Bootstrapping non-parallel voice conversion from speaker-adaptive

BOOTSTRAPPING NON-PARALLEL VOICE CONVERSION FROM SPEAKER-ADAPTIVE

TEXT-TO-SPEECH

Hieu-Thi Luong

1,2

, Junichi Yamagishi

1,2,3

SOKENDAI (The Graduate University for Advanced Studies), Kanagawa, Japan

National Institute of Informatics, Tokyo, Japan

The University of Edinburgh, Edinburgh, UK

ABSTRACT

Voice conversion (VC) and text-to-speech (TTS) are two tasks

that share a similar objective, generating speech with a target

voice. However, they are usually developed independently

under vastly different frameworks. In this paper, we propose

a methodology to bootstrap a VC system from a pretrained

speaker-adaptive TTS model and unify the techniques as well

as the interpretations of these two tasks. Moreover by ofﬂoad-

ing the heavy data demand to the training stage of the TTS

model, our VC system can be built using a small amount of

target speaker speech data. It also opens up the possibility of

using speech in a foreign unseen language to build the system.

Our subjective evaluations show that the proposed framework

is able to not only achieve competitive performance in the

standard intra-language scenario but also adapt and convert

using speech utterances in an unseen language.

Index Terms— voice conversion, cross-lingual, speaker

adaptation, transfer learning, text-to-speech

1. INTRODUCTION

The voice conversion (VC) system is used to convert speech

of a certain speaker to speech of a desired target while retain-

ing the linguistic content [1]. VC and text-to-speech (TTS)

systems are two different systems functioning under different

scenarios, but they share a common goal: generating speech

with a target voice. If we deﬁne the TTS system as a syn-

thesis system generating speech using linguistic instructions

(in an abstract sense) obtained from written text [2], then the

VC system can be deﬁned as a synthesis system generating

speech using linguistic instructions extracted from a reference

utterance. This similarity in objective but difference in oper-

ation context makes VC and TTS complement each other and

have their unique role in a spoken dialogue system. More

speciﬁcally, as text input is easy to create and modify, TTS is

capable of generating a large amount of speech automatically

and cheaply. However it is difﬁcult to generate speech when

the desired linguistic instructions cannot be represented in the

This work was partially supported by a JST CREST Grant (JP-

MJCR18A6, VoicePersonae project), Japan, and MEXT KAKENHI Grants

(16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.

expected written form (e.g. when text is written in foreign

languages). On the other hand, speech used as a reference

input for VC is more time-consuming and expensive to pro-

duce, but the system can be straightforwardly extended to an

unseen language.

A VC system is usually classiﬁed as parallel or non-

parallel depending on the nature of the data available for

development. The parallel VC system is developed using

a parallel corpus containing pairs of utterance spoken by a

deterministic source and target speakers. The parallel speech

utterances of source and target are aligned using dynamic

time warping (DTW) to create the training set. By using this

training set, a mapping function is formulated to transform

the acoustic features of one speaker to another [3]. Many

methods have been proposed to model this transformation

for parallel VC systems [1, 4, 5]. Parallel VC is developed

with just data of source and target speakers which is one

of its advantages. Acquisition of parallel speech data re-

quires a lot more planning and preparation than acquisition

of non-parallel speech, which in turn makes the former more

expensive to develop, especially when we want a VC system

with the voice of a particular speaker. Therefore, many ap-

proaches have been proposed to develop VC systems using

a non-parallel corpus [6, 7]. For the conventional Gaussian

mixture model (GMM) approach, non-parallel VC can be

adapted from a pretrained parallel VC in the model space

using the maximum a posterior (MAP) method [7, 8] or as

interpolation between multiple parallel models [9]. For recent

neural network approaches, a non-parallel VC can be trained

by directly using an intermediate linguistic representation

extracted from an automatic speech recognition (ASR) model

[10, 11] or by indirectly encouraging the network to disen-

tangle linguistic information from the speaker characteristics

using methods like variational autoencoder (VAE) [12], gen-

erative adversarial networks (GAN) [13, 14] or some other

techniques [15]. For both parallel and non-parallel VC, the

systems usually change the voice but are unable to change

the duration of the utterance. Many recent researches fo-

cused on converting speaking rate along with voices by using

sequence-to-sequence models [11, 16, 17, 18], as speaking

rate is also a speaker characteristic.

Recently Luong et al. proposed a speaker-adaptive TTS

arXiv:1909.06532v1 [eess.AS] 14 Sep 2019

Dec

LEnc

spkcode

μ σ

AEnc

μ σ

KLD

MAE

loss

Fig. 1. Training the initial speaker-adaptive TTS model.

model that could perform speaker adaptation with untran-

scribed speech using backpropagation algorithm [19]. We

ﬁnd that the proposed system can potentially be used for

developing non-parallel VC as well. In this paper, we in-

troduce a framework for creating a VC system using the

untranscribed speech of a target speaker by bootstrapping

from the speaker-adaptive TTS model proposed by Luong et

al. [19]. Furthermore, we investigate the performance of our

framework when it adapts and convert using speech utter-

ances of an unseen language. The rest of paper is organized

as follows. Section 2 describes our framework, Section 3

introduces different scenarios in which a VC can operate in

an unseen language and their practical applications, Section 4

describes the experiment setup, Section 5 presents subjective

results, and Section 6 concludes our ﬁndings.

2. BOOTSTRAPPING VOICE CONVERSION FROM

SPEAKER-ADAPTIVE TTS

2.1. Multimodal architecture for speaker-adaptive TTS

We ﬁrst summarize the speaker-adaptive TTS proposed in

[19]. A conventional TTS model is essentially a function that

transforms linguistic features x, extracted from a given text,

to the acoustic features y, which could be used to synthesize

speech waveform. A neural TTS model is a TTS model that

uses neural networks as the approximation function for the x

to y transformation. Neural networks are trained with back-

propagation algorithm and can also be adapted to a new un-

seen domain with backpropagation when the adaptation data

is labeled. In the context of TTS, an acoustic model can

easily be adapted to voices of unseen speakers if the adap-

tation data is transcribed speech but it is not as straightfor-

ward when the adaptation is untranscribed. Many techniques

Dec

AEnc

MAE

Fig. 2. Transferring to VC and adapting to the target speaker.

have been proposed to perform adaptation with untranscribed

speech by avoiding backpropagation and extracting a certain

speaker representation from an external system instead [20,

21]. However, this forward-based approach limited the per-

formance of the adaptation process [19].

Luong et al. [19] proposed backpropagation-based adap-

tation method using untranscribed speech by introducing a

latent variable z, namely latent linguistic embedding (LLE),

which presumably contains no information about the speak-

ers. The conventional acoustic model then splits into a

speaker-dependent (SD) acoustic decoder Dec and a speaker-

independent (SI) linguistic encoder LEnc. The acoustic

decoder is deﬁned by its SI parameter θ

core

and SD parameter

spk,(k)

, one for each k-th speaker in the training set while the

linguistic encoder is deﬁned by its SI parameter φ

. The TTS

model can be described as follows:

∼ LEnc(x; φ

) = p(z|x) (1)

˜y

= Dec(z

; θ

core

, θ

spk,(k)

) (2)

With this new factorized model, if we want to adapt the acous-

tic decoder to unseen speakers using speech, we train an aux-

iliary acoustic encoder AEnc (deﬁned by its SI parameter

) to use as a substitute for the linguistic encoder:

∼ AEnc(y; φ

) = q(z|y) (3)

˜y

= Dec(z

; θ

core

, θ

spk,(k)

) (4)

The acoustic encoder is only used to perform the unsuper-

vised speaker adaptation in [19]; however the network cre-

ated by stacking the acoustic encoder and acoustic decoder is

essentially a speech-to-speech stack that potentially could be

used as a VC system. This is the motivation for us to develop

a VC framework based on this model.

2.2. Latent Linguistic Embedding for many-to-one voice

conversion

By using the speaker-adaptive TTS model described above,

we propose a three-step framework to develop a VC system

for a target speaker:

1) train the initial TTS model: The ﬁrst step is to train

the TTS model, it is identical to the training mode deﬁned in

Section IV-B in [19]. The data required for this step is a multi-

speaker transcribed speech corpus. In this step, we need to

jointly train all modules using a deterministic loss function:

loss

train

= loss

main

+ β loss

tie

(5)

loss

main

is the distortion between output of the TTS stack

and the natural acoustic features, and loss

tie

is the KL diver-

gence (KLD) between the output of the linguistic encoder and

output of the acoustic encoder. The setup is almost the same

as described in [19]. One small difference is that we use the

mean absolute error (MAE) as the distortion function instead

of the mean square error as shown in Fig. 1. In this step, the

acoustic encoder is trained to transform acoustic features to

the linguistic representation LLE while the acoustic decoder

is trained to become a good initial model for speaker adapta-

tion.

2) transfer to VC and adapt to the target: For build-

ing a VC system we only need the acoustic encoder and the

acoustic decoder trained previously; the linguistic encoder is

discarded as it is no longer needed. The acoustic encoder and

decoder make a complete speech-to-speech network. To have

a VC system with a particular voice we adapt the acoustic

decoder using the available speech data of the target. This

is identical to the unsupervised adaptation mode described

in Section IV-B in [19]. More speciﬁcally, we removed all

SD parameters θ

spk,(k)

and then ﬁne-tuned remaining param-

eter θ

core,(r)

to the r-th target speaker by minimizing the loss

function:

loss

adapt

= L

MAE

(˜y

, y) (6)

Slightly different to [19], we removed the standard deviation

σ from the acoustic encoder as illustrated in Fig. 2. Simply

speaking instead of sampling z

from a distribution (Equa-

tion 3), z

will always take the mean value instead. We found

this change slightly improves the performance of the adapted

model in the converting step.

3) converting speech of arbitrary source speakers: The

adapted VC system then could be used to convert speech of

arbitrary source speakers to the target. Our system does not

need to train on or adapt to the speech data of source speakers,

although doing so might help the performance.

2.3. Related works

While the original VC is narrowly deﬁned as a function that

directly maps acoustic units of a source to a target speaker

[3, 22], recent VC systems have usually been modeled with

an intermediate linguistic representation, either explicitly [10]

or implicitly [23] due to the rapid development of representa-

tion learning and disentanglement ability of neural networks.

This approach has further reduced the theoretical difference

between VC and TTS. Motivated by the same observation,

Mingyang Zhang et al. [24] proposed a method to jointly train

a TTS and a parallel VC system using a shared sequence-to-

sequence model. The proposed framework improves the sta-

bility of the generated speech but does not add any beneﬁts in

term of data efﬁciency, as parallel speech data of the source

and target speakers is still a requirement. Recently, Jing-Xuan

Zhang et al. [18] have introduced a non-parallel sequence-to-

sequence voice conversion system that has a procedure that

is similar to our: the initial model is trained with a TTS-like

network to help disentangle linguistic representation and the

model then adapted to the source and target speakers. How-

ever, the adaptation step of the framework proposed in [18] re-

quires both source and target speaker speech as well as their

transcripts, which increases the data demand for building a

VC system with a particular voice.

Despite being different in motivation and procedure, our

framework has the same data efﬁciency as the models using

phonetic posteriorgrams (PPG) proposed by Sun et al. [10].

The LLEs in our setup play a similar role to the PPGs, but are

very different in terms of characteristics. While PPGs are the

direct output of an ASR network with an explicit interpreta-

tion, LLEs are latent variables and guided by a TTS network.

The efﬁciency in terms of data usage is not only lowering the

requirement for building a VC system but also opening up the

possibility of solving other interesting tasks. Creating and us-

ing the VC system with a low-resource unseen language is

one example [25].

3. USING VOICE CONVERSION SYSTEM IN AN

UNSEEN LANGUAGE

The proposed three-step framework have an interesting char-

acteristic which is the asymmetric nature in terms of data re-

quirements for each step. While the TTS training step re-

quires a transcribed multi-speaker corpus, the adaptation step

only requires a small amount of untranscribed speech from

the target speaker. Moreover the model does not need to train

on data of source speakers and hypothetically can convert ut-

terances of arbitrary speakers out of the box. We could take

advantage of these characteristics to build VC systems for

low-resource languages. For our experiments, we deﬁne the

abundant language as language with transcribed speech cor-

pus and used to train the initial TTS model (seen); in our case

it is English (E). The low-resource language is a language

whose data does not include transcripts and is not used in the

TTS step (unseen); in our case it is Japanese (J). We ﬁrst de-

ﬁne multiple scenarios in which our framework can interact

with an unseen language. These scenarios are dictated by the

language used in the adaptation and conversion steps:

Pretrain

Adaptation

Conversion

Speech and transcript data of

multiple speakers

Speech data of the

target speaker

An utterance of a

source speaker

Data requirements

Fig. 3. The asymmetric nature of data requirements for each

step of the proposed framework.

(a) EE-E: We adapt the pretrained English model to an

English speaker and using it to convert English utterances.

Generally speaking the entire framework from start to ﬁnish

operates within a single language. This is the typical intra-

language scenario of a voice conversion system.

(b) EE-J: We adapt the pretrained English model to

an English speaker but use it to convert Japanese utterances.

Generally speaking, the VC system is used to generate speech

of linguistic instructions that have not been seen. The ability

to generate speech for content that is difﬁcult to represent in

the expected written form (a foreign language in this case) is

one advantage of VC over TTS. This scenario is referred as

cross-language voice conversion [26, 27].

Japanese speaker and use it to convert English utterances. In

this scenario we want to build a VC system in the abundant

language; however, the speech data of the target speaker is

only available in a low-resource unseen language. This is

sometime referred to as cross-lingual voice conversion [28,

29], however we call it cross-language speaker adaption to

distinguish it from EE-J. Unlike other scenarios involving the

unseen language, cross-language speaker adaptation is rele-

vant for both VC [28] and TTS [30, 25]. Even though it is

not evaluated in this paper, this scenario is also the unsuper-

vised cross-language speaker adaptation scenario of the TTS

system proposed in [19].

(d) EJ-J: We adapt the pretrained English model to a

Japanese speaker and use it to convert Japanese utterances.

In this scenario we essentially bootstrap a VC system for a

low-resource language from a pretrained model of an abun-

dant one. The written form of the target language is not used

in the training, the adaptation or the conversion stages. This

scenario is sometime referred as text-to-speech without text,

which is the main topic of the Zero Resource Speech Chal-

lenge 2019 [31]. Even though our scenario has the same

objective as the challenge, the approach is a little different.

The participant of the challenge are encouraged to develop

intra-language unsupervised unit discovery methods, which

are more difﬁcult [32, 33, 34]. Our framework is bootstrapped

from an abundant language, which is a more practical ap-

proach [35, 36].

English

English English

English

⽇本語⽇本語

(a) EE-E (b) EE-J

English

⽇本語

English English

English

⽇本語

⽇本語⽇本語

Fig. 4. Different scenarios of using a non-parallel voice con-

version system with unseen language. In this paper, English

plays the role of the abundant language (seen) while Japanese

plays the role of the low-resource language (unseen).

4. EXPERIMENT

4.1. Dataset

We used 25.6 hours of transcribed speech from 72 English

speakers to train the initial speaker-adaptive TTS model and

the initial SI WaveNet vocoder. The data is a subset of the

VCTK speech corpus [37]. To validate the proposed method,

we reenact the SPOKE task of the Voice Conversion Chal-

lenge 2018 (VCC2018) [38]. We build the VC system for 4

target speakers (2 males and 2 females) using 81 utterances

per person. We then evaluate the SPOKE task using the

speech of 4 source speakers (2 males and 2 females); each

speaker contribute 35 utterances. Even though the task pro-

vided training data for the source speakers, we did not use

the training data in our experiments as our model can convert

speech of an arbitrary source speaker.

To test the performance of the proposed system in the con-

text of unseen language as described in Section 3, we use 2

in-house bilingual speakers (1 male and 1 female) as the new

targets speakers. These two target speakers can speaker En-

glish and Japanese at almost native level. 400 utterances per

speaker per language are used as training data; 10 more ut-

terances are used for validation. For evaluation, we reuse

2 source speakers (1 male and 1 female) from VCC2018 as

the English source speakers and add 2 native Japanese speak-

ers (1 male and 1 female) as the Japanese source speakers.

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quality

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Similarity

N10

OUR

B01

Fig. 5. Subjective results for the SPOKE task of the Voice

Conversion Challenge 2018. The lines indicate 95% conﬁ-

dence interval with all results are statistically signiﬁcant.

Table 1. Detailed subjective results for SPOKE task.

(a) Quality

F-F F-M M-F M-M ALL

N10 3.91 3.96 3.85 3.93 3.91

B01 3.10 2.12 1.84 2.49 2.39

OUR 2.72 2.77 2.41 2.68 2.65

(b) Similarity

F-F F-M M-F M-M All

N10 3.18 3.55 3.14 3.48 3.33

B01 2.74 2.87 2.21 2.04 2.47

OUR 2.92 3.13 2.70 3.26 3.00

Each source speaker contributes 35 utterances for evaluation.

While Japanese plays the role of the low-resource in our ex-

periments, to provide a good reference, we train an additional

system in which Japanese is the abundant language and then

use it as the upper bound. This intra-lingual VC system for

Japanese, namely JJ-J, is pretrained on 44.9 hours of tran-

scribed speech of 235 native speakers and then adapted to the

Japanese training speech of the two bilingual target.

4.2. Model conﬁguration

Our conﬁguration imitates the setup described in [19] as

closely as possible. The linguistic feature is a 367-dimensional

vector contained various information that is deemed to be use-

ful for English speech synthesis. The acoustic feature is the

80-dimensional mel-spectrogram. One difference compared

with [19] is that all speech data is resampled to 22050 Hz

(which is the sampling rate of the VCC2018 dataset) before it

is used to extract acoustic features. WaveNet vocoder is also

trained on the 22050 Hz waveform that has been quantized

using 10-bit u-law.

The neural network used for the speaker-adaptive acous-

tic model has the same structure as in [19] with the size of

NA-E EE-E EJ-E

(a) Quality

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

NA-E EE-E EJ-E

(a) Similarity

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Fig. 6. Subjective results for cross-language speaker adapta-

tion task. All results are statistically signiﬁcant.

the LLE set to 64. Speaker bias is placed at layers B5, B6,

B7, and B8 (Fig. 1 in [19]) in the form of a one-hot-vector

to represent speakers in the training set. The initial multi-

modal architecture is trained using the loss function deﬁned

in Equation 5 with β set to 0.25. The SI WaveNet vocoder

contained 40 dilated causal layers and it is conditioned on a

natural mel-spectrogram. The WaveNet vocoder is ﬁne-tuned

for each target speaker using their respective available train-

ing speech data.

5. EVALUATION AND DISCUSSION

5.1. Voice conversion challenge 2018 SPOKE task

We follow the guidelines of VCC2018 and build systems

for the 4 target speakers of the SPOKE task and converting

evaluation utterances of the 4 source speakers. We compare

our system (OUR) with B01 [39], which is the baseline of

VCC2018, and N10 [40], which is the best system based on

the subjective test. Instead of reproducing the experiments

for B01 and N10, we use the utterances submitted by the

participants to ensure the highest quality. We conduct the

listening test to judge the quality and similarity of utterances

generated by the OUR, B01 and N10 systems. A total of

28 native English speakers participated in our test, most of

them did approximately 10 sessions. Each session is prepared

to contain generated utterances of every target speaker from

every system. In summary, each system is judged 1088 times

for quality and another 1088 times for similarity. In the qual-

ity question, the participant is asked to judge the quality of

the represented speech sample using a 5-point scale mean-

opinion score. In the similarity question, the participant is

asked to judge the similarity between the utterance generated

by one VC system and a natural utterance spoken by the

target speaker in a 4-point scale. The setup is the same as in

VCC2018 [38].

The general results can be seen in Fig. 5. While it is not as

good as the best system N10, our system is slightly better than

the baseline B01 in terms of quality and signiﬁcantly better in

terms of speaker similarity. One should note that B01 is a

strong baseline, it is ranked 3rd in quality and 6th in speaker

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

EE-J

EJ-J

JJ-J

NA-J

Fig. 7. Quality evaluations for low-resource language voice

conversion. All results are statistical signiﬁcant.

similarity measurements at the VCC2018 [38]. This validated

our framework for VC. More detailed results can be found in

Table 1, which shows consistent performance of our system

across same-gender and cross-gender pairs.

5.2. Cross-language speaker adaptation

Next, we evaluate our system in the scenario of cross-

language speaker adaptation (EJ-E). For this task we develop

a system using the speech data of the bilingual target speak-

ers. We compare EJ-E with EE-E, which is the standard

intra-language of English, and a reference system NA-E,

which is of hold-out natural English utterances.

The native English speakers that participated in the sur-

vey for previous task are asked to evaluate the cross-language

speaker adaptation task as well. In summary, each scenario

is judged 544 times for quality and another 544 times for

similarity. The quality and similarity results are illustrated

in Fig. 6 and are statistically signiﬁcant between all scenar-

ios. The cross-language speaker adaptation EJ-E is generally

worse than EE-E as one would expect.

5.3. Voice conversion for low-resource language

Finally, we test the performance of our system when using

it to convert speech of low-resource language. This consists

of the EE-J and EJ-J scenarios. We use JJ-J, the abundant

language scenario of Japanese, as the upper-bound and NA-J,

the hold-out natural Japanese utterances, as the reference. We

prepare a similar listening test to that of previous tasks. A

total 148 native Japanese participated in our test; no partici-

pant is allowed to do more than 10 sessions. Each session is

prepared to contain all source and target pairs. In summary,

each scenario is judged 2800 times for quality and another

2800 times for similarity except the reference natural speech,

which is only judged 1400 times for each measurement.

The result of the quality test is illustrated in Fig. 7. Our

standard Japanese system JJ-J achieved a relatively better

score than the English counterpart EE-E, although technically

they should not be compared directly. The cross-language

converting scenarios, EJ-J and EE-J, are a lot worse than the

standard scenario JJ-J; as expected EJ-J has slightly better

score than EE-J. The result for the similarity test is illustrated

in Fig. 8. For the four systems that convert Japanese speech,

their similarity result trend is the same as their quality result

0 20 40 60 80 100

NA-E

EE-E

EE-J

EJ-J

JJ-J

NA-J

same (sure)

same (not sure)

different (not sure)

different (sure)

Fig. 8. Similarity evaluations for low-resource language voice

conversion.

trend. For the similarity test, we also included two extra sys-

tems that produce English speech, EE-E and NA-E. In these

two cases the listener would listen to an English utterance

contributed by EE-E or NA-E and a natural Japanese utter-

ance of the same speaker (as the target speakers are bilingual)

and judge their similarity. Interestingly the similarity result of

NA-E is a lot worse than that of NA-J even though they both

contain natural speech spoken by the same speaker in real life.

This suggests that utterances spoken by the same speaker in

different languages might not have consistent characteristic.

Or problem might be that is it difﬁcult for listeners to evalu-

ate speaker similarity in cross-language scenario, especially

when they are not ﬂuent in both.

Even though the performance in the unseen language sce-

narios is not as good as the standard scenario, the subjective

results have shown the ability of using our system for inter-

language tasks and establishing a preliminary result as well as

a solid baseline for future improvements

6. CONCLUSION

In this paper, we have presented a methodology to bootstrap

a VC system from a pretrained speaker-adaptive TTS model.

By transferring knowledge learned previously by the TTS

model, we were able to signiﬁcantly lower the data require-

ments for building the VC system. This in turn allows the

system to operate in a low-resource unseen language. The

subjective results show that our VC system achieves competi-

tive performance compared with existing methods. Moreover,

it can also be used for cross-language voice conversion and

cross-language speaker adaptation. While the performance

in these unseen language scenarios are not as good, all ex-

periments in this work are conducted under the assumption

of minimal available resources. Our future work includes

taking advantage of the available additional resources such

as a multi-lingual corpus [28, 41] to further improve the

robustness of the VC system.

Speech samples are available at https://nii-yamagishilab.

github.io/sample-vc-bootstrapping-tts/

7. REFERENCES

[1] Tomoki Toda, Alan W Black, and Keiichi Tokuda,

“Voice conversion based on maximum-likelihood esti-

mation of spectral parameter trajectory,” IEEE Trans.

Audio, Speech, Language Process., vol. 15, no. 8, pp.

2222–2235, 2007.

[2] Paul Taylor, Text-to-speech synthesis, Cambridge uni-

versity press, 2009.

[3] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano,

and Hisao Kuwabara, “Voice conversion through vector

quantization,” J. Acoust. Soc. Jpn. (E), vol. 11, no. 2,

pp. 71–76, 1990.

[4] Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo

Ariki, “Exemplar-based voice conversion using sparse

representation in noisy environments,” IEICE Transac-

tions on Fundamentals of Electronics, Communications

and Computer Sciences, vol. 96, no. 10, pp. 1946–1953,

2013.

[5] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-

Rong Dai, “Voice conversion using deep neural net-

works with layer-wise generative training,” IEEE/ACM

Transactions on Audio, Speech and Language Process-

ing (TASLP), vol. 22, no. 12, pp. 1859–1872, 2014.

[6] Daniel Erro, Asunci

on Moreno, and Antonio Bonafonte,

“INCA algorithm for training voice conversion systems

from nonparallel corpora,” IEEE Trans. Audio, Speech,

Language Process., vol. 18, no. 5, pp. 944–953, 2009.

[7] Yining Chen, Min Chu, Eric Chang, Jia Liu, and Run-

sheng Liu, “Voice conversion with smoothed gmm and

map adaptation,” in Proc. EUROSPEECH, 2003, pp.

2413–2416.

[8] Chung-Han Lee and Chung-Hsien Wu, “Map-based

adaptation for speech conversion using adaptation data

selection and non-parallel training,” in Proc. INTER-

SPEECH, 2006, pp. 2254–2257.

[9] Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano,

“Eigenvoice conversion based on gaussian mixture

model,” in Proc. INTERSPEECH, 2006, pp. 2446–2449.

[10] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and He-

len Meng, “Phonetic posteriorgrams for many-to-one

voice conversion without parallel data training,” in

2016 IEEE International Conference on Multimedia and

Expo (ICME). IEEE, 2016, pp. 1–6.

[11] Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi,

and Hiroshi Saruwatari, “Voice conversion us-

ing sequence-to-sequence learning of context posterior

probabilities,” Proc. INTERSPEECH, pp. 1268–1272,

2017.

[12] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu,

Yu Tsao, and Hsin-Min Wang, “Voice conversion from

non-parallel corpora using variational auto-encoder,” in

Proc. APSIPA. IEEE, 2016, pp. 1–6.

[13] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-

data-free voice conversion using cycle-consistent ad-

versarial networks,” arXiv preprint arXiv:1711.11293,

2017.

[14] Fuming Fang, Junichi Yamagishi, Isao Echizen, and

Jaime Lorenzo-Trueba, “High-quality nonparallel voice

conversion based on cycle-consistent adversarial net-

work,” in 2018 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2018, pp. 5279–5283.

[15] Andros Tjandra, Berrak Sisman, Mingyang Zhang,

Sakriani Sakti, Haizhou Li, and Satoshi Nakamura,

“Vqvae unsupervised unit discovery and multi-scale

code2spec inverter for zerospeech challenge 2019,”

arXiv preprint arXiv:1905.11449, 2019.

[16] Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-

yi Lee, and Lin-shan Lee, “Rhythm-ﬂexible voice

conversion without parallel data using cycle-gan over

phoneme posteriorgram sequences,” in Proc. SLT, 2018,

pp. 274–281.

[17] Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and

Nobukatsu Hojo, “Atts2s-vc: Sequence-to-sequence

voice conversion with attention and context preservation

mechanisms,” in Proc. ICASSP, 2019, pp. 6805–6809.

[18] Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai,

“Non-parallel sequence-to-sequence voice conversion

with disentangled linguistic and speaker representa-

tions,” arXiv preprint arXiv:1906.10508, 2019.

[19] Hieu-Thi Luong and Junichi Yamagishi, “A uniﬁed

speaker adaptation method for speech synthesis using

transcribed and untranscribed speech with backpropa-

gation,” arXiv preprint arXiv:1906.07414, 2019.

[20] Ye Jia, Yu Zhang, Ron J Weiss, Quan Wang, Jonathan

Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruom-

ing Pang, Ignacio Lopez Moreno, and Yonghui Wu,

“Transfer learning from speaker veriﬁcation to mul-

tispeaker text-to-speech synthesis,” arXiv preprint

arXiv:1806.04558, 2018.

[21] Eliya Nachmani, Adam Polyak, Yaniv Taigman, and

Lior Wolf, “Fitting new speakers based on a short un-

transcribed sample,” arXiv preprint arXiv:1802.06984,

2018.

[22] Yannis Stylianou, Olivier Capp

e, and Eric Moulines,

“Continuous probabilistic transform for voice conver-

sion,” IEEE Transactions on speech and audio process-

ing, vol. 6, no. 2, pp. 131–142, 1998.

[23] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and

Nobukatsu Hojo, “Acvae-vc: Non-parallel many-to-

many voice conversion with auxiliary classiﬁer varia-

tional autoencoder,” arXiv preprint arXiv:1808.05092,

2018.

[24] Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou

Li, and Junichi Yamagishi, “Joint training frame-

work for text-to-speech and voice conversion using

multi-source tacotron and wavenet,” arXiv preprint

arXiv:1903.12389, 2019.

[25] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li, and Helen M

Meng, “Personalized, cross-lingual tts using phonetic

posteriorgrams.,” in Proc. INTERSPEECH, 2016, pp.

322–326.

[26] Masanobu Abe, Kiyohiro Shikano, and Hisao

Kuwabara, “Statistical analysis of bilingual speakers

speech for cross-language voice conversion,” J. Acoust.

Soc. Am., vol. 90, no. 1, pp. 76–82, 1991.

[27] Mikiko Mashimo, Tomoki Toda, Kiyohiro Shikano, and

Nick Campbell, “Evaluation of cross-language voice

conversion based on GMM and Straight,” in Proc. EU-

ROSPEECH, 2001, pp. 361–364.

[28] Yi Zhou, Xiaohai Tian, Haihua Xu, Rohan Kumar Das,

and Haizhou Li, “Cross-lingual voice conversion with

bilingual phonetic posteriorgram and average model-

ing,” in Proc. ICASSP, 2019, pp. 6790–6794.

[29] Sai Sirisha Rallabandi and Suryakanth V Gangashetty,

“An approach to cross-lingual voice conversion,” in

Proc. IJCNN, 2019.

[30] Yi-Jian Wu, Yoshihiko Nankaku, and Keiichi Tokuda,

“State mapping based method for cross-lingual speaker

adaptation in hmm-based speech synthesis,” in Proc.

INTERSPEECH, 2009, pp. 528–531.

[31] Ewan Dunbar, Robin Algayres, Julien Karadayi, Math-

ieu Bernard, Juan Benjumea, Xuan-Nga Cao, Lucie

Miskic, Charlotte Dugrain, Lucas Ondel, Alan W.

Black, Laurent Besacier, Sakriani Sakti, and Emmanuel

Dupoux, “The zero resource speech challenge 2019:

TTS without T,” arXiv preprint arXiv:1904.11469,

2019.

[32] Siyuan Feng, Tan Lee, and Zhiyuan Peng, “Combining

adversarial training and disentangled speech representa-

tion for robust zero-resource subword modeling,” arXiv

preprint arXiv:1906.07234, 2019.

[33] Andy T Liu, Po-chun Hsu, and Hung-yi Lee, “Unsu-

pervised end-to-end learning of discrete linguistic units

for voice conversion,” arXiv preprint arXiv:1905.11563,

2019.

[34] Ryan Eloff, Andr

e Nortje, Benjamin van Niekerk,

Avashna Govender, Leanne Nortje, Arnu Pretorius,

Elan Van Biljon, Ewald van der Westhuizen, Lisa

van Staden, and Herman Kamper, “Unsupervised

acoustic unit discovery for speech synthesis using dis-

crete latent-variable neural networks,” arXiv preprint

arXiv:1904.07556, 2019.

[35] Sunayana Sitaram, Sukhada Palkar, Yun-Nung Chen,

Alok Parlikar, and Alan W Black, “Bootstrapping text-

to-speech for speech processing in languages without an

orthography,” in Proc. ICASSP, 2013, pp. 7992–7996.

[36] Prasanna Kumar Muthukumar and Alan W Black, “Au-

tomatic discovery of a phonetic inventory for unwrit-

ten languages for statistical speech synthesis,” in Proc.

ICASSP, 2014, pp. 2594–2598.

[37] Christophe Veaux, Junichi Yamagishi, and Kirsten

MacDonald, “CSTR VCTK corpus: English multi-

speaker corpus for CSTR voice cloning toolkit,” 2017,

http://dx.doi.org/10.7488/ds/1994.

[38] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki

Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kin-

nunen, and Zhenhua Ling, “The voice conversion chal-

lenge 2018: Promoting development of parallel and

nonparallel methods,” in Proc. Odyssey, 2018, pp. 195–

202.

[39] Kazuhiro Kobayashi and Tomoki Toda, “sprocket:

Open-source voice conversion software,” in Proc.

Odyssey, 2018, pp. 203–210.

[40] Li-Juan Liu, Zhen-Hua Ling, Yuan Jiang, Ming Zhou,

and Li-Rong Dai, “Wavenet vocoder with limited train-

ing data for voice conversion,” in Proc. INTERSPEECH,

2018, pp. 1983–1987.

[41] Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, and Hung-

yi Lee, “End-to-end text-to-speech for low-resource

languages by cross-lingual transfer learning,” arXiv

preprint arXiv:1904.06508, 2019.