7. REFERENCES
[1] Tomoki Toda, Alan W Black, and Keiichi Tokuda,
“Voice conversion based on maximum-likelihood esti-
mation of spectral parameter trajectory,” IEEE Trans.
Audio, Speech, Language Process., vol. 15, no. 8, pp.
2222–2235, 2007.
[2] Paul Taylor, Text-to-speech synthesis, Cambridge uni-
versity press, 2009.
[3] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano,
and Hisao Kuwabara, “Voice conversion through vector
quantization,” J. Acoust. Soc. Jpn. (E), vol. 11, no. 2,
pp. 71–76, 1990.
[4] Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo
Ariki, “Exemplar-based voice conversion using sparse
representation in noisy environments,” IEICE Transac-
tions on Fundamentals of Electronics, Communications
and Computer Sciences, vol. 96, no. 10, pp. 1946–1953,
2013.
[5] Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-
Rong Dai, “Voice conversion using deep neural net-
works with layer-wise generative training,” IEEE/ACM
Transactions on Audio, Speech and Language Process-
ing (TASLP), vol. 22, no. 12, pp. 1859–1872, 2014.
[6] Daniel Erro, Asunci
´
on Moreno, and Antonio Bonafonte,
“INCA algorithm for training voice conversion systems
from nonparallel corpora,” IEEE Trans. Audio, Speech,
Language Process., vol. 18, no. 5, pp. 944–953, 2009.
[7] Yining Chen, Min Chu, Eric Chang, Jia Liu, and Run-
sheng Liu, “Voice conversion with smoothed gmm and
map adaptation,” in Proc. EUROSPEECH, 2003, pp.
2413–2416.
[8] Chung-Han Lee and Chung-Hsien Wu, “Map-based
adaptation for speech conversion using adaptation data
selection and non-parallel training,” in Proc. INTER-
SPEECH, 2006, pp. 2254–2257.
[9] Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano,
“Eigenvoice conversion based on gaussian mixture
model,” in Proc. INTERSPEECH, 2006, pp. 2446–2449.
[10] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and He-
len Meng, “Phonetic posteriorgrams for many-to-one
voice conversion without parallel data training,” in
2016 IEEE International Conference on Multimedia and
Expo (ICME). IEEE, 2016, pp. 1–6.
[11] Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi,
and Hiroshi Saruwatari, “Voice conversion us-
ing sequence-to-sequence learning of context posterior
probabilities,” Proc. INTERSPEECH, pp. 1268–1272,
2017.
[12] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu,
Yu Tsao, and Hsin-Min Wang, “Voice conversion from
non-parallel corpora using variational auto-encoder,” in
Proc. APSIPA. IEEE, 2016, pp. 1–6.
[13] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-
data-free voice conversion using cycle-consistent ad-
versarial networks,” arXiv preprint arXiv:1711.11293,
2017.
[14] Fuming Fang, Junichi Yamagishi, Isao Echizen, and
Jaime Lorenzo-Trueba, “High-quality nonparallel voice
conversion based on cycle-consistent adversarial net-
work,” in 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2018, pp. 5279–5283.
[15] Andros Tjandra, Berrak Sisman, Mingyang Zhang,
Sakriani Sakti, Haizhou Li, and Satoshi Nakamura,
“Vqvae unsupervised unit discovery and multi-scale
code2spec inverter for zerospeech challenge 2019,”
arXiv preprint arXiv:1905.11449, 2019.
[16] Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-
yi Lee, and Lin-shan Lee, “Rhythm-flexible voice
conversion without parallel data using cycle-gan over
phoneme posteriorgram sequences,” in Proc. SLT, 2018,
pp. 274–281.
[17] Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and
Nobukatsu Hojo, “Atts2s-vc: Sequence-to-sequence
voice conversion with attention and context preservation
mechanisms,” in Proc. ICASSP, 2019, pp. 6805–6809.
[18] Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai,
“Non-parallel sequence-to-sequence voice conversion
with disentangled linguistic and speaker representa-
tions,” arXiv preprint arXiv:1906.10508, 2019.
[19] Hieu-Thi Luong and Junichi Yamagishi, “A unified
speaker adaptation method for speech synthesis using
transcribed and untranscribed speech with backpropa-
gation,” arXiv preprint arXiv:1906.07414, 2019.
[20] Ye Jia, Yu Zhang, Ron J Weiss, Quan Wang, Jonathan
Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruom-
ing Pang, Ignacio Lopez Moreno, and Yonghui Wu,
“Transfer learning from speaker verification to mul-
tispeaker text-to-speech synthesis,” arXiv preprint
arXiv:1806.04558, 2018.
[21] Eliya Nachmani, Adam Polyak, Yaniv Taigman, and
Lior Wolf, “Fitting new speakers based on a short un-
transcribed sample,” arXiv preprint arXiv:1802.06984,
2018.