Real Time Voice Cloning System

Abstract

Title: Real-Time Voice Cloning System Using Deep Learning, an emerging field in artificial intelligence, has witnessed significant advancements in recent years owing to the rapid progress of deep learning techniques. This survey paper delves into the realm of real-time voice cloning systems that employ deep learning methodologies. The ability to generate highly realistic and natural- sounding speech from limited audio samples has garnered attention due to its potential applications in entertainment, assistive technology, virtual assistants, and more. This survey provides an in-depth analysis of the key components and techniques employed in real-time voice cloning systems. We explore various neural network architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and generative adversarial networks (GANs) that have been utilized for voice cloning tasks. Additionally, we investigate the role of different training paradigms, including supervised, semi-supervised, and unsupervised learning, and discuss their implications on cloning accuracy and efficiency. Furthermore, the paper examines datasets used for training and evaluation, ranging from large-scale multilingual corpora to more specialized speech datasets. Framework has the capability to duplicate voices not encountered during training as well as generate speech from previously unseen text.

Country : India

1 Shruti Parshuram Kambali2 Ansari Majid Ali3 Priyanshi Upendra Srivastav4 Aryan Manish Dandwekar5 Dr. Radhika Nanda

  1. Student, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India
  2. Student, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India
  3. Student, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India
  4. Student, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India
  5. Professor, Dept. of AI & ML, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India

IRJIET, Volume 7, Issue 10, October 2023 pp. 294-303

doi.org/10.47001/IRJIET/2023.710038

References

  1. Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. Neural voice cloning with a few samples, 2018.
  2. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly,Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. Natural TTS synthesis by conditioning wave net on Mel spectrogram predictions. CoRR, abs/1712.05884, 2017. URL http://arxiv.org/abs/1712.05884.
  3. Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez-Moreno, and Yonghui Wu. Transfer learning from speaker verification to multispeaker text-to- speech.
  4. Arik, S. O., Diamos, G., Gibiansky, A., Miller, J., Peng, K., & Ping, W. (2017). Deep voice: Real-time neural text-to- speech. In Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 195-204).
  5. Bäckström, T., Chen, X., & Skoglund, M. (2019). Deep Reservoir Computing Networks for Real-Time Voice Cloning. In Proceedings of the 27th European Signal Processing Conference (EUSIPCO).
  6. Jia, Y., Zhang, Y., & Hinton, G. E. (2018). Audio super- resolu-tion using neural networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI).
  7. Nachmani, E., & Wolf, L. (2018). Improving sequence-to- sequence voice cloning for real-time speech synthesis. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI).
  8. Ping, W., Peng, K., Gibiansky, A., & Arik, S. O. (2018). Deep voice 2: Multi-speaker neural text-to-speech. In Advances in Neural Information Processing Systems (NIPS), 31.
  9. Sotelo, J., Mehri, S., Kumar, K., Dieleman, S., Erhan, D., & Courville, A. (2017). Char2Wav: End-to-End speech synthesis. arXiv preprint arXiv:1702.04225.
  10. Van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,.. & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.0349.