Text To Speech with Emotion Control Using Deep Learning

Abstract

Recent advances in neural Text-to-Speech (TTS) systems have produced speech with high intelligibility and naturalness, yet most deployed systems still sound emotionally neutral. This lack of affective expressiveness limits user engagement and degrades the quality of human– computer interaction, especially in applications such as virtual assistants, audiobooks, education, and accessibility technologies. This work proposes an emotion-infused TTS framework that extends a Tacotron-based sequence-to-sequence architecture with explicit emotion conditioning. The system leverages the Emotional Speech Database (ESD) to model five emotional categories—neutral, happy, angry, sad, and surprise—and incorporates emotion vectors alongside text embeddings in the encoder–decoder pipeline. Mel-spectrograms predicted by the model are converted to waveforms using the Griffin–Lim algorithm. Experimental training on English-emotion subsets of ESD demonstrates stable convergence of mel-spectrogram reconstruction loss and the capability to synthesize perceptually distinct emotional speech, as observed through qualitative waveform and spectrogram analysis. A web-based interface is further developed to enable end-user interaction, allowing text input or file upload with selectable emotional style. The proposed system shows that explicit emotion conditioning can significantly enhance expressiveness of neural TTS without sacrificing intelligibility, and it provides a practical foundation for emotionally aware human–machine communication.

Country : India

1 Manasa S M2 Subramanya S Gujjar3 Shashank H S4 Subhash G K5 Vikyath M A

  1. Assistant Professor, Department of IS&E, JNNCE, Shivamogga, Karnataka, India
  2. UG Student, Department of IS&E, JNNCE, Shivamogga, Karnataka, India
  3. UG Student, Department of IS&E, JNNCE, Shivamogga, Karnataka, India
  4. UG Student, Department of IS&E, JNNCE, Shivamogga, Karnataka, India
  5. UG Student, Department of IS&E, JNNCE, Shivamogga, Karnataka, India

IRJIET, Volume 9, Issue 12, December 2025 pp. 192-198

doi.org/10.47001/IRJIET/2025.912029

References

  1. X. Gao et al., “TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations,” IEEE Trans. Audio, Speech, and Language Process., 2025.
  2. L. Abdel-Hamid et al., “Analysis of Linguistic and Prosodic Features of Bilingual Arabic–English Speakers for Speech Emotion Recognition,” IEEE Access, vol. 8, pp. 7295772970, 2020.
  3. S. Seo et al., “Convolutional Neural Networks Using Log Mel-Spectrogram Separation for Audio Event Classification with Unknown Devices,” J. Web Eng., vol. 21, no. 2, pp. 497522, 2022.
  4. O. Ghahabi and J. Hernando, “Deep Learning Backend for Single and Multisession i-Vector Speaker Recognition,” IEEE/ACM Trans. Audio, Speech, and Language Process., vol. 25, no. 4, pp. 807817, 2017.
  5. Y. Masuyama et al., “Griffin–Lim Like Phase Recovery via Alternating Direction Method of Multipliers,” IEEE Signal Process. Lett., vol. 26, no. 1, pp. 184188, 2019.
  6. K. L. Ong et al., “Mel-MViTv2 Enhanced Speech Emotion Recognition with Mel Spectrogram and Improved Multiscale Vision Transformers,” IEEE Access, vol. 11, pp. 108571 108579, 2023.
  7. R. Liu et al., “Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-Based TTS,” IEEE Signal Process. Lett., vol. 27, pp. 14701474, 2020.
  8. X. Tan et al., “NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 6, pp. 42344245, 2024.
  9. D. Yoshioka et al., “Nonparallel Spoken-Text-Style Transfer for Linguistic Expression Control in Speech Generation,” IEEE Trans. Audio, Speech, and Language Process., vol. 33, pp. 333346, 2025.
  10. A.Mukhamediya and A. Zollanvari, “On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition,” IEEE Access, vol. 11, pp. 6195061957, 2023.
  11. D. B. de Souza et al., “Multitaper-Mel Spectrograms for Keyword Spotting,” IEEE Signal Process. Lett., vol. 29, pp. 20282032, 2022.
  12. R. Nenov et al., “Accelerated Griffin–Lim Algorithm: A Fast and Provably Converging Numerical Method for Phase Retrieval,” IEEE Trans. Signal Process., vol. 72, pp. 190202, 2024.
  13. R. Sato, R. Sasaki, N. Suga and T. Furukawa, "Creation and Analysis of Emotional Speech Database for Multiple Emotions Recognition," 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Yangon, Myanmar, 2020.
  14. Y. Zhang, "A Study on the Translation of Spoken English from Speech to Text," in Journal of ICT Standardization, vol. 12, no. 4, pp. 429-441, December 2024.
  15. Z. Liang, Z. Ma, C. Du, K. Yu and X. Chen, "E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 4810-4821, 2024.