Vision-to-Voice: AI for generating Description & Audio of Visual Content

Abstract

The seamless transformation of visual content into descriptive text and naturalistic speech, termed Vision-to-Voice, represents a significant interdisciplinary advancement at the intersection of computer vision, natural language processing (NLP), and speech synthesis. This paper explores the development of an end-to-end Vision-to-Voice pipeline, encompassing visual scene understanding, semantic description generation, and high-quality speech synthesis, thereby enabling AI systems to narrate visual content for human users. The proposed methodology integrates Transformer-based image captioning models with context-aware linguistic augmentation and neural vocoders trained for expressive speech synthesis, ensuring fluent and expressive audio descriptions for visual content. While individual advancements in image captioning and TTS are well documented, their seamless fusion into an end-to-end, real-time system presents unique research and engineering challenges, including context preservation across modalities, maintaining linguistic fluency, and ensuring audio naturalness. This paper addresses these gaps through a unified encoder-decoder captioning module with Bahdanau Attention, followed by a Tacotron 2-based Mel-spectrogram generation module and HiFi-GAN-based waveform synthesis module. Extensive experimentation and evaluations using standard datasets, including Flickr8K and LJSpeech, demonstrate the efficacy of the proposed system in terms of caption quality (BLEU) and audio naturalness (MOS scores). The Vision-to-Voice system holds promising applications in assistive technologies, multimedia enrichment, and automated video annotation systems, thereby contributing to both academic research and real-world accessibility solutions.

Country : India

1 P. Jayanth2 K. Lakshmi Sree3 K. Karthik Kumar Reddy4 G. Om Prakash5 G. Reddy Prasad

  1. Department of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, India
  2. Department of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, India
  3. Department of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, India
  4. Department of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, India
  5. Department of Artificial Intelligence, Madanapalle Institute of Technology & Science, Madanapalle, India

IRJIET, Volume 9, Special Issue of ICCIS-2025 May 2025 pp. 206-213

doi.org/10.47001/IRJIET/2025.ICCIS-202533

References

  1. Hu Xu1 Po-Yao Huang1. (2024). Altogether: Image Captioning via Re-aligning Alt-text.v3. https://doi.org/10.48550/arXiv.2410.17251
  2. Reshmi Sasibhooshan, Suresh Kumaraswamy and Santhoshkumar Sasidharan.(2023).Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction. Article number: 18. https://doi.org/10.1186/s40537-023-00693-9.
  3. D. Wang, Z. Hu, Y. Zhou, R. Hong and M. Wang.(2023). "A Text-Guided Generation and Refinement Model for Image Captioning," in IEEE Transactions on Multimedia, vol. 25, pp. 2966-2977, doi: 10.1109/TMM.2022.3154149.
  4. Hawraz A. Ahmad, Tarik A. Rashid, Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning, Journal of King Saud University - Computer and Information Sciences, Volume 36, Issue 7, 2024, ISSN 1319-1578, https://doi.org/10.1016/j.jksuci.2024.102131.
  5. Sneha Tamboli, Pratiksha Raut, A Review Paper on Text-to-Speech Convertor,vol3, https://ijrpr.com/uploads/V3ISSUE5/IJRPR4449.pdf
  6. D. J. B. Saini, S. Kumar, K. Joshi, A. K. Pathak, S. Jain and A. Singh, "A Novel Approach of Image Caption Generator using Deep Learning," 2023 Third International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS), Gobichettipalayam, India, 2023, pp. 24-29, doi: 10.1109/ICUIS60567.2023.00012.
  7. W. Jiang, X. Li, H. Hu, Q. Lu and B. Liu, "Multi-Gate Attention Network for Image Captioning," in IEEE Access, vol. 9, pp. 69700-69709, 2021, doi:10.1109/ACCESS.2021.3067607.https://ieeexplore.ieee.org/document/9382255.
  8. C. Amritkar and V. Jabade, "Image Caption Generation Using Deep Learning Technique," 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 2018, pp. 1-4, doi: 10.1109/ICCUBEA.2018.8697360 https://www.researchgate.net/publication/332674126_Image_Caption_Generation_Using_Deep_Learning_Technique
  9. Andrej Karpathy Li Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions",v2,2015, https://doi.org/10.48550/arXiv.1412.2306
  10. Uotian Luo, “Goal-driven Text Descriptions for Images", arXiv preprint, vol. 1, 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2108.12575
  11. Muhanad Hameed Arif,"Image to Text Description Approach based on Deep Learning Models",ISSN:29579651,10.56990/bajest/2024.030103 https://www.researchgate.net/publication/378948235_Image-to-Text_Description_Approach_based_on_Deep_Learning_Models
  12. Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong Cai,"V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models", v4, 2023, https://doi.org/10.48550/arXiv.2308.09300
  13. Hifeng Xie1, Shengye Yu, Qile He, MengtianLi, "SonicVision LM: Playing Sound with Vision Language Mod", v3, 2024, https://doi.org/10.48550/arXiv.2401.04394
  14. UotianLuo, “Goal-driven Text Descriptions for Images", v1, 2021, https://doi.org/10.48550/arXiv.2108.12575
  15. Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, “Show and Tell: A Neural Image Caption             Generator", v2, 2015, https://doi.org/10.48550/arXiv.1411.4555.