BeatLens: A Context-Aware Vision-to-Music Framework for Image-Based Song Recommendations

Abstract

BeatLens is a song recommendation engine based on AI that is created to boost social media storytelling through the automation of music selection for Instagram stories. It solves the typical problem of taking too much time to select songs by using uploaded images via sophisticated computer vision models such as YOLO (for object detection) and CLIP (for scene classification) to decipher visual context. The system then uses Large Language Models (LLMs) like LLaMA 3, LLaVA, and Mistral to suggest songs based on the mood, theme, and setting of the image. For maximum accessibility, BeatLens is available in 14 languages, namely English, Marathi, Hindi, Spanish, Punjabi, Bhojpuri, Korean, German, Portuguese, Japanese, Tamil, Telugu, Kannada, and Malayalam. This multilingual functionality, paired with its AI-powered analysis, turns song choosing into an intuitive, streamlined process—improving user experience and minimizing decision fatigue.

Country : India

1 Aditya Arolkar2 Dhaval Smart3 Gaurav Waghmare4 Pratham Atale5 Prof. Sonali Despande

  1. Student, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India
  2. Student, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India
  3. Student, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India
  4. Student, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India
  5. Professor, Smt. Indira Gandhi College of Engineering, Ghansoli, New Mumbai, Maharashtra, India

IRJIET, Volume 9, Issue 4, April 2025 pp. 140-146

doi.org/10.47001/IRJIET/2025.904021

References

  1. YOLO (You Only Look Once): Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779-788.
  2. CLIP (Contrastive Language-Image Pre-training): Radford, A., Kim, J. W., Xu, C., McLeavey, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ArXiv, abs/2103.00020.
  3. Streamlit: Streamlit documentation. Retrieved from https://streamlit.io/
  4. Ollama: Ollama documentation. Retrieved from https://ollama.com/
  5. Transformers Library: Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., … Rush, A. M. (2020). Transformers: State-of-the-Art Natural Language Processing. ArXiv, abs/1909.08053.
  6. PyTorch: PyTorch documentation. Retrieved from https://pytorch.org/
  7. NumPy: Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
  8. PIL (Pillow): Pillow documentation. Retrieved from https://pillow.readthedocs.io/en/stable/
  9. OpenCV: OpenCV documentation. Retrieved from https://opencv.org/
  10. Llama3, Llava, Mistral, Gemma, Phi: Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.