Language-Independent Data Augmentation for Text Classification [LiDA]

Abstract

Building high-performance text classification models in low-resource languages is a challenging task due to the scarcity of labelled data. Traditional approaches rely on manually annotated corpora, which are expensive and time-consuming to obtain. However, most existing augmentation methods are language-dependent, leveraging linguistic tools such as synonym replacement, word embeddings, or grammar-based transformations, which restrict their applicability to multilingual and low-resource settings. Our approach leverages a combination of back-translation, token-level perturbations, and contrastive learning to create diverse, semantically meaningful augmented samples that enhance model learning. Back-translation introduces natural variations while preserving meaning, token-level perturbations modify individual tokens to improve robustness, and contrastive learning helps the model distinguish between subtle differences in text representations, leading to better generalization across unseen data. Our results show that LiDA outperforms traditional augmentation techniques by generating more contextually relevant and linguistically diverse samples, particularly in low-resource environments. Furthermore, our method enhances model adaptability to multilingual data, demonstrating its potential as a scalable and language-agnostic augmentation strategy.

Country : India

1 M. Sharmila Devi2 G. Sharanya3 B. Himaja4 A. Bhavya Rohitha5 A. Sujitha6 J. Swapna kumari

  1. Assistant Professor, Department of Computer Science & Engineering, Santhiram Engineering College, Nandyal, A.P., India
  2. Student, Department of Computer Science & Engineering, Santhiram Engineering College, Nandyal, A.P., India
  3. Student, Department of Computer Science & Engineering, Santhiram Engineering College, Nandyal, A.P., India
  4. Student, Department of Computer Science & Engineering, Santhiram Engineering College, Nandyal, A.P., India
  5. Student, Department of Computer Science & Engineering, Santhiram Engineering College, Nandyal, A.P., India
  6. Student, Department of Computer Science & Engineering, Santhiram Engineering College, Nandyal, A.P., India

IRJIET, Volume 9, Special Issue of INSPIRE’25 April 2025 pp. 164-171

doi.org/10.47001/IRJIET/2025.INSPIRE27

References

  1. S. Qiu, B. Xu, J. Zhang, Y. Wang, X. Shen, G. de Melo, C. Long, and X. Li, “Easyaug: An automatic textual data augmentation platform for classification tasks,” in Companion Proceedings of the Web Conference 2020, WWW ’20, (New York, NY, USA), p. 249–252, Association for Computing Machinery, 2020.
  2. G. Rizos, K. Hemker, and B. Schuller, “Augment to prevent: Short-text data augmentation in deep learning for hate-speech classification,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, (New York, NY, USA), p. 991–1000, Association for Computing Machinery, 2019.
  3. Mahammad, Farooq Sunar, et al. "Key distribution scheme for preventing key reinstallation attack in wireless networks." AIP Conference Proceedings. Vol. 3028. No. 1. AIP Publishing, 2024.
  4. Suman, Jami Venkata, et al. "Leveraging natural language processing in conversational AI agents to improve healthcare security." Conversational Artificial Intelligence (2024): 699-711.
  5. Sunar, Mahammad Farooq, and V. Madhu Viswanatham. "A fast approach to encrypt and decrypt of video streams for secure channel transmission." World Review of Science, Technology and Sustainable Development 14.1 (2018): 11-28.
  6. Mahammad, Farooq Sunar, Karthik Balasubramanian, and T. Sudhakar Babu. "A comprehensive research on video imaging techniques." All Open Access, Bronze (2019).
  7. Mahammad, Farooq Sunar, and V. Madhu Viswanatham. "Performance analysis of data compression algorithms for heterogeneous architecture through parallel approach." The Journal of Supercomputing 76.4 (2020): 2275-2288.
  8. Devi, M. Sharmila, et al. "Extracting and Analyzing Features in Natural Language Processing for Deep Learning with English Language." Journal of Research Publication and Reviews 4.4 (2023): 497-502.
  9. Devi, M. Sharmila, et al. "Machine Learning Based Classification and Clustering Analysis of Efficiency of Exercise Against Covid-19 Infection." Journal of Algebraic Statistics 13.3 (2022): 112-117.
  10. Mandalapu, Sharmila Devi, et al. "Rainfall prediction using machine learning." AIP Conference Proceedings. Vol. 3028. No. 1. AIP Publishing, 2024.
  11. Chaitanya, V. Lakshmi, et al. "Identification of traffic sign boards and voice assistance system for driving." AIP Conference Proceedings. Vol. 3028. No. 1. AIP Publishing, 2024.
  12. Chaitanya, V. Lakshmi. "Machine Learning Based Predictive Model for Data Fusion Based Intruder Alert System." journal of algebraic statistics 13.2 (2022): 2477-2483.
  13. Chaitanya, V. Lakshmi, and G. Vijaya Bhaskar. "Apriori vs Genetic algorithms for Identifying Frequent Item Sets." International journal of Innovative Research &Development 3.6 (2014): 249-254.
  14. Parumanchala Bhaskar, et al. "Incorporating Deep Learning Techniques to Estimate the Damage of Cars During the Accidents" AIP Conference Proceedings. Vol. 3028. No. 1. AIP Publishing, 2024.
  15. Parumanchala Bhaskar, et al “Cloud Computing Network in Remote Sensing-Based Climate Detection Using Machine Learning Algorithms” remote sensing in earth systems sciences (springer).
  16. Arumanchala Bhaskar, et al. "Machine Learning Based Predictive Model for Closed Loop Air Filtering System." Journal of Algebraic Statistics 13.3 (2022): 416-423.
  17. Paradesi Subba Rao,”Detecting malicious Twitter bots using machine learning” AIP Conf. Proc. 3028, 020073 (2024), https://doi.org/10.1063/5.0212693.
  18. Paradesi SubbaRao,”Morphed Image Detection using Structural Similarity Index Measure”M6 Volume 48 issue 4 (December 2024), https://powertechjournal.com.