Impact Factor (2025): 6.9
DOI Prefix: 10.47001/IRJIET
Building
high-performance text classification models in low-resource languages is a
challenging task due to the scarcity of labelled data. Traditional approaches
rely on manually annotated corpora, which are expensive and time-consuming to
obtain. However, most existing augmentation methods are language-dependent,
leveraging linguistic tools such as synonym replacement, word embeddings, or
grammar-based transformations, which restrict their applicability to
multilingual and low-resource settings. Our approach leverages a combination of
back-translation, token-level perturbations, and contrastive learning to create
diverse, semantically meaningful augmented samples that enhance model learning.
Back-translation introduces natural variations while preserving meaning,
token-level perturbations modify individual tokens to improve robustness, and
contrastive learning helps the model distinguish between subtle differences in
text representations, leading to better generalization across unseen data. Our
results show that LiDA outperforms traditional augmentation techniques by
generating more contextually relevant and linguistically diverse samples,
particularly in low-resource environments. Furthermore, our method enhances
model adaptability to multilingual data, demonstrating its potential as a
scalable and language-agnostic augmentation strategy.
Country : India
IRJIET, Volume 9, Special Issue of INSPIRE’25 April 2025 pp. 164-171