Model-generated Code-mixed Sentences

Code-switching/code-mixing is a common linguistic practice where speakers switch between multiple languages in a single discourse. Developing language models with code-switching capabilities is crucial to the interest of the large multilingual communities. However, collecting real-world code-mixed sentences for training remains challenging due to its colloquial nature, highlighting the importance of synthetic code-mixed data. Preliminary experiments shows LLMs can generate code-mixed sentences by switching entities in multiple languages. However, these model-generated sentences are often not natural limiting further leverage for training. This project explores methods to improve the naturalness of model-generated code-mixed sentences. Ultimately, we aim to build an automated pipeline capable of generating natural-sounding code-switched sentences for further downstream tasks.

Tianyi Zhu
Tianyi Zhu
UROP Student (Jan ‘25)

UROP student

Barid Xi Ai
Barid Xi Ai
Research Fellow

Postdoctoral Research Fellow at WING

Min-Yen Kan
Min-Yen Kan
Associate Professor

WING lead; interests include Digital Libraries, Information Retrieval and Natural Language Processing.