In multilingual natural language processing (NLP) we distinguish languages based on their resourcefulness. This indicates the amount of resources available for a given language. On the Common Crawl statistics page we can clearly see that data crawled from the internet is predominantly English, followed by Russian, German. These languages make up more than half of the 363 TiB of uncompressed content. Examples of high-resource languages are English, Chinese, and Spanish. Examples of medium-resource languages are Hindi, Swedish, and Turkish. Examples of low-resource languages are Zulu, Amharic, Khmer.
In training large language models (LLMs), pre-training on overwhelmingly English data results in English-centric models [1, 2]. Both in language and in culture, speakers of medium- and low-resource languages are left behind. Models will have broken translations, spelling errors, and English-centric biases for low-resource languages, locking speakers out of properly utilizing these tools [3, 4, 5].
Dialects are subsets of languages, consisting of an even lower amount of tokens available, and usually contain nuanced cultural distinctions.
This project focuses on measuring (and possibly increasing!) model performance on low-resource languages, with a focus on dialects. Possible research outcomes of this project would be a benchmark and a novel method for improving LLM performance on dialects.
Recommended reading:
● Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts
● Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish
● (Optional) CLRLC-LLMs Workshop @ NeurIPS 2025
References:
[1] Zhong, Chengzhi, et al. "What Language Do Non-English-Centric Large Language Models Think in?." Findings of the Association for Computational Linguistics: ACL 2025.
[2] Gupta, Vansh, et al. "Multilingual performance biases of large language models in education." arXiv preprint arXiv:2504.17720 (2025).
[3] Misra, Amit, et al. "AI Diffusion in Low Resource Language Countries." arXiv preprint arXiv:2511.02752 (2025).
[4] Digital Divide Data. "Low-Resource Languages in AI." Digital Divide Data, 20 Jan. 2023, www.digitaldividedata.com/blog/low-resource-languages-in-ai.
[5] Arnett, Catherine, and Tyler Chang. "An Analysis of Multilingual Models on Hugging Face." Hugging Face, 18 Sept. 2025, huggingface.co/blog/catherinearnett/hf-model-survey
Joaquin Vanschoren
Dalton Harmsen