Data and AI cluster

Project: Dialects and Low-Resource Languages in LLMs

Description

In multilingual natural language processing (NLP) we distinguish languages based on their resourcefulness. This indicates the amount of resources available for a given language. On the Common Crawl statistics page we can clearly see that data crawled from the internet is predominantly English, followed by Russian, German. These languages make up more than half of the 363 TiB of uncompressed content. Examples of high-resource languages are English, Chinese, and Spanish. Examples of medium-resource languages are Hindi, Swedish, and Turkish. Examples of low-resource languages are Zulu, Amharic, Khmer.

In training large language models (LLMs), pre-training on overwhelmingly English data results in English-centric models [1, 2]. Both in language and in culture, speakers of medium- and low-resource languages are left behind. Models will have broken translations, spelling errors, and English-centric biases for low-resource languages, locking speakers out of properly utilizing these tools [3, 4, 5].

Dialects are subsets of languages, consisting of an even lower amount of tokens available, and usually contain nuanced cultural distinctions.

This project focuses on measuring (and possibly increasing!) model performance on low-resource languages, with a focus on dialects. Possible research outcomes of this project would be a benchmark and a novel method for improving LLM performance on dialects.

Details

Supervisor: Joaquin Vanschoren
Secondary supervisor: Dalton Harmsen
Interested?: Get in contact