Data diversity as the foundation for better language models

ERC Starting Grant awarded for research into data diversity in Natural Language Processing

Dong Nguyen
Dong Nguyen conducts research on Natural Language Processing.

In September, Dong Nguyen received an ERC Starting Grant of €1.5 million for her research on the impact of data diversity on the quality of language models. In her DataDivers project, she aims to develop methods to accurately measure the diversity of a dataset and investigate how this diversity influences the behavior of a language model. The goal: to create fairer and more robust language models by placing diversity at the core of their training process.

Talking computers are no longer just a sci-fi concept. Even though devices often don’t literally speak, chatbots, automatic translation, and automated moderation on online platforms are part of our everyday lives. For Dong Nguyen, a researcher in the field of Natural Language Processing (NLP), this is great news: her research is more relevant than ever. NLP is a technology that enables computers to understand and use human language. Although it has existed since the 1950s, the field has experienced a tremendous leap in the past five years, largely due to the rise of generative AI. “The developments are moving incredibly fast. That’s why I try to identify research questions that will remain relevant in the long term.”

Quality over quantity

NLP models, such as the one behind ChatGPT, require extensive training before they can be used. Researchers are increasingly realising that the quality of the data used plays a significant role in the performance of a language model, Nguyen explains. “When a model learns certain stereotypes, or performs poorly in specific tasks, it is often due to the data it was trained on.”

Currently, scientists still know little about how training data exactly impacts the resulting behavior of a language model. Nguyen believes that data diversity is a key predictor of a model’s behaviour. How to measure this diversity is still an open question, she says. “What characteristics should the data have, and how do these traits influence the model? For example, do the texts cover different topics, dialects, or genres?” Nguyen asks. “Sometimes, a smaller, carefully curated dataset can train a model better than a vast amount of data does.”

How do you train a language model?

Training a language model involves several stages. First, a large dataset is used, often consisting of internet data, books, or even synthetically generated texts. The model is then further trained with smaller, more specific datasets. For example, if you want to train a model to recognise hate speech, you provide it with examples of both hate speech and non-hate speech.

Measuring data diversity

In her DataDivers project, Nguyen is developing methods to accurately measure the diversity of datasets. Data diversity is a broad concept, and there has been little research on it within NLP. “Diversity is not just about representation,” Nguyen notes, “but also about factors like variation in topics, writing styles, and grammatical structures.” She plans to draw on methods from other fields, such as ecology and social sciences, where many techniques for measuring diversity have already been developed.

Nguyen will then conduct experiments to investigate how dataset diversity impacts language model performance. By training models with datasets of varying diversity levels, she will explore how data diversity affects model accuracy, its ability to handle different demographic groups, and how quickly it learns new tasks. The aim is to identify which types of diversity are most significant.

When a model learns certain stereotypes or performs poorly in specific tasks, it is often due to the data it was trained on

In the final phase of the project, Nguyen aims to adjust how language models are trained, ensuring that diversity is integrated from the start. This could be achieved by assigning more weight to certain data points within a dataset. Additionally, the way data is collected could be modified to enhance the diversity of the dataset.

Hate speech

Data diversity may be particularly important for language models designed to detect hate speech on social media platforms. Sometimes, these models become overly sensitive to patterns that should not be relevant. Nguyen explains that models can inadvertently associate names or certain topics with hate speech, even when those features are irrelevant. She expects that more diverse training data could help avoid these kinds of mistakes.

Nguyen also believes that data diversity can improve a model’s ability to perform well on data different from their training data. For instance, language models tend to perform well in standard language, but struggle with dialects. Additionally, language models can generalise better when they are exposed to more diverse datasets. “It’s just like learning a new language: if you only read news articles or sports reports, your vocabulary will be limited, and you’ll have less understanding of the language in other areas”, Nguyen explains.

The learning process of a language model can be compared to someone learning a new language: if you only read news articles or sports reports, you’ll have a limited vocabulary and less understanding of the language than if you were to read a variety of texts

About Dong

After completing her master’s in Language Technologies at Carnegie Mellon University, Dong specialised in NLP. At Utrecht University, she works within the Natural Language Processing group, where she also leads her own lab: the NLP and Society Lab. How can social factors of language use be modeled or analysed? Together with her PhD students and UU students, she analyses online conversations and investigates whether there are signals that predict whether a conversation will go well or poorly. Or she looks into questions such as: how do we measure whether models are ‘fair’? What stereotypes do they learn? Can models recognise and understand gender-neutral pronouns? How do we evaluate language models? How can we smartly select data to train better models? Dong Nguyen is also a member of the Utrecht Young Academy.