‘You get much more out of a text corpus with Named Entity Recognition’

In this series of interviews we show what contribution projects can make to FAIR research IT. The research teams of the projects have received a grant from the FAIR Research IT Innovation Fund.

I-Analyzer helps you to easily search and visualize text corpora. In addition, the Named Entity Recognition functionality will soon allow you to search for entities such as place names, persons or organisations quickly and easily. You can also visualize these Named Entities. This application will be integrated in I-Analyzer from the TextMiNER project. This contributes to working FAIR and Open Science.

Suppose, you are doing research into the theme ‘collaboration’ in English newspapers. You open the I-Analyzer research software and enter your search query for the corpus of The Times. Soon all kinds of data appear on your screen: you see which texts mention something about collaboration, how often it occurs and when. You scroll through some articles and at a glance you see all places in the text where the word ‘collaboration’ occurs, because they are marked by colours. What’s more: on a topographical map you can see all place names mentioned in these articles.

I-Analyzer is ideal for researchers who are working with large text corpora. In addition, with Named Entity Recognition you will soon be able to retrieve and visualize even more data quickly and easily.

Until recently this scenario was only a dream for some researchers, but soon that will change. This is thanks to Research Software Engineer Berit Janssen and her colleagues from the Research Software Lab (RSLab) team of the UU Centre for Digital Humanities. “Our team supports Humanities researchers who are using software, for instance as part of large corpus studies on newspaper articles,’ Berit Janssen explains. “We advise on the right software for this and also develop software ourselves on request.”  The team built I-Analyzer: a tool for searching and visualizing text corpora. It can be used to analyze a large collection of texts from a bird’s eye view (distant reading). “ Useful if,  for instance, you want to know what search terms it contains, or how the different types of texts are divided within the corpus. This way you can answer your research question or make a selection that you further analyze with ‘close reading’, zooming in on the details.”

At the moment you will mainly find newspaper corpora in I-Analyzer, but also, for instance, court records and parliamentary data. “Researchers can work directly with these corpora. If they want to work on another corpus, they submit a request to us.”  Recently I-Analyzer has become open source, so anyone can use it with their own dataset. This contributes to Open Science. “ We are working towards making it easier for researchers to enter corpora without using code. In this way accessibility is increased.”

Named Entity Recognition (NER)

Dr. Berit Janssen, photo by Annemiek van der Kuil, PhotoA

Although researchers gained a lot by the arrival of I-Analyzer, Berit Janssen and her colleagues noticed a need that was not met. “ Researchers regularly asked: can we do something with Named Entity Recognition (NER)? Historians wanted to do research into place names in newspaper articles during the Second World War. Sifting manually through a text corpus takes up a lot of time. NER offers the option to search a corpus automatically for so-called ‘ entities’, such as place names, persons, brand names or years. Although this leads to some more mistakes, it allows you to analyze more texts.”  A useful functionality then, but it has to be available of course. This is how the RSLab Team came up with the idea of applying to the FAIR Research IT Innovation Fund for the TextMiNer project. “To make research with Named Entities in I-Analyzer possible, we need to enhance the data. That means that we go through all texts in a corpus in one go and label all Named Entities. For that we use a model within the spaCy software. We store the labels and make them visible in I-Analyzer, so researchers can proceed with them.”

What is Named Entity Recognition?

Named Entity Recognition (NER) happens with machine learning models. These models are trained on large amounts of data with 'named entities' annotated by humans. This allows the models to make predictions on new data where place names, personal names and other 'entities' might be located.

Benefits for researchers

Thanks to this project, I-Analyzer users will soon get direct access to the NER-functionality. Ideal for researchers working with large amounts of textual data, Berit Janssen says. “For instance, a researcher wanted to track the concept of Fairtrade. In that case, NER can be used to do an analysis on chocolate brands mentioned in English newspapers over time. Another research project dealt with family companies. As companies are labeled too, you can search for them in the corpus of annual proceedings of Dutch companies.”

In short, this functionality offers many options for researchers. What can they expect exactly when they start working with it? “Suppose, you search a corpus in I-Analyzer. Then you see what entities have been found and you can apply statistics to this data. Maybe you are comparing several periods: what occurs more frequently and when? Or you zoom in on the retrieved data: you see the sentence containing the entity, along with its label and content.” The team also wants to visually represent the Named Entities, including histograms and geographical maps on which place names are mapped. In addition, all entities are colour marked in the texts. In this respect it is important to note that you can only search Named Entities in the enhanced corpora in I-Analyzer, Berit Janssen emphasizes. “Unless you know how to program yourself. The code we are developing is open source, so researchers will soon be able to apply Named Entity Recognition to their text corpora themselves.”

Dr. Berit Janssen, photo by Annemiek van der Kuil, PhotoA

Working FAIR

TextMiNER fits in well with a FAIR way of working, Berit Janssen explains. “For instance, NER makes certain aspects of data better findable. This project also makes this functionality more accessible to researchers in a reusable way. As a team we always try to develop software anyway which can be used more than once, and has lasting value. NER is therefore a method that can be used by various disciplines. That is why a project such as TextMiNER is so great: we are systematically creating a solution that helps many researchers.” Moreover, TextMiNER increases ‘interoperability’  because several research techniques are linked, such as filtering, analysing and viewing Named Entities. Does I-Analyzer also allow easy collaboration or the replication of research? “In principle, yes. I-Analyzer can also be accessed by researchers outside Utrecht University. However, then you both need to have access to the corpus. Unfortunately, not all data within I-Analyzer is in the public domain. This is because the owner of a newspaper corpus is often a publisher, with whom each university has to enter into its own agreement. Enhancements of non-public data are therefore difficult to share. As a researcher you might still be able to indicate which Named Entities have been found without sharing the text itself, but that must be coordinated with the copyright owner.”

Seizing the initiative

Without the Innovation Fund this project could not have got off the ground quickly, according to Berit Janssen. “We have seen for quite some time now that researchers really want this. Some even had some budget, but never enough to enhance large amounts of data, let alone to represent them visually. The Innovation Fund offered a splendid opportunity to finally take this up.” The good thing about this grant is, according to her, that they could apply for the grant as Research Software Engineers.  “Many funds require a researcher. Now we could take our own initiative to create this solution.” The budget is used for the time the developing requires. Currently, Berit Janssen is mainly occupying herself with this.  “First I set up a pilot by preparing data. I mainly test how we can save the labels in such a way that they can be retrieved quickly.”

From 2024 onwards several team members will start working on the project. `Then we will apply NER on a large scale and develop the visual representation of Named Entities. Which corpus will be singled out?  “Probably The Times. Newspaper corpora are of interest to many researchers. Moreover, this corpus was the first one in I-Analyzer. After that the debates held in the Dutch parliament might be a good option, since they are publicly available.’ In any case, the RSLab keeps exploring how they can help researchers further with smart solutions. “Take for instance the representation of Named Entities on a map. Researchers really wanted that, but we always had to disappoint them. But soon we will be able to say: yes, we can do that!”

Would you like to start working with I-Analyzer? Get direct access here or have a look at the I-Analyzer GitHub pages  and the TextMiNER project.

About the FAIR Research IT Innovation Fund

Utrecht University wants each research team to be well supported in the field of research IT. One of the ways to achieve this is through the FAIR Research IT Innovation Fund. Scientists can receive a grant for projects which, for instance, improve the IT infrastructure of scientific research. You may think of projects that enable enough storage capacity for data, or of the development of tools and services that help researchers in their work. FAIR and open science principles are the guidelines when selecting projects. Other researchers must be able to easily and quickly reuse the knowledge and solutions.