‘This software makes privacy-sensitive research data more accessible’.

In this series of interviews, we show what contribution projects can make to FAIR research IT. The research teams of the projects have received a grant from the FAIR Research IT Innovation Fund.

Do you, as a scientist or as a research organization, work with privacy-sensitive data? Software tool metasyn (formerly known as MetaSynth) makes it easy to create synthetic data, safeguarding the privacy of persons. In this way you can publish sensitive data after all. Moreover, doing research with existing data becomes easier. And what’s more: it contributes to FAIR data and Open Science.

If a researcher expresses excitement about ‘very nice statistical problems’, you know right away: no lack of enthusiasm here. The words come from Erik-Jan van Kesteren, assistant professor in the Human Data Science Group, part of the Department of Methodology and Statistics of the Faculty of Social & Behavioural Sciences. Some years ago, the statistician was asked to set up the Social Data Science Team (SoDa), part of ODISSEI (see textbox). `We help social scientists with things like data science and computational research, such as using algorithms and models to understand patterns. So the research I do is really focused on statistical problems from the daily research practice.’

Tackling privacy-sensitive data issues

Erik-Jan van Kesteren assistant professor in the Human Data Science Group, part of the Department of Methodology and Statistics of the Faculty of Social & Behavioural Sciences
Erik-Jan van Kesteren, photo by Annemiek van der Kuil, PhotoA

In the SoDa team, Erik-Jan van Kesteren noticed a recurring problem: data is not that well available in the case of privacy-sensitive projects. `Many (social) scientists work with privacy-sensitive data. Think for instance of a psychologist asking people to fill in a questionnaire on their mental health. Or a researcher working with microdata of Statistics Netherlands (CBS). That kind of data can often not be published due to privacy issues. As a result, others cannot reproduce or check your research. That is at odds with the idea behind Open Science. What’s more: you want to be able to build on already existing research.’

What is ODISSEI?

The Social Data Science Team is part of ODISSEI, a national research infrastructure for the Dutch social sciences. Here faculties, data providers, long-term researches and other organizations join forces. The purpose: making data better available for social scientists and help them make better use of the options that data science has to offer.

Synthetic data

How do we make publishing and (re)using privacy-sensitive data easier? A promising solution, according to Erik-Jan van Kesteren: synthetic data. `A kind of imitation data that look like real data and which you publish. To do so, you use a statistical model that looks at the characteristics of your data, but then produces other data. That is why they are unfit for analysis. Look at it as test data or data to practise with.’

Useful, but not without its pitfalls. `First of all there are privacy risks involved in a lot of existing programs. If you extensively train the model using your data, the synthetic data look a lot like your real data. You have a lot of analytical validity then, as a manner of speaking you could reproduce the final results. However, that also means that you can retrace individual data.’ In addition, Erik-Jan van Kesteren noticed problems with the usability of existing synthetic data solutions. `You could make a slightly improved model for each project, allowing you to keep privacy, but that is tailored work. The data differs, as does the privacy framework: different rules and agreements apply per project or organization. That is why you prefer a generic solution which is privacy-friendly at the same time.’

Metasyn is a generic solution which is at the same time very privacy-friendly. That is why it is suitable for many researchers and projects.

Privacy first

That is how the idea for metasyn came up. Together with SoDa colleague and Research Engineer Raoul Schram, Erik-Jan van Kesteren developed a proof of concept. Points of departure: synthetic data with privacy guarantees, in which the privacy is automatically included. Erik-Jan van Kesteren on the privacy aspect: ‘Our motto is: as privacy-friendly as possible. That is why metasyn only keeps characteristics on the level of variables. For instance, take a dataset with incomes and age. Then I can look at the distribution of income, but if I want to do something with age next, I will be using another model. So the relationship between income and age is dropped. That limits the analytic validity, but it guarantees that you cannot predict someone’s income based on age.

Then the second point of departure: the ‘automation’ of privacy. `To get that done, we build two plugins. With the help of the plugins, organisations such as YOUth or national statistics offices can set up rules about what is exported and what is not, in accordance with their privacy guidelines. These choices are implemented in the plugin. Then, as a researcher, you can choose with the help of a kind of pick and match which privacy definition you want to follow, aligned wity your project. It offers many options and saves time.’

Erik-Jan van Kesteren assistant professor in the Human Data Science Group, part of the Department of Methodology and Statistics of the Faculty of Social & Behavioural Sciences
Erik-Jan van Kesteren, photo by Annemiek van der Kuil, PhotoA

A solution for both publishing and doing research

Who might be especially interested in metasyn? ‘In any case, researchers and data providers who want to make privacy-sensitive data better available from an open science perspective. Our vision would be that projects such as YOUth next to a data description also make a synthetic version available. As if we leave the door ajar: have a peek at what’s inside. We also perform a pilot with synthetic data together with YOUth.’

Moreover, metasyn is a solution for researchers who want to use existing data. `Suppose you want to know if YOUth has interesting data for your research. Formerly you had to have a conversation first, trying to find out which data to link and if you can measure what you want to measure. Thanks to metasyn you can already do a kind of test version of your research. That is how you discover: can I answer my research question with this data? That makes the process more efficient and easier.’ And thanks to the synthetic data you become familiar with the data. `You can write your script, see if it works, solving problems. You can see the data in front of you. That is so important to researchers: you must be able to get a sense of your data. It makes your research less error-prone.’ And into the bargain: because you can do all this beforehand, you can make specific requests for the real data. `That saves money, because you often pay for the duration of time you have access.’

FAIR and Open Science

Metasyn fits in well with the thoughts behind FAIR. `It makes privacy-sensitive datasets more accessible, but it also increases interoperability, metasyn allows you to export data to, for instance, Python, R or Excel. And it becomes more reusable: with the help of existing data, new research is easier done. And if you publish your research with a synthetic dataset, others can also reuse your code for another research project or build upon your research.’ As said before, this project fits in with the open science mentality. `We are very transparent as it is. You can see clearly what goes out of the secured environment: we have a format that you can just open and read. That is important for organisations working with privacy-sensitive data. In this way they can see exactly what’s happening with the data.’

As a researcher you must be able to get a sense of your data. Metasyn makes that possible in the early stages of your research.

A big leap forwards thanks to the grant of the Innovation Fund

Metasyn is one of the projects that received the FAIR Research IT Fund (see textbox).  For Metasyn this grant came on a crucial moment. `Before, the project was run on a small scale, but going from proof of concept to actually usable software is a big leap. We can make this leap thanks to the grant.’ It gives Raoul Schram and Erik-Jan van Kesteren the opportunity to develop the software further, for instance by making it suitable for more data types, such as open questions. `And so we build these plugins, for which we get help from other researchers thanks to the grant. We have also taken on a student-assistent to help us with the documentation and landing experience, so a visitor to the metasyn page knows right away what it is and what it is used for.’

You can have a look at metasyn, download it and even contribute to it: it is completely open source. `We are still a bit in the experimental phase, but as an individual researcher you can already experience how easy it is: creating synthetic data. We also have tutorials. So: give it a try and let us know how it goes!’

About the FAIR Research IT Innovation Fund

Utrecht University wants each research team to be well supported in the field of research IT. One of the ways to achieve this is through the FAIR Research IT Innovation Fund. Scientists can receive a grant for projects which, for instance, improve the IT infrastructure of scientific research. You may think of projects that enable enough storage capacity for data, or of the development of tools and services that help researchers in their work. FAIR and open science principles are the guidelines when selecting projects. Other researchers must be able to easily and quickly reuse the knowledge and solutions.