Recommendations on the Use of Synthetic Data to Train AI Models

Payal Arora et al.

Omslag 'Recommendations on the Use of Synthetic Data to Train AI Models' (2024)

Recently, the United Nations University published Recommendations on the Use of Synthetic Data to Train AI Models. This first report with policy guidelines of its kind is written by an invited group of experts, consisting of computer scientists, AI and social experts, including Professor of Inclusive AI Cultures Payal Arora.

Training AI models through artificially generated data

Using synthetic or artificially generated data in training Artificial Intelligence (AI) algorithms is a burgeoning practice with significant potential to affect society directly. It can address data scarcity, privacy, and bias issues but does raise concerns about data quality, security, and ethical implications. While some systems use only synthetic data, most times synthetic data is used together with real-world data to train AI models.

Recommendations in this document are for any system where some synthetic data are used. The use of synthetic data has the potential to enhance existing data to allow for more efficient and inclusive practices and policies. However, we cannot assume synthetic data to be automatically better or even equivalent to data from the physical world. There are many risks to using synthetic data, including cybersecurity risks, bias propagation, and increasing model error. This document sets out recommendations for the responsible use of synthetic data in AI training.

Besides Arora, the contributing expert group consisted of Fernando Buarque, Yik Chan Chin, Mamello Thinyane, Stinckwich Serge, Fournier-Tombs Eleonore, Marwala Tshilidzi, and Philippe de Wilde.