Most leading chatbots routinely exaggerate science findings

ChatGPT is often asked for summaries, but how accurate are these really?

AI-chatbots ChatGPT en DeepSeek geopend op een telefoon. Foto: Solen Feyissa, via Unsplash

It seems so convenient: when you are short of time, asking ChatGPT or another chatbot to summarise a scientific paper to quickly get a gist of it. But in up to 73 per cent of the cases, these large language models (LLMs) produce inaccurate conclusions, a new study by Uwe Peters (Utrecht University) and Benjamin Chin-Yee (Western University and University of Cambridge) finds.

Almost 5,000 LLM-generated summaries analysed

The researchers tested ten of the most prominent LLMs, including ChatGPT, DeepSeek, Claude, and LLaMA. “We entered abstracts and articles from top science journals, such as Nature, Science, and The Lancet,” says Peters, “and asked the models to summarise them. Our key question: how accurate are the summaries that the models generate?”

“Over a year, we collected 4,900 summaries. When we analysed them, we found that six of ten models systematically exaggerated claims they found in the original texts. Often the differences were subtle. But nuances can be of great importance in making sense of scientific findings.”

For instance, LLMs often changed cautious, past-tense claims into more sweeping, present-tense versions: ‘The treatment was effective in this study’ became ‘The treatment is effective’. “Changes like this can mislead readers,” Chin-Yee warns. “They can give the impression that the results are more widely applicable than they really are.”

When asked for more accuracy, the chatbots exaggerated even more often.

The researchers also directly compared human-written with LLM-generated summaries of the same texts. Chatbots were nearly five times more likely to produce broad generalisations than their human counterparts.

Accuracy prompts backfired

Peters and Chin-Yee did try to get the LLMs to generate accurate summaries. They, for instance, specifically asked the chatbots to avoid inaccuracies. “But strikingly, the models then produced exaggerated conclusions even more often”, Peters says. “They were nearly twice as likely to produce overgeneralised conclusions.”

“This effect is concerning. Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they will get a more reliable summary. Our findings suggest the exact opposite.”

Newer AI models, like ChatGPT-4o and DeepSeek, performed even worse.

Why are these exaggerations happening?

“LLMs may inherit the tendency to make broader claims from the texts they are trained with,” Chin-Yee explains. He refers to findings in previous studies. “Human experts also tend to draw more general conclusions, from Western samples to all people, for example.”

“But lots of the original articles didn’t contain problematic generalisations, but the summaries then suddenly did,” Peters adds. “Worse still, overall, newer AI models, like ChatGPT-4o and DeepSeek, performed worse than older ones.”

Another reason for LLMs’ generalisation bias might lie in people’s interaction with the chatbots. “In their interactions with LLMs, human users that are involved in the models’ fine-tuning may prefer LLM responses that sound helpful and widely applicable. In this way, the models might learn to favour such answers – even at the expense of accuracy.”

There is a real risk that AI-generated science summaries could spread misinformation.

Reducing the risks

“If we want AI to support science literacy rather than undermine it, we need more vigilance and testing of LLMs in science communication contexts,” Peters stresses. “These tools are already being widely used for science summarisation, so their outputs can shape public science understanding – accurately or misleadingly. Without proper oversight, there is a real risk that AI-generated science summaries could spread misinformation, or present uncertain science as settled fact.”

If you still wish to use a chatbot to summarise a text, Peters and Chin-Yee recommend using LLMs such as Claude, which had the highest generalisation accuracy. It may also help to use prompts that enforce indirect, past-tense reporting, and, if you are a programmer, to set chatbots to lower ‘temperature’ (the parameter fixing a chatbot’s ‘creativity’).