The Noisy Label Filter Procedure

Systematic reviews are essential for synthesizing evidence, but their reproducibility is often hindered by the lack of access to fully labeled datasets—particularly the labeling decisions for excluded studies. While PRISMA guidelines require the publication of search strategies and final inclusions, they do not mandate sharing the full set of screened records and their corresponding labels. This makes it difficult to run simulation studies with active learning models, which rely on complete and accurate labels. To address this challenge, the ASReview team developed the Noisy Label Filter (NLF) procedure—a method to approximate the original dataset’s labeling as closely as possible when only partial data are available.
Progress
The Noisy Label Filter (NLF) procedure is designed to reduce the risk of mislabeling in reconstructed datasets by identifying potentially relevant records among those whose status is uncertain. It helps ensure a more reliable simulation setup with minimal screening effort. In our application, we replicated the dataset from a 2018 systematic review on psychological treatment for borderline personality disorder. After reconstructing the search results (covering 1,543 records and deduplicating to 1,053), we verified the presence of all 20 known relevant studies and labeled the remaining 1,033 records as "noisy."
To assess and refine these noisy labels, we applied the NLF procedure through four integrated steps:
- Prepare your dataset by reconstructing the original search and verifying that the known relevant records are included and correctly labeled.
- Screen noisy labels using ASReview, where an expert labels a subset of the most likely relevant records predicted by an active learning model, continuing until a predefined stopping rule is met (e.g., 50 irrelevant records in a row).
- Assign final labels based on this process. If no additional relevant studies are found, the remaining records can be confidently marked as irrelevant.
- Run a simulation study using ASReview Makita to compare the performance of various active learning models.
In our case study, one of the original authors screened 111 records in ASReview and identified four potentially relevant records. After team discussion, these were ultimately deemed irrelevant based on prior exclusion decisions. This confirmed the accuracy of the noisy label classification and allowed us to finalize the dataset with confidence. We then ran simulations comparing models like Naïve Bayes, Random Forest, and SVM, finding that Naïve Bayes was the most efficient in retrieving relevant studies with minimal effort.
The NLF procedure also serves as a rapid second screener check. It is a diagnostic tool to assess whether any relevant studies were missed—essentially checking for false negatives. This validation step supports higher confidence in the reconstructed dataset’s integrity.
Funding
This project and its resulting output are being funded by the Dutch Research Council (NWO), under grant number 406.22.GO.048.
People involved
- Rutger Neeleman - Student
- Matthijs Oud - Advisor
- Felix Weijdema - Advisor
- Cathalijn Leenaars - Advisor
- Rens van de Schoot - Advisor