Dr. Filip Moons

Buys Ballotgebouw
Princetonplein 5
Kamer 3.65
3584 CC Utrecht

Dr. Filip Moons

Universitair docent
Mathematics Education
+32494034383
f.v.l.moons@uu.nl

How to Measure Inter-Rater Reliability When Subjects Fall into Multiple Categories

Cohen’s and Fleiss’ kappa are household names for measuring inter-rater agreement, but both assume that each rater picks only one category per subject. That’s a real limitation in many real-world scenarios—think psychiatric diagnoses (patients with several disorders), or coding qualitative data using multiple labels.

In our latest paper (see below), a Generalized Fleiss' kappa is introduced that:

  • Allows for multiple category assignments per subject by each rater,
  • Supports weighted and hierarchical categories (primary/sub-diagnoses, etc.),
  • Handles missing data and variable numbers of raters,
  • Is equivalent to classic Fleiss’ kappa when single-category assignment is used,
  • Includes worked-out examples, with R and Excel resources available.

To cite the paper:
Moons, F. & Vandervieren, E. (2025). Measuring agreement among several raters classifying subjects into one or more (hierarchical) categories: A generalization of Fleiss’ kappa. Behavior Research Methods, 57, 287. https://doi.org/10.3758/s13428-025-02746-8

Download the Excel add-in for applying the generalized Fleiss’ kappa: 

Remark: First install the add-in, next open the worksheet.

The data-repository on OSF: https://osf.io/q5nft/