Label fusion and training methods for reliable representation of inter-rater uncertainty

Andreanne Lemay1,20000-0001-8581-2929, Charley Gros1,20000-0003-4318-0024, Enamundram Naga Karthik3,20000-0003-2940-5514, Julien Cohen-Adad1,20000-0003-3662-9532
1: Electrical Engineering, Polytechnique Montreal, 2: Mila, Quebec AI Institute, Montreal, QC, Canada, 3: Biomedical Engineering, NeuroPoly lab
Publication date: 2023/01/18
PDF · Code · arXiv


Medical tasks are prone to inter-rater variability due to multiple factors such as image quality, professional experience and training, or guideline clarity. Training deep learning networks with annotations from multiple raters is a common practice that mitigates the model’s bias towards a single expert. Reliable models generating calibrated outputs and reflecting the inter-rater disagreement are key to the integration of artificial intelligence in clinical practice. Various methods exist to take into account different expert labels. We focus on comparing three label fusion methods: STAPLE, average of the rater’s segmentation, and random sampling of each rater’s segmentation during training. Each label fusion method is studied using both the conventional training framework and the recently published SoftSeg framework that limits information loss by treating the segmentation task as a regression. Our results, across 10 data splittings on two public datasets (spinal cord gray matter challenge, and multiple sclerosis brain lesion segmentation), indicate that SoftSeg models, regardless of the ground truth fusion method, had better calibration and preservation of the inter-rater rater variability compared with their conventional counterparts without impacting the segmentation performance. Conventional models, i.e., trained with a Dice loss, with binary inputs, and sigmoid/softmax final activate, were overconfident and underestimated the uncertainty associated with inter-rater variability. Conversely, fusing labels by averaging with the SoftSeg framework led to underconfident outputs and overestimation of the rater disagreement. In terms of segmentation performance, the best label fusion method was different for the two datasets studied, indicating this parameter might be task-dependent. However, SoftSeg had segmentation performance systematically superior or equal to the conventionally trained models and had the best calibration and preservation of the inter-rater variability. SoftSeg has a low computational cost and performed similarly in terms of uncertainty to ensembles which require multiple models and forward passes. Our code is available at


Inter-rater variability · Calibration · Segmentation · Deep learning · Soft training · Label fusion

Bibtex @article{melba:2022:031:lemay, title = "Label fusion and training methods for reliable representation of inter-rater uncertainty", author = "Lemay, Andreanne and Gros, Charley and Naga Karthik, Enamundram and Cohen-Adad, Julien", journal = "Machine Learning for Biomedical Imaging", volume = "1", issue = "January 2023 issue", year = "2022", pages = "1--27", issn = "2766-905X", doi = "", url = "" }
RISTY - JOUR AU - Lemay, Andreanne AU - Gros, Charley AU - Naga Karthik, Enamundram AU - Cohen-Adad, Julien PY - 2022 TI - Label fusion and training methods for reliable representation of inter-rater uncertainty T2 - Machine Learning for Biomedical Imaging VL - 1 IS - January 2023 issue SP - 1 EP - 27 SN - 2766-905X DO - UR - ER -

2022:031 cover