Synthetic Data for Robust Stroke Segmentation

Liam Chalcroft1Orcid, Ioannis Pappas2Orcid, Cathy J. Price1Orcid, John Ashburner1Orcid
1: Wellcome Centre for Human Neuroimaging, University College London, 2: University of Southern Californa
Publication date: 2025/08/14
https://doi.org/10.59275/j.melba.2025-f3g6
PDF · Model and weights · SPM Toolbox · arXiv

Abstract

Current deep learning-based approaches to lesion segmentation in neuroimaging often depend on high-resolution images and extensive annotated data, limiting clinical applicability. This paper introduces a novel synthetic data framework tailored for stroke lesion segmentation, expanding the SynthSeg methodology to incorporate lesion-specific augmentations that simulate diverse pathological features. Using a modified nnUNet architecture, our approach trains models with label maps from healthy and stroke datasets, facilitating segmentation across both normal and pathological tissue without reliance on specific sequence-based training. Our method achieves robust out-of-domain performance where conventional approaches fail, with in-domain performance of 48.2% Dice compared to 57.5% for conventional training. Crucially, even with oracle knowledge of the optimal domain adaptation method - an unrealistic scenario in practice - conventionally-trained models cannot match our synthetic approach in out-of-domain settings. The framework demonstrates that synthetic pre-training provides fundamental robustness unachievable through test-time adaptation alone. Our approach reduces reliance on domain-specific training data and helps bridge the gap between research-grade and clinical scans to improve clinical stroke neuroimaging workflows. PyTorch training code and weights are publicly available at https://github.com/liamchalcroft/SynthStroke, along with an SPM toolbox featuring a plug-and-play model at https://github.com/liamchalcroft/SynthStrokeSPM

Keywords

Machine Learning · Image Segmentation · Domain Adaptation

Bibtex @article{melba:2025:014:chalcroft, title = "Synthetic Data for Robust Stroke Segmentation ", author = "Chalcroft, Liam and Pappas, Ioannis and Price, Cathy J. and Ashburner, John", journal = "Machine Learning for Biomedical Imaging", volume = "3", issue = "August 2025 issue", year = "2025", pages = "317--346", issn = "2766-905X", doi = "https://doi.org/10.59275/j.melba.2025-f3g6", url = "https://melba-journal.org/2025:014" }
RISTY - JOUR AU - Chalcroft, Liam AU - Pappas, Ioannis AU - Price, Cathy J. AU - Ashburner, John PY - 2025 TI - Synthetic Data for Robust Stroke Segmentation T2 - Machine Learning for Biomedical Imaging VL - 3 IS - August 2025 issue SP - 317 EP - 346 SN - 2766-905X DO - https://doi.org/10.59275/j.melba.2025-f3g6 UR - https://melba-journal.org/2025:014 ER -

2025:014 cover

Disclaimer: the following html version has been automatically generated and the PDF remains the reference version. Feedback can be sent directly to publishing-editor@melba-journal.org

1 Introduction

Semantic segmentation is a critical component of neuroimaging pipelines, enabling precise quantification of anatomical structures and lesions for applications like tracking disease progression and planning treatments. In research settings, segmentation labels are typically derived from high-quality, standardised structural scans (e.g., MPRAGE) that benefit from consistent field-of-view, spacing, orientation, and minimal artifacts. In contrast, clinical scans often exhibit significant variability in these factors, which can severely impact deep learning model performance. Consequently, models trained on homogeneous research-grade data may not generalise well to the diverse, lower-quality images encountered in clinical practice.

Both traditional probabilistic methods (Ashburner and Friston, 2005) and modern deep discriminative methods (Isensee et al., 2020) require prior information to be provided for a given sequence - in the case of a traditional method this may be an atlas or template, and in modern methods this would come in the form of training data. Atlas-based methods build a template of the anatomical structure of the brain, which may be deformed into alignment with a new subject to assign voxel-wise anatomical classes. This is proven to be robust for delineating healthy structure even with shifts in contrast (Puonti et al., 2016), however it is non-trivial to include classes of pathology such as stroke within such a model, due to the inherent heterogeneity in location and geometric properties. In the context of generative models, lesions may be included in the form of anomaly detection as demonstrated in Seghier et al. (2008). This method is not however directly attempting to label the pathology, and (by design) will label physiological changes such as ventricular enlargement in addition to the responsible infarct.

Deep discriminative models trained using supervised learning have been able to reach human-level performance when tested in-domain on large datasets for a variety of brain pathologies and imaging modalities (Baid et al., 2021; de la Rosa et al., 2024). There is still however a significant gap when trying to translate these models to clinical data, where each hospital is likely to vary both in the scanning equipment used and the choice of imaging sequences (Nguyen et al., 2024). This poses a significant challenge to the adoption of deep learning for automating the labelling of clinical data, which could greatly help to accelerate the translation of modern research in stroke prognosis (Loughnan et al., 2019).

To extend such methods to the open-ended domain of clinical scans, models often need to perform on sequences for which no training data may be available. To this end, domain randomisation via synthetic data has been shown to give impressive results for healthy brain parcellation in SynthSeg (Billot et al., 2023). In this method, a set of ground truth healthy tissues are used to generate synthetic images, under the assumption that each tissue class’ intensity distribution should roughly follow a Gaussian. By assigning random Gaussian distributions to each class, a deep learning model can learn to extract shape information for parcellation in a way that is invariant to the input image’s relative tissue contrast, hence allowing the model to be used on any sequence at test-time, without training data or prior knowledge of the sequence. This method of training with synthetic data has since been extended to tasks such as image registration (Hoffmann et al., 2022, 2023, 2024), image super-resolution (Iglesias et al., 2021, 2023), surface estimation (Gopinath et al., 2023, 2024a) and vascular segmentation (Chollet et al., 2024). A comprehensive overview is available in Gopinath et al. (2024b).

An additional benefit of this method is that the ’forward model’ of creating an MRI (or CT) image from tissues of different physical properties is a perfect 1:1 mapping to the ’inverse model’ of labeling the tissues (i.e. segmentation) from the acquired image. In structures that exhibit a large amount of inter-rater variability, this is likely to help prevent a model from imitating under- or over- segmentation from imperfect ground truths - the images segmented are generated from the corresponding segmentation labels and so labels will always be a consistent method of segmentation.

Prior work to SynthSeg demonstrated the potential of encoding anatomical priors in a neural network through pre-training with unpaired parcellation labels (Dalca et al., 2018). Such methods face significantly larger challenges when applied to the heterogeneous shape and spatial distribution of lesions. In healthy parcellation, anatomical structures have consistent positions across individuals (e.g., the brainstem reliably appears in the same region of the brain), a regularity that has motivated atlas-based approaches to parcellation. In contrast, lesions are highly variable across individuals - not only in number and size but also in their spatial distribution. Unlike anatomical structures, lesions cannot be reliably mapped to a specific location or shape within an atlas. Although the exact site of lesion initiation is often influenced by the brain’s vascular architecture - meaning that certain regions are statistically more susceptible to stroke due to the location of large blood vessels - the resulting lesion’s size, shape, and spread are highly variable. Multiple sclerosis serves as an exception, with lesions that are somewhat predictable in their white matter localisation (Lassmann, 2018), enabling modelling through synthetic deep learning frameworks (Billot et al., 2021; Laso et al., 2024) and traditional probabilistic models (Cerri et al., 2021). More recently, Liu et al. (2024) has shown promising results by training a SynthSeg-like model on lesion labels from various pathologies, providing a foundation for fine-tuning on multiple downstream datasets.

Robust open-domain stroke segmentation remains an unsolved challenge. Most domain-specific frameworks for stroke lesion segmentation are targeted towards lesion-pasting (Zhang et al., 2021; Dai et al., 2022; Basaran et al., 2023), aiming to augment the anatomical variety without making any attempt to augment the variety in image contrasts. Likewise, label-conditioned generative models such as TumorGAN (Li et al., 2020) can similarly only generate new lesioned brains within the learned distribution of image intensities. None of these works approach the task of robustness to shifts in image appearance, instead focusing on shape-related augmentation.

In our work, we extend the SynthSeg framework to the task of stroke lesion segmentation via a novel lesion-pasting method that better simulates variety in lesion appearance. Our hybrid approach trades a statistically significant 9.3% median Dice reduction in-domain (57.5% vs 48.2%, p<<0.001) for improved out-of-domain robustness. Crucially, we demonstrate that even with oracle knowledge of the optimal domain adaptation method, conventional training cannot match our synthetic approach in out-of-domain scenarios. We validate this on a comprehensive range of lesion datasets with a wide distribution of image characteristics and lesion physiology. To assist in widespread evaluation of this framework, we release PyTorch training code/weights, and a MATLAB toolbox for SPM to reduce the barrier to clinical adoption.

2 Methods

Terminology and Notation

For clarity, we define the key terms used throughout this work:

  • TTA (Test-Time Augmentation): A procedure at inference where multiple augmented versions (here generated by flips) of an input are processed and their predictions averaged to improve robustness.

  • DA (Domain Adaptation): Techniques applied at test time to adapt a trained model to new data distributions.

  • Oracle DA: The hypothetical best-case scenario where the optimal DA method is known a priori for each dataset/modality combination.

2.1 Synthetic Data Generation Framework

Rationale.

Our goal is to create a large, diverse and perfectly labelled training set without the labour of voxel-wise annotation. We therefore generate paired image-label volumes by composing healthy tissue maps with realistically shaped stroke lesions, followed by intensity synthesis and heavy image-quality augmentation (Fig. 2).

(i) Healthy-tissue label bank.

Instead of the 100+ FreeSurfer classes used by SynthSeg, we adopt the nine posterior tissue maps produced by MultiBrain (Brudfors et al., 2020). This reduces memory usage, speeds up sampling and still retains the GM/WM/CSF boundaries that matter for lesion realism (Fig. 1).

Refer to caption
(a) FreeSurfer (Puonti et al., 2016)
Refer to caption
(b) MultiBrain (Brudfors et al., 2020)
Refer to caption
(c) MultiBrain (skull-stripped)
Figure 1: Sample generated images using different labels for a single subject. 1(a): FreeSurfer anatomical labels. 1(b): MultiBrain tissue labels. 1(c): MultiBrain tissue labels masked to simulate skull-stripping.

(ii) Lesion Copy-Paste (Lesion-CP).

We extend Soft-CP (Dai et al., 2022) with random dilate/erode ’feathering’ and a spatially varying bias-field multiplier (MONAI Random Bias Field (Cardoso et al., 2022)) to mimic penumbral intensity fall-off (Middleton et al., 2024).

(iii) Intensity sampling.

Each tissue class is assigned μ𝒰(0,255)similar-to𝜇𝒰0255\mu\!\sim\!\mathcal{U}(0,255), σ𝒰(0,16)similar-to𝜎𝒰016\sigma\!\sim\!\mathcal{U}(0,16) and Gaussian blur FWHM𝒰(0,2)similar-toFWHM𝒰02\text{FWHM}\!\sim\!\mathcal{U}(0,2). For stroke lesions we modulate the copy-pasted mask with the bias field to create intra-lesion heterogeneity. We sample from a single Gaussian distribution, with the implications of this choice examined via Hartigan’s dip-test in Appendix A.

(iv) Image-quality augmentations.

Table 1 summarises every random transform (bias field, affine, elastic, skull-strip imperfection, noise, resolution, motion, contrast, etc.).

Future work will explore mixture-of-Gaussians modelling (Ashburner and Friston, 2005) to represent lesions that are simultaneously hyper- and hypo-intense.

Table 1: Parameter ranges for every image-quality augmentation used during synthetic training. indicates augmentations applied only to synthetic data and not to real ATLAS samples.
CategoryTransformSampling range / notes
Bias fieldMultiplicative biasControl points 𝒰(2,7)𝒰27\mathcal{U}(2,7), strength 𝒰(0,0.5)𝒰00.5\mathcal{U}(0,0.5)
AffineRotation𝒰(15,15)𝒰superscript15superscript15\mathcal{U}(-15^{\circ},15^{\circ}) on each axis
Shear𝒰(0,0.012)𝒰00.012\mathcal{U}(0,0.012)
Zoom𝒰(0.85,1.15)𝒰0.851.15\mathcal{U}(0.85,1.15)
ElasticDeformation gridControl points 𝒰(0,10)𝒰010\mathcal{U}(0,10), max disp. 𝒰(0,0.05)𝒰00.05\mathcal{U}(0,0.05)
Skull-strip flawDilationp=0.3𝑝0.3p=0.3, radius = 2 vox.
Erosionp=0.3𝑝0.3p=0.3, radius = 4 vox.
NoiseGaussian (SNR)SNR 𝒰(0,10)𝒰010\mathcal{U}(0,10), smoothed by g𝑔g-factor 𝒰(2,5)𝒰25\mathcal{U}(2,5)
ResolutionSlice anisotropyThk. factor 𝒰(1,8)𝒰18\mathcal{U}(1,8) (base res. 1 mm3)
ContrastGammaγ=10𝒩(0,0.6)𝛾superscript10𝒩00.6\gamma=10^{\mathcal{N}(0,0.6)}
Motion blurPSF width (FWHM)𝒰(0,3)𝒰03\mathcal{U}(0,3) vox.
FlipMirror on each axisp=0.8𝑝0.8p=0.8
Refer to caption
Figure 2: Schematic overview of the data generation process. Lesions are sampled from a template-normalised bank of lesion binary masks, and healthy tissue maps are sampled from a template-normalised bank of MultiBrain segmentations. Pasting of lesions onto healthy tissue maps is performed using a spatially varying lesion intensity to simulate penumbra. Tissue intensities may then be sampled from Gaussian distributions and image-label pairs used to train a segmentation model in a supervised manner.

With this synthetic data generation pipeline established, we now describe the datasets, network architecture, and training procedures used in our experiments.

2.2 Training Data and Sampling

Healthy maps: OASIS-3 (N=2 679, 2 579/100 train/val) with MultiBrain segmentation, all warped to ICBM space. Lesion masks: ATLAS (N=655, 419/105/131 train/val/test) aligned to the same space. During sampling we paste one random ATLAS lesion onto one random OASIS subject and on-the-fly augment as above.

2.3 Network Architecture and Optimisation

Backbone. 3D U-Net (nnUNet template) with six levels (16326412832016326412832016\!\rightarrow\!32\!\rightarrow\!64\!\rightarrow\!128\!\rightarrow\!320 channels), PReLU activations and one residual unit per block (Isensee et al., 2024).

Output channels. Background + Gray Matter (GM) + White Matter (WM) + GM/WM Partial Volume + Cerebrospinal Fluid (CSF) + Lesion (total six channels) for Synth; Background + Lesion for Baseline. When real images (binary GT) enter the mixed loader we mask the loss to the lesion channel only.

Training schedule.1923superscript1923192^{3} crops, batch 1, 1 200 epochs × 500 iterations (=6×1056superscript1056\times 10^{5} updates), AdamW (η0=104subscript𝜂0superscript104\eta_{0}\!=10^{-4}, weight-decay 0.01, poly LR decay with power 0.9), combined Dice + CE loss, dropout 0.2, gradient-norm clip 12.

Data-loader mixing. Synthetic : Real ratio = 2 579 : 419 (mirrors sample counts). Real MPRAGE images receive the same spatial/intensity transforms as synthetic batches.

Comparison model. WMH-SynthSeg (Laso et al., 2024; Fischl, 2012) is included as an ”off-the-shelf” robust-lesion baseline, but its training labels target small, periventricular WMHs and differ substantially from large-cortical stroke lesions.

2.4 Domain Adaptation Methods

At test time we evaluate a diverse set of unsupervised domain-adaptation (DA) techniques to determine: (i) whether the Baseline model can, under an oracle choice of DA method, match the performance of the DA-free Synth model, and (ii) whether Synth’s inherent robustness offers a better starting point for DA on truly unseen data.

We tested six DA methods (TTA, DAE, TENT, PL, UPL, DPL) and report the best-performing configuration for each modality/dataset combination as Oracle DA in Tables 2-5. This represents an upper bound on baseline performance, as in practice one cannot know a priori which DA method will work best. Full individual results for all methods appear in Appendix B, Tables 10-13.

DA techniques evaluated.

(1) TTA - Test-Time Augmentation (Wang et al. (2019); denoted TTA). Eight mirror-flipped volumes (23superscript232^{3} axis-flip combinations) are inferred, logits averaged, and softmax/argmax yields the mask. This heuristic is cheap and rarely degrades performance.

(2) DAE - Denoising Auto-Encoder Regularisation (Karani et al., 2021). We train a 3-layer denoising auto-encoder to regularise noisy labels for both Baseline and Synth. At inference a three-layer normalisation network (3×3×33333{\times}3{\times}3 kernels, 16 channels) is prepended to the frozen segmentor; its activation is

f(x)=expx2σ2,𝑓𝑥superscript𝑥2superscript𝜎2f(x)=\exp{-\frac{x^{2}}{\sigma^{2}}},(1)

where σ𝜎\sigma is learned. For each test image the network is re-initialised and optimised for 100 steps to minimise Dice + L2 loss between segmentation logits and their DAE-cleaned counterpart.

(3) TENT - Test-Time Entropy Minimisation (Wang et al., 2021). Because 3D memory limits rule out batch-norm adaptation, we instead optimise an identical normalisation network (initialised from scratch per subject) for 100 steps to minimise Shannon entropy

(y^)=cpc(y^)logpc(y^),^𝑦subscript𝑐subscript𝑝𝑐^𝑦subscript𝑝𝑐^𝑦\mathcal{H}(\hat{y})=-\sum\nolimits_{c}p_{c}(\hat{y})\log p_{c}(\hat{y}),(2)

with pcsubscript𝑝𝑐p_{c} the class-c𝑐c softmax probability.

(4) PL Family - Pseudo-Labelling (Self-Training) (Chen et al., 2021). Three variants are tested:

  • PL: threshold softmax at τ=1.5NC𝜏1.5subscript𝑁𝐶\tau=\tfrac{1.5}{N_{C}} to keep only high-confidence voxels, where NCsubscript𝑁𝐶N_{C} is the total number of output classes.

  • UPL: PL plus uncertainty masking via 10-sample Monte-Carlo dropout (variance >0.05absent0.05>0.05 discarded).

  • DPL: full ”prototype-consistency” pipeline that further removes voxels inconsistent with decoder-feature prototypes.

All PL variants fine-tune all segmentation weights for 2000 iterations with weighted cross-entropy.

Common optimiser settings. Every trainable DA method (DAE, TENT, PL, UPL, DPL) uses AdamW (Loshchilov and Hutter, 2019) with learning-rate 0.002 and weight-decay 0.01. TTA is inference-only and therefore parameter-free.

2.5 Large-Scale Pseudo-Labelling

Because the synthetic pipeline decouples image realism from label accuracy, we can tolerate imperfect pseudo-labels. We therefore use Baseline + TTA to annotate 1159 chronic stroke MPRAGE scans from PLORAS Sample 1 (Seghier et al., 2016); this PLORAS-MPRAGE cohort feeds the mixed loader exactly like ATLAS.

3 Experiments

3.1 Datasets

Models were validated on four stroke-lesion datasets. We assessed the models’ in-domain performance on the hold-out test set for the ATLAS dataset (131 subjects, 1 mm isotropic MPRAGE). Out-of-domain (OOD) robustness was evaluated on three additional cohorts: the ISLES 2015 dataset (Maier et al., 2017) (N=28 subjects with skull-stripped T1w/T2w/FLAIR/DWI), the ARC dataset (Gibson et al., 2024; Johnson et al., 2024) (N=229 T2w, 202 T1w, 85 FLAIR; N=84 subjects have all three), and the hospital scans from 661 acute-stroke patients in PLORAS Sample 2 (N=106 T2w, 300 FLAIR, 255 CT), collectively referred to as PLORAS (Price et al., 2010). ISLES and PLORAS also introduce an acute-versus-chronic shift. There is no overlap in patients between the PLORAS hospital scans (acute) and the PLORAS MPRAGE cohort described in Section 2.5.

For PLORAS, images are resampled from the original 2 mm isotropic resolution to 1 mm to maintain a single preprocessing pipeline, acknowledging that this constitutes a second resampling step.

3.2 Experimental Design

Pre-processing and inference.

All test images are re-oriented to RAS, resliced to 1 mm 3 voxels, histogram-normalised and z𝑧z-scored. Inference uses a 1923superscript1923192^{3} sliding window with 50 % overlap and a Gaussian blending kernel (σ=0.125𝜎0.125\sigma=0.125). Test-time augmentation (TTA) averages logits over all eight combinations of left-right, anterior-posterior and inferior-superior flips.

Multi-modal ensembles.

When multiple MR sequences are available for the same subject (ISLES 2015, ARC) we average the per-modality logits before the softmax. This simple ensembling mimics a realistic clinical deployment.

Pseudo-label training.

Pseudo-labels are generated for the PLORAS MPRAGE cohort with the Baseline + TTA model. A new Baseline and a new Synth model are then re-trained using the union of ATLAS and pseudo-labelled data, following exactly the same optimisation schedule as the originals.

3.3 Evaluation Metrics

Prior to metric computation, predictions and ground truth are resliced to 1 mm and zero-padded to 2563superscript2563256^{3} voxels. We report Dice and Surface-Dice (Seidlitz et al., 2022) (1 mm tolerance) in the main text; HD95, absolute volume difference (AVD), absolute lesion difference (ALD), lesion-wise F1, true-positive rate (TPR) and false-positive rate (FPR) appear in Appendix B.

Dice quantifies volumetric overlap (1 = perfect, 0 = none), whereas HD95 captures boundary error while being robust to outliers. AVD reports absolute volume mismatch in cm3superscriptcm3\text{cm}^{3}. ALD counts mismatches in the number of connected components, and lesion-wise F1 scores per-lesion detection accuracy. TPR and FPR follow standard definitions.

3.4 Comparison Methods

We evaluate four primary approaches:

(i) Baseline: A standard 3D U-Net trained solely on real MPRAGE images from the ATLAS dataset (N=419) using supervised learning. This represents the conventional approach of training on a single modality with manual annotations.

(ii) Oracle DA: The baseline model enhanced with the single best-performing domain adaptation method (selected post-hoc from TTA, TENT, DAE, PL, UPL, DPL) for each modality. This represents the theoretical upper bound of what domain adaptation can achieve when the optimal method is known - an unrealistic scenario in practice.

(iii) WMH-SynthSeg: The pre-trained white matter hyperintensity model from Laso et al. (2024); Fischl (2012), included as an off-the-shelf robust lesion segmentation baseline. Note that this model was trained for small periventricular WMH lesions, which differ substantially from large cortical stroke lesions.

(iv) Synth (Ours): Our proposed approach using synthetic data generation as described in Section 2. The model is trained on a mixture of synthetic data (generated from OASIS healthy maps + ATLAS lesion masks) and real ATLAS MPRAGE images, following the framework illustrated in Figure 2.

For fair comparison, TTA is only applied when a model is explicitly labelled ”+TTA” or when it is selected as the Oracle DA. All candidate maps are binarised via argmax on posterior probabilities without additional calibration or threshold tuning.

4 Results

Overall results are shown in Figure 3, which compares (i) the Baseline, (ii) the Oracle DA (best-performing domain adaptation for Baseline only), (iii) WMH-SynthSeg, and (iv) the DA-free Synth model. Oracle DA represents the hypothetical best-case scenario where the optimal DA method is known a priori for each dataset/modality. Tables 2, 3, 4 and 5 provide comprehensive results for all tested domain adaptation methods on both Baseline and Synth models.

Refer to caption
Refer to caption
Figure 3: Dice and Surface Dice metrics for all reported datasets. ’Oracle DA’ represents the hypothetical best-case scenario where optimal DA method is known a priori for each dataset/modality and applied to the baseline model.

4.1 ATLAS

The ATLAS dataset represents the in-domain scenario with T1-weighted images matching our training distribution. Table 2 shows performance on the held-out test set. The conventional Baseline achieves a median Dice of 57.5 %, whereas our Synth model reaches 48.2 %, a 9.3 % gap that represents the ’price of robustness’ we accept for the larger out-of-domain gains reported later. Surface Dice follows the same trend (49.4 % vs. 38.1 %).

Applying test-time augmentation (Baseline+TTA) changes Dice by <<0.1 %, indicating that TTA adds little benefit when the evaluation domain is already aligned with training.

The off-the-shelf WMH-SynthSeg, trained for small periventricular WMH lesions, scores only 7.3 % Dice, confirming that cortical stroke in ATLAS lies well outside its intended scope.

A paired Wilcoxon test (Appendix Figure 7) verifies that the Baseline–Synth Dice difference is statistically significant, underscoring that domain-invariant training still sacrifices some in-domain accuracy. Voxel-level metrics in Appendix Table 10 reveal the Baseline achieves a higher recall but at the cost of more false positives, suggesting a tendency to over-segment.

Worst Scanner Syndrome. This in-domain drop is consistent with the ’Worst Scanner Syndrome’ hypothesis (Moyer and Golland, 2021), which posits that many domain-invariance strategies pull feature quality toward the least informative - or noisiest - domain. As the next Results sections demonstrate, that modest in-domain penalty is offset by substantial gains on heterogeneous clinical data.

Table 2: Median results on the ATLAS hold-out set (N=131). Best score shown in bold. Student’s t distribution 95% confidence intervals given in brackets.
ModalityModelDice (%)Surface Dice (%)
T1wBaseline57.5 (52.3-62.7)49.4 (44.5-54.3)
Baseline+TTA57.5 (52.2-62.8)49.5 (44.5-54.5)
WMH-SynthSeg7.3 (4.8-9.7)8.9 (6.9-10.9)
Synth (Ours)48.2 (43.1-53.4)38.1 (33.4-42.7)

4.2 ARC

The ARC dataset contains research-quality chronic stroke scans, representing a moderate domain shift from our ATLAS training data. We expect T1w performance to be strongest given its proximity to the training domain.

Table 3 shows performance on ARC, comparing Baseline, Oracle DA (best possible domain adaptation for Baseline), WMH-SynthSeg, and our Synth approach. For T1w, both baseline and Oracle DA achieve 75.2% Dice, with Synth at 72.3% - all maintaining strong performance (Appendix Figure 8 shows statistical significance). However, T2w reveals dramatic differences: baseline drops to 0.4% and Oracle DA to 0.1%, while Synth maintains 26.8%. FLAIR shows intermediate performance with Oracle DA at 12.4% versus Synth at 14.1%. The ensemble results are particularly striking: Oracle DA achieves only 11.7% while Synth reaches 60.2%.

This pattern - strong baseline performance on T1w but catastrophic failure on other modalities even with optimal DA - demonstrates that synthetic pre-training provides fundamental robustness unachievable through post-hoc adaptation.

Table 3: Median results on the ARC dataset (N=229). Best score shown in bold. Student’s t distribution 95% confidence intervals given in brackets. ’Oracle DA’ represents the hypothetical best-case scenario where optimal DA method is known a priori for each dataset/modality and applied to the baseline model.
ModalityModelDice (%)Surface Dice (%)
T1wBaseline75.2 (71.5-79.0)41.7 (39.1-44.4)
Oracle DA75.2 (71.3-79.0)42.1 (39.4-44.8)
WMH-SynthSeg8.3 (6.8-9.8)8.9 (7.7-10.1)
Synth (Ours)72.3 (68.4-76.2)33.7 (31.2-36.2)
T2wBaseline0.4 (0.0-1.4)1.2 (0.8-1.7)
Oracle DA0.1 (0.0-0.4)1.0 (0.8-1.2)
WMH-SynthSeg3.2 (2.2-4.2)6.0 (5.2-6.8)
Synth (Ours)26.8 (23.3-30.2)12.0 (10.4-13.6)
FLAIRBaseline12.0 (7.6-16.3)6.3 (4.7-7.8)
Oracle DA12.4 (5.8-19.0)7.7 (5.2-10.2)
WMH-SynthSeg2.4 (0.9-3.9)4.3 (3.0-5.5)
Synth (Ours)14.1 (9.7-18.5)6.6 (4.8-8.4)
EnsembleBaseline3.4 (1.3-5.4)2.0 (0.8-3.2)
Oracle DA11.7 (8.6-14.7)6.3 (4.7-7.8)
Synth (Ours)60.2 (56.6-63.8)26.3 (24.3-28.4)

4.3 ISLES 2015

ISLES 2015 contains skull-stripped sub-acute stroke scans, representing our most challenging domain shift with both acquisition and pathology differences from the chronic ATLAS training data.

The results in Figure 3 and Table 4 show that the Synth model outperforms the Baseline model in all modalities in regards to both Dice and Surface Dice with high statistical significance in the Dice metric evidenced in Appendix Figures 11-13. Baseline performance is near-zero across all modalities. Oracle DA shows minimal recovery: 10.5% for T1w (still below Synth’s 11.0%) and 0.0% for all other modalities (T2w, FLAIR, DWI). In contrast, Synth achieves 11.0% (T1w), 11.1% (T2w), 21.2% (FLAIR), and 5.6% (DWI). The ensemble demonstrates the starkest contrast: 0.0% for both baseline and Oracle DA versus 42.3% for Synth.

The poor performance correlates with high false positive rates rather than missed detections (Appendix Table 12), suggesting models struggle with tissue discrimination in this domain. When baseline predictions fail catastrophically, no DA method can recover meaningful performance.

It is also evident from Table 4 that the model performance is highly dependent on the choice of image sequence available. The ensemble improves performance over individual sequences in ISLES2015 (Table 4) but shows mixed results in ARC, suggesting dataset-specific benefits rather than universal improvement. Although we only show the upper limit of an ensemble of all four sequences, it is expected that in cases where fewer sequences are available we will still observe constructive gains from post-hoc ensembling.

Table 4: Median results on the ISLES2015 dataset (N=28). Best score shown in bold. Student’s t distribution 95% confidence intervals given in brackets. ’Oracle DA’ represents the hypothetical best-case scenario where optimal DA method is known a priori for each dataset/modality and applied to the baseline model.
ModalityModelDice (%)Surface Dice (%)
T1wBaseline0.0 (0.0-7.4)0.0 (0.0-4.8)
Oracle DA10.5 (0.0-21.7)3.8 (0.0-9.9)
WMH-SynthSeg0.0 (0.0-4.5)0.7 (0.0-3.9)
Synth (Ours)11.0 (1.2-20.8)3.4 (0.0-8.5)
T2wBaseline0.0 (0.0-0.5)0.3 (0.0-1.0)
Oracle DA0.0 (0.0-0.2)0.0 (0.0-0.6)
WMH-SynthSeg0.1 (0.0-2.5)0.8 (0.0-3.0)
Synth (Ours)11.1 (0.7-21.6)7.1 (2.7-11.5)
FLAIRBaseline0.0 (0.0-0.0)0.0 (0.0-0.0)
Oracle DA0.0 (0.0-0.0)0.0 (0.0-0.0)
WMH-SynthSeg0.3 (0.0-2.3)0.8 (0.0-3.0)
Synth (Ours)21.2 (8.5-34.0)14.7 (9.0-20.3)
DWIBaseline0.0 (0.0-0.0)0.0 (0.0-0.0)
Oracle DA0.0 (0.0-0.0)0.0 (0.0-0.0)
WMH-SynthSeg0.4 (0.0-2.2)1.3 (0.0-3.6)
Synth (Ours)5.6 (0.0-14.4)4.1 (0.8-7.4)
EnsembleBaseline0.0 (0.0-0.0)0.0 (0.0-0.0)
Oracle DA0.0 (0.0-0.0)0.0 (0.0-0.0)
Synth (Ours)42.3 (30.2-54.5)19.1 (12.7-25.4)

4.4 PLORAS

The PLORAS dataset represents the most extreme domain shift with real clinical data exhibiting large diversity in acquisition protocols and slice thickness. We expect minimal baseline performance given these challenging conditions. For all available modalities, the Synth model outperforms the baseline with statistical significance in Dice (see Appendix Figures 15 - 17).

Results in Table 5 demonstrate the most extreme performance gap between conventional training and our synthetic approach. On this challenging clinical dataset, Oracle DA achieves at most 0.2% Dice (CT modality), while our Synth model achieves 11.9%, 25.4%, and 11.3% for T2w, FLAIR, and CT respectively. The near-total failure of both baseline and Oracle DA on real clinical data - where acquisition protocols, slice thickness, and image quality vary substantially - validates our core hypothesis: domain adaptation cannot substitute for domain-invariant training when deployment conditions diverge significantly from training data. Even WMH-SynthSeg, despite being trained for robustness, achieves only 0.0-4.9% Dice, likely due to its focus on small periventricular lesions rather than large cortical strokes. Full results for all individual DA methods are provided in Appendix Table 13. A number of samples are also visualised for this dataset in Figure 4.

Table 5: Median results on the PLORAS dataset (N=661). Best score shown in bold. Student’s t distribution 95% confidence intervals given in brackets. ’Oracle DA’ represents the hypothetical best-case scenario where optimal DA method is known a priori for each dataset/modality and applied to the baseline model.
ModalityModelDice (%)Surface Dice (%)
T2wBaseline0.0 (0.0-1.3)0.1 (0.0-0.6)
Oracle DA0.1 (0.0-2.2)0.1 (0.0-0.8)
WMH-SynthSeg4.9 (4.0-5.8)0.0 (0.0-0.0)
Synth (Ours)11.9 (6.7-17.1)8.4 (6.1-10.7)
FLAIRBaseline0.0 (0.0-0.4)0.0 (0.0-0.2)
Oracle DA0.0 (0.0-0.4)0.0 (0.0-0.2)
WMH-SynthSeg4.6 (4.2-5.0)0.0 (0.0-0.0)
Synth (Ours)25.4 (22.5-28.3)8.5 (7.4-9.6)
CTBaseline0.0 (0.0-1.1)0.0 (0.0-0.5)
Oracle DA0.2 (0.0-0.5)0.3 (0.0-0.7)
WMH-SynthSeg0.0 (0.0-0.0)0.0 (0.0-0.0)
Synth (Ours)11.3 (8.0-14.6)7.9 (6.0-9.8)
Refer to caption
Refer to caption
Refer to caption
Figure 4: Sample visualisations in the PLORAS dataset. Green indicates a true positive prediction, red a false positive, and blue a false negative.

4.5 Domain Adaptation

Tables 3, 4 and 5 demonstrate that even with oracle selection of the optimal domain adaptation method, the baseline model cannot match Synth performance in out-of-domain settings. We evaluated six established DA techniques (TTA, TENT, DAE, PL, UPL, DPL) on both baseline and Synth models to determine whether domain adaptation could enable baseline generalisation. Complete results in Appendix Tables 11-13 consistently show that poor initial baseline predictions prevent effective adaptation. In contrast, when applied to our Synth model, several DA methods yield substantial improvements, demonstrating the potential for compound gains when robust pre-training is combined with appropriate post-hoc adaptation.

Analysis of complete DA results (Appendix Tables 11-13) reveals DAE as the most consistently effective method for Synth. DAE achieves substantial improvements: ARC T2w increases from 26.8% to 54.3% Dice, ISLES T1w from 11.0% to 31.1%, and PLORAS CT from 11.3% to 23.9%. TTA provides modest gains for PLORAS FLAIR (25.4% to 29.4%), while TENT benefits ARC FLAIR (14.1% to 36.1%). Pseudo-labeling methods (PL, UPL, DPL) show high variance - occasionally strong but often catastrophic.

The key finding: while DA cannot rescue baseline models trained on narrow data, it can enhance robustly pre-trained models. Synth+DAE consistently outperforms both Synth alone and Baseline+Oracle DA, demonstrating complementary benefits.

Our domain adaptation experiments serve primarily to establish an upper bound on baseline model performance. Even with oracle knowledge of the optimal DA method for each dataset/modality combination - an unrealistic scenario in practice - the baseline model cannot match Synth performance in out-of-domain settings. Full DA results for all methods appear in Appendix Tables 11-13. While individual DA methods show varied effectiveness, the key finding is that synthetic pre-training provides robustness that cannot be recovered through test-time adaptation alone. This underscores the value of appearance-invariant training, even with our current limitations in modelling stroke heterogeneity through single Gaussian distributions. Domain adaptation results varied substantially by method and modality, with no single approach providing consistent improvements. We present all tested combinations rather than cherry-picking optimal results, as real-world deployment would lack oracle knowledge of the best DA method for unseen data.

4.6 Ablation: To mix or not to mix?

For all experiments shown thus far, the Synth model used a mix of both synthetic data and the real ATLAS dataset. In order to evaluate whether this decision is justified, an ablation is performed where the Synth model using mixed real/synthetic data is compared to a model trained with only synthetic data. This model is trained in the exact same manner as described previously for the baseline and Synth models. Results for the test datasets ATLAS, ARC and ISLES 2015 are shown in Tables 6, 7 and 8 respectively.

The results reveal a nuanced picture. For in-domain and near-domain T1w data (ATLAS, ARC), mixing with real data provides clear benefits, with pure synthetic training achieving only 19.7% and 46.7% Dice respectively compared to 48.2% and 72.3% for mixed training. However, for several out-of-domain scenarios, pure synthetic training surprisingly often outperforms mixed training: ARC T2w (62.6% vs 26.8%), ISLES T1w (30.4% vs 11.0%), and ISLES FLAIR (37.2% vs 21.2%). This suggests that including real T1w data may inadvertently bias the model toward T1w-specific features, reducing generalisation to other modalities. The T2w and FLAIR improvements with pure synthetic training may reflect these sequences’ different tissue contrast patterns, which are better captured by unbiased synthetic variation than by a model partially trained on T1w-specific features. Further improving the realism of the generated lesions with methods proposed in Liu et al. (2025) may aid in closing the gap for the model trained only on synthetic data, potentially achieving both in-domain performance and out-of-domain generalisation.

Refer to caption
Refer to caption
Figure 5: Dice and Surface Dice metrics for all reported datasets, for models trained on different combinations of real/synthetic data.
Table 6: Median results for ablation and pseudo-label experiments on the ATLAS test set (N=131). Best score shown in bold. Student’s t distribution 95% confidence intervals given in brackets.
ModalityModelDice (%)Surface Dice (%)
T1wBaseline57.5 (52.3-62.7)49.4 (44.5-54.3)
Baseline+Pseudo59.8 (54.8-64.7)48.6 (44.0-53.3)
Synth (no real data)19.7 (15.5-23.9)14.6 (12.3-16.9)
Synth (Ours)48.2 (43.1-53.4)38.1 (33.4-42.7)
Synth+Pseudo49.6 (44.5-54.7)38.2 (33.8-42.6)
Table 7: Median results for ablation and pseudo-label experiments on the ARC test set (N=229). Best score shown in bold. Student’s t distribution 95% confidence intervals given in brackets.
ModalityModelDice (%)Surface Dice (%)
T1wBaseline75.2 (71.5-79.0)41.7 (39.1-44.4)
Baseline+Pseudo75.9 (72.1-79.7)38.2 (35.8-40.7)
Synth (no real data)46.7 (43.0-50.4)18.2 (16.5-20.0)
Synth (Ours)72.3 (68.4-76.2)33.7 (31.2-36.2)
Synth+Pseudo74.3 (70.4-78.2)36.8 (34.3-39.3)
T2wBaseline0.4 (0.0-1.4)1.2 (0.8-1.7)
Baseline+Pseudo1.4 (0.3-2.4)3.0 (2.5-3.4)
Synth (no real data)62.6 (58.8-66.5)30.9 (28.7-33.0)
Synth (Ours)26.8 (23.3-30.2)12.0 (10.4-13.6)
Synth+Pseudo37.2 (33.2-41.1)14.7 (12.7-16.7)
FLAIRBaseline12.0 (7.6-16.3)6.3 (4.7-7.8)
Baseline+Pseudo14.5 (10.1-18.9)7.0 (5.5-8.5)
Synth (no real data)9.6 (7.0-12.3)7.3 (6.1-8.6)
Synth (Ours)14.1 (9.7-18.5)6.6 (4.8-8.4)
Synth+Pseudo12.9 (8.4-17.3)6.3 (4.4-8.1)
EnsembleBaseline3.4 (1.3-5.4)2.0 (0.8-3.2)
Baseline+Pseudo12.4 (9.7-15.0)6.5 (5.1-8.0)
Synth (no real data)59.7 (56.2-63.1)26.0 (24.2-27.9)
Synth (Ours)60.2 (56.6-63.8)26.3 (24.3-28.4)
Synth+Pseudo69.4 (65.8-73.0)34.0 (31.7-36.3)
Table 8: Median results for ablation and pseudo-label experiments on the ISLES 2015 test set (N=28). Best score shown in bold. Student’s t distribution 95% confidence intervals given in brackets.
ModalityModelDice (%)Surface Dice (%)
T1wBaseline0.0 (0.0-7.4)0.0 (0.0-4.8)
Baseline+Pseudo0.3 (0.0-9.1)0.0 (0.0-4.0)
Synth (no real data)30.4 (21.3-39.6)9.4 (4.1-14.8)
Synth (Ours)11.0 (1.2-20.8)3.4 (0.0-8.5)
Synth+Pseudo17.9 (7.6-28.3)8.0 (2.1-13.8)
T2wBaseline0.0 (0.0-0.5)0.3 (0.0-1.0)
Baseline+Pseudo0.1 (0.0-0.7)0.4 (0.0-1.1)
Synth (no real data)7.4 (0.0-18.4)8.1 (3.0-13.1)
Synth (Ours)11.1 (0.7-21.6)7.1 (2.7-11.5)
Synth+Pseudo18.1 (7.1-29.1)9.5 (4.4-14.5)
FLAIRBaseline0.0 (0.0-0.0)0.0 (0.0-0.0)
Baseline+Pseudo0.0 (0.0-0.3)0.0 (0.0-0.4)
Synth (no real data)37.2 (25.5-48.9)16.0 (10.8-21.1)
Synth (Ours)21.2 (8.5-34.0)14.7 (9.0-20.3)
Synth+Pseudo4.2 (0.0-17.3)3.9 (0.0-9.8)
DWIBaseline0.0 (0.0-0.0)0.0 (0.0-0.0)
Baseline+Pseudo0.0 (0.0-0.4)0.0 (0.0-0.0)
Synth (no real data)8.2 (0.0-16.8)5.1 (2.0-8.3)
Synth (Ours)5.6 (0.0-14.4)4.1 (0.8-7.4)
Synth+Pseudo5.6 (0.0-14.7)4.3 (0.8-7.8)
EnsembleBaseline0.0 (0.0-0.0)0.0 (0.0-0.0)
Baseline+Pseudo0.0 (0.0-0.0)0.0 (0.0-0.0)
Synth (no real data)27.2 (15.7-38.8)15.6 (10.3-20.9)
Synth (Ours)42.3 (30.2-54.5)19.1 (12.7-25.4)
Synth+Pseudo50.1 (37.6-62.6)24.9 (18.7-31.1)

4.7 Use of Pseudo-labelling

The use of synthetic data for semi-supervised pseudo-label training is also explored through the use of the PLORAS MPRAGE dataset (N=1159), distinct from the PLORAS hospital scans used in the Experiments. This approach leverages a key advantage of our framework: because synthetic training decouples image generation from label accuracy, the model can tolerate and potentially benefit from imperfect pseudo-labels. Results in Figure 6 and Tables 6, 7 and 8 demonstrate an overall positive effect of pseudo labels in the Synth model, with the exception of FLAIR images in ISLES 2015.

Notable improvements include: ARC T2w increasing from 26.8% to 37.2% Dice, ISLES T1w from 11.0% to 17.9%, and ARC ensemble from 60.2% to 69.4%. The baseline model shows minimal improvement with pseudo-labels in out-of-domain scenarios (e.g., ISLES ensemble remains at 0.0%), suggesting that pseudo-labelling cannot overcome fundamental domain shift. The FLAIR degradation in ISLES 2015 (21.2% to 4.2%) may indicate that pseudo-labels from chronic MPRAGE data introduce biases incompatible with acute FLAIR lesion appearance. The in-domain ATLAS dataset also shows a notable improvement for both models as a result of pseudo labels, with a more marked increase in the baseline model, suggesting pseudo-labels effectively expand the training distribution when the domain gap is small.

Refer to caption
Refer to caption
Figure 6: Dice and Surface Dice metrics for all reported datasets, for models trained with/without an additional training dataset of MPRAGE images and pseudo-labels.

These results across four diverse datasets - from research-quality ATLAS to challenging clinical PLORAS scans - reveal consistent patterns that warrant deeper analysis of the fundamental principles underlying domain robustness.

5 Discussion

Our results reveal fundamental principles about domain robustness in medical image segmentation. The consistent failure of domain adaptation to rescue baseline models - even with oracle selection of the optimal method - demonstrates that post-hoc adaptation cannot substitute for domain-invariant training. This finding challenges prevailing assumptions about the sufficiency of test-time adaptation for clinical deployment.

The compound effect observed with Synth+DAE suggests that synthetic pre-training and test-time adaptation address different aspects of domain shift. Synthetic training provides feature representations invariant to appearance changes, while DA methods fine-tune these representations to specific test-time distributions. However, this synergy only emerges when the base model already possesses sufficient robustness - when initial predictions fail catastrophically (as with baseline models on ISLES 2015 and PLORAS), no amount of adaptation can recover performance.

The variable benefit of multi-modal ensembling - substantial for ISLES 2015 but mixed for ARC - indicates that fusion strategies must consider dataset-specific characteristics rather than assuming universal improvement. Clinical deployment should balance single-modality robustness with opportunistic multi-modal fusion.

6 Limitations and Future Directions

Appendix A confirms that 20–40% of lesions in several cohorts are multimodal, underscoring the central limitation of our single-Gaussian sampling. While our spatially-varying approach introduces intra-lesion heterogeneity, it does not address the multi-modal nature of stroke appearance. Future implementations should explore mixture models to capture lesions that simultaneously exhibit hyper- and hypo-intense regions within the same pathology. It may also be advantageous to paste several lesions per subject, each assigned its own independently sampled intensity profile. This would allow a single synthetic brain to display chronic hypointense scars alongside acute hyperintense infarcts, better reflecting the heterogeneous lesion chronology typically observed in stroke cohorts.

Additional qualitative results in Appendix Figures 18 - 23 display further positive and negative results across the available image modalities within the PLORAS dataset. A number of failure modes may be observed here, such as missing cerebellar lesions or under-segmenting large hemispheric lesions.

Although results in the paper provide strong evidence of the raw value of the segmentation metrics, it is important to further validate that the predictions made from the trained model are useful for downstream tasks. This could be validated by comparing predictivity of the segmentation masks for functional scores such as the Comprehensive Aphasia Test (CAT) (Hope et al., 2024) or the NIH Stroke Score (NIHSS).

In the context of neuroimaging data, there are many shifts in domain beyond those related to scanner sequence and resolution targeted in this work. Shifts related to anatomical shape, brought on by changes in demographics or the presence of confounding conditions such as atrophy are of equal importance to shifts in image appearance. Prior work has attempted to model such factors through the use of causal inference and counterfactual generation (Pawlowski et al., 2020; Pombo et al., 2023; Ribeiro et al., 2023). It is conceivable that such an approach could be used to introduce morphological changes to the healthy tissue maps to create a model that is agnostic to shifts in domains related to both shape and appearance.

While this study focused on stroke, the results are expected to translate to other similar domains, such as haemorrhage and glioblastoma. Future work will compare the impact of the mixing proportions of real and synthetic data. Additionally, the utility of multi-modal data in a multi-channel model will be examined versus post-hoc averaging of individual modality predictions. Lesions often appear differently across modalities, so a model trained with multi-channel inputs is expected to leverage these differences more effectively. This relationship between multi-channel inputs may better be modelled using quantitative MRI data.

7 Conclusion

In this study, we introduced a novel synthetic data generation and training framework for stroke lesion segmentation, building on the success of prior works in healthy brain parcellation. A model trained using this novel framework was evaluated on a wide range of datasets covering research and clinical data, and chronic and acute stroke pathologies. Our experiments demonstrate that synthetic pre-training provides fundamental robustness unachievable through test-time adaptation alone. While accepting a 9.3% median Dice reduction in-domain (48.2% vs 57.5%), our approach maintains performance where conventional methods fail entirely. Even with oracle knowledge of the optimal domain adaptation method - an unrealistic scenario in practice - conventionally-trained models cannot match our synthetic approach in out-of-domain settings.

Our results demonstrate that even imperfect appearance modelling can provide substantial benefits for cross-modality segmentation. Incorporating recent contemporary work on realistic lesion simulation (Liu et al., 2025) may further improve performance both in- and out-of-distribution.

Qualitative results in Figure 4 demonstrate the potential of the model as a starting checkpoint in an active-learning framework such as MONAI Label (Diaz-Pinto et al., 2024), where predictions may be refined to generate training data for task-specific fine-tuning. The uncertainty of healthy-tissue segmentations via MC dropout may also provide an effective heuristic for prioritising items to refine/label in such a framework (Nath et al., 2021).

Results also showed the potential benefit of the model for semi-supervised learning through pseudo labelling, where predicted labels may be fed back into the model with much less caution than is typically required when training with real images (Pham et al., 2021; Xu et al., 2024). The ablation study revealed that pure synthetic training can outperform mixed training for certain out-of-domain modalities, suggesting opportunities for further optimisation of the real/synthetic data balance. A simple solution to further reinforce this may be to train a real/fake discriminator to perform automated quality-control and threshold optimisation for the generated binary maps.

Our approach reduces reliance on domain-specific training data and helps bridge the gap between research-grade and clinical scans to improve clinical stroke neuroimaging workflows, providing a foundation for more robust and widely applicable lesion segmentation tools.


Acknowledgments

LC is supported by the EPSRC-funded UCL Centre for Doctoral Training in Intelligent, Integrated Imaging in Healthcare (i4health) (EP/S021930/1), and the Wellcome Trust (203147/Z/16/Z and 205103/Z/16/Z). IP supported by the Alzheimer’s Association grant number SG-20-678486-GAAIN2. CP is funded by Wellcome (203147/Z/16/Z, 205103/Z/16/Z and 224562/Z/21/Z). This research was supported by NVIDIA and utilised NVIDIA RTX A6000 48GB.


Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.


Conflicts of Interest

We declare we don’t have conflicts of interest.


Data availability

Authors have released public code and weights to reimplement all experiments. Model weights are also made available as a toolbox for SPM. This toolbox is written in pure MATLAB and requires no external installs or compilation, to minimise the barrier for clinical evaluation. All datasets are public except the PLORAS study data, which is part of an ongoing study and will be published upon completion.

References

  • Ashburner and Friston (2005) John Ashburner and Karl J. Friston. Unified segmentation. NeuroImage, 26(3):839–851, 2005. ISSN 10538119. .
  • Baid et al. (2021) Ujjwal Baid, Satyam Ghodasara, Suyash Mohan, and others. The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification, 2021.
  • Basaran et al. (2023) Berke Doga Basaran, Weitong Zhang, Mengyun Qiao, Bernhard Kainz, Paul M. Matthews, et al. Lesionmix: A lesion-level data augmentation method for medical image segmentation, 2023. URL https://arxiv.org/abs/2308.09026.
  • Billot et al. (2021) Benjamin Billot, Stefano Cerri, Koen Van Leemput, Adrian V. Dalca, and Juan Eugenio Iglesias. Joint segmentation of multiple sclerosis lesions and brain anatomy in MRI scans of any contrast and resolution with CNNs. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, April 2021. .
  • Billot et al. (2023) Benjamin Billot, Douglas N. Greve, Oula Puonti, Axel Thielscher, Koen Van Leemput, et al. SynthSeg: Segmentation of brain MRI scans of any contrast and resolution without retraining. Medical Image Analysis, 86:102789, May 2023. ISSN 1361-8415. .
  • Brudfors et al. (2020) Mikael Brudfors, Yaël Balbastre, Guillaume Flandin, Parashkev Nachev, and John Ashburner. Flexible Bayesian Modelling for Nonlinear Image Registration, page 253–263. Springer International Publishing, 2020. ISBN 9783030597160. .
  • Cardoso et al. (2022) M. Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, et al. Monai: An open-source framework for deep learning in healthcare, 2022.
  • Cerri et al. (2021) Stefano Cerri, Oula Puonti, Dominik S. Meier, Jens Wuerfel, Mark Mühlau, et al. A contrast-adaptive method for simultaneous whole-brain and lesion segmentation in multiple sclerosis. NeuroImage, 225:117471, January 2021. ISSN 1053-8119. .
  • Chen et al. (2021) Cheng Chen, Quande Liu, Yueming Jin, Qi Dou, and Pheng-Ann Heng. Source-free domain adaptive fundus image segmentation with denoised pseudo-labeling, 2021. URL https://arxiv.org/abs/2109.09735.
  • Chollet et al. (2024) Etienne Chollet, Yaël Balbastre, Chiara Mauri, Caroline Magnain, Bruce Fischl, et al. Neurovascular segmentation in soct with deep learning and synthetic training data, 2024. URL https://arxiv.org/abs/2407.01419.
  • Dai et al. (2022) Pingping Dai, Licong Dong, Ruihan Zhang, Haiming Zhu, Jie Wu, et al. Soft-CP: A credible and effective data augmentation for semantic segmentation of medical lesions, 2022.
  • Dalca et al. (2018) Adrian V. Dalca, John Guttag, and Mert R. Sabuncu. Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, June 2018. .
  • de la Rosa et al. (2024) Ezequiel de la Rosa, Mauricio Reyes, Sook-Lei Liew, Alexandre Hutton, Roland Wiest, et al. A robust ensemble algorithm for ischemic stroke lesion segmentation: Generalizability and clinical utility beyond the isles challenge, 2024. URL https://arxiv.org/abs/2403.19425.
  • Diaz-Pinto et al. (2024) Andres Diaz-Pinto, Sachidanand Alle, Vishwesh Nath, Yucheng Tang, Alvin Ihsani, et al. Monai label: A framework for ai-assisted interactive labeling of 3d medical images. Medical Image Analysis, 95:103207, July 2024. ISSN 1361-8415. . URL http://dx.doi.org/10.1016/j.media.2024.103207.
  • Fischl (2012) Bruce Fischl. Freesurfer. NeuroImage, 62(2):774–781, August 2012. ISSN 1053-8119. . URL http://dx.doi.org/10.1016/j.neuroimage.2012.01.021.
  • Gibson et al. (2024) Makayla Gibson, Roger Newman-Norlund, Leonardo Bonilha, Julius Fridriksson, Gregory Hickok, et al. The Aphasia Recovery Cohort, an open-source chronic stroke repository. Scientific Data, 11(1):1–8, 2024. ISSN 20524463. . URL http://dx.doi.org/10.1038/s41597-024-03819-7.
  • Gopinath et al. (2023) Karthik Gopinath, Douglas N. Greve, Sudeshna Das, Steve Arnold, Colin Magdamo, et al. Cortical analysis of heterogeneous clinical brain mri scans for large-scale neuroimaging studies, 2023. URL https://arxiv.org/abs/2305.01827.
  • Gopinath et al. (2024a) Karthik Gopinath, Douglas N. Greve, Colin Magdamo, Steve Arnold, Sudeshna Das, et al. Recon-all-clinical: Cortical surface reconstruction and analysis of heterogeneous clinical brain mri, 2024a. URL https://arxiv.org/abs/2409.03889.
  • Gopinath et al. (2024b) Karthik Gopinath, Andrew Hoopes, Daniel C. Alexander, Steven E. Arnold, Yael Balbastre, et al. Synthetic data in generalizable, learning-based neuroimaging. Imaging Neuroscience, 2024b. ISSN 2837-6056. . URL http://dx.doi.org/10.1162/imag_a_00337.
  • Hartigan and Hartigan (1985) J. A. Hartigan and P. M. Hartigan. The dip test of unimodality. The Annals of Statistics, 13(1), March 1985. ISSN 0090-5364. . URL http://dx.doi.org/10.1214/aos/1176346577.
  • Hoffmann et al. (2022) Malte Hoffmann, Benjamin Billot, Douglas N Greve, Juan Eugenio Iglesias, Bruce Fischl, et al. Synthmorph: learning contrast-invariant registration without acquired images. IEEE Transactions on Medical Imaging, 41(3):543–558, 2022.
  • Hoffmann et al. (2023) Malte Hoffmann, Andrew Hoopes, Bruce Fischl, and Adrian V Dalca. Anatomy-specific acquisition-agnostic affine registration learned from fictitious images. In Medical Imaging 2023: Image Processing, volume 12464, page 1246402. SPIE, 2023.
  • Hoffmann et al. (2024) Malte Hoffmann, Andrew Hoopes, Douglas N Greve, Bruce Fischl, and Adrian V Dalca. Anatomy-aware and acquisition-agnostic joint registration with synthmorph. Imaging Neuroscience, 2:1–33, 2024.
  • Hope et al. (2024) Thomas M.H. Hope, Douglas Neville, Mohamed L. Seghier, and Cathy J. Price. Continuous lesion images drive more accurate predictions of outcomes after stroke than binary lesion images. October 2024. . URL http://dx.doi.org/10.1101/2024.10.04.616726.
  • Iglesias et al. (2023) Juan E. Iglesias, Benjamin Billot, Yaël Balbastre, Colin Magdamo, Steven E. Arnold, et al. Synthsr: A public ai tool to turn heterogeneous clinical brain scans into high-resolution t1-weighted images for 3d morphometry. Science Advances, 9(5), February 2023. ISSN 2375-2548. . URL http://dx.doi.org/10.1126/sciadv.add3607.
  • Iglesias et al. (2021) Juan Eugenio Iglesias, Benjamin Billot, Yaël Balbastre, Azadeh Tabari, John Conklin, et al. Joint super-resolution and synthesis of 1 mm isotropic MP-RAGE volumes from clinical MRI exams with scans of different orientation, resolution and contrast. NeuroImage, 237, 2021.
  • Isensee et al. (2020) Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2):203–211, December 2020. ISSN 1548-7105. .
  • Isensee et al. (2024) Fabian Isensee, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, et al. nnu-net revisited: A call for rigorous validation in 3d medical image segmentation, 2024. URL https://arxiv.org/abs/2404.09556.
  • Johnson et al. (2024) Lisa Johnson, Roger Newman-Norlund, Alex Teghipco, et al. Progressive lesion necrosis is related to increasing aphasia severity in chronic stroke. NeuroImage: Clinical, 41, 1 2024. ISSN 22131582. .
  • Karani et al. (2021) Neerav Karani, Ertunc Erdil, Krishna Chaitanya, and Ender Konukoglu. Test-time adaptable neural networks for robust medical image segmentation. Medical Image Analysis, 68:101907, February 2021. ISSN 1361-8415. . URL http://dx.doi.org/10.1016/j.media.2020.101907.
  • Laso et al. (2024) Pablo Laso, Stefano Cerri, Annabel Sorby-Adams, Jennifer Guo, Farrah Mateen, et al. Quantifying white matter hyperintensity and brain volumes in heterogeneous clinical and low-field portable mri, 2024. URL https://arxiv.org/abs/2312.05119.
  • Lassmann (2018) Hans Lassmann. Multiple sclerosis pathology. Cold Spring Harbor Perspectives in Medicine, 8(3):a028936, January 2018. ISSN 2157-1422. .
  • Li et al. (2020) Qingyun Li, Zhibin Yu, Yubo Wang, and Haiyong Zheng. Tumorgan: A multi-modal data augmentation framework for brain tumor segmentation. Sensors, 20(15):4203, July 2020. ISSN 1424-8220. . URL http://dx.doi.org/10.3390/s20154203.
  • Liu et al. (2024) Peirong Liu, Oula Puonti, Annabel Sorby-Adams, William T. Kimberly, and Juan E. Iglesias. PEPSI: Pathology-Enhanced Pulse-Sequence-Invariant Representations for Brain MRI, 3 2024. URL http://arxiv.org/abs/2403.06227.
  • Liu et al. (2025) Peirong Liu, Ana Lawry Aguila, and Juan E. Iglesias. Unraveling normal anatomy via fluid-driven anomaly randomization, 2025. URL https://arxiv.org/abs/2501.13370.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019.
  • Loughnan et al. (2019) Robert Loughnan, Diego L. Lorca-Puls, Andrea Gajardo-Vidal, Valeria Espejo-Videla, Céline R. Gillebert, et al. Generalizing post-stroke prognoses from research data to clinical data. NeuroImage: Clinical, 24(October 2019):102005, 2019. ISSN 22131582. . URL https://doi.org/10.1016/j.nicl.2019.102005.
  • Maier et al. (2017) Oskar Maier, Bjoern H. Menze, Janina von der Gablentz, Levin Häni, Mattias P. Heinrich, et al. ISLES 2015 - a public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI. Medical Image Analysis, 35:250–269, January 2017. ISSN 1361-8415. .
  • Middleton et al. (2024) Jon Middleton, Marko Bauer, Kaining Sheng, Jacob Johansen, Mathias Perslev, et al. Local gamma augmentation for ischemic stroke lesion segmentation on mri, 2024. URL https://arxiv.org/abs/2401.06893.
  • Moyer and Golland (2021) Daniel Moyer and Polina Golland. Harmonization and the worst scanner syndrome, 2021. URL https://arxiv.org/abs/2101.06255.
  • Nath et al. (2021) Vishwesh Nath, Dong Yang, Bennett A. Landman, Daguang Xu, and Holger R. Roth. Diminishing uncertainty within the training pool: Active learning for medical image segmentation. IEEE Transactions on Medical Imaging, 40(10):2534–2547, October 2021. ISSN 1558-254X. . URL http://dx.doi.org/10.1109/TMI.2020.3048055.
  • Nguyen et al. (2024) Thanh N Nguyen, Mohamad Abdalkader, Urs Fischer, Zhongming Qiu, Simon Nagel, et al. Endovascular management of acute stroke. The Lancet, 404(10459):1265–1278, September 2024. ISSN 0140-6736. . URL http://dx.doi.org/10.1016/S0140-6736(24)01410-7.
  • Pawlowski et al. (2020) Nick Pawlowski, Daniel C. Castro, and Ben Glocker. Deep structural causal models for tractable counterfactual inference, 2020. URL https://arxiv.org/abs/2006.06485.
  • Pham et al. (2021) Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, and Quoc V. Le. Meta pseudo labels, 2021. URL https://arxiv.org/abs/2003.10580.
  • Pombo et al. (2023) Guilherme Pombo, Robert Gray, M Jorge Cardoso, Sebastien Ourselin, Geraint Rees, John Ashburner, and Parashkev Nachev. Equitable modelling of brain imaging by counterfactual augmentation with morphologically constrained 3D deep generative models. Med. Image Anal., 84(102723):102723, February 2023.
  • Price et al. (2010) Cathy J. Price, Mohamed L. Seghier, and Alex P. Leff. Predicting language outcome and recovery after stroke: The PLORAS system. Nature Reviews Neurology, 6(4):202–210, 2010. ISSN 17594758. .
  • Puonti et al. (2016) Oula Puonti, Juan Eugenio Iglesias, and Koen Van Leemput. Fast and sequence-adaptive whole-brain segmentation using parametric Bayesian modeling. NeuroImage, 143:235–249, 2016. ISSN 10959572. . URL http://dx.doi.org/10.1016/j.neuroimage.2016.09.011.
  • Ribeiro et al. (2023) Fabio De Sousa Ribeiro, Tian Xia, Miguel Monteiro, Nick Pawlowski, and Ben Glocker. High fidelity image counterfactuals with probabilistic causal models, 2023. URL https://arxiv.org/abs/2306.15764.
  • Seghier et al. (2008) Mohamed L. Seghier, Anil Ramlackhansingh, Jenny Crinion, Alexander P. Leff, and Cathy J. Price. Lesion identification using unified segmentation-normalisation models and fuzzy clustering. NeuroImage, 41(4):1253–1266, 2008. ISSN 10538119. . URL http://dx.doi.org/10.1016/j.neuroimage.2008.03.028.
  • Seghier et al. (2016) Mohamed L. Seghier, Elnas Patel, Susan Prejawa, Sue Ramsden, Andre Selmer, et al. The PLORAS Database: A data repository for Predicting Language Outcome and Recovery After Stroke. NeuroImage, 124:1208–1212, 2016. ISSN 10959572. . URL http://dx.doi.org/10.1016/j.neuroimage.2015.03.083.
  • Seidlitz et al. (2022) Silvia Seidlitz, Jan Sellner, Jan Odenthal, Berkin Özdemir, Alexander Studier-Fischer, et al. Robust deep learning-based semantic organ segmentation in hyperspectral images. Medical Image Analysis, 80:102488, August 2022. ISSN 1361-8415. . URL http://dx.doi.org/10.1016/j.media.2022.102488.
  • Wang et al. (2021) Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization, 2021.
  • Wang et al. (2019) Guotai Wang, Wenqi Li, Sébastien Ourselin, and Tom Vercauteren. Automatic Brain Tumor Segmentation Using Convolutional Neural Networks with Test-Time Augmentation, page 61–72. Springer International Publishing, 2019. ISBN 9783030117269. . URL http://dx.doi.org/10.1007/978-3-030-11726-9_6.
  • Xu et al. (2024) Moucheng Xu, Yukun Zhou, Chen Jin, Marius de Groot, Daniel C. Alexander, et al. Expectation maximisation pseudo labels. Medical Image Analysis, 94:103125, May 2024. ISSN 1361-8415. . URL http://dx.doi.org/10.1016/j.media.2024.103125.
  • Zhang et al. (2021) Xinru Zhang, Chenghao Liu, Ni Ou, Xiangzhu Zeng, Xiaoliang Xiong, et al. CarveMix: A Simple Data Augmentation Method for Brain Lesion Segmentation, page 196–205. Springer International Publishing, 2021. ISBN 9783030871932. . URL http://dx.doi.org/10.1007/978-3-030-87193-2_19.

A Lesion-intensity unimodality

Motivation. Most classical stroke-segmentation pipelines, including our synthetic generator, implicitly assume that the grey-level distribution of a lesion is unimodal. That assumption underpins common appearance priors such as single-Gaussian modelling and simple intensity normalisation. However, chronic infarcts often contain a mixture of tissue constituents, and in CT the same infarct can include both hypo- and hyper-dense cores. Quantifying how often real lesions violate unimodality therefore informs whether richer mixture-based priors are warranted.

Experiment. For every connected component in each dataset, we extracted the raw voxel intensities and applied Hartigan’s dip test (Hartigan and Hartigan, 1985) to the empirical one-dimensional distribution. A lesion was labelled unimodal when the null hypothesis of unimodality could not be rejected at α=0.05𝛼0.05\alpha=0.05; otherwise it was classified as multimodal. For efficiency, we randomly subsampled at most 2048 voxels per lesion before testing. Table 9 summarises the counts.

Findings. Across MRI datasets the majority of lesions were indeed unimodal, but a non-negligible tail of multimodal cases emerged:

  • ATLAS (chronic MPRAGE T1w) showed >95%absentpercent95>95\% unimodal lesions; only 80 of 1818 connected components exhibited multimodality, in line with mature cavities whose signal is dominated by a single CSF-like class.

  • ISLES 2015 exhibited 12-38% multimodal lesions depending on modality, reflecting mixed-phase tissue in a sub-acute cohort.

  • ARC (chronic research scans) retained a high unimodal rate in T1w/T2w but FLAIR contained 39% multimodal lesions.

  • PLORAS revealed the strongest departure: while T2w/FLAIR remained largely unimodal, CT lesions were predominantly multimodal (58%), underscoring how density heterogeneity dominates in CT.

Implications. A single-Gaussian appearance prior is largely adequate for T1w/T2w MRI, where 80%absentpercent80\geq 80\% of lesions were unimodal; it begins to break down for modalities that emphasise subtle tissue heterogeneity - most notably FLAIR and especially CT. Adopting explicit mixture-based priors, or synthetically sampling lesions with mixed intensities, therefore represents a principled next step for improving model robustness across modalities and imaging protocols.

Table 9: Counts of connected‐component lesions that are unimodal or multimodal in each dataset–modality pair. For every lesion larger than 10 voxels we sampled at most 2048 raw intensity values and ran Hartigan’s dip test; failure to reject the null hypothesis of unimodality at α=0.05𝛼0.05\alpha=0.05 yields the Unimodal label, otherwise Multimodal. ATLAS is split into the subjects used for model development (T1w [Train]) and the held-out evaluation set (T1w [Validation]).
DatasetModalityUnimodalMultimodal
ATLAST1w [Train]138465
T1w [Validation]35415
ISLES2015T1w529
T2w4615
FLAIR547
DWI3823
ARCT1w28664
T2w33548
FLAIR9057
PLORAST2w20314
FLAIR71229
CT248342

B Further segmentation metrics

Additional metrics are reported below in Appendix Tables 10 - 13 for all datasets. Here we report individual scores for each DA method, for both the baseline model and our Synth model.

Table 10: Mean results on ATLAS hold-out set (N=131). Best score shown in bold. FPR values are shown as a multiple of 103.superscript10310^{-3}.
ModalityModelDiceHD95AVDALDLF1TPRFPR
T1wBaseline0.50044.37.603.480.4800.5100.251
Baseline+TTA0.50341.47.342.520.5370.5020.234
Synth (Ours)0.45647.48.732.260.5360.4260.200
Synth+TTA0.45548.38.511.760.5590.4240.191
Table 11: Mean results on the ARC dataset (N=229). Best score per column is shown in bold. FPR values are shown as a multiple of 1000.
ModalityModelDiceHD95AVDALDLF1TPRFPR
T1wBaseline0.64624.518.046.190.3350.6001.178
Baseline+TTA0.64823.718.333.420.4420.5981.126
Baseline+TENT0.077144.975.503.510.1800.0600.285
Baseline+DAE0.61426.226.533.150.5070.6852.238
Baseline+PL0.32567.5146.72175.610.0170.61310.342
Baseline+UPL0.34174.448.1022.840.2440.2620.695
Baseline+DPL0.34052.552.489.530.3240.2470.465
Synth (Ours)0.60927.922.881.990.6130.5390.953
Synth+TTA0.61127.024.081.840.6400.5340.885
Synth+TENT0.56033.426.712.420.5280.6042.385
Synth+DAE0.61323.420.602.260.5810.6401.903
Synth+PL0.311158.648.41260.540.0170.3102.303
Synth+UPL0.34056.354.5720.350.3340.2430.457
Synth+DPL0.25558.363.367.440.5300.1690.339
T2wBaseline0.04076.873.8047.120.0440.0636.778
Baseline+TTA0.02377.071.8326.000.0540.0366.819
Baseline+TENT0.000148.585.405.110.0470.0000.206
Baseline+DAE0.05675.8123.4910.090.1130.13410.938
Baseline+PL0.00678.8127.78353.310.0070.01412.731
Baseline+UPL0.000256.085.742.570.0130.0000.185
Baseline+DPL0.000256.085.742.570.0130.0000.185
Synth (Ours)0.29966.240.7938.810.0600.3042.282
Synth+TTA0.28466.243.7016.000.1170.2741.893
Synth+TENT0.32155.543.4315.090.1400.3261.663
Synth+DAE0.43740.840.1911.430.1960.4101.344
Synth+PL0.38465.540.2622.770.0950.4753.881
Synth+UPL0.12050.773.3022.710.1720.0740.484
Synth+DPL0.05552.378.4347.280.1080.0320.447
FLAIRBaseline0.19366.252.1549.490.0490.2054.532
Baseline+TTA0.20265.747.9825.980.0900.1973.719
Baseline+TENT0.011165.679.404.940.0700.0080.628
Baseline+DAE0.28754.875.4711.090.2320.3275.981
Baseline+PL0.12681.9162.13382.660.0070.24013.548
Baseline+UPL0.000256.082.362.620.0350.0000.497
Baseline+DPL0.000256.082.362.620.0350.0000.497
Synth (Ours)0.19959.360.3617.690.1500.1571.349
Synth+TTA0.19056.462.988.360.2560.1431.185
Synth+TENT0.34049.746.055.010.2470.2861.582
Synth+DAE0.26351.553.145.880.2480.2151.581
Synth+PL0.27696.845.9793.180.0270.2552.133
Synth+UPL0.02999.180.444.790.2720.0160.520
Synth+DPL0.010146.981.387.110.1890.0050.512
EnsembleBaseline0.10753.377.0422.200.3090.0711.137
Baseline+TTA0.08765.878.919.100.4290.0571.126
Baseline+TENT0.003238.085.732.600.0540.0020.185
Baseline+DAE0.21251.373.8439.250.1680.2122.350
Baseline+PL0.01676.170.76512.920.0040.0143.869
Baseline+UPL0.000256.085.742.570.0130.0000.185
Baseline+DPL0.000256.085.742.570.0130.0000.185
Synth (Ours)0.52534.935.6010.090.3560.4510.770
Synth+TTA0.53033.036.514.500.4700.4520.696
Synth+TENT0.45139.138.5011.120.3700.4120.960
Synth+DAE0.57927.627.526.770.4320.5310.923
Synth+PL0.52950.233.2129.020.1060.5151.551
Synth+UPL0.24453.065.779.800.4700.1590.310
Synth+DPL0.17271.471.6213.970.4140.1070.284
Table 12: Mean results on the ISLES2015 dataset (N=28). Best score per column is shown in bold. FPR values are shown as a multiple of 1000.
ModalityModelDiceHD95AVDALDLF1TPRFPR
T1wBaseline0.11584.238.143.320.2690.0760.034
Baseline+TTA0.091113.240.111.960.3110.0580.012
Baseline+TENT0.002178.243.772.140.1610.0010.001
Baseline+DAE0.24668.248.077.180.2850.1911.604
Baseline+PL0.04081.642.79643.290.0030.0391.979
Baseline+UPL0.000256.043.792.110.0000.0000.000
Baseline+DPL0.000256.043.792.110.0000.0000.000
Synth (Ours)0.22270.530.497.140.2080.1680.162
Synth+TTA0.23078.531.412.180.3490.1630.085
Synth+TENT0.076108.233.7940.860.1030.0580.265
Synth+DAE0.26858.129.183.640.3160.2230.293
Synth+PL0.05076.942.553.930.3210.0280.008
Synth+UPL0.000241.743.792.040.0500.0000.000
Synth+DPL0.001221.743.772.460.1270.0000.000
T2wBaseline0.00765.450.8115.000.0780.0442.786
Baseline+TTA0.00365.548.976.860.0950.0252.260
Baseline+TENT0.001142.643.383.750.0500.0010.085
Baseline+DAE0.00671.588.179.430.0710.0695.805
Baseline+PL0.00274.271.28155.210.0130.0055.055
Baseline+UPL0.000256.043.792.110.0000.0000.000
Baseline+DPL0.000256.043.792.110.0000.0000.000
Synth (Ours)0.23174.852.7311.820.1660.4433.534
Synth+TTA0.24574.650.994.680.2790.4543.379
Synth+TENT0.21577.433.4420.110.1090.2761.755
Synth+DAE0.25563.131.104.610.2800.3281.437
Synth+PL0.24573.337.5713.710.1610.2882.280
Synth+UPL0.15263.331.0612.430.1380.1160.631
Synth+DPL0.11158.635.258.110.1720.0730.143
FLAIRBaseline0.00071.440.0610.790.0180.0000.515
Baseline+TTA0.000112.941.243.430.0000.0000.238
Baseline+TENT0.000119.043.474.320.0000.0000.021
Baseline+DAE0.00080.161.6515.290.0150.0043.679
Baseline+PL0.00077.143.4370.430.0090.0000.490
Baseline+UPL0.000256.043.792.110.0000.0000.000
Baseline+DPL0.000256.043.792.110.0000.0000.000
Synth (Ours)0.31462.918.8410.610.1540.3460.975
Synth+TTA0.32982.118.373.500.2750.3590.891
Synth+TENT0.12779.028.0334.540.0310.1070.460
Synth+DAE0.27455.720.257.860.1520.3050.989
Synth+PL0.25275.750.59111.430.0360.3303.176
Synth+UPL0.232137.924.883.040.2710.1870.111
Synth+DPL0.205145.625.167.360.2360.1680.191
DWIBaseline0.00080.142.635.540.0230.0000.138
Baseline+TTA0.000102.843.832.430.0000.0000.133
Baseline+TENT0.000103.943.521.890.0000.0000.021
Baseline+DAE0.00076.240.0411.210.0250.0002.123
Baseline+PL0.00477.540.33835.360.0010.0040.726
Baseline+UPL0.000256.043.792.110.0000.0000.000
Baseline+DPL0.000256.043.792.110.0000.0000.000
Synth (Ours)0.19382.547.6416.710.1000.3522.741
Synth+TTA0.20481.144.328.390.1870.3362.344
Synth+TENT0.10486.630.7138.500.0510.1080.882
Synth+DAE0.18580.439.1012.360.1060.2622.139
Synth+PL0.18685.281.6417.430.1200.3975.759
Synth+UPL0.17960.832.4219.430.1540.1550.268
Synth+DPL0.089117.439.273.750.3000.0600.031
EnsembleBaseline0.000242.843.792.000.0000.0000.000
Baseline+TTA0.000256.043.792.110.0000.0000.000
Baseline+TENT0.000256.043.792.110.0000.0000.000
Baseline+DAE0.000231.043.682.000.0000.0000.007
Baseline+PL0.000198.943.792.110.0240.0000.000
Baseline+UPL0.000256.043.792.110.0000.0000.000
Baseline+DPL0.000256.043.792.110.0000.0000.000
Synth (Ours)0.37060.419.687.500.2340.3390.256
Synth+TTA0.37869.720.012.750.3480.3340.223
Synth+TENT0.111114.832.233.210.2350.0890.047
Synth+DAE0.36551.419.843.750.3540.3310.204
Synth+PL0.24660.729.0121.180.1250.1880.150
Synth+UPL0.016218.242.525.180.1310.0090.000
Synth+DPL0.033187.341.096.040.2230.0200.000
Table 13: Mean results on the PLORAS dataset (N=661). Best score per column is shown in bold. FPR values are shown as a multiple of 1000.
ModalityModelDiceHD95AVDALDLF1TPRFPR
T2wBaseline0.02880.3112.8212.220.1010.0649.638
Baseline+TTA0.02079.192.6911.020.1030.0458.409
Baseline+TENT0.005101.530.795.120.0200.0071.849
Baseline+DAE0.04482.7204.492.640.2760.12314.983
Baseline+PL0.000235.832.462.130.0090.0041.696
Baseline+UPL0.000256.032.472.120.0090.0041.695
Baseline+DPL0.000256.032.472.120.0090.0041.695
Synth (Ours)0.25071.940.208.420.2430.3714.156
Synth+TTA0.25370.537.004.750.3290.3613.917
Synth+TENT0.20277.061.9921.230.1180.3305.612
Synth+DAE0.29158.723.684.280.3070.3122.969
Synth+PL0.16281.033.288.580.1790.1863.737
Synth+UPL0.10166.025.8011.070.1860.0732.014
Synth+DPL0.05372.128.443.420.2120.0361.832
FLAIRBaseline0.01074.242.1218.690.0330.0132.839
Baseline+TTA0.00674.638.0412.570.0260.0082.119
Baseline+TENT0.000106.337.654.990.0140.0030.718
Baseline+DAE0.02776.298.183.810.1550.0557.805
Baseline+PL0.000249.939.252.460.0000.0020.582
Baseline+UPL0.000256.039.252.530.0000.0020.582
Baseline+DPL0.000256.039.252.530.0000.0020.582
Synth (Ours)0.29672.148.0417.930.1370.4153.443
Synth+TTA0.31471.441.159.070.2130.4122.967
Synth+TENT0.22179.448.219.870.1400.2803.362
Synth+DAE0.30463.643.348.820.1990.3593.258
Synth+PL0.27984.381.3220.010.1560.5195.573
Synth+UPL0.14578.531.5115.500.1150.1081.028
Synth+DPL0.24380.125.7113.820.1460.2241.514
CTBaseline0.02970.928.018.540.0840.0321.492
Baseline+TTA0.017125.027.634.110.0760.0110.593
Baseline+TENT0.000207.728.742.590.0180.0000.550
Baseline+DAE0.06478.278.652.980.1980.1065.426
Baseline+PL0.01178.379.932010.460.0020.0326.333
Baseline+UPL0.000256.028.872.840.0000.0000.542
Baseline+DPL0.000256.028.872.840.0000.0000.542
Synth (Ours)0.23466.819.715.430.2330.2040.965
Synth+TTA0.22985.220.342.650.3170.1950.913
Synth+TENT0.22258.620.829.170.1390.2101.245
Synth+DAE0.30751.623.814.130.2580.2981.657
Synth+PL0.000110.026.6934.560.0020.0010.714
Synth+UPL0.000255.028.862.840.0000.0000.543
Synth+DPL0.000256.028.872.840.0000.0000.542

C Wilcoxon Significance Tests

Significance measures are provided below in Figures 7 - 17 as Wilcoxon signed-rank test values for pairwise comparisons between models on each dataset. ’Oracle DA’ represents the hypothetical best-case scenario where optimal DA method is known a priori for each dataset/modality and applied to the baseline model. Median Dice values are shown along the diagonal.

Refer to caption
Figure 7: Wilcoxon signed-rank test values for Dice metric measurements in the ATLAS T1w dataset.
Refer to caption
Figure 8: Wilcoxon signed-rank test values for Dice metric measurements in the ARC T1w dataset.
Refer to caption
Figure 9: Wilcoxon signed-rank test values for Dice metric measurements in the ARC T2w dataset.
Refer to caption
Figure 10: Wilcoxon signed-rank test values for Dice metric measurements in the ARC FLAIR dataset.
Refer to caption
Figure 11: Wilcoxon signed-rank test values for Dice metric measurements in the ISLES 2015 T1w dataset.
Refer to caption
Figure 12: Wilcoxon signed-rank test values for Dice metric measurements in the ISLES 2015 T2w dataset.
Refer to caption
Figure 13: Wilcoxon signed-rank test values for Dice metric measurements in the ISLES 2015 FLAIR dataset.
Refer to caption
Figure 14: Wilcoxon signed-rank test values for Dice metric measurements in the ISLES 2015 DWI dataset.
Refer to caption
Figure 15: Wilcoxon signed-rank test values for Dice metric measurements in the PLORAS T2w dataset.
Refer to caption
Figure 16: Wilcoxon signed-rank test values for Dice metric measurements in the PLORAS FLAIR dataset.
Refer to caption
Figure 17: Wilcoxon signed-rank test values for Dice metric measurements in the PLORAS CT dataset.

D Additional Qualitative Results

Results are shown below in Figures 18 - 23 for failure and success cases for the three different sequences in the PLORAS dataset.

Refer to caption
Refer to caption
Refer to caption
Figure 18: Sample visualisations of successful cases in the PLORAS T2w dataset. Green indicates a true positive prediction, red a false positive, and blue a false negative.
Refer to caption
Refer to caption
Refer to caption
Figure 19: Sample visualisations of failure cases in the PLORAS T2w dataset. Green indicates a true positive prediction, red a false positive, and blue a false negative.
Refer to caption
Refer to caption
Refer to caption
Figure 20: Sample visualisations of successful cases in the PLORAS FLAIR dataset. Green indicates a true positive prediction, red a false positive, and blue a false negative.
Refer to caption
Refer to caption
Refer to caption
Figure 21: Sample visualisations of failure cases in the PLORAS FLAIR dataset. Green indicates a true positive prediction, red a false positive, and blue a false negative.
Refer to caption
Refer to caption
Refer to caption
Figure 22: Sample visualisations of successful cases in the PLORAS CT dataset. Green indicates a true positive prediction, red a false positive, and blue a false negative.
Refer to caption
Refer to caption
Refer to caption
Figure 23: Sample visualisations of failure cases in the PLORAS CT dataset. Green indicates a true positive prediction, red a false positive, and blue a false negative.