Understanding differences in applying DETR to natural and medical images

Yanqi Xu1Orcid, Yiqiu Shen2Orcid, Carlos Fernandez-Granda1Orcid, Laura Heacock3Orcid, Krzysztof J. Geras2Orcid
1: Center for Data Science, New York Univeristy, New York, NY, USA, 2: NYU Grossman School of Medicine, New York, NY, USA, 3: NYU Langone Health, New York, NY, USA
Publication date: 2025/05/30
https://doi.org/10.59275/j.melba.2025-g137
PDF · Code

Abstract

Natural images depict real-world scenes such as landscapes, animals, and everyday items. Transformer-based detectors, such as the Detection Transformer, have demonstrated strong object detection performance on natural image datasets. These models are typically optimized through complex engineering strategies tailored to the characteristics of natural scenes. However, medical imaging presents unique challenges, such as high resolutions, smaller and fewer regions of interest, and subtle inter-class differences, which differ significantly from natural images. In this study, we evaluated the effectiveness of common design choices in transformer-based detectors when applied to medical imaging. Using two representative datasets, a mammography dataset and a chest CT dataset, we showed that common design choices proposed for natural images, including complex encoder architectures, multi-scale feature fusion, query initialization, and iterative bounding box refinement, fail to improve and can even be detrimental to the object detection performance. In contrast, simpler and shallower architectures often achieve equal or superior results with less computational cost. These findings highlight that standard design practices need to be reconsidered when adapting transformer models to medical imaging, and suggest that simplicity may be more effective than added complexity in this domain. Our model code and weights are publicly available at https://github.com/nyukat/Mammo-DETR

Keywords

Machine Learning · Vision Transformers · Object Detection · Breast Cancer

Bibtex @article{melba:2025:009:xu, title = "Understanding differences in applying DETR to natural and medical images", author = "Xu, Yanqi and Shen, Yiqiu and Fernandez-Granda, Carlos and Heacock, Laura and Geras, Krzysztof J.", journal = "Machine Learning for Biomedical Imaging", volume = "3", issue = "May 2025 issue", year = "2025", pages = "152--170", issn = "2766-905X", doi = "https://doi.org/10.59275/j.melba.2025-g137", url = "https://melba-journal.org/2025:009" }
RISTY - JOUR AU - Xu, Yanqi AU - Shen, Yiqiu AU - Fernandez-Granda, Carlos AU - Heacock, Laura AU - Geras, Krzysztof J. PY - 2025 TI - Understanding differences in applying DETR to natural and medical images T2 - Machine Learning for Biomedical Imaging VL - 3 IS - May 2025 issue SP - 152 EP - 170 SN - 2766-905X DO - https://doi.org/10.59275/j.melba.2025-g137 UR - https://melba-journal.org/2025:009 ER -

2025:009 cover