Breast Density FL

Organized by challenge-organizer - Current server time: June 6, 2023, 8:36 a.m. UTC

First phase

June 15, 2022, midnight UTC


Competition Ends
Aug. 31, 2022, 1:26 p.m. UTC


You are NOT at the main challenge submission site. Please navigate to to participate in the challenge. 

  • is now pointing to the main challenge site:
  • (you are here now) was where the challenge was mainly being hosted.
  • You can continue to use the forum here.


The correct interpretation of breast density is important in the assessment of breast cancer risk. Furthermore, the identification of dense breasts may stratify patients who may have masked cancers and may benefit from additional imaging. However, due to the differences in imaging characteristics from different mammography systems, it has been shown that models built using data from one system do not generalize to other systems. Distributed learning techniques including federated learning are increasing popular approaches to learn from multi-institutional datasets without the need for data sharing. However, the optimal techniques for distributed learning, especially in the settings of heterogenous data are still an active area of research. This includes the best approaches for model training (e.g. the use of batch vs. group vs. instance norms), model aggregation (FedAvg, FedProx etc) and workflow (e.g. federated averaging vs. cyclical weight transfer vs. split learning). We will use a large dataset of digital mammograms from the Digital Mammographic Imaging Screening Trial (DMIST) acquired from 33 institutions. This is a rich dataset with high quality labels consisting of over 100,000 images from over 21,000 patients. The goal of the challenge would be to develop the best, most generalizable models for breast density estimation using distributed/federated learning. The algorithm should not be biased (i.e. should work well for all populations and ethnicities) and generalizable (should work on data acquired on multiple scanners)

(high performance in aggregate and for individual sites relative to centrally hosted data).


Organizing team (names and affiliations):

  • Jayashree Kalpathy-Cramer - MGB/Harvard Medical School
  • Holger Roth - NVIDIA
  • Keyvan Farahani - National Cancer Institute, National Institutes of Health
  • Laura Coombs - American College of Radiology
  • Kendall Schmidt - American College of Radiology Data Science Institute
  • Benjamin Bearce - QTIM (Quantitative Translational Imaging in Medicine) Lab at MGB
  • Ken Chang - QTIM (Quantitative Translational Imaging in Medicine) Lab at MGB
  • Kristopher Kersten - NVIDIA

Primary Contact:


The results of the challenge are in and the winners have been announced:

1st Place: MarawanElbatel, KaoutherMouheb, Joaquin Seia, Robert Marti (University of Girona, Spain)

Runner-Up: Ruipeng Zhang, Yao Zhang, Ziqing Fan, Jiangchao Yao, Ya Zhang, Yanfeng Wang (Shanghai Jiao Tong University & Shanghai AI Laboratory, China)

Final leaderboard (Phase II):



Avg. Rank






















We congratulate the winners and thank all participants for making this challenge a success!

Presentation slides and video recordings from the Challenge Event at MICCAI'22 are available below (see event agenda).



Challenge Schedule:

  • April 1, 2022: Challenge website is open
  • May 4, 2022: The reference implementation using NVFlare and MONAI is available
  • June 15, 2022: Submissions open for training & validation
  • August 7, 2022: Submission site opens
  • NEW! August 31, September 5, 2022: Deadline for final submission using submission site (23:59 AOE)
  • September 18, 2022: Winners announced at MICCAI 2022, Singapore [8:00 AM to 10:00 AM (SGT time)]

Training Leaderboard

Challenge Event Agenda @MICCAI 2022:

Time: Sep 18 / 08:00 AM to 10:00 AM (SGT time)

8:00-8:40   Intro by Prof. Jayashree Kalpathy-Cramer & Overview by Dr. Holger Roth

                         Intro: slides

                         Overview: slides

 8:40-9:00   Short videos by finalists (incl. time for Q&A in the end)

                          Winner: MarawanElbatel et al. video

                          Runner-up: Ruipeng Zhang et al. video

                          Finalist: Conrad Testagrose et al. video

                          Finalist: Yi Qin video

                          Finalist: Hu Yaojun et al. video

9:00-9:20   Ranking analysis, 3-minute video from NVIDIA, winner announcement & GPU award

                           Ranking: slides

                           Message from NVIDIA: video

9:20-9:30   Discussion & closing remarks

9:30-10:00 Morning coffee break


Pisano ED, Gatsonis C, Hendrick E, Yaffe M, Baum JK, Acharyya S, Conant EF, Fajardo LL, Bassett L, D'Orsi C, Jong R, Rebner M; Digital Mammographic Imaging Screening Trial (DMIST) Investigators Group. Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med. 2005 Oct 27;353(17):1773-83. doi: 10.1056/NEJMoa052911. Epub 2005 Sep 16. Erratum in: N Engl J Med. 2006 Oct 26;355(17):1840. PMID: 16169887.

describes the dataset

Chang K, Beers AL, Brink L, Patel JB, Singh P, Arun NT, Hoebel KV, Gaw N, Shah M, Pisano ED, Tilkin M, Coombs LP, Dreyer KJ, Allen B, Agarwal S, Kalpathy-Cramer J. Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density. J Am Coll Radiol. 2020 Dec;17(12):1653-1662. doi: 10.1016/j.jacr.2020.05.015. Epub 2020 Jun 24. PMID: 32592660.  describes prior work in developing a breast density model for this dataset described a federated learning activity using this and other datasets

Further comments:

We believe that it is important to provide an opportunity for distributed learning challenges where multiple
approaches for learning from institutional datasets in a secure fashion (exchanging model parameters, not images)
are explored. These could include traditional approaches including federated averaging or other approaches
including cyclical weight transfer and split learning.


NVIDIA will be sponsoring a GPU award for the challenge winner!

Challenge Evaluation

Docker submission on Medici platform:

Data are not available to the participants. They will be allowed to interact with the data only through docker containers. 

The DMIST dataset provided for training has been only available via request from ACR. We will note in the challenge rules that participants who might have requested access to DCR in the past, will not be allowed to use their copy of DMIST for any step of their training pipeline.

We allow users to utilize public datasets for pretraining or submitted as part of their docker containers, e.g. for regularization of the global model. The participants will be required to explain how they use any public data in their method description (similar to other challenges, e.g. COVID-19-20). The ultimate goal of the challenge is to find a good FL method for breast density classification. Therefore, we aim to compare different approaches to this overall goal rather than comparing particular aspects of a method. The utilization of publicly available data is not going to circumvent the constraints of the FL setting.

We will provide an example docker using MONAI and NVIDIA FLARE as a baseline benchmark but participants can use any framework. We will provide detailed instructions about the inputs and outputs necessary for the docker. The dataset for each phase will be set up at 3 VMs to simulate real life federated learning. The dataset in each VM will be substantially different in terms of image properties (data are non IID).

Unlimited runs during training phase (but restricted to 3 submissions/day). During test phase only last run is officially used for the final scoring.

Implementation & Hardware Resources

  • The reference implementation using NVFlare and MONAI is available
  • Each client machine contains a single 16GB NVIDIA Tesla V100 GPU.


Code Availability:

The script used for evaluating different submissions is available here.

The open source toolkit will be used for the evaluation. We recommend that all participants make their code publicly available and open-source. In order to win, the code must be open source.

Conflicts of Interest:

Only ACR and Jayashree Kalpathy-Cramer will have access to the test labels.


  • Overall kappa
  • Kappa for each institution

Kappa is commonly used to evaluate breast density algorithms to account. We will measure kappa for the overall test set as well as differently sized test sets from the 3 institutions and and
aggregate ranks will be used in the final consensus ranking.

Kappa and the distance based metric are commonly used measure for the task of estimating 4-class breast density. Particularly for the scenario of multi-institutional data and federated learning, we are interested in both overall performance as well as per institutional performance since there is heterogeneity in the data. Thus, we propose an aggregate rank as the final ranking measure.

e.g. Lehman et al, 2018 (
Chang et al, 2020

Ranking method(s):

A first set of four ranks will be computed for all 4 tasks (overall kappa as well as Kappa for institutions 1,2 and 3 test

Furthermore, BI-RADS breast density grades can be ordered. i.e. if the density grades are 1,2,3,4, calling a 4 a 1
should be worse than calling 4 a 3. Therefore, we will also use a per-image distance metric, penalizing predictions
more that are further away from the truth. Again, this will be evaluated based on global and institutional test sets.
With the image-based metric, we will use the challengeR toolkit to do statistical ranking across these four tasks.

For simplicity, the final consensus ranking will be based on all eight tasks, equally-weighted, based on the mean
ranks using the Euclidean distance. See for details.

Our submission process provides error handling for missing and duplicate rows in the submission. Each case will
be machine-readable. Therefore, we expect each trained model to be able to produce a prediction for each case. If
- for any reason - a case misses a prediction, we will assign a random prediction (1 of four classes) to be still able to
compute Kappa scores for all images. For the distance-based metric, the lowest distance (furthest away from the
truth) will be chosen.

The goal is to have an algorithm that performs well on all institutions as well as the entire population.

Further analyses:

We will conduct thorough post-submission analysis to examine

  1. Biases with respect to ethnicity (this information is not provided to the participants)
  2. Biases with respect to breast size (potential confounder)
  3. Inter-algorithm agreement and variability to identify "difficult" cases
  4. Ranking variability
  5. Generalization - The test set will included data from previously unseen institutions to evaluate the robustness of
    the final trained model.


Challenge Rules:

  • Top-performing methods will be announced publicly, leaderboard will be available but team names can be pseudonyms.
  • Top teams will be invited to participate in the overall challenge manuscript. Participants may publish their results after the primary challenge paper has been published.
  • Winners should make their models/best approaches available, preferably within AI-Lab or MONAI.
  • Each team is expected to be working on independent approaches.
  • Members of the organizers' institutes may participate in the challenge but are not eligible for awards and are not listed in the final leaderboard.
  • We will not take and reuse any participant's model\algortithm without prior consent.
  • All uploaded docker images will be subject to security scans to check for vulnerabilities.
  • All docker images and code must be zipped in .zip format
  • Submissions must be directly aligned with the challenge task. Only computations related to the task are allowed.
  • Challenge organizers reserve the right to revoke a participant from the challenge without notice. All decisions by the organizers are final.
  • Submissions will be restricted to 3 per day.
  • The runtime of submissions will be limited to 8 hours (including training and inference).
  • Additional details and rules are in the "Challenge Evaluation" section of this website.


Start: June 15, 2022, midnight

Description: Training phase: Train models in preparation for the test phase.


Start: Aug. 31, 2022, midnight

Description: Test phase: Perform Inference on Test Data.

Competition Ends

Aug. 31, 2022, 1:26 p.m.

You must be logged in to participate in competitions.

Sign In