Main

As artificial intelligence (AI) models are increasingly deployed in real-world clinical settings1,2, it is crucial to evaluate not only model performance but also potential biases toward specific demographic groups3,4. Although deep learning has achieved human-level performance in numerous medical imaging tasks5,6, existing literature indicates a tendency for these models to manifest existing biases in the data, causing performance disparities between protected subgroups7,8,9,10,11. For instance, chest X-ray (CXR) classifiers trained to predict the presence of disease systematically underdiagnose Black patients12, potentially leading to delays in care. To ensure the responsible and equitable deployment of such models, it is essential to understand the source of such biases and, where feasible, take actions to correct them13,14.

Recent studies have unveiled the surprising ability of deep models to predict demographic information, such as self-reported race15, sex and age16, from medical images, achieving performance far beyond that of radiologists. These insights raise the concern of disease prediction models leveraging demographic features as heuristic ‘shortcuts’17,18—correlations that are present in the data but have no real clinical basis18, for instance deep models using the hospital as a shortcut for disease prediction19,20.

In this work, we investigated four questions. First, we consider whether disease classification models also use demographic information as shortcuts and whether such demographic shortcuts result in biased predictions. Second, we evaluate the extent to which state-of-the-art methods can remove such shortcuts and create ‘locally optimal’ models that are also fair. Third, we consider real-world clinical deployment settings where shortcuts may not be valid in the out-of-distribution (OOD) data, to dissect the interplay between algorithmic fairness and shortcuts when data shift. Finally, we explore which algorithms and model selection criteria can lead to ‘globally optimal’ models that maintain fairness when deployed in an OOD setting.

We performed a systematic investigation into how medical AI leverages demographic shortcuts through these questions, with an emphasis on fairness disparities across both in-distribution (ID) training and external test sets. Our primary focus is on CXR prediction models, with further validation in dermatology (Extended Data Fig. 1) and ophthalmology (Extended Data Fig. 2). Our X-ray analysis draws upon six extensive, international radiology datasets: MIMIC-CXR21, CheXpert22, NIH23, SIIM24, PadChest25 and VinDr26. We explored fairness within both individual and intersectional subgroups spanning race, sex and age12. Our assessment uncovers compelling new insights into how medical AI encodes demographics and the impact that this has on various fairness considerations, especially when models are applied outside their training context during real-world domain shifts, with actionable insights on what models to select for fairness under distribution shift.

Results

Datasets and model training

We used six publicly available CXR datasets, as described in Table 1. We focused on four binary classification tasks that have been shown to have disparate performance between protected groups7,27: ‘No Finding’, ‘Effusion’, ‘Pneumothorax’ and ‘Cardiomegaly’. The detailed prevalence rates of the diseases for each demographic subgroup are shown in Extended Data Table 1.

Table 1 Demographic and label characteristics of the six X-ray datasets used in this study

We also examined medical AI applications in dermatology and ophthalmology. Specifically, we used the ISIC dataset28 with ‘No Finding’ as the task for dermatological imaging (Extended Data Fig. 1a) and the ODIR dataset29 with ‘Retinopathy’ as the task for ophthalmology images (Extended Data Fig. 2a).

To evaluate fairness, we examined the class-conditioned error rate that is likely to lead to worse patient outcomes for a screening model. For ‘No Finding’, a false positive indicates falsely predicting that a patient is healthy when they are ill, which could lead to delays in treatment12; we, therefore, evaluated the differences in false-positive rate (FPR) between demographic groups. For all other diseases, we evaluated the false-negative rate (FNR) for the same reason. Equality in these metrics is equivalent to equality of opportunity30. We choose to study fairness through the notion of equalized odds, as it has been widely used in previous work in the CXR and fairness literature7,12. In addition, shortcut learning using a particular demographic attribute leads to differences in class-conditioned error rates (that is, FPR and FNR gaps) across attributes31,32, and so studying these gaps allows us to glean insight into the severity of shortcut learning. Finally, FPR and FNR (as enforced to be equal by equalized odds) are meaningful metrics in the clinical setting, as they correspond to error rates of decision-making at the individual level12.

To understand and quantify the types and degrees of distribution shifts in our study, we examined whether there are significant statistical differences in distributions between demographic groups in the ID settings as well as across different datasets in the OOD settings. Specifically, we analyzed prevalence shifts P(Y|A) and representation shifts P(X|A) across different subgroups for ID scenarios and added label shifts P(Y) and covariate shifts P(X) for OOD scenarios (Methods). Our analyses indicate that all the distributions that we examined show statistically significant shifts, affecting most demographic groups in the ID context (Extended Data Table 2) and across various sites in the OOD context (Extended Data Table 3). We note that our analysis does not presuppose specific types of distribution shifts; instead, we simulated real-world deployment conditions where any of these shifts might occur, aiming for results that are generalizable to complex, real-world scenarios.

We trained a grid of deep convolutional neural networks33 on MIMIC-CXR (radiology), CheXpert (radiology), ODIR (ophthalmology) and ISIC (dermatology), varying the classification task. Our approach follows previous work that achieves state-of-the-art performance in these tasks8,12 using empirical risk minimization (ERM)34. We also evaluated algorithms designed to remove spurious correlations or increase model fairness during training. We categorized these algorithms into those that (1) reweight samples based on their group to combat underrepresentation (ReSample35 and GroupDRO36); (2) adversarially remove group information from model representations (DANN37 and CDANN38); and (3) more generically attempt to improve model generalization—that is, exponential moving average (MA39). In total, our analysis encompassed a total of 3,456 models trained on MIMIC-CXR, corresponding to the cartesian product of four tasks, four demographic attributes, six algorithms, 12 hyperparameter settings and three random seeds. We summarized our experimental pipeline in Fig. 1.

Fig. 1: Overall experimental pipeline.
figure 1

a, We trained a grid of deep learning models on medical images from a variety of modalities on several clinical tasks. We applied a variety of state-of-the-art algorithms to mitigate shortcuts, for up to four demographic attributes (where available). b, We evaluated each model ID (that is, on the same dataset where it is trained), along the axis of performance, fairness, amount of demographic encoded and calibration. c, We evaluated the performance and fairness of CXR classification models on OOD domains. To mimic a realistic deployment setting where OOD samples are not observed, we chose the ‘best’ model based on several ID selection criteria.

Algorithmic encoding of attributes leads to fairness gaps

We separately trained deep learning models for our four distinct CXR prediction tasks (‘No Finding’, ‘Cardiomegaly’, ‘Effusion’ and ‘Pneumothorax’) as well as ‘Retinopathy’ in ophthalmology and ‘No Finding’ in dermatology. Each model consists of a feature extractor followed by a disease prediction head. We then employed a transfer learning approach, wherein we kept the weights of the feature extractor frozen and retrained the model to predict sensitive attributes (for example, race). This allowed us to assess the amount of attribute-related information present in the features learned by each model as measured by the area under the receiver operating characteristic curve (AUROC) for attribute prediction (Methods). Previous work15,40 demonstrated that deep models trained for disease classification encode demographic attributes, and such encoding could lead to algorithmic bias41. We extend the investigation to a broader array of datasets, attributes and imaging modalities. As Fig. 2a,c,e confirms, the penultimate layer of different disease models contains significant information about four demographic attributes (age, race, sex and the intersection of sex and race), and that is consistent across different tasks and medical imaging modalities.

Fig. 2: Medical imaging models encode sensitive attributes and are unfair across subgroups.
figure 2

a, The AUROC of demographic attribute prediction from frozen representations for the best ERM model. We trained ERM models on MIMIC-CXR to predict four different binary tasks. ERM representations encode demographic attributes to a high degree. b, The fairness gap, as defined by the FPR gap for ‘No Finding’, and the FNR gap for all other tasks for the best ERM model. ERM models exhibit high fairness gaps, especially between age groups. c, The AUROC of demographic attribute prediction from frozen representations for the best ERM model on the ODIR dataset (ophthalmology), following the same experimental setup. d, The fairness gap for the best ERM model on the ODIR dataset (ophthalmology). e, The AUROC of demographic attribute prediction from frozen representations for the best ERM model on the ISIC dataset (dermatology), following the same experimental setup. f, The fairness gap for the best ERM model on the ISIC dataset (dermatology). af, Each bar and its error bar indicate the mean and standard deviation across three independent runs. g, The correlation between attribute prediction performance and fairness for all learned models. We excluded models with suboptimal performance—that is, with an overall validation AUROC below 0.7. The attribute prediction AUROC shows a high correlation with the fairness gap (‘No Finding’, age: R = 0.82, P = 4.7 × 10−8; ‘No Finding’, sex and race: R = 0.81, P = 8.4 × 10−9; ‘Cardiomegaly’, age: R = 0.81, P = 1.9 × 10−7; ‘Effusion’, race: R = 0.71, P = 6.4 × 10−6; ‘Pneumothorax’, sex: R = 0.59, P = 2.3 × 10−3; all using two-sided t-test). The center line and the shadow denote the mean and 95% CI, respectively.

We then assessed the fairness of these models across demographic subgroups as defined by equal opportunity30—that is, discrepancies in the model’s FNR or FPR for demographic attributes. We focused on underdiagnosis12—that is, discrepancies in FPR for ‘No Finding’ and discrepancies in FNR for other diseases. For each demographic attribute, we identified two key subgroups with sufficient sample sizes: age groups ‘80–100’ (n = 8,063) and ‘18–40’ (n = 7,319); race groups ‘White’ (n = 32,732) and ‘Black’ (n = 8,279); sex groups ‘female’ (n = 25,782) and ‘male’ (n = 27,794); and sex and race groups ‘White male’ (n = 18,032) and ‘Black female’ (n = 5,027). In all tasks, we observed that the models displayed biased performance within the four demographic attributes, as evidenced by the FNR disparities (Fig. 2b). The observed gaps can be as large as 30% for age. The same results hold for the other two imaging modalities (Fig. 2d,f). Similar results for overdiagnosis (FNR of ‘No Finding’ and FPR for disease prediction) can be found in Extended Data Fig. 3.

We further investigated the degree to which demographic attribute encoding ‘shortcuts’ may impact model fairness. When models use demographic variables as shortcuts, previous work showed that they can exhibit gaps in subgroup FPR and FNR31,40. We note that a model encoding demographic information does not necessarily imply a fairness violation, as the model may not necessarily use this information for its prediction. For each task and attribute combination, we trained different models with varying hyperparameters (Methods). We focused on the correlation between the degree of encoding of different attributes and the fairness gaps as assessed by underdiagnosis. Figure 2g shows that a stronger encoding of demographic information is significantly correlated with stronger model unfairness (‘No Finding’, age: R = 0.82, P = 4.7 × 10−8; ‘No Finding’, sex and race: R = 0.81, P = 8.4 × 10−9; ‘Cardiomegaly’, age: R = 0.81, P = 1.9 × 10−7; ‘Effusion’, race: R = 0.71, P = 6.4 × 10−6; ‘Pneumothorax’, sex: R = 0.59, P = 2.3 × 10−3; all using two-sided t-test). Such consistent observations indicate that models using demographic encodings as heuristic shortcuts also have larger fairness disparities, as measured by discrepancies in FPR and FNR.

Mitigating shortcuts creates locally optimal models

We performed model evaluations first in the ID setting, where ERM models trained and tested on data from the same source performed well. We compared ERM to state-of-the-art robustness methods that were designed to effectively address fairness gaps while maintaining overall performance. As shown in Fig. 3a, ERM models exhibited large fairness gaps across age groups when predicting ‘Cardiomegaly’ (that is, models centered in the top right corner, FNR gap of 20% between groups ‘80–100’ and ‘18–40’). By applying data rebalancing methods to address prevalence shifts during training (for example, ReSample), we observed reduced fairness gaps in certain contexts. By applying debiasing robustness methods that correct demographic shortcuts, such as GroupDRO and DANN, the resulting models were able to close the FNR gap while achieving similar AUROCs (for example, the bottom right corner). Our results hold when using the worst group AUROC as the performance metric (Extended Data Fig. 4) and across different combinations of diseases and attributes (Fig. 3b and Extended Data Fig. 4).

Fig. 3: Algorithms for removing demographic shortcuts mitigate ID fairness gaps and maintain performance.
figure 3

a, Tradeoff between the fairness gap and overall AUROC for all trained models, for ‘Cardiomegaly’ prediction using ‘age’ as the attribute. We plotted the Pareto front—the best achievable fairness gap with a minimum constraint on the performance. b, Tradeoff between the fairness gap and overall AUROC for all trained models, with more disease prediction tasks and attributes. c, Tradeoff between the fairness gap and the overall AUROC on the ODIR dataset (ophthalmology). d, Tradeoff between the fairness gap and the overall AUROC on the ISIC dataset (dermatology).

To demonstrate the value of model debiasing, we further plotted the set of ‘locally optimal models’—those on the Pareto front42 that balance the performance–fairness tradeoff most optimally on ID data (Fig. 3a). Those models that lie on this front are ‘locally optimal’, as they have the smallest fairness gap that can be achieved for a fixed performance constraint (for example, AUROC > 0.8). In the ID setting, we found several existing algorithms that consistently achieve high ID fairness without losing notable overall performance for disease prediction (Fig. 3a,b and Extended Data Fig. 4).

Similar to our observations in radiology, we identified fairness gaps within subgroups based on age and sex in dermatology and ophthalmology, respectively (Fig. 2d,f). We further verified the Pareto front for both attributes, where similar observations hold that algorithms for fixing demographic shortcuts could improve ID fairness while incurring minimal detriments to performance (Fig. 3c,d). The steepness of the Pareto front suggests that small sacrifices in performance could yield substantial gains in fairness.

Locally optimal models exhibit tradeoffs in other metrics

We examined how locally optimal models that balance fairness and AUROC impact other metrics, as previous work showed that it is a theoretical impossibility to balance fairness measured by probabilistic equalized odds and calibration by group43. We found that optimizing fairness alone leads to worse results for other clinically meaningful metrics in some cases, indicating an inherent tradeoff between fairness and other metrics. First, for the ‘No Finding’ prediction task, enforcing fair predictions across groups results in worse expected calibration error gap (ECE Gap; Extended Data Fig. 5a) between groups. Across different demographic attributes, we found a consistent statistically significant negative correlation between ECE Gap and Fairness Gap (age: R = −0.85, P = 7.5 × 10−42; race: R = −0.64, P = 6.1 × 10−15; sex: R = −0.73, P = 4.4 × 10−28; sex and race: R = −0.45, P = 1.9 × 10−8; all using two-sided t-test).

We explored the relationship between fairness and other metrics, including average precision and average F1 score. For ‘No Finding’ prediction, fairer models lead to both worse average precision and F1 score (Extended Data Fig. 5a). The same trend holds across different diseases—for example, for ‘Effusion’ (Extended Data Fig. 5b). These findings stress that these models, although being locally optimal, exhibit worse results on other important and clinically relevant performance metrics. This uncovers the limitation of blindly optimizing fairness, emphasizing the necessity for more comprehensive evaluations to ensure the reliability of medical AI models.

Local fairness does not transfer under distribution shift

When deploying AI models in real settings, it is crucial to ensure that models can generalize to data from unseen institutions or environments. We directly tested all trained models in the OOD setting, where we report results on external test datasets that are unseen during model development. Figure 4 illustrates that the correlation between ID and OOD performance is high across different settings, which was observed in previous work44. However, we found that there was no consistent correlation between ID and OOD fairness. For example, Fig. 4b shows an instance where the correlation between ID fairness and OOD fairness is strongly positive (‘Effusion’ with ‘age’ as the attribute; R = 0.98, P = 3.0 × 10−36, two-sided t-test), whereas Fig. 4c shows an instance where the correlation between these metrics is actually significantly negative (‘Pneumothorax’ with ‘sex and race’ as the attribute; R = −0.50, P = 4.4 × 10−3, two-sided t-test). Across 16 combinations of task and attribute, we found that five such settings exhibited this negative correlation, and three additional settings exhibited only a weak (R < 0.5) positive correlation (see Extended Data Fig. 6a,b for additional correlation plots). Thus, improving ID fairness may not lead to improvements in OOD fairness, highlighting the complex interplay between fairness and distribution shift45,46.

Fig. 4: The transfer of performance (overall AUROC) and fairness between the ID (MIMIC-CXR) and OOD datasets.
figure 4

a, We plotted the Pearson correlation coefficient of ID versus OOD performance versus the Pearson correlation coefficient of ID versus OOD fairness. Here, each point was derived from a grid of models trained on a particular combination of task and attribute. We found that there was a high correlation between ID and OOD performance in all cases, but the correlation between ID and OOD fairness was tenuous. b, One particular point where fairness transfers between ID and OOD datasets (‘Effusion’ with ‘age’ as the attribute; R = 0.98, P = 3.0 × 1036, two-sided t-test). The center line and the shadow denote the mean and 95% CI, respectively. c, One particular point where fairness does not transfer between ID and OOD datasets (‘Pneumothorax’ with ‘sex and race’ as the attribute; R = −0.50, P = 4.4 × 103, two-sided t-test). The center line and the shadow denote the mean and 95% CI, respectively. d, The ID Pareto front for ‘Cardiomegaly’ prediction using ‘race’ as the attribute. e, The transformation of the ID Pareto front to the OOD Pareto front, for ‘Cardiomegaly’ prediction using ‘race’ as the attribute. Models that are Pareto optimal ID often do not maintain Pareto optimality OOD.

In addition, we investigated whether models achieving ID Pareto optimality between fairness and performance will maintain in OOD settings. As shown for ‘Cardiomegaly’ prediction using race as the attribute, models originally on the Pareto front ID (Fig. 4d) do not guarantee to maintain Pareto optimality when deployed in a different OOD setting (Fig. 4e). We show additional examples of this phenomenon in Extended Data Fig. 6c.

Dissecting model fairness under distribution shift

To disentangle the OOD fairness gap, we present a way to decompose model fairness under distribution shift. Specifically, we decompose and attribute the change in fairness between ID and OOD to be the difference in performance change for each of the groups—that is, the change in fairness is determined by how differently the distribution shift affects each group (Methods).

In Extended Data Fig. 7, we show examples of transferring a trained model from ID setting to OOD setting. For example, Extended Data Fig. 7d illustrates an ERM model trained to predict ‘No Finding’ on CheXpert (ID) and transferred to MIMIC-CXR (OOD) while evaluating fairness across sex. We found that the model was fair with respect to the FPR gap in the ID setting (−0.1% gap, not significant) but had a significant FPR gap when deployed in the OOD setting (3.2%), with females being underdiagnosed at a higher rate (Extended Data Fig. 7e). We then segmented this FPR gap by sex and found that females experienced an increase in FPR of 3.9%, whereas males experienced an increase in FPR of 0.8% (Extended Data Fig. 7f). In other words, the model becomes worse for both groups in an OOD setting but to a much larger extent for female patients. This decomposition suggests that mitigation strategies that reduce the impact of the distribution shift on females could be effective in reducing the OOD fairness gap in this instance.

We further extended this study to a larger set of tasks and protected attributes (Extended Data Fig. 7). Across all settings, the disparate impact of distribution shift on each group was a significant component, indicating that mitigating the impact of distribution shift is as important as mitigating ID fairness, if the goal is to achieve a fair model OOD.

Globally optimal model selection for OOD fairness

Figure 4 shows that selecting a model based on ID fairness may not lead to a model with optimal OOD fairness. Here, we examined alternate model selection criteria that may lead to better OOD fairness, when we have access only to ID data. Our goal is to find ‘globally optimal’ models that maintain their performance and fairness in new domains. First, we subsetted our selection only to models that had satisfactory ID overall performance (defined as those with overall validation AUROC no less than 5% of the best ERM model). This set of models also had satisfactory OOD performance (Supplementary Fig. 1).

Next, we proposed eight candidate model selection criteria (Fig. 5a), corresponding to selecting the model from this set that minimizes or maximizes some ID metric. We evaluated the selected model by its OOD fairness across five external datasets, each containing up to four attributes and up to four tasks, corresponding to a total of 42 settings. We compared the OOD fairness of the selected model to the OOD fairness of an ‘oracle’, which observes samples from the OOD dataset and directly chooses the model with the smallest OOD fairness gap. For each setting, we computed the increase in fairness gap of each selection criteria relative to the oracle. In Fig. 5a, we report the mean across the 42 settings as well as the 95% confidence interval (CI) computed from 1,000 bootstrap iterations. We found that, surprisingly, selecting the model with the minimum ID fairness gap may not be optimal. Instead, two other criteria based on selecting models where the embedding contains the least attribute information lead to a lower average OOD fairness gap. For instance, we observed a significantly lower increase in OOD fairness gap by selecting models with the ‘Minimum Attribute Prediction Accuracy’ as compared to ‘Minimum Fairness Gap’ (P = 9.60 × 10−94, one-tailed Wilcoxon rank-sum test). The result echoes our finding in Fig. 2 that the encoding of demographic attributes is positively correlated with ID fairness.

Fig. 5: OOD fairness of models with different model selection criteria and for different algorithms.
figure 5

a, We varied the ID model selection criteria and compared the selected model against the oracle that chooses the model that is most fair OOD. We plotted the increase in OOD fairness gap of the selected model over the oracle, averaged across 42 combinations of OOD dataset, task and attribute. We used non-parametric bootstrap sampling (n = 1,000) to define the bootstrap distribution for the metric. We found that selection criteria based on choosing models with minimum attribute encoding achieve better OOD fairness than naively selecting based on ID fairness or other aggregate performance metrics (‘Minimum Attribute Prediction Accuracy’ versus ‘Minimum Fairness Gap’: P = 9.60 × 10−94, one-tailed Wilcoxon rank-sum test; ‘Minimum Attribute Prediction AUROC’ versus ‘Minimum Fairness Gap’: P = 1.95 × 10−12, one-tailed Wilcoxon rank-sum test). b, We selected the model for each algorithm with the minimum ID fairness gap. We evaluated its OOD fairness against the oracle on the same 42 settings. We found that removing demographic encoding (that is, DANN) leads to the best OOD fairness (‘DANN’ versus ‘ERM’: P = 1.86 × 10−117, one-tailed Wilcoxon rank-sum test). On each box, the central line indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers extend to 1.5 times the interquartile range. Points beyond the whiskers are plotted individually using the ‘+’ symbol.

Finally, we studied the fairness of each algorithm in the OOD setting. We maintained the performance cutoff described above and selected the model for each algorithm with the lowest ID fairness gap. In Fig. 5b, we report the mean increase in OOD fairness gap relative to the oracle across the same 42 settings. We found that methods that remove demographic information from embeddings (specifically, DANN) lead to the lowest average OOD fairness gap (‘DANN’ versus ‘ERM’: P = 1.86 × 10−117, one-tailed Wilcoxon rank-sum test). Our findings demonstrate that evaluating and removing demographic information encoded by the model ID may be the key to ‘globally optimal’ models that transfer both performance and fairness to external domains.

Discussion

We demonstrated the interplays between the demographic encoding of attributes as ‘shortcuts’ in medical imaging AI models and how they change under distribution shifts. Notably, we validated our findings across global-scale datasets in radiology (Table 1) and across multiple medical imaging modalities (Extended Data Figs. 1 and 2). The results show that algorithmic encoding of protected attributes leads to unfairness (Fig. 2) and mitigating shortcuts can reduce ID fairness gaps and maintain performance (Fig. 3). However, our results also show that there exists an inherent tradeoff for clinically meaningful metrics beyond fairness (Extended Data Fig. 5), and such fairness does not transfer under distribution shift (Fig. 4). We provide initial strategies to dissect and explain the model fairness under distribution shifts (Extended Data Fig. 7). Our results further reveal actionable algorithm and model selection strategies for OOD fairness (Fig. 5).

Our results have multiple implications. First, they offer a cautionary tale on the efficacy and consequences of eliminating demographic shortcuts in disease classification models. On the one hand, removing shortcuts addresses ID fairness, which is a crucial consideration in fair clinical decision-making12. On the other hand, the resulting tradeoffs with other metrics and non-transferability to OOD settings raises the question about the long-term utility in removing such shortcuts. This is particularly complex in the healthcare setting, where the relationship between demographics and the disease or outcome label is complex47, variables can be mislabeled48 and distribution shifts between domains are difficult to quantify1.

Second, we frame demographic features as potential ‘shortcuts’, which should not be used by the model to make disease predictions. However, some demographic variables could be a direct causal factor in some diseases (for example, sex as a causal factor of breast cancer). In these cases, it would not be desirable to remove all demographic reliance but, instead, match the reliance of the model on the demographic attribute to its true causal effect49. In the tasks that we examined here, demographic variables, such as race, may have an indirect effect on disease (for example, through socioeconomic status)50, which may vary across geographic location or even time period51. Whether demographic variables should serve as proxies for these causal factors is a decision that should rest with the model deployers14,47,52,53.

Third, we present a preliminary decomposition for diagnosing OOD model fairness changes, by expressing it as a function of the ID fairness gap and the performance change of each group. We found that the disparate impact of distribution shift on per-group performance is a major contributor to lack of fairness in OOD settings. Our work suggests that, for practitioners trying to achieve fairness in models deployed in a different domain, mitigating ID fairness is at least as important as mitigating the impact of distribution shift for particular groups. However, building models robust to arbitrary domain shifts is, in general, a challenging task54,55. Having some knowledge or data about how the distributions may shift, or even the ability to actively collect data for particular groups, may be necessary56. Developing methods and deriving theoretical characterizations of fairness under distribution shift is an active area of research45,46.

Fourth, the US Food & Drug Administration (FDA), as the primary regulatory body for medical technologies, does not require external validation of clinical AI models, relying instead on the assessment by the product creator57. Our findings underscore the necessity for regular evaluation of model performance under distribution shift58,59, challenging the popular opinion of a single fair model across different settings60. This questions the effectiveness of developer assurances on model fairness at the time of testing and highlights the need for regulatory bodies to consider real-world performance monitoring, including fairness degradation61. Finally, when a model is deployed in any clinical environment, both its overall and per-group performance, as well as associated clinical outcomes, should be continuously monitored62.

Finally, although we imply that smaller ‘fairness gaps’ are better, enforcing these group fairness definitions can lead to worse utility and performance for all groups43,63,64,65,66, and other fairness definitions may be better suited to the clinical setting8,67. We note that these invariant notions of fairness could have drawbacks66, as equalized odds are incompatible with calibration by group (Extended Data Fig. 5), and enforcing equalized odds often lead to the ‘leveling down’ effect in overall performance63,64. We present the Pareto curve showing the tradeoff between fairness and accuracy, allowing the practitioner to select a model that best fits their deployment scenario. In general, we encourage practitioners to choose a fairness definition that is best suited to their use case and carefully consider the performance–equality tradeoff. The impact of minimizing algorithmic bias on real-world health disparities, the ultimate objective, is complex68, and there is no guarantee that deploying a fair model will lead to equitable outcomes. In addition, although we constructed several models for clinical risk prediction, we do not advocate for deployment of these models in real-world clinical settings without practitioners carefully testing models on their data and taking other considerations into account (for example, privacy, regulation and interpretability)1,3.

Methods

Datasets and pre-processing

The datasets used in this study are summarized in Extended Data Table 1. Unless otherwise stated, we trained models on MIMIC-CXR21 and evaluated on an OOD dataset created by merging CheXpert22, NIH23, SIIM24, PadChest25 and VinDr26. We included all images (both frontal and lateral) and split each dataset into 70% train, 15% validation and 15% test sets. Note that only MIMIC-CXR and CheXpert have patient race information available, and we extracted race (and other attributes) following established protocols69. For MIMIC-CXR, demographic information was obtained by merging with MIMIC-IV70. For CheXpert, separate race labels were obtained from the Stanford Center for Artificial Intelligence in Medicine & Imaging (https://aimi.stanford.edu/) website. Where applicable, we dropped patients with missing values for any attribute.

For all datasets, we excluded samples where the corresponding patient has missing age or sex. For ODIR and ISIC, we dropped samples from patients younger than 18 years and older than 80 years due to small sample sizes (that is, smaller than 3% of the total dataset).

Owing to computational constraints, we mainly chose four prediction tasks for CXRs (that is, ‘No Finding’, ‘Effusion’, ‘Cardiomegaly’ and ‘Pneumothorax’). We selected these tasks for several reasons: (1) diversity in presentation: ‘Effusion’, ‘Cardiomegaly’ and ‘Pneumothorax’ each present distinctively and occur in different locations on a CXR, allowing for a comprehensive evaluation across varied pathologies and underlying causes; (2) prevalence in clinical and research settings: these labels are not only common in clinical practice but also frequently studied in prior academic work7,12,63 and used in commercial diagnostic systems71; and (3) performance and fairness considerations: these labels are among those with both the highest diagnostic accuracy and substantial fairness gaps on MIMIC-CXR, making them particularly relevant for exploring the relationship between model performance and fairness7,12.

We scaled all images to 224 × 224 for input to the model. We applied the following image augmentations during training only: random flipping of the images along the horizontal axis, random rotation of up to 10° and a crop of a random size (70–100%) and a random aspect ratio (3/4 to 4/3).

Evaluation methods

To evaluate the performance of disease classification in medical imaging, we used the following metrics: AUROC, TPR, TNR and ECE.

The TPR and TNR are calculated as (FN, false negative; FP, false positive; TP, true positive; TN, true negative):

$${\mathrm{TPR}}=\frac{{\mathrm{TP}}}{{\mathrm{TP}}+{\mathrm{FN}}}$$
$${\mathrm{TNR}}=\frac{{\mathrm{TN}}}{{\mathrm{TN}}+{\mathrm{FP}}}$$

When reporting the sensitivity and specificity, we followed previous work12,72 in selecting the threshold that maximizes the F1 score. This threshold optimization procedure is conducted separately for each dataset, task, algorithm and attribute combination. We followed standard procedures to calculate the 95% CI for sensitivity and specificity.

We also reported AUC, which is the area under the corresponding ROC curves showing an aggregate measure of detection performance. Finally, we report the expected calibration error (ECE)73, which we computed using the netcal library74.

Assessing the fairness of machine learning models

To assess the fairness of machine learning models, we evaluated the metrics described above for each demographic group as well as the difference in the value of the metric between groups. Equality of TPR and TNR between demographic groups is known in the algorithmic fairness literature as equal odds75. As the models that we studied in this work are likely to be used as screening or triage tools, the cost of an FP may be different from the cost of an FN. In particular, for ‘No Finding’ prediction, FPs (corresponding to underdiagnosis12) would be more costly than FNs, and so we focused on the FPR (or TNR) for this task. For all remaining disease prediction tasks, we focused on the FNR (or TPR) for the same reason. Equality in one of the class-conditioned error rates is an instance of equal opportunity30.

Finally, we also examined the per-group ECE and ECE gap between groups. Note that zero ECE for both groups (that is, calibration per group) implies the fairness definition known as sufficiency of the risk score75. We emphasize that differences in calibration between groups is a significant source of disparity, as consistent under-estimation or over-estimation of risk for a particular group could lead to under-treatment or over-treatment for that group at a fixed operating threshold relative to the true risk76.

Quantifying the distribution shifts

We examined and quantified the types and degrees of distribution shifts in both ID and OOD settings in this study. Inspired by previous work46,77, we performed a series of hypothesis tests to determine if there were significant statistical differences in distributions between demographic groups and across different pairs of datasets. All P values were adjusted for multiple testing using Bonferroni correction78.

ID setting

We studied the following distribution shifts in the ID setting:

Prevalence shift: P(Y|A)

For binary outcomes Y across groups, we calculated the total variational distance between the probability distributions of Y conditioned on different groups and used a two-sample binomial proportion test, where the null hypothesis corresponds to P(Y|A = a1) = P(Y|A = a2):

$${d}_{\mathrm{Y}}({a}_{1},{a}_{2})={d}_{\mathrm{TV}}({\mathrm{P}}({\mathrm{Y}}|{\mathrm{A}}={a}_{1}),{\mathrm{P}}({\mathrm{Y}}|{\mathrm{A}}={a}_{2}))$$

Representation shift: P(X|A)

When comparing the distribution of the input images X, we first encoded them into representations derived from a frozen foundation model f that is trained in a self-supervised manner on diverse CXR datasets79,80. We then used the mean maximum discrepancy (MMD) distance and a permutation-based hypothesis test following ref. 81 to test if demographic groups differed statistically in their distribution of representations:

$$\left.{d}_{\mathrm{X}}({a}_{1},{a}_{2})={d}_{\mathrm{MMD}}\left({\mathrm{P}}\left(\;f({\mathrm{X}})\right|{\mathrm{A}}={a}_{1}\right),{\mathrm{P}}\left(f({\mathrm{X}})|{\mathrm{A}}={a}_{2}\right)\right)$$

OOD setting

We studied the following distribution shifts in the OOD setting (the null hypothesis is PID(·) = POOD(·)):

Label shift: P(Y)

We calculated the total variational distance between the probability distributions of binary outcomes Y across ID and OOD datasets using a two-sample binomial proportion test:

$${d}_{\mathrm{Y}}={d}_{\mathrm{TV}}({\mathrm{P}}_{\mathrm{ID}}\left({\mathrm{Y}}\right),{\mathrm{P}}_{\mathrm{OOD}}({\mathrm{Y}}))$$

Prevalence shift: P(Y|A = a)

We similarly evaluated the distance between the distributions of Y conditioned on specific demographic subgroups (A) between ID and OOD datasets:

$${d}_{{\mathrm{Y}}{\rm{|}}{\mathrm{A}}}(a)={d}_{\mathrm{TV}}({\mathrm{P}}_{\mathrm{ID}}\left({\mathrm{Y}}{\rm{|}}{\mathrm{A}}=a\right),{\mathrm{P}}_{\mathrm{OOD}}({\mathrm{Y}}{\rm{|}}{\mathrm{A}}=a))$$

Covariate shift: P(X)

We again encoded X into representations derived from a frozen foundation model f and then used the MMD distance and a permutation-based hypothesis test81 to examine if ID and OOD datasets differed statistically in their distribution of representations:

$${d}_{\mathrm{X}}={d}_{\mathrm{MMD}}({\mathrm{P}}_{\mathrm{ID}}\left(f({\mathrm{X}})\right),{\mathrm{P}}_{\mathrm{OOD}}(\;f({\mathrm{X}}\,)))$$

Representation shift: P(X|A = a)

Similarly, we calculated the MMD distance conditioned on subgroup A to evaluate shifts in the representation space:

$${d}_{{\mathrm{X}}{{|}}A}(a)={d}_{\mathrm{MMD}}({\mathrm{P}}_{\mathrm{ID}}\left(\;f\left({\mathrm{X}}\right){{|}}{\mathrm{A}}=a\right),{\mathrm{P}}_{\mathrm{OOD}}\left(\;f\left({\mathrm{X}}\right)|{\mathrm{A}}=a\right))$$

We provide additional results on quantifying various distribution shifts in both ID and OOD settings in the Supplementary Information (Supplementary Tables 13).

Training details

We trained DenseNet-121 (ref. 33) models on each task, initializing with ImageNet82 pre-trained weights. We evaluated six algorithms: ERM34, ReSample35, GroupDRO36, DANN37, CDANN38 and MA39.

For each combination of task, algorithm and demographic attribute, we conducted a random hyperparameter search83 with 12 runs. During training, for a particular attribute, we evaluated the validation set worst-group validation AUROC every 1,000 steps and early stopped if this metric had not improved for five evaluations. We tuned the learning rate and weight decay for all algorithms and also tuned algorithm-specific hyperparameters as mentioned in the original works. We selected the hyperparameter setting that maximized the worst-attribute validation AUROC. CIs were computed as the standard deviation across three different random seeds for each hyperparameter setting.

We also explored a multi-label training setup, where models were trained simultaneously on 14 binary labels available in MIMIC-CXR7. We followed the same experimental protocol as outlined in the main paper, including hyperparameter tuning and model selection. Our findings in the multi-label setup mirrored those seen in the binary task setup (Supplementary Figs. 2 and 3).

To obtain the level of demographic encoding within representations (Fig. 2), we first computed representations using a trained disease prediction model. We froze these representations and trained a multi-class multi-nomial logistic regression model to predict the demographic group using the training set using the scikit-learn library84. We varied the L2 regularization strength between 10−5 and 10 and selected the model with the best macro-averaged AUROC on the validation set. We report the macro-averaged AUROC on the test set.

Decomposing OOD fairness

Here we present a first approach toward decomposing the fairness gap in an OOD environment as a function of the ID fairness gap and the impact that the distribution shift has in each group. In particular, let Dsrc and Dtar be the source and target datasets, respectively. Let g G be a particular group from a set of groups. Let Lf(g, D) be an evaluation metric for a model f, which is decomposable over individual samples—that is, Lf(g, D) = \({L}_{f}(g,D)={\sum }_{(x,y,g{\prime} )\in D;g{\prime} =g}l(f(x),y)\). Examples of such metrics are the accuracy, TPR or TNR. Then, we can decompose:

$$\begin{array}{c}{L}_{f}\,({g}_{1},{D}_{\mathrm{tar}})-{L}_{f}\,({g}_{2},{D}_{\mathrm{tar}})=[{L}_{f}\,({g}_{1},{D}_{\mathrm{src}})-{L}_{f}\,({g}_{2},{D}_{\mathrm{src}})]\\ +[{L}_{f}\,({g}_{2},{D}_{\mathrm{src}})-{L}_{f}\,({g}_{2},{D}_{\mathrm{tar}})]-[{L}_{f}\,({g}_{1},{D}_{\mathrm{src}})-{L}_{f}\,({g}_{1},{D}_{\mathrm{tar}})].\end{array}$$

The left-hand term is the fairness gap in the OOD environment, and the three terms on the right are (1) the fairness gap in the ID data, (2) the impact of the distribution shift on g2 and (3) the impact of the distribution shift on g1. We note that, to achieve a low fairness gap in the OOD environment, it is important not only to minimize the ID fairness gap (term 1) but also to minimize the difference in how the distribution shift impacts each group (term 2 − term 3).

Evaluation with different medical imaging modalities

In addition to radiology, we also examined medical AI applications in dermatology and ophthalmology to corroborate our findings. Specifically, Extended Data Fig. 1 shows the results for dermatological imaging. We used the ISIC dataset28, which contains 32,259 images sourced from multiple international sites. We focused on the ‘No Finding’ task, taking into account ‘sex’ and ‘age’ as the sensitive demographic attributes (Extended Data Fig. 1a). Similar to our observations in radiology, we identified fairness gaps within subgroups based on age and sex (Extended Data Fig. 1b), although these disparities were less significant than those observed in CXR assessments (for example, fairness gaps smaller than 2%). This was further confirmed by the Pareto front plot, where most models, including ERM, could achieve a good performance–fairness tradeoff (Extended Data Fig. 1c).

We extended our analysis to ophthalmology images, specifically focusing on retinopathy detection, using the ODIR dataset29 with 6,800 images (Extended Data Fig. 2). The task that we considered was ‘Retinopathy’, with ‘sex’ and ‘age’ being used as demographic attributes (Extended Data Fig. 2a). Notably, significant subgroup fairness gaps were observed in age (43% FNR gap between groups ‘60–80’ and ‘18–40’). In contrast, the fairness gap based on sex was less significant, with a 3% FNR difference between ‘female’ and ‘male’ subgroups. We further verified the Pareto front for both attributes, where similar observations hold that algorithms for fixing demographic shortcuts could improve ID fairness while incurring minimal detriments to performance (measured in AUROC).

Analysis on underdiagnosis versus overdiagnosis

In evaluating fairness metrics, our primary study centered on underdiagnosis, specifically the disparities in FPR for ‘No Finding’ and discrepancies in FNR for other conditions. However, an alternative approach involves focusing on overdiagnosis, defined as variances in FNR for ‘No Finding’ and differences in FPR for other diseases. We present findings between their relationship in Extended Data Fig. 3. An analysis spanning two datasets (MIMIC and CheXpert) and various tasks revealed a consistent pattern: larger gaps in underdiagnosis tend to correspond with more significant overdiagnosis discrepancies. Nonetheless, certain task and attribute combinations exhibited more complex trends, indicating a necessity for deeper exploration and informed decision-making regarding the most appropriate fairness metrics for critical disease evaluations in practical medical settings.

Direct prediction of demographic attributes

We provide analyses for demographic encoding of attribute information. In the main paper, we analyzed the predictiveness of attributes (for example, age, sex and race) based on the embeddings from a disease classification model. The distinct predictiveness between attributes in these domains could be attributed to the intrinsic characteristics of the datasets or the nature of the conditions being studied. To delve deeper, we conducted an experiment training an ERM model to predict these attributes directly using the dermatology dataset (ISIC), and we show the results in Supplementary Table 4. We observed that certain attributes are indeed less predictive compared to others (that is, age versus sex), suggesting that age may be inherently more challenging to encode within the studied dermatology dataset. Furthermore, the results reveal variations in the predictiveness of age across different subgroups (for example, age groups ‘18–40’ and ‘60–80’ exhibit higher AUROC than the ‘40–60’ group).

Analysis using multi-label models

Prior works7 studied CXR classification in the multi-label setting, where a model contains an encoder, followed by an individual linear classification head for each of the downstream tasks. We followed the setup7 to study the following 14 binary labels in MIMIC-CXR: Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Airspace Opacity, No Finding, Effusion, Pleural Other, Pneumonia, Pneumothorax and Support Devices.

We adapted the following methods to the multi-label setting, (1) ERM, (2) DANN and (3) CDANN, and we followed the same experimental protocol in the main paper in terms of hyperparameter tuning. Note that GroupDRO and ReSample are challenging to adapt to the multi-label setting, as the number of groups is exponential in the number of tasks. For each combination of algorithm and attribute, we selected the multi-label model that maximizes the worst-attribute AUROC, averaged across the 14 tasks.

First, we examined the level of demographic encoding present in the embeddings of the best multi-task ERM model and found that it also encodes a variety of demographic information, similar to the single-label case (Supplementary Fig. 2a). We further showed the fairness gaps of this best multi-label ERM model and observed that a variety of fairness gaps exist and are statistically significant across all tasks (Supplementary Fig. 2b). In addition, we plotted the correlation between the fairness gap and the attribute prediction AUROC, across all trained multi-label models. We found a strong and statistically significant positive correlation among all combinations of task and attribute, similar to the single-label case (Supplementary Fig. 2c).

We also present Pareto plots showing the tradeoff between the fairness gap and overall AUROC across all models, for each combination of task and attribute (Supplementary Fig. 3). Overall, we found that the Pareto fronts for the multi-label models demonstrate similar behavior as the single-task models—where the multi-label ERM exhibits the best overall AUROC but has high fairness gap. In addition, with debiasing methods such as multi-label DANN and CDANN, we are able to achieve fair models with minimal loss in overall AUROC.

Test set rebalancing for prevalence shift

We investigated whether eliminating the prevalence shift in the test set would address the observed fairness gaps. Following prior work40, we balanced the test set for multiple attributes—age and race—ensuring that demographic proportions (for example, ‘White’ aged ‘20–40’ versus ‘Black’ aged ‘60–80’) and disease prevalence are uniform across all attribute combinations. This approach aims to eliminate prevalence shifts within the test set. Our findings in Supplementary Fig. 4 suggest that, although test set rebalancing can reduce fairness gaps for certain task and attribute combinations (for example, ‘No Finding’ for ‘race’ and ‘Cardiomegaly’ for ‘age’, as compared to Fig. 2b), there exists significant gaps even after rebalancing the test set, indicating that fairness gaps are influenced by multiple shifts beyond just prevalence shifts.

Statistical analysis

Correlation

To calculate the correlations between variables, we used Pearson correlation coefficients and their associated P value (two-sided t-test, α = 0.05). 95% CI for the Pearson correlation coefficient was calculated.

Increase in OOD fairness gap

One-tailed Wilcoxon rank-sum test (α = 0.05) was used to assess the increase in OOD fairness gap compared to oracle models.

CIs

We used non-parametric bootstrap sampling to generate CIs: random samples of size n (equal to the size of the original dataset) were repeatedly sampled 1,000 times from the original dataset with replacement. We then estimated the increase in OOD fairness gap compared to oracle using each bootstrap sample (α = 0.05).

All statistical analysis was performed with Python version 3.9 (Python Software Foundation).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.