AI Can Detect Race from X-Rays
Even When Humans Can't
People used to worry
that robots were getting so smart that they’d soon
start secretly plotting to take over the world. But
now experts worry that AI is getting so smart that it
could be secretly plotting to do racism to Black
people:
However, our findings
that AI can trivially predict self-reported race —
even from corrupted, cropped, and noised medical
images — in a setting where clinical experts cannot,
creates an enormous risk for all model deployments
in medical imaging: if an AI model secretly used its
knowledge of self-reported race to misclassify all
Black patients, radiologists would not be able to
tell using the same data the model has access to.
From a new preprint on
arXiv:
Reading
Race: AI Recognises Patient’s Racial Identity
In Medical Images
Imon Banerjee,
Ananth Reddy Bhimireddy, John L. Burns, Leo
Anthony Celi, Li-Ching Chen, Ramon Correa, Natalie
Dullerud, Marzyeh Ghassemi, Shih-Cheng Huang,
Po-Chih Kuo, Matthew P Lungren, Lyle Palmer,
Brandon J Price, Saptarshi Purkayastha, Ayis
Pyrros, Luke Oakden-Rayner, Chima Okechukwu, Laleh
Seyyed-Kalantari, Hari Trivedi, Ryan Wang, Zachary
Zaiman, Haoran Zhang, Judy W Gichoya
Background:
In medical imaging, prior studies have
demonstrated disparate AI performance by race, yet
there is no known correlation for race on medical
imaging that would be obvious to the human expert
interpreting the images.
Methods:
Using private and public datasets we evaluate: A)
performance quantification of deep learning models
to detect race from medical images, including the
ability of these models to generalize to external
environments and across multiple imaging
modalities, B) assessment of possible confounding
anatomic and phenotype population features, such
as disease distribution and body habitus as
predictors of race, and C) investigation into the
underlying mechanism by which AI models can
recognize race.
Findings: Standard deep learning models can be
trained to predict race from medical images with
high performance across multiple imaging
modalities. Our findings hold under external
validation conditions, as well as when models
are optimized to perform clinically motivated
tasks. We demonstrate this detection is not due
to trivial proxies or imaging-related surrogate
covariates for race, such as underlying disease
distribution. Finally, we show that performance
persists over all anatomical regions and
frequency spectrum of the images suggesting that
mitigation efforts will be challenging and
demand further study.
Interpretation:
We emphasize that model ability to predict
self-reported race is itself not the issue of
importance. However, our findings
that AI can trivially predict self-reported race
— even from corrupted, cropped, and noised
medical images — in a setting where clinical
experts cannot, creates an enormous risk for all
model deployments in medical imaging: if
an AI model secretly used its knowledge of
self-reported race to misclassify all Black
patients, radiologists would not be able to tell
using the same data the model has access to.
From the blog of
one of the authors:
AI
has the worst superpower… medical racism.
AUGUST 2, 2021 ~
LUKEOAKDENRAYNER
Is this
the darkest timeline? Are we the baddies?
… instead I
wanted to write something else which I think will
complement the paper; an explanation of why I and
many of my co-authors think this issue is
important.
One thing we
noticed when we were working on this research was
that there was a clear divide in our team. The
more clinical and safety/bias related researchers
were shocked, confused, and frankly horrified by
the results we were getting. Some of the computer
scientists and the more junior researchers on the
other hand were surprised by our reaction. They
didn’t really understand why we were concerned.
So in a way, this
blog post can be considered a primer, a companion
piece for the paper which explains the why. Sure,
AI can detect a patient’s racial identity, but why
does it matter?
Disclaimer: I’m
white. I’m glad I got to contribute, and I am
happy to write about this topic, but that does not
mean I am somehow an authority on the lived
experiences of minoritized racial groups. These
are my opinions after discussion with my much more
knowledgeable colleagues, several of whom have
reviewed the blog post itself.
A brief summary
In extremely brief
form, here is what the paper showed:
AI can trivially
learn to identify the self-reported racial
identity of patients to an absurdly high degree of
accuracy
AI does learn to
do this when trained for clinical tasks
These results
generalise, with successful external validation
and replication in multiple x-ray and CT datasets
Despite many
attempts, we couldn’t work out what it learns or
how it does it. It didn’t seem to rely on obvious
confounders, nor did it rely on a limited
anatomical region or portion of the image
spectrum.
Now for the
important part: so what?
An argument in
four steps
I’m going to try
to lay out, as clearly as possible, that this AI
behaviour is both surprising, and a very bad thing
if we care about patient safety, equity, and
generalisability.
The argument will
have the following parts:
Medical practice
is biased in favour of the privileged classes in
any society, and worldwide towards a specific type
of white men.
AI can trivially
learn to recognise features in medical imaging
studies that are strongly correlated with racial
identity. This provides a powerful and direct
mechanism for models to incorporate the biases in
medical practice into their decisions.
Humans cannot
identify the racial identity of a patient from
medical images. In medical imaging we don’t
routinely have access to racial identity
information, so human oversight of this problem is
extremely limited at the clinical level.
The features the
AI makes use of appear to occur across the entire
image spectrum and are not regionally localised, which will severely limit our ability
to stop AI systems from doing this.
There are several
other things I should point out before we get
stuck in. First of all, a definition. We are
talking about racial identity, not genetic
ancestry or any other biological process that
might come to mind when you hear the word “race”.
Racial identity is a social, legal, and political
construct that consists of our own perceptions of
our race, and how other people see us. In the context of this work, we rely
on self-reported race as our indicator of racial
identity.
Before you jump
in with questions about this approach and the
definition, a quick reminder on what we are trying
to research. Bias in medical
practice is almost never about genetics or
biology. No patient has genetic ancestry testing
as part of their emergency department workup. We
are interested in factors that may bias doctors in
how they decide to investigate and treat patients,
and in that setting the only information they get
is visual (i.e.,
skin tone, facial features etc.) and
sociocultural (clothing, accent and language use,
and so on). What we care about is race as a social
construct, even if some elements of that construct
(such as skin tone) have a biological basis.
Secondly,
whenever I am using the term bias in this piece, I
am referring to the social definition, which is a
subset of the strict technical definition; it is
the biases that impact decisions made about humans
on the basis of their race. These biases can in
turn produce health disparities, which the NIH
defines as “a health difference that adversely
affects disadvantaged populations“.
Third, I want to
take as given that racial bias in medical AI is
bad. I feel like this shouldn’t need to be said,
but the ability of AI to homogenise,
institutionalise, and algorithm-wash health
disparities across regions and populations is not
a neutral thing.
AI can seriously
make things much, much worse.
… In medical
imaging we like to think of ourselves as above
this problem, particularly with respect to race
because we usually don’t know the identity of our
patients. We report the scans without ever seeing
the person, but that only protects us from direct
bias. Biases still affect who gets referred for
scans and who doesn’t, and they affect which scans
are ordered. …
But it is true
that, in general, we read the scan as it comes.
The scan can’t tell us what colour a person’s skin
is.
Can it?
Part II – AI can
detect racial identity in x-rays and CT scans
I’ve already
included some results up in the summary section,
and there are more in the paper, but I’ll very
briefly touch on my interpretation of them here.
Firstly, the
performance of these models ranges from high to
absurd. An AUC
of 0.99 for recognising the self-reported race of
a patient, which has no recognised medical imaging
correlate? This is flat out nonsense.
Every
radiologist I have told about these results is
absolutely flabbergasted, because despite all of
our expertise, none of us would have believed in
a million years that x-rays and CT scans contain
such strong information about racial identity.
Honestly we are talking jaws dropped – we see
these scans everyday and we have never noticed [Editor's Note:
BECAUSE YOU NEVER LOOKED OR BECAUSE YOU ARE NOT
PHYSICAL OR FORENSIC ANTHROPOLOGISTS.].
The
second important aspect though is that, with
such a strong correlation, it appears that AI
models learn the features correlated with racial
identity by default. For example, in our
experiments we showed that the distribution of
diseases in the population for several datasets
was essentially non-predictive of racial identity
(AUC = 0.5 to 0.6), but we also found that if you
train a model to detect those diseases, the model
learns to identify patient race almost as well as
the models directly optimised for that purpose
(AUC = 0.86). Whaaat?
Despite racial
identity not being useful for the task (since the
disease distribution does not differentiate racial
groups), the model learns it anyway? …
But
no matter how it works, the take-home message is
that it appears that models will tend to learn
to recognise race, even when it seems irrelevant
to the task. So the dozens upon dozens of FDA
approved x-ray and CT scan AI models on the
market now … probably do this^^? Yikes!
There is one more
interpretation of these results that is worth
mentioning, for the “but this is expected model
behaviour” folks. Even from a
purely technical perspective, ignoring the
racial bias aspect, the fact models learn
features of racial identity is bad. There is no
causal pathway linking racial identity and the
appearance of, for example, pneumonia on a chest
x-ray. By definition these features are
spurious.
By
definition!
They are
shortcuts. Unintended cues. The model is
underspecified for the problem it is intended to
solve.
However we want
to frame this, the model has
learned something that is wrong, and this means
the model can behave in undesirable and
unexpected ways [Editor's Note: Naughty model!
Bad, bad model. Now go stand in the corner.].
I
won’t be surprised if this becomes a canonical
example of the biggest weakness of deep learning
– the ability of deep learning to pick up
unintended cues from the data. I’m certainly
going to include it in all my talks.
Part III – Humans
can’t identify racial identity in medical images
… The problem is
much worse for racial bias. At least in MRI
super-resolution, the radiologist is expected to
review the original low quality image to ensure it
is diagnostic quality (which seems like a
contradiction to me, but whatever). In AI with
racial bias though, humans literally cannot
recognise racial identity from images^^^. Unless
they are provided with access to additional data
(which they don’t currently have easy access to in
imaging workflows) they will be completely unable
to appreciate the bias no matter how skilled they
are and no matter how much effort they apply to
the task.
Part IV – We don’t know how to stop it
[Editor's
Note: Remember HAL-9000 in 2001: A Space
Odyssey?]
This
is probably the biggest problem here. We ran an
extensive series of experiments to try and work
out what was going on.
First, we tried
obvious demographic confounders (for example,
Black patients tend to have higher rates of
obesity than white patients, so we checked whether
the models were simply using body mass/shape as a
proxy for racial identity). None of them appeared
to be responsible, with very low predictive
performance when tested alone.
Next
we tried to pin down what sort of features were
being used. There was no clear anatomical
localisation, no specific region of the images
that contributed to the predictions. Even more
interesting, no part of the image spectrum was
primarily responsible either. We could get rid
of all the high-frequency information, and the
AI could still recognise race in fairly blurry
(non-diagnostic) images. Similarly, and I think
this might be the most amazing figure I have
ever seen, we could get rid of the low-frequency
information to the point that a human can’t even
tell the image is still an x-ray, and the model
can still predict racial identity just as well
as with the original image!
Damn
their eyes!
Performance is
maintained with the low pass filter to around the
LPF25 level, which is quite blurry but still
readable. But for the high-pass filter, the model
can still recognise the racial identity of the
patient well past the point that the image is just
a grey box
…
This difficulty
in isolating the features associated with racial
identity is really important, because one
suggestion people tend to have when they get shown
evidence of racial bias is that we should make the
algorithms “colorblind” – to remove the features
that encode the protected attribute and thereby
make it so the AI cannot “see” race but should
still perform well on the clinical tasks we care
about.
Here,
it seems like there is no easy way to remove
racial information from images. It is everywhere
and it is in everything.
Perhaps Disraeli
was right when he had the character who was his
mouthpiece in his novels explain, “All is race.”
An urgent problem
AI
seems to easily learn racial identity
information from medical images, even when the
task seems unrelated. We can’t isolate how it
does this, and we humans can’t recognise when AI
is doing it unless we collect demographic
information (which is rarely readily available
to clinical radiologists). That is bad.
There are around
30 AI systems using CXR and CT Chest imaging on
the market currently, FDA cleared, many of which
were trained on the exact same datasets we
utilised in this research. That
is worse.
I don’t know
about you, but I’m worried. AI might be
superhuman, but not every superpower is a force
for good.
The line between
superheroism and supervillainy is a fine one.
It’s
almost as if race does exist.
But of course we’ve been told over
and over that that can’t possibly be true.
But did anybody tell Artificial Intelligence that?
It’s almost as if AI isn’t a True Believer in the
conventional wisdom about the scientific
nonexistence of race. Something
must be done to inject the natural stupidity of
our elite wisdom into Artificial Intelligence. (read
more)