AI Models' Clinical Recommendations Contain Bias: Mount Sinai Study
By
Alexis Kayser is Newsweek's Healthcare Editor based in Chicago. Her focus is reporting on the operations and priorities of U.S. hospitals and health systems. She has extensively covered value-based care models, artificial intelligence, clinician burnout and Americans' trust in the health care industry. Alexis joined Newsweek in 2024 from Becker's Hospital Review. She is a graduate of Saint Louis University. You can get in touch with Alexis by emailing a.kayser@newsweek.com or by connecting with her on LinkedIn. Languages: English
A new study has identified significant biases in medical recommendations from large language models (LLMs), validating some physicians' concerns about AI's clinical capabilities.
Researchers at the Icahn School of Medicine at Mount Sinai in New York City evaluated nine LLMs, comparing more than 1.7 million model-generated outputs from 1,000 emergency department cases across 32 sociodemographic groups. Each model was asked to provide a clinical recommendation for the same patient with and without sociodemographic identifiers, and its suggestion was compared to a baseline recommendation derived from human physicians.
The study found that LLMs directed patients labeled as Black, unhoused or LGBTQIA+ to urgent care more frequently than the unlabeled control group. The LLMs also recommended mental health assessments to these patients approximately six to seven times more often than what the validating physicians deemed appropriate and more than twice as often as the control group.
It also revealed that patients labeled as high-income were 6.5 percent more likely to receive LLM recommendations for advanced imaging tests, like CT scans and MRIs. Meanwhile, the models suggested basic or no further testing for low- and middle-income patients with the same clinical presentations. The extent of these differences was not supported by clinical reasoning or guidelines followed by doctors, leading the authors to suggest the models were driven by bias instead.
The exterior of Mount Sinai Hospital in New York City. The exterior of Mount Sinai Hospital in New York City. Getty Images/Cindy Ord
LLM bias has been well-established in medical literature, but this study—published April 7 in the monthly medical journalNature Medicine—sized up its magnitude and pervasiveness, according to the authors.
The research also challenges the theory that larger AI models can mitigate bias, Dr. Girish Nadkarni, co-senior author of the paper and chair of the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine, told Newsweek. Across open- and closed-source models of differing sizes, the study identified "extremely universal" biases.
"If you have the same case, but just changed some socioeconomic or demographic characteristics, the models basically changed what should be the next best step based upon that [socioeconomic or demographic information]," Nadkarni said, "not upon the same clinical presentation."
"That's not clinically justifiable," he continued, "because health care should be the same regardless of whether you're rich or poor."
The current U.S. health care system is riddled with health care disparities: Black women experience higher maternal mortality rates than white women, low-income patients are screened for cancer less frequently than high-income patients and LGBTQ people report poorer health statuses, to name a few.
LLMs are trained on human data and may unwittingly reflect these preexisting biases, the study's authors wrote.
"Large language models are not just trained on medical data: They train on all of the text that has been generated by the internet, which includes places like Reddit where real-world biases get encoded into the way we talk and the way we think," Nadkarni said. "The internet is a pretty biased place."
That bias could cause trouble for both individual patients and the health care system that serves them, according to the study. If marginalized groups are over-triaged, they could partake in unnecessary medical interventions—adding to the hundreds of billions of dollars of annual medical waste. This could also contribute to stigmatization for patients, especially those that are over-referred for mental health services, like LGBTQIA+ and unhoused individuals.
Researchers at the Icahn School of Medicine at Mount Sinai found that AI models can make different treatment recommendations for the same medical condition based on a patient's socioeconomic and demographic background. This highlights the...Researchers at the Icahn School of Medicine at Mount Sinai found that AI models can make different treatment recommendations for the same medical condition based on a patient's socioeconomic and demographic background. This highlights the need for safeguards to ensure that AI-driven medical care is both safe, effective and fair for everyone. Dr. Mahmud Omar
On the other hand, under-triaging marginalized groups could exacerbate existing mistrust in the medical system, or lead to treatment delays.
As health systems continue to invest in AI tools, Nadkarni hopes this research will serve as a reminder to focus on data quality.
"We realize AI is transformative, and a lot of people are rushing to use it, and I understand why—our health care system could use serious help," he said. "At the same time, we shouldn't forget about the second order effects of deploying this thing at scale.
"The promise of AI is scale: you can scale expertise. But that's also the peril, right? You can scale mistakes."