A new study conducted by the Icahn School of Medicine at Mount Sinai in New York City reveals serious flaws in the performance of the ChatGPT Health tool. The ChatGPT Health tool is a medical-specific version of the ChatGPT chatbot launched by OpenAI in January 2026 and used by approximately 40 million people every day to get health advice, despite the increasing use of artificial intelligence in healthcare.
The study, published in Nature Medicine, is the first independent safety evaluation of the tool since its launch and focuses on its ability to assess the need for emergency treatment.
How was the research conducted?
Researchers prepared 60 realistic clinical scenarios covering 21 medical specialties, ranging from minor illnesses to true emergencies. Three independent physicians conducted 960 interactions using the tool, determining the appropriate level of urgency based on guidelines from 56 medical societies and considering factors such as gender, race, social barriers, and family influences.
Main findings
While the tool worked well for “obvious” emergencies such as stroke or severe allergies, it underestimated the risk in more than half (52%) of real emergencies and recommended waiting or seeing a doctor within 24 to 48 hours rather than going to the emergency department right away. Example: In the case of asthma with early signs of respiratory failure, the tool recognized the risk but advised to wait.
It also revealed an increased risk assessment for around two-thirds of patients with mild symptoms who need to be managed at home, potentially overwhelming emergency departments, and a worrying discrepancy in suicide alerts. In some scenarios, the Suicide Crisis Intervention sign appeared (directed to line 988), but disappeared in very similar situations when normal test results were added, despite the same symptoms and language.
Recommendations were heavily influenced by social influences. If the family downplayed the severity (e.g., “nothing is serious”), the tool was 12 times more likely to lower the severity rating.
“This tool works well in moderate cases, but fails at the most critical end of the spectrum,” said Dr. Ashwin Ramaswamy, lead author of the study.
Dr. Girish Nadkarni, chief artificial intelligence officer at Mount Sinai Health System, said failures in suicide prevention measures are “most alarming,” and noted that safety features that work 100% in one situation and completely fail in similar situations are a “fundamental safety issue.”
Emergency and artificial intelligence experts Mark Siegel and Harvey Castro emphasized the importance of research, stressing that artificial intelligence cannot replace human clinical judgment in sensitive cases, and calling for continued evaluation and independent monitoring.
Research limitations
The researchers acknowledged that the study was conducted at one point in time and relied on scenarios written by doctors rather than conversations with actual patients, but these systems are constantly being updated and performance may change later.
Clear advice: no need to wait for AI
The researchers stressed that if you experience serious symptoms, such as severe chest pain, difficulty breathing, severe allergic reactions, or thoughts of self-harm, you should go to the emergency department or call emergency services or 988 immediately, without waiting for recommendations from artificial intelligence tools.
between hope and caution
The researchers agreed that the study was not aimed at negating artificial intelligence in medicine, but rather at improving it through independent testing and stronger safety controls.

