Harvard study says OpenAI o1 beat two ER doctors on diagnosis in small sample

A Harvard-led study published this week in Science found OpenAI’s o1 performed nominally better than or on par with two attending physicians and OpenAI’s 4o model across a set of 76 emergency room cases at Beth Israel Deaconess Medical Center. ^[1]

The research team came from Harvard Medical School and Beth Israel Deaconess Medical Center. It compared diagnoses for patients who came into the Beth Israel emergency room, using diagnoses from two internal medicine attending physicians and the two OpenAI models. ^[1]

Two other attending physicians, blinded to whether the diagnoses came from humans or AI, evaluated the cases. Harvard Medical School said the models were given the same information available in the electronic medical records at the time of each diagnosis, and the data were not pre-processed. ^[1]

The largest gap appeared at the first ER triage step. In those cases, o1 produced the exact or very close diagnosis in 67% of cases, compared with 55% for one physician and 50% for the other. ^[1]

The study said o1 did as well as or better than the doctors and 4o at each diagnostic touchpoint. Arjun Manrai said the model was tested against “virtually every benchmark” and “eclipsed both prior models and our physician baselines.” ^[1]

The paper adds to a growing body of work on how large language models perform in medical settings, but the experiment was small and limited to a single emergency room sample. The next concrete step is further evaluation of these models in clinical settings after this week’s publication in Science. ^[1]

Sources