Key Highlights
- Stanford launches MedAgentBench, the first benchmark to measure how AI agents perform real-world electronic health record (EHR) tasks.
- Claude 3.5 Sonnet v2 achieved a 70% success rate, outperforming other frontier large language models.
- Researchers highlight AI’s potential as a clinical teammate, helping address physician burnout and staffing shortages.
AI benchmarking moves beyond knowledge tests: Unlike earlier evaluations that focused on exams like the USMLE, MedAgentBench assesses how well AI agents execute physician tasks such as retrieving patient data, ordering medications, and handling test requests inside a realistic clinical system.
Key findings from Stanford’s study: The benchmark tested 12 large language models across 300 clinical tasks. Claude 3.5 Sonnet v2 led with 69.7% success, GPT-4o followed with 64%, while many models lagged below 50%. Researchers emphasized that transparency into strengths and weaknesses is critical to guide safe deployment in healthcare.
Implications for clinicians and health systems: The study shows AI is unlikely to replace doctors but can support them by handling routine “clinical housekeeping” tasks. This could reduce physician workload, mitigate burnout, and help address the projected global shortage of over 10 million healthcare workers by 2030.
The road toward deployment: The Stanford team noted that understanding error patterns, building safety frameworks, and ensuring interoperability are prerequisites before widespread adoption. With improvements in newer models, AI agents could soon transition from research prototypes to real-world pilots in hospitals.
About Stanford HAI: Stanford University’s Institute for Human-Centered Artificial Intelligence (HAI) is a global leader in advancing trustworthy, human-centered AI solutions. Its interdisciplinary research spans healthcare, education, and policy, with a mission to augment human expertise and create meaningful societal impact.




