Experts highlight two primary issues: users struggled to supply chatbots with pertinent and full details, and the AI models occasionally offered inconsistent or demonstrably false recommendations.
A recent study from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford suggests a significant disconnect exists between the theoretical medical understanding of large language models (LLMs) and their practical application for patient benefit. This research, a collaboration with MLCommons and other entities, encompassed 1,298 participants across the UK.
For the purpose of the study, one cohort was tasked with employing LLMs like GPT-4o, Llama 3, and Command R to evaluate health symptoms and propose appropriate actions, while a comparison group relied on conventional methods such as web search engines or their existing knowledge.
The findings revealed that the generative AI (genAI) tool users did not outperform the control group in evaluating the criticality of a health condition. Furthermore, they proved less effective at pinpointing the correct medical diagnosis, as reported by The Register.
The study’s authors identify two core challenges. Firstly, individuals found it hard to furnish the chatbots with comprehensive and relevant data. Secondly, the AI models were prone to delivering conflicting or entirely inaccurate guidance.
Additionally, the research indicates that standard AI assessments, such as multiple-choice medical exams, fail to represent real-world user interaction with these systems. Successfully completing a theoretical assessment does not equate to safe operation within dynamic healthcare scenarios. Consequently, researchers conclude that current AI chatbots are not yet suitable for deployment as dependable medical consultants for the general populace.