Leveraging LLMs for early detection of cognitive decline

A newly published study from IMO Health, Harvard Medical School, and Mass General Brigham displays AI’s potential for detecting cognitive decline early.
Lancet- IMO Health

Large language models (LLMs) have excelled in various healthcare domains, streamlining workflows, enhancing data quality, improving patient care, and more. But until recently, little was known about their effectiveness in identifying specific clinical conditions in patient medical records.  

To help patch this gap, IMO Health collaborated with Harvard Medical School to evaluate LLMs for detecting signs of cognitive decline in real clinical notes, comparing their error profiles with traditional models. The study, Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes, was recently published online in The Lancet eBioMedicine.  

Conducted at Mass General Brigham in Boston, MA, home to a large cohort of Alzheimer’s patients, the study analyzed clinical notes from diagnoses of mild cognitive impairment in patients aged 50 and older.  

Experts from IMO Health developed prompts for two LLMs, Llama 2 and GPT-4, on Health Insurance Portability and Accountability Act (HIPAA)-compliant cloud-computing platforms using multiple approaches, including prompting, retrieval augmented generation (RAG), and error analysis-based instructions to select the best LLM-based method.  

“Early detection of cognitive decline can facilitate timely interventions for Alzheimer’s disease (AD) and related dementias (ADRD),” Jingcheng Du, PhD, VP of AI Innovations at IMO Health, said. “The comparative analysis in this study shed the light on how we can optimize general-domain LLMs for clinical decision support tasks.”  

The study found that LLMs and traditional machine learning models trained on local electronic health record (EHR) data will each produce unique errors. By combining these distinct models into one hybrid model, we can greatly enhance diagnosis precision.  

Specifically, the ensemble model achieved a precision of 90.2%, a recall of 94.2%, and an F1-score of 92.1%. Precision improved quite notably, increasing from a range of 70%–79% to above 90%, compared to the best-performing single model. 

“This study validated that even the most powerful LLMs cannot reach optimal clinical performance if they are generic models,” Du said.  

To produce the most accurate, actionable insights, LLMs must possess deep clinical knowledge. Fortifying such applications with comprehensive clinical terminology can also greatly enhance performance and ensure the reliability of results.  

“Future research should investigate integrating LLMs with smaller, localized models and incorporating medical data and domain knowledge to enhance performance on specific tasks,” the study authors wrote. 

This study was partially sponsored by NIH-NIA R44AG081006, NIH-NLM R01LM014239, and NIH-NIA R01AG080429. 

Click here to explore more innovative research publications, authored by leading IMO Health scientists.  

Interested in more IMO Health resources?

Sign up today and have resources delivered straight to your inbox.

Latest Resources​

Discover how an integrated health system improved surgical scheduling accuracy and boosted operating room utilization with IMO Core Periop, achieving a 13.7%...
By leveraging admin coding assistance value sets, organizations can mitigate losses from denied claims and improve clinical documentation specificity.
This primer is your guide to understanding healthcare's most common medical coding systems. Learn how they're used in clinical documentation and maintained.