Can LLMs excel in medical coding? Yes, with rich semantics and clinical AI

By leveraging rich clinical terminology and IMO Clinical AI to train LLMs, the accuracy of medical coding can be greatly improved.
clinical AI in healthcare
In recent years, there has been growing interest in using artificial intelligence (AI), especially large language models (LLMs), to automate the medical coding process. Medical coding, which involves assigning standardized codes like ICD-10 and CPT® to diagnoses and procedures, is a critical but time-consuming administrative task in healthcare.

LLMs are deep learning models that are trained on vast datasets, making them ideal for generating text output and automating tasks. But according to a recent Mount Sinai study published in the April 19 online issue of NEJM AI, LLMs are poor medical coders.1  The study emphasizes the need to refine and validate these technologies before implementing them in large-scale clinical settings.

LLMs display weaknesses in medical coding

Specifically, the study found that out-of-the-box LLMs like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b Chat performed poorly at medical coding when simply prompted to generate codes from descriptions. Out-of-the-box LLMs are pre-trained models that have not been fine-tuned or adapted for specific tasks, in contrast to specialized LLMs that have been further trained on domain-specific data to improve their performance in particular areas.

The model with the best performance, GPT-4, which is the latest and most advanced language model developed by OpenAI, the creators of ChatGPT, only achieved 46% exact match accuracy for ICD-9 codes, 34% for ICD-10, and 50% for CPT. The models often generated codes that were imprecise or even contained falsified information.

The Mount Sinai study aligns with IMO Health’s ongoing exploration of automated medical coding. We too have found that out-of-the-box LLMs, while impressive in many areas, often struggle with the complex and nuanced task of medical coding.

To address these limitations, we have focused on enhancing LLMs specifically for medical coding by leveraging our terminology resources, mapping knowledge, AI techniques, and IMO Health tools. By combining the power of LLMs with our expertise in medical informatics, we aim to create more accurate and reliable automated medical coding solutions that can support healthcare providers and improve patient care.

Enhancing LLMs with deep clinical ontologies and informatics

Structured clinical terminology, comprised of codified terms from a common clinical vocabulary, can be employed to accurately represent clinical concepts like diseases or lab results. However, managing constantly changing data on millions of clinical terms, concepts, their interrelationships, and complex clinical nuances requires specialized expertise. Improving NLP model performance requires training on a comprehensive structured clinical terminology and deep domain-expertise.

Thirty years in the making, IMO Health remains the most advanced and widely adopted terminology solution in the industry. With its extensive coverage, meticulous content curation, and well-documented guidelines, IMO Health terminology has the potential to significantly enhance LLMs for medical coding.

Widely adopted and comprehensive

IMO Health terminology is used by 89% of US physicians, nurses, and physician assistants. Its versatility supports various electronic health record (EHR) use cases, such as problem list management and unstructured note processing.

With 1.13 million unique concepts and 4.23 million lexical items distributed across 24 active domains, IMO Health terminology encompasses a wide range of industry-standard terminologies, including ICD-10-CM, ICD-9-CM, SNOMED CT®, CPT, HCPCS, ICD-10-PCS, LOINC®, RxNorm®, NDC, and CVX.

Compared to the Unified Medical Language System (UMLS), IMO Health terminology includes approximately 20% more synonyms per concept and a higher percentage of long and complex terms (Figure 1), reflecting the precise language used in clinical care.

Figure 1. Displays the percentage distribution of term length (number of tokens) of IMO Health terminology vs the Unified Medical Language System (UMLS) -- that is, how many words or phrases fall into different length categories.

Accurate and up-to-date content curation 

The content creation and maintenance of IMO Health’s terminology is powered by a team of industry experts, including MDs, RNs, pharmacists, medical laboratory scientists, and credentialed HIM professionals.  

Collectively, the team boasts 150 years of clinical informatics expertise, more than 130 years of experience in health information, and 160 years of clinical practice experience spanning various specialties, such as surgery, oncology, radiology, pediatrics, orthopedics, ER, and family medicine. Together, the team has spent hundreds of thousands of hours over three decades creating, curating, updating, and maintaining content. 

Well-documented guidelines and instructions 

IMO Health terminology includes a wealth of best practices, industry standards, coding guidelines, and IMO Health-specific rules. These resources are meticulously documented with detailed instructions and rich positive and negative examples. Hundreds of pages of editorial guidelines are designed to ensure consistent and high-quality content, promoting the creation of LLM prompts to simplify medical coding tasks. 

Decades of access to clinical data 

With a decades-long history as the terminology and coding foundation in all major EHRs, IMO Health has accumulated an extensive knowledge base. This includes capturing the clinical terms physicians search for when seeking medical codes and the codes they select, along with the distributions of search terms, frequencies, and co-occurrences. These insights can enhance LLMs by providing additional context to medical codes. 

Enhancing LLMs with proven AI techniques 

To enhance LLMs for medical coding, we leverage several proven techniques, including advanced prompt engineering, retrieval-augmented generation, agents and tools, and fine-tuning. Let’s explore each of these techniques in more detail.

Advanced prompt engineering 

Prompt engineering – or the act of writing and refining inputs to elicit high-quality outputs – plays a crucial role in guiding LLMs to generate more accurate medical codes. At IMO Health, using ICD-10-CM codes as examples, we have summarized 22 coding rules and incorporated them as part of the prompts. In doing so, we have observed a significant improvement in the accuracy of generated ICD-10-CM codes compared to using simple questions alone. 

Retrieval augmented generation (RAG) 

Retrieval-augmented generation involves having the LLM reference relevant medical coding information retrieved from IMO Health’s terminology and normalization application programming interfaces (APIs).  

By leveraging retrieved codes from IMO Health’s terminologies, we minimize the occurrence of fake or inaccurate codes and reduce hallucinations, or outputs that are nonsensical or entirely fabricated (Figure 2). This approach simplifies the task from generating codes to selecting from pre-existing candidates, thus reducing the reliance on prior knowledge from the base LLMs.  

As a result, it becomes possible to use smaller LLMs to build lower-cost and faster-running solutions. 

Figure 2. Illustrates how IMO Health leverages retrieval augmented generation (RAG) – a technique that involves retrieving information from comprehensive terminologies and knowledge graphs and feeding it into LLMs – to reduce hallucinations.

Agents and tools

An LLM agent is a specialized AI system designed to perform specific tasks or functions within a larger AI ecosystem. These agents are often built on top of foundational LLMs and are trained to handle particular domains or use cases.

At IMO Health, we formalize mapping and editorial guidelines into prompts to build a chain of thoughts for LLMs when performing medical coding. We also instruct the LLM to call upon IMO Health tools and APIs, including natural language processing (NLP) pipelines2 and our normalization solution, IMO Precision Normalize3, when applicable. By using agents, the output becomes explainable, trustworthy, and acceptable to human medical coders, instead of a black box (Figure 3).

Figure 3. Represents an example agent, or a software component that assists the AI in making decisions, and an LLM chain of thoughts. As depicted above, the LLM leverages IMO Health tools and APIs to turn the input “impaired toe range of motion” into “toe stiffness,” yielding an accurate code match.

Fine-tuning 

Fine-tuning involves further training the base LLM on high-quality medical coding datasets to improve its understanding of the medical coding task. By exposing the LLM to a large volume of relevant data, including IMO Health terminology synonyms, mapping relationships, and historical product logs, we can fine-tune it to better capture the nuances and intricacies of medical coding. 

The result: better performance on medical coding 

Improved mapping accuracy 

In a recent test on a typical dataset, the top-performing out-of-the-box LLM achieved an accuracy of 45% on ICD-10-CM code prediction. By enhancing the LLM with the UMLS as a RAG resource, we were able to improve the performance to 64%.  

However, when using the IMO Precision Normalize API without any LLMs, the performance reached an impressive 83%. This is primarily due to the extensive coverage of challenging terms by synonyms in IMO Health’s comprehensive medical terminology. 

When we evaluated the medical coding solution powered by IMO Clinical AI, which combines LLMs with our proprietary resources and techniques, the performance reached 90% accuracy on the same dataset. This demonstrates the effectiveness of IMO Health’s approach in delivering highly accurate medical coding results. (Figure 4) 

Figure 4. Demonstrates the difference in ICD-10-CM mapping accuracy between out-of-the-box LLMs and other AI tools – and IMO Health’s medical coding solution powered by IMO Clinical AI.

Enriched results with secondary codes and HCC integration 

Using IMO Health as a bridge to medical codes offers several benefits beyond improved accuracy. IMO Health’s terminology not only returns the preferred primary code but also provides preferred secondary codes, cross-referenced to multiple terminologies. This captures the detailed semantic differences between medical codes, providing a more comprehensive and precise coding output (Figure 5). 

IMO Health’s terminology also includes Hierarchical Condition Category (HCC) scores, which are crucial for risk adjustment and reimbursement purposes in value-based care models. By integrating HCC scores directly into the coding process, IMO Clinical AI streamlines the workflow and eliminates the need for manual HCC assignment. 

Figure 5. Uses the diagnosis “breast cancer metastasized to pelvis” to show how solutions infused with IMO Clinical AI return preferred secondary codes, cross-referenced to multiple terminologies, in addition to preferred primary codes.

Explainable and trustworthy code selections 

One of the key advantages of IMO Clinical AI is its ability to explain why certain medical codes are chosen and why they are more suitable compared to other similar codes. By prompting the LLM with our mapping knowledge and terminology resources, the generated explanations are more clinically logical, with fewer hallucinations and false statements. This makes the results more acceptable and trustworthy to medical coders when they review the output. 

The explainable nature of IMO Clinical AI code selections is particularly valuable when there is ambiguity or multiple potential codes for a given medical condition or procedure. By providing clear and clinically sound reasoning for the chosen codes, the system instills confidence in medical coders and facilitates a more efficient review process.  

This transparency also enables coders to quickly identify and address any potential discrepancies or uncommon cases, further improving the overall accuracy and reliability of the coding output (Figure 6). 

Figure 6. Provides an example of medical coding in IMO Studio4, underscoring the explainable nature of IMO Clinical AI.

Cost-efficiency optimization

IMO Clinical AI doesn’t simply rely on LLMs for all medical coding tasks. Thanks to IMO Health’s comprehensive terminology synonyms and mappings, many input diagnosis terms are already covered directly without the need for LLMs. Only the uncovered terms or terms with low confidence scores are sent to LLMs for further analysis.  

In an early study, only 25.1% of input diagnosis terms required LLMs, while the overall accuracy on the entire dataset increased from 82.9% to 90.0% (+7.1%). 

This selective use of LLMs offers significant cost-efficiency benefits. By leveraging IMO Health’s extensive terminology resources as the foundation and using LLMs judiciously for more complex or ambiguous cases, the system optimizes computational resources and reduces the overall cost of the medical coding process. This cost-efficiency, combined with the high accuracy and explainability of the system, makes IMO Clinical AI an attractive solution for healthcare organizations looking to automate their medical coding workflows. 

Conclusion 

IMO Clinical AI represents a notable breakthrough in medical coding automation. By leveraging LLMs alongside IMO Health’s extensive medical terminology resources and proprietary techniques, we deliver remarkable accuracy and efficiency.  

With features like improved mapping accuracy, comprehensive code coverage, HCC score integration, explainable code selections, and cost-efficiency optimization, IMO Clinical AI is poised to transform the medical coding landscape, saving time, reducing errors, and ultimately improving patient care. 

Our selective use of LLMs ensures that computational resources are utilized effectively, minimizing costs while maintaining accuracy. As healthcare organizations seek to optimize revenue cycle management and improve clinical documentation, IMO Clinical AI offers a reliable, transparent, and cost-efficient solution, driving significant value and efficiency gains.  

By combining IMO Health’s rich terminology resources with the power of LLMs, IMO Clinical AI sets a new standard for automated medical coding, enabling healthcare providers to focus on delivering high-quality patient care while streamlining their coding processes. 

Click here to learn more about IMO Clinical AI and here to learn how our AI-powered solutions simplify clinical workflows and boost healthcare data quality. 

1Soroush, A., Glicksberg, B. S., Zimlichman, E., Barash, Y., Freeman, R., Charney, A. W., … & Klang, E. (2024). Large Language Models Are Poor Medical Coders—Benchmarking of Medical Code Querying. NEJM AI, AIdbp2300040.

2IMO Entity Extraction API: https://developer.imohealth.com/api-catalog/entity-extraction 

3IMO Precision Normalize API: https://developer.imohealth.com/api-catalog/imor-precision-normalize-api 

4IMO Studio: https://studio.imohealth.com/ 

RxNorm® is a registered trademark of the National Library of Medicine. 

CPT is a registered trademark of the American Medical Association. All rights reserved. 

SNOMED and SNOMED CT® are registered trademarks of SNOMED International. 

Interested in more IMO Health resources?

Sign up today and have resources delivered straight to your inbox.

Latest Resources​

Explore how IMO Clinical AI bridges the gap between classical ML and agentic AI, offering solutions that meet varying AI adoption levels.
Learn how IMO Health experts leverage the medical problem list to enhance HCC data capture, simplify risk adjustment, and support value-based care.
Article
Temps are tanking, string lights are shining, festive foods are flowing—holiday season is here. Let’s hope you avoid these 12 ICD-10-CM codes.

For award-winning solutions in healthcare IT and data analytics, you're in the right place.