Can LLMs excel in medical coding? Yes, with rich semantics and clinical AI

In recent years, there has been growing interest in using artificial intelligence (AI), especially large language models (LLMs), to automate the medical coding process. Medical coding, which involves assigning standardized codes like ICD-10 and CPT® to diagnoses and procedures, is a critical but time-consuming administrative task in healthcare.

LLMs are deep learning models that are trained on vast datasets, making them ideal for generating text output and automating tasks. But according to a recent Mount Sinai study published in the April 19 online issue of NEJM AI, LLMs are poor medical coders.¹  The study emphasizes the need to refine and validate these technologies before implementing them in large-scale clinical settings.

LLMs display weaknesses in medical coding

Specifically, the study found that out-of-the-box LLMs like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b Chat performed poorly at medical coding when simply prompted to generate codes from descriptions. Out-of-the-box LLMs are pre-trained models that have not been fine-tuned or adapted for specific tasks, in contrast to specialized LLMs that have been further trained on domain-specific data to improve their performance in particular areas.

The model with the best performance, GPT-4, which is the latest and most advanced language model developed by OpenAI, the creators of ChatGPT, only achieved 46% exact match accuracy for ICD-9 codes, 34% for ICD-10, and 50% for CPT. The models often generated codes that were imprecise or even contained falsified information.

The Mount Sinai study aligns with IMO Health’s ongoing exploration of automated medical coding. We too have found that out-of-the-box LLMs, while impressive in many areas, often struggle with the complex and nuanced task of medical coding.

To address these limitations, we have focused on enhancing LLMs specifically for medical coding by leveraging our terminology resources, mapping knowledge, AI techniques, and IMO Health tools. By combining the power of LLMs with our expertise in medical informatics, we aim to create more accurate and reliable automated medical coding solutions that can support healthcare providers and improve patient care.

Enhancing LLMs with deep clinical ontologies and informatics

Structured clinical terminology, comprised of codified terms from a common clinical vocabulary, can be employed to accurately represent clinical concepts like diseases or lab results. However, managing constantly changing data on millions of clinical terms, concepts, their interrelationships, and complex clinical nuances requires specialized expertise. Improving NLP model performance requires training on a comprehensive structured clinical terminology and deep domain-expertise.

Thirty years in the making, IMO Health remains the most advanced and widely adopted terminology solution in the industry. With its extensive coverage, meticulous content curation, and well-documented guidelines, IMO Health terminology has the potential to significantly enhance LLMs for medical coding.

Widely adopted and comprehensive

IMO Health terminology is used by 89% of US physicians, nurses, and physician assistants. Its versatility supports various electronic health record (EHR) use cases, such as problem list management and unstructured note processing.

With 1.13 million unique concepts and 4.23 million lexical items distributed across 24 active domains, IMO Health terminology encompasses a wide range of industry-standard terminologies, including ICD-10-CM, ICD-9-CM, SNOMED CT®, CPT, HCPCS, ICD-10-PCS, LOINC®, RxNorm®, NDC, and CVX.

Compared to the Unified Medical Language System (UMLS), IMO Health terminology includes approximately 20% more synonyms per concept and a higher percentage of long and complex terms (Figure 1), reflecting the precise language used in clinical care.

Accurate and up-to-date content curation

The content creation and maintenance of IMO Health’s terminology is powered by a team of industry experts, including MDs, RNs, pharmacists, medical laboratory scientists, and credentialed HIM professionals.

Collectively, the team boasts 150 years of clinical informatics expertise, more than 130 years of experience in health information, and 160 years of clinical practice experience spanning various specialties, such as surgery, oncology, radiology, pediatrics, orthopedics, ER, and family medicine. Together, the team has spent hundreds of thousands of hours over three decades creating, curating, updating, and maintaining content.

Well-documented guidelines and instructions

IMO Health terminology includes a wealth of best practices, industry standards, coding guidelines, and IMO Health-specific rules. These resources are meticulously documented with detailed instructions and rich positive and negative examples. Hundreds of pages of editorial guidelines are designed to ensure consistent and high-quality content, promoting the creation of LLM prompts to simplify medical coding tasks.

Decades of access to clinical data

With a decades-long history as the terminology and coding foundation in all major EHRs, IMO Health has accumulated an extensive knowledge base. This includes capturing the clinical terms physicians search for when seeking medical codes and the codes they select, along with the distributions of search terms, frequencies, and co-occurrences. These insights can enhance LLMs by providing additional context to medical codes.

Enhancing LLMs with proven AI techniques

To enhance LLMs for medical coding, we leverage several proven techniques, including advanced prompt engineering, retrieval-augmented generation, agents and tools, and fine-tuning. Let’s explore each of these techniques in more detail.

Advanced prompt engineering

Prompt engineering – or the act of writing and refining inputs to elicit high-quality outputs – plays a crucial role in guiding LLMs to generate more accurate medical codes. At IMO Health, using ICD-10-CM codes as examples, we have summarized 22 coding rules and incorporated them as part of the prompts. In doing so, we have observed a significant improvement in the accuracy of generated ICD-10-CM codes compared to using simple questions alone.

Retrieval augmented generation (RAG)

Retrieval-augmented generation involves having the LLM reference relevant medical coding information retrieved from IMO Health’s terminology and normalization application programming interfaces (APIs).

By leveraging retrieved codes from IMO Health’s terminologies, we minimize the occurrence of fake or inaccurate codes and reduce hallucinations, or outputs that are nonsensical or entirely fabricated (Figure 2). This approach simplifies the task from generating codes to selecting from pre-existing candidates, thus reducing the reliance on prior knowledge from the base LLMs.

As a result, it becomes possible to use smaller LLMs to build lower-cost and faster-running solutions.

Agents and tools

An LLM agent is a specialized AI system designed to perform specific tasks or functions within a larger AI ecosystem. These agents are often built on top of foundational LLMs and are trained to handle particular domains or use cases.

At IMO Health, we formalize mapping and editorial guidelines into prompts to build a chain of thoughts for LLMs when performing medical coding. We also instruct the LLM to call upon IMO Health tools and APIs, including natural language processing (NLP) pipelines² and our normalization solution, IMO Precision Normalize³, when applicable. By using agents, the output becomes explainable, trustworthy, and acceptable to human medical coders, instead of a black box (Figure 3).

Fine-tuning

Fine-tuning involves further training the base LLM on high-quality medical coding datasets to improve its understanding of the medical coding task. By exposing the LLM to a large volume of relevant data, including IMO Health terminology synonyms, mapping relationships, and historical product logs, we can fine-tune it to better capture the nuances and intricacies of medical coding.

The result: better performance on medical coding

Improved mapping accuracy

In a recent test on a typical dataset, the top-performing out-of-the-box LLM achieved an accuracy of 45% on ICD-10-CM code prediction. By enhancing the LLM with the UMLS as a RAG resource, we were able to improve the performance to 64%.

However, when using the IMO Precision Normalize API without any LLMs, the performance reached an impressive 83%. This is primarily due to the extensive coverage of challenging terms by synonyms in IMO Health’s comprehensive medical terminology.

When we evaluated the medical coding solution powered by IMO Clinical AI, which combines LLMs with our proprietary resources and techniques, the performance reached 90% accuracy on the same dataset. This demonstrates the effectiveness of IMO Health’s approach in delivering highly accurate medical coding results. (Figure 4)

Enriched results with secondary codes and HCC integration

Using IMO Health as a bridge to medical codes offers several benefits beyond improved accuracy. IMO Health’s terminology not only returns the preferred primary code but also provides preferred secondary codes, cross-referenced to multiple terminologies. This captures the detailed semantic differences between medical codes, providing a more comprehensive and precise coding output (Figure 5).

IMO Health’s terminology also includes Hierarchical Condition Category (HCC) scores, which are crucial for risk adjustment and reimbursement purposes in value-based care models. By integrating HCC scores directly into the coding process, IMO Clinical AI streamlines the workflow and eliminates the need for manual HCC assignment.

Explainable and trustworthy code selections

One of the key advantages of IMO Clinical AI is its ability to explain why certain medical codes are chosen and why they are more suitable compared to other similar codes. By prompting the LLM with our mapping knowledge and terminology resources, the generated explanations are more clinically logical, with fewer hallucinations and false statements. This makes the results more acceptable and trustworthy to medical coders when they review the output.

The explainable nature of IMO Clinical AI code selections is particularly valuable when there is ambiguity or multiple potential codes for a given medical condition or procedure. By providing clear and clinically sound reasoning for the chosen codes, the system instills confidence in medical coders and facilitates a more efficient review process.

This transparency also enables coders to quickly identify and address any potential discrepancies or uncommon cases, further improving the overall accuracy and reliability of the coding output (Figure 6).

Cost-efficiency optimization

IMO Clinical AI doesn’t simply rely on LLMs for all medical coding tasks. Thanks to IMO Health’s comprehensive terminology synonyms and mappings, many input diagnosis terms are already covered directly without the need for LLMs. Only the uncovered terms or terms with low confidence scores are sent to LLMs for further analysis.

In an early study, only 25.1% of input diagnosis terms required LLMs, while the overall accuracy on the entire dataset increased from 82.9% to 90.0% (+7.1%).

This selective use of LLMs offers significant cost-efficiency benefits. By leveraging IMO Health’s extensive terminology resources as the foundation and using LLMs judiciously for more complex or ambiguous cases, the system optimizes computational resources and reduces the overall cost of the medical coding process. This cost-efficiency, combined with the high accuracy and explainability of the system, makes IMO Clinical AI an attractive solution for healthcare organizations looking to automate their medical coding workflows.

Conclusion

IMO Clinical AI represents a notable breakthrough in medical coding automation. By leveraging LLMs alongside IMO Health’s extensive medical terminology resources and proprietary techniques, we deliver remarkable accuracy and efficiency.

With features like improved mapping accuracy, comprehensive code coverage, HCC score integration, explainable code selections, and cost-efficiency optimization, IMO Clinical AI is poised to transform the medical coding landscape, saving time, reducing errors, and ultimately improving patient care.

Our selective use of LLMs ensures that computational resources are utilized effectively, minimizing costs while maintaining accuracy. As healthcare organizations seek to optimize revenue cycle management and improve clinical documentation, IMO Clinical AI offers a reliable, transparent, and cost-efficient solution, driving significant value and efficiency gains.

By combining IMO Health’s rich terminology resources with the power of LLMs, IMO Clinical AI sets a new standard for automated medical coding, enabling healthcare providers to focus on delivering high-quality patient care while streamlining their coding processes.

Click here to learn more about IMO Clinical AI and here to learn how our AI-powered solutions simplify clinical workflows and boost healthcare data quality.

¹Soroush, A., Glicksberg, B. S., Zimlichman, E., Barash, Y., Freeman, R., Charney, A. W., … & Klang, E. (2024). Large Language Models Are Poor Medical Coders—Benchmarking of Medical Code Querying. NEJM AI, AIdbp2300040.

²IMO Entity Extraction API: https://developer.imohealth.com/api-catalog/entity-extraction

³IMO Precision Normalize API: https://developer.imohealth.com/api-catalog/imor-precision-normalize-api

⁴IMO Studio: https://studio.imohealth.com/

RxNorm® is a registered trademark of the National Library of Medicine.

SNOMED and SNOMED CT® are registered trademarks of SNOMED International.

POINT OF CARE WORKFLOW

DATA QUALITY MANAGEMENT

INDUSTRY-SPECIFIC SOLUTIONS

Can LLMs excel in medical coding? Yes, with rich semantics and clinical AI

LLMs display weaknesses in medical coding