Skip to content

Implement Advanced Terminology Grounding for Medication and Condition Mapping #21

Description

@nicoloesch

Motivation

We need to optimise and robustify our data grounding pipeline for the OMOP CDM. Currently, standard vocabulary mapping leaves significant gaps in source data retention (particularly in medications) and introduces downstream complexity when navigating hierarchies (particularly in conditions).

National Drug Codes (NDCs) change constantly, and local source systems often have custom text or missing maps. Losing 30–40% of your drug codes during ETL due to unmapped data destroys the data's research value.

Instead of relying solely on exact string matches or outdated OMOP CONCEPT_RELATIONSHIP tables, we could use Semantic Embeddings. By converting drug names into vector embeddings, qw can find the "closest match" geometrically, verify if the relationship makes sense, and ensure the mapped code falls into the correct therapeutic drug class (e.g., ATC class).


Pitch

1. Embedding-Based Mapping for Unmapped Drug Vocabularies (NDC -> RxNorm)

  • Problem: Manual mapping is bottlenecked, and standard exact-match lookups result in a 30-40% data loss for legacy or poorly coded drug names.
  • Solution: Implement a semantic embedding workflow (e.g., using a clinical LLM/biomedical embeddings model) to calculate vector similarity between unmapped source drug strings and standard RxNorm concepts.
  • Validation: The system should identify the closest geometric match, evaluate neighboring concept relationships, and verify that the target falls within the expected therapeutic drug class.

2. Compositional Parsing for Custom/Local Medication Codes

  • Problem: Source data contains localized/proprietary codes representing compounded medications or custom mixtures that lack direct RxNorm representation.
  • Solution: Develop a mechanism to ingest the ingredients list of these custom formulations. Use a compositional query approach to break down the mixture by its active components and group/map the local code under the appropriate high-level RxNorm Ingredient or clinical drug form group.

3. SNOMED Hierarchy Simplification via CCSR Mapping

  • Problem: SNOMED CT’s deeply nested polyhierarchical structure makes downstream data grounding, cohort building, and feature engineering overly complex.
  • Solution: Build a transformation layer that collapses the complex SNOMED hierarchy into a flatter, categorical graph structure using the HCUP Clinical Classifications Software Refined (CCSR) framework. This will allow us to map granular clinical findings into stable, well-defined clinical categories.

Alternatives

No response


Additional context

  • Reference for CCSR: HCUP CCSR Tools & Software
  • Target Model: OMOP CDM v5.4 / v6.0 (CONCEPT_RELATIONSHIP, DRUG_EXPOSURE, CONDITION_OCCURRENCE)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions