Motivation
We need to optimise and robustify our data grounding pipeline for the OMOP CDM. Currently, standard vocabulary mapping leaves significant gaps in source data retention (particularly in medications) and introduces downstream complexity when navigating hierarchies (particularly in conditions).
National Drug Codes (NDCs) change constantly, and local source systems often have custom text or missing maps. Losing 30–40% of your drug codes during ETL due to unmapped data destroys the data's research value.
Instead of relying solely on exact string matches or outdated OMOP CONCEPT_RELATIONSHIP tables, we could use Semantic Embeddings. By converting drug names into vector embeddings, qw can find the "closest match" geometrically, verify if the relationship makes sense, and ensure the mapped code falls into the correct therapeutic drug class (e.g., ATC class).
Pitch
1. Embedding-Based Mapping for Unmapped Drug Vocabularies (NDC -> RxNorm)
- Problem: Manual mapping is bottlenecked, and standard exact-match lookups result in a 30-40% data loss for legacy or poorly coded drug names.
- Solution: Implement a semantic embedding workflow (e.g., using a clinical LLM/biomedical embeddings model) to calculate vector similarity between unmapped source drug strings and standard RxNorm concepts.
- Validation: The system should identify the closest geometric match, evaluate neighboring concept relationships, and verify that the target falls within the expected therapeutic drug class.
2. Compositional Parsing for Custom/Local Medication Codes
- Problem: Source data contains localized/proprietary codes representing compounded medications or custom mixtures that lack direct RxNorm representation.
- Solution: Develop a mechanism to ingest the ingredients list of these custom formulations. Use a compositional query approach to break down the mixture by its active components and group/map the local code under the appropriate high-level RxNorm Ingredient or clinical drug form group.
3. SNOMED Hierarchy Simplification via CCSR Mapping
- Problem: SNOMED CT’s deeply nested polyhierarchical structure makes downstream data grounding, cohort building, and feature engineering overly complex.
- Solution: Build a transformation layer that collapses the complex SNOMED hierarchy into a flatter, categorical graph structure using the HCUP Clinical Classifications Software Refined (CCSR) framework. This will allow us to map granular clinical findings into stable, well-defined clinical categories.
Alternatives
No response
Additional context
- Reference for CCSR: HCUP CCSR Tools & Software
- Target Model: OMOP CDM v5.4 / v6.0 (
CONCEPT_RELATIONSHIP, DRUG_EXPOSURE, CONDITION_OCCURRENCE)
Motivation
We need to optimise and robustify our data grounding pipeline for the OMOP CDM. Currently, standard vocabulary mapping leaves significant gaps in source data retention (particularly in medications) and introduces downstream complexity when navigating hierarchies (particularly in conditions).
Instead of relying solely on exact string matches or outdated OMOP
CONCEPT_RELATIONSHIPtables, we could use Semantic Embeddings. By converting drug names into vector embeddings, qw can find the "closest match" geometrically, verify if the relationship makes sense, and ensure the mapped code falls into the correct therapeutic drug class (e.g., ATC class).Pitch
1. Embedding-Based Mapping for Unmapped Drug Vocabularies (NDC -> RxNorm)
2. Compositional Parsing for Custom/Local Medication Codes
3. SNOMED Hierarchy Simplification via CCSR Mapping
Alternatives
No response
Additional context
CONCEPT_RELATIONSHIP,DRUG_EXPOSURE,CONDITION_OCCURRENCE)