Implement Advanced Terminology Grounding for Medication and Condition Mapping

### Motivation

We need to optimise and robustify our data grounding pipeline for the OMOP CDM. Currently, standard vocabulary mapping leaves significant gaps in source data retention (particularly in medications) and introduces downstream complexity when navigating hierarchies (particularly in conditions). 

> National Drug Codes (NDCs) change constantly, and local source systems often have custom text or missing maps. Losing 30–40% of your drug codes during ETL due to unmapped data destroys the data's research value.

Instead of relying solely on exact string matches or outdated OMOP `CONCEPT_RELATIONSHIP` tables, we could use Semantic Embeddings. By converting drug names into vector embeddings, qw can find the "closest match" geometrically, verify if the relationship makes sense, and ensure the mapped code falls into the correct therapeutic drug class (e.g., ATC class).

---

### Pitch

#### 1. Embedding-Based Mapping for Unmapped Drug Vocabularies (NDC -> RxNorm)
* **Problem:** Manual mapping is bottlenecked, and standard exact-match lookups result in a 30-40% data loss for legacy or poorly coded drug names.
* **Solution:** Implement a semantic embedding workflow (e.g., using a clinical LLM/biomedical embeddings model) to calculate vector similarity between unmapped source drug strings and standard RxNorm concepts. 
* **Validation:** The system should identify the closest geometric match, evaluate neighboring concept relationships, and verify that the target falls within the expected therapeutic drug class.

#### 2. Compositional Parsing for Custom/Local Medication Codes
* **Problem:** Source data contains localized/proprietary codes representing compounded medications or custom mixtures that lack direct RxNorm representation.
* **Solution:** Develop a mechanism to ingest the ingredients list of these custom formulations. Use a compositional query approach to break down the mixture by its active components and group/map the local code under the appropriate high-level RxNorm Ingredient or clinical drug form group.

#### 3. SNOMED Hierarchy Simplification via CCSR Mapping
* **Problem:** SNOMED CT’s deeply nested polyhierarchical structure makes downstream data grounding, cohort building, and feature engineering overly complex.
* **Solution:** Build a transformation layer that collapses the complex SNOMED hierarchy into a flatter, categorical graph structure using the **HCUP Clinical Classifications Software Refined (CCSR)** framework. This will allow us to map granular clinical findings into stable, well-defined clinical categories.

---

### Alternatives

_No response_

---

### Additional context

* Reference for CCSR: [HCUP CCSR Tools & Software](https://hcup-us.ahrq.gov/toolssoftware/ccsr/ccs_refined.jsp)
* Target Model: OMOP CDM v5.4 / v6.0 (`CONCEPT_RELATIONSHIP`, `DRUG_EXPOSURE`, `CONDITION_OCCURRENCE`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Advanced Terminology Grounding for Medication and Condition Mapping #21

Motivation

Pitch

1. Embedding-Based Mapping for Unmapped Drug Vocabularies (NDC -> RxNorm)

2. Compositional Parsing for Custom/Local Medication Codes

3. SNOMED Hierarchy Simplification via CCSR Mapping

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Implement Advanced Terminology Grounding for Medication and Condition Mapping #21

Description

Motivation

Pitch

1. Embedding-Based Mapping for Unmapped Drug Vocabularies (NDC -> RxNorm)

2. Compositional Parsing for Custom/Local Medication Codes

3. SNOMED Hierarchy Simplification via CCSR Mapping

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions