Skip to content

N+1 SQL round-trips in ancestry validation during concept_ground (find_standard_paths / find_standard_concepts) #24

Description

@nicoloesch

Motivation

Expensive round trips to the OMOP CDM DB cause slow retrieval of data, that can be optimised:

  • find_standard_concepts() (omop_graph/src/omop_graph/reasoning/grounding.py) loops over every anchor concept ID (for parent in parent_ids: find_standard_paths(kg, candidate, target=parent, ...)), e.g. 21
    anchors for the Observation domain.
  • find_standard_paths() (omop_graph/src/omop_graph/graph/paths.py), during BFS graph traversal, calls
    kg.get_potential_ancestor(child_id=subject_node.concept_id, parent_id=target) once per standard concept
    node it encounters.
  • get_potential_ancestor() (omop_graph/src/omop_graph/graph/kg.py) opens a session and runs a single-pair point lookup (q_concept_potential_ancestor in omop_graph/src/omop_graph/graph/queries.py)
    • PK-indexed and fast on its own (~0.5ms), but called once per (candidate, parent_id) pair.

Volume: N candidates × ~10 standard concepts per BFS × 21 parent_ids ≈ up to 2,100 SQL round trips per
concept_ground call. At ~0.5ms each that's already ~1s in pure SQL time, plus per-round-trip Python/network
overhead, accounting for the bulk of the observed 5-6s.

This was previously masked by a much larger bottleneck in omop-alchemy.

Pitch

Add a batched ancestor-check method so each candidate makes one round trip instead of up to 21:

# omop_graph/src/omop_graph/graph/kg.py
def get_potential_ancestors_batch(
    self, child_id: int, parent_ids: tuple[int, ...]
) -> set[int]:
    """Return the subset of parent_ids that are ancestors of child_id."""
    ...
# omop_graph/src/omop_graph/graph/queries.py
def q_concept_potential_ancestors_batch(
    child_id: int, parent_ids: tuple[int, ...]
) -> Select:
    return select(Concept_Ancestor.ancestor_concept_id).where(
        and_(
            Concept_Ancestor.descendant_concept_id == child_id,
            Concept_Ancestor.ancestor_concept_id.in_(parent_ids),
            Concept_Ancestor.min_levels_of_separation > 0,
        )
    )

Update find_standard_paths() to accept the full set of parent_ids and use the batched result instead of
checking one target at a time, and update find_standard_concepts() to pass all anchors through in a
single call rather than looping and calling find_standard_paths() per anchor.

Expected improvement: ~2,100 round trips → ~N (one per candidate). Embedding tier: ~5-6s → ~0.5-1s.

Alternatives

  • Pre-materialised descendant sets per anchor: at grounding startup, load the full descendant
    concept_id set for each anchor into memory once; replace the SQL ancestry check with an O(1) set
    membership test. Avoids per-request round trips entirely, at the cost of a one-time startup load
    (~1-2s per anchor group) and higher memory use. Worth it if anchor sets are stable across requests.
  • Push ancestry into the SQL query itself: already done for the ExactLabel/ExactSynonym resolvers (the
    functional-index fix joins directly against concept_ancestor in one query). The FTS resolver could do
    the same, but its candidate set comes from tsvector hits rather than a name-equality filter, so the
    ancestry check has to happen after the fact. This may be tolerable today only because FTS result counts are small
    (<20).

Recommend the batched-query approach (Pitch) first: smallest change, same code structure, no caching
invalidation concerns.

Additional context

  • Profiling source: 2026-06-15 grounding pipeline investigation. Full concept_ground timing before any
    fixes:
    • ExactLabelResolver+ExactSynonymResolver: 26,142ms
    • FullTextResolver+FullTextSynonymResolver: 211ms
    • EmbeddingResolver: 886ms
    • PartialLabelResolver+PartialSynonymResolver: 3,337ms
    • Total: 30,591ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions