Skip to content

Activate the MONDO close-match guard in disease glom (with build-vs-build clique-diff tool)#883

Open
gaurav wants to merge 3 commits into
add-mpofrom
fix-mondo-close-match-guard
Open

Activate the MONDO close-match guard in disease glom (with build-vs-build clique-diff tool)#883
gaurav wants to merge 3 commits into
add-mpofrom
fix-mondo-close-match-guard

Conversation

@gaurav

@gaurav gaurav commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes a latent bug in how the disease/phenotype build reads the MONDO_close concord, which
left glom's close-match guard a silent no-op, and adds a reusable build-vs-build clique-diff
tool plus the before/after impact analysis that this change requires.

This is the dedicated follow-up promised in #790, where the same code was touched only enough to
stop the build crashing (validating the real 3-column MONDO_close format while preserving
the dormant x[1] behaviour). Activating the guard changes disease clique merging broadly and is
orthogonal to adding MP, so it lives here with its own evidence.

Note

Based on add-mpo (#790), not main — it builds on that PR's compute_cliques_for_impact_report
refactor. Merge #790 first; this PR will retarget to main automatically.

The bug

compute_cliques_for_impact_report() read MONDO_close (a 3-column subject predicate object
concord) with close_mondos[x[0]].add(x[1]) — keying the MONDO subject to column 2, the
predicate
(oio:closeMatch) instead of column 3, the close-match object. No clique ever
contains the literal string oio:closeMatch, so glom's close= guard (if cd in newset) never
matched. The guard has therefore never fired, on main and before it; close (but not exact)
matches were free to collapse into exact cliques.

The fix is one line of intent: key on x[2], matching the (stuff[0], stuff[2]) parsing every
other concord in that function already uses.

Impact (why this needs SME review)

Measured on a full local disease build (Biolink 4.4.3), same intermediate ids/concords,
differing only in x[1] vs x[2]. glom is deterministic — two x[1] builds were clique-identical
— so every difference is signal, not run-to-run noise.

  • 1,219 identifiers, all MEDDRA, are dropped from Disease.txt. They were present only because
    the dormant guard let them merge into a MONDO disease clique; correctly kept out, they have no
    independent disease typing in this pipeline and leave the compendium entirely.
  • 1,191 of 365,465 Disease cliques lose ≥1 member; net Disease clique count +20, PhenotypicFeature +13.
  • PhenotypicFeature.txt is essentially unaffected (the close pairs are MONDO-subject).

Dropping 1,219 MEDDRA codes is correct if they are genuinely distinct concepts (the guard's
premise), but a coverage loss if anything relies on that MEDDRA → MONDO normalization. Per the
project rule against dropping valid identifiers without good reason, this is an SME call — the
evidence is in docs/pipelines/diseasephenotype/mondo-close-match-guard/ (README.md,
clique-diff.csv, dropped-members.csv, summary.json).

New tool: babel-clique-diff

A reusable build-vs-build clique diff (tools/clique_diff) that compares the finished compendia of
two builds and reports split/merged/lost cliques and, crucially, dropped CURIEs.

It is deliberately distinct from source-impact-report: that answers "what does adding source
X
do?" by re-glomming intermediates with vs. without one source; this answers "how did the cliques
change between build A and build B?" given the same inputs but different code/config/data.
Because it works on finished compendia rather than glom state, it can also compare a local build
against a published stars.renci.org build with no re-glom — useful for validating any glom-logic
change (close-match, unique_prefixes, overuse filtering) or as a release regression check.

Test plan

  • uv run ruff check / ruff format --check — clean.
  • uv run rumdl check . — clean.
  • uv run pytest -m unit -q — green (243 passed), including new tests/tools/test_clique_diff_tool.py.

Out of scope

🤖 Generated with Claude Code

gaurav and others added 3 commits June 29, 2026 19:40
A new tools/clique_diff package + `babel-clique-diff` entry point that compares the
finished JSONL compendia of two builds and reports which cliques split/merged/lost
members, and — the headline signal — which CURIEs were dropped from the output entirely.

This is distinct from source-impact-report: that answers "what does adding source X do?"
by re-glomming intermediate files with vs. without one source; this answers "how did the
cliques change between build A and build B?" given the same inputs but different code,
config, or upstream data. It works on finished compendia rather than glom state, so it can
also compare a local build against a published stars.renci.org build without re-running
glom — useful for validating any glom-logic change or as a release regression check.

Includes unit tests for the four destination kinds (kept/regrouped/moved/dropped) and a
docs/tools/README.md entry. The test module has a unique basename (test_clique_diff_tool)
to avoid a pytest prepend-mode collision with tests/test_clique_diff.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
compute_cliques_for_impact_report() read MONDO_close with close_mondos[x[0]].add(x[1]),
keying the MONDO subject to column 2 (the predicate, "oio:closeMatch") instead of column 3
(the close-match object). No clique ever contains the literal predicate string, so glom's
`close=` guard never matched and was a silent no-op on every build, letting close (but not
exact) matches collapse into exact cliques. Key on x[2] instead, matching the
(stuff[0], stuff[2]) parsing every other concord in this function already uses.

Activating the guard is a deliberate, SME-reviewable change: on a full local disease build
it drops 1,219 MEDDRA identifiers from Disease.txt (present only via the dormant guard's
incorrect close-as-exact merges). The before/after impact analysis and the SME-facing
dropped-members list are committed under
docs/pipelines/diseasephenotype/mondo-close-match-guard/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generated by babel-clique-diff against two full local disease builds (x[1] dormant guard
vs x[2] active guard), differing only in the one-line MONDO_close parsing fix. glom is
deterministic (two x[1] builds were clique-identical), so the diff is all signal.

- README.md — the bug, the fix, and the SME decision (1,219 MEDDRA codes leave Disease.txt).
- clique-diff.csv — every changed clique with destination_kind (kept/regrouped/moved/dropped).
- dropped-members.csv — the 1,219 dropped CURIEs with the MONDO clique each left.
- summary.json — per-compendium counts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gaurav added a commit that referenced this pull request Jun 29, 2026
Cross-link the deferred x[1]->x[2] close-match-guard fix to its dedicated PR (#883), and
note the concrete impact (~1,219 MEDDRA identifiers dropped from Disease.txt) that the
follow-up's before/after analysis quantifies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gaurav

gaurav commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Heads up: the babel-clique-diff tool commit from this PR has been extracted into its own standalone PR off main (#885) so it can land independently of the MONDO close-match-guard change here. Once #885 merges, please rebase this PR to drop its now-duplicated babel-clique-diff commit. The new "MP kept disjoint from HP" PR (#886) depends on #885.

gaurav added a commit that referenced this pull request Jun 30, 2026
The extracted babel-clique-diff commit linked to #883's mondo-close-match-guard
evidence dir, which is not present on this standalone tool branch. Replace it with
the general commit-location convention for clique-diff outputs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gaurav added a commit that referenced this pull request Jun 30, 2026
Cross-link the deferred x[1]->x[2] close-match-guard fix to its dedicated PR (#883), and
note the concrete impact (~1,219 MEDDRA identifiers dropped from Disease.txt) that the
follow-up's before/after analysis quantifies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gaurav added a commit that referenced this pull request Jun 30, 2026
Add docs/sources/MP/disjointness.md explaining the post-glom split, why
unique_prefixes/concord-dropping are insufficient, and the measured impact
(added/split/moved/deleted) from babel-clique-diff comparing the overlap-allowed
build to the disjoint build. Commit the clique-diff CSV + summary JSON under
docs/sources/MP/disjointness/. Update the MP and HP READMEs: MP/HP are now disjoint,
so an MP clique carries only the Mammalia taxon and an HP clique only Homo sapiens
(correcting the earlier "mixed cliques carry both taxa" note). Cross-link prior PRs
(#790, #300, #883, #742/#781).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant