Activate the MONDO close-match guard in disease glom (with build-vs-build clique-diff tool)#883
Open
gaurav wants to merge 3 commits into
Open
Activate the MONDO close-match guard in disease glom (with build-vs-build clique-diff tool)#883gaurav wants to merge 3 commits into
gaurav wants to merge 3 commits into
Conversation
A new tools/clique_diff package + `babel-clique-diff` entry point that compares the finished JSONL compendia of two builds and reports which cliques split/merged/lost members, and — the headline signal — which CURIEs were dropped from the output entirely. This is distinct from source-impact-report: that answers "what does adding source X do?" by re-glomming intermediate files with vs. without one source; this answers "how did the cliques change between build A and build B?" given the same inputs but different code, config, or upstream data. It works on finished compendia rather than glom state, so it can also compare a local build against a published stars.renci.org build without re-running glom — useful for validating any glom-logic change or as a release regression check. Includes unit tests for the four destination kinds (kept/regrouped/moved/dropped) and a docs/tools/README.md entry. The test module has a unique basename (test_clique_diff_tool) to avoid a pytest prepend-mode collision with tests/test_clique_diff.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
compute_cliques_for_impact_report() read MONDO_close with close_mondos[x[0]].add(x[1]), keying the MONDO subject to column 2 (the predicate, "oio:closeMatch") instead of column 3 (the close-match object). No clique ever contains the literal predicate string, so glom's `close=` guard never matched and was a silent no-op on every build, letting close (but not exact) matches collapse into exact cliques. Key on x[2] instead, matching the (stuff[0], stuff[2]) parsing every other concord in this function already uses. Activating the guard is a deliberate, SME-reviewable change: on a full local disease build it drops 1,219 MEDDRA identifiers from Disease.txt (present only via the dormant guard's incorrect close-as-exact merges). The before/after impact analysis and the SME-facing dropped-members list are committed under docs/pipelines/diseasephenotype/mondo-close-match-guard/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generated by babel-clique-diff against two full local disease builds (x[1] dormant guard vs x[2] active guard), differing only in the one-line MONDO_close parsing fix. glom is deterministic (two x[1] builds were clique-identical), so the diff is all signal. - README.md — the bug, the fix, and the SME decision (1,219 MEDDRA codes leave Disease.txt). - clique-diff.csv — every changed clique with destination_kind (kept/regrouped/moved/dropped). - dropped-members.csv — the 1,219 dropped CURIEs with the MONDO clique each left. - summary.json — per-compendium counts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gaurav
added a commit
that referenced
this pull request
Jun 29, 2026
Cross-link the deferred x[1]->x[2] close-match-guard fix to its dedicated PR (#883), and note the concrete impact (~1,219 MEDDRA identifiers dropped from Disease.txt) that the follow-up's before/after analysis quantifies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collaborator
Author
|
Heads up: the |
gaurav
added a commit
that referenced
this pull request
Jun 30, 2026
The extracted babel-clique-diff commit linked to #883's mondo-close-match-guard evidence dir, which is not present on this standalone tool branch. Replace it with the general commit-location convention for clique-diff outputs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gaurav
added a commit
that referenced
this pull request
Jun 30, 2026
Cross-link the deferred x[1]->x[2] close-match-guard fix to its dedicated PR (#883), and note the concrete impact (~1,219 MEDDRA identifiers dropped from Disease.txt) that the follow-up's before/after analysis quantifies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gaurav
added a commit
that referenced
this pull request
Jun 30, 2026
Add docs/sources/MP/disjointness.md explaining the post-glom split, why unique_prefixes/concord-dropping are insufficient, and the measured impact (added/split/moved/deleted) from babel-clique-diff comparing the overlap-allowed build to the disjoint build. Commit the clique-diff CSV + summary JSON under docs/sources/MP/disjointness/. Update the MP and HP READMEs: MP/HP are now disjoint, so an MP clique carries only the Mammalia taxon and an HP clique only Homo sapiens (correcting the earlier "mixed cliques carry both taxa" note). Cross-link prior PRs (#790, #300, #883, #742/#781). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a latent bug in how the disease/phenotype build reads the
MONDO_closeconcord, whichleft glom's close-match guard a silent no-op, and adds a reusable build-vs-build clique-diff
tool plus the before/after impact analysis that this change requires.
This is the dedicated follow-up promised in #790, where the same code was touched only enough to
stop the build crashing (validating the real 3-column
MONDO_closeformat while preservingthe dormant
x[1]behaviour). Activating the guard changes disease clique merging broadly and isorthogonal to adding MP, so it lives here with its own evidence.
Note
Based on
add-mpo(#790), notmain— it builds on that PR'scompute_cliques_for_impact_reportrefactor. Merge #790 first; this PR will retarget to
mainautomatically.The bug
compute_cliques_for_impact_report()readMONDO_close(a 3-columnsubject predicate objectconcord) with
close_mondos[x[0]].add(x[1])— keying the MONDO subject to column 2, thepredicate (
oio:closeMatch) instead of column 3, the close-match object. No clique evercontains the literal string
oio:closeMatch, so glom'sclose=guard (if cd in newset) nevermatched. The guard has therefore never fired, on
mainand before it; close (but not exact)matches were free to collapse into exact cliques.
The fix is one line of intent: key on
x[2], matching the(stuff[0], stuff[2])parsing everyother concord in that function already uses.
Impact (why this needs SME review)
Measured on a full local
diseasebuild (Biolink 4.4.3), same intermediate ids/concords,differing only in
x[1]vsx[2]. glom is deterministic — twox[1]builds were clique-identical— so every difference is signal, not run-to-run noise.
Disease.txt. They were present only becausethe dormant guard let them merge into a MONDO disease clique; correctly kept out, they have no
independent disease typing in this pipeline and leave the compendium entirely.
PhenotypicFeature.txtis essentially unaffected (the close pairs are MONDO-subject).Dropping 1,219 MEDDRA codes is correct if they are genuinely distinct concepts (the guard's
premise), but a coverage loss if anything relies on that MEDDRA → MONDO normalization. Per the
project rule against dropping valid identifiers without good reason, this is an SME call — the
evidence is in
docs/pipelines/diseasephenotype/mondo-close-match-guard/(README.md,clique-diff.csv,dropped-members.csv,summary.json).New tool:
babel-clique-diffA reusable build-vs-build clique diff (
tools/clique_diff) that compares the finished compendia oftwo builds and reports split/merged/lost cliques and, crucially, dropped CURIEs.
It is deliberately distinct from
source-impact-report: that answers "what does adding sourceX do?" by re-glomming intermediates with vs. without one source; this answers "how did the cliques
change between build A and build B?" given the same inputs but different code/config/data.
Because it works on finished compendia rather than glom state, it can also compare a local build
against a published
stars.renci.orgbuild with no re-glom — useful for validating any glom-logicchange (close-match,
unique_prefixes, overuse filtering) or as a release regression check.Test plan
uv run ruff check/ruff format --check— clean.uv run rumdl check .— clean.uv run pytest -m unit -q— green (243 passed), including newtests/tools/test_clique_diff_tool.py.Out of scope
x[1](the production behaviour of that PR) and isnot regenerated here.
🤖 Generated with Claude Code