Add protein/chemical overlap report#830
Draft
gaurav wants to merge 4 commits into
Draft
Conversation
Inventories cross-references that bridge the chemical/protein boundary so the Translator community can review which protein and chemical cliques a merge would combine (issues #706, #667, #440, #513, #276). generate_protein_chemical_overlap_report() streams the chemical compendia, the protein compendium, and every concord file, emitting: - candidate_pairs: deduplicated (chem leader, prot leader) merge candidates -- the list of mappings a DrugProtein conflation would apply (#706). - bridges: one row per boundary-crossing concord edge, source-attributed. - duplicate_curies: CURIEs in both a chemical and a protein clique (#276/#513). - summary: per-source counts. Each candidate carries two triage discriminators: chem_has_inchikey (a structurally-defined chemical cross-referenced to a protein is usually a bug) and prot_reaches_gene (merging would make a chemical normalize to a gene, #662). Memory is bounded by concord size: only CURIEs referenced by a concord are indexed, so the large compendia are streamed, not held in RAM. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds rule generate_protein_chemical_overlap_report, gated on outputs_done (like the other check_* report rules) so the compendia, conflations, and intermediate concords all exist. It globs both concord directories at runtime and writes the four TSVs under reports/protein_chemical/. Registered as a dependency of all_reports so it runs with the standard reports target. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds docs/reports/ProteinChemicalOverlap.md (purpose, the two triage discriminators, the four output files, how to run it against a downloaded build) and links it from the docs index. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Records the design for a follow-up tool that consumes candidate_pairs.tsv and simulates a hypothetical DrugProtein conflation layered on the existing GeneProtein/DrugChemical conflations, reporting before/after normalization for a probe set so the chemical-reaches-gene effect (#662, #654) is concrete before anything ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Inventories the cross-references that bridge the chemical/protein boundary, so the Translator community can review which protein and chemical cliques a merge would combine. This is the empirical input to the "list of mappings that would be applied if we combined proteins and chemicals" called for in #706, and a generalization of the UMLS-only report proposed in #667.
What it does
generate_protein_chemical_overlap_report()streams the chemical compendia, the protein compendium, and every concord file, and finds concord edges whose two endpoints landed in different cliques — one chemical, one protein. It writes four TSVs underbabel_outputs/reports/protein_chemical/:candidate_pairs.tsv— deduplicated (chemical leader → protein leader) merge candidates with supporting sources/predicates aggregated and sorted by evidence. The SSSOM-able mapping list a DrugProtein conflation (A possible solution to the protein/chemical overlap problem (DrugProtein conflation) #440) would apply.bridges.tsv— one row per boundary-crossing concord edge, source-attributed (raw evidence).duplicate_curies.tsv— CURIEs in both a chemical and a protein clique (An ID can be present in multiple cliques (CURIE duplication) #276/MeSH, DrugBank and UMLS identifiers sometimes duplicate between ChemicalEntity and Protein clique #513). Scoped to concord-referenced CURIEs to keep memory bounded; the exhaustive list is a one-query lookup against the DuckDBEdgetable.summary.tsv— per-source counts with the discriminator splits and aTOTALrow.Each candidate carries two triage discriminators:
chem_has_inchikey— a structurally-defined chemical cross-referenced to a protein is usually a bug (e.g. CHEBI:24536 "Pepsin" actually being hexachlorocyclohexane); the genuine protein-as-chemical cases have no InChIKey.prot_reaches_gene— whether merging would make a chemical normalize all the way to a gene (the Proteins and chemicals are being combined in potentially confusing ways #662 concern), derived from the GeneProtein conflation.Memory is bounded by concord size: only CURIEs referenced by a concord are indexed, so the large compendia are streamed rather than held in RAM.
Changes
src/reports/protein_chemical_overlap.py— core module.tests/reports/test_protein_chemical_overlap.py— unit tests on synthetic fixtures (InChIKey vs no-InChIKey crossings, duplicate CURIE, gene-reaching protein, no-conflation path).src/snakefiles/reports.snakefile— rulegenerate_protein_chemical_overlap_report, gated onoutputs_doneand registered underall_reports.docs/reports/ProteinChemicalOverlap.md(+ README index entry) — purpose, discriminators, outputs, how to run against a downloaded build, and a future-work section.Follow-up
The natural next step is a conflation-chain impact simulator that consumes
candidate_pairs.tsvand shows before/after normalization for a probe set — tracked in #829.Status / notes
ruff check,ruff format --check,snakefmt --check,rumdlall clean; unit tests pass.babel_outputs/(e.g. the stars.renci.org snapshot) and sanity-checking the top candidate pairs is a good pre-merge step. Draft until then.Related: #706, #667, #662, #654, #513, #440, #276.
🤖 Generated with Claude Code