Skip to content

Add protein/chemical overlap report#830

Draft
gaurav wants to merge 4 commits into
mainfrom
worktree-generate-protein-overlap-report
Draft

Add protein/chemical overlap report#830
gaurav wants to merge 4 commits into
mainfrom
worktree-generate-protein-overlap-report

Conversation

@gaurav

@gaurav gaurav commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Inventories the cross-references that bridge the chemical/protein boundary, so the Translator community can review which protein and chemical cliques a merge would combine. This is the empirical input to the "list of mappings that would be applied if we combined proteins and chemicals" called for in #706, and a generalization of the UMLS-only report proposed in #667.

What it does

generate_protein_chemical_overlap_report() streams the chemical compendia, the protein compendium, and every concord file, and finds concord edges whose two endpoints landed in different cliques — one chemical, one protein. It writes four TSVs under babel_outputs/reports/protein_chemical/:

Each candidate carries two triage discriminators:

  • chem_has_inchikey — a structurally-defined chemical cross-referenced to a protein is usually a bug (e.g. CHEBI:24536 "Pepsin" actually being hexachlorocyclohexane); the genuine protein-as-chemical cases have no InChIKey.
  • prot_reaches_gene — whether merging would make a chemical normalize all the way to a gene (the Proteins and chemicals are being combined in potentially confusing ways #662 concern), derived from the GeneProtein conflation.

Memory is bounded by concord size: only CURIEs referenced by a concord are indexed, so the large compendia are streamed rather than held in RAM.

Changes

  • src/reports/protein_chemical_overlap.py — core module.
  • tests/reports/test_protein_chemical_overlap.py — unit tests on synthetic fixtures (InChIKey vs no-InChIKey crossings, duplicate CURIE, gene-reaching protein, no-conflation path).
  • src/snakefiles/reports.snakefile — rule generate_protein_chemical_overlap_report, gated on outputs_done and registered under all_reports.
  • docs/reports/ProteinChemicalOverlap.md (+ README index entry) — purpose, discriminators, outputs, how to run against a downloaded build, and a future-work section.

Follow-up

The natural next step is a conflation-chain impact simulator that consumes candidate_pairs.tsv and shows before/after normalization for a probe set — tracked in #829.

Status / notes

  • ruff check, ruff format --check, snakefmt --check, rumdl all clean; unit tests pass.
  • Not yet validated against a real build — the logic is exercised by unit fixtures only. Running it against a downloaded babel_outputs/ (e.g. the stars.renci.org snapshot) and sanity-checking the top candidate pairs is a good pre-merge step. Draft until then.

Related: #706, #667, #662, #654, #513, #440, #276.

🤖 Generated with Claude Code

gaurav and others added 4 commits June 4, 2026 17:39
Inventories cross-references that bridge the chemical/protein boundary so the
Translator community can review which protein and chemical cliques a merge would
combine (issues #706, #667, #440, #513, #276).

generate_protein_chemical_overlap_report() streams the chemical compendia, the
protein compendium, and every concord file, emitting:
  - candidate_pairs: deduplicated (chem leader, prot leader) merge candidates --
    the list of mappings a DrugProtein conflation would apply (#706).
  - bridges: one row per boundary-crossing concord edge, source-attributed.
  - duplicate_curies: CURIEs in both a chemical and a protein clique (#276/#513).
  - summary: per-source counts.

Each candidate carries two triage discriminators: chem_has_inchikey (a
structurally-defined chemical cross-referenced to a protein is usually a bug) and
prot_reaches_gene (merging would make a chemical normalize to a gene, #662).

Memory is bounded by concord size: only CURIEs referenced by a concord are
indexed, so the large compendia are streamed, not held in RAM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds rule generate_protein_chemical_overlap_report, gated on outputs_done (like
the other check_* report rules) so the compendia, conflations, and intermediate
concords all exist. It globs both concord directories at runtime and writes the
four TSVs under reports/protein_chemical/. Registered as a dependency of
all_reports so it runs with the standard reports target.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds docs/reports/ProteinChemicalOverlap.md (purpose, the two triage
discriminators, the four output files, how to run it against a downloaded build)
and links it from the docs index.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Records the design for a follow-up tool that consumes candidate_pairs.tsv and
simulates a hypothetical DrugProtein conflation layered on the existing
GeneProtein/DrugChemical conflations, reporting before/after normalization for a
probe set so the chemical-reaches-gene effect (#662, #654) is concrete before
anything ships.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant