Skip to content

napkin-math: advisory source-preservation audit (Fork B, proposal 141 PR 1)#751

Merged
neoneye merged 2 commits into
mainfrom
napkin-math/source-preservation-audit-141-pr1
May 21, 2026
Merged

napkin-math: advisory source-preservation audit (Fork B, proposal 141 PR 1)#751
neoneye merged 2 commits into
mainfrom
napkin-math/source-preservation-audit-141-pr1

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented May 21, 2026

Summary

Starts proposal 141 with the smallest useful slice: a deterministic advisory audit that diffs a prior parameters.json against a current parameters.json and classifies every prior signal as one of five outcomes. No strict mode, no CI gating, no schema changes to extract prompts, no LLM rationale parsing — those land in later proposal 141 PRs after this audit's findings (false-positive rate, useful catches) are measured on the corpus.

What counts as a "prior signal"

Both entry id values AND entry output_name values across the five sections (key_values, missing_values_to_estimate, derived_questions, recommended_first_calculations, unmodelled_gates). Output_names are first-class because downstream consumers (calculations.py, monte-carlo, summarize-assessment) bind by output_name, not by entry id — a calc whose entry id survives but whose output_name changes is a genuine signal regression.

Classification

Each prior signal name is classified by what evidence the current artifact provides:

Status Meaning
preserved_by_id same name appears as a current entry id
preserved_by_output_name name appears as a current output_name (downstream binders use output_name)
preserved_as_formula_dependency name is on a current formula_hint RHS or in a current depends_on list
likely_renamed snake_case token Jaccard overlap ≥ 0.4 with one or more candidates from current ids ∪ current output_names (advisory; top 3 candidates by overlap)
absent_unexplained no preservation evidence at all

Threshold calibration: actual_X → X_target overlaps at 0.6, X_dkk → X_floor_dkk at 0.75, X_capex → X_capex_floor at 0.83. Unrelated ids stay near 0. 0.4 is permissive enough to surface renames without flooding the report.

Detail entries carry prior_kind ("id" or "output_name") so reviewers can distinguish entry-id drift from output-name drift. The text renderer tags each line with [section/kind].

Out of scope (later proposal 141 PRs)

  • Fork A: source-digest regex scan against the current artifact.
  • dropped_signals schema: optional field in extract prompts.
  • LLM rationale parsing of dropped_signals entries.
  • Strict mode / CI gating policy.

Empirical findings on v49 → v51 (6 probes)

Plan Prior signals by_id by_output_name renamed absent
euro_adoption 18 17 0 1 0
yellowstone 18 12 0 5 1
crate 23 11 2 7 3
mars_gtld 23 14 1 6 2
datacenter 25 0 0 11 14
paperclip 18 3 0 12 3

Useful catches the audit surfaces mechanically:

  • Paperclip latency-tripwire trio all absent from v51: actual_api_p99_latency_ms, api_latency_margin_ms, api_latency_p99_threshold_ms — exactly the silent-regression failure mode the plan doc has been flagging as needing mechanical detection.
  • Yellowstone public_compliance_threshold absent — documented v49→v51 regression in the plan's "Known limitations" section.
  • Crate q_logistics_budget_margin, target_recovery_rate, annual_crate_loss_volume all absent.
  • Datacenter wholesale restructure (0 ids preserved verbatim, 11 renamed, 14 absent) — useful diagnostic; reviewer can scan rename candidates.

False-positive surface (called out honestly):

  • Some likely_renamed candidates at jaccard 0.4–0.5 are borderline. Example: paperclip's manual_intervention_margin_hours → actual_manual_intervention_hours_per_week (j=0.43) might be a different concept (margin variable vs realised variable), not a true rename.
  • The token-only heuristic treats role changes (margin ↔ realised) as renames.
  • No semantic check (label overlap, source_text overlap) — just snake_case token Jaccard. Adding semantic checks is a candidate refinement for PR 3 once we see how reviewers actually use the audit.

Test plan

  • pytest experiments/napkin_math/tests/test_audit_source_preservation.py (17 synthetic unit tests pass — 15 + 2 for output_name drift and output_name-as-rename-candidate)
  • Audit run end-to-end on all 6 v49→v51 probes — output is sensible and surfaces the known absences
  • No hand-patching of parameters.json (the audit reads, never writes)
  • Output_name treated as a first-class audited signal (review feedback)
  • CI green on this branch

🤖 Generated with Claude Code

neoneye and others added 2 commits May 21, 2026 19:27
… PR 1)

Starts proposal 141 with the smallest useful slice: a deterministic advisory audit that diffs a prior parameters.json against a current parameters.json and classifies every prior signal as one of five outcomes. No strict mode, no CI gating, no schema changes to extract prompts, no LLM rationale parsing — those land in later proposal 141 PRs after this audit's findings (false-positive rate, useful catches) are measured on the corpus.

Classification of each prior id:

  preserved_by_id                — same id appears in current

  preserved_by_output_name       — prior id appears as an output_name in current (downstream binders use output_name, so the signal is alive)

  preserved_as_formula_dependency — prior id is on a current formula RHS or in a depends_on list (alive as a calculation input)

  likely_renamed                 — snake_case token Jaccard overlap >= 0.4 with one or more current ids (advisory; top 3 candidates by overlap)

  absent_unexplained             — no preservation evidence at all

Calibration: the 0.4 Jaccard threshold catches typical renames (actual_X → X_target overlaps at 0.6; X_dkk → X_floor_dkk at 0.75; X_capex → X_capex_floor at 0.83) while rejecting unrelated ids. Tested by synthetic fixtures; corpus probes show 1–12 renames per plan depending on how much the artifact was restructured between iterations.

15 synthetic unit tests cover: preserved_by_id (within-section and cross-section), preserved_by_output_name (primitive → calculation rename), preserved_as_formula_dependency (RHS or depends_on), likely_renamed (single and ranked candidates), absent_unexplained (no overlap and empty current), empty prior, jaccard primitive, parse_rhs_tokens primitive with assignment / builtin-filter / non-string handling, and the text-report rendering.

Out of scope (deferred to later proposal 141 PRs):

  - Fork A: source-digest regex scan against the current artifact.

  - The optional dropped_signals schema in extract prompts.

  - LLM rationale parsing of dropped_signals entries.

  - Strict-mode exit-non-zero policy.

Corpus probe findings (v49 → v51, 6 probes): the audit mechanically surfaces the paperclip latency-tripwire trio absence (actual_api_p99_latency_ms, api_latency_margin_ms, api_latency_p99_threshold_ms — all absent_unexplained), the yellowstone public_compliance_threshold absence, the crate q_logistics_budget_margin/target_recovery_rate/annual_crate_loss_volume absences, and datacenter's wholesale 0-of-23-ids restructure. False-positive surface acknowledged: some likely_renamed candidates at jaccard 0.4–0.5 are borderline; role changes (margin ↔ realised variable) read as renames by the token-only heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…de ids

Review feedback on PR #751: the audit only counted prior ids as signals, so a calc whose entry id survives but whose output_name changes (or disappears) was reported as 'preserved_by_id' even though the downstream-binding signal was lost. Downstream consumers (calculations.py, monte-carlo, summarize-assessment) bind by output_name, not by entry id, so output_name drift is a genuine signal regression that the audit must surface.

Changes:

(1) build_signal_index now returns a 'signals' dict that includes BOTH ids and output_names. The audit iterates this combined signal set instead of just prior ids. A name that appears as both an id and an output_name (typical for calcs where id == output_name) is counted once with kind='id' as the authoritative reading.

(2) find_rename_candidates draws candidates from the union of current ids AND current output_names. Mars-style restructures where a prior id is moved to a new calc's output_name (rather than to a current entry's id) now surface as likely_renamed with the output_name as the candidate.

(3) Detail entries gain a 'prior_kind' field ('id' or 'output_name') and the text report tags each line with [section/kind] so reviewers can distinguish entry-id drift from output-name drift.

Two new synthetic tests added: (a) the exact scenario the reviewer requested — prior derived_question with id=q_margin and output_name=old_margin against current with id=q_margin and output_name=new_margin reports q_margin preserved_by_id AND old_margin absent_unexplained; (b) a Mars-style rename to a current output_name surfaces as likely_renamed with the output_name as the candidate. 17 unit tests pass total.

Corpus probe re-run on v49 → v51: prior_total grew for plans whose prior output_names were distinct from their ids (crate 21→23, mars 22→23, datacenter 23→25). Euro_adoption, yellowstone, and paperclip totals are unchanged because their prior calcs all had id == output_name. The known absences from the previous run still surface (paperclip latency-tripwire trio, yellowstone public_compliance_threshold, crate missing signals, datacenter wholesale restructure).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neoneye
Copy link
Copy Markdown
Member Author

neoneye commented May 21, 2026

Addressed (`266b6af2`). Output_name is now a first-class audited signal:

  1. build_signal_index returns a combined signals dict containing both ids AND output_names. The audit iterates this combined set instead of just prior ids.
  2. find_rename_candidates searches the union of current ids and current output_names so Mars-style restructures (where a prior id is moved to a new calc's output_name) surface as likely_renamed.
  3. Detail entries gain prior_kind ("id" or "output_name"); text report tags each line with `[section/kind]`.

The exact test you requested is in: test_output_name_drift_on_preserved_id_is_reported. Prior derived_question{id=q_margin, output_name=old_margin} against current derived_question{id=q_margin, output_name=new_margin} reports q_margin preserved_by_id AND old_margin absent_unexplained, with prior_kind=\"output_name\" on the second detail entry.

Also added test_rename_candidates_drawn_from_output_names_too — a Mars-style rename where the prior id maps to a current output_name (not a current id) now surfaces as likely_renamed with the output_name as a candidate.

Corpus probe re-run (counts in updated PR body): prior_total grew for plans whose prior output_names were distinct from their ids — crate 21→23, mars_gtld 22→23, datacenter 23→25. Euro_adoption, yellowstone, and paperclip totals are unchanged because their prior calcs all had id == output_name. All previously-surfaced absences (paperclip latency trio, yellowstone public_compliance_threshold, crate trio, datacenter restructure) still surface; no findings regressed.

17 unit tests pass total (15 prior + 2 new). PR body updated.

@neoneye neoneye merged commit 8a65fae into main May 21, 2026
3 checks passed
@neoneye neoneye deleted the napkin-math/source-preservation-audit-141-pr1 branch May 21, 2026 17:52
neoneye added a commit that referenced this pull request May 21, 2026
Three review fixes:

1. plan: update the stale 'No formal source-preservation audit implementation' bullet — Fork B shipped in PR #751/#752/#753; Fork A, orchestrator-side prior-baseline injection, and strict-mode are the actual still-pending follow-ups.

2. plan: bump the document title from 2026-05-20 to 2026-05-22; add an italicised note that the doc was originally drafted 2026-05-20 and renamed/refreshed for the post-#753 ship-set.

3. methology: stop overclaiming what the assessment Basis column exposes. summarize_assessment.py maps source:'data' → 'report_derived' and source:'assumption' → 'model_assumption', and that is what the column shows; the finer 'plan-internal gap forecast vs bare commitment' distinction lives in the rationale string, not the column.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…proposal 141 PR 2)

Builds on PR PlanExeOrg#751 (Fork B advisory audit) by giving the LLM an optional vocabulary to explain absences. Three coordinated changes:

(1) Both extract prompts (from-digest and from-full) gain an optional top-level dropped_signals array. Each entry must name a structural reason from a closed enum (replaced_by, cap_pressure, out_of_scope, moved_to_unmodelled_gate, redundant_with) and reference the current signal it was replaced by, made redundant with, moved to, or capped under. Hard limit 8 entries; rationale ≤25 words. Corpus-agnostic wording — no plan literals.

(2) validate_parameters.py grows a 19th check, dropped_signals_schema, that ERRORs on malformed entries: unknown reason, unresolved replacement_id/redundant_with_id, cap_pressure on an array that isn't actually at its cap, moved_to_unmodelled_gate replacement_id not pointing at an unmodelled_gates entry, rationale over the 25-word cap, total over the 8-entry cap. Per the proposal, malformed dropped_signals entries are audit failures — they should not be accepted as explanations.

(3) audit_source_preservation.py adds a new explained_drop classification status that ranks above likely_renamed and absent_unexplained. When current parameters.json's dropped_signals records the prior signal with a valid reason, the audit reclassifies the disappearance as explained_drop with the structured reason and reference. Malformed dropped_signals entries are silently skipped by the audit (validate_parameters surfaces them); double-counting is avoided by ignoring drops_signals entries whose id is actually preserved in current.

23 new unit tests added: 9 validator (replaced_by clean, unknown reason, unresolved references, cap_pressure must match a capped array at cap, redundant_with required field, moved_to_unmodelled_gate must point at unmodelled_gates, rationale word cap, entry count cap, absent field is clean) + 6 audit (reclassification from absent to explained_drop, explained_drop outranks likely_renamed, ignored when prior is actually preserved, silently skips malformed entries, cap_pressure handling, moved_to_unmodelled_gate handling). 51 total tests (28 validator + 23 audit) all pass. 9/9 smoke checks. All 6 v51 parameters.json validate clean with 19 checks — the new schema check is correctly inert on legacy outputs without the optional field.

Discovered limitation worth surfacing: the LLM can only meaningfully emit dropped_signals with origin=prior_baseline when the orchestrator passes it the prior parameters.json as additional input. The current extract skill reads only the source digest, so a same-LLM same-session regeneration of v51 would emit zero prior_baseline drops. The schema/validator/audit-consumption infrastructure is in place; the orchestrator/skill wiring that lets the LLM see prior baselines is a separate PR (proposal 141 PR 3 candidate). The audit's Fork B comparison itself does not need the LLM to know about the prior — the audit reads both files externally.

Out of scope (later proposal 141 PRs):

  - Fork A: source-digest regex scan against the current artifact (independent advisory line)

  - Orchestrator wiring to pass prior parameters.json to the extract skill

  - Strict mode / CI gating policy

  - source_claim_ids per-entry field

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…posal 141 PR 3)

Builds on PR PlanExeOrg#751 (Fork B audit) and PR PlanExeOrg#752 (dropped_signals schema + strict consumption) by closing the prior_baseline loop: the extract skill now has a way to see what the previous iteration emitted, so it can decide what to preserve and what to explain-drop.

Per review direction, the orchestration is intentionally NARROW: the full prior parameters.json is NOT passed in. Instead, prepare_extract_input.py builds a compact Prior Signal Ledger and appends it to the combined digest at the end. The ledger contains only:

  - signal names (entry ids and output_names)

  - section and kind (id or output_name)

  - formula_hint when present

  - depends_on when non-empty

Intentionally excluded: source_text, label, comment, value. These would anchor the LLM on old phrasings and old framings — the ledger is a preservation BUDGET, not a phrasing TARGET. The source digest above the ledger remains the authoritative input.

Changes:

(1) prepare_extract_input.py grows a --prior CLI flag pointing at a prior parameters.json. When omitted (first-iteration extraction), no ledger is appended and behavior is unchanged. When provided, build_prior_signal_ledger emits a compact markdown section appended after the bundle.

(2) Both extract prompts (from-digest and from-full) gain a 'Prior Signal Ledger' subsection in the dropped_signals area. Posture: ledger is advisory metadata, source remains authoritative; preserve when source-supported, record dropped_signals when not; do NOT invent dropped_signals entries for signals not in the ledger or source.

(3) 12 synthetic unit tests cover ledger construction: key_value ids with section/kind tags, output_names tracked separately when distinct from ids, formula_hint and depends_on inclusion, formula_hint omission when null, id-equals-output_name dedupe (kind=id wins), unmodelled_gates inclusion, first-iteration empty-ledger message, and explicit exclusion of label/source_text/comment/value. Plus 3 end-to-end tests covering build_combined_digest with and without --prior.

End-to-end empirical check: ran prepare_extract_input.py --prior on paperclip's v49 parameters.json. The ledger lands at the end of the digest with all 16 prior signals — including the latency-tripwire trio (api_latency_p99_threshold_ms, api_latency_margin_ms, actual_api_p99_latency_ms) that v51 silently dropped per the v49→v51 audit on main. The infrastructure is now in place for the LLM extract skill to see these prior signals and either preserve them OR record dropped_signals.

What this PR explicitly does NOT do:

  - Does not re-run the LLM extract skill end-to-end (that is the user's next step via the standard skill workflow). The skill re-run plus audit comparison is the empirical validation of whether the ledger actually helps the LLM populate dropped_signals usefully.

  - Does not pass the full prior parameters.json (per review direction — anchoring risk).

  - Does not change strict mode, CI gating, or Fork A scope (those land in later PRs once this loop is proven useful).

  - Does not bundle Phase 5 verify-bounds-citations or different-LLM validation.

Empirical posture: 12 new unit tests pass. 9/9 smoke checks pass. End-to-end smoke run on paperclip produced a clean digest with the ledger appended. No corpus literals introduced (the ledger emits the actual prior_baseline ids, but those are extracted ids from gitignored corpus outputs, not literals embedded in the prompt or code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
huangyingting pushed a commit to repomesh/PlanExe that referenced this pull request May 22, 2026
…hip-set

Updates two docs to reflect the post-PlanExeOrg#753 state of the napkin-math pipeline.

methology.md: describe the current pipeline behaviour — two-batch compress with paraphrase-tolerant quote match and cross-bucket promoter; extract's source-arithmetic preservation, threshold-pairing, and dropped_signals field; 19-check validator (added aggregate_not_bounded, requirement_has_margin, dropped_signals_schema); bounds' asymmetric source label on commitment defaults, calculation-output strip, reserved correlations block, reserved lognormal/pert disciplines with loud NotImplementedError; advisory audit_source_preservation.py step.

20260520_plan.md → 20260522_plan.md: bump status date; mark PR PlanExeOrg#750 merged; add PR PlanExeOrg#751/PlanExeOrg#752/PlanExeOrg#753 entries (proposal 141 implementation); update Phase status table (added 4.5 audit row, reclassified Phase 8 as partially done, Phase 10 marked done for current ship-set); add v58 14-plan empirical snapshot (1 viable / 5 fragile / 8 doom); reorder Next likely move now that proposal 141 has shipped — Phase 5 citation verifier promoted to PlanExeOrg#1, Phase 8 samplers added as PlanExeOrg#2 with v58 cases that bite now, Phase 9 composite-band cap as PlanExeOrg#3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant