napkin-math: advisory source-preservation audit (Fork B, proposal 141 PR 1)#751
Conversation
… PR 1) Starts proposal 141 with the smallest useful slice: a deterministic advisory audit that diffs a prior parameters.json against a current parameters.json and classifies every prior signal as one of five outcomes. No strict mode, no CI gating, no schema changes to extract prompts, no LLM rationale parsing — those land in later proposal 141 PRs after this audit's findings (false-positive rate, useful catches) are measured on the corpus. Classification of each prior id: preserved_by_id — same id appears in current preserved_by_output_name — prior id appears as an output_name in current (downstream binders use output_name, so the signal is alive) preserved_as_formula_dependency — prior id is on a current formula RHS or in a depends_on list (alive as a calculation input) likely_renamed — snake_case token Jaccard overlap >= 0.4 with one or more current ids (advisory; top 3 candidates by overlap) absent_unexplained — no preservation evidence at all Calibration: the 0.4 Jaccard threshold catches typical renames (actual_X → X_target overlaps at 0.6; X_dkk → X_floor_dkk at 0.75; X_capex → X_capex_floor at 0.83) while rejecting unrelated ids. Tested by synthetic fixtures; corpus probes show 1–12 renames per plan depending on how much the artifact was restructured between iterations. 15 synthetic unit tests cover: preserved_by_id (within-section and cross-section), preserved_by_output_name (primitive → calculation rename), preserved_as_formula_dependency (RHS or depends_on), likely_renamed (single and ranked candidates), absent_unexplained (no overlap and empty current), empty prior, jaccard primitive, parse_rhs_tokens primitive with assignment / builtin-filter / non-string handling, and the text-report rendering. Out of scope (deferred to later proposal 141 PRs): - Fork A: source-digest regex scan against the current artifact. - The optional dropped_signals schema in extract prompts. - LLM rationale parsing of dropped_signals entries. - Strict-mode exit-non-zero policy. Corpus probe findings (v49 → v51, 6 probes): the audit mechanically surfaces the paperclip latency-tripwire trio absence (actual_api_p99_latency_ms, api_latency_margin_ms, api_latency_p99_threshold_ms — all absent_unexplained), the yellowstone public_compliance_threshold absence, the crate q_logistics_budget_margin/target_recovery_rate/annual_crate_loss_volume absences, and datacenter's wholesale 0-of-23-ids restructure. False-positive surface acknowledged: some likely_renamed candidates at jaccard 0.4–0.5 are borderline; role changes (margin ↔ realised variable) read as renames by the token-only heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…de ids Review feedback on PR #751: the audit only counted prior ids as signals, so a calc whose entry id survives but whose output_name changes (or disappears) was reported as 'preserved_by_id' even though the downstream-binding signal was lost. Downstream consumers (calculations.py, monte-carlo, summarize-assessment) bind by output_name, not by entry id, so output_name drift is a genuine signal regression that the audit must surface. Changes: (1) build_signal_index now returns a 'signals' dict that includes BOTH ids and output_names. The audit iterates this combined signal set instead of just prior ids. A name that appears as both an id and an output_name (typical for calcs where id == output_name) is counted once with kind='id' as the authoritative reading. (2) find_rename_candidates draws candidates from the union of current ids AND current output_names. Mars-style restructures where a prior id is moved to a new calc's output_name (rather than to a current entry's id) now surface as likely_renamed with the output_name as the candidate. (3) Detail entries gain a 'prior_kind' field ('id' or 'output_name') and the text report tags each line with [section/kind] so reviewers can distinguish entry-id drift from output-name drift. Two new synthetic tests added: (a) the exact scenario the reviewer requested — prior derived_question with id=q_margin and output_name=old_margin against current with id=q_margin and output_name=new_margin reports q_margin preserved_by_id AND old_margin absent_unexplained; (b) a Mars-style rename to a current output_name surfaces as likely_renamed with the output_name as the candidate. 17 unit tests pass total. Corpus probe re-run on v49 → v51: prior_total grew for plans whose prior output_names were distinct from their ids (crate 21→23, mars 22→23, datacenter 23→25). Euro_adoption, yellowstone, and paperclip totals are unchanged because their prior calcs all had id == output_name. The known absences from the previous run still surface (paperclip latency-tripwire trio, yellowstone public_compliance_threshold, crate missing signals, datacenter wholesale restructure). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed (`266b6af2`). Output_name is now a first-class audited signal:
The exact test you requested is in: Also added Corpus probe re-run (counts in updated PR body): prior_total grew for plans whose prior output_names were distinct from their ids — crate 21→23, mars_gtld 22→23, datacenter 23→25. Euro_adoption, yellowstone, and paperclip totals are unchanged because their prior calcs all had id == output_name. All previously-surfaced absences (paperclip latency trio, yellowstone 17 unit tests pass total (15 prior + 2 new). PR body updated. |
Three review fixes: 1. plan: update the stale 'No formal source-preservation audit implementation' bullet — Fork B shipped in PR #751/#752/#753; Fork A, orchestrator-side prior-baseline injection, and strict-mode are the actual still-pending follow-ups. 2. plan: bump the document title from 2026-05-20 to 2026-05-22; add an italicised note that the doc was originally drafted 2026-05-20 and renamed/refreshed for the post-#753 ship-set. 3. methology: stop overclaiming what the assessment Basis column exposes. summarize_assessment.py maps source:'data' → 'report_derived' and source:'assumption' → 'model_assumption', and that is what the column shows; the finer 'plan-internal gap forecast vs bare commitment' distinction lives in the rationale string, not the column. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…proposal 141 PR 2) Builds on PR PlanExeOrg#751 (Fork B advisory audit) by giving the LLM an optional vocabulary to explain absences. Three coordinated changes: (1) Both extract prompts (from-digest and from-full) gain an optional top-level dropped_signals array. Each entry must name a structural reason from a closed enum (replaced_by, cap_pressure, out_of_scope, moved_to_unmodelled_gate, redundant_with) and reference the current signal it was replaced by, made redundant with, moved to, or capped under. Hard limit 8 entries; rationale ≤25 words. Corpus-agnostic wording — no plan literals. (2) validate_parameters.py grows a 19th check, dropped_signals_schema, that ERRORs on malformed entries: unknown reason, unresolved replacement_id/redundant_with_id, cap_pressure on an array that isn't actually at its cap, moved_to_unmodelled_gate replacement_id not pointing at an unmodelled_gates entry, rationale over the 25-word cap, total over the 8-entry cap. Per the proposal, malformed dropped_signals entries are audit failures — they should not be accepted as explanations. (3) audit_source_preservation.py adds a new explained_drop classification status that ranks above likely_renamed and absent_unexplained. When current parameters.json's dropped_signals records the prior signal with a valid reason, the audit reclassifies the disappearance as explained_drop with the structured reason and reference. Malformed dropped_signals entries are silently skipped by the audit (validate_parameters surfaces them); double-counting is avoided by ignoring drops_signals entries whose id is actually preserved in current. 23 new unit tests added: 9 validator (replaced_by clean, unknown reason, unresolved references, cap_pressure must match a capped array at cap, redundant_with required field, moved_to_unmodelled_gate must point at unmodelled_gates, rationale word cap, entry count cap, absent field is clean) + 6 audit (reclassification from absent to explained_drop, explained_drop outranks likely_renamed, ignored when prior is actually preserved, silently skips malformed entries, cap_pressure handling, moved_to_unmodelled_gate handling). 51 total tests (28 validator + 23 audit) all pass. 9/9 smoke checks. All 6 v51 parameters.json validate clean with 19 checks — the new schema check is correctly inert on legacy outputs without the optional field. Discovered limitation worth surfacing: the LLM can only meaningfully emit dropped_signals with origin=prior_baseline when the orchestrator passes it the prior parameters.json as additional input. The current extract skill reads only the source digest, so a same-LLM same-session regeneration of v51 would emit zero prior_baseline drops. The schema/validator/audit-consumption infrastructure is in place; the orchestrator/skill wiring that lets the LLM see prior baselines is a separate PR (proposal 141 PR 3 candidate). The audit's Fork B comparison itself does not need the LLM to know about the prior — the audit reads both files externally. Out of scope (later proposal 141 PRs): - Fork A: source-digest regex scan against the current artifact (independent advisory line) - Orchestrator wiring to pass prior parameters.json to the extract skill - Strict mode / CI gating policy - source_claim_ids per-entry field Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…posal 141 PR 3) Builds on PR PlanExeOrg#751 (Fork B audit) and PR PlanExeOrg#752 (dropped_signals schema + strict consumption) by closing the prior_baseline loop: the extract skill now has a way to see what the previous iteration emitted, so it can decide what to preserve and what to explain-drop. Per review direction, the orchestration is intentionally NARROW: the full prior parameters.json is NOT passed in. Instead, prepare_extract_input.py builds a compact Prior Signal Ledger and appends it to the combined digest at the end. The ledger contains only: - signal names (entry ids and output_names) - section and kind (id or output_name) - formula_hint when present - depends_on when non-empty Intentionally excluded: source_text, label, comment, value. These would anchor the LLM on old phrasings and old framings — the ledger is a preservation BUDGET, not a phrasing TARGET. The source digest above the ledger remains the authoritative input. Changes: (1) prepare_extract_input.py grows a --prior CLI flag pointing at a prior parameters.json. When omitted (first-iteration extraction), no ledger is appended and behavior is unchanged. When provided, build_prior_signal_ledger emits a compact markdown section appended after the bundle. (2) Both extract prompts (from-digest and from-full) gain a 'Prior Signal Ledger' subsection in the dropped_signals area. Posture: ledger is advisory metadata, source remains authoritative; preserve when source-supported, record dropped_signals when not; do NOT invent dropped_signals entries for signals not in the ledger or source. (3) 12 synthetic unit tests cover ledger construction: key_value ids with section/kind tags, output_names tracked separately when distinct from ids, formula_hint and depends_on inclusion, formula_hint omission when null, id-equals-output_name dedupe (kind=id wins), unmodelled_gates inclusion, first-iteration empty-ledger message, and explicit exclusion of label/source_text/comment/value. Plus 3 end-to-end tests covering build_combined_digest with and without --prior. End-to-end empirical check: ran prepare_extract_input.py --prior on paperclip's v49 parameters.json. The ledger lands at the end of the digest with all 16 prior signals — including the latency-tripwire trio (api_latency_p99_threshold_ms, api_latency_margin_ms, actual_api_p99_latency_ms) that v51 silently dropped per the v49→v51 audit on main. The infrastructure is now in place for the LLM extract skill to see these prior signals and either preserve them OR record dropped_signals. What this PR explicitly does NOT do: - Does not re-run the LLM extract skill end-to-end (that is the user's next step via the standard skill workflow). The skill re-run plus audit comparison is the empirical validation of whether the ledger actually helps the LLM populate dropped_signals usefully. - Does not pass the full prior parameters.json (per review direction — anchoring risk). - Does not change strict mode, CI gating, or Fork A scope (those land in later PRs once this loop is proven useful). - Does not bundle Phase 5 verify-bounds-citations or different-LLM validation. Empirical posture: 12 new unit tests pass. 9/9 smoke checks pass. End-to-end smoke run on paperclip produced a clean digest with the ledger appended. No corpus literals introduced (the ledger emits the actual prior_baseline ids, but those are extracted ids from gitignored corpus outputs, not literals embedded in the prompt or code). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hip-set Updates two docs to reflect the post-PlanExeOrg#753 state of the napkin-math pipeline. methology.md: describe the current pipeline behaviour — two-batch compress with paraphrase-tolerant quote match and cross-bucket promoter; extract's source-arithmetic preservation, threshold-pairing, and dropped_signals field; 19-check validator (added aggregate_not_bounded, requirement_has_margin, dropped_signals_schema); bounds' asymmetric source label on commitment defaults, calculation-output strip, reserved correlations block, reserved lognormal/pert disciplines with loud NotImplementedError; advisory audit_source_preservation.py step. 20260520_plan.md → 20260522_plan.md: bump status date; mark PR PlanExeOrg#750 merged; add PR PlanExeOrg#751/PlanExeOrg#752/PlanExeOrg#753 entries (proposal 141 implementation); update Phase status table (added 4.5 audit row, reclassified Phase 8 as partially done, Phase 10 marked done for current ship-set); add v58 14-plan empirical snapshot (1 viable / 5 fragile / 8 doom); reorder Next likely move now that proposal 141 has shipped — Phase 5 citation verifier promoted to PlanExeOrg#1, Phase 8 samplers added as PlanExeOrg#2 with v58 cases that bite now, Phase 9 composite-band cap as PlanExeOrg#3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Starts proposal 141 with the smallest useful slice: a deterministic advisory audit that diffs a prior
parameters.jsonagainst a currentparameters.jsonand classifies every prior signal as one of five outcomes. No strict mode, no CI gating, no schema changes to extract prompts, no LLM rationale parsing — those land in later proposal 141 PRs after this audit's findings (false-positive rate, useful catches) are measured on the corpus.What counts as a "prior signal"
Both entry
idvalues AND entryoutput_namevalues across the five sections (key_values,missing_values_to_estimate,derived_questions,recommended_first_calculations,unmodelled_gates). Output_names are first-class because downstream consumers (calculations.py, monte-carlo, summarize-assessment) bind by output_name, not by entry id — a calc whose entry id survives but whose output_name changes is a genuine signal regression.Classification
Each prior signal name is classified by what evidence the current artifact provides:
preserved_by_idpreserved_by_output_nameoutput_name(downstream binders use output_name)preserved_as_formula_dependencyformula_hintRHS or in a currentdepends_onlistlikely_renamedabsent_unexplainedThreshold calibration:
actual_X → X_targetoverlaps at 0.6,X_dkk → X_floor_dkkat 0.75,X_capex → X_capex_floorat 0.83. Unrelated ids stay near 0. 0.4 is permissive enough to surface renames without flooding the report.Detail entries carry
prior_kind("id"or"output_name") so reviewers can distinguish entry-id drift from output-name drift. The text renderer tags each line with[section/kind].Out of scope (later proposal 141 PRs)
dropped_signalsschema: optional field in extract prompts.dropped_signalsentries.Empirical findings on v49 → v51 (6 probes)
Useful catches the audit surfaces mechanically:
actual_api_p99_latency_ms,api_latency_margin_ms,api_latency_p99_threshold_ms— exactly the silent-regression failure mode the plan doc has been flagging as needing mechanical detection.public_compliance_thresholdabsent — documented v49→v51 regression in the plan's "Known limitations" section.q_logistics_budget_margin,target_recovery_rate,annual_crate_loss_volumeall absent.False-positive surface (called out honestly):
likely_renamedcandidates at jaccard 0.4–0.5 are borderline. Example: paperclip'smanual_intervention_margin_hours → actual_manual_intervention_hours_per_week(j=0.43) might be a different concept (margin variable vs realised variable), not a true rename.Test plan
pytest experiments/napkin_math/tests/test_audit_source_preservation.py(17 synthetic unit tests pass — 15 + 2 for output_name drift and output_name-as-rename-candidate)🤖 Generated with Claude Code