napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks#750
Conversation
…tes_and_thresholds via the risks-side rule Addresses the residual paperclip v53c failure mode: the LLM occasionally files a '$X exceeds threshold, then <downside>' tripwire under risks_and_shocks instead of gates_and_thresholds. The 'do not restate' guard in the risks prompt didn't catch the case because the LLM put the item in risks first, not as a restatement. Change: adds a structural-priority paragraph to the risks_and_shocks bucket prompt that tells the LLM NOT to emit a sentence here when its source side has the 'If <metric> <comparator> <numeric threshold>, then <consequence>' shape — that shape belongs in gates_and_thresholds even when the then-clause is downside-flavoured (cost, schedule, scope, penalty, vendor switch). Why ONLY the risks side and not also a parallel paragraph in the gates prompt: I tried both sides in v54 and found a clear over-narrowing regression — the LLM became more conservative about what counted as a gate, with paperclip expert_criticism dropping from 6 gates to 2, yellowstone selected_scenario from 6 to 3, and similar shrinkages elsewhere. Adding a long structural-shape paragraph to the gates prompt implicitly raised the bar for what counted as a gate (numeric thresholds only), excluding legitimate deadline/categorical gates. The risks-side rule alone is enough to claim the if/then numeric sentences for gates without narrowing gates from the other direction. Empirical posture (regression check, NOT improvement claim; same-LLM same-session Gemini Flash Lite reruns): Paperclip 3x (v55a/b/c): $75k OPC UA bid lands in gates_and_thresholds public top in ALL THREE runs (vs 2/3 before in v53). This is the focal v53c case the change targets. 5-other-plan cross-probe regression (euro_adoption, yellowstone, crate, mars_gtld, datacenter): bucket counts mostly unchanged. Modest 1-2-item shrinkages in datacenter selected_scenario and yellowstone selected_scenario are within typical LLM run variance. Two sections produced 0 gates in v55 (crate premortem and mars_gtld expert_criticism) — these are LLM run variance unrelated to this change: (a) the gates bucket prompt is unchanged so a risks-side rule cannot affect first-pass gate emission; (b) mars_gtld expert_criticism gates is a known-flaky combo (saw 0 candidates on PR #744 rerun); (c) crate premortem produced 6 gates fine in v54 with an even more aggressive prompt, ruling out the v55 risks-side change as the cause. Unit tests: 28 pass (no test changes; the prompt edit is structural language, not new code paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…et promoter (deterministic, code-side) Review feedback on PR #750 (first round): the risks-side prompt rule had a wrong causal model. gates_and_thresholds is emitted BEFORE risks_and_shocks in BUCKET_SPECS, so a risks-side prompt rule cannot move an item into gates — it can only suppress emission in risks. Worse, when gates already missed the item (v53c-style), the risks-side suppression rule removes the fallback visible copy, turning 'wrong bucket but visible' into 'missing from public output entirely'. The v55 3/3 paperclip result was LLM run variance in the gates bucket call, not evidence the prompt rule worked. Replaces the prompt-side rule with a deterministic post-processor that scans the actual LLM emissions across both buckets and reroutes by structural shape: has_gate_shape(line): true when the surface form matches 'If <something with a digit token> ... then <consequence>' — the structural shape the gates bucket prompt asks the LLM to produce. Language-neutral (digits are digits in any locale); does not key on English-only keywords beyond the if/then template the bucket prompt already requires. Qualitative if-then sentences (no numeric token) are intentionally excluded — they may legitimately be gates (categorical/approval/deadline) but the promoter only fires on the unambiguous numeric pattern to avoid stealing genuine risks. promote_gate_shaped_risks(gates_items, risks_items): scans risks for gate-shaped items whose normalised source_quote is NOT already represented in the gates pool. Promoted items are MOVED to the gates candidate pool (not copied) so the risks slot is reclaimed for an actual risk. Items already in gates by quote are left in risks untouched (within-bucket dedupe is a separate concern handled by the existing 'do not restate' prompt rule). Inputs are not mutated. Wiring: defers annotate_scored_items (top-N filter) for gates_and_thresholds and risks_and_shocks until both have completed first+second-pass merging. After the bucket loop, the promoter runs on both merged candidate pools, then annotate fires on the augmented gates pool and the remaining risks pool. Per-bucket metadata gains a cross_bucket_promoted_count field so downstream consumers can audit. Reverted the earlier risks-side prompt addition from this branch — it was both causally wrong AND created a worse failure mode (per the user critique). Gates and risks bucket prompts are now back to their pre-PR state. Empirical posture: 37 unit tests pass (28 prior + 9 new — has_gate_shape true/false/non-string, promotion fire/skip/dedupe/empty/no-mutation). Same-LLM same-session paperclip 3x + 5-other-plan regression sweep (v56) shows 0 promotions across all 32 plan x section cells — the v53c-style miscategorisation did not recur in this session, so the promoter had nothing to act on. The change is a deterministic backstop for a rare LLM failure mode, analogous to PR #747's calculation-output strip which also fired 0 times on its regression corpus. The 8-run regression is otherwise within typical same-LLM variance (mostly +/-1-3 items, paperclip v56c expert_criticism is a separate 0-candidate emission failure unrelated to this change — the gates LLM call is unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed review (
Empirical finding worth surfacing: 0 promotions fired across all 8 v56 runs (paperclip 3× + 5 regression probes, 32 plan×section cells). The v53c-style miscategorisation didn't recur in this LLM session, so the promoter had nothing to act on. The change is a deterministic backstop, same posture as PR #747's 37 unit tests pass (28 prior + 9 new). PR title and body rewritten to reflect the new approach and the honest empirical posture. |
…mparison shape (the actual v53c phrasing) Review feedback on PR #750 (second round): the first iteration only caught canonical 'If <... digit ...> then <...>' form, which is NOT the phrasing the LLM used in the historical v53c failure. The v53c risk-bucket line was declarative: 'Middleware development bid exceeds $75,000, consuming budget planned for the physical handoff accumulation system.' has_gate_shape() returned False on that, so the advertised v53c backstop did not address the historical failure. Extended the detector to also recognise the declarative form: <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence>. Comparison verbs in the recognised list are structural cues, not domain vocabulary: exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. The verb membership is the structural cue; if the line uses a causal verb ('X risks Y', 'X causes Y', 'Failure of X leads to Y'), it stays in risks. Numeric guard: the threshold must contain a digit token AND the separator comma/colon must not be followed by another digit (so commas inside numbers like '$75,000' do not split the match). 'Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase.' is rejected because there is no comparison verb between subject and digit (negative regression test added). Deterministic if/then rewrite preserves the gates_and_thresholds bucket's output contract: a declarative line is rewritten as 'If <subject> <verb> <threshold>, then <consequence>' with case adjustments for mid-sentence flow. line_original is intentionally not rewritten — it keeps the source's native phrasing for downstream consumers. Three new regression tests added: (1) the exact v53c phrasing is now recognised by has_gate_shape and rewritten to if/then form by gate_shape_promotion; (2) the genuine supply-chain risk shape stays rejected; (3) the promoter end-to-end correctly moves the v53c-shaped risk to gates with line_english rewritten while source_quote, scores, status, and line_original are preserved. 42 unit tests pass total. Empirical posture: audited the extended detector across the v56 sweep (290 risks candidate lines, 8 runs). 0 of 290 match the extended pattern. v56 risks emissions are dominated by causal forms ('X risks Y', 'X causes Y') rather than declarative comparison. The detector covers the v53c shape (verified by unit test on the exact historical line) but the v53c shape did not recur in v56. The promoter remains a deterministic backstop — exercised by unit tests, dormant on the live sweep, same posture as PR #747's calculation-output strip rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed ( Extended
Comparison verb list (structural cues, not domain vocabulary): Numeric guard: separator comma/colon must not be followed by another digit so commas inside numbers like Deterministic if/then rewrite preserves the gates bucket's output contract: a declarative line is rewritten as The exact tests you requested are in:
42 unit tests pass total. Honest empirical posture (audited the extended detector against v56): across 290 risks candidate lines in the 8 v56 runs, 0 match the extended pattern. The actual v56 risks emissions are dominated by causal forms, not declarative comparison. The v53c shape didn't recur in v56. The promoter is still a deterministic backstop — covered by the regression test on the literal historical line, dormant on the live sweep. Same posture as PR #747's PR title remains "addresses paperclip v53c miscategorisation" — the change now actually does, by regression-test construction on the literal phrasing. |
…f/then rewrite Review feedback on PR #750 (third round): the rewrite naively lowercased the first character of the subject and consequence, which damages acronyms — 'API job queue latency exceeds 100ms, ...' was being rewritten to 'If aPI job queue latency...' which is visibly broken. Fix: only lowercase the first character when it is followed by a lowercase letter (a regular capitalised word like 'Middleware'). Acronyms like 'API' / 'OPC UA' / digit prefixes like '5G' all have a non-lowercase second character, so they are left unchanged. Two new regression tests: (1) 'API job queue latency exceeds 100ms, ...' rewrites to 'If API job queue latency exceeds 100ms, then ...' with the acronym intact, (2) 'Middleware development bid exceeds $75,000, ...' still rewrites to 'If middleware development bid exceeds $75,000, then ...' (regular capitalisation still adjusted). 44 unit tests pass (42 prior + 2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed both cleanup items (
44 unit tests pass. |
…promoter) in 20260520 plan Per user direction, the plan-status update lands in PR #750 (not a separate doc PR). PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line. Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744. Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hip-set Updates two docs to reflect the post-PlanExeOrg#753 state of the napkin-math pipeline. methology.md: describe the current pipeline behaviour — two-batch compress with paraphrase-tolerant quote match and cross-bucket promoter; extract's source-arithmetic preservation, threshold-pairing, and dropped_signals field; 19-check validator (added aggregate_not_bounded, requirement_has_margin, dropped_signals_schema); bounds' asymmetric source label on commitment defaults, calculation-output strip, reserved correlations block, reserved lognormal/pert disciplines with loud NotImplementedError; advisory audit_source_preservation.py step. 20260520_plan.md → 20260522_plan.md: bump status date; mark PR PlanExeOrg#750 merged; add PR PlanExeOrg#751/PlanExeOrg#752/PlanExeOrg#753 entries (proposal 141 implementation); update Phase status table (added 4.5 audit row, reclassified Phase 8 as partially done, Phase 10 marked done for current ship-set); add v58 14-plan empirical snapshot (1 viable / 5 fragile / 8 doom); reorder Next likely move now that proposal 141 has shipped — Phase 5 citation verifier promoted to PlanExeOrg#1, Phase 8 samplers added as PlanExeOrg#2 with v58 cases that bite now, Phase 9 composite-band cap as PlanExeOrg#3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Addresses the residual paperclip v53c failure mode: the LLM sometimes files a tripwire (
If $X exceeds threshold, then <downside>, OR the declarative<X> exceeds threshold, <consequence>) underrisks_and_shocksinstead ofgates_and_thresholds. The misfiled item then competes against actual risks at top-N selection and can fall out of the public output entirely.Approach
A deterministic post-processor over the LLM emissions, not a prompt rule. The first iteration of this PR tried a risks-side prompt rule — review correctly flagged the causal model was wrong (gates emits before risks, so a risks-side prompt rule cannot move items into gates, and risks creating a worse failure mode where both buckets miss). That commit has been reverted on this branch; the gates and risks prompts are now back to their pre-PR state.
Code change
has_gate_shape(line)— true when the surface form matches either:If <... digit token ...> then <consequence><subject> + comparison verb + <threshold with digit> + comma/colon + <consequence>— the actual v53c phrasing. Recognised comparison verbs:exceeds,falls below,drops below,rises above,breaches,is above/below/greater than/less than/more than,reaches,surpasses. Causal verbs (X risks Y,X causes Y,Failure of X leads to Y) are not recognised — those stay in risks.Numeric guard: separator comma/colon must not be followed by another digit, so commas inside numbers like
$75,000don't split the match. Qualitative if-then sentences without a numeric token betweenifandthenare intentionally excluded so the promoter doesn't steal categorical/approval/deadline gates that the LLM already categorises correctly.gate_shape_promotion(line)— returns the if/then form oflineif it has any recognised gate shape, elseNone. For declarative inputs it produces a deterministic if/then rewrite preserving the gates bucket's output contract. Acronym casing is preserved (API job queue latency...rewrites toIf API ..., NOTIf aPI ...) — the case adjustment only fires on regular capitalised words (uppercase followed by lowercase).line_originalis intentionally NOT rewritten; it preserves the source's native phrasing.promote_gate_shaped_risks(gates, risks)— scans risks for gate-shaped items whose normalisedsource_quoteis NOT already in the gates pool. Promoted items are MOVED to the gates candidate pool (not copied) so the risks slot is reclaimed. Items already in gates by quote are left in risks untouched (within-bucket dedupe is the existing 'do not restate' prompt rule's job).Wiring — defers
annotate_scored_itemsforgates_and_thresholdsandrisks_and_shocksuntil both have completed first+second-pass merging. After the bucket loop, the promoter runs on both merged pools, thenannotate_scored_itemsfires on the augmented gates pool and the remaining risks pool. The promoter's count is exposed inper_bucket.gates_and_thresholds.cross_bucket_promoted_countfor downstream auditing.Empirical posture
test_compress_report_section.py. New cases include:has_gate_shapetrue/false for both if/then and declarative shapes; positive regression on the literal v53c phrasingMiddleware development bid exceeds $75,000, consuming budget...; negative regression on the genuine riskSupply chain disruption: 4 to 6 weeks delay and $15,000 cost increase.; end-to-end promoter test that asserts the v53c-shaped risk is moved withline_englishrewritten to if/then form andsource_quote/ scores / status /line_originalpreserved; acronym preservation (APIstaysAPI); regular capitalisation lowering (Middleware→middleware).X risks Y,X causes Y,Failure of X leads to Y) which the detector intentionally rejects. The change is a deterministic backstop for a rare LLM failure mode, analogous to PR napkin-math(bounds): Phase 4 runtime + schema readiness #747'scalculation-outputstrip rule which also fired 0 times on its v48 regression corpus.What this PR explicitly does NOT claim
qv=Truesomewhere, but in the wrong bucket. When that recurs, the promoter routes it to gates with a clean if/then rewrite that preserves the gates bucket's output contract and acronym casing.Test plan
pytest worker_plan/.../tests/test_compress_report_section.py(44 pass)🤖 Generated with Claude Code