napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks by neoneye · Pull Request #750 · PlanExeOrg/PlanExe

neoneye · 2026-05-21T15:40:52Z

Summary

Addresses the residual paperclip v53c failure mode: the LLM sometimes files a tripwire (If $X exceeds threshold, then <downside>, OR the declarative <X> exceeds threshold, <consequence>) under risks_and_shocks instead of gates_and_thresholds. The misfiled item then competes against actual risks at top-N selection and can fall out of the public output entirely.

Approach

A deterministic post-processor over the LLM emissions, not a prompt rule. The first iteration of this PR tried a risks-side prompt rule — review correctly flagged the causal model was wrong (gates emits before risks, so a risks-side prompt rule cannot move items into gates, and risks creating a worse failure mode where both buckets miss). That commit has been reverted on this branch; the gates and risks prompts are now back to their pre-PR state.

Code change

has_gate_shape(line) — true when the surface form matches either:
- Canonical if/then numeric: If <... digit token ...> then <consequence>
- Declarative comparison numeric: <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence> — the actual v53c phrasing. Recognised comparison verbs: exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. Causal verbs (X risks Y, X causes Y, Failure of X leads to Y) are not recognised — those stay in risks.
Numeric guard: separator comma/colon must not be followed by another digit, so commas inside numbers like $75,000 don't split the match. Qualitative if-then sentences without a numeric token between if and then are intentionally excluded so the promoter doesn't steal categorical/approval/deadline gates that the LLM already categorises correctly.
gate_shape_promotion(line) — returns the if/then form of line if it has any recognised gate shape, else None. For declarative inputs it produces a deterministic if/then rewrite preserving the gates bucket's output contract. Acronym casing is preserved (API job queue latency... rewrites to If API ..., NOT If aPI ...) — the case adjustment only fires on regular capitalised words (uppercase followed by lowercase). line_original is intentionally NOT rewritten; it preserves the source's native phrasing.
promote_gate_shaped_risks(gates, risks) — scans risks for gate-shaped items whose normalised source_quote is NOT already in the gates pool. Promoted items are MOVED to the gates candidate pool (not copied) so the risks slot is reclaimed. Items already in gates by quote are left in risks untouched (within-bucket dedupe is the existing 'do not restate' prompt rule's job).
Wiring — defers annotate_scored_items for gates_and_thresholds and risks_and_shocks until both have completed first+second-pass merging. After the bucket loop, the promoter runs on both merged pools, then annotate_scored_items fires on the augmented gates pool and the remaining risks pool. The promoter's count is exposed in per_bucket.gates_and_thresholds.cross_bucket_promoted_count for downstream auditing.

Empirical posture

Unit tests: 44 pass in test_compress_report_section.py. New cases include: has_gate_shape true/false for both if/then and declarative shapes; positive regression on the literal v53c phrasing Middleware development bid exceeds $75,000, consuming budget...; negative regression on the genuine risk Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase.; end-to-end promoter test that asserts the v53c-shaped risk is moved with line_english rewritten to if/then form and source_quote / scores / status / line_original preserved; acronym preservation (API stays API); regular capitalisation lowering (Middleware → middleware).
v56 regression sweep (paperclip 3× + 5 other probes, 290 risks candidate lines, 32 plan×section cells): 0 promotions fired. The v53c shape did NOT recur in this same-LLM same-session sweep; the v56 risks emissions are dominated by causal forms (X risks Y, X causes Y, Failure of X leads to Y) which the detector intentionally rejects. The change is a deterministic backstop for a rare LLM failure mode, analogous to PR napkin-math(bounds): Phase 4 runtime + schema readiness #747's calculation-output strip rule which also fired 0 times on its v48 regression corpus.
The 8-run regression otherwise shows typical same-LLM variance (mostly ±1-3 items per cell, no systematic over-narrowing). One section had a 0-candidate emission failure (paperclip v56c expert_criticism gates), unrelated to this change — the gates LLM call itself is unchanged; the promoter only acts post-LLM.

What this PR explicitly does NOT claim

Does not claim to "fix the v53c failure" by changing live LLM behaviour in the v56 session — the v53c-shaped failure didn't recur, so the promoter is uncalled.
Does not claim to improve any bucket count metric — the v56 vs v53 count deltas are within typical same-LLM variance.
Does claim to provide a deterministic backstop for the structural failure mode that PRs napkin-math(compress): second-pass shifts the variance failure from emission to ranking #743/napkin-math(compress): paraphrase-tolerant quote verification #744 left open: the LLM correctly emits the item with qv=True somewhere, but in the wrong bucket. When that recurs, the promoter routes it to gates with a clean if/then rewrite that preserves the gates bucket's output contract and acronym casing.

Test plan

pytest worker_plan/.../tests/test_compress_report_section.py (44 pass)
v56 paperclip 3× + 5 regression probes — no systematic over-narrowing, 0 false-positive promotions
Reverted the wrong risks-side prompt rule from this PR's first commit
Detector covers both if/then numeric AND declarative comparison forms (the actual v53c phrasing)
Acronym casing preserved by the if/then rewrite
CI green on this branch

🤖 Generated with Claude Code

…tes_and_thresholds via the risks-side rule Addresses the residual paperclip v53c failure mode: the LLM occasionally files a '$X exceeds threshold, then <downside>' tripwire under risks_and_shocks instead of gates_and_thresholds. The 'do not restate' guard in the risks prompt didn't catch the case because the LLM put the item in risks first, not as a restatement. Change: adds a structural-priority paragraph to the risks_and_shocks bucket prompt that tells the LLM NOT to emit a sentence here when its source side has the 'If <metric> <comparator> <numeric threshold>, then <consequence>' shape — that shape belongs in gates_and_thresholds even when the then-clause is downside-flavoured (cost, schedule, scope, penalty, vendor switch). Why ONLY the risks side and not also a parallel paragraph in the gates prompt: I tried both sides in v54 and found a clear over-narrowing regression — the LLM became more conservative about what counted as a gate, with paperclip expert_criticism dropping from 6 gates to 2, yellowstone selected_scenario from 6 to 3, and similar shrinkages elsewhere. Adding a long structural-shape paragraph to the gates prompt implicitly raised the bar for what counted as a gate (numeric thresholds only), excluding legitimate deadline/categorical gates. The risks-side rule alone is enough to claim the if/then numeric sentences for gates without narrowing gates from the other direction. Empirical posture (regression check, NOT improvement claim; same-LLM same-session Gemini Flash Lite reruns): Paperclip 3x (v55a/b/c): $75k OPC UA bid lands in gates_and_thresholds public top in ALL THREE runs (vs 2/3 before in v53). This is the focal v53c case the change targets. 5-other-plan cross-probe regression (euro_adoption, yellowstone, crate, mars_gtld, datacenter): bucket counts mostly unchanged. Modest 1-2-item shrinkages in datacenter selected_scenario and yellowstone selected_scenario are within typical LLM run variance. Two sections produced 0 gates in v55 (crate premortem and mars_gtld expert_criticism) — these are LLM run variance unrelated to this change: (a) the gates bucket prompt is unchanged so a risks-side rule cannot affect first-pass gate emission; (b) mars_gtld expert_criticism gates is a known-flaky combo (saw 0 candidates on PR #744 rerun); (c) crate premortem produced 6 gates fine in v54 with an even more aggressive prompt, ruling out the v55 risks-side change as the cause. Unit tests: 28 pass (no test changes; the prompt edit is structural language, not new code paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…et promoter (deterministic, code-side) Review feedback on PR #750 (first round): the risks-side prompt rule had a wrong causal model. gates_and_thresholds is emitted BEFORE risks_and_shocks in BUCKET_SPECS, so a risks-side prompt rule cannot move an item into gates — it can only suppress emission in risks. Worse, when gates already missed the item (v53c-style), the risks-side suppression rule removes the fallback visible copy, turning 'wrong bucket but visible' into 'missing from public output entirely'. The v55 3/3 paperclip result was LLM run variance in the gates bucket call, not evidence the prompt rule worked. Replaces the prompt-side rule with a deterministic post-processor that scans the actual LLM emissions across both buckets and reroutes by structural shape: has_gate_shape(line): true when the surface form matches 'If <something with a digit token> ... then <consequence>' — the structural shape the gates bucket prompt asks the LLM to produce. Language-neutral (digits are digits in any locale); does not key on English-only keywords beyond the if/then template the bucket prompt already requires. Qualitative if-then sentences (no numeric token) are intentionally excluded — they may legitimately be gates (categorical/approval/deadline) but the promoter only fires on the unambiguous numeric pattern to avoid stealing genuine risks. promote_gate_shaped_risks(gates_items, risks_items): scans risks for gate-shaped items whose normalised source_quote is NOT already represented in the gates pool. Promoted items are MOVED to the gates candidate pool (not copied) so the risks slot is reclaimed for an actual risk. Items already in gates by quote are left in risks untouched (within-bucket dedupe is a separate concern handled by the existing 'do not restate' prompt rule). Inputs are not mutated. Wiring: defers annotate_scored_items (top-N filter) for gates_and_thresholds and risks_and_shocks until both have completed first+second-pass merging. After the bucket loop, the promoter runs on both merged candidate pools, then annotate fires on the augmented gates pool and the remaining risks pool. Per-bucket metadata gains a cross_bucket_promoted_count field so downstream consumers can audit. Reverted the earlier risks-side prompt addition from this branch — it was both causally wrong AND created a worse failure mode (per the user critique). Gates and risks bucket prompts are now back to their pre-PR state. Empirical posture: 37 unit tests pass (28 prior + 9 new — has_gate_shape true/false/non-string, promotion fire/skip/dedupe/empty/no-mutation). Same-LLM same-session paperclip 3x + 5-other-plan regression sweep (v56) shows 0 promotions across all 32 plan x section cells — the v53c-style miscategorisation did not recur in this session, so the promoter had nothing to act on. The change is a deterministic backstop for a rare LLM failure mode, analogous to PR #747's calculation-output strip which also fired 0 times on its regression corpus. The 8-run regression is otherwise within typical same-LLM variance (mostly +/-1-3 items, paperclip v56c expert_criticism is a separate 0-candidate emission failure unrelated to this change — the gates LLM call is unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye · 2026-05-21T16:00:10Z

Addressed review (1755269c). The risks-side prompt rule has been reverted on this branch; the new approach is the deterministic post-processor (option 2):

has_gate_shape(line) — true for If <... digit token ...> then <...> surface form. Language-neutral.
promote_gate_shaped_risks(gates, risks) — scans risks for gate-shaped items whose normalised quote is not in gates, moves them to the gates candidate pool, removes them from risks. Items already in gates by quote are left in risks untouched.
Wiring: defers annotate_scored_items for both buckets until both have emitted; runs promoter between LLM emission and top-N filter.

Empirical finding worth surfacing: 0 promotions fired across all 8 v56 runs (paperclip 3× + 5 regression probes, 32 plan×section cells). The v53c-style miscategorisation didn't recur in this LLM session, so the promoter had nothing to act on. The change is a deterministic backstop, same posture as PR #747's calculation-output strip (which also fired 0 times on its v48 regression corpus).

37 unit tests pass (28 prior + 9 new). PR title and body rewritten to reflect the new approach and the honest empirical posture.

…mparison shape (the actual v53c phrasing) Review feedback on PR #750 (second round): the first iteration only caught canonical 'If <... digit ...> then <...>' form, which is NOT the phrasing the LLM used in the historical v53c failure. The v53c risk-bucket line was declarative: 'Middleware development bid exceeds $75,000, consuming budget planned for the physical handoff accumulation system.' has_gate_shape() returned False on that, so the advertised v53c backstop did not address the historical failure. Extended the detector to also recognise the declarative form: <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence>. Comparison verbs in the recognised list are structural cues, not domain vocabulary: exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. The verb membership is the structural cue; if the line uses a causal verb ('X risks Y', 'X causes Y', 'Failure of X leads to Y'), it stays in risks. Numeric guard: the threshold must contain a digit token AND the separator comma/colon must not be followed by another digit (so commas inside numbers like '$75,000' do not split the match). 'Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase.' is rejected because there is no comparison verb between subject and digit (negative regression test added). Deterministic if/then rewrite preserves the gates_and_thresholds bucket's output contract: a declarative line is rewritten as 'If <subject> <verb> <threshold>, then <consequence>' with case adjustments for mid-sentence flow. line_original is intentionally not rewritten — it keeps the source's native phrasing for downstream consumers. Three new regression tests added: (1) the exact v53c phrasing is now recognised by has_gate_shape and rewritten to if/then form by gate_shape_promotion; (2) the genuine supply-chain risk shape stays rejected; (3) the promoter end-to-end correctly moves the v53c-shaped risk to gates with line_english rewritten while source_quote, scores, status, and line_original are preserved. 42 unit tests pass total. Empirical posture: audited the extended detector across the v56 sweep (290 risks candidate lines, 8 runs). 0 of 290 match the extended pattern. v56 risks emissions are dominated by causal forms ('X risks Y', 'X causes Y') rather than declarative comparison. The detector covers the v53c shape (verified by unit test on the exact historical line) but the v53c shape did not recur in v56. The promoter remains a deterministic backstop — exercised by unit tests, dormant on the live sweep, same posture as PR #747's calculation-output strip rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye · 2026-05-21T16:43:19Z

Addressed (b495818b). The detector now actually covers the v53c shape:

Extended has_gate_shape to recognise two forms:

Canonical If <... digit ...> then <...> (unchanged from prior commit)
Declarative <subject> + comparison verb + <threshold with digit> + comma/colon + <consequence> — the actual v53c phrasing

Comparison verb list (structural cues, not domain vocabulary): exceeds, falls below, drops below, rises above, breaches, is above/below/greater than/less than/more than, reaches, surpasses. Causal verbs (X risks Y, X causes Y, Failure of X leads to Y) are NOT recognised — those stay in risks.

Numeric guard: separator comma/colon must not be followed by another digit so commas inside numbers like $75,000 don't split the match.

Deterministic if/then rewrite preserves the gates bucket's output contract: a declarative line is rewritten as If <subject> <verb> <threshold>, then <consequence> with case adjustments. line_original stays unchanged.

The exact tests you requested are in:

Positive: test_has_gate_shape_accepts_declarative_v53c_form uses the literal v53c line "Middleware development bid exceeds $75,000, consuming budget planned for the physical handoff accumulation system."
Negative: test_has_gate_shape_rejects_genuine_risk_with_colon_delay uses "Supply chain disruption: 4 to 6 weeks delay and $15,000 cost increase."
End-to-end: test_promote_gate_shaped_risks_rewrites_declarative_v53c_form asserts the v53c-shaped risk is promoted with line_english rewritten to if/then form, source_quote/scores/status preserved, line_original untouched, and the item removed from risks.

42 unit tests pass total.

Honest empirical posture (audited the extended detector against v56): across 290 risks candidate lines in the 8 v56 runs, 0 match the extended pattern. The actual v56 risks emissions are dominated by causal forms, not declarative comparison. The v53c shape didn't recur in v56. The promoter is still a deterministic backstop — covered by the regression test on the literal historical line, dormant on the live sweep. Same posture as PR #747's calculation-output strip rule.

PR title remains "addresses paperclip v53c miscategorisation" — the change now actually does, by regression-test construction on the literal phrasing.

…f/then rewrite Review feedback on PR #750 (third round): the rewrite naively lowercased the first character of the subject and consequence, which damages acronyms — 'API job queue latency exceeds 100ms, ...' was being rewritten to 'If aPI job queue latency...' which is visibly broken. Fix: only lowercase the first character when it is followed by a lowercase letter (a regular capitalised word like 'Middleware'). Acronyms like 'API' / 'OPC UA' / digit prefixes like '5G' all have a non-lowercase second character, so they are left unchanged. Two new regression tests: (1) 'API job queue latency exceeds 100ms, ...' rewrites to 'If API job queue latency exceeds 100ms, then ...' with the acronym intact, (2) 'Middleware development bid exceeds $75,000, ...' still rewrites to 'If middleware development bid exceeds $75,000, then ...' (regular capitalisation still adjusted). 44 unit tests pass (42 prior + 2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye · 2026-05-21T17:01:44Z

Addressed both cleanup items (542cc196):

Acronym casing preserved. _lowercase_first_preserving_acronyms only lowercases the first character when it's uppercase AND followed by a lowercase letter (a regular capitalised word). API / OPC UA / 5G all have a non-lowercase second character, so they're left unchanged. Two new regression tests: one asserts API job queue latency exceeds 100ms, ... rewrites to If API job queue latency exceeds 100ms, then ... with aPI explicitly absent; the other asserts Middleware still lowers to middleware so the normal capitalisation path still works.
PR body updated to reflect 44 tests (was 37) and to describe both the canonical if/then detector AND the declarative comparison detector with the comparison-verb list, numeric guard, rewrite behaviour, and acronym preservation.

44 unit tests pass.

…promoter) in 20260520 plan Per user direction, the plan-status update lands in PR #750 (not a separate doc PR). PR #749 marked merged (was previously open). PR #750 added to the landed-on-main section with the honest 'shipped after two reverted iterations' process note — first attempt was a risks-side prompt rule with the wrong causal model, second attempt only detected canonical if/then form and missed the actual v53c declarative phrasing, third commit extended the detector to both shapes with acronym-preserving rewrite. 44 unit tests including the literal v53c regression on the historical line. Phase 1 status row updated to reference PR #750 as the cross-bucket promoter backstop on top of #737/#743/#744. Next-likely-move list re-ordered: bucket-categorisation no longer item 1 (now covered by #750). Proposal 141 takes item 1, Phase 5 verify-bounds-citations takes item 2, different-LLM behavioural validation takes item 3, prompt-hygiene takes item 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hip-set Updates two docs to reflect the post-PlanExeOrg#753 state of the napkin-math pipeline. methology.md: describe the current pipeline behaviour — two-batch compress with paraphrase-tolerant quote match and cross-bucket promoter; extract's source-arithmetic preservation, threshold-pairing, and dropped_signals field; 19-check validator (added aggregate_not_bounded, requirement_has_margin, dropped_signals_schema); bounds' asymmetric source label on commitment defaults, calculation-output strip, reserved correlations block, reserved lognormal/pert disciplines with loud NotImplementedError; advisory audit_source_preservation.py step. 20260520_plan.md → 20260522_plan.md: bump status date; mark PR PlanExeOrg#750 merged; add PR PlanExeOrg#751/PlanExeOrg#752/PlanExeOrg#753 entries (proposal 141 implementation); update Phase status table (added 4.5 audit row, reclassified Phase 8 as partially done, Phase 10 marked done for current ship-set); add v58 14-plan empirical snapshot (1 viable / 5 fragile / 8 doom); reorder Next likely move now that proposal 141 has shipped — Phase 5 citation verifier promoted to PlanExeOrg#1, Phase 8 samplers added as PlanExeOrg#2 with v58 cases that bite now, Phase 9 composite-band cap as PlanExeOrg#3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

neoneye and others added 2 commits May 21, 2026 17:40

neoneye changed the title ~~napkin-math(compress): claim downside-framed if/then sentences for gates_and_thresholds (risks-side rule)~~ napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks May 21, 2026

neoneye merged commit 63b59ef into main May 21, 2026
3 checks passed

neoneye deleted the napkin-math/compress-gates-vs-risks-priority branch May 21, 2026 17:13

neoneye mentioned this pull request May 21, 2026

docs(napkin-math): refresh methodology + plan status for 2026-05-22 ship-set #754

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks#750

napkin-math(compress): cross-bucket promoter for gate-shaped items misfiled under risks#750
neoneye merged 5 commits into
mainfrom
napkin-math/compress-gates-vs-risks-priority

neoneye commented May 21, 2026 •

edited

Loading

Uh oh!

neoneye commented May 21, 2026

Uh oh!

neoneye commented May 21, 2026

Uh oh!

neoneye commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neoneye commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Code change

Empirical posture

What this PR explicitly does NOT claim

Test plan

Uh oh!

neoneye commented May 21, 2026

Uh oh!

neoneye commented May 21, 2026

Uh oh!

neoneye commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neoneye commented May 21, 2026 •

edited

Loading