Skip to content

callaway-santanna: fix cluster= silent no-op + narrow vcov_type contract#487

Merged
igerber merged 10 commits into
mainfrom
cs-vcov-type-phase1b-4
May 23, 2026
Merged

callaway-santanna: fix cluster= silent no-op + narrow vcov_type contract#487
igerber merged 10 commits into
mainfrom
cs-vcov-type-phase1b-4

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 23, 2026

Summary

Phase 1b interstitial. Four tightly-coupled items in CallawaySantAnna and adjacent infrastructure:

  1. Bug fix: bare CallawaySantAnna(cluster="state").fit(...) was a silent no-op — accepted at __init__, stored as self.cluster, returned by get_params(), but never consumed in the fit / aggregator / bootstrap pipeline. Users got per-unit IF inference believing they had cluster-robust SE; typically SE too small under intra-cluster correlation, no warning. Survey-PSU clustering via survey_design=SurveyDesign(psu="state") was NOT affected. Fix synthesizes a minimal SurveyDesign(psu=self.cluster, weight_type="pweight") when bare cluster= is set without an explicit survey design, threading through the existing PSU-meat aggregator + PSU-multiplier bootstrap. Three-branch wiring mirrors estimators.py:497-516; wiring inserted BEFORE _validate_unit_constant_survey so synthesized / injected designs catch movers on panel data.

  2. Narrow vcov_type input contract: CallawaySantAnna(vcov_type=...) now accepts {"hc1"} only. {classical, hc2, hc2_bm, conley} REJECTED at __init__ with methodology-rooted messages. Rejection is library-architectural, not paper-prescribed: CS uses influence-function-based variance per Callaway & Sant'Anna (2021); per-(g,t) doubly-robust / IPW / outcome-regression structure has no single design matrix to compute hat-matrix leverage or Bell-McCaffrey Satterthwaite DOF on. Narrow contract is permanent. _validate_vcov_type extracted as @staticmethod; called from both __init__ and fit() so sklearn-style set_params mutations are re-checked at use time.

  3. Survey / non-survey contract via dedicated df_inference field: bare cluster= populates new Results.df_inference (cluster-level df for HonestDiD's t-critical-value selection) but leaves Results.survey_metadata as None — preserving the canonical "user provided a survey design" signal that DiagnosticReport (Bacon skip + 2x2 PT skip) and CallawaySantAnnaResults.summary() (survey block render) read. HonestDiD at both extraction sites prefers df_inference over survey_metadata.df_survey. Replicate-weight survey designs combined with bare cluster= raise NotImplementedError (replicate IF variance ignores PSU entirely).

  4. TripleDifference defensive test: added test_cluster_changes_ses regression test asserting TripleDifference(cluster="state") produces SE differing from cluster=None. TripleDifference's bare-cluster code path at triple_diff.py:1245-1259 was already correct but lacked a positive regression test.

Audit verified the bare-cluster= no-op was CS-specific — the other 7 Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD) handle bare cluster= correctly.

Methodology references

  • Method name(s): CallawaySantAnna (Callaway & Sant'Anna 2021)
  • Paper / source link(s): https://doi.org/10.1016/j.jeconom.2020.12.001
  • Any intentional deviations from the source (and why):
    • Narrow vcov_type to {"hc1"}: library-architectural, not a deviation from CS 2021 — analytical-sandwich variance families and Conley spatial-HAC don't compose with IF-based variance machinery. Documented in docs/methodology/REGISTRY.md "IF-based variance estimators vs analytical-sandwich estimators" subsection.
    • SurveyDesign(psu=cluster) synthesis for bare cluster= is a documented synthesis (per CLAUDE.md "Documenting Deviations" — **Note:** label in REGISTRY.md): the synthesis itself is new wiring; the downstream PSU-meat machinery (_compute_stratified_psu_meat) is the established survey-side path; the CR1 Liang-Zeger algebra on IF is Williams (2000) / Hansen (2007).

Validation

  • Tests added/updated:
    • tests/test_staggered.pyTestCallawaySantAnnaClusterWiring (8 tests: panel + RCS + survey precedence coverage), TestCallawaySantAnnaVcovTypeNarrowContract (10 tests: input contract + set_params re-validation), TestCallawaySantAnnaClusterSafetyGates (6 tests: mover validation + replicate reject + survey/non-survey contract + end-to-end HonestDiD integration)
    • tests/test_triple_diff.pyTestTripleDifferenceClusterDefensive::test_cluster_changes_ses
    • tests/test_staggered.py::TestCallawaySantAnnaClusterWiring::test_cluster_none_path_unchanged verifies cluster=None path is bit-equal to pre-PR (wiring guarded by if self.cluster is not None:)
    • All 410 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did + test_two_stage) pass
  • Backtest / simulation / notebook evidence: N/A — no tutorial / notebook surface changes

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

igerber and others added 2 commits May 22, 2026 22:16
Phase 1b interstitial PR addressing four tightly-coupled items in
CallawaySantAnna and adjacent infrastructure:

1. Bug fix: bare CallawaySantAnna(cluster="state").fit(...) was a silent
   no-op — the parameter was accepted at __init__, stored as self.cluster,
   returned by get_params(), but never consumed in the fit / aggregator /
   bootstrap pipeline. The docstring at staggered.py:154-156 claimed
   "Defaults to unit-level clustering" but the aggregator at
   staggered_aggregation.py:193-213 computed per-unit IF variance regardless
   of self.cluster, and the bootstrap at staggered_bootstrap.py:323-347
   drew per-unit multiplier weights regardless. Users who explicitly set
   cluster="state" got per-unit inference believing they had cluster-robust
   SE — typically SE too small under intra-cluster correlation, no warning.

   Survey-PSU clustering via survey_design=SurveyDesign(psu="state") was
   NOT affected and continued to cluster correctly via the existing
   _compute_stratified_psu_meat machinery.

   The fix synthesizes a minimal SurveyDesign(psu=self.cluster,
   weight_type="pweight") when bare cluster= is set without an explicit
   survey design, threading the synthesized PSU through the existing
   survey-PSU machinery (aggregator + bootstrap). Three-branch wiring
   mirrors estimators.py:497-516:
   - bare cluster + no survey → synthesize SurveyDesign(psu=cluster)
   - survey w/o PSU + cluster → inject cluster as PSU via
     _inject_cluster_as_psu, AND construct effective_survey_design =
     replace(survey_design, psu=cluster) so _validate_unit_constant_survey
     catches movers on panel data
   - survey w/ PSU + cluster → PSU wins; _resolve_effective_cluster
     emits UserWarning if partitions differ

   Wiring is inserted BEFORE the existing validators so synthesized /
   injected designs pass through unit-constancy + pweight checks.

   cluster_name on CallawaySantAnnaResults reflects the canonical PSU:
   survey_design.psu when explicit PSU is provided, self.cluster when bare
   cluster synthesizes / injects.

2. Narrow vcov_type input contract: CallawaySantAnna(vcov_type=...) now
   accepts {"hc1"} only at __init__. Analytical-sandwich families
   {classical, hc2, hc2_bm} and conley rejected with methodology-rooted
   messages. The rejection is library-architectural, not paper-prescribed:
   CS uses influence-function-based variance per Callaway & Sant'Anna
   (2021); per-(g,t) doubly-robust / IPW / outcome-regression structure
   has no single design matrix to compute hat-matrix leverage or
   Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent;
   the same surface applies to other IF-based estimators (ImputationDiD,
   EfficientDiD) when their PRs land. _validate_vcov_type extracted as
   @staticmethod; called from both __init__ and fit() so sklearn-style
   set_params mutations are re-checked at use time (preventing silent
   propagation of bad values to Results metadata).

3. Survey/non-survey contract separation via dedicated df_inference field:
   bare cluster= populates the new Results.df_inference field (carrying
   cluster-level df for downstream HonestDiD t-critical-value selection)
   but leaves Results.survey_metadata as None — preserving the canonical
   "user provided a survey design" signal that DiagnosticReport
   (diagnostic_report.py:848-856 Bacon skip + 1150-1158 2x2 PT skip) and
   CallawaySantAnnaResults.summary() (staggered_results.py:235-238 survey
   block render) read. Legitimate survey_design= fits populate both
   df_inference AND survey_metadata; the two surfaces agree
   (df_inference == survey_metadata.df_survey). HonestDiD at
   honest_did.py prefers df_inference over survey_metadata.df_survey at
   BOTH extraction sites (MPD branch + CS branch).

   Replicate-weight survey designs combined with bare cluster= raise
   NotImplementedError: replicate-weight variance is computed by replicate
   reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster
   entirely (survey.py:104-109 enforces replicate_weights are mutually
   exclusive with strata/psu/fpc). Honoring bare cluster= would silently
   have no effect on variance while populating cluster_name/n_clusters
   dishonestly. Fail-closed per feedback_no_silent_failures.

   vcov_type, cluster_name, n_clusters, and df_inference added to
   CallawaySantAnnaResults with full class-docstring attribute entries.

4. TripleDifference defensive test_cluster_changes_ses regression test:
   TripleDifference's bare-cluster code path (triple_diff.py:1245-1259)
   was already correct but lacked a positive regression test (only an
   error-handling test for missing cluster columns existed). Added
   defensive coverage on a fixed-seed panel with state-level random
   effects, mirroring tests/test_two_stage.py::test_cluster_changes_ses.

REGISTRY.md adds the "IF-based variance estimators vs analytical-sandwich
estimators" taxonomy subsection explaining the structural split (CS,
ImputationDiD, EfficientDiD vs DiD, MPD, TWFE, SA, StackedDiD,
WooldridgeDiD), plus CS variance-families + cluster-wiring documentation.
CHANGELOG Fixed + Added bullets. TODO.md Phase 1b row updated (4
remaining estimators, CS path narrowed permanently); new CS conley and
cluster=-deprecation follow-up rows. llms-full.txt CS entry updated with
the new vcov_type + cluster behavior.

Audit verified the cluster= silent no-op was CS-specific — the other 7
Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD,
TripleDifference, TwoStageDiD, EfficientDiD) handle bare cluster=
correctly.

Tests: 25 new tests across TestCallawaySantAnnaClusterWiring (8 panel +
RCS + survey precedence coverage), TestCallawaySantAnnaVcovTypeNarrowContract
(10 input contract + set_params re-validation), TestCallawaySantAnnaClusterSafetyGates
(6 mover validation + replicate reject + survey/non-survey contract +
end-to-end HonestDiD integration), and TestTripleDifferenceClusterDefensive
(1 defensive SE-changes regression). All 326 tests
(test_staggered + test_staggered_rc + test_triple_diff + test_honest_did)
pass. cluster=None path is bit-equal to pre-PR (wiring guarded by
if self.cluster is not None).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment
⚠️ Needs changes. Highest unmitigated severity: P1.

Executive Summary

  • The new df_inference plumbing introduces a downstream inference bug: HonestDiD now prefers a field that CallawaySantAnna populates from the wrong df source on some survey-backed fits.
  • The vcov_type={"hc1"} restriction and the SurveyDesign(psu=cluster) synthesis are documented in the Methodology Registry, so I am not treating those as methodology defects.
  • vcov_type was inserted mid-__init__ signature, which breaks positional CallawaySantAnna(...) callers.
  • The added regressions cover the analytical cluster wiring well, but they do not pin the bootstrap path that the PR also changes.
  • I could not run the test suite in this environment because numpy is unavailable.

Methodology

  • Severity: P1. Impact: diff_diff/staggered.py (L646-L669) computes the actual survey df from the unit-collapsed design, and diff_diff/staggered.py (L1993-L1999) can tighten it further, but diff_diff/staggered.py (L2212-L2218) stores df_inference from the original resolved_survey.df_survey, and diff_diff/honest_did.py (L831-L849) now prefers that field unconditionally. On panel survey fits without an explicit PSU (weights/strata/fpc only), this can switch HonestDiD from the unit-level df actually used by CS to the larger observation-level df; on survey fits whose df was later tightened, it can override the corrected survey_metadata.df_survey. That makes HonestDiD CIs/p-values anti-conservative with no warning. Concrete fix: only populate/prefer df_inference for the bare-cluster= path that intentionally leaves survey_metadata=None, or derive it from the same final df object that backs the fitted results’ actual inference.
  • Severity: P3 informational. Impact: the permanent vcov_type="hc1" restriction and the bare-cluster= synthesis are documented in docs/methodology/REGISTRY.md (L306-L348) and docs/methodology/REGISTRY.md (L368-L378), so I am not treating those deviations as defects. Concrete fix: none.

Code Quality

  • Severity: P2. Impact: diff_diff/staggered.py (L325-L343) inserts vcov_type between cluster and n_bootstrap, unlike sibling estimators that append it after the existing positional tail, e.g. diff_diff/sun_abraham.py (L520-L530) and diff_diff/wooldridge.py (L323-L336). Any positional caller that previously passed n_bootstrap, bootstrap_weights, or later args positionally will now bind them incorrectly and fail at runtime. Concrete fix: preserve the old positional order and make vcov_type keyword-only, or append it at the end with a deprecation shim for old positional calls.

Performance
No new findings.

Maintainability
No new findings beyond the constructor-compatibility issue above.

Tech Debt

  • Severity: P3 informational. Impact: the PR’s deferrable follow-ups are properly tracked in TODO.md (L102-L108), so they do not block merge. Concrete fix: none.

Security
No findings.

Documentation/Tests

  • Severity: P3. Impact: the new cluster-wiring regression suite in tests/test_staggered.py (L4283-L4776) does not exercise the bootstrap path, even though the fix explicitly changes bootstrap routing via diff_diff/staggered_bootstrap.py (L318-L347). That leaves the bootstrap side of the original silent-no-op bug class unpinned. Concrete fix: add one positive regression with cluster="state", n_bootstrap>0 asserting clustered bootstrap SEs differ from cluster=None, plus one precedence case showing survey_design.psu still wins under bootstrap.

Path to Approval

  1. Narrow the new df_inference contract so it only covers the bare-cluster=/no-survey case, or compute it from the same final df source that backs the fitted CS inference; update HonestDiD to prefer df_inference only when survey_metadata is absent.
  2. Add a regression for panel SurveyDesign(weights=...) (no explicit PSU) + aggregate="event_study" + HonestDiD, and a second regression for a survey case whose final df is tightened, so the new precedence cannot silently overstate denominator df again.

…compat + bootstrap test coverage

Three issues:

1. df_inference contract narrowing: prior plumbing populated df_inference
   from resolved_survey.df_survey for ALL cluster branches (synthesize,
   inject, conflict), and HonestDiD preferred df_inference unconditionally.
   But CS internally uses resolved_survey_unit.df_survey (unit-collapsed)
   for safe_inference and can tighten that further via overall_effective_df
   recompute at staggered.py:1995-1999. Preferring resolved_survey.df_survey
   in HonestDiD on panel survey fits without explicit PSU (weights/strata/
   fpc only) could switch from the unit-level df actually used by CS to
   the larger observation-level df, making HonestDiD CIs/p-values anti-
   conservative without warning. Fix narrows df_inference to bare-cluster-
   synthesize path only (the case where survey_metadata is intentionally
   None to preserve the survey/non-survey contract for DiagnosticReport /
   summary). For inject/conflict branches, df_inference stays None and
   HonestDiD reads df_survey directly from survey_metadata (which carries
   the actual CS-internal df, including any post-recompute tightening).
   HonestDiD preference inverted: survey_metadata.df_survey first,
   df_inference fallback for bare-cluster fits.

2. Positional-arg compatibility: vcov_type was inserted between cluster
   and n_bootstrap in __init__, breaking positional callers who passed
   n_bootstrap or later args positionally (e.g.,
   CallawaySantAnna("never_treated", 0, "dr", 0.05, None, 999) would
   silently bind 999 to vcov_type → ValueError at validation). Moved
   vcov_type to the end of __init__ (matches SA / WooldridgeDiD
   convention of appending new params).

3. Bootstrap path coverage: added two new tests in
   TestCallawaySantAnnaClusterSafetyGates:
   - test_bare_cluster_bootstrap_se_differs_from_unit_level pins that
     bare cluster= routes bootstrap through the PSU-multiplier-weights
     branch at staggered_bootstrap.py:323-347 (closes the bootstrap-
     side of the original silent-no-op bug class)
   - test_survey_design_psu_wins_under_bootstrap pins that explicit
     PSU wins under bootstrap when cluster= + survey_design.psu have
     different partitions (mirrors the analytical-sandwich precedence)

Existing test_explicit_survey_design_does_populate_survey_metadata
updated to assert df_inference IS None in the inject branch (matches
new narrowed contract); the survey_metadata.df_survey assertion is
retained.

CHANGELOG Fixed bullet updated to describe the narrowed df_inference
contract + the inverted HonestDiD preference order.

All 328 tests (test_staggered + test_staggered_rc + test_triple_diff
+ test_honest_did) pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7c6146e54ae45fd0cd53416a38980a0dab50b862


Overall Assessment
✅ Looks good. Highest unmitigated severity: P3.

Executive Summary

  • This re-review resolves the prior substantive issues: the df_inference/HonestDiD bug is fixed in substance, the constructor keeps positional compatibility, and the bootstrap path now has explicit regressions.
  • The vcov_type={"hc1"} restriction and the bare-cluster=SurveyDesign(psu=...) synthesis are documented in docs/methodology/REGISTRY.md:L306-L352 and docs/methodology/REGISTRY.md:L368-L378, so I am not treating them as methodology defects.
  • CallawaySantAnna now only emits df_inference on the bare-cluster/no-survey branch in diff_diff/staggered.py:L2204-L2223, which removes the prior risk of HonestDiD overriding survey-backed denominator df.
  • The previous bootstrap coverage gap is addressed by tests/test_staggered.py:L4792-L4878.
  • Remaining issues are documentation-only: the generated guide shows the wrong argument order, and the new results/test docs still describe df_inference on survey-backed fits contrary to the implementation.
  • I could not run the test suite here because this environment is missing numpy.

Methodology

  • Severity: P3 informational. Impact: No unmitigated methodology defect remains from the prior review. df_inference is now gated to the bare-cluster/no-survey path in diff_diff/staggered.py:L2204-L2223, while bare-cluster CS event-study extraction still reads it in diff_diff/honest_did.py:L835-L853. Concrete fix: none.
  • Severity: P3 informational. Impact: The narrow vcov_type contract and synthesized PSU routing are explicitly documented in docs/methodology/REGISTRY.md:L306-L352 and docs/methodology/REGISTRY.md:L368-L378, so under the stated rubric they are not defects. Concrete fix: none.

Code Quality
No findings. The earlier positional-signature concern is fixed in code: vcov_type remains last in CallawaySantAnna.__init__ at diff_diff/staggered.py:L325-L342.

Performance
No findings.

Maintainability
No findings.

Tech Debt

  • Severity: P3 informational. Impact: The remaining Conley follow-up and broader standalone-estimator vcov_type threading are explicitly tracked in TODO.md:L102-L108, so the deferrals are properly recorded and non-blocking. Concrete fix: none.

Security
No findings.

Documentation/Tests

  • Severity: P3. Impact: diff_diff/guides/llms-full.txt:L186-L193 places vcov_type before n_bootstrap, but the real constructor keeps vcov_type at the end in diff_diff/staggered.py:L325-L342. That reintroduces the old positional-order confusion in the generated guide even though the code fixed it. Concrete fix: move vcov_type to the end of the pseudo-signature or mark it keyword-only there.
  • Severity: P3. Impact: diff_diff/staggered_results.py:L112-L121, diff_diff/staggered_results.py:L172-L180, and tests/test_staggered.py:L4706-L4739 still describe explicit survey-design fits as populating df_inference, but the implementation intentionally leaves df_inference=None whenever survey_metadata is present in diff_diff/staggered.py:L2204-L2223. Concrete fix: update those docstrings/comments to match the actual contract.
  • No further test-coverage finding. The prior bootstrap gap is now covered by tests/test_staggered.py:L4792-L4878.

Two P3 doc-only follow-ups:

1. llms-full.txt CS entry kept vcov_type before n_bootstrap in the
   pseudo-signature even after the constructor was reordered to append
   vcov_type at the end. Moved vcov_type to the end of the pseudo-
   signature so the generated guide matches the actual __init__ order
   (matches SA / WooldridgeDiD convention).

2. staggered_results.py class attribute docstring + inline field comment
   for df_inference still described it as populated on "Legitimate
   survey_design= fits" alongside survey_metadata. Updated to reflect
   the narrowed contract: df_inference is populated on the bare-cluster-
   synthesize path ONLY (where survey_metadata is None). When user
   provides survey_design=, df_inference stays None and the canonical
   df carrier is survey_metadata.df_survey. HonestDiD reads
   survey_metadata first, df_inference fallback.

3. test_explicit_survey_design_does_populate_survey_metadata docstring
   still said "+ df_inference also populated"; updated to match the
   body assertion (which already correctly asserts df_inference is None).

No source/test logic changes. All 80 targeted tests + test_honest_did
pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2449842cfd598fa7ae47a86f810914183b1f93bc


Overall Assessment

⛔ Blocker — one unmitigated P0 finding.

Executive Summary

  • Prior re-review items look resolved: the guide signature is fixed, and df_inference is now narrowed to the bare-cluster/no-survey CS path.
  • The vcov_type={"hc1"} restriction and bare cluster=SurveyDesign(psu=...) synthesis are documented in the Methodology Registry, so I am not treating those choices themselves as defects.
  • Severity: P0 [Newly identified]. The new cluster wiring still does not make analytical group_time_effects[(g,t)] SEs cluster-robust. Aggregate/bootstrap surfaces are wired, but the public ATT(g,t) surface remains unit-level and then only gets a clustered d.f. attached afterward.
  • The new tests pin overall_se, bootstrap routing, and metadata/HonestDiD behavior, but they do not cover analytical ATT(g,t) SEs, which is why this residual no-op survived.
  • I could not run the tests locally because this environment is missing numpy, pandas, and pytest.

Methodology

  • Severity: P0 [Newly identified]. Impact: the new contract says cluster=X means CR1 Liang-Zeger on the IF (docs/methodology/REGISTRY.md:L326-L331, docs/methodology/REGISTRY.md:L368-L378, diff_diff/staggered.py:L154-L167, diff_diff/guides/llms-full.txt:L186-L199), but analytical ATT(g,t) SEs are still computed with unit-level formulas. The no-cov vectorized path computes se = sqrt(sum(inf^2)) and then only passes df_survey into safe_inference_batch (diff_diff/staggered.py:L999-L1066); the covariate-reg fast path does the same (diff_diff/staggered.py:L1363-L1417); and the main panel/RCS fit loops just store those se_gt values and attach clustered d.f. (diff_diff/staggered.py:L1812-L1827, diff_diff/staggered.py:L1904-L1919). The shared ATT(g,t) helpers are still unit-level too (diff_diff/staggered.py:L2329-L2377, diff_diff/staggered.py:L2543-L2563, diff_diff/staggered.py:L2843-L2928, diff_diff/staggered.py:L3303-L3309). That yields silently misleading per-cell SE/t/p/CI whenever users inspect results.group_time_effects under analytical inference. Concrete fix: compute ATT(g,t) analytical variance from the per-cell IF using the same cluster/PSU-aware machinery as the aggregate path (panel: resolved_survey_unit; RCS: resolved_survey) and use that in all four per-cell code paths.
  • Severity: P3 informational. Impact: the hc1-only vcov_type contract and the synthetic-PSU routing are explicitly documented in docs/methodology/REGISTRY.md:L321-L352 and :368-L376, so I am not treating them as methodology defects. Callaway & Sant’Anna (2021) is ATT(g,t)-based and paired with a simple bootstrap procedure, which is consistent with the repo’s IF-based framing. Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The TODO.md updates are proper tracking, not blockers.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new CS cluster tests cover overall_se, bootstrap routing, metadata, and HonestDiD d.f. propagation (tests/test_staggered.py:L4290-L4495, tests/test_staggered.py:L4622-L4878), but nothing asserts that analytical group_time_effects[(g,t)]["se"] changes under cluster= or matches explicit SurveyDesign(psu=...). That left the blocker above unpinned. Concrete fix: add panel and RCS regressions that compare at least one ATT(g,t) analytical SE under cluster=None vs cluster="state" and verify equality between bare cluster="state" and explicit SurveyDesign(psu="state").

Path to Approval

  1. Replace unit-level analytical ATT(g,t) variance with cluster/PSU-aware IF variance across _compute_att_gt_fast, _compute_all_att_gt_vectorized, _compute_all_att_gt_covariate_reg, and _compute_att_gt_rc (or their shared helpers), so cluster= actually changes the public ATT(g,t) SE surface.
  2. Recompute ATT(g,t) t_stat, p_value, and conf_int from those corrected SEs, not just from df_survey.
  3. Add targeted panel and repeated-cross-section regressions on results.group_time_effects[(g,t)] for bare cluster= and explicit SurveyDesign(psu=...).

CI codex R3 P0: the cluster wiring contract documented in REGISTRY.md
("cluster=X means CR1 Liang-Zeger on the IF") was honored at the
aggregate inference surface (overall_se, event-study, group, bootstrap)
but the per-cell public surface results.group_time_effects[(g,t)]["se"]
remained unit-level. Users inspecting per-cell ATT(g,t) inference under
cluster= got silently misleading SE/t/p/CI even though overall inference
was correctly cluster-robust.

Fix: new module-level _cluster_robust_se_from_per_gt_if helper that
aggregates the per-(g,t) IF by PSU and returns CR1 Liang-Zeger SE.
Applied at all 4 ATT(g,t) computation sites identified by the codex:

1. _compute_all_att_gt_vectorized (no-covariate vectorized batch) —
   recompute se after building inf_info, overwrite group_time_effects
   [(g,t)]["se"] which was set with the unit-level value
2. _compute_all_att_gt_covariate_reg (covariate-reg batch) — same pattern
3. Main panel single-cell loop (after _compute_att_gt_fast) — local
   se_gt update flows into gte_entry["se"]
4. RC fit loop (after _compute_att_gt_rc) — uses resolved_survey.psu
   (per-obs) instead of resolved_survey_unit.psu (per-unit)

The recompute is gated by `if psu is not None`, so cluster=None remains
bit-equal to pre-PR. For cluster=unit (each unit its own cluster), the
CR1 formula coincides with the unit-level IF formula (modulo ddof
conventions in the underlying OR path) — methodologically consistent
with Williams (2000) CR1-on-IF for IF-based estimators.

Tests:
- test_per_gt_analytical_se_changes_with_cluster: asserts at least one
  (g,t) cell shows measurable SE divergence between cluster=None and
  cluster="state" on a panel with intra-cluster correlation
- test_per_gt_se_matches_explicit_survey_design: asserts per-(g,t) SE
  agrees (rel=1e-10) between bare cluster="state" and explicit
  SurveyDesign(psu="state") — both activate the same CR1 aggregation

All 414 tests (test_staggered + test_staggered_rc + test_triple_diff +
test_honest_did + test_two_stage) pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 2b166cb0e9d434a64e031352b527afeadf445bf0


Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

  • Re-review scope: the prior blocker on clustered analytical ATT(g,t) surfaces looks functionally addressed. The PR now wires per-cell clustered SE updates into the vectorized, covariate-regression, general panel, and RCS CS paths, and adds targeted regression coverage in tests/test_staggered.py:L4832-L4910.
  • Severity P1 [Newly identified]: the new per-cell cluster helper is labeled and used as CR1 Liang-Zeger, but it computes sqrt(sum(psi_c^2)) and bypasses the shared PSU variance helper, omitting the documented CR1 finite-sample correction and shared PSU-meat semantics. See diff_diff/staggered.py:L95-L151 versus diff_diff/survey.py:L1407-L1425 and L2139-L2177, plus docs/methodology/REGISTRY.md:L328-L331 and L368-L370.
  • Impact: CallawaySantAnnaResults.group_time_effects[(g,t)] still understates clustered analytical SEs, so the corresponding t-stats, p-values, and CIs remain too optimistic relative to the library’s documented IF/CR1 contract, especially with few clusters.
  • The vcov_type={"hc1"} narrowing and bare cluster= synthesis are documented in REGISTRY.md, so I am not treating those choices themselves as defects.
  • I could not run the test suite locally because this environment is missing numpy; this review is static.

Methodology

  • Severity: P1 [Newly identified]. Impact: CallawaySantAnna’s changed analytical clustering path now reaches the public ATT(g,t) surface, but the new helper in diff_diff/staggered.py:L95-L151 does not implement the documented CR1/PSU-meat variance contract. It returns sqrt(sum(psi_per_cluster**2)) and is used from the per-cell paths at diff_diff/staggered.py:L1107-L1120, L1467-L1472, L1914-L1922, and L2022-L2030. The registry says hc1 with cluster=X is CR1 Liang-Zeger on the IF and is routed through _compute_stratified_psu_meat (docs/methodology/REGISTRY.md:L328-L331, L368-L370), while the shared survey helper applies the PSU centering / G/(G-1) logic and the common undefined-variance behavior (diff_diff/survey.py:L1407-L1425, L2139-L2177). As written, the per-cell clustered SEs are still methodologically mismatched to the documented contract. Concrete fix: build the per-cell phi vector and call compute_survey_if_variance() (or _compute_stratified_psu_meat() on a 1-column matrix) with the relevant resolved survey object, so ATT(g,t) SEs inherit the same G/(G-1), centering/FPC, and G<2 -> NaN behavior as the rest of CS clustered inference.
  • Severity: P3. Impact: the vcov_type="hc1"-only contract and the bare cluster= → synthesized SurveyDesign(psu=...) route are explicitly documented in docs/methodology/REGISTRY.md:L328-L331 and L368-L370. Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The TODO.md additions properly track deferrable follow-up work and do not mitigate the P1 above.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new CS tests prove that per-cell SEs now change under cluster= and that bare-cluster wiring matches explicit SurveyDesign(psu=...) wiring (tests/test_staggered.py:L4832-L4910), but they do not pin parity with the shared CR1/PSU variance helper or the documented finite-sample factor. That is why the remaining SE mismatch can survive with green tests. Concrete fix: add a deterministic few-cluster regression that compares one clustered ATT(g,t) SE against compute_survey_if_variance() (or the exact CR1 formula) on the same per-cell IF vector.

Path to Approval

  1. Replace the manual _cluster_robust_se_from_per_gt_if() formula with the same survey/PSU variance routine the registry points to, using the appropriate panel or RCS resolved survey object for each ATT(g,t) surface.
  2. Add a targeted regression that checks one clustered ATT(g,t) SE against the shared variance helper / exact CR1 formula under a small-G design, so the finite-sample correction is locked in.

…elper

CI codex R4 P1: the per-(g,t) cluster helper from R3 computed
sqrt(sum(psi_per_cluster^2)) — the basic IF aggregation — but the
documented CR1 contract in REGISTRY.md routes through
_compute_stratified_psu_meat, which adds:

- G/(G-1) finite-sample correction
- PSU centering (subtracting per-stratum mean)
- FPC handling
- lonely-PSU / G<2 → NaN behavior

Without these, per-(g,t) clustered SEs were methodologically
mismatched to the aggregate path and still understated SE on
few-cluster designs.

Fix: rewrite _cluster_robust_se_from_per_gt_if to build the per-(g,t)
psi vector from inf_info and route through compute_survey_if_variance
(survey.py:2139) which delegates to _compute_stratified_psu_meat. The
helper signature changes from taking a bare psu_array to taking the
full ResolvedSurveyDesign object — all 4 call sites updated. Per-cell
SEs now inherit the SAME design-based machinery as overall +
event-study + bootstrap inference.

For panel path: ``resolved_survey_unit`` (length n_units). For RCS
path: ``resolved_survey`` (length n_obs). The shared helper handles
both index spaces uniformly.

New regression test test_per_gt_se_matches_compute_survey_if_variance
_helper: hand-constructs a small-G (10 cluster) design + synthetic
inf_info, calls the helper, and asserts it equals direct
compute_survey_if_variance reconstruction at rel=0/abs=0. Pins that
the helper bypasses NO part of the shared machinery — any future
refactor that diverges from compute_survey_if_variance fails this
test. Closes CI codex R4 P1 + P2 findings.

All 331 tests (test_staggered + test_staggered_rc + test_triple_diff +
test_honest_did) pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: dfd91dbb96e196334d2a5377f05574d33e22283b


Overall Assessment

⚠️ Needs changes — the prior methodology mismatch on clustered per-cell SE computation looks resolved, but one unmitigated P1 remains on the public ATT(g,t) clustered-SE surface.

Executive Summary

  • Re-review: the previous P1 is functionally addressed. _cluster_robust_se_from_per_gt_if() now routes through the shared survey variance helper, matching the REGISTRY’s documented CR1/PSU-meat contract. See diff_diff/staggered.py:L95-L165, diff_diff/survey.py:L2139-L2183, docs/methodology/REGISTRY.md:L326-L352, and tests/test_staggered.py:L4924-L4984.
  • Severity P1: when clustered design-based variance is undefined, the new helper returns None, and all four call sites silently keep the old unit-level per-cell SE instead of propagating NaN. See diff_diff/staggered.py:L121-L125, diff_diff/staggered.py:L162-L165, diff_diff/staggered.py:L1121-L1133, diff_diff/staggered.py:L1481-L1488, diff_diff/staggered.py:L1930-L1938, diff_diff/staggered.py:L2040-L2048.
  • Impact: group_time_effects[(g,t)]["se"] can remain finite under requested clustered inference even though the shared survey helper says clustered variance is unidentified. That is an undocumented variance mismatch on the public per-cell surface. See diff_diff/survey.py:L1407-L1413, diff_diff/survey.py:L2139-L2183.
  • The vcov_type={"hc1"} narrowing and bare-cluster= synthesis are documented in REGISTRY.md / TODO.md, so I am not treating those choices themselves as defects. See docs/methodology/REGISTRY.md:L306-L370 and TODO.md:L102-L108.
  • Static review only: I could not execute the tests here because this environment is missing numpy and pytest.

Methodology

  • Resolved prior finding: the per-cell clustered helper now uses compute_survey_if_variance() rather than a bespoke sqrt(sum psi_c^2) formula, which brings it back in line with the documented CR1/PSU-meat route. See diff_diff/staggered.py:L95-L165, diff_diff/survey.py:L2139-L2183, docs/methodology/REGISTRY.md:L326-L352.
  • Severity: P1. Impact: _cluster_robust_se_from_per_gt_if() returns None when compute_survey_if_variance() is NaN, and each caller interprets that as “leave the unit-level SE in place” rather than “clustered variance is undefined.” That yields a finite per-cell se on group_time_effects[(g,t)] under clustered inference, contrary to the shared survey variance contract and the library’s NaN-safe inference conventions. See diff_diff/staggered.py:L121-L125, diff_diff/staggered.py:L162-L165, diff_diff/staggered.py:L1121-L1133, diff_diff/staggered.py:L1481-L1488, diff_diff/staggered.py:L1930-L1938, diff_diff/staggered.py:L2040-L2048, diff_diff/survey.py:L1407-L1413, diff_diff/survey.py:L2139-L2183, and diff_diff/utils.py:L177-L205. Concrete fix: once a PSU-backed design is active, do not retain the estimator’s unit-level SE on helper failure; propagate se=np.nan (or raise) and let safe_inference() / safe_inference_batch() produce all-NaN inference for that cell.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: the new results docs and test comments say HonestDiD reads survey_metadata.df_survey first and falls back to df_inference, but the Callaway-Sant’Anna extraction branch still checks df_inference first. Current fit invariants make this harmless because df_inference is only populated when survey_metadata is absent, but the mismatch is a refactor trap. See diff_diff/staggered_results.py:L112-L126, tests/test_staggered.py:L4706-L4743, and diff_diff/honest_did.py:L835-L853. Concrete fix: either make the CS-specific extraction order match the docs/comments or rewrite the comments to document the mutual-exclusion invariant explicitly.

Tech Debt

  • No findings. The new TODO rows in TODO.md:L102-L108 appropriately track deferrable follow-up work and do not mitigate the P1 above.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new regression coverage pins the finite-variance happy path and helper parity, but it does not exercise the compute_survey_if_variance() -> NaN branch, so the remaining P1 would pass CI. See tests/test_staggered.py:L4924-L4984. Concrete fix: add regressions for bare cluster= with one unique cluster, and for an explicitly survey-backed design whose clustered TSL variance is unidentified, asserting per-cell se, t_stat, p_value, and both CI bounds are all NaN.

Path to Approval

  1. Change all PSU-backed per-(g,t) paths in diff_diff/staggered.py so that failure to obtain a finite clustered variance never falls back to the preexisting unit-level SE; use np.nan (or a hard error) instead.
  2. Add targeted tests for the undefined clustered-variance branches, and assert the full per-cell inference surface is NaN-consistent under those designs.

…gnment

Three fixes:

1. P1: when compute_survey_if_variance returns NaN (clustered variance
   undefined — e.g., G<2, lonely-PSU removed all strata), the per-(g,t)
   helper returned None and all 4 call sites silently kept the
   unit-level SE. group_time_effects[(g,t)]["se"] could remain finite
   under requested clustered inference even though the shared survey
   helper said the variance was unidentified — violates the
   safe_inference NaN-consistent contract + feedback_no_silent_failures.

   Fix: change helper return contract:
   - float SE → use it
   - NaN     → propagate NaN through to safe_inference (NaNs out the
               full per-cell inference surface)
   - None    → malformed (negative variance), caller keeps unit-level

   All 4 call sites updated to drop the np.isfinite(se_cluster) guard —
   NaN now flows through and triggers the safe_inference NaN-consistent
   behavior. The full per-cell surface (se, t_stat, p_value, conf_int)
   becomes NaN when clustered variance is undefined.

2. P3: HonestDiD CS branch (honest_did.py:835) and dCDH branch
   (honest_did.py:1003) read df_inference first, falling back to
   survey_metadata.df_survey. The MPD branch (honest_did.py:654) was
   updated in R2 to prefer survey_metadata first. The CS + dCDH
   branches still had the old preference order, creating a refactor
   trap (mutual-exclusion invariant kept this harmless today, but the
   mismatch between docs/comments and code is a bug magnet). Updated
   both CS + dCDH branches via replace_all to mirror the MPD ordering:
   survey_metadata first, df_inference fallback.

3. P2: added test_per_gt_se_propagates_nan_when_cluster_variance_
   undefined regression. Forces G=1 (single-cluster design) so
   compute_survey_if_variance returns NaN, then asserts at least one
   (g,t) cell has NaN se AND that the full inference surface (se,
   t_stat, p_value, both CI bounds) is NaN-consistent. Pins the
   NaN-propagation contract end-to-end.

All 416 tests (test_staggered + test_staggered_rc + test_triple_diff +
test_honest_did + test_two_stage) pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0c763bb9eef6727bcf1a46de32157679766ba46c


Overall Assessment

✅ Looks good — re-review of the prior clustered-SE finding is favorable. I do not see any unmitigated P0/P1 issues in the changed code.

Executive Summary

  • Re-review focus: the prior P1 on clustered per-(g,t) SEs silently falling back to unit-level SEs appears resolved. _cluster_robust_se_from_per_gt_if() now returns NaN for unidentified clustered variance, and all four affected Callaway-Sant’Anna call sites propagate that through safe_inference / safe_inference_batch instead of retaining the old SE. diff_diff/staggered.py:L95-L178, diff_diff/staggered.py:L1134-L1168, diff_diff/staggered.py:L1496-L1538, diff_diff/staggered.py:L1946-L1965, diff_diff/staggered.py:L2052-L2077, diff_diff/utils.py:L177-L263
  • The bare-cluster= wiring is now consistent with the Methodology Registry’s documented synthesis: synthesize SurveyDesign(psu=cluster), reject replicate-weight + bare-cluster combinations fail-closed, and preserve the non-survey survey_metadata contract while surfacing cluster df via df_inference. docs/methodology/REGISTRY.md:L306-L378, diff_diff/staggered.py:L1660-L1760, diff_diff/staggered.py:L2346-L2413
  • The HonestDiD consumer side is aligned with that results contract: it now prefers survey_metadata.df_survey and only falls back to df_inference when survey_metadata is absent. diff_diff/honest_did.py:L654-L673, diff_diff/honest_did.py:L835-L854, diff_diff/honest_did.py:L1004-L1023
  • The permanently narrow vcov_type={"hc1"} surface is documented in REGISTRY.md, tracked in TODO.md for deferred conley, and consistently enforced in constructor, fit-time revalidation, get_params(), and results metadata. I am treating that as a documented design choice, not a methodology defect. docs/methodology/REGISTRY.md:L306-L370, TODO.md:L102-L108, diff_diff/staggered.py:L4274-L4357, tests/test_staggered.py:L4498-L4567
  • Minor only: one private-helper docstring still describes the pre-fix fallback behavior even though the implementation now propagates NaN. diff_diff/staggered.py:L121-L125, diff_diff/staggered.py:L163-L177

Methodology

  • No unmitigated P0/P1 findings. The changed Callaway-Sant’Anna variance/cluster behavior matches the registry’s IF-based variance taxonomy and its documented SurveyDesign(psu=...) synthesis, and the prior per-cell clustered-SE mismatch is fixed in the changed analytical and RCS paths. docs/methodology/REGISTRY.md:L306-L378, diff_diff/staggered.py:L95-L178, diff_diff/staggered.py:L1660-L1760
  • No methodology finding on vcov_type narrowing. The deviation is explicitly documented and tracked, so under the review rules it is informational rather than defective. docs/methodology/REGISTRY.md:L306-L370, TODO.md:L102-L108

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: _cluster_robust_se_from_per_gt_if()’s docstring still says compute_survey_if_variance() returning NaN leads to None and a fallback to unit-level SE, but the implementation now returns NaN specifically to avoid that fallback. Runtime behavior is correct, but the docstring now documents the opposite contract from the code and tests, which is a refactor trap. diff_diff/staggered.py:L121-L125, diff_diff/staggered.py:L163-L177, tests/test_staggered.py:L4985-L5032 Concrete fix: update the helper docstring’s return-contract paragraph so it says NaN is propagated to callers and only malformed inputs / invariant violations return None.

Tech Debt

  • No findings. The new TODO rows are appropriate tracking for deferred work and do not mitigate any correctness issue. TODO.md:L102-L108

Security

  • No findings.

Documentation/Tests

  • No findings. The added regressions cover the previously missing undefined-cluster-variance branch, the df_inference/survey_metadata split, HonestDiD df fallback, RCS cluster wiring, vcov_type revalidation, bootstrap PSU precedence, and the defensive TripleDifference regression. tests/test_staggered.py:L4323-L4458, tests/test_staggered.py:L4498-L4567, tests/test_staggered.py:L4622-L4794, tests/test_staggered.py:L4985-L5090, tests/test_triple_diff.py:L1100-L1141
  • Static review only: I could not execute the tests here because this environment does not have numpy, pandas, or pytest.

CI codex R6 P3: _cluster_robust_se_from_per_gt_if docstring still said
"compute_survey_if_variance returning NaN leads to None and fallback to
unit-level SE" — but R5 changed the contract: NaN now propagates so the
per-cell inference surface NaN-outs uniformly. Docstring documented
the opposite of the code, creating a refactor trap.

Rewrote the return-contract paragraph to enumerate the three explicit
return types (float SE / NaN / None) with the caller's contract for
each. Cross-references feedback_no_silent_failures for the rationale.

No code/test changes — docstring-only fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 82e7489b43d89c0e013148769ea1bdb1393acd82


Overall Assessment

✅ Looks good — re-review of the prior clustered-inference findings is favorable. I do not see any unmitigated P0/P1 issues in the changed code.

Executive Summary

  • The prior bare-cluster= silent no-op is resolved: CallawaySantAnna.fit() now synthesizes or injects PSU structure before survey validation, and the clustered path reaches analytical SEs, per-(g,t) SEs, RCS, and bootstrap routing consistently. diff_diff/staggered.py:L1661-L1769, diff_diff/staggered.py:L1946-L2090, diff_diff/staggered.py:L2358-L2425
  • HonestDiD now reads survey_metadata.df_survey first and only falls back to df_inference when survey_metadata is absent, which fixes the stale-df risk for explicit survey fits. diff_diff/honest_did.py:L654-L673
  • The permanently narrow vcov_type={"hc1"} surface is documented in REGISTRY.md and tracked in TODO.md; the constructor, fit(), and sklearn-style set_params mutation path now enforce that contract consistently. docs/methodology/REGISTRY.md:L306-L378, diff_diff/staggered.py:L4286-L4368, TODO.md:L102-L107
  • The previous minor docstring mismatch is fixed: _cluster_robust_se_from_per_gt_if() now documents NaN propagation instead of unit-level fallback. diff_diff/staggered.py:L95-L190
  • Remaining issue is minor: the PR adds strong clustered-bootstrap coverage, but it still does not pin the undefined clustered-bootstrap branch (G<2) with a regression.
  • Static review only: I could not run pytest here because pytest is not installed in this environment.

Methodology

  • Severity: P3-informational. Impact: none. Cross-checking the primary source, Callaway-Sant’Anna inference is influence-function based and paired with a multiplier bootstrap, with a clustered-bootstrap modification discussed separately; that supports the PR’s hc1-only contract, and the CR1-on-IF SurveyDesign(psu=...) route is explicitly documented in the registry as a library synthesis rather than an undocumented deviation. Concrete fix: none. docs/methodology/REGISTRY.md:L306-L378, diff_diff/staggered.py:L1661-L1769, diff_diff/staggered.py:L4286-L4328, TODO.md:L102-L107. citeturn0view0turn7find0turn5find0
  • No other methodology findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings beyond the follow-up rows the PR already adds/trims in TODO.md; those are appropriately tracked and are not blockers. TODO.md:L102-L108

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new tests cover analytical G<2 NaN propagation and clustered bootstrap precedence, but they do not pin the newly reachable bare-cluster= + bootstrap undefined-variance branch. That path returns zero PSU multipliers at diff_diff/bootstrap_utils.py:L557-L562 and depends on downstream zero-SE guards at diff_diff/bootstrap_utils.py:L365-L377 and diff_diff/bootstrap_utils.py:L472-L485 to keep the full inference surface NaN-consistent. A future refactor could regress that behavior without failing the current suite. Concrete fix: add a CallawaySantAnna(cluster="single_cluster", n_bootstrap=...) regression that asserts overall_*, per-(g,t), and requested aggregate bootstrap inference fields are all NaN when clustered bootstrap variance is unidentified. tests/test_staggered.py:L4796-L5090, diff_diff/bootstrap_utils.py:L365-L377, diff_diff/bootstrap_utils.py:L472-L485, diff_diff/bootstrap_utils.py:L557-L562

CI codex R7 P3 wanted a regression pinning that bare-cluster + bootstrap
NaN-propagates when clustered bootstrap variance is undefined (G=1
single-cluster, no within-PSU variability). Writing the test exposed a
real silent-failure gap: when n_psu < 2 in the PSU-multiplier-weights
bootstrap path at staggered_bootstrap.py:323-347, all multiplier draws
collapse to constants → BLAS produces ≈0 variance (e.g., 6.7e-16), not
NaN. Downstream safe_inference treats this as finite-but-tiny SE,
yielding near-infinite t-stats and ~0 p-values for inference that is
actually undefined.

Fix: capture `_bootstrap_cluster_variance_unidentified = len(psu_ids) < 2`
after PSU multiplier weight generation; emit a UserWarning explaining
the diagnosis; before returning CSBootstrapResults, NaN-out ALL
bootstrap inference fields (overall_se, gt_ses, event_study_ses,
group_effect_ses, plus their CIs / p-values; cband_crit_value=None).
The conditional dict construction guards against optional fields being
None (e.g., when no event-study aggregation was requested).

New test_bare_cluster_bootstrap_propagates_nan_when_g_less_than_2
pins the NaN contract end-to-end: fits CS with cluster="single_cluster"
(all units in 1 cluster) + n_bootstrap=99 + aggregate="event_study",
asserts overall_* + per-(g,t) + event-study are NaN-consistent (se,
t_stat, p_value, both CI bounds). Closes CI codex R7 P3.

All 417 tests (test_staggered + test_staggered_rc + test_triple_diff +
test_honest_did + test_two_stage) pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a8746b2d2ac9999408a3df4aab12d4d4b8b63402


Overall Assessment

✅ Looks good — this re-review is favorable. I do not see any unmitigated P0/P1 issues in the changed code. The prior clustered-inference concerns appear resolved. Static review only: I could not run pytest here because python -m pytest fails with No module named pytest.

Executive Summary

  • The bare-cluster= silent no-op looks fixed end-to-end: fit() now synthesizes or injects PSU structure before survey validation, and the clustered path reaches analytical per-(g,t) SEs, overall aggregation, RCS, and bootstrap routing consistently. diff_diff/staggered.py:L95-L190, diff_diff/staggered.py:L1661-L1785, diff_diff/staggered.py:L1948-L2089
  • The vcov_type={"hc1"} narrowing is documented in the Methodology Registry and TODO, and the constructor / fit / sklearn-style mutation path now enforce that contract consistently. docs/methodology/REGISTRY.md:L306-L352, docs/methodology/REGISTRY.md:L368-L378, diff_diff/staggered.py:L4285-L4328, TODO.md:L102-L107
  • HonestDiD now prefers survey_metadata.df_survey and only falls back to df_inference, which resolves the stale-df risk for explicit survey fits while still supporting bare-cluster CS results. diff_diff/honest_did.py:L654-L673, diff_diff/honest_did.py:L835-L854, diff_diff/staggered.py:L2373-L2425
  • The prior clustered-bootstrap G<2 concern is now explicitly handled: bootstrap warns and NaN-outs inference surfaces, and the new regression suite pins that behavior. diff_diff/staggered_bootstrap.py:L323-L580, tests/test_staggered.py:L5034-L5090
  • The TripleDifference sidecar test is appropriately defensive and does not raise new methodology concerns. tests/test_triple_diff.py:L1100-L1140

Methodology

  • Severity: P3-informational. Impact: none. The two substantive methodology-facing changes in this PR, SurveyDesign(psu=cluster) synthesis for bare cluster= and the permanent vcov_type={"hc1"} contract, are explicitly documented in REGISTRY.md as IF-based variance routing rather than an undocumented deviation from Callaway-Sant’Anna. Concrete fix: none. docs/methodology/REGISTRY.md:L306-L352, docs/methodology/REGISTRY.md:L368-L378, diff_diff/staggered.py:L1661-L1785, diff_diff/staggered.py:L4285-L4328

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational. Impact: none. The remaining conley follow-up for CallawaySantAnna is explicitly tracked in TODO.md, so it is not a blocker under the repo’s deferred-work rules. Concrete fix: none. TODO.md:L102-L108

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new suite now covers overall clustered SE divergence, per-(g,t) clustered SEs, RCS, HonestDiD df fallback, bootstrap G<2, and the TripleDifference defensive case, but it still does not directly pin bare-cluster= through aggregate="group". _aggregate_by_group() has its own aggregation/inference path, so a regression isolated to grouped effects would currently escape the added tests. Concrete fix: add a CallawaySantAnna(cluster="state", aggregate="group") regression that asserts grouped SEs differ from cluster=None and match survey_design=SurveyDesign(psu="state"); optionally add a G=1 grouped-bootstrap NaN-consistency case. diff_diff/staggered_aggregation.py:L782-L860, tests/test_staggered.py:L4281-L5095
  • Severity: P3-informational. Impact: review validation depth is limited. I could not execute the added tests in this environment because pytest is not installed. Concrete fix: none for the PR itself; rely on CI for execution.

CI codex R8 P3: aggregate="group" has its own SE computation at
staggered_aggregation.py:782-860 (_aggregate_by_group) independent of
overall + event-study, so a regression isolated to grouped effects
would escape the existing tests. Added test that asserts:

1. Grouped SEs differ between cluster=None and cluster="state" on a
   panel with intra-cluster correlation (max diff > 1e-6)
2. Bare cluster="state" matches explicit SurveyDesign(psu="state") on
   the grouped surface (rel=1e-10, abs=1e-12) — confirms both routes
   activate the same CR1 aggregation through _aggregate_by_group

All 418 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d0572ec74d44a5507a822d9c94626848cc68e833


Overall Assessment

✅ Looks good. I do not see any unmitigated P0/P1 issues in the changed code. On re-review, the prior clustered-inference concerns appear resolved, including the earlier aggregate="group" coverage gap. Static review only: I could not run pytest here because /usr/bin/python: No module named pytest.

Executive Summary

  • Bare cluster= is now wired before survey validation and reaches the intended analytical per-cell, aggregation, and bootstrap paths instead of remaining a silent no-op. diff_diff/staggered.py:L1128-L1187, diff_diff/staggered.py:L1495-L1551, diff_diff/staggered.py:L1661-L1778, diff_diff/staggered.py:L1938-L2089, diff_diff/staggered_bootstrap.py:L324-L599
  • The HonestDiD denominator-df precedence is fixed: explicit survey fits now use survey_metadata.df_survey first, with df_inference only as a fallback for bare-cluster CS fits. diff_diff/honest_did.py:L654-L673, diff_diff/honest_did.py:L835-L855, diff_diff/honest_did.py:L1004-L1023, diff_diff/staggered.py:L2373-L2425, diff_diff/staggered_results.py:L98-L126
  • The vcov_type={"hc1"} restriction and synthesized SurveyDesign(psu=cluster) routing are documented in the Methodology Registry/TODO, so they are informational rather than undocumented methodology deviations. docs/methodology/REGISTRY.md:L306-L378, TODO.md:L102-L107, diff_diff/staggered.py:L4285-L4368
  • The prior P3 test gap on grouped aggregation is addressed: the new suite now covers grouped aggregation, per-(g,t) clustered SE parity, RCS clustering, G<2 NaN propagation, HonestDiD df fallback, and PSU-precedence/bootstrap behavior. tests/test_staggered.py:L4281-L5180
  • The TripleDifference change is purely defensive regression coverage; I did not find a new methodology or correctness issue there. tests/test_triple_diff.py:L1100-L1140

Methodology

  • Severity: P3-informational. Impact: none. Concrete fix: none. The methodology-facing changes are documented rather than hidden deviations: the registry now classifies CallawaySantAnna as an IF-based variance estimator, documents the synthesized SurveyDesign(psu=cluster) routing, and records the permanently narrow vcov_type={"hc1"} contract. That is consistent with Callaway-Sant’Anna’s asymptotic-linear/influence-function inference and their multiplier-bootstrap discussion, including the paper’s note that clustered uncertainty is handled by drawing cluster-level multipliers when the number of clusters is large. diff_diff/staggered.py:L95-L190, diff_diff/staggered.py:L1661-L1778, diff_diff/staggered.py:L4285-L4328, docs/methodology/REGISTRY.md:L306-L378 (psantanna.com)
  • No unmitigated methodology defects found.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3-informational. Impact: none. Concrete fix: none. The remaining conley limitation is explicitly tracked in TODO.md, so it is not blocking under the repo’s deferred-work policy. TODO.md:L102-L107

Security

  • No findings.

Documentation/Tests

  • Severity: P3-informational. Impact: the prior grouped-aggregation coverage gap from the last review is closed; the new tests now cover grouped aggregation, clustered per-cell SEs, RCS clustering, G<2 NaN propagation, survey/non-survey metadata contracts, and HonestDiD df fallback. Concrete fix: none. tests/test_staggered.py:L4281-L5180, tests/test_triple_diff.py:L1100-L1140
  • Severity: P3-informational. Impact: local execution validation is limited in this environment. Concrete fix: none for the PR itself; rely on CI. I could not run python -m pytest ... here because pytest is not installed.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 23, 2026
@igerber igerber merged commit 44e5947 into main May 23, 2026
33 of 34 checks passed
@igerber igerber deleted the cs-vcov-type-phase1b-4 branch May 23, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant