callaway-santanna: fix cluster= silent no-op + narrow vcov_type contract by igerber · Pull Request #487 · igerber/diff-diff

igerber · 2026-05-23T02:20:24Z

Summary

Phase 1b interstitial. Four tightly-coupled items in CallawaySantAnna and adjacent infrastructure:

Bug fix: bare CallawaySantAnna(cluster="state").fit(...) was a silent no-op — accepted at __init__, stored as self.cluster, returned by get_params(), but never consumed in the fit / aggregator / bootstrap pipeline. Users got per-unit IF inference believing they had cluster-robust SE; typically SE too small under intra-cluster correlation, no warning. Survey-PSU clustering via survey_design=SurveyDesign(psu="state") was NOT affected. Fix synthesizes a minimal SurveyDesign(psu=self.cluster, weight_type="pweight") when bare cluster= is set without an explicit survey design, threading through the existing PSU-meat aggregator + PSU-multiplier bootstrap. Three-branch wiring mirrors estimators.py:497-516; wiring inserted BEFORE _validate_unit_constant_survey so synthesized / injected designs catch movers on panel data.
Narrow vcov_type input contract: CallawaySantAnna(vcov_type=...) now accepts {"hc1"} only. {classical, hc2, hc2_bm, conley} REJECTED at __init__ with methodology-rooted messages. Rejection is library-architectural, not paper-prescribed: CS uses influence-function-based variance per Callaway & Sant'Anna (2021); per-(g,t) doubly-robust / IPW / outcome-regression structure has no single design matrix to compute hat-matrix leverage or Bell-McCaffrey Satterthwaite DOF on. Narrow contract is permanent. _validate_vcov_type extracted as @staticmethod; called from both __init__ and fit() so sklearn-style set_params mutations are re-checked at use time.
Survey / non-survey contract via dedicated df_inference field: bare cluster= populates new Results.df_inference (cluster-level df for HonestDiD's t-critical-value selection) but leaves Results.survey_metadata as None — preserving the canonical "user provided a survey design" signal that DiagnosticReport (Bacon skip + 2x2 PT skip) and CallawaySantAnnaResults.summary() (survey block render) read. HonestDiD at both extraction sites prefers df_inference over survey_metadata.df_survey. Replicate-weight survey designs combined with bare cluster= raise NotImplementedError (replicate IF variance ignores PSU entirely).
TripleDifference defensive test: added test_cluster_changes_ses regression test asserting TripleDifference(cluster="state") produces SE differing from cluster=None. TripleDifference's bare-cluster code path at triple_diff.py:1245-1259 was already correct but lacked a positive regression test.

Audit verified the bare-cluster= no-op was CS-specific — the other 7 Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD) handle bare cluster= correctly.

Methodology references

Method name(s): CallawaySantAnna (Callaway & Sant'Anna 2021)
Paper / source link(s): https://doi.org/10.1016/j.jeconom.2020.12.001
Any intentional deviations from the source (and why):
- Narrow vcov_type to {"hc1"}: library-architectural, not a deviation from CS 2021 — analytical-sandwich variance families and Conley spatial-HAC don't compose with IF-based variance machinery. Documented in docs/methodology/REGISTRY.md "IF-based variance estimators vs analytical-sandwich estimators" subsection.
- SurveyDesign(psu=cluster) synthesis for bare cluster= is a documented synthesis (per CLAUDE.md "Documenting Deviations" — **Note:** label in REGISTRY.md): the synthesis itself is new wiring; the downstream PSU-meat machinery (_compute_stratified_psu_meat) is the established survey-side path; the CR1 Liang-Zeger algebra on IF is Williams (2000) / Hansen (2007).

Validation

Tests added/updated:
- tests/test_staggered.py — TestCallawaySantAnnaClusterWiring (8 tests: panel + RCS + survey precedence coverage), TestCallawaySantAnnaVcovTypeNarrowContract (10 tests: input contract + set_params re-validation), TestCallawaySantAnnaClusterSafetyGates (6 tests: mover validation + replicate reject + survey/non-survey contract + end-to-end HonestDiD integration)
- tests/test_triple_diff.py — TestTripleDifferenceClusterDefensive::test_cluster_changes_ses
- tests/test_staggered.py::TestCallawaySantAnnaClusterWiring::test_cluster_none_path_unchanged verifies cluster=None path is bit-equal to pre-PR (wiring guarded by if self.cluster is not None:)
- All 410 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did + test_two_stage) pass
Backtest / simulation / notebook evidence: N/A — no tutorial / notebook surface changes

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

@staticmethod

Phase 1b interstitial PR addressing four tightly-coupled items in CallawaySantAnna and adjacent infrastructure: 1. Bug fix: bare CallawaySantAnna(cluster="state").fit(...) was a silent no-op — the parameter was accepted at __init__, stored as self.cluster, returned by get_params(), but never consumed in the fit / aggregator / bootstrap pipeline. The docstring at staggered.py:154-156 claimed "Defaults to unit-level clustering" but the aggregator at staggered_aggregation.py:193-213 computed per-unit IF variance regardless of self.cluster, and the bootstrap at staggered_bootstrap.py:323-347 drew per-unit multiplier weights regardless. Users who explicitly set cluster="state" got per-unit inference believing they had cluster-robust SE — typically SE too small under intra-cluster correlation, no warning. Survey-PSU clustering via survey_design=SurveyDesign(psu="state") was NOT affected and continued to cluster correctly via the existing _compute_stratified_psu_meat machinery. The fix synthesizes a minimal SurveyDesign(psu=self.cluster, weight_type="pweight") when bare cluster= is set without an explicit survey design, threading the synthesized PSU through the existing survey-PSU machinery (aggregator + bootstrap). Three-branch wiring mirrors estimators.py:497-516: - bare cluster + no survey → synthesize SurveyDesign(psu=cluster) - survey w/o PSU + cluster → inject cluster as PSU via _inject_cluster_as_psu, AND construct effective_survey_design = replace(survey_design, psu=cluster) so _validate_unit_constant_survey catches movers on panel data - survey w/ PSU + cluster → PSU wins; _resolve_effective_cluster emits UserWarning if partitions differ Wiring is inserted BEFORE the existing validators so synthesized / injected designs pass through unit-constancy + pweight checks. cluster_name on CallawaySantAnnaResults reflects the canonical PSU: survey_design.psu when explicit PSU is provided, self.cluster when bare cluster synthesizes / injects. 2. Narrow vcov_type input contract: CallawaySantAnna(vcov_type=...) now accepts {"hc1"} only at __init__. Analytical-sandwich families {classical, hc2, hc2_bm} and conley rejected with methodology-rooted messages. The rejection is library-architectural, not paper-prescribed: CS uses influence-function-based variance per Callaway & Sant'Anna (2021); per-(g,t) doubly-robust / IPW / outcome-regression structure has no single design matrix to compute hat-matrix leverage or Bell-McCaffrey Satterthwaite DOF on. The narrow contract is permanent; the same surface applies to other IF-based estimators (ImputationDiD, EfficientDiD) when their PRs land. _validate_vcov_type extracted as @staticmethod; called from both __init__ and fit() so sklearn-style set_params mutations are re-checked at use time (preventing silent propagation of bad values to Results metadata). 3. Survey/non-survey contract separation via dedicated df_inference field: bare cluster= populates the new Results.df_inference field (carrying cluster-level df for downstream HonestDiD t-critical-value selection) but leaves Results.survey_metadata as None — preserving the canonical "user provided a survey design" signal that DiagnosticReport (diagnostic_report.py:848-856 Bacon skip + 1150-1158 2x2 PT skip) and CallawaySantAnnaResults.summary() (staggered_results.py:235-238 survey block render) read. Legitimate survey_design= fits populate both df_inference AND survey_metadata; the two surfaces agree (df_inference == survey_metadata.df_survey). HonestDiD at honest_did.py prefers df_inference over survey_metadata.df_survey at BOTH extraction sites (MPD branch + CS branch). Replicate-weight survey designs combined with bare cluster= raise NotImplementedError: replicate-weight variance is computed by replicate reweighting (BRR / Fay / JK1 / JKn / SDR) and ignores PSU/cluster entirely (survey.py:104-109 enforces replicate_weights are mutually exclusive with strata/psu/fpc). Honoring bare cluster= would silently have no effect on variance while populating cluster_name/n_clusters dishonestly. Fail-closed per feedback_no_silent_failures. vcov_type, cluster_name, n_clusters, and df_inference added to CallawaySantAnnaResults with full class-docstring attribute entries. 4. TripleDifference defensive test_cluster_changes_ses regression test: TripleDifference's bare-cluster code path (triple_diff.py:1245-1259) was already correct but lacked a positive regression test (only an error-handling test for missing cluster columns existed). Added defensive coverage on a fixed-seed panel with state-level random effects, mirroring tests/test_two_stage.py::test_cluster_changes_ses. REGISTRY.md adds the "IF-based variance estimators vs analytical-sandwich estimators" taxonomy subsection explaining the structural split (CS, ImputationDiD, EfficientDiD vs DiD, MPD, TWFE, SA, StackedDiD, WooldridgeDiD), plus CS variance-families + cluster-wiring documentation. CHANGELOG Fixed + Added bullets. TODO.md Phase 1b row updated (4 remaining estimators, CS path narrowed permanently); new CS conley and cluster=-deprecation follow-up rows. llms-full.txt CS entry updated with the new vcov_type + cluster behavior. Audit verified the cluster= silent no-op was CS-specific — the other 7 Phase 1b estimators (SunAbraham, StackedDiD, WooldridgeDiD, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD) handle bare cluster= correctly. Tests: 25 new tests across TestCallawaySantAnnaClusterWiring (8 panel + RCS + survey precedence coverage), TestCallawaySantAnnaVcovTypeNarrowContract (10 input contract + set_params re-validation), TestCallawaySantAnnaClusterSafetyGates (6 mover validation + replicate reject + survey/non-survey contract + end-to-end HonestDiD integration), and TestTripleDifferenceClusterDefensive (1 defensive SE-changes regression). All 326 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did) pass. cluster=None path is bit-equal to pre-PR (wiring guarded by if self.cluster is not None). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md

github-actions · 2026-05-23T02:30:14Z

Overall Assessment
⚠️ Needs changes. Highest unmitigated severity: P1.

Executive Summary

The new df_inference plumbing introduces a downstream inference bug: HonestDiD now prefers a field that CallawaySantAnna populates from the wrong df source on some survey-backed fits.
The vcov_type={"hc1"} restriction and the SurveyDesign(psu=cluster) synthesis are documented in the Methodology Registry, so I am not treating those as methodology defects.
vcov_type was inserted mid-__init__ signature, which breaks positional CallawaySantAnna(...) callers.
The added regressions cover the analytical cluster wiring well, but they do not pin the bootstrap path that the PR also changes.
I could not run the test suite in this environment because numpy is unavailable.

Methodology

Severity: P1. Impact: diff_diff/staggered.py (L646-L669) computes the actual survey df from the unit-collapsed design, and diff_diff/staggered.py (L1993-L1999) can tighten it further, but diff_diff/staggered.py (L2212-L2218) stores df_inference from the original resolved_survey.df_survey, and diff_diff/honest_did.py (L831-L849) now prefers that field unconditionally. On panel survey fits without an explicit PSU (weights/strata/fpc only), this can switch HonestDiD from the unit-level df actually used by CS to the larger observation-level df; on survey fits whose df was later tightened, it can override the corrected survey_metadata.df_survey. That makes HonestDiD CIs/p-values anti-conservative with no warning. Concrete fix: only populate/prefer df_inference for the bare-cluster= path that intentionally leaves survey_metadata=None, or derive it from the same final df object that backs the fitted results’ actual inference.
Severity: P3 informational. Impact: the permanent vcov_type="hc1" restriction and the bare-cluster= synthesis are documented in docs/methodology/REGISTRY.md (L306-L348) and docs/methodology/REGISTRY.md (L368-L378), so I am not treating those deviations as defects. Concrete fix: none.

Code Quality

Severity: P2. Impact: diff_diff/staggered.py (L325-L343) inserts vcov_type between cluster and n_bootstrap, unlike sibling estimators that append it after the existing positional tail, e.g. diff_diff/sun_abraham.py (L520-L530) and diff_diff/wooldridge.py (L323-L336). Any positional caller that previously passed n_bootstrap, bootstrap_weights, or later args positionally will now bind them incorrectly and fail at runtime. Concrete fix: preserve the old positional order and make vcov_type keyword-only, or append it at the end with a deprecation shim for old positional calls.

Performance
No new findings.

Maintainability
No new findings beyond the constructor-compatibility issue above.

Tech Debt

Severity: P3 informational. Impact: the PR’s deferrable follow-ups are properly tracked in TODO.md (L102-L108), so they do not block merge. Concrete fix: none.

Security
No findings.

Documentation/Tests

Severity: P3. Impact: the new cluster-wiring regression suite in tests/test_staggered.py (L4283-L4776) does not exercise the bootstrap path, even though the fix explicitly changes bootstrap routing via diff_diff/staggered_bootstrap.py (L318-L347). That leaves the bootstrap side of the original silent-no-op bug class unpinned. Concrete fix: add one positive regression with cluster="state", n_bootstrap>0 asserting clustered bootstrap SEs differ from cluster=None, plus one precedence case showing survey_design.psu still wins under bootstrap.

Path to Approval

Narrow the new df_inference contract so it only covers the bare-cluster=/no-survey case, or compute it from the same final df source that backs the fitted CS inference; update HonestDiD to prefer df_inference only when survey_metadata is absent.
Add a regression for panel SurveyDesign(weights=...) (no explicit PSU) + aggregate="event_study" + HonestDiD, and a second regression for a survey case whose final df is tightened, so the new precedence cannot silently overstate denominator df again.

…compat + bootstrap test coverage Three issues: 1. df_inference contract narrowing: prior plumbing populated df_inference from resolved_survey.df_survey for ALL cluster branches (synthesize, inject, conflict), and HonestDiD preferred df_inference unconditionally. But CS internally uses resolved_survey_unit.df_survey (unit-collapsed) for safe_inference and can tighten that further via overall_effective_df recompute at staggered.py:1995-1999. Preferring resolved_survey.df_survey in HonestDiD on panel survey fits without explicit PSU (weights/strata/ fpc only) could switch from the unit-level df actually used by CS to the larger observation-level df, making HonestDiD CIs/p-values anti- conservative without warning. Fix narrows df_inference to bare-cluster- synthesize path only (the case where survey_metadata is intentionally None to preserve the survey/non-survey contract for DiagnosticReport / summary). For inject/conflict branches, df_inference stays None and HonestDiD reads df_survey directly from survey_metadata (which carries the actual CS-internal df, including any post-recompute tightening). HonestDiD preference inverted: survey_metadata.df_survey first, df_inference fallback for bare-cluster fits. 2. Positional-arg compatibility: vcov_type was inserted between cluster and n_bootstrap in __init__, breaking positional callers who passed n_bootstrap or later args positionally (e.g., CallawaySantAnna("never_treated", 0, "dr", 0.05, None, 999) would silently bind 999 to vcov_type → ValueError at validation). Moved vcov_type to the end of __init__ (matches SA / WooldridgeDiD convention of appending new params). 3. Bootstrap path coverage: added two new tests in TestCallawaySantAnnaClusterSafetyGates: - test_bare_cluster_bootstrap_se_differs_from_unit_level pins that bare cluster= routes bootstrap through the PSU-multiplier-weights branch at staggered_bootstrap.py:323-347 (closes the bootstrap- side of the original silent-no-op bug class) - test_survey_design_psu_wins_under_bootstrap pins that explicit PSU wins under bootstrap when cluster= + survey_design.psu have different partitions (mirrors the analytical-sandwich precedence) Existing test_explicit_survey_design_does_populate_survey_metadata updated to assert df_inference IS None in the inject branch (matches new narrowed contract); the survey_metadata.df_survey assertion is retained. CHANGELOG Fixed bullet updated to describe the narrowed df_inference contract + the inverted HonestDiD preference order. All 328 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did) pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-23T10:37:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7c6146e54ae45fd0cd53416a38980a0dab50b862

Overall Assessment
✅ Looks good. Highest unmitigated severity: P3.

Executive Summary

This re-review resolves the prior substantive issues: the df_inference/HonestDiD bug is fixed in substance, the constructor keeps positional compatibility, and the bootstrap path now has explicit regressions.
The vcov_type={"hc1"} restriction and the bare-cluster= → SurveyDesign(psu=...) synthesis are documented in docs/methodology/REGISTRY.md:L306-L352 and docs/methodology/REGISTRY.md:L368-L378, so I am not treating them as methodology defects.
CallawaySantAnna now only emits df_inference on the bare-cluster/no-survey branch in diff_diff/staggered.py:L2204-L2223, which removes the prior risk of HonestDiD overriding survey-backed denominator df.
The previous bootstrap coverage gap is addressed by tests/test_staggered.py:L4792-L4878.
Remaining issues are documentation-only: the generated guide shows the wrong argument order, and the new results/test docs still describe df_inference on survey-backed fits contrary to the implementation.
I could not run the test suite here because this environment is missing numpy.

Methodology

Severity: P3 informational. Impact: No unmitigated methodology defect remains from the prior review. df_inference is now gated to the bare-cluster/no-survey path in diff_diff/staggered.py:L2204-L2223, while bare-cluster CS event-study extraction still reads it in diff_diff/honest_did.py:L835-L853. Concrete fix: none.
Severity: P3 informational. Impact: The narrow vcov_type contract and synthesized PSU routing are explicitly documented in docs/methodology/REGISTRY.md:L306-L352 and docs/methodology/REGISTRY.md:L368-L378, so under the stated rubric they are not defects. Concrete fix: none.

Code Quality
No findings. The earlier positional-signature concern is fixed in code: vcov_type remains last in CallawaySantAnna.__init__ at diff_diff/staggered.py:L325-L342.

Performance
No findings.

Maintainability
No findings.

Tech Debt

Severity: P3 informational. Impact: The remaining Conley follow-up and broader standalone-estimator vcov_type threading are explicitly tracked in TODO.md:L102-L108, so the deferrals are properly recorded and non-blocking. Concrete fix: none.

Security
No findings.

Documentation/Tests

Severity: P3. Impact: diff_diff/guides/llms-full.txt:L186-L193 places vcov_type before n_bootstrap, but the real constructor keeps vcov_type at the end in diff_diff/staggered.py:L325-L342. That reintroduces the old positional-order confusion in the generated guide even though the code fixed it. Concrete fix: move vcov_type to the end of the pseudo-signature or mark it keyword-only there.
Severity: P3. Impact: diff_diff/staggered_results.py:L112-L121, diff_diff/staggered_results.py:L172-L180, and tests/test_staggered.py:L4706-L4739 still describe explicit survey-design fits as populating df_inference, but the implementation intentionally leaves df_inference=None whenever survey_metadata is present in diff_diff/staggered.py:L2204-L2223. Concrete fix: update those docstrings/comments to match the actual contract.
No further test-coverage finding. The prior bootstrap gap is now covered by tests/test_staggered.py:L4792-L4878.

Two P3 doc-only follow-ups: 1. llms-full.txt CS entry kept vcov_type before n_bootstrap in the pseudo-signature even after the constructor was reordered to append vcov_type at the end. Moved vcov_type to the end of the pseudo- signature so the generated guide matches the actual __init__ order (matches SA / WooldridgeDiD convention). 2. staggered_results.py class attribute docstring + inline field comment for df_inference still described it as populated on "Legitimate survey_design= fits" alongside survey_metadata. Updated to reflect the narrowed contract: df_inference is populated on the bare-cluster- synthesize path ONLY (where survey_metadata is None). When user provides survey_design=, df_inference stays None and the canonical df carrier is survey_metadata.df_survey. HonestDiD reads survey_metadata first, df_inference fallback. 3. test_explicit_survey_design_does_populate_survey_metadata docstring still said "+ df_inference also populated"; updated to match the body assertion (which already correctly asserts df_inference is None). No source/test logic changes. All 80 targeted tests + test_honest_did pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-23T10:48:03Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2449842cfd598fa7ae47a86f810914183b1f93bc

Overall Assessment

⛔ Blocker — one unmitigated P0 finding.

Executive Summary

Prior re-review items look resolved: the guide signature is fixed, and df_inference is now narrowed to the bare-cluster/no-survey CS path.
The vcov_type={"hc1"} restriction and bare cluster= → SurveyDesign(psu=...) synthesis are documented in the Methodology Registry, so I am not treating those choices themselves as defects.
Severity: P0 [Newly identified]. The new cluster wiring still does not make analytical group_time_effects[(g,t)] SEs cluster-robust. Aggregate/bootstrap surfaces are wired, but the public ATT(g,t) surface remains unit-level and then only gets a clustered d.f. attached afterward.
The new tests pin overall_se, bootstrap routing, and metadata/HonestDiD behavior, but they do not cover analytical ATT(g,t) SEs, which is why this residual no-op survived.
I could not run the tests locally because this environment is missing numpy, pandas, and pytest.

Methodology

Severity: P0 [Newly identified]. Impact: the new contract says cluster=X means CR1 Liang-Zeger on the IF (docs/methodology/REGISTRY.md:L326-L331, docs/methodology/REGISTRY.md:L368-L378, diff_diff/staggered.py:L154-L167, diff_diff/guides/llms-full.txt:L186-L199), but analytical ATT(g,t) SEs are still computed with unit-level formulas. The no-cov vectorized path computes se = sqrt(sum(inf^2)) and then only passes df_survey into safe_inference_batch (diff_diff/staggered.py:L999-L1066); the covariate-reg fast path does the same (diff_diff/staggered.py:L1363-L1417); and the main panel/RCS fit loops just store those se_gt values and attach clustered d.f. (diff_diff/staggered.py:L1812-L1827, diff_diff/staggered.py:L1904-L1919). The shared ATT(g,t) helpers are still unit-level too (diff_diff/staggered.py:L2329-L2377, diff_diff/staggered.py:L2543-L2563, diff_diff/staggered.py:L2843-L2928, diff_diff/staggered.py:L3303-L3309). That yields silently misleading per-cell SE/t/p/CI whenever users inspect results.group_time_effects under analytical inference. Concrete fix: compute ATT(g,t) analytical variance from the per-cell IF using the same cluster/PSU-aware machinery as the aggregate path (panel: resolved_survey_unit; RCS: resolved_survey) and use that in all four per-cell code paths.
Severity: P3 informational. Impact: the hc1-only vcov_type contract and the synthetic-PSU routing are explicitly documented in docs/methodology/REGISTRY.md:L321-L352 and :368-L376, so I am not treating them as methodology defects. Callaway & Sant’Anna (2021) is ATT(g,t)-based and paired with a simple bootstrap procedure, which is consistent with the repo’s IF-based framing. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The TODO.md updates are proper tracking, not blockers.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new CS cluster tests cover overall_se, bootstrap routing, metadata, and HonestDiD d.f. propagation (tests/test_staggered.py:L4290-L4495, tests/test_staggered.py:L4622-L4878), but nothing asserts that analytical group_time_effects[(g,t)]["se"] changes under cluster= or matches explicit SurveyDesign(psu=...). That left the blocker above unpinned. Concrete fix: add panel and RCS regressions that compare at least one ATT(g,t) analytical SE under cluster=None vs cluster="state" and verify equality between bare cluster="state" and explicit SurveyDesign(psu="state").

Path to Approval

Replace unit-level analytical ATT(g,t) variance with cluster/PSU-aware IF variance across _compute_att_gt_fast, _compute_all_att_gt_vectorized, _compute_all_att_gt_covariate_reg, and _compute_att_gt_rc (or their shared helpers), so cluster= actually changes the public ATT(g,t) SE surface.
Recompute ATT(g,t) t_stat, p_value, and conf_int from those corrected SEs, not just from df_survey.
Add targeted panel and repeated-cross-section regressions on results.group_time_effects[(g,t)] for bare cluster= and explicit SurveyDesign(psu=...).

CI codex R3 P0: the cluster wiring contract documented in REGISTRY.md ("cluster=X means CR1 Liang-Zeger on the IF") was honored at the aggregate inference surface (overall_se, event-study, group, bootstrap) but the per-cell public surface results.group_time_effects[(g,t)]["se"] remained unit-level. Users inspecting per-cell ATT(g,t) inference under cluster= got silently misleading SE/t/p/CI even though overall inference was correctly cluster-robust. Fix: new module-level _cluster_robust_se_from_per_gt_if helper that aggregates the per-(g,t) IF by PSU and returns CR1 Liang-Zeger SE. Applied at all 4 ATT(g,t) computation sites identified by the codex: 1. _compute_all_att_gt_vectorized (no-covariate vectorized batch) — recompute se after building inf_info, overwrite group_time_effects [(g,t)]["se"] which was set with the unit-level value 2. _compute_all_att_gt_covariate_reg (covariate-reg batch) — same pattern 3. Main panel single-cell loop (after _compute_att_gt_fast) — local se_gt update flows into gte_entry["se"] 4. RC fit loop (after _compute_att_gt_rc) — uses resolved_survey.psu (per-obs) instead of resolved_survey_unit.psu (per-unit) The recompute is gated by `if psu is not None`, so cluster=None remains bit-equal to pre-PR. For cluster=unit (each unit its own cluster), the CR1 formula coincides with the unit-level IF formula (modulo ddof conventions in the underlying OR path) — methodologically consistent with Williams (2000) CR1-on-IF for IF-based estimators. Tests: - test_per_gt_analytical_se_changes_with_cluster: asserts at least one (g,t) cell shows measurable SE divergence between cluster=None and cluster="state" on a panel with intra-cluster correlation - test_per_gt_se_matches_explicit_survey_design: asserts per-(g,t) SE agrees (rel=1e-10) between bare cluster="state" and explicit SurveyDesign(psu="state") — both activate the same CR1 aggregation All 414 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did + test_two_stage) pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-23T11:04:39Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2b166cb0e9d434a64e031352b527afeadf445bf0

Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

Re-review scope: the prior blocker on clustered analytical ATT(g,t) surfaces looks functionally addressed. The PR now wires per-cell clustered SE updates into the vectorized, covariate-regression, general panel, and RCS CS paths, and adds targeted regression coverage in tests/test_staggered.py:L4832-L4910.
Severity P1 [Newly identified]: the new per-cell cluster helper is labeled and used as CR1 Liang-Zeger, but it computes sqrt(sum(psi_c^2)) and bypasses the shared PSU variance helper, omitting the documented CR1 finite-sample correction and shared PSU-meat semantics. See diff_diff/staggered.py:L95-L151 versus diff_diff/survey.py:L1407-L1425 and L2139-L2177, plus docs/methodology/REGISTRY.md:L328-L331 and L368-L370.
Impact: CallawaySantAnnaResults.group_time_effects[(g,t)] still understates clustered analytical SEs, so the corresponding t-stats, p-values, and CIs remain too optimistic relative to the library’s documented IF/CR1 contract, especially with few clusters.
The vcov_type={"hc1"} narrowing and bare cluster= synthesis are documented in REGISTRY.md, so I am not treating those choices themselves as defects.
I could not run the test suite locally because this environment is missing numpy; this review is static.

Methodology

Severity: P1 [Newly identified]. Impact: CallawaySantAnna’s changed analytical clustering path now reaches the public ATT(g,t) surface, but the new helper in diff_diff/staggered.py:L95-L151 does not implement the documented CR1/PSU-meat variance contract. It returns sqrt(sum(psi_per_cluster**2)) and is used from the per-cell paths at diff_diff/staggered.py:L1107-L1120, L1467-L1472, L1914-L1922, and L2022-L2030. The registry says hc1 with cluster=X is CR1 Liang-Zeger on the IF and is routed through _compute_stratified_psu_meat (docs/methodology/REGISTRY.md:L328-L331, L368-L370), while the shared survey helper applies the PSU centering / G/(G-1) logic and the common undefined-variance behavior (diff_diff/survey.py:L1407-L1425, L2139-L2177). As written, the per-cell clustered SEs are still methodologically mismatched to the documented contract. Concrete fix: build the per-cell phi vector and call compute_survey_if_variance() (or _compute_stratified_psu_meat() on a 1-column matrix) with the relevant resolved survey object, so ATT(g,t) SEs inherit the same G/(G-1), centering/FPC, and G<2 -> NaN behavior as the rest of CS clustered inference.
Severity: P3. Impact: the vcov_type="hc1"-only contract and the bare cluster= → synthesized SurveyDesign(psu=...) route are explicitly documented in docs/methodology/REGISTRY.md:L328-L331 and L368-L370. Concrete fix: none.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The TODO.md additions properly track deferrable follow-up work and do not mitigate the P1 above.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new CS tests prove that per-cell SEs now change under cluster= and that bare-cluster wiring matches explicit SurveyDesign(psu=...) wiring (tests/test_staggered.py:L4832-L4910), but they do not pin parity with the shared CR1/PSU variance helper or the documented finite-sample factor. That is why the remaining SE mismatch can survive with green tests. Concrete fix: add a deterministic few-cluster regression that compares one clustered ATT(g,t) SE against compute_survey_if_variance() (or the exact CR1 formula) on the same per-cell IF vector.

Path to Approval

Replace the manual _cluster_robust_se_from_per_gt_if() formula with the same survey/PSU variance routine the registry points to, using the appropriate panel or RCS resolved survey object for each ATT(g,t) surface.
Add a targeted regression that checks one clustered ATT(g,t) SE against the shared variance helper / exact CR1 formula under a small-G design, so the finite-sample correction is locked in.

…elper CI codex R4 P1: the per-(g,t) cluster helper from R3 computed sqrt(sum(psi_per_cluster^2)) — the basic IF aggregation — but the documented CR1 contract in REGISTRY.md routes through _compute_stratified_psu_meat, which adds: - G/(G-1) finite-sample correction - PSU centering (subtracting per-stratum mean) - FPC handling - lonely-PSU / G<2 → NaN behavior Without these, per-(g,t) clustered SEs were methodologically mismatched to the aggregate path and still understated SE on few-cluster designs. Fix: rewrite _cluster_robust_se_from_per_gt_if to build the per-(g,t) psi vector from inf_info and route through compute_survey_if_variance (survey.py:2139) which delegates to _compute_stratified_psu_meat. The helper signature changes from taking a bare psu_array to taking the full ResolvedSurveyDesign object — all 4 call sites updated. Per-cell SEs now inherit the SAME design-based machinery as overall + event-study + bootstrap inference. For panel path: ``resolved_survey_unit`` (length n_units). For RCS path: ``resolved_survey`` (length n_obs). The shared helper handles both index spaces uniformly. New regression test test_per_gt_se_matches_compute_survey_if_variance _helper: hand-constructs a small-G (10 cluster) design + synthetic inf_info, calls the helper, and asserts it equals direct compute_survey_if_variance reconstruction at rel=0/abs=0. Pins that the helper bypasses NO part of the shared machinery — any future refactor that diverges from compute_survey_if_variance fails this test. Closes CI codex R4 P1 + P2 findings. All 331 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did) pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-23T11:17:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: dfd91dbb96e196334d2a5377f05574d33e22283b

Overall Assessment

⚠️ Needs changes — the prior methodology mismatch on clustered per-cell SE computation looks resolved, but one unmitigated P1 remains on the public ATT(g,t) clustered-SE surface.

Executive Summary

Re-review: the previous P1 is functionally addressed. _cluster_robust_se_from_per_gt_if() now routes through the shared survey variance helper, matching the REGISTRY’s documented CR1/PSU-meat contract. See diff_diff/staggered.py:L95-L165, diff_diff/survey.py:L2139-L2183, docs/methodology/REGISTRY.md:L326-L352, and tests/test_staggered.py:L4924-L4984.
Severity P1: when clustered design-based variance is undefined, the new helper returns None, and all four call sites silently keep the old unit-level per-cell SE instead of propagating NaN. See diff_diff/staggered.py:L121-L125, diff_diff/staggered.py:L162-L165, diff_diff/staggered.py:L1121-L1133, diff_diff/staggered.py:L1481-L1488, diff_diff/staggered.py:L1930-L1938, diff_diff/staggered.py:L2040-L2048.
Impact: group_time_effects[(g,t)]["se"] can remain finite under requested clustered inference even though the shared survey helper says clustered variance is unidentified. That is an undocumented variance mismatch on the public per-cell surface. See diff_diff/survey.py:L1407-L1413, diff_diff/survey.py:L2139-L2183.
The vcov_type={"hc1"} narrowing and bare-cluster= synthesis are documented in REGISTRY.md / TODO.md, so I am not treating those choices themselves as defects. See docs/methodology/REGISTRY.md:L306-L370 and TODO.md:L102-L108.
Static review only: I could not execute the tests here because this environment is missing numpy and pytest.

Methodology

Resolved prior finding: the per-cell clustered helper now uses compute_survey_if_variance() rather than a bespoke sqrt(sum psi_c^2) formula, which brings it back in line with the documented CR1/PSU-meat route. See diff_diff/staggered.py:L95-L165, diff_diff/survey.py:L2139-L2183, docs/methodology/REGISTRY.md:L326-L352.
Severity: P1. Impact: _cluster_robust_se_from_per_gt_if() returns None when compute_survey_if_variance() is NaN, and each caller interprets that as “leave the unit-level SE in place” rather than “clustered variance is undefined.” That yields a finite per-cell se on group_time_effects[(g,t)] under clustered inference, contrary to the shared survey variance contract and the library’s NaN-safe inference conventions. See diff_diff/staggered.py:L121-L125, diff_diff/staggered.py:L162-L165, diff_diff/staggered.py:L1121-L1133, diff_diff/staggered.py:L1481-L1488, diff_diff/staggered.py:L1930-L1938, diff_diff/staggered.py:L2040-L2048, diff_diff/survey.py:L1407-L1413, diff_diff/survey.py:L2139-L2183, and diff_diff/utils.py:L177-L205. Concrete fix: once a PSU-backed design is active, do not retain the estimator’s unit-level SE on helper failure; propagate se=np.nan (or raise) and let safe_inference() / safe_inference_batch() produce all-NaN inference for that cell.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: the new results docs and test comments say HonestDiD reads survey_metadata.df_survey first and falls back to df_inference, but the Callaway-Sant’Anna extraction branch still checks df_inference first. Current fit invariants make this harmless because df_inference is only populated when survey_metadata is absent, but the mismatch is a refactor trap. See diff_diff/staggered_results.py:L112-L126, tests/test_staggered.py:L4706-L4743, and diff_diff/honest_did.py:L835-L853. Concrete fix: either make the CS-specific extraction order match the docs/comments or rewrite the comments to document the mutual-exclusion invariant explicitly.

Tech Debt

No findings. The new TODO rows in TODO.md:L102-L108 appropriately track deferrable follow-up work and do not mitigate the P1 above.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new regression coverage pins the finite-variance happy path and helper parity, but it does not exercise the compute_survey_if_variance() -> NaN branch, so the remaining P1 would pass CI. See tests/test_staggered.py:L4924-L4984. Concrete fix: add regressions for bare cluster= with one unique cluster, and for an explicitly survey-backed design whose clustered TSL variance is unidentified, asserting per-cell se, t_stat, p_value, and both CI bounds are all NaN.

Path to Approval

Change all PSU-backed per-(g,t) paths in diff_diff/staggered.py so that failure to obtain a finite clustered variance never falls back to the preexisting unit-level SE; use np.nan (or a hard error) instead.
Add targeted tests for the undefined clustered-variance branches, and assert the full per-cell inference surface is NaN-consistent under those designs.

…gnment Three fixes: 1. P1: when compute_survey_if_variance returns NaN (clustered variance undefined — e.g., G<2, lonely-PSU removed all strata), the per-(g,t) helper returned None and all 4 call sites silently kept the unit-level SE. group_time_effects[(g,t)]["se"] could remain finite under requested clustered inference even though the shared survey helper said the variance was unidentified — violates the safe_inference NaN-consistent contract + feedback_no_silent_failures. Fix: change helper return contract: - float SE → use it - NaN → propagate NaN through to safe_inference (NaNs out the full per-cell inference surface) - None → malformed (negative variance), caller keeps unit-level All 4 call sites updated to drop the np.isfinite(se_cluster) guard — NaN now flows through and triggers the safe_inference NaN-consistent behavior. The full per-cell surface (se, t_stat, p_value, conf_int) becomes NaN when clustered variance is undefined. 2. P3: HonestDiD CS branch (honest_did.py:835) and dCDH branch (honest_did.py:1003) read df_inference first, falling back to survey_metadata.df_survey. The MPD branch (honest_did.py:654) was updated in R2 to prefer survey_metadata first. The CS + dCDH branches still had the old preference order, creating a refactor trap (mutual-exclusion invariant kept this harmless today, but the mismatch between docs/comments and code is a bug magnet). Updated both CS + dCDH branches via replace_all to mirror the MPD ordering: survey_metadata first, df_inference fallback. 3. P2: added test_per_gt_se_propagates_nan_when_cluster_variance_ undefined regression. Forces G=1 (single-cluster design) so compute_survey_if_variance returns NaN, then asserts at least one (g,t) cell has NaN se AND that the full inference surface (se, t_stat, p_value, both CI bounds) is NaN-consistent. Pins the NaN-propagation contract end-to-end. All 416 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did + test_two_stage) pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-23T11:28:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0c763bb9eef6727bcf1a46de32157679766ba46c

Overall Assessment

✅ Looks good — re-review of the prior clustered-SE finding is favorable. I do not see any unmitigated P0/P1 issues in the changed code.

Executive Summary

Re-review focus: the prior P1 on clustered per-(g,t) SEs silently falling back to unit-level SEs appears resolved. _cluster_robust_se_from_per_gt_if() now returns NaN for unidentified clustered variance, and all four affected Callaway-Sant’Anna call sites propagate that through safe_inference / safe_inference_batch instead of retaining the old SE. diff_diff/staggered.py:L95-L178, diff_diff/staggered.py:L1134-L1168, diff_diff/staggered.py:L1496-L1538, diff_diff/staggered.py:L1946-L1965, diff_diff/staggered.py:L2052-L2077, diff_diff/utils.py:L177-L263
The bare-cluster= wiring is now consistent with the Methodology Registry’s documented synthesis: synthesize SurveyDesign(psu=cluster), reject replicate-weight + bare-cluster combinations fail-closed, and preserve the non-survey survey_metadata contract while surfacing cluster df via df_inference. docs/methodology/REGISTRY.md:L306-L378, diff_diff/staggered.py:L1660-L1760, diff_diff/staggered.py:L2346-L2413
The HonestDiD consumer side is aligned with that results contract: it now prefers survey_metadata.df_survey and only falls back to df_inference when survey_metadata is absent. diff_diff/honest_did.py:L654-L673, diff_diff/honest_did.py:L835-L854, diff_diff/honest_did.py:L1004-L1023
The permanently narrow vcov_type={"hc1"} surface is documented in REGISTRY.md, tracked in TODO.md for deferred conley, and consistently enforced in constructor, fit-time revalidation, get_params(), and results metadata. I am treating that as a documented design choice, not a methodology defect. docs/methodology/REGISTRY.md:L306-L370, TODO.md:L102-L108, diff_diff/staggered.py:L4274-L4357, tests/test_staggered.py:L4498-L4567
Minor only: one private-helper docstring still describes the pre-fix fallback behavior even though the implementation now propagates NaN. diff_diff/staggered.py:L121-L125, diff_diff/staggered.py:L163-L177

Methodology

No unmitigated P0/P1 findings. The changed Callaway-Sant’Anna variance/cluster behavior matches the registry’s IF-based variance taxonomy and its documented SurveyDesign(psu=...) synthesis, and the prior per-cell clustered-SE mismatch is fixed in the changed analytical and RCS paths. docs/methodology/REGISTRY.md:L306-L378, diff_diff/staggered.py:L95-L178, diff_diff/staggered.py:L1660-L1760
No methodology finding on vcov_type narrowing. The deviation is explicitly documented and tracked, so under the review rules it is informational rather than defective. docs/methodology/REGISTRY.md:L306-L370, TODO.md:L102-L108

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: _cluster_robust_se_from_per_gt_if()’s docstring still says compute_survey_if_variance() returning NaN leads to None and a fallback to unit-level SE, but the implementation now returns NaN specifically to avoid that fallback. Runtime behavior is correct, but the docstring now documents the opposite contract from the code and tests, which is a refactor trap. diff_diff/staggered.py:L121-L125, diff_diff/staggered.py:L163-L177, tests/test_staggered.py:L4985-L5032 Concrete fix: update the helper docstring’s return-contract paragraph so it says NaN is propagated to callers and only malformed inputs / invariant violations return None.

Tech Debt

No findings. The new TODO rows are appropriate tracking for deferred work and do not mitigate any correctness issue. TODO.md:L102-L108

Security

No findings.

Documentation/Tests

No findings. The added regressions cover the previously missing undefined-cluster-variance branch, the df_inference/survey_metadata split, HonestDiD df fallback, RCS cluster wiring, vcov_type revalidation, bootstrap PSU precedence, and the defensive TripleDifference regression. tests/test_staggered.py:L4323-L4458, tests/test_staggered.py:L4498-L4567, tests/test_staggered.py:L4622-L4794, tests/test_staggered.py:L4985-L5090, tests/test_triple_diff.py:L1100-L1141
Static review only: I could not execute the tests here because this environment does not have numpy, pandas, or pytest.

CI codex R6 P3: _cluster_robust_se_from_per_gt_if docstring still said "compute_survey_if_variance returning NaN leads to None and fallback to unit-level SE" — but R5 changed the contract: NaN now propagates so the per-cell inference surface NaN-outs uniformly. Docstring documented the opposite of the code, creating a refactor trap. Rewrote the return-contract paragraph to enumerate the three explicit return types (float SE / NaN / None) with the caller's contract for each. Cross-references feedback_no_silent_failures for the rationale. No code/test changes — docstring-only fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-23T11:37:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 82e7489b43d89c0e013148769ea1bdb1393acd82

Overall Assessment

✅ Looks good — re-review of the prior clustered-inference findings is favorable. I do not see any unmitigated P0/P1 issues in the changed code.

Executive Summary

The prior bare-cluster= silent no-op is resolved: CallawaySantAnna.fit() now synthesizes or injects PSU structure before survey validation, and the clustered path reaches analytical SEs, per-(g,t) SEs, RCS, and bootstrap routing consistently. diff_diff/staggered.py:L1661-L1769, diff_diff/staggered.py:L1946-L2090, diff_diff/staggered.py:L2358-L2425
HonestDiD now reads survey_metadata.df_survey first and only falls back to df_inference when survey_metadata is absent, which fixes the stale-df risk for explicit survey fits. diff_diff/honest_did.py:L654-L673
The permanently narrow vcov_type={"hc1"} surface is documented in REGISTRY.md and tracked in TODO.md; the constructor, fit(), and sklearn-style set_params mutation path now enforce that contract consistently. docs/methodology/REGISTRY.md:L306-L378, diff_diff/staggered.py:L4286-L4368, TODO.md:L102-L107
The previous minor docstring mismatch is fixed: _cluster_robust_se_from_per_gt_if() now documents NaN propagation instead of unit-level fallback. diff_diff/staggered.py:L95-L190
Remaining issue is minor: the PR adds strong clustered-bootstrap coverage, but it still does not pin the undefined clustered-bootstrap branch (G<2) with a regression.
Static review only: I could not run pytest here because pytest is not installed in this environment.

Methodology

Severity: P3-informational. Impact: none. Cross-checking the primary source, Callaway-Sant’Anna inference is influence-function based and paired with a multiplier bootstrap, with a clustered-bootstrap modification discussed separately; that supports the PR’s hc1-only contract, and the CR1-on-IF SurveyDesign(psu=...) route is explicitly documented in the registry as a library synthesis rather than an undocumented deviation. Concrete fix: none. docs/methodology/REGISTRY.md:L306-L378, diff_diff/staggered.py:L1661-L1769, diff_diff/staggered.py:L4286-L4328, TODO.md:L102-L107. citeturn0view0turn7find0turn5find0
No other methodology findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings beyond the follow-up rows the PR already adds/trims in TODO.md; those are appropriately tracked and are not blockers. TODO.md:L102-L108

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new tests cover analytical G<2 NaN propagation and clustered bootstrap precedence, but they do not pin the newly reachable bare-cluster= + bootstrap undefined-variance branch. That path returns zero PSU multipliers at diff_diff/bootstrap_utils.py:L557-L562 and depends on downstream zero-SE guards at diff_diff/bootstrap_utils.py:L365-L377 and diff_diff/bootstrap_utils.py:L472-L485 to keep the full inference surface NaN-consistent. A future refactor could regress that behavior without failing the current suite. Concrete fix: add a CallawaySantAnna(cluster="single_cluster", n_bootstrap=...) regression that asserts overall_*, per-(g,t), and requested aggregate bootstrap inference fields are all NaN when clustered bootstrap variance is unidentified. tests/test_staggered.py:L4796-L5090, diff_diff/bootstrap_utils.py:L365-L377, diff_diff/bootstrap_utils.py:L472-L485, diff_diff/bootstrap_utils.py:L557-L562

CI codex R7 P3 wanted a regression pinning that bare-cluster + bootstrap NaN-propagates when clustered bootstrap variance is undefined (G=1 single-cluster, no within-PSU variability). Writing the test exposed a real silent-failure gap: when n_psu < 2 in the PSU-multiplier-weights bootstrap path at staggered_bootstrap.py:323-347, all multiplier draws collapse to constants → BLAS produces ≈0 variance (e.g., 6.7e-16), not NaN. Downstream safe_inference treats this as finite-but-tiny SE, yielding near-infinite t-stats and ~0 p-values for inference that is actually undefined. Fix: capture `_bootstrap_cluster_variance_unidentified = len(psu_ids) < 2` after PSU multiplier weight generation; emit a UserWarning explaining the diagnosis; before returning CSBootstrapResults, NaN-out ALL bootstrap inference fields (overall_se, gt_ses, event_study_ses, group_effect_ses, plus their CIs / p-values; cband_crit_value=None). The conditional dict construction guards against optional fields being None (e.g., when no event-study aggregation was requested). New test_bare_cluster_bootstrap_propagates_nan_when_g_less_than_2 pins the NaN contract end-to-end: fits CS with cluster="single_cluster" (all units in 1 cluster) + n_bootstrap=99 + aggregate="event_study", asserts overall_* + per-(g,t) + event-study are NaN-consistent (se, t_stat, p_value, both CI bounds). Closes CI codex R7 P3. All 417 tests (test_staggered + test_staggered_rc + test_triple_diff + test_honest_did + test_two_stage) pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-23T11:51:17Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a8746b2d2ac9999408a3df4aab12d4d4b8b63402

Overall Assessment

✅ Looks good — this re-review is favorable. I do not see any unmitigated P0/P1 issues in the changed code. The prior clustered-inference concerns appear resolved. Static review only: I could not run pytest here because python -m pytest fails with No module named pytest.

Executive Summary

The bare-cluster= silent no-op looks fixed end-to-end: fit() now synthesizes or injects PSU structure before survey validation, and the clustered path reaches analytical per-(g,t) SEs, overall aggregation, RCS, and bootstrap routing consistently. diff_diff/staggered.py:L95-L190, diff_diff/staggered.py:L1661-L1785, diff_diff/staggered.py:L1948-L2089
The vcov_type={"hc1"} narrowing is documented in the Methodology Registry and TODO, and the constructor / fit / sklearn-style mutation path now enforce that contract consistently. docs/methodology/REGISTRY.md:L306-L352, docs/methodology/REGISTRY.md:L368-L378, diff_diff/staggered.py:L4285-L4328, TODO.md:L102-L107
HonestDiD now prefers survey_metadata.df_survey and only falls back to df_inference, which resolves the stale-df risk for explicit survey fits while still supporting bare-cluster CS results. diff_diff/honest_did.py:L654-L673, diff_diff/honest_did.py:L835-L854, diff_diff/staggered.py:L2373-L2425
The prior clustered-bootstrap G<2 concern is now explicitly handled: bootstrap warns and NaN-outs inference surfaces, and the new regression suite pins that behavior. diff_diff/staggered_bootstrap.py:L323-L580, tests/test_staggered.py:L5034-L5090
The TripleDifference sidecar test is appropriately defensive and does not raise new methodology concerns. tests/test_triple_diff.py:L1100-L1140

Methodology

Severity: P3-informational. Impact: none. The two substantive methodology-facing changes in this PR, SurveyDesign(psu=cluster) synthesis for bare cluster= and the permanent vcov_type={"hc1"} contract, are explicitly documented in REGISTRY.md as IF-based variance routing rather than an undocumented deviation from Callaway-Sant’Anna. Concrete fix: none. docs/methodology/REGISTRY.md:L306-L352, docs/methodology/REGISTRY.md:L368-L378, diff_diff/staggered.py:L1661-L1785, diff_diff/staggered.py:L4285-L4328

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3-informational. Impact: none. The remaining conley follow-up for CallawaySantAnna is explicitly tracked in TODO.md, so it is not a blocker under the repo’s deferred-work rules. Concrete fix: none. TODO.md:L102-L108

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new suite now covers overall clustered SE divergence, per-(g,t) clustered SEs, RCS, HonestDiD df fallback, bootstrap G<2, and the TripleDifference defensive case, but it still does not directly pin bare-cluster= through aggregate="group". _aggregate_by_group() has its own aggregation/inference path, so a regression isolated to grouped effects would currently escape the added tests. Concrete fix: add a CallawaySantAnna(cluster="state", aggregate="group") regression that asserts grouped SEs differ from cluster=None and match survey_design=SurveyDesign(psu="state"); optionally add a G=1 grouped-bootstrap NaN-consistency case. diff_diff/staggered_aggregation.py:L782-L860, tests/test_staggered.py:L4281-L5095
Severity: P3-informational. Impact: review validation depth is limited. I could not execute the added tests in this environment because pytest is not installed. Concrete fix: none for the PR itself; rely on CI for execution.

CI codex R8 P3: aggregate="group" has its own SE computation at staggered_aggregation.py:782-860 (_aggregate_by_group) independent of overall + event-study, so a regression isolated to grouped effects would escape the existing tests. Added test that asserts: 1. Grouped SEs differ between cluster=None and cluster="state" on a panel with intra-cluster correlation (max diff > 1e-6) 2. Bare cluster="state" matches explicit SurveyDesign(psu="state") on the grouped surface (rel=1e-10, abs=1e-12) — confirms both routes activate the same CR1 aggregation through _aggregate_by_group All 418 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-23T11:59:49Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d0572ec74d44a5507a822d9c94626848cc68e833

Overall Assessment

✅ Looks good. I do not see any unmitigated P0/P1 issues in the changed code. On re-review, the prior clustered-inference concerns appear resolved, including the earlier aggregate="group" coverage gap. Static review only: I could not run pytest here because /usr/bin/python: No module named pytest.

Executive Summary

Bare cluster= is now wired before survey validation and reaches the intended analytical per-cell, aggregation, and bootstrap paths instead of remaining a silent no-op. diff_diff/staggered.py:L1128-L1187, diff_diff/staggered.py:L1495-L1551, diff_diff/staggered.py:L1661-L1778, diff_diff/staggered.py:L1938-L2089, diff_diff/staggered_bootstrap.py:L324-L599
The HonestDiD denominator-df precedence is fixed: explicit survey fits now use survey_metadata.df_survey first, with df_inference only as a fallback for bare-cluster CS fits. diff_diff/honest_did.py:L654-L673, diff_diff/honest_did.py:L835-L855, diff_diff/honest_did.py:L1004-L1023, diff_diff/staggered.py:L2373-L2425, diff_diff/staggered_results.py:L98-L126
The vcov_type={"hc1"} restriction and synthesized SurveyDesign(psu=cluster) routing are documented in the Methodology Registry/TODO, so they are informational rather than undocumented methodology deviations. docs/methodology/REGISTRY.md:L306-L378, TODO.md:L102-L107, diff_diff/staggered.py:L4285-L4368
The prior P3 test gap on grouped aggregation is addressed: the new suite now covers grouped aggregation, per-(g,t) clustered SE parity, RCS clustering, G<2 NaN propagation, HonestDiD df fallback, and PSU-precedence/bootstrap behavior. tests/test_staggered.py:L4281-L5180
The TripleDifference change is purely defensive regression coverage; I did not find a new methodology or correctness issue there. tests/test_triple_diff.py:L1100-L1140

Methodology

Severity: P3-informational. Impact: none. Concrete fix: none. The methodology-facing changes are documented rather than hidden deviations: the registry now classifies CallawaySantAnna as an IF-based variance estimator, documents the synthesized SurveyDesign(psu=cluster) routing, and records the permanently narrow vcov_type={"hc1"} contract. That is consistent with Callaway-Sant’Anna’s asymptotic-linear/influence-function inference and their multiplier-bootstrap discussion, including the paper’s note that clustered uncertainty is handled by drawing cluster-level multipliers when the number of clusters is large. diff_diff/staggered.py:L95-L190, diff_diff/staggered.py:L1661-L1778, diff_diff/staggered.py:L4285-L4328, docs/methodology/REGISTRY.md:L306-L378 (psantanna.com)
No unmitigated methodology defects found.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3-informational. Impact: none. Concrete fix: none. The remaining conley limitation is explicitly tracked in TODO.md, so it is not blocking under the repo’s deferred-work policy. TODO.md:L102-L107

Security

No findings.

Documentation/Tests

Severity: P3-informational. Impact: the prior grouped-aggregation coverage gap from the last review is closed; the new tests now cover grouped aggregation, clustered per-cell SEs, RCS clustering, G<2 NaN propagation, survey/non-survey metadata contracts, and HonestDiD df fallback. Concrete fix: none. tests/test_staggered.py:L4281-L5180, tests/test_triple_diff.py:L1100-L1140
Severity: P3-informational. Impact: local execution validation is limited in this environment. Concrete fix: none for the PR itself; rely on CI. I could not run python -m pytest ... here because pytest is not installed.

igerber and others added 2 commits May 22, 2026 22:16

Merge remote-tracking branch 'origin/main' into cs-vcov-type-phase1b-4

b0b65cd

# Conflicts: # CHANGELOG.md

igerber added the ready-for-ci Triggers CI test workflows label May 23, 2026

igerber merged commit 44e5947 into main May 23, 2026
33 of 34 checks passed

igerber deleted the cs-vcov-type-phase1b-4 branch May 23, 2026 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

callaway-santanna: fix cluster= silent no-op + narrow vcov_type contract#487

callaway-santanna: fix cluster= silent no-op + narrow vcov_type contract#487
igerber merged 10 commits into
mainfrom
cs-vcov-type-phase1b-4

igerber commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 23, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant