TwoStageDiD Wave E.3 parity: always-treated full-design retention by igerber · Pull Request #485 · igerber/diff-diff

igerber · 2026-05-22T12:33:25Z

Summary

Mechanical transfer of the Wave E.3 invariant from SpilloverDiD (PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482) to TwoStageDiD. When the always-treated handler drops units from the OLS sample, the resolved survey design now retains its FULL-DOMAIN n_psu / n_strata / df_survey / strata / fpc / psu arrays. Per-cluster scores are zero-padded onto the full-domain unique-PSU list via new optional kwargs (score_pad_mask, cluster_ids_full) on _compute_gmm_variance; PSUs containing only always-treated rows get zero score rows but still count toward G_full.
Replicate-weight refit callback (_refit_ts) subsets each w_r via keep_mask.values before threading into stage-1/stage-2, mirroring the same subsetting the main fit applies to survey_weights (otherwise solve_ols rejects the length mismatch and compute_replicate_refit_variance swallows the ValueError so replicate inference NaNs out).
Cluster injection block sources cluster_ids_raw from FULL-DOMAIN data[cluster] (not post-drop df[cluster]) so _inject_cluster_as_psu's zip against resolved_survey.strata stays length-aligned; df["_survey_cluster"] aligned via resolved_survey.psu[keep_mask.values].
Closes the parity follow-up tracked at TODO.md since PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482.

Methodology references (required if estimator / math changes)

Method name(s): TwoStageDiD (Gardner 2022) — survey variance under always-treated unit drop.
Paper / source link(s): Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R, §2.5 "Domains and subpopulations" (R survey::svyrecvar(subset()) zero-pad convention). Gardner, J. (2022). Two-stage differences in differences. arXiv:2207.05943.
Any intentional deviations from the source (and why): None new. The full-domain retention under always-treated drop is documented in REGISTRY.md TwoStageDiD section as "documented synthesis — Wave E.3 parity, full-domain survey design under always-treated drop" mirroring the SpilloverDiD Wave E.3 invariant (PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482) and the in-library precedents at imputation.py:2175-2183 (PreTrendsImputation) and prep.py:1401-1432 (DCDH cell variance).

Validation

Tests added/updated:
- New TestTwoStageDiDWaveE3ParityAlwaysTreated in tests/test_two_stage.py (8 tests: no-always-treated baseline, full-domain df_survey preservation, full-domain n_psu reporting, per-cluster zero-pad mock-spy, subpopulation+always-treated composition, cluster-as-PSU+always-treated, no-survey path unchanged, PSU entirely-always-treated).
- Strengthened test_two_stage_always_treated in tests/test_replicate_weight_expansion.py to assert finite overall_se / overall_p_value / overall_conf_int under always-treated + replicate-weight design.
- New test_two_stage_always_treated_event_study_and_group_replicate in the same file: exercises event-study + group replicate refit branches end-to-end with aggregate="all", asserts finite se + p_value + conf_int on non-reference horizons and cohorts.
Backtest / simulation / notebook evidence (if applicable): N/A — no tutorial changes.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…n survey design Mechanical transfer of the Wave E.3 SpilloverDiD invariant (PR #482, merge 24de906) to TwoStageDiD. When the always-treated handler drops units from the OLS sample, the resolved survey design retains its FULL-DOMAIN n_psu / n_strata / df_survey / strata / fpc / psu arrays instead of being subsetted via replace(...). Per-cluster scores aggregate at fit-length then zero-pad onto the full-domain unique-PSU list via two new optional kwargs on _compute_gmm_variance: score_pad_mask and cluster_ids_full. PSUs containing only always-treated rows get zero score rows but still count toward G_full for n_psu / df_survey accounting. Documented synthesis (library-convention adoption): adopts the canonical R survey::svyrecvar(subset()) convention (Lumley 2010 §2.5), already established at imputation.py:2175-2183 (PreTrendsImputation), prep.py:1401-1432 (DCDH cell variance), and spillover.py (PR #482). Implementation: - diff_diff/two_stage.py: delete L1485-1525 design-subset block; promote keep_mask to fit()-level scope (always defined; defaults all-True); cluster injection sources cluster_ids_raw from FULL-DOMAIN data[cluster] (not post-drop df[cluster]) so _inject_cluster_as_psu's zip against resolved_survey.strata stays length-aligned; df["_survey_cluster"] aligned to post-drop length via resolved_survey.psu[keep_mask.values]; _compute_gmm_variance + 3 inner _stage2_* methods gain score_pad_mask / cluster_ids_full kwargs; zero-pad expansion after per-cluster aggregation; strata/fpc obs_idx lookups use cluster_ids_full under padding; G < 2 unidentified gate fires on G_full when padding active. - diff_diff/two_stage.py: _refit_ts callback subsets each replicate weight w_r via keep_mask.values before threading into _fit_untreated_model and _stage2_*, matching the keep_mask-subsetting applied to survey_weights in the main fit (otherwise solve_ols rejects the length mismatch and compute_replicate_refit_variance swallows the ValueError so replicate inference NaNs out). - Always-treated warning text updated to reflect the new contract: weights are subsetted for OLS, design retained for variance. - Replicate variance + always-treated: existing path now works correctly (score_pad_mask_arg stays None on _uses_replicate_ts paths; per-replicate refit handles resampling separately). Tests (tests/test_two_stage.py): - New TestTwoStageDiDWaveE3ParityAlwaysTreated class with 8 tests: (a) no-always-treated baseline, (b) full-domain df_survey preservation under drop, (c) full-domain n_psu reporting, (d) per-cluster zero-pad mock-spy on _compute_stratified_meat_from_psu_scores, (e) subpopulation + always-treated composition, (f) cluster-as-PSU + always-treated, (g) no-survey path unchanged, (h) PSU entirely-always-treated. Tests (tests/test_replicate_weight_expansion.py): - Strengthened test_two_stage_always_treated to assert finite overall_se / overall_p_value / overall_conf_int (was only asserting finite ATT, missing the replicate-SE regression class). - New test_two_stage_always_treated_event_study_and_group_replicate exercising the event-study + group replicate refit branches end-to-end under always-treated drop with aggregate="all"; asserts finite se + p_value + conf_int on non-reference horizons and cohorts. Docs: - docs/methodology/REGISTRY.md: TwoStageDiD section gains "documented synthesis — Wave E.3 parity" note; SpilloverDiD Wave E.3 follow-up note updated to mark TwoStageDiD parity as shipped. - CHANGELOG.md: Unreleased Added entry leading with documented-synthesis framing. - TODO.md: drop parity follow-up row. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-22T12:40:31Z

Overall Assessment

✅ Looks good

Executive Summary

The estimator-side change is a documented Wave E.3 synthesis, not an undocumented methodology deviation: docs/methodology/REGISTRY.md:L1316-L1318 now explicitly records the full-domain survey-design retention rule, and the implementation in diff_diff/two_stage.py:L1460-L1704 and diff_diff/two_stage.py:L2958-L3084 matches that contract.
I did not find a P0/P1 issue in identification, weighting, variance/SE computation, or inference propagation for the changed TwoStageDiD paths.
The risky propagation points appear complete: the new padding kwargs are threaded through static/event-study/group stage-2 paths, and replicate refits now subset full-domain replicate weights before refitting in diff_diff/two_stage.py:L1788-L1897.
The main non-blocking issue is a small performance regression: the zero-pad branch now runs on every analytical survey fit, even when no always-treated rows were dropped.
The added coverage is good for the intended bugfix, but the new replicate tests use unseeded randomness.
I could not execute the test suite in this environment because runtime tooling/deps are missing; the test review below is static.

Methodology

No findings. The PR changes the survey-variance realization for TwoStageDiD, but that deviation is explicitly documented in docs/methodology/REGISTRY.md:L1317-L1317, so under the review rubric it is mitigated/P3-informational rather than a defect. The code preserves post-drop OLS estimation while zero-padding PSU scores onto the full-domain survey design, which is consistent with the documented Lumley-style domain-estimation convention implemented here at diff_diff/two_stage.py:L1498-L1512, diff_diff/two_stage.py:L1555-L1602, and diff_diff/two_stage.py:L2969-L3084.

Code Quality

No findings. The PR continues to use safe_inference() on the changed inference surfaces and does not introduce the banned inline t_stat = effect / se pattern; see diff_diff/two_stage.py:L1707-L1709, diff_diff/two_stage.py:L1907-L1929, diff_diff/two_stage.py:L2599-L2606, and diff_diff/two_stage.py:L2713-L2720.

Performance

Severity: P3
Impact: score_pad_mask_arg is populated for every non-replicate survey fit because keep_mask defaults to all-True, so _compute_gmm_variance() always takes the zero-pad/copy branch even when no always-treated rows were removed. That adds avoidable np.unique/np.searchsorted work and full-size c_by_cluster / s2_by_cluster copies on the baseline survey path. See diff_diff/two_stage.py:L1464-L1467, diff_diff/two_stage.py:L1677-L1682, and diff_diff/two_stage.py:L2969-L3027.
Concrete fix: Only pass score_pad_mask and cluster_ids_full when rows were actually dropped, e.g. gate on n_always_treated > 0 or not keep_mask.all().

Maintainability

No findings. Internal propagation of the new split-length contract looks complete across the changed call graph: diff_diff/two_stage.py:L1684-L1758, diff_diff/two_stage.py:L2377-L2388, diff_diff/two_stage.py:L2549-L2560, and diff_diff/two_stage.py:L2675-L2686.

Tech Debt

No findings. Removing the resolved TODO entry in TODO.md is consistent with the shipped code/docs/tests for this parity item.

Security

No findings.

Documentation/Tests

Severity: P3
Impact: The new replicate tests add unseeded np.random.normal(...) draws at tests/test_replicate_weight_expansion.py:L395-L401 and tests/test_replicate_weight_expansion.py:L437-L440. The assertions are coarse enough that flakiness risk is low, but deterministic fixtures are preferable for regression coverage in inference code.
Concrete fix: Seed a local generator (rng = np.random.default_rng(...)) and draw from it, or replace the perturbations with fixed values.

Execution note: I could not run pytest here because the environment lacks both pytest and core runtime deps such as numpy; I limited validation to static inspection and AST parsing of the changed Python files.

CI codex review P3 #1 (performance): _compute_gmm_variance was taking the zero-pad branch (np.unique + np.searchsorted + full-size c_by_cluster / s2_by_cluster copies) on EVERY analytical survey fit because keep_mask defaults to all-True. Gate on `n_always_treated > 0` so baseline survey fits with no drop pass `None / None` and take the bit-identical pre-PR path. CI codex review P3 #2 (test determinism): seed local np.random.default_rng in the two new always-treated replicate-weight tests so always-treated outcomes are deterministic across runs (was using global np.random.normal which is non-deterministic at import time). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-22T13:03:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: db3f7fea1bec4cbeed0046ca8b6c73c1c1e9cc9b

Overall Assessment

✅ Looks good

Executive Summary

The methodology-affecting change is on TwoStageDiD survey variance under always-treated unit drop, and it is explicitly documented in docs/methodology/REGISTRY.md:L1316-L1318, so this is a documented synthesis rather than an undocumented deviation.
The implementation matches that contract across the fit-time survey/cluster handling in diff_diff/two_stage.py:L1459-L1768 and the zero-pad PSU-score expansion in diff_diff/two_stage.py:L2770-L3075.
The replicate-weight fix is correctly applied at the refit boundary: full-domain replicate weights are subset with the same keep_mask before stage 1 / stage 2 refits in diff_diff/two_stage.py:L1793-L1898, covering overall, event-study, and group surfaces.
The prior re-review P3 items are addressed: the zero-pad branch is now gated to actual always-treated drops in diff_diff/two_stage.py:L1667-L1687, and the new replicate tests use seeded RNGs in tests/test_replicate_weight_expansion.py:L393-L405 and tests/test_replicate_weight_expansion.py:L437-L447.
I did not find any unmitigated P0/P1 issues in the changed code. Validation here was static only because this environment lacks pytest and numpy.

Methodology

Severity: P3. Impact: No action required. The affected method is TwoStageDiD survey-variance handling when always-treated units are dropped. The new full-domain-retention rule is explicitly documented in docs/methodology/REGISTRY.md:L1316-L1318, and the implementation follows that contract in diff_diff/two_stage.py:L1459-L1768 and diff_diff/two_stage.py:L2963-L3075. I did not find an undocumented mismatch in identification, weighting, or survey df / PSU accounting. Concrete fix: None.

Code Quality

No findings. The new private kwargs are propagated through static, event-study, and group paths, and the replicate refit now applies the same post-drop mask before all changed stage-2 surfaces in diff_diff/two_stage.py:L1690-L1768 and diff_diff/two_stage.py:L1793-L1898.

Performance

No findings. The earlier always-on zero-pad/copy regression is resolved by activating the padding path only when always-treated rows were actually dropped in diff_diff/two_stage.py:L1667-L1687.

Maintainability

No findings. The registry note, in-code docstrings/comments, and the new targeted tests all describe the same Wave E.3 contract.

Tech Debt

No findings. Removing the shipped parity follow-up from TODO.md is consistent with the delivered code, docs, and tests.

Security

No findings.

Documentation/Tests

No findings. Coverage is materially stronger in tests/test_two_stage.py:L1549-L1912 and tests/test_replicate_weight_expansion.py:L380-L493, and the prior unseeded-randomness concern is resolved.
Static-review limitation: I could not execute the tests here because the environment does not have pytest or numpy installed.

igerber added the ready-for-ci Triggers CI test workflows label May 22, 2026

igerber merged commit fdf2ceb into main May 22, 2026
33 of 34 checks passed

igerber deleted the spillover-conley-twostagedid-wave-e3-parity branch May 22, 2026 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TwoStageDiD Wave E.3 parity: always-treated full-design retention#485

TwoStageDiD Wave E.3 parity: always-treated full-design retention#485
igerber merged 2 commits into
mainfrom
spillover-conley-twostagedid-wave-e3-parity

igerber commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 22, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 22, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 22, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant