Skip to content

TwoStageDiD Wave E.3 parity: always-treated full-design retention#485

Merged
igerber merged 2 commits into
mainfrom
spillover-conley-twostagedid-wave-e3-parity
May 22, 2026
Merged

TwoStageDiD Wave E.3 parity: always-treated full-design retention#485
igerber merged 2 commits into
mainfrom
spillover-conley-twostagedid-wave-e3-parity

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 22, 2026

Summary

  • Mechanical transfer of the Wave E.3 invariant from SpilloverDiD (PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482) to TwoStageDiD. When the always-treated handler drops units from the OLS sample, the resolved survey design now retains its FULL-DOMAIN n_psu / n_strata / df_survey / strata / fpc / psu arrays. Per-cluster scores are zero-padded onto the full-domain unique-PSU list via new optional kwargs (score_pad_mask, cluster_ids_full) on _compute_gmm_variance; PSUs containing only always-treated rows get zero score rows but still count toward G_full.
  • Replicate-weight refit callback (_refit_ts) subsets each w_r via keep_mask.values before threading into stage-1/stage-2, mirroring the same subsetting the main fit applies to survey_weights (otherwise solve_ols rejects the length mismatch and compute_replicate_refit_variance swallows the ValueError so replicate inference NaNs out).
  • Cluster injection block sources cluster_ids_raw from FULL-DOMAIN data[cluster] (not post-drop df[cluster]) so _inject_cluster_as_psu's zip against resolved_survey.strata stays length-aligned; df["_survey_cluster"] aligned via resolved_survey.psu[keep_mask.values].
  • Closes the parity follow-up tracked at TODO.md since PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482.

Methodology references (required if estimator / math changes)

  • Method name(s): TwoStageDiD (Gardner 2022) — survey variance under always-treated unit drop.
  • Paper / source link(s): Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R, §2.5 "Domains and subpopulations" (R survey::svyrecvar(subset()) zero-pad convention). Gardner, J. (2022). Two-stage differences in differences. arXiv:2207.05943.
  • Any intentional deviations from the source (and why): None new. The full-domain retention under always-treated drop is documented in REGISTRY.md TwoStageDiD section as "documented synthesis — Wave E.3 parity, full-domain survey design under always-treated drop" mirroring the SpilloverDiD Wave E.3 invariant (PR SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482) and the in-library precedents at imputation.py:2175-2183 (PreTrendsImputation) and prep.py:1401-1432 (DCDH cell variance).

Validation

  • Tests added/updated:
    • New TestTwoStageDiDWaveE3ParityAlwaysTreated in tests/test_two_stage.py (8 tests: no-always-treated baseline, full-domain df_survey preservation, full-domain n_psu reporting, per-cluster zero-pad mock-spy, subpopulation+always-treated composition, cluster-as-PSU+always-treated, no-survey path unchanged, PSU entirely-always-treated).
    • Strengthened test_two_stage_always_treated in tests/test_replicate_weight_expansion.py to assert finite overall_se / overall_p_value / overall_conf_int under always-treated + replicate-weight design.
    • New test_two_stage_always_treated_event_study_and_group_replicate in the same file: exercises event-study + group replicate refit branches end-to-end with aggregate="all", asserts finite se + p_value + conf_int on non-reference horizons and cohorts.
  • Backtest / simulation / notebook evidence (if applicable): N/A — no tutorial changes.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…n survey design

Mechanical transfer of the Wave E.3 SpilloverDiD invariant (PR #482, merge
24de906) to TwoStageDiD. When the always-treated handler drops units from
the OLS sample, the resolved survey design retains its FULL-DOMAIN n_psu /
n_strata / df_survey / strata / fpc / psu arrays instead of being subsetted
via replace(...). Per-cluster scores aggregate at fit-length then zero-pad
onto the full-domain unique-PSU list via two new optional kwargs on
_compute_gmm_variance: score_pad_mask and cluster_ids_full. PSUs containing
only always-treated rows get zero score rows but still count toward G_full
for n_psu / df_survey accounting.

Documented synthesis (library-convention adoption): adopts the canonical
R survey::svyrecvar(subset()) convention (Lumley 2010 §2.5), already
established at imputation.py:2175-2183 (PreTrendsImputation),
prep.py:1401-1432 (DCDH cell variance), and spillover.py (PR #482).

Implementation:
- diff_diff/two_stage.py: delete L1485-1525 design-subset block; promote
  keep_mask to fit()-level scope (always defined; defaults all-True);
  cluster injection sources cluster_ids_raw from FULL-DOMAIN data[cluster]
  (not post-drop df[cluster]) so _inject_cluster_as_psu's zip against
  resolved_survey.strata stays length-aligned; df["_survey_cluster"]
  aligned to post-drop length via resolved_survey.psu[keep_mask.values];
  _compute_gmm_variance + 3 inner _stage2_* methods gain score_pad_mask /
  cluster_ids_full kwargs; zero-pad expansion after per-cluster
  aggregation; strata/fpc obs_idx lookups use cluster_ids_full under
  padding; G < 2 unidentified gate fires on G_full when padding active.
- diff_diff/two_stage.py: _refit_ts callback subsets each replicate
  weight w_r via keep_mask.values before threading into _fit_untreated_model
  and _stage2_*, matching the keep_mask-subsetting applied to survey_weights
  in the main fit (otherwise solve_ols rejects the length mismatch and
  compute_replicate_refit_variance swallows the ValueError so replicate
  inference NaNs out).
- Always-treated warning text updated to reflect the new contract: weights
  are subsetted for OLS, design retained for variance.
- Replicate variance + always-treated: existing path now works correctly
  (score_pad_mask_arg stays None on _uses_replicate_ts paths; per-replicate
  refit handles resampling separately).

Tests (tests/test_two_stage.py):
- New TestTwoStageDiDWaveE3ParityAlwaysTreated class with 8 tests:
  (a) no-always-treated baseline, (b) full-domain df_survey preservation
  under drop, (c) full-domain n_psu reporting, (d) per-cluster zero-pad
  mock-spy on _compute_stratified_meat_from_psu_scores, (e) subpopulation
  + always-treated composition, (f) cluster-as-PSU + always-treated,
  (g) no-survey path unchanged, (h) PSU entirely-always-treated.

Tests (tests/test_replicate_weight_expansion.py):
- Strengthened test_two_stage_always_treated to assert finite overall_se /
  overall_p_value / overall_conf_int (was only asserting finite ATT,
  missing the replicate-SE regression class).
- New test_two_stage_always_treated_event_study_and_group_replicate
  exercising the event-study + group replicate refit branches end-to-end
  under always-treated drop with aggregate="all"; asserts finite se +
  p_value + conf_int on non-reference horizons and cohorts.

Docs:
- docs/methodology/REGISTRY.md: TwoStageDiD section gains "documented
  synthesis — Wave E.3 parity" note; SpilloverDiD Wave E.3 follow-up note
  updated to mark TwoStageDiD parity as shipped.
- CHANGELOG.md: Unreleased Added entry leading with documented-synthesis
  framing.
- TODO.md: drop parity follow-up row.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • The estimator-side change is a documented Wave E.3 synthesis, not an undocumented methodology deviation: docs/methodology/REGISTRY.md:L1316-L1318 now explicitly records the full-domain survey-design retention rule, and the implementation in diff_diff/two_stage.py:L1460-L1704 and diff_diff/two_stage.py:L2958-L3084 matches that contract.
  • I did not find a P0/P1 issue in identification, weighting, variance/SE computation, or inference propagation for the changed TwoStageDiD paths.
  • The risky propagation points appear complete: the new padding kwargs are threaded through static/event-study/group stage-2 paths, and replicate refits now subset full-domain replicate weights before refitting in diff_diff/two_stage.py:L1788-L1897.
  • The main non-blocking issue is a small performance regression: the zero-pad branch now runs on every analytical survey fit, even when no always-treated rows were dropped.
  • The added coverage is good for the intended bugfix, but the new replicate tests use unseeded randomness.
  • I could not execute the test suite in this environment because runtime tooling/deps are missing; the test review below is static.

Methodology

No findings. The PR changes the survey-variance realization for TwoStageDiD, but that deviation is explicitly documented in docs/methodology/REGISTRY.md:L1317-L1317, so under the review rubric it is mitigated/P3-informational rather than a defect. The code preserves post-drop OLS estimation while zero-padding PSU scores onto the full-domain survey design, which is consistent with the documented Lumley-style domain-estimation convention implemented here at diff_diff/two_stage.py:L1498-L1512, diff_diff/two_stage.py:L1555-L1602, and diff_diff/two_stage.py:L2969-L3084.

Code Quality

No findings. The PR continues to use safe_inference() on the changed inference surfaces and does not introduce the banned inline t_stat = effect / se pattern; see diff_diff/two_stage.py:L1707-L1709, diff_diff/two_stage.py:L1907-L1929, diff_diff/two_stage.py:L2599-L2606, and diff_diff/two_stage.py:L2713-L2720.

Performance

  • Severity: P3
    Impact: score_pad_mask_arg is populated for every non-replicate survey fit because keep_mask defaults to all-True, so _compute_gmm_variance() always takes the zero-pad/copy branch even when no always-treated rows were removed. That adds avoidable np.unique/np.searchsorted work and full-size c_by_cluster / s2_by_cluster copies on the baseline survey path. See diff_diff/two_stage.py:L1464-L1467, diff_diff/two_stage.py:L1677-L1682, and diff_diff/two_stage.py:L2969-L3027.
    Concrete fix: Only pass score_pad_mask and cluster_ids_full when rows were actually dropped, e.g. gate on n_always_treated > 0 or not keep_mask.all().

Maintainability

No findings. Internal propagation of the new split-length contract looks complete across the changed call graph: diff_diff/two_stage.py:L1684-L1758, diff_diff/two_stage.py:L2377-L2388, diff_diff/two_stage.py:L2549-L2560, and diff_diff/two_stage.py:L2675-L2686.

Tech Debt

No findings. Removing the resolved TODO entry in TODO.md is consistent with the shipped code/docs/tests for this parity item.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: The new replicate tests add unseeded np.random.normal(...) draws at tests/test_replicate_weight_expansion.py:L395-L401 and tests/test_replicate_weight_expansion.py:L437-L440. The assertions are coarse enough that flakiness risk is low, but deterministic fixtures are preferable for regression coverage in inference code.
    Concrete fix: Seed a local generator (rng = np.random.default_rng(...)) and draw from it, or replace the perturbations with fixed values.

Execution note: I could not run pytest here because the environment lacks both pytest and core runtime deps such as numpy; I limited validation to static inspection and AST parsing of the changed Python files.

CI codex review P3 #1 (performance): _compute_gmm_variance was taking the
zero-pad branch (np.unique + np.searchsorted + full-size c_by_cluster /
s2_by_cluster copies) on EVERY analytical survey fit because keep_mask
defaults to all-True. Gate on `n_always_treated > 0` so baseline survey
fits with no drop pass `None / None` and take the bit-identical pre-PR
path.

CI codex review P3 #2 (test determinism): seed local np.random.default_rng
in the two new always-treated replicate-weight tests so always-treated
outcomes are deterministic across runs (was using global np.random.normal
which is non-deterministic at import time).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: db3f7fea1bec4cbeed0046ca8b6c73c1c1e9cc9b


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

Performance

  • No findings. The earlier always-on zero-pad/copy regression is resolved by activating the padding path only when always-treated rows were actually dropped in diff_diff/two_stage.py:L1667-L1687.

Maintainability

  • No findings. The registry note, in-code docstrings/comments, and the new targeted tests all describe the same Wave E.3 contract.

Tech Debt

  • No findings. Removing the shipped parity follow-up from TODO.md is consistent with the delivered code, docs, and tests.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 22, 2026
@igerber igerber merged commit fdf2ceb into main May 22, 2026
33 of 34 checks passed
@igerber igerber deleted the spillover-conley-twostagedid-wave-e3-parity branch May 22, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant