Skip to content

Phase 1b interstitial #2: TripleDifference vcov_type narrow contract#488

Merged
igerber merged 8 commits into
mainfrom
triple-diff-vcov-type-phase1b
May 24, 2026
Merged

Phase 1b interstitial #2: TripleDifference vcov_type narrow contract#488
igerber merged 8 commits into
mainfrom
triple-diff-vcov-type-phase1b

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 24, 2026

Summary

  • TripleDifference vcov_type input contract (Phase 1b interstitial Add fixed effects and absorb parameters to DifferenceInDifferences #2, permanently narrow to {\"hc1\"}). Mirrors CallawaySantAnna PR callaway-santanna: fix cluster= silent no-op + narrow vcov_type contract #487 template. Analytical-sandwich families {classical, hc2, hc2_bm} and conley REJECTED at __init__ with methodology-rooted messages: TripleDifference uses influence-function-based variance per Ortiz-Villavicencio & Sant'Anna (2025) on the 3-pairwise-DiD decomposition inf = w3·IF_3 + w2·IF_2 - w1·IF_1; the analytical-sandwich families have no single design matrix for hat-matrix leverage or Bell-McCaffrey Satterthwaite DOF.
  • vcov_type + cluster_name fields added to TripleDifferenceResults; summary() routes through shared _format_vcov_label: bare fits render \"HC1 heteroskedasticity-robust\", clustered fits render \"CR1 cluster-robust at <name>, G=<n>\" (actual algebra is Liang-Zeger CR1 on combined IF), survey-backed fits suppress the variance-estimator line (Survey Design block is the canonical surface).
  • cluster= + SurveyDesign(replicate_weights=[...]) raises NotImplementedError at fit() — replicate-weight variance ignores PSU/cluster entirely; honoring cluster= would silently have no effect on variance. Mirrors CS PR callaway-santanna: fix cluster= silent no-op + narrow vcov_type contract #487 guard.
  • cluster_name + n_clusters suppressed under survey designs so they don't misreport the raw cluster= argument when survey_design.psu overrides it.
  • _validate_vcov_type @staticmethod called from __init__ AND fit() (mirrors CS pattern) so sklearn-style set_params mutations can't silently propagate bad values to Results metadata.
  • All default behavior bit-equal across {dr, reg, ipw} × {default, cluster, survey, replicate-survey} paths.
  • 35 new tests in TestTripleDifferenceVcovType: 5-surface bit-equal matrix (default/cluster/TSL-survey/replicate-survey/fit-time-revalidation) + 8 introspection/convenience tests + 3 codex-driven additions (summary cluster label, summary survey suppression, cluster+replicate rejection).

Methodology references

  • Method: TripleDifference (DDD) — Ortiz-Villavicencio & Sant'Anna (2025), "Better Understanding Triple Differences Estimators", arXiv:2505.09942. IF-based variance: SE = std(w3·IF_3 + w2·IF_2 - w1·IF_1) / sqrt(n); CR1 Liang-Zeger on combined IF when cluster= set.
  • Intentional deviations from the source: None new. The vcov_type narrowing to {\"hc1\"} is a library-architectural surface restriction, not a methodological deviation — the paper's asymptotic theory is written around influence functions rather than a single hat-matrix-bearing regression. Existing documented deviations on the variance side (no Stata-style (n-1)/(n-p) finite-sample factor on the CR1 algebra because the IF has no design-matrix p in the OLS sense) are now explicitly called out in REGISTRY.md.

Validation

  • Tests added: tests/test_triple_diff.py::TestTripleDifferenceVcovType (35 tests; +477 LoC). Full file: 76 → 111 passing.
  • Tests still green: tests/test_methodology_triple_diff.py (44 passing, 63 skipped R-parity); tests/test_survey_phase3.py::TestTripleDifferenceSurvey (8 passing).
  • 8 rounds of /ai-review-local --backend codex iterated to clean (zero actionable findings). Each round caught a different consumer surface: R1 summary-label rendering, R2 cluster+survey metadata, R3+R4 REGISTRY wording, R5 replicate-weight coverage, R7 cluster+replicate silent no-op (real bug discovered by R5's added test). R8 verdict: ✅ Looks good.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

igerber and others added 8 commits May 24, 2026 10:16
Mirrors CS PR #487. TripleDifference uses IF-based variance per
Ortiz-Villavicencio & Sant'Anna (2025) on the 3-pairwise-DiD
decomposition; analytical-sandwich families {classical, hc2, hc2_bm}
have no single design matrix for hat-matrix leverage or BM Satterthwaite
DOF and are rejected at __init__ with methodology-rooted messages.
conley deferred to follow-up TODO row.

vcov_type='hc1' (default) preserves all current behavior bit-equally
across {dr, reg, ipw} on default / cluster= / survey_design= paths.
vcov_type field added to TripleDifferenceResults + to_dict + summary
("Variance estimator: hc1"). _validate_vcov_type @staticmethod called
from __init__ AND fit() so sklearn-style set_params mutations can't
silently propagate bad values to Results metadata.

24 new tests in TestTripleDifferenceVcovType: 5-surface contract
(default/cluster/survey bit-equal, __init__ rejection per family,
fit()-time revalidation) + 8 introspection/convenience-function tests.

REGISTRY IF-based taxonomy section now lists TripleDifference alongside
CallawaySantAnna in "Enforced today". CHANGELOG, TODO row strikes
(L107, L206), and new Conley deferral row added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses local codex P2 (.claude/reviews/local-review-latest.md):
the unconditional "Variance estimator: hc1" line in summary() was
misleading for clustered fits (actual algebra is CR1 Liang-Zeger on
the combined IF, not the raw hc1 sandwich) and for survey-backed fits
(actual algebra is TSL on the combined IF).

Changes:
- Add cluster_name field to TripleDifferenceResults (mirrors CS PR #487)
- Thread self.cluster as cluster_name in fit() Results construction
- summary() routes through shared _format_vcov_label (results.py:49-89):
  - Bare: "HC1 heteroskedasticity-robust"
  - Cluster: "CR1 cluster-robust at <name>, G=<n>"
  - Survey: line suppressed (Survey Design block is the canonical surface)
- to_dict() carries cluster_name when set

Tests (3 new) close the codex P3 regression-coverage gap:
- test_summary_cluster_label_is_cr1_not_raw_hc1
- test_summary_no_variance_estimator_line_under_survey
- test_results_cluster_name_carries_through (+ to_dict assertion)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses local codex R2 P2 (.claude/reviews/local-review-latest.md):
when survey_design.psu overrides a conflicting cluster= argument,
the Results cluster_name + n_clusters fields previously still recorded
the raw self.cluster (the ignored argument). That made to_dict() and
downstream reporting describe the wrong clustering level.

Fix mirrors the variance-estimator line suppression in summary(): on
survey-backed fits the Survey Design block (PSU/strata/df) is the
canonical surface, so cluster_name + n_clusters are suppressed
(set to None) rather than reporting potentially-overridden raw inputs.

Reviewer offered two equivalent options ("reflect effective PSU OR
suppress under survey-backed fits"); suppression is simpler and
consistent with the existing summary() architecture.

Tests:
- test_cluster_name_suppressed_under_survey_design: cluster='state'
  + survey_design(psu='psu') with conflicting partitions; resolver
  emits the documented UserWarning; Results carries None for both
  cluster_name and n_clusters; to_dict() doesn't leak; Survey Design
  block remains in summary()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses local codex R3 P3 (.claude/reviews/local-review-latest.md):
the IF-based taxonomy section's "hc1 with cluster=X" bullet claimed
enforced estimators activate CR1 by synthesizing SurveyDesign(psu=X)
and routing through _compute_stratified_psu_meat. That's true for
CallawaySantAnna but not for TripleDifference, which computes the
algebraically equivalent CR1 directly from cluster-summed IFs inline
at triple_diff.py:1245-1259 (no SurveyDesign synthesis — the IF is
already in scope at the SE call site).

Reword the bullet to describe both activation paths and note they
produce the same numerical result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses local codex R4 P3 (.claude/reviews/local-review-latest.md):
the post-taxonomy "documented synthesis" note still read as if
SurveyDesign(psu=...) synthesis were the universal clustered-hc1
route. After R3 reworded the per-estimator bullet, this generic note
needed the same treatment — split the activation path into the
CallawaySantAnna synthesis route and the TripleDifference inline-CR1
route. CR1 Liang-Zeger algebra on the IF is Williams (2000) /
Hansen (2007) in both cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses local codex R5 P1 (.claude/reviews/local-review-latest.md):
the new vcov_type contract was previously only regression-tested on
the default, bare-cluster, and analytical-TSL survey paths. The
distinct replicate-variance branch in fit() at lines 727-747
(survey_design with replicate_weights + JK1/BRR/Fay/JKn/SDR) was
unverified, leaving a coverage hole on a supported inference branch.

Add _ddd_replicate_panel helper (mirrors the JK1 pattern from
tests/test_survey_phase6.py::test_triple_diff_replicate_all_methods
but uses default_rng for reproducibility) + parameterized
test_replicate_survey_hc1_bit_equal_baseline over reg/ipw/dr. Asserts:
- ATT/SE bit-equal between default and explicit vcov_type="hc1"
- results.vcov_type == "hc1" on the replicate branch
- summary() still suppresses raw "Variance estimator" line under
  survey (Survey Design block remains the canonical surface)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-merge-check audit caught a missing surface: diff_diff/guides/llms-full.txt
TripleDifference signature listing (lines 575-583) did not include the new
vcov_type parameter, even though the CallawaySantAnna entry at line 199 has
the exact analog line. The llms-full.txt is the AI-agent contract bundled in
the wheel + published on RTD; the signature listing should mirror the public
TripleDifference.__init__ signature.

Add the vcov_type line with the same "permanently narrow {hc1} per IF-based
variance" framing as CS, citing Ortiz-Villavicencio & Sant'Anna (2025).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses local codex R7 P1 (.claude/reviews/local-review-latest.md):
TripleDifference(cluster=X, survey_design=SurveyDesign(replicate_weights=...))
was a silent no-op — the replicate-variance branch in fit() computes
SE only via compute_replicate_if_variance(psi_rep, resolved_survey)
and never consumes PSU/cluster state. Adding cluster='state' produced
identical ATT/SE and no warnings relative to the no-cluster replicate
fit, while populating cluster_name/n_clusters on Results dishonestly.

Fail-closed per feedback_no_silent_failures: raise NotImplementedError
at fit() when both cluster= and survey_design.replicate_weights are
set. Message points to two valid alternatives: omit cluster= (replicate
weights encode design implicitly) or use a non-replicate survey design
with explicit strata/psu/fpc.

Mirrors CS PR #487 guard at staggered.py:1705-1719 (the audit that
caught it on CS originally noted "the other 7 Phase 1b estimators ...
handle bare cluster= correctly" — but that audit only checked whether
bare-cluster works at all, not the replicate-weight interaction).

Tests:
- test_cluster_plus_replicate_weights_rejected: parameterized over
  reg/ipw/dr, asserts NotImplementedError with "replicate-weight"
  in message

Docs:
- REGISTRY.md TripleDifference section gains a per-estimator Note
  on the cluster= + replicate-weight rejection
- CHANGELOG entry amended with the new rejection clause

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • TripleDifference is the only affected method; the PR narrows the public vcov_type surface and adds validation/results plumbing, but it does not change the underlying ATT or SE algebra in _core_ddd.
  • I did not find an undocumented methodology mismatch against the Methodology Registry.
  • Parameter propagation looks complete across constructor, get_params(), fit-time revalidation, convenience wrapper, results serialization, and summary output.
  • No new inline inference or partial-NaN anti-patterns were introduced; inference still routes through safe_inference.
  • Review limitation: I could not run pytest here because the environment is missing pytest and runtime deps such as pandas; verification was static only.

Methodology

  • Severity: P3. Impact: TripleDifference remains aligned with the repo’s documented IF-based variance contract. The new vcov_type={"hc1"} restriction, fit-time revalidation, and results metadata only formalize the already-existing IF / clustered-IF / survey-TSL paths; I do not see an undocumented estimator, weighting, variance, or assumption drift in the diff. Concrete fix: none. diff_diff/triple_diff.py:L543-L551, diff_diff/triple_diff.py:L619-L684, diff_diff/triple_diff.py:L781-L818, docs/methodology/REGISTRY.md:L321-L360, docs/methodology/REGISTRY.md:L2039-L2045. citeturn1view1
  • Severity: P3. Impact: The new cluster + replicate_weights rejection is a fail-closed guard against a silent no-op and is explicitly documented in REGISTRY.md, so it is informational rather than blocking. Concrete fix: none. diff_diff/triple_diff.py:L661-L684, docs/methodology/REGISTRY.md:L2045-L2045

Code Quality

  • No findings. The changed fit path still uses safe_inference, and the modified file does not introduce new inline t_stat = ... / se or partial-NaN inference logic. diff_diff/triple_diff.py:L750-L779

Performance

  • No findings. The change is validation/metadata/test surface only; no hot-path estimator math was expanded.

Maintainability

  • No findings. The new parameter is propagated consistently through constructor state, get_params(), fit-time validation, the convenience wrapper, to_dict(), and summary(). diff_diff/triple_diff.py:L516-L556, diff_diff/triple_diff.py:L619-L623, diff_diff/triple_diff.py:L781-L818, diff_diff/triple_diff.py:L2036-L2055, diff_diff/triple_diff.py:L2101-L2205

Tech Debt

  • Severity: P3. Impact: Deferred conley support is properly tracked in TODO.md, so the deferral is non-blocking under the project’s review policy. Concrete fix: none in this PR. diff_diff/triple_diff.py:L2017-L2027, TODO.md:L107-L112

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The new tests cover the right surfaces for this change: default, cluster, survey, replicate-survey, fit-time revalidation, and results-summary metadata behavior. Concrete fix: none on the PR itself; rerun the targeted TripleDifference tests in CI or a normal dev environment with pytest installed. tests/test_triple_diff.py:L1206-L1618

@igerber igerber added ready-for-ci Triggers CI test workflows and removed ready-for-ci Triggers CI test workflows labels May 24, 2026
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented May 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3b6fa9bdebc1457f96cf1cac909c74cfdad0879f


Overall Assessment

Looks good. No unmitigated P0/P1 findings in the PR diff.

Executive Summary

  • TripleDifference is the only affected method; the PR narrows the public vcov_type surface and adds validation/results plumbing, but it does not change the underlying ATT or existing IF / CR1 / survey-TSL variance algebra.
  • The narrowed vcov_type={"hc1"} contract is consistent with the source paper's IF-based DDD formulation and the registry's documented IF-vs-sandwich split; I did not find an undocumented methodology deviation. (arxiv.org)
  • The new cluster + replicate_weights guard correctly fail-closes the only silent-no-op risk in the changed code path.
  • Parameter propagation is complete across constructor, fit-time revalidation, get_params(), the convenience wrapper, results serialization, and summary output.
  • The prior AI review had no P1+ findings, and I did not identify any new P1+ regressions.
  • I could not execute pytest here because pytest is not installed.

Methodology

  • Severity: P3. Impact: The permanent vcov_type={"hc1"} narrowing is consistent with the paper and is now explicitly documented in the Methodology Registry, so this is informational rather than a defect. The paper motivates DDD-specific RA/IPW/DR estimators, shows DDD is not generally just "two DiDs," and builds the asymptotic theory around influence-function / recentered-influence-function representations, with cluster-robust inference treated as a straightforward extension of that setup. Concrete fix: none. diff_diff/triple_diff.py:L407-L415, diff_diff/triple_diff.py:L543-L551, diff_diff/triple_diff.py:L1986-L2034, docs/methodology/REGISTRY.md:L321-L360, docs/methodology/REGISTRY.md:L2039-L2045 (arxiv.org)
  • Severity: P3. Impact: Rejecting cluster= with replicate-weight survey designs is a documented fail-closed guard against misleading variance metadata and a silent no-op. Concrete fix: none. diff_diff/triple_diff.py:L661-L684, docs/methodology/REGISTRY.md:L2045-L2045

Code Quality

  • No findings. The diff continues to route inference through safe_inference() and does not introduce inline t_stat = ... / se or partial-NaN patterns. diff_diff/triple_diff.py:L750-L822

Performance

  • No findings. The change is validation, metadata plumbing, and summary formatting only; the DDD estimation and IF construction paths are unchanged. diff_diff/triple_diff.py:L619-L684, diff_diff/triple_diff.py:L781-L822

Maintainability

  • No findings. vcov_type propagation is complete across constructor state, fit-time revalidation, get_params(), the convenience wrapper, results serialization, and summary rendering. diff_diff/triple_diff.py:L516-L556, diff_diff/triple_diff.py:L619-L623, diff_diff/triple_diff.py:L781-L822, diff_diff/triple_diff.py:L2036-L2055, diff_diff/triple_diff.py:L2101-L2205

Tech Debt

  • Severity: P3. Impact: conley remains an explicitly tracked follow-up in TODO.md, so the new rejection message is properly tracked and non-blocking under the project policy. Concrete fix: none in this PR. TODO.md:L107-L113

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: The new tests cover the key changed surfaces: default, cluster, survey, replicate-survey, fit-time revalidation, and results-summary metadata. Concrete fix: run tests/test_triple_diff.py -k 'TripleDifferenceVcovType or TripleDifferenceClusterDefensive' in CI or a normal dev environment; I could not run it here because pytest is not installed. tests/test_triple_diff.py:L1205-L1618

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 24, 2026
@igerber igerber merged commit 03b0b13 into main May 24, 2026
62 of 64 checks passed
@igerber igerber deleted the triple-diff-vcov-type-phase1b branch May 24, 2026 17:02
igerber added a commit that referenced this pull request May 25, 2026
Phase 1b interstitial #3 for ImputationDiD. Mirrors the CallawaySantAnna
(PR #487) + TripleDifference (PR #488) template for IF-based estimators:
vcov_type is permanently narrow to {"hc1"} because the per-unit influence
function aggregation (Borusyak-Jaravel-Spiess 2024 Theorem 3) has no
single design matrix on which hat-matrix leverage or Bell-McCaffrey
Satterthwaite DOF can be defined.

Source surface:
- diff_diff/imputation.py: vcov_type param + @staticmethod
  _validate_vcov_type + fit()-time revalidation +
  cluster+replicate-weights NotImplementedError guard +
  Results cluster_name/n_clusters resolution
- diff_diff/imputation_results.py: vcov_type/cluster_name/n_clusters
  fields + new to_dict() + variance-estimator line in summary() routing
  through shared _format_vcov_label helper
- diff_diff/imputation_bootstrap.py: dual-site n_clusters<2 /
  n_psu<2 NaN guards via new _build_nan_bootstrap_results helper
  (closes the BLAS-roundoff zero-SE class predicted to recur on
  IF-based estimators)

Tests: 34 new tests in TestImputationDiDVcovType covering default /
cluster / TSL-survey / replicate-survey bit-equality (parametrized over
aggregate modes), bootstrap × cluster + bootstrap × survey bit-equality,
fit()-time revalidation after set_params bypass, bootstrap n_psu<2 /
n_clusters<2 NaN propagation, pretrends bit-equality, and the full
introspection + safety-gate surface (8 tests).

Docs: REGISTRY.md (IF-based taxonomy + 4 new Notes), CHANGELOG.md,
TODO.md (row narrowed, Conley follow-up added), llms-full.txt
(vcov_type + pretrends signature drift fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request May 25, 2026
… 3 (Phase 1b interstitial #3)

Phase 1b interstitial #3 for ImputationDiD. Mirrors the CallawaySantAnna
(PR #487) + TripleDifference (PR #488) template for IF-based estimators:
vcov_type is permanently narrow to {"hc1"} because the per-unit influence
function aggregation (Borusyak-Jaravel-Spiess 2024 Theorem 3) has no
single design matrix on which hat-matrix leverage or Bell-McCaffrey
Satterthwaite DOF can be defined.

Source surface (diff_diff/):
- imputation.py: vcov_type param + @staticmethod _validate_vcov_type +
  fit()-time revalidation + cluster+replicate-weights NotImplementedError
  guard + Results metadata resolution (cluster_name=unit by default for
  the Theorem 3 unit-clustered IF variance; suppressed under ANY survey
  design — analytical OR replicate — because replicate variance ignores
  cluster/PSU entirely)
- imputation_results.py: vcov_type/cluster_name/n_clusters fields, new
  to_dict() method, summary() variance line via shared _format_vcov_label
  (default cluster=None renders "CR1 cluster-robust at <unit>, G=<n>";
  bootstrap fits suppress the analytical label and render
  "Inference method: bootstrap" instead, mirroring DiDResults.summary()
  gate at results.py:213-226)
- imputation_bootstrap.py: dual-site n_clusters<2 / n_psu<2 NaN guards
  via new _build_nan_bootstrap_results helper (closes the BLAS-roundoff
  zero-SE class predicted to recur on IF-based estimators)

Tests: 42 new tests in TestImputationDiDVcovType covering default /
cluster / TSL-survey / replicate-survey + bootstrap × cluster + bootstrap
× survey bit-equality (ALL parametrized over aggregate ∈ {None,
"event_study", "group"} with per-horizon and per-group SE override
branches pinned); fit()-time revalidation after set_params bypass;
bootstrap n_psu<2 / n_clusters<2 NaN propagation including coef_var NaN;
pretrends=True × vcov_type='hc1' × cluster bit-equality; introspection
(default attr, get_params, Results carries, to_dict, summary label
default+cluster+bootstrap-suppressed, cluster_name suppression under
both analytical AND replicate survey, fit-clone idempotence,
convenience function); input rejection on classical/hc2/hc2_bm/conley/
unknown with distinct methodology-keyword pins; cluster+replicate
rejection. Full pytest tests/test_imputation.py: 128 passed.

Docs:
- REGISTRY.md: IF-based taxonomy adds ImputationDiD to "Enforced today"
  tier; ImputationDiD section gains 4 new Notes (vcov_type contract,
  cluster+replicate fail-closed, bootstrap n<2 NaN, default unit-cluster
  CR1 rendering)
- CHANGELOG.md: [Unreleased] entry
- TODO.md: Phase 1b row narrowed to TwoStageDiD + EfficientDiD;
  ImputationDiD Conley follow-up row added
- guides/llms-full.txt: vcov_type + pretrends signature drift fix +
  shared staggered-results section advertises new variance metadata
  fields and to_dict()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant