Skip to content

Fix flaky TestExpandHoldsNoAccessExclusiveLock#66

Merged
iarunsaragadam merged 2 commits into
mainfrom
fix/flaky-expand-lock-test
Jun 23, 2026
Merged

Fix flaky TestExpandHoldsNoAccessExclusiveLock#66
iarunsaragadam merged 2 commits into
mainfrom
fix/flaky-expand-lock-test

Conversation

@iarunsaragadam

Copy link
Copy Markdown
Contributor

The #33 operational lock test false-failed in CI (expand held a sustained AccessExclusiveLock). Root cause: the observer flagged a lock seen on 5 consecutive ~1ms polls (~5ms), but the personal-index swap's DROP INDEX+RENAME is a legitimate size-independent metadata ACCESS EXCLUSIVE held for a few ms — which a slow/contended CI runner stretches across those polls.

  • Measure contiguous wall-clock hold and tolerate brief metadata flicks (<400ms); a real data-proportional in-place rebuild — the regression the test guards against — holds it far longer (the old whole-schema-in-one-tx approach held it for the entire migration).
  • Drop the secondary absolute >1s per-insert latency check (CI-noise sensitive); the bounded per-insert context already detects a genuine block.

Test-only. Ran 25× with -race locally, green; full postgres package green; lint 0.

TestExpandHoldsNoAccessExclusiveLock flagged a sustained ACCESS EXCLUSIVE when
the lock was seen on 5 consecutive ~1ms polls (~5ms). The personal-index swap's
DROP+RENAME is a legitimate size-INDEPENDENT metadata ACCESS EXCLUSIVE held for
a few ms — which a slow/contended CI runner can stretch across those polls,
false-failing the test. Measure CONTIGUOUS wall-clock hold and tolerate brief
metadata flicks (<400ms); a real data-proportional in-place rebuild (the actual
regression) holds it far longer. Also drop the secondary absolute >1s per-insert
latency check (CI-noise sensitive) — the bounded per-insert context already
detects a genuine block.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8f07a1ef0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +148 to +149
default:
heldSince = time.Time{} // released (or a transient poll error): reset

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check elapsed hold time before resetting

If the observer samples present=true and then is descheduled until after the lock is released, this branch clears heldSince without comparing the elapsed wall time, so a real AccessExclusiveLock held longer than maxContiguousHold can pass whenever no successful poll occurs while the threshold is exceeded. Since this test is specifically meant to catch sustained locks under contended CI scheduling, record a failure on the release/stop path when time.Since(heldSince) exceeds the threshold before resetting it.

Useful? React with 👍 / 👎.

The config-layer GATEWAY_DATA_REGION validator was only exercised via its
empty-region path (22% covered), dropping internal/config below the 88%
per-package coverage gate. Add a direct branch test (empty/valid pass;
uppercase/dot/space/slash/over-length rejected; at-bound allowed) → config
coverage 86.8% -> 97.1%.
@iarunsaragadam iarunsaragadam merged commit 93fd2ad into main Jun 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant