WIP merge wait methods in#5391
Draft
denik wants to merge 21 commits into
Draft
Conversation
Collaborator
|
Commit: 79818c6
22 interesting tests: 15 SKIP, 7 KNOWN
Top 24 slowest tests (at least 2 minutes):
|
b00a6f3 to
786f578
Compare
f0e65f7 to
9f7fc23
Compare
bb3c2e3 to
cdd95c8
Compare
…for intermediate state saves Resources with multi-step deployments (apps, clusters, model serving endpoints, etc.) previously split their logic between DoCreate/DoUpdate and WaitAfterCreate/WaitAfterUpdate at an arbitrary point, making it hard to persist state incrementally. Replace the WaitAfterXxx methods with an *Engine parameter on DoCreate and DoUpdate. Engine.SetID + Engine.SaveState can be called immediately after the initial API call succeeds, before any long-running wait, so the resource is tracked in state even if deployment is interrupted mid-wait (preventing orphaned resources). Simple resources pass _ *Engine and are unaffected. Complex resources (apps, clusters, database instances, model serving endpoints, vector search endpoints and indexes) now inline their wait logic and call engine.SetID/SaveState at the appropriate point. Co-authored-by: Denis Bilenko
Make testserver app DELETE asynchronous (sets DELETING state) to match real API behaviour. AppsGet auto-removes the app after returning it in DELETING state, so the next request sees it as gone. app_test.go: use real SDK calls (Create + DeleteByName) to put the app in DELETING state before testing DoCreate retry logic, instead of injecting state directly. RetriesWhenGetReturnsNotFound uses a one-shot POST override so GET returns 404 naturally without a custom GET handler. all_test.go: retry DoRead once after DoDelete to let async deletions (DELETING → gone) clear before asserting the resource is absent. Add Server.GetWorkspace to testserver for pre-seeding state in tests. Co-authored-by: Denis Bilenko
Use per-test [[Repls]] rules (matching raw digits) in the two publish-failure tests so the URL parameter is normalized to ?[WSPARAM]=[NUMID] regardless of whether the testserver or cloud environment is used. Revert the over-broad parent-level rule that broke detect-change's existing ?[ow]=... cleanup. Co-authored-by: Denis Bilenko
macOS ships bash 3.2 which does not support &>> (bash 4+). Replace with >> file 2>&1 which works on all bash versions. Co-authored-by: Denis Bilenko
MSYS_NO_PATHCONV=1 (set in the parent test.toml) prevents MSYS2 from converting POSIX paths to Windows paths, causing Python to receive a broken path (/c/a/... instead of C:\a\...) when invoking fault.py. Unset it before the fault.py call, matching the pattern used by other dashboard scripts before their bin helper invocations. Co-authored-by: Denis Bilenko
…d=false
Engine.SaveState:
- Accepts resourceKey for logging: "SaveState: resources.X id=Y N bytes: {...}"
- Skips WAL write if state is unchanged (structdiff.IsEqual), logging a skip message.
- Records the last saved value in e.lastSaved for subsequent comparisons.
Dashboard DoCreate:
- Saves intermediate state with Published=false (the actual draft state) instead of
the user's Published=true. This ensures the planner sees a real diff (false→true)
on the next deploy if publish is interrupted, rather than treating the resource as
up-to-date and silently skipping the publish.
publish-failure-retry acceptance test:
- Adds `bundle plan -o json` and extracts the published change to verify
old=false, new=true in the plan after a failed publish.
README.md: update stale SetID+SaveState reference to current SaveState(ctx, id, state) API.
Co-authored-by: Denis Bilenko
…reate/DoUpdate Mergiraf reintroduced the standalone WaitAfterCreate and WaitAfterUpdate methods during the rebase against main (sql_warehouse.go had a merge conflict). Inline their wait logic directly into DoCreate and DoUpdate and remove the standalone methods, consistent with the Engine callback pattern used by other resources. Co-authored-by: Denis Bilenko
df19a64 to
bf602f1
Compare
…ublish-failure-retry - Replace fixed hex regex [[Repls]] with replace_ids.py called after the first deploy; the real dashboard ID is captured from state as [DASHBOARD1_ID]. - Inline the plan -o json | jq output directly into output.txt instead of saving to out.plan_published_change.json and trace cat-ing it. Co-authored-by: Denis Bilenko
…dashboard DoUpdate Two more cases where state could be persisted earlier to avoid orphaning or stale-etag conflicts: sql_warehouse DoCreate: the warehouse is created, then DoCreate polls for RUNNING (and may Stop it). If interrupted during the wait, the warehouse was orphaned. Now SaveState is called right after Create returns the id, before the wait. dashboard DoUpdate: Update() bumps the server-side etag, then publishDashboard() can fail. Mirror the DoCreate fix — save the new etag with Published=false before publishing. This keeps the etag in sync (a stale etag makes the next Update fail with a conflict) and records published=false so the planner re-publishes next deploy. Resolves the pre-existing TODO. Add publish-failure-retry-on-update acceptance test: deploy, then trigger an update with an injected publish failure, and verify the update issued a PATCH (no CREATE) and the plan records published old=false. Co-authored-by: Denis Bilenko
DoUpdate applies up to four sequential mutating calls (tags, AI gateway, config, notifications) and then waits up to 35 minutes for the endpoint to finish updating. If that wait was interrupted, none of the applied changes were persisted to state, forcing them to be re-applied on the next deploy. Save the new config right after the mutating calls and before the wait, mirroring DoCreate which already saves before waitForEndpointReady. The id (endpoint name) matches the one DoCreate persists, so there is no id mismatch. Co-authored-by: Denis Bilenko
…, not resource id) The previous attempt to save state before the async wait in postgres DoCreate used engine.SaveState(ctx, waiter.Name(), config). waiter.Name() returns the LRO operation name (e.g. .../branches/foo/operations/UUID), not the resource name (…/branches/foo) that DoCreate ultimately returns. Using the operation name as the id creates a dead WAL entry and does not help with orphan recovery. Replace with a TODO comment explaining the two options for a proper fix: 1. Derive the resource name from request inputs (Parent + resource-type + Id). 2. Use waiter.Metadata() if it surfaces the resource name before Wait completes. No panic in production: the framework saves state via db.SaveState (not through the Engine) after DoCreate returns, so the mismatch between operation-name and real-id never hits the Engine's id-mismatch check. The test panic was a test artifact where the same Engine was reused across Create→Update. Co-authored-by: Denis Bilenko
Previously apply.go called db.SaveState directly after DoCreate/DoUpdate returned, bypassing the engine. This meant a DoCreate that called engine.SaveState with a wrong id (e.g. an LRO operation name instead of the real resource name) would go undetected — the final save would silently use the correct id. Route both final saves through the engine instead: - If DoCreate/DoUpdate called engine.SaveState with a wrong id, the engine's id-mismatch check panics, catching the bug immediately in tests. - If DoCreate/DoUpdate already saved identical state (e.g. the resource saved before a long wait), the engine's dedup skips the redundant write. - State-save I/O errors are logged internally by the engine and no longer abort the deployment (the resource was already created/updated successfully; aborting would confuse the user). UpdateWithID is left unchanged: DoUpdateWithID takes no Engine parameter. Co-authored-by: Denis Bilenko
The type does exactly one thing: save state. StateSaver names that directly. NewNopEngine -> NewNopStateSaver, engine.go -> state_saver.go. Co-authored-by: Denis Bilenko
…aver - SaveStateWith[F any]: type-safe helper that temporarily sets a field to an intermediate value before saving, then restores it. Used to save Published=false before publishDashboard and Lifecycle=nil before warehouse/cluster lifecycle management, so the planner sees a real diff if deployment is interrupted mid-way. - Change saveFunc signature to func(id string, b json.RawMessage) error and add DeploymentState.SaveStateJSON to accept pre-marshaled bytes. StateSaver already marshals x to JSON for dedup; passing those bytes directly avoids a redundant marshal on every save. - Fix lastSaved aliasing: store []byte (JSON snapshot) instead of any pointer. SaveStateWith modifies the config, saves, then restores; a pointer-based lastSaved would point at the restored value, making the subsequent final save look like a no-op and leaving Published=false in the WAL. Co-authored-by: Isaac
Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Why
Tests