cmd/loop: CLI session recording and replay tests#1072
Conversation
Summary of ChangesHello @starius, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the testing infrastructure for the Loop CLI by introducing a robust session recording and replay mechanism. The primary goal is to establish a high-level, deterministic, and regression-resistant testing path that does not depend on a live loopd daemon or specific local environment details. By capturing and replaying real command executions, the CLI's behavior, including output formatting, error messaging, and gRPC interactions, can be thoroughly verified in a controlled and reproducible manner, greatly improving the reliability of future changes. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive session recording and replay framework for testing the Loop CLI. This is a significant improvement that will make the CLI more robust and regression-resistant. The implementation is well-structured, abstracting away dependencies like the gRPC connection, clock, and stdio to allow for deterministic testing. The recorded test sessions are extensive and cover a wide range of commands and scenarios. The code is of high quality, and I only have one minor suggestion to improve the clarity of the replay logic.
6fca25d to
763c480
Compare
|
Rebased, add recorded sessions for newly added commands and flags (the last commit). |
|
My biggest question about the design for fixture maintenance. With 114 fixture files, what's the plan for bulk updates when CLI output format changes? A few things that would help long-term:
I don't think it's a blocker for this test update. Just want to understand the maintenance story before this grows further. Otherwise seems solid! |
| } | ||
|
|
||
| // finalize writes the recorded session to disk once. | ||
| func (r *sessionRecorder) finalize(runErr error) error { |
There was a problem hiding this comment.
Minor: consider writing session even if stopHooks fails, to preserve debugging data
From Claude, the realistic risk is low. stopHooks() just calls unhook functions that restore original file descriptors. Failures here would be unusual (maybe a closed pipe, or dup2 failure). In normal operation this won't happen.
It's suggestion is:
r.finalizeOnce.Do(func() {
// Still try to stop hooks, but don't abandon ship on error
hookErr := r.stopHooks()
r.logExit(runErr)
// ... write session to disk ...
// Return hook error only if write succeeded
if finalizeErr == nil {
finalizeErr = hookErr
}
})
There was a problem hiding this comment.
Handled in cmd/loop: keep sessions on hook errors. Finalization now still records the exit state and writes the session file even if hook cleanup returns an error, and only reports the cleanup error after the file has been persisted. That keeps the debugging artifact without hiding an unusual teardown failure.
|
|
||
| // copyExportedFields copies exported fields from src into dst. | ||
| func copyExportedFields(dst, src reflect.Value) { | ||
| // Iterate fields and copy only exported ones. |
There was a problem hiding this comment.
Moderate concern here: This copies only exported fields from cli.Command, cli.Flag, and cli.Argument structs. Here's the points where it's fragile:
- Unexported state is silently ignored
If urfave/cli stores important state in unexported fields, the clone won't have it. For example, if
cli.StringFlag had:
type StringFlag struct {
Name string // exported - copied
value string // unexported - NOT copied
isSet bool // unexported - NOT copied
}
The test at line 1071 (require.False(t, clonedAlpha.IsSet())) actually relies on this behavior - it expects
IsSet() to return false because the unexported state isn't copied. So currently this works in your favor.
- Structural assumptions about cli.Command
The code explicitly handles these fields (line 688-692):
- Flags
- MutuallyExclusiveFlags
- Arguments
- Commands
If urfave/cli v3.5 adds a new field like ConditionalFlags []cli.Flag, the cloning code won't deep-clone it - it'll share references with the original.
- Version pinning mitigates but doesn't eliminate risk
The repo is on urfave/cli v3.4.1. When you upgrade, the test TestCloneCommandForReplayResetsFlagState should catch obvious breaks, but it doesn't comprehensively cover all cli.Command fields.
Suggestions:
- Add a comment documenting the urfave/cli version dependency:
// cloneCommandForReplay deep-clones a command tree for deterministic replays.
// NOTE: This relies on urfave/cli/v3 internal structure. If upgrading cli
// versions, verify TestCloneCommandForReplayResetsFlagState still passes.
- Consider a simpler alternative: Since this is test-only, you could just call newRootCommand() fresh for each test instead of cloning. Is cloning is actually necessary?
There was a problem hiding this comment.
Handled in cmd/loop: document replay cloning. I kept cloning and added the missing explanation: the CLI keeps package-level command pointers with mutable flag state, so reusing one root would leak IsSet and related state across replayed sessions. The comment also calls out that the clone logic depends on the current urfave/cli structure and points future upgrades at TestCloneCommandForReplayResetsFlagState.
| defer r.Close() | ||
|
|
||
| writer := composeWriter(forward, onChunk) | ||
| _, _ = io.Copy(writer, r) |
There was a problem hiding this comment.
Minor/Nit: You have a lot of error swallowing in this file. Consider logging io.Copy errors in session_stdio.go for debugging incomplete recordings, or documenting why they're intentionally ignored.
Claude suggests:
go func() {
defer wg.Done()
defer r.Close()
writer := composeWriter(forward, onChunk)
if _, err := io.Copy(writer, r); err != nil {
// At minimum, log it for debugging
log.Debugf("session stdout copy error: %v", err)
}
}()
Or track errors in a struct field and surface them in finalize():
type sessionRecorder struct {
// ...
copyErr error // Set if io.Copy fails
}
There was a problem hiding this comment.
Handled in cmd/loop: clarify ignored stdio errors. I kept the current behavior but documented why those background io.Copy errors are intentionally ignored. During hook restoration the pipe endpoints are closed on purpose, and that can terminate the copy loops with an expected error. Logging those teardown paths would mostly add noise rather than improve diagnosis.
| } | ||
|
|
||
| // cloneCommandForReplay deep-clones a command tree for deterministic replays. | ||
| func cloneCommandForReplay(cmd *cli.Command) *cli.Command { |
There was a problem hiding this comment.
I'll be more concise on this method. Do you need to clone at all vs calling newRootCommand()? If there is a reason cloning is needed I recommend to add a comment to document this.
There was a problem hiding this comment.
I figured out why cloning is necessary. The package-level command pointers (loopOutCommand, etc.) have mutable flag state that would bleed otherwise between test runs. Recommend to add a comment explaining this, something like:
// cloneCommandForReplay deep-clones a command tree because the CLI uses
// package-level command pointers with mutable flag state. Without cloning,
// flag state from one replay would bleed into subsequent replays.
//
// NOTE: This uses reflection to copy exported fields only. If upgrading
// urfave/cli versions, verify TestCloneCommandForReplayResetsFlagState
// still passes.
func cloneCommandForReplay(cmd *cli.Command) *cli.Command {
There was a problem hiding this comment.
Handled in cmd/loop: document replay cloning. I kept cloning and added the missing explanation: the CLI keeps package-level command pointers with mutable flag state, so reusing one root would leak IsSet and related state across replayed sessions. The comment also calls out that the clone logic depends on the current urfave/cli structure and points future upgrades at TestCloneCommandForReplayResetsFlagState.
|
|
||
| // forceDeterministicJSON is enabled by tests to obtain stable JSON | ||
| // output. | ||
| forceDeterministicJSON bool |
There was a problem hiding this comment.
Could these package-level variables be encapsulated in a context or config struct? Looks fine for a CLI. May complicate future refactorings.
There was a problem hiding this comment.
I left the package-level CLI hooks as-is in this series. They are scoped to the CLI runtime and the record/replay harness, and moving them behind a config or context object would turn this into a broader refactor than this PR needs.
| return false | ||
| } | ||
|
|
||
| // TestRecordedSessions replays all recorded sessions and compares output. |
There was a problem hiding this comment.
Test parallelism safety (minor): The tests modify package globals (sessionRec, sessionTransport, cliClock, forceDeterministicJSON) without synchronization. This is fine since tests run sequentially. A comment could prevent future breakage:
// TestRecordedSessions replays all recorded sessions and compares output.
// NOTE: Do not add t.Parallel() - this test modifies package-level globals.
func TestRecordedSessions(t *testing.T) {
There was a problem hiding this comment.
Handled in cmd/loop: note replay test stays serial. I added an explicit note above TestRecordedSessions that it must not use t.Parallel() because the replay harness swaps package-level globals for transport, clock, recorder state, and JSON normalization.
Thanks for suggestion! I implemented re-generation of all fixtures when only stdout and stderr changed (see commit "cmd/loop: bless text-only session updates", env var LOOP_UPDATE_RECORDED_SESSIONS). If a broader change was deployed, it makes sense to regenerate the fixtures. Fortunately AI is capable of doing it. I also changed the format of fixtures to have multi-line stdout and stderr outputs actually stored as multiline ( I pushed new commits instead of squashing to make it easier to follow the changes. I'll rebase the changes later. Also I'm aware that the CI is broken now - I'll fix it later as well. |
|
Rebased - that was the only way to fix CI, because it runs against a merged state with master, and there was a change in CLI texts in master recently. Fixed that in "fixup! testdata: add static address and swaps sessions" |
|
Cool, I'm wrapping up my review. This is a really excellent PR with a lot of strong changes including the deterministic replay and bless mode safety. Have a couple more non-blocking suggestions. |
| _ = w.Close() | ||
| if !useOrig { | ||
| wg.Wait() | ||
| } |
There was a problem hiding this comment.
Minor: I'd like to add a comment here to explain the intentional behavior of the goroutine, that it skips waiting for copy. For example:
// When reading from the original terminal stdin, we skip waiting // for the copy goroutine. It blocks on terminal input indefinitely // and will terminate when the pipe closes or the process exits. // Waiting here would hang the CLI.
There was a problem hiding this comment.
Thanks! I added the comment.
| data, err := json.Marshal(payload) | ||
| if err != nil { | ||
| return | ||
| } |
There was a problem hiding this comment.
Minor: consider logging marshal errors here to stderr
There was a problem hiding this comment.
Logging to stderr may be problematic here, since stderr may be hooked.
I fixed it by saving the error and checking it from Finalize() method.
kaldun-tech
left a comment
There was a problem hiding this comment.
A couple other things I checked here:
- The testdata fixtures do not contain sensitive data.
- Did not find any concerns with race conditions in session_recorder
- We could add a Makefile target for recording sessions in a follow-up PR, but this would only wrap a one-liner already documented in AGENTS.md. Don't think it's needed.
- Bless mode is great!
f4fff1a to
71d2249
Compare
|
@kaldun-tech Thanks for review! I addressed the comments and rebased the PR, to make atomic Git commits. |
|
@claude please review this |
Code Review: CLI Session Recording and ReplayThis is a well-conceived feature that brings deterministic, CI-friendly regression testing to the Loop CLI. The overall design is sound and the implementation is thorough. Below are observations organized by severity. Potential Bugs / Behavioral Changes1. Silent behavioral regression in abandonSwap (swaps.go) The removal of the If this is intentional (e.g., the flag is being deprecated in favour of positional args), it should be documented and probably also enforced with a runtime warning when users still pass 2. Hook error silently dropped when file write fails in finalize Inside Architecture / Design Concerns3. Package-level global state makes tests non-parallelisable
4. Reflection-based command cloning is fragile against urfave/cli upgrades
5. sessionDefaultDir is a hardcoded relative path
Minor Issues6. Redundant contains wrapper The 7. Five separate sessionRec nil checks in main() The recording setup is scattered across five separate 8. Double-signal os.Exit bypasses deferred cleanup In 9. TestUpdateRecordedSessionsEnabled env var setup The test saves the existing env var value via Positive Highlights
SummaryThe main concern is the behavioral regression in abandonSwap (point 1) — please confirm whether dropping the |
Make the output homedir-independent to facilitate tests.
abandonSwap used to access flag --id which does not exist.
Add an env-gated bless mode to recorded-session replay so CLI-only text changes can refresh fixtures without a live recording pass. The replay still uses the recorded gRPC traffic, stdin, and environment, and it only rewrites stdout, stderr, and run_error after the command preserves the recorded success or failure shape. Keep the updater strict by refusing to bless sessions when replay leaves recorded gRPC events unconsumed, and add focused tests for the rewrite rules. Document the bless workflow next to the session fixtures, including the need for -count=1 so the Go test cache does not skip updates.
Correct the duplicated article in the static openchannel help text so the command summary reads cleanly. Refresh the recorded help fixtures that surface that summary directly, including the dedicated openchannel help output and the parent static command listing.
This one is not a supported CLI behavior change. Relevant current shape: var abandonSwapCommand = &cli.Command{
Name: "abandonswap",
ArgsUsage: "ID",
Flags: []cli.Flag{
&cli.BoolFlag{Name: "i_know_what_i_am_doing"},
},
Action: abandonSwap,
}I left the code as-is here because reintroducing the branch would keep dead code around for a flag the command cannot parse.
Fixed in Relevant part: file, err := os.Create(r.filePath)
if err != nil {
finalizeErr = errors.Join(err, eventErr, hookErr)
return
}
encoder := json.NewEncoder(file)
encoder.SetIndent("", " ")
if err := encoder.Encode(fileContent); err != nil {
finalizeErr = errors.Join(err, eventErr, hookErr)
return
}I also updated the finalize test so it asserts both the write failure and hook failure are returned: err := recorder.Finalize(nil)
require.ErrorContains(t, err, "is a directory")
require.ErrorContains(t, err, "stdout restore failed")
Leaving this as a follow-up. The replay test already documents that it must stay serial because the CLI uses package-level command/runtime state. Moving this into a Current guardrail: // NOTE: Do not add t.Parallel() here; the replay harness mutates package-level
// globals such as the active transport, clock, and JSON normalization mode.
func TestRecordedSessions(t *testing.T) {
Leaving this unchanged in the PR. The code now has an explicit comment explaining why replay clones the command tree and what urfave/cli upgrade risk to watch for, and the regression test covers the mutable flag/argument state that caused the issue here. Relevant comment: // NOTE: This relies on urfave/cli storing replay-relevant configuration in
// exported fields. If upgrading urfave/cli, verify that
// TestCloneCommandForReplayResetsFlagState still passes.
func cloneCommandForReplay(cmd *cli.Command) *cli.Command {An upstream
Fixed in Relevant code: func ensureSessionBaseDir(baseDir string) error {
info, err := os.Stat(baseDir)
if errors.Is(err, fs.ErrNotExist) {
return fmt.Errorf("%s does not exist; run session recording "+
"from the repository root", baseDir)
}
if err != nil {
return err
}
if !info.IsDir() {
return fmt.Errorf("%s is not a directory", baseDir)
}
return nil
}I added a unit test for that failure mode: t.Chdir(t.TempDir())
t.Setenv(sessionEnvVar, "true")
recorder, err := newSessionRecorder([]string{testLoopArg, testOutArg})
require.Nil(t, recorder)
require.EqualError(t, err, "cmd/loop/testdata/sessions does not "+
"exist; run session recording from the repository root")
Fixed in if evt.Event == expected || slices.Contains(alternatives, evt.Event) {
return &evt, idx, nil
}
Leaving this unchanged. The separate checks are a little repetitive, but they keep each installed hook close to its lifetime/defer and avoid introducing a broader setup abstraction in this PR. Since all blocks are guarded by Representative shape: if sessionRec != nil {
restoreClock := hookClock(
clock.NewTestClock(time.Unix(sessionClockStartUnix, 0)),
)
defer restoreClock()
if err := sessionRec.Start(nil, nil, nil); err != nil {
fatal(err)
}
}
Fixed in // A second signal requests immediate shutdown. Finalize the
// session first because os.Exit skips deferred cleanup.
_ = sessionRec.Finalize(fmt.Errorf("signal: %s", sig))
os.Exit(130)
Fixed in if current, ok := os.LookupEnv(updateRecordedSessionsEnvVar); ok {
t.Setenv(updateRecordedSessionsEnvVar, current)
} else {
t.Setenv(updateRecordedSessionsEnvVar, "false")
}
_ = os.Unsetenv(updateRecordedSessionsEnvVar)
enabled, err := updateRecordedSessionsEnabled()
require.NoError(t, err)
require.False(t, enabled) |
hieblmi
left a comment
There was a problem hiding this comment.
Great work @starius, I've found a coupe of issues:
- cmd/loop/session_replay_test.go:689: normal replay mode does not assert that all recorded gRPC events were consumed. That check only runs in bless mode, so CI can miss regressions where a command stops making a recorded trailing RPC but the visible text still matches.
- cmd/loop/main.go:263: live recording replaces the CLI clock with a fixed timestamp. Since that timestamp is January 26, 2026, future loop out / quote out recordings
send stale swap_publication_deadline values via cmd/loop/loopout.go:251 and cmd/loop/quote.go:209, which can diverge from real CLI behavior or be rejected by a live server.
This work adds a repeatable, high-level testing path for the Loop CLI by recording real command runs and replaying them in tests. The goal is to make CLI behavior verifiable and regression-resistant without depending on a live loopd or local environment details. The recorded sessions become fixtures that exercise success and error paths across subcommands and flags, while replay ensures deterministic output checks in CI.
What changed
LOOP_SESSION_RECORD=trueis set.~and normalized JSON output and timestamps during replay so sessions are portable across machines.cmd/loop/testdata/sessions/and documented how to add more.Why this approach
The CLI previously relied on manual testing or live daemons, which makes it hard to detect regressions and impossible to run in CI without complex setup. Recording/replay gives us deterministic, reproducible coverage of real behavior, including error messaging, output formatting, and gRPC request/response handling, while still exercising the production code paths.
How it works
Recording wraps the CLI process and intercepts stdio and gRPC calls, saving a JSON session file with events and metadata. Replay loads these events, supplies recorded input, and provides a recorded gRPC connection so commands run without a live daemon. Output is normalized where needed so sessions can be replayed across machines.
Testing
go test ./cmd/loop -vLOOP_SESSION_RECORD=true loop <command> --network regtestPull Request Checklist
release_notes.mdif your PR contains major features, breaking changes or bugfixes