Skip to content

feat: auto-heal stuck jobs + idle / bash-stuck watcher detectors#3

Closed
suharvest wants to merge 7 commits into
tasict:mainfrom
suharvest:pr/auto-heal
Closed

feat: auto-heal stuck jobs + idle / bash-stuck watcher detectors#3
suharvest wants to merge 7 commits into
tasict:mainfrom
suharvest:pr/auto-heal

Conversation

@suharvest

Copy link
Copy Markdown

Summary

Stacked on #2 — this branch contains the 5 commits from #2 plus 2 new commits. If GitHub shows extra commits in the diff, please merge #2 first (or retarget this PR and I'll rebase).

Two commits that together address the "task stuck in `investigating` after the session is actually done" class of bugs:

  • `feat(auto-heal)`: new `lib/auto-heal.mjs` probes `GET /session/:id/message?limit=1`. When the last assistant message has `info.finish` set and `info.time.completed >= job.startedAt`, the local job is transitioned to `completed` and the message text is persisted. If the task-worker PID is dead and the session has been silent for >60 s, the job is marked `failed` with a clear reason.
    • `status`, `result`, and `task-resume-candidate` run a silent heal pass before reading state, so they never report a false "running".
    • New `companion.mjs heal` subcommand reconciles in bulk (`--dry-run` / `--json` / `--all`).
    • No-op when the server is unreachable.
  • `feat(server)`: watcher-side stall detection in `sendPrompt`.
    • Idle timeout (`OPENCODE_IDLE_TIMEOUT_MS`, default 15 min): abort when the session shows no message/part/tool-output change for too long.
    • Bash-tool stuck detector: when the latest tool is a bash in `status=running` but `opencode serve` has zero child processes for N consecutive polls, abort. This catches the ask-permission deadlock (Bash permission 'ask' hangs forever in headless/server mode anomalyco/opencode#14473) where the shell already exited cleanly but tool state never flipped.
    • Graceful degradation on Windows or when `lsof`/`pgrep` are unavailable — idle timeout still covers those cases.
    • Race restructure: a rejection from fetch or watcher no longer cancels the other side, so the server's 5-min POST cap can no longer kill live sessions before the watcher observes completion.

Absolute `PROMPT_TIMEOUT_MS` raised to 4 h as a pure safety cap; real stall detection now lives in the watcher.

Test plan

  • Reproduced the "stuck investigating" case with a killed task-worker — `status` now auto-reconciles to `completed` with the assistant output
  • `companion.mjs heal --dry-run` reports the correct intended actions without mutating state
  • Bash deadlock (tool.status=running, no child) aborts within ~15 s; other bash tools (still running children) are untouched
  • Idle timeout fires on a session that genuinely stalls, and stays quiet on long silent-but-progressing tasks
  • Server unreachable → heal is a silent no-op, existing `status` / `result` still work

When Claude dispatches an opencode rescue task (via Agent tool or direct
companion Bash call), this hook detects the new task-xxx id in the tool
response and injects a system-reminder instructing Claude to start a
persistent Monitor covering that id. On terminal states the Monitor
emits a READY line pointing to the companion result command so Claude
fetches the full payload and summarizes it for the user without needing
to be asked.

- New plugins/opencode/scripts/post-tool-use-monitor-hook.mjs
- hooks.json: register PostToolUse (matcher: Agent|Bash, timeout 5s)

Gracefully no-ops on non-matching tool output or missing companion markers.
On terminal state the Monitor script now calls companion result <id> and
emits the truncated summary inline (bounded by OPENCODE_MONITOR_RESULT_CHARS,
default 1500). Claude sees the result summary directly in the Monitor
event and no longer needs a follow-up Bash call.

Also fixes an extra trailing ) in the inline node -e expression that
would have caused the status parser to syntax-error at runtime.
… timeouts

OpenCode server's POST /session/:id/message occasionally fails to close
its HTTP response after the session emits the terminal assistant message
(observed with glm-5 backend, opencode 1.4.x). Without this fix,
sendPrompt hangs until AbortSignal fires, leaving the companion job
stuck in 'investigating' status until the (previously 5 min) timeout.

Changes:
- Race the POST fetch against a /session/:id/message polling watcher;
  whichever returns first aborts the other. Watcher only accepts a
  completion whose info.time.completed >= prompt startedAt.
- Bump generic request() timeout and sendPrompt timeout to 30 min,
  configurable via OPENCODE_REQUEST_TIMEOUT_MS / OPENCODE_PROMPT_TIMEOUT_MS
  env vars.
- Completion poll interval configurable via OPENCODE_COMPLETION_POLL_MS
  (default 5s).
`status` handler was ignoring argv entirely — `--json` was silently
dropped and positional task ids were never matched. Tooling that piped
status through jq would choke on the markdown fallback with "parse
error: Invalid numeric literal".

Now:
- `status --json` emits a workspace snapshot as JSON ({workspaceRoot,
  running, latestFinished, recent})
- `status <tid> [--json]` looks up a single job by id/prefix. JSON
  form is {workspaceRoot, job: <enriched|null>} so callers can always
  read .job.status safely.
- `status --all` widens from session-scoped to all-sessions (useful
  for cross-session observers like monitor scripts)
- Markdown output unchanged for the no-flag case.
Previously the Monitor script only emitted on status/phase transitions.
For long-running tasks that sit in 'running/investigating' for many
minutes, the user saw one initial event and then nothing — no way to
tell if the task was still alive.

Now:
- Include the last line of progressPreview in the state signature so
  any new log activity inside the task triggers an event (with elapsed
  time + latest log snippet)
- Emit a heartbeat every HEARTBEAT_POLLS ticks (default 10 = ~5min)
  with current status/phase/elapsed even when nothing has changed
- Both tunable via OPENCODE_MONITOR_HEARTBEAT_POLLS env var
Long-running background tasks occasionally get stuck in investigating
status after the OpenCode session has finished server-side (POST body
never closes, watcher misses the terminal signal, or task-worker dies).

- New lib/auto-heal.mjs probes GET /session/:id/message?limit=1 and
  transitions the local job to completed when the last assistant
  message has info.finish set and info.time.completed >= job.startedAt.
  If the task-worker PID is dead and the session is silent >60s, the
  job is marked failed with a clear reason.
- status, result, and task-resume-candidate run a silent heal pass
  before reading state so they never report a false "running" for a
  session that is actually complete.
- New `companion.mjs heal` subcommand scans and reconciles in bulk,
  with --dry-run / --json / --all flags.
- Heal is a no-op when the server is unreachable, so offline use of
  status/result keeps working.
Raise the absolute prompt timeout to 4h as a pure safety cap and move
real stall detection into the watcher so long-but-alive tasks aren't
killed by a fixed deadline.

- Idle timeout (OPENCODE_IDLE_TIMEOUT_MS, default 15min): abort when
  the session shows no message/part/tool-output change for too long.
- Bash-tool stuck detector: when the latest tool is a bash in status
  running but `opencode serve` has zero child processes for N
  consecutive polls (default 3 × 5s), abort. This catches the
  ask-permission deadlock (anomalyco/opencode#14473) where the shell process
  already exited cleanly but tool state never flipped to completed.
  Gracefully degrades on Windows or when lsof/pgrep is unavailable.
- Restructure fetch-vs-watcher race so a rejection from one side no
  longer cancels the other. The server's 5-min POST cap used to kill
  sendPrompt before the watcher could observe completion; now both
  settle independently and we prefer whichever succeeded.
@suharvest

Copy link
Copy Markdown
Author

Superseded by #5 — consolidated hardening suite including this PR's work plus server/auto-heal/SAFETY_HEADER/wait-and-result on top. Closing since no review activity in 2+ days and the follow-on work builds on this.

@suharvest suharvest closed this Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant