feat: auto-heal stuck jobs + idle / bash-stuck watcher detectors by suharvest · Pull Request #3 · tasict/opencode-plugin-cc

suharvest · 2026-04-18T07:52:29Z

Summary

Stacked on #2 — this branch contains the 5 commits from #2 plus 2 new commits. If GitHub shows extra commits in the diff, please merge #2 first (or retarget this PR and I'll rebase).

Two commits that together address the "task stuck in `investigating` after the session is actually done" class of bugs:

`feat(auto-heal)`: new `lib/auto-heal.mjs` probes `GET /session/:id/message?limit=1`. When the last assistant message has `info.finish` set and `info.time.completed >= job.startedAt`, the local job is transitioned to `completed` and the message text is persisted. If the task-worker PID is dead and the session has been silent for >60 s, the job is marked `failed` with a clear reason.
- `status`, `result`, and `task-resume-candidate` run a silent heal pass before reading state, so they never report a false "running".
- New `companion.mjs heal` subcommand reconciles in bulk (`--dry-run` / `--json` / `--all`).
- No-op when the server is unreachable.
`feat(server)`: watcher-side stall detection in `sendPrompt`.
- Idle timeout (`OPENCODE_IDLE_TIMEOUT_MS`, default 15 min): abort when the session shows no message/part/tool-output change for too long.
- Bash-tool stuck detector: when the latest tool is a bash in `status=running` but `opencode serve` has zero child processes for N consecutive polls, abort. This catches the ask-permission deadlock (Bash permission 'ask' hangs forever in headless/server mode anomalyco/opencode#14473) where the shell already exited cleanly but tool state never flipped.
- Graceful degradation on Windows or when `lsof`/`pgrep` are unavailable — idle timeout still covers those cases.
- Race restructure: a rejection from fetch or watcher no longer cancels the other side, so the server's 5-min POST cap can no longer kill live sessions before the watcher observes completion.

Absolute `PROMPT_TIMEOUT_MS` raised to 4 h as a pure safety cap; real stall detection now lives in the watcher.

Test plan

Reproduced the "stuck investigating" case with a killed task-worker — `status` now auto-reconciles to `completed` with the assistant output
`companion.mjs heal --dry-run` reports the correct intended actions without mutating state
Bash deadlock (tool.status=running, no child) aborts within ~15 s; other bash tools (still running children) are untouched
Idle timeout fires on a session that genuinely stalls, and stays quiet on long silent-but-progressing tasks
Server unreachable → heal is a silent no-op, existing `status` / `result` still work

When Claude dispatches an opencode rescue task (via Agent tool or direct companion Bash call), this hook detects the new task-xxx id in the tool response and injects a system-reminder instructing Claude to start a persistent Monitor covering that id. On terminal states the Monitor emits a READY line pointing to the companion result command so Claude fetches the full payload and summarizes it for the user without needing to be asked. - New plugins/opencode/scripts/post-tool-use-monitor-hook.mjs - hooks.json: register PostToolUse (matcher: Agent|Bash, timeout 5s) Gracefully no-ops on non-matching tool output or missing companion markers.

On terminal state the Monitor script now calls companion result <id> and emits the truncated summary inline (bounded by OPENCODE_MONITOR_RESULT_CHARS, default 1500). Claude sees the result summary directly in the Monitor event and no longer needs a follow-up Bash call. Also fixes an extra trailing ) in the inline node -e expression that would have caused the status parser to syntax-error at runtime.

… timeouts OpenCode server's POST /session/:id/message occasionally fails to close its HTTP response after the session emits the terminal assistant message (observed with glm-5 backend, opencode 1.4.x). Without this fix, sendPrompt hangs until AbortSignal fires, leaving the companion job stuck in 'investigating' status until the (previously 5 min) timeout. Changes: - Race the POST fetch against a /session/:id/message polling watcher; whichever returns first aborts the other. Watcher only accepts a completion whose info.time.completed >= prompt startedAt. - Bump generic request() timeout and sendPrompt timeout to 30 min, configurable via OPENCODE_REQUEST_TIMEOUT_MS / OPENCODE_PROMPT_TIMEOUT_MS env vars. - Completion poll interval configurable via OPENCODE_COMPLETION_POLL_MS (default 5s).

`status` handler was ignoring argv entirely — `--json` was silently dropped and positional task ids were never matched. Tooling that piped status through jq would choke on the markdown fallback with "parse error: Invalid numeric literal". Now: - `status --json` emits a workspace snapshot as JSON ({workspaceRoot, running, latestFinished, recent}) - `status <tid> [--json]` looks up a single job by id/prefix. JSON form is {workspaceRoot, job: <enriched|null>} so callers can always read .job.status safely. - `status --all` widens from session-scoped to all-sessions (useful for cross-session observers like monitor scripts) - Markdown output unchanged for the no-flag case.

Previously the Monitor script only emitted on status/phase transitions. For long-running tasks that sit in 'running/investigating' for many minutes, the user saw one initial event and then nothing — no way to tell if the task was still alive. Now: - Include the last line of progressPreview in the state signature so any new log activity inside the task triggers an event (with elapsed time + latest log snippet) - Emit a heartbeat every HEARTBEAT_POLLS ticks (default 10 = ~5min) with current status/phase/elapsed even when nothing has changed - Both tunable via OPENCODE_MONITOR_HEARTBEAT_POLLS env var

Long-running background tasks occasionally get stuck in investigating status after the OpenCode session has finished server-side (POST body never closes, watcher misses the terminal signal, or task-worker dies). - New lib/auto-heal.mjs probes GET /session/:id/message?limit=1 and transitions the local job to completed when the last assistant message has info.finish set and info.time.completed >= job.startedAt. If the task-worker PID is dead and the session is silent >60s, the job is marked failed with a clear reason. - status, result, and task-resume-candidate run a silent heal pass before reading state so they never report a false "running" for a session that is actually complete. - New `companion.mjs heal` subcommand scans and reconciles in bulk, with --dry-run / --json / --all flags. - Heal is a no-op when the server is unreachable, so offline use of status/result keeps working.

Raise the absolute prompt timeout to 4h as a pure safety cap and move real stall detection into the watcher so long-but-alive tasks aren't killed by a fixed deadline. - Idle timeout (OPENCODE_IDLE_TIMEOUT_MS, default 15min): abort when the session shows no message/part/tool-output change for too long. - Bash-tool stuck detector: when the latest tool is a bash in status running but `opencode serve` has zero child processes for N consecutive polls (default 3 × 5s), abort. This catches the ask-permission deadlock (anomalyco/opencode#14473) where the shell process already exited cleanly but tool state never flipped to completed. Gracefully degrades on Windows or when lsof/pgrep is unavailable. - Restructure fetch-vs-watcher race so a rejection from one side no longer cancels the other. The server's 5-min POST cap used to kill sendPrompt before the watcher could observe completion; now both settle independently and we prefer whichever succeeded.

suharvest · 2026-04-19T23:25:40Z

Superseded by #5 — consolidated hardening suite including this PR's work plus server/auto-heal/SAFETY_HEADER/wait-and-result on top. Closing since no review activity in 2+ days and the follow-on work builds on this.

suharvest added 7 commits April 18, 2026 15:46

This was referenced Apr 18, 2026

Thanks + a few issues found in real-world use (see #1 #2 #3) #4

Open

Consolidated hardening suite: state isolation, server+monitor+auto-heal, SAFETY_HEADER, wait-and-result #5

Open

suharvest closed this Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-heal stuck jobs + idle / bash-stuck watcher detectors#3

feat: auto-heal stuck jobs + idle / bash-stuck watcher detectors#3
suharvest wants to merge 7 commits into
tasict:mainfrom
suharvest:pr/auto-heal

suharvest commented Apr 18, 2026

Uh oh!

suharvest commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suharvest commented Apr 18, 2026

Summary

Test plan

Uh oh!

suharvest commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant