feat: auto-heal stuck jobs + idle / bash-stuck watcher detectors#3
Closed
suharvest wants to merge 7 commits into
Closed
feat: auto-heal stuck jobs + idle / bash-stuck watcher detectors#3suharvest wants to merge 7 commits into
suharvest wants to merge 7 commits into
Conversation
When Claude dispatches an opencode rescue task (via Agent tool or direct companion Bash call), this hook detects the new task-xxx id in the tool response and injects a system-reminder instructing Claude to start a persistent Monitor covering that id. On terminal states the Monitor emits a READY line pointing to the companion result command so Claude fetches the full payload and summarizes it for the user without needing to be asked. - New plugins/opencode/scripts/post-tool-use-monitor-hook.mjs - hooks.json: register PostToolUse (matcher: Agent|Bash, timeout 5s) Gracefully no-ops on non-matching tool output or missing companion markers.
On terminal state the Monitor script now calls companion result <id> and emits the truncated summary inline (bounded by OPENCODE_MONITOR_RESULT_CHARS, default 1500). Claude sees the result summary directly in the Monitor event and no longer needs a follow-up Bash call. Also fixes an extra trailing ) in the inline node -e expression that would have caused the status parser to syntax-error at runtime.
… timeouts OpenCode server's POST /session/:id/message occasionally fails to close its HTTP response after the session emits the terminal assistant message (observed with glm-5 backend, opencode 1.4.x). Without this fix, sendPrompt hangs until AbortSignal fires, leaving the companion job stuck in 'investigating' status until the (previously 5 min) timeout. Changes: - Race the POST fetch against a /session/:id/message polling watcher; whichever returns first aborts the other. Watcher only accepts a completion whose info.time.completed >= prompt startedAt. - Bump generic request() timeout and sendPrompt timeout to 30 min, configurable via OPENCODE_REQUEST_TIMEOUT_MS / OPENCODE_PROMPT_TIMEOUT_MS env vars. - Completion poll interval configurable via OPENCODE_COMPLETION_POLL_MS (default 5s).
`status` handler was ignoring argv entirely — `--json` was silently
dropped and positional task ids were never matched. Tooling that piped
status through jq would choke on the markdown fallback with "parse
error: Invalid numeric literal".
Now:
- `status --json` emits a workspace snapshot as JSON ({workspaceRoot,
running, latestFinished, recent})
- `status <tid> [--json]` looks up a single job by id/prefix. JSON
form is {workspaceRoot, job: <enriched|null>} so callers can always
read .job.status safely.
- `status --all` widens from session-scoped to all-sessions (useful
for cross-session observers like monitor scripts)
- Markdown output unchanged for the no-flag case.
Previously the Monitor script only emitted on status/phase transitions. For long-running tasks that sit in 'running/investigating' for many minutes, the user saw one initial event and then nothing — no way to tell if the task was still alive. Now: - Include the last line of progressPreview in the state signature so any new log activity inside the task triggers an event (with elapsed time + latest log snippet) - Emit a heartbeat every HEARTBEAT_POLLS ticks (default 10 = ~5min) with current status/phase/elapsed even when nothing has changed - Both tunable via OPENCODE_MONITOR_HEARTBEAT_POLLS env var
Long-running background tasks occasionally get stuck in investigating status after the OpenCode session has finished server-side (POST body never closes, watcher misses the terminal signal, or task-worker dies). - New lib/auto-heal.mjs probes GET /session/:id/message?limit=1 and transitions the local job to completed when the last assistant message has info.finish set and info.time.completed >= job.startedAt. If the task-worker PID is dead and the session is silent >60s, the job is marked failed with a clear reason. - status, result, and task-resume-candidate run a silent heal pass before reading state so they never report a false "running" for a session that is actually complete. - New `companion.mjs heal` subcommand scans and reconciles in bulk, with --dry-run / --json / --all flags. - Heal is a no-op when the server is unreachable, so offline use of status/result keeps working.
Raise the absolute prompt timeout to 4h as a pure safety cap and move real stall detection into the watcher so long-but-alive tasks aren't killed by a fixed deadline. - Idle timeout (OPENCODE_IDLE_TIMEOUT_MS, default 15min): abort when the session shows no message/part/tool-output change for too long. - Bash-tool stuck detector: when the latest tool is a bash in status running but `opencode serve` has zero child processes for N consecutive polls (default 3 × 5s), abort. This catches the ask-permission deadlock (anomalyco/opencode#14473) where the shell process already exited cleanly but tool state never flipped to completed. Gracefully degrades on Windows or when lsof/pgrep is unavailable. - Restructure fetch-vs-watcher race so a rejection from one side no longer cancels the other. The server's 5-min POST cap used to kill sendPrompt before the watcher could observe completion; now both settle independently and we prefer whichever succeeded.
This was referenced Apr 18, 2026
Author
|
Superseded by #5 — consolidated hardening suite including this PR's work plus server/auto-heal/SAFETY_HEADER/wait-and-result on top. Closing since no review activity in 2+ days and the follow-on work builds on this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two commits that together address the "task stuck in `investigating` after the session is actually done" class of bugs:
Absolute `PROMPT_TIMEOUT_MS` raised to 4 h as a pure safety cap; real stall detection now lives in the watcher.
Test plan