Consolidated hardening suite: state isolation, server+monitor+auto-heal, SAFETY_HEADER, wait-and-result#5
Open
suharvest wants to merge 23 commits into
Open
Consolidated hardening suite: state isolation, server+monitor+auto-heal, SAFETY_HEADER, wait-and-result#5suharvest wants to merge 23 commits into
suharvest wants to merge 23 commits into
Conversation
When Claude dispatches an opencode rescue task (via Agent tool or direct companion Bash call), this hook detects the new task-xxx id in the tool response and injects a system-reminder instructing Claude to start a persistent Monitor covering that id. On terminal states the Monitor emits a READY line pointing to the companion result command so Claude fetches the full payload and summarizes it for the user without needing to be asked. - New plugins/opencode/scripts/post-tool-use-monitor-hook.mjs - hooks.json: register PostToolUse (matcher: Agent|Bash, timeout 5s) Gracefully no-ops on non-matching tool output or missing companion markers.
On terminal state the Monitor script now calls companion result <id> and emits the truncated summary inline (bounded by OPENCODE_MONITOR_RESULT_CHARS, default 1500). Claude sees the result summary directly in the Monitor event and no longer needs a follow-up Bash call. Also fixes an extra trailing ) in the inline node -e expression that would have caused the status parser to syntax-error at runtime.
… timeouts OpenCode server's POST /session/:id/message occasionally fails to close its HTTP response after the session emits the terminal assistant message (observed with glm-5 backend, opencode 1.4.x). Without this fix, sendPrompt hangs until AbortSignal fires, leaving the companion job stuck in 'investigating' status until the (previously 5 min) timeout. Changes: - Race the POST fetch against a /session/:id/message polling watcher; whichever returns first aborts the other. Watcher only accepts a completion whose info.time.completed >= prompt startedAt. - Bump generic request() timeout and sendPrompt timeout to 30 min, configurable via OPENCODE_REQUEST_TIMEOUT_MS / OPENCODE_PROMPT_TIMEOUT_MS env vars. - Completion poll interval configurable via OPENCODE_COMPLETION_POLL_MS (default 5s).
`status` handler was ignoring argv entirely — `--json` was silently
dropped and positional task ids were never matched. Tooling that piped
status through jq would choke on the markdown fallback with "parse
error: Invalid numeric literal".
Now:
- `status --json` emits a workspace snapshot as JSON ({workspaceRoot,
running, latestFinished, recent})
- `status <tid> [--json]` looks up a single job by id/prefix. JSON
form is {workspaceRoot, job: <enriched|null>} so callers can always
read .job.status safely.
- `status --all` widens from session-scoped to all-sessions (useful
for cross-session observers like monitor scripts)
- Markdown output unchanged for the no-flag case.
Previously the Monitor script only emitted on status/phase transitions. For long-running tasks that sit in 'running/investigating' for many minutes, the user saw one initial event and then nothing — no way to tell if the task was still alive. Now: - Include the last line of progressPreview in the state signature so any new log activity inside the task triggers an event (with elapsed time + latest log snippet) - Emit a heartbeat every HEARTBEAT_POLLS ticks (default 10 = ~5min) with current status/phase/elapsed even when nothing has changed - Both tunable via OPENCODE_MONITOR_HEARTBEAT_POLLS env var
Long-running background tasks occasionally get stuck in investigating status after the OpenCode session has finished server-side (POST body never closes, watcher misses the terminal signal, or task-worker dies). - New lib/auto-heal.mjs probes GET /session/:id/message?limit=1 and transitions the local job to completed when the last assistant message has info.finish set and info.time.completed >= job.startedAt. If the task-worker PID is dead and the session is silent >60s, the job is marked failed with a clear reason. - status, result, and task-resume-candidate run a silent heal pass before reading state so they never report a false "running" for a session that is actually complete. - New `companion.mjs heal` subcommand scans and reconciles in bulk, with --dry-run / --json / --all flags. - Heal is a no-op when the server is unreachable, so offline use of status/result keeps working.
Raise the absolute prompt timeout to 4h as a pure safety cap and move real stall detection into the watcher so long-but-alive tasks aren't killed by a fixed deadline. - Idle timeout (OPENCODE_IDLE_TIMEOUT_MS, default 15min): abort when the session shows no message/part/tool-output change for too long. - Bash-tool stuck detector: when the latest tool is a bash in status running but `opencode serve` has zero child processes for N consecutive polls (default 3 × 5s), abort. This catches the ask-permission deadlock (anomalyco/opencode#14473) where the shell process already exited cleanly but tool state never flipped to completed. Gracefully degrades on Windows or when lsof/pgrep is unavailable. - Restructure fetch-vs-watcher race so a rejection from one side no longer cancels the other. The server's 5-min POST cap used to kill sendPrompt before the watcher could observe completion; now both settle independently and we prefer whichever succeeded.
GLM-5 (and likely other models) inside an opencode session sometimes
inherits routing directives from the outer Claude Code CLAUDE.md
("delegate long tasks to opencode-rescue") and tries to invoke Task
with subagent_type="opencode:rescue" / "codex:rescue". Those are
Claude Code skill namespaces, not opencode agents — the call errors
out, then the model stalls indefinitely retrying with zero output
emission, which the session-level idle watchdog only catches after
15min.
buildTaskPrompt now prepends an explicit notice that routing rules
have already been consumed and Claude Code subagent_types are
unavailable here; execute directly instead.
Applies only to prompts going through opencode-companion (task,
review paths). Direct 'opencode run' CLI invocations still need to
strip CLAUDE.md themselves — prefer Agent(subagent_type=opencode-
rescue) for proper watchdog coverage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tdio Opens an append fd to the given path and passes it as stdout+stderr to the detached spawn; closes the fd in the parent immediately after fork. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hed worker Resolves logFile path and calls upsertJob with status=queued before spawn so status/result commands see the job immediately. Updates pid after spawn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ind OPENCODE_STRICT_TERMINAL Some OpenCode versions emit terminal assistant messages without a finish field. Accept completed > 0 as sufficient when OPENCODE_STRICT_TERMINAL != "1". Applies to both sendPrompt watcher and probeSessionTerminal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… live child processes Prevents premature session termination when tool subprocesses are still running. Resets lastActivityMs and logs a reason instead of aborting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ocess is dead Replaces the unconditional skip with a dead-worker check: if job.pid is gone and updatedAt is older than STALE_IDLE_MS, upsert status=failed and return healed-failed so the job stops being retried. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…usted Adds OPENCODE_MONITOR_MAX_POLLS (default 120) poll counter. When the loop exits due to poll exhaustion rather than a terminal status, emits a TERMINAL=timeout line for each non-terminal job so the parent thread sees a clear signal. Guards against duplicate TERMINAL lines via terminal[] map. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The prior patch used `continue` inside the idle-threshold branch, which skipped the POLL_INTERVAL_MS sleep at loop tail and hammered the opencode server while tools were alive. Replace with a straight if/else so the reset path falls through to the normal poll interval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLAUDE_PLUGIN_DATA can be exported by an unrelated plugin (e.g. codex companion) into the shared env, causing opencode state to land in another plugin's data directory. Derive our own data dir from the script's install path instead, falling back to CLAUDE_PLUGIN_DATA only when it already names an opencode-scoped dir. Add OPENCODE_COMPANION_DATA as an explicit override.
…rver Two server-side robustness features so newcomers hit fewer cryptic failures: 1. errors.mjs — classifyError() annotates fetch/abort/connection errors with actionable fix hints. Wired into request() and sendPrompt() in opencode-server.mjs so "fetch failed" becomes e.g. "Aborted after Ns (OPENCODE_PROMPT_TIMEOUT_MS=...). For longer tasks set ... to a higher value." ECONNREFUSED gets the start-server hint, 401/403 points at OPENCODE_SERVER_PASSWORD, 5xx points at docker logs. 2. opencode-config.mjs — ensureOpencodeConfig() writes the headless-safe permission set (bash/edit/webfetch/external_directory all "allow") to ~/.config/opencode/opencode.json before ensureServer spawns opencode. Idempotent, atomic temp+rename, respects XDG_CONFIG_HOME. Fixes anomalyco/opencode#14473 hang that cost 48 min in one session to diagnose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New subcommands for first-run self-service: - doctor [--fix] [--json] [--verbose] — 8 health checks covering opencode binary, config permissions, server reachability, plugin data dir resolution, stuck jobs, disk. With --fix, writes missing opencode.json permissions and runs autoHealJobs. Exits 1 on failures for CI use. - config [--json] — dumps all recognized env vars with current values and source (default | env), plus resolved state dir, opencode config path, and server reachability. Makes hidden defaults discoverable. Adds getSessionLastActivity() to auto-heal.mjs (used in follow-up commit for status breadcrumbs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handleStatus now enriches each running job with its opencode session's last activity (tool + command head + age, or text snippet) via getSessionLastActivity. render.mjs displays breadcrumb under the job line when present. Makes "investigating" actionable — users see "bash: docker exec speech ... (3s ago)" instead of a static phase label. README gets a Quickstart pointing at 'doctor' as the first thing to run, an Environment Variables table documenting all OPENCODE_* knobs with defaults, and a Pitfalls section for the two recurring surprises (stale investigating → heal; headless bash hang → doctor --fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l guidance
Rescue subagents occasionally return placeholder strings ("Monitor started",
"Waiting for completion", "Task forwarded in background") instead of the
companion's rendered terminal report. Add a PostToolUse hook that spots those
patterns, surfaces any task ids seen, and tells the main thread to poll
companion status/result rather than treat the vague output as final. Updates
opencode-rescue agent prompt and the result-handling/runtime skills to
document the dispatch-and-poll contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Skill + Agent to the existing Task blocklist (was only Task). Adds explicit 'IGNORE that instruction' for task text that mentions delegating to opencode-rescue / codex-rescue. Covers plugin:name colon-namespaced identifiers as a class. Addresses observed sub-agent confusion where names like opencode:rescue look like plugin skills but are actually agents, causing Skill() calls to fail silently or stall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…scue wrapper
Replaces the fragile 3+ Bash call dispatch-and-poll loop in
opencode-rescue.md with a 2-step loop using a new server-side blocking
wait-and-result subcommand.
opencode-companion.mjs:
- New handleWaitAndResult: polls job status every 250ms up to --max-wait
seconds, returns final formatted result on terminal state (exit 0),
or JSON status-running on timeout (exit 2). Bad task-id exits 1.
opencode-rescue.md:
- Dispatch loop collapsed from 'task --background + sleep-30-poll loop +
result' to 'task --background + wait-and-result loop (max 20 rounds)'
- Each iteration is one clean Bash call with server-side blocking up
to 8 minutes (safe under Claude Code's 10-minute Bash tool cap).
Fixes the observed ~5/6 failure rate where the wrapper returned
placeholder text like 'Monitor started' / 'dispatched, polling'
instead of the completed report. Tested with: fast-path (exit 0),
timeout path (exit 2), already-terminal, bad-id. All pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- strengthen SAFETY_HEADER: block Skill/Agent/Task delegation - add wait-and-result subcommand + simplify opencode-rescue wrapper
This was referenced Apr 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates and supersedes previously open PRs #1 (state isolation), #2 (monitor/sendPrompt hardening), #3 (auto-heal). Adds two new layers on top. Closing #1/#2/#3 in favor of this one since none received review activity and each depends on the next.
23 commits organized by theme. All from a single author addressing related reliability issues around long-running background tasks.
Commit groups
1. State isolation (supersedes #1)
4fb3c30fix(state): derive own plugin data dir to avoid cross-plugin leak2. Server + error handling
c215eaafeat(server): classify errors + auto-repair opencode.json on ensureServer743639bfix(server): relax looksTerminal; gate behind OPENCODE_STRICT_TERMINAL3c77c98fix(server): bump idle timeout to 1h, skip abort when opencode has live child0247c99fix(server): avoid busy-loop in idle watcher live-child skip31febd0fix(server): race sendPrompt against completion watcher + env-tunable timeoutsdeb3419feat(server): idle + bash-stuck detectors in sendPrompt watcher3. Worker process + job record
1e5b537fix(process): accept logFile opt in spawnDetached to capture worker stdio240b7d5feat(companion): pre-register queued job record before spawning detached worker4. Monitor + progress (supersedes #2)
bd11ee5feat: auto-start Monitor on rescue dispatch via PostToolUse hookc378dfdfeat(monitor): surface progress activity and heartbeat02160b8fix(monitor): inline result fetch + fix parse-status JS syntax6d48b0ffix(status): honor --json flag and single-task lookupcd5ef39fix(monitor): emit explicit TERMINAL=timeout echo when MAX_POLLS exhausted3915b83feat(monitor): detect vague Agent-result notifications and inject poll guidance5. Auto-heal (supersedes #3)
ec27573feat(auto-heal): reconcile stuck jobs via session terminal probe6569f11fix(auto-heal): mark job failed when server unreachable and worker dead6. Doctor / UX
08ae72bfeat(companion): add doctor + config onboarding subcommandsb124d3dfeat(status,docs): progress breadcrumbs + README quickstart7. SAFETY_HEADER (new in this PR)
7616201fix(prompts): add SAFETY_HEADER to block recursive subagent delegation8c26b9fstrengthen SAFETY_HEADER: explicitly block Skill/Agent/Task delegationRecursive-delegation failure mode: inside an opencode session, GLM-5 could see 'delegate to opencode-rescue' in the task text (inherited from Claude Code's CLAUDE.md) and try to recursively invoke a subagent/skill that doesn't exist in this runtime, stalling silently. SAFETY_HEADER prepends an explicit block on Task/Agent/Skill invocations with 'plugin:name' colon-namespaced identifiers.
8. wait-and-result subcommand (new in this PR)
61f05d6feat(companion): add wait-and-result subcommand, simplify opencode-rescue wrapper180151aMerge branch 'feat/safety-header-skill-agent-block'The opencode-rescue wrapper was failing ~5/6 dispatches by returning placeholder text like 'Monitor started' / 'Task dispatched, polling' instead of the completed report. Root cause: the 3+ Bash-call dispatch-and-poll state machine was fragile — any interruption (bash error, task-id regex miss, wrapper state loss) caused the wrapper to short-circuit with vague output, while the inner companion job kept running with no way to notify.
Fix pattern borrowed from codex-rescue: single blocking
task --waitcall. Implemented via newwait-and-resultsubcommand that polls server-side up to --max-wait seconds, returns final formatted result on terminal state (exit 0), or JSON status-running on timeout (exit 2). opencode-rescue wrapper collapses from 3-step loop to 2-step loop (task submit + wait-and-result loop up to 20×8min).Each wait-and-result call respects Claude Code's 10-minute Bash tool cap via --max-wait 480 default. For tasks >8min, the loop iterates.
Validated: 4 direct CLI tests (fast-path, timeout, already-terminal, bad-id) + Agent-tool end-to-end dispatch returns clean '## Job: ...' result in 27s.
Test plan
node opencode-companion.mjs wait-and-result <fresh-id> --max-wait 30→ exit 0 with formatted report--max-wait 3on slow task → exit 2, JSON running statusWhy consolidated
None of #1/#2/#3 received review activity in 2+ days. Groups 2-8 build on group 1's state isolation. Groups 4 (monitor) and 8 (wait-and-result) both target the vague-notification problem from different angles (runtime hook vs. server-side wait) — they coexist as defense-in-depth. Easier for maintainer to review as one atomic change than three stacked PRs + two separate follow-ups.
🤖 Generated with Claude Code