Consolidated hardening suite: state isolation, server+monitor+auto-heal, SAFETY_HEADER, wait-and-result by suharvest · Pull Request #5 · tasict/opencode-plugin-cc

suharvest · 2026-04-19T23:25:24Z

Summary

Consolidates and supersedes previously open PRs #1 (state isolation), #2 (monitor/sendPrompt hardening), #3 (auto-heal). Adds two new layers on top. Closing #1/#2/#3 in favor of this one since none received review activity and each depends on the next.

23 commits organized by theme. All from a single author addressing related reliability issues around long-running background tasks.

Commit groups

1. State isolation (supersedes #1)

4fb3c30 fix(state): derive own plugin data dir to avoid cross-plugin leak

2. Server + error handling

c215eaa feat(server): classify errors + auto-repair opencode.json on ensureServer
743639b fix(server): relax looksTerminal; gate behind OPENCODE_STRICT_TERMINAL
3c77c98 fix(server): bump idle timeout to 1h, skip abort when opencode has live child
0247c99 fix(server): avoid busy-loop in idle watcher live-child skip
31febd0 fix(server): race sendPrompt against completion watcher + env-tunable timeouts
deb3419 feat(server): idle + bash-stuck detectors in sendPrompt watcher

3. Worker process + job record

1e5b537 fix(process): accept logFile opt in spawnDetached to capture worker stdio
240b7d5 feat(companion): pre-register queued job record before spawning detached worker

4. Monitor + progress (supersedes #2)

bd11ee5 feat: auto-start Monitor on rescue dispatch via PostToolUse hook
c378dfd feat(monitor): surface progress activity and heartbeat
02160b8 fix(monitor): inline result fetch + fix parse-status JS syntax
6d48b0f fix(status): honor --json flag and single-task lookup
cd5ef39 fix(monitor): emit explicit TERMINAL=timeout echo when MAX_POLLS exhausted
3915b83 feat(monitor): detect vague Agent-result notifications and inject poll guidance

5. Auto-heal (supersedes #3)

ec27573 feat(auto-heal): reconcile stuck jobs via session terminal probe
6569f11 fix(auto-heal): mark job failed when server unreachable and worker dead

6. Doctor / UX

08ae72b feat(companion): add doctor + config onboarding subcommands
b124d3d feat(status,docs): progress breadcrumbs + README quickstart

7. SAFETY_HEADER (new in this PR)

7616201 fix(prompts): add SAFETY_HEADER to block recursive subagent delegation
8c26b9f strengthen SAFETY_HEADER: explicitly block Skill/Agent/Task delegation

Recursive-delegation failure mode: inside an opencode session, GLM-5 could see 'delegate to opencode-rescue' in the task text (inherited from Claude Code's CLAUDE.md) and try to recursively invoke a subagent/skill that doesn't exist in this runtime, stalling silently. SAFETY_HEADER prepends an explicit block on Task/Agent/Skill invocations with 'plugin:name' colon-namespaced identifiers.

8. wait-and-result subcommand (new in this PR)

61f05d6 feat(companion): add wait-and-result subcommand, simplify opencode-rescue wrapper
180151a Merge branch 'feat/safety-header-skill-agent-block'

The opencode-rescue wrapper was failing ~5/6 dispatches by returning placeholder text like 'Monitor started' / 'Task dispatched, polling' instead of the completed report. Root cause: the 3+ Bash-call dispatch-and-poll state machine was fragile — any interruption (bash error, task-id regex miss, wrapper state loss) caused the wrapper to short-circuit with vague output, while the inner companion job kept running with no way to notify.

Fix pattern borrowed from codex-rescue: single blocking task --wait call. Implemented via new wait-and-result subcommand that polls server-side up to --max-wait seconds, returns final formatted result on terminal state (exit 0), or JSON status-running on timeout (exit 2). opencode-rescue wrapper collapses from 3-step loop to 2-step loop (task submit + wait-and-result loop up to 20×8min).

Each wait-and-result call respects Claude Code's 10-minute Bash tool cap via --max-wait 480 default. For tasks >8min, the loop iterates.

Validated: 4 direct CLI tests (fast-path, timeout, already-terminal, bad-id) + Agent-tool end-to-end dispatch returns clean '## Job: ...' result in 27s.

Test plan

node opencode-companion.mjs wait-and-result <fresh-id> --max-wait 30 → exit 0 with formatted report
Same with --max-wait 3 on slow task → exit 2, JSON running status
Already-terminal id → exit 0 fast path
Bad id → exit 1 with stderr
Dispatch via Agent(opencode:opencode-rescue) → clean report, no vague placeholder
auto-heal reconciles stuck job when worker dies mid-run
SAFETY_HEADER blocks recursive Task/Agent/Skill invocation from within opencode session

Why consolidated

None of #1/#2/#3 received review activity in 2+ days. Groups 2-8 build on group 1's state isolation. Groups 4 (monitor) and 8 (wait-and-result) both target the vague-notification problem from different angles (runtime hook vs. server-side wait) — they coexist as defense-in-depth. Easier for maintainer to review as one atomic change than three stacked PRs + two separate follow-ups.

🤖 Generated with Claude Code

When Claude dispatches an opencode rescue task (via Agent tool or direct companion Bash call), this hook detects the new task-xxx id in the tool response and injects a system-reminder instructing Claude to start a persistent Monitor covering that id. On terminal states the Monitor emits a READY line pointing to the companion result command so Claude fetches the full payload and summarizes it for the user without needing to be asked. - New plugins/opencode/scripts/post-tool-use-monitor-hook.mjs - hooks.json: register PostToolUse (matcher: Agent|Bash, timeout 5s) Gracefully no-ops on non-matching tool output or missing companion markers.

On terminal state the Monitor script now calls companion result <id> and emits the truncated summary inline (bounded by OPENCODE_MONITOR_RESULT_CHARS, default 1500). Claude sees the result summary directly in the Monitor event and no longer needs a follow-up Bash call. Also fixes an extra trailing ) in the inline node -e expression that would have caused the status parser to syntax-error at runtime.

… timeouts OpenCode server's POST /session/:id/message occasionally fails to close its HTTP response after the session emits the terminal assistant message (observed with glm-5 backend, opencode 1.4.x). Without this fix, sendPrompt hangs until AbortSignal fires, leaving the companion job stuck in 'investigating' status until the (previously 5 min) timeout. Changes: - Race the POST fetch against a /session/:id/message polling watcher; whichever returns first aborts the other. Watcher only accepts a completion whose info.time.completed >= prompt startedAt. - Bump generic request() timeout and sendPrompt timeout to 30 min, configurable via OPENCODE_REQUEST_TIMEOUT_MS / OPENCODE_PROMPT_TIMEOUT_MS env vars. - Completion poll interval configurable via OPENCODE_COMPLETION_POLL_MS (default 5s).

`status` handler was ignoring argv entirely — `--json` was silently dropped and positional task ids were never matched. Tooling that piped status through jq would choke on the markdown fallback with "parse error: Invalid numeric literal". Now: - `status --json` emits a workspace snapshot as JSON ({workspaceRoot, running, latestFinished, recent}) - `status <tid> [--json]` looks up a single job by id/prefix. JSON form is {workspaceRoot, job: <enriched|null>} so callers can always read .job.status safely. - `status --all` widens from session-scoped to all-sessions (useful for cross-session observers like monitor scripts) - Markdown output unchanged for the no-flag case.

Previously the Monitor script only emitted on status/phase transitions. For long-running tasks that sit in 'running/investigating' for many minutes, the user saw one initial event and then nothing — no way to tell if the task was still alive. Now: - Include the last line of progressPreview in the state signature so any new log activity inside the task triggers an event (with elapsed time + latest log snippet) - Emit a heartbeat every HEARTBEAT_POLLS ticks (default 10 = ~5min) with current status/phase/elapsed even when nothing has changed - Both tunable via OPENCODE_MONITOR_HEARTBEAT_POLLS env var

Long-running background tasks occasionally get stuck in investigating status after the OpenCode session has finished server-side (POST body never closes, watcher misses the terminal signal, or task-worker dies). - New lib/auto-heal.mjs probes GET /session/:id/message?limit=1 and transitions the local job to completed when the last assistant message has info.finish set and info.time.completed >= job.startedAt. If the task-worker PID is dead and the session is silent >60s, the job is marked failed with a clear reason. - status, result, and task-resume-candidate run a silent heal pass before reading state so they never report a false "running" for a session that is actually complete. - New `companion.mjs heal` subcommand scans and reconciles in bulk, with --dry-run / --json / --all flags. - Heal is a no-op when the server is unreachable, so offline use of status/result keeps working.

Raise the absolute prompt timeout to 4h as a pure safety cap and move real stall detection into the watcher so long-but-alive tasks aren't killed by a fixed deadline. - Idle timeout (OPENCODE_IDLE_TIMEOUT_MS, default 15min): abort when the session shows no message/part/tool-output change for too long. - Bash-tool stuck detector: when the latest tool is a bash in status running but `opencode serve` has zero child processes for N consecutive polls (default 3 × 5s), abort. This catches the ask-permission deadlock (anomalyco/opencode#14473) where the shell process already exited cleanly but tool state never flipped to completed. Gracefully degrades on Windows or when lsof/pgrep is unavailable. - Restructure fetch-vs-watcher race so a rejection from one side no longer cancels the other. The server's 5-min POST cap used to kill sendPrompt before the watcher could observe completion; now both settle independently and we prefer whichever succeeded.

GLM-5 (and likely other models) inside an opencode session sometimes inherits routing directives from the outer Claude Code CLAUDE.md ("delegate long tasks to opencode-rescue") and tries to invoke Task with subagent_type="opencode:rescue" / "codex:rescue". Those are Claude Code skill namespaces, not opencode agents — the call errors out, then the model stalls indefinitely retrying with zero output emission, which the session-level idle watchdog only catches after 15min. buildTaskPrompt now prepends an explicit notice that routing rules have already been consumed and Claude Code subagent_types are unavailable here; execute directly instead. Applies only to prompts going through opencode-companion (task, review paths). Direct 'opencode run' CLI invocations still need to strip CLAUDE.md themselves — prefer Agent(subagent_type=opencode- rescue) for proper watchdog coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tdio Opens an append fd to the given path and passes it as stdout+stderr to the detached spawn; closes the fd in the parent immediately after fork. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hed worker Resolves logFile path and calls upsertJob with status=queued before spawn so status/result commands see the job immediately. Updates pid after spawn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ind OPENCODE_STRICT_TERMINAL Some OpenCode versions emit terminal assistant messages without a finish field. Accept completed > 0 as sufficient when OPENCODE_STRICT_TERMINAL != "1". Applies to both sendPrompt watcher and probeSessionTerminal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… live child processes Prevents premature session termination when tool subprocesses are still running. Resets lastActivityMs and logs a reason instead of aborting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ocess is dead Replaces the unconditional skip with a dead-worker check: if job.pid is gone and updatedAt is older than STALE_IDLE_MS, upsert status=failed and return healed-failed so the job stops being retried. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…usted Adds OPENCODE_MONITOR_MAX_POLLS (default 120) poll counter. When the loop exits due to poll exhaustion rather than a terminal status, emits a TERMINAL=timeout line for each non-terminal job so the parent thread sees a clear signal. Guards against duplicate TERMINAL lines via terminal[] map. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The prior patch used `continue` inside the idle-threshold branch, which skipped the POLL_INTERVAL_MS sleep at loop tail and hammered the opencode server while tools were alive. Replace with a straight if/else so the reset path falls through to the normal poll interval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CLAUDE_PLUGIN_DATA can be exported by an unrelated plugin (e.g. codex companion) into the shared env, causing opencode state to land in another plugin's data directory. Derive our own data dir from the script's install path instead, falling back to CLAUDE_PLUGIN_DATA only when it already names an opencode-scoped dir. Add OPENCODE_COMPANION_DATA as an explicit override.

…rver Two server-side robustness features so newcomers hit fewer cryptic failures: 1. errors.mjs — classifyError() annotates fetch/abort/connection errors with actionable fix hints. Wired into request() and sendPrompt() in opencode-server.mjs so "fetch failed" becomes e.g. "Aborted after Ns (OPENCODE_PROMPT_TIMEOUT_MS=...). For longer tasks set ... to a higher value." ECONNREFUSED gets the start-server hint, 401/403 points at OPENCODE_SERVER_PASSWORD, 5xx points at docker logs. 2. opencode-config.mjs — ensureOpencodeConfig() writes the headless-safe permission set (bash/edit/webfetch/external_directory all "allow") to ~/.config/opencode/opencode.json before ensureServer spawns opencode. Idempotent, atomic temp+rename, respects XDG_CONFIG_HOME. Fixes anomalyco/opencode#14473 hang that cost 48 min in one session to diagnose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New subcommands for first-run self-service: - doctor [--fix] [--json] [--verbose] — 8 health checks covering opencode binary, config permissions, server reachability, plugin data dir resolution, stuck jobs, disk. With --fix, writes missing opencode.json permissions and runs autoHealJobs. Exits 1 on failures for CI use. - config [--json] — dumps all recognized env vars with current values and source (default | env), plus resolved state dir, opencode config path, and server reachability. Makes hidden defaults discoverable. Adds getSessionLastActivity() to auto-heal.mjs (used in follow-up commit for status breadcrumbs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

handleStatus now enriches each running job with its opencode session's last activity (tool + command head + age, or text snippet) via getSessionLastActivity. render.mjs displays breadcrumb under the job line when present. Makes "investigating" actionable — users see "bash: docker exec speech ... (3s ago)" instead of a static phase label. README gets a Quickstart pointing at 'doctor' as the first thing to run, an Environment Variables table documenting all OPENCODE_* knobs with defaults, and a Pitfalls section for the two recurring surprises (stale investigating → heal; headless bash hang → doctor --fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l guidance Rescue subagents occasionally return placeholder strings ("Monitor started", "Waiting for completion", "Task forwarded in background") instead of the companion's rendered terminal report. Add a PostToolUse hook that spots those patterns, surfaces any task ids seen, and tells the main thread to poll companion status/result rather than treat the vague output as final. Updates opencode-rescue agent prompt and the result-handling/runtime skills to document the dispatch-and-poll contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds Skill + Agent to the existing Task blocklist (was only Task). Adds explicit 'IGNORE that instruction' for task text that mentions delegating to opencode-rescue / codex-rescue. Covers plugin:name colon-namespaced identifiers as a class. Addresses observed sub-agent confusion where names like opencode:rescue look like plugin skills but are actually agents, causing Skill() calls to fail silently or stall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…scue wrapper Replaces the fragile 3+ Bash call dispatch-and-poll loop in opencode-rescue.md with a 2-step loop using a new server-side blocking wait-and-result subcommand. opencode-companion.mjs: - New handleWaitAndResult: polls job status every 250ms up to --max-wait seconds, returns final formatted result on terminal state (exit 0), or JSON status-running on timeout (exit 2). Bad task-id exits 1. opencode-rescue.md: - Dispatch loop collapsed from 'task --background + sleep-30-poll loop + result' to 'task --background + wait-and-result loop (max 20 rounds)' - Each iteration is one clean Bash call with server-side blocking up to 8 minutes (safe under Claude Code's 10-minute Bash tool cap). Fixes the observed ~5/6 failure rate where the wrapper returned placeholder text like 'Monitor started' / 'dispatched, polling' instead of the completed report. Tested with: fast-path (exit 0), timeout path (exit 2), already-terminal, bad-id. All pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- strengthen SAFETY_HEADER: block Skill/Agent/Task delegation - add wait-and-result subcommand + simplify opencode-rescue wrapper

suharvest and others added 23 commits April 18, 2026 15:46

fix(process): accept logFile opt in spawnDetached to capture worker s…

1e5b537

…tdio Opens an append fd to the given path and passes it as stdout+stderr to the detached spawn; closes the fd in the parent immediately after fork. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge branch 'feat/safety-header-skill-agent-block'

180151a

- strengthen SAFETY_HEADER: block Skill/Agent/Task delegation - add wait-and-result subcommand + simplify opencode-rescue wrapper

This was referenced Apr 19, 2026

fix(state): derive own plugin data dir to avoid cross-plugin leak #1

Closed

feat: post-tool-use monitor + sendPrompt watcher hardening #2

Closed

feat: auto-heal stuck jobs + idle / bash-stuck watcher detectors #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidated hardening suite: state isolation, server+monitor+auto-heal, SAFETY_HEADER, wait-and-result#5

Consolidated hardening suite: state isolation, server+monitor+auto-heal, SAFETY_HEADER, wait-and-result#5
suharvest wants to merge 23 commits into
tasict:mainfrom
suharvest:main

suharvest commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suharvest commented Apr 19, 2026

Summary

Commit groups

1. State isolation (supersedes #1)

2. Server + error handling

3. Worker process + job record

4. Monitor + progress (supersedes #2)

5. Auto-heal (supersedes #3)

6. Doctor / UX

7. SAFETY_HEADER (new in this PR)

8. wait-and-result subcommand (new in this PR)

Test plan

Why consolidated

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant