Skip to content

Consolidated hardening suite: state isolation, server+monitor+auto-heal, SAFETY_HEADER, wait-and-result#5

Open
suharvest wants to merge 23 commits into
tasict:mainfrom
suharvest:main
Open

Consolidated hardening suite: state isolation, server+monitor+auto-heal, SAFETY_HEADER, wait-and-result#5
suharvest wants to merge 23 commits into
tasict:mainfrom
suharvest:main

Conversation

@suharvest

Copy link
Copy Markdown

Summary

Consolidates and supersedes previously open PRs #1 (state isolation), #2 (monitor/sendPrompt hardening), #3 (auto-heal). Adds two new layers on top. Closing #1/#2/#3 in favor of this one since none received review activity and each depends on the next.

23 commits organized by theme. All from a single author addressing related reliability issues around long-running background tasks.

Commit groups

1. State isolation (supersedes #1)

  • 4fb3c30 fix(state): derive own plugin data dir to avoid cross-plugin leak

2. Server + error handling

  • c215eaa feat(server): classify errors + auto-repair opencode.json on ensureServer
  • 743639b fix(server): relax looksTerminal; gate behind OPENCODE_STRICT_TERMINAL
  • 3c77c98 fix(server): bump idle timeout to 1h, skip abort when opencode has live child
  • 0247c99 fix(server): avoid busy-loop in idle watcher live-child skip
  • 31febd0 fix(server): race sendPrompt against completion watcher + env-tunable timeouts
  • deb3419 feat(server): idle + bash-stuck detectors in sendPrompt watcher

3. Worker process + job record

  • 1e5b537 fix(process): accept logFile opt in spawnDetached to capture worker stdio
  • 240b7d5 feat(companion): pre-register queued job record before spawning detached worker

4. Monitor + progress (supersedes #2)

  • bd11ee5 feat: auto-start Monitor on rescue dispatch via PostToolUse hook
  • c378dfd feat(monitor): surface progress activity and heartbeat
  • 02160b8 fix(monitor): inline result fetch + fix parse-status JS syntax
  • 6d48b0f fix(status): honor --json flag and single-task lookup
  • cd5ef39 fix(monitor): emit explicit TERMINAL=timeout echo when MAX_POLLS exhausted
  • 3915b83 feat(monitor): detect vague Agent-result notifications and inject poll guidance

5. Auto-heal (supersedes #3)

  • ec27573 feat(auto-heal): reconcile stuck jobs via session terminal probe
  • 6569f11 fix(auto-heal): mark job failed when server unreachable and worker dead

6. Doctor / UX

  • 08ae72b feat(companion): add doctor + config onboarding subcommands
  • b124d3d feat(status,docs): progress breadcrumbs + README quickstart

7. SAFETY_HEADER (new in this PR)

  • 7616201 fix(prompts): add SAFETY_HEADER to block recursive subagent delegation
  • 8c26b9f strengthen SAFETY_HEADER: explicitly block Skill/Agent/Task delegation

Recursive-delegation failure mode: inside an opencode session, GLM-5 could see 'delegate to opencode-rescue' in the task text (inherited from Claude Code's CLAUDE.md) and try to recursively invoke a subagent/skill that doesn't exist in this runtime, stalling silently. SAFETY_HEADER prepends an explicit block on Task/Agent/Skill invocations with 'plugin:name' colon-namespaced identifiers.

8. wait-and-result subcommand (new in this PR)

  • 61f05d6 feat(companion): add wait-and-result subcommand, simplify opencode-rescue wrapper
  • 180151a Merge branch 'feat/safety-header-skill-agent-block'

The opencode-rescue wrapper was failing ~5/6 dispatches by returning placeholder text like 'Monitor started' / 'Task dispatched, polling' instead of the completed report. Root cause: the 3+ Bash-call dispatch-and-poll state machine was fragile — any interruption (bash error, task-id regex miss, wrapper state loss) caused the wrapper to short-circuit with vague output, while the inner companion job kept running with no way to notify.

Fix pattern borrowed from codex-rescue: single blocking task --wait call. Implemented via new wait-and-result subcommand that polls server-side up to --max-wait seconds, returns final formatted result on terminal state (exit 0), or JSON status-running on timeout (exit 2). opencode-rescue wrapper collapses from 3-step loop to 2-step loop (task submit + wait-and-result loop up to 20×8min).

Each wait-and-result call respects Claude Code's 10-minute Bash tool cap via --max-wait 480 default. For tasks >8min, the loop iterates.

Validated: 4 direct CLI tests (fast-path, timeout, already-terminal, bad-id) + Agent-tool end-to-end dispatch returns clean '## Job: ...' result in 27s.

Test plan

  • node opencode-companion.mjs wait-and-result <fresh-id> --max-wait 30 → exit 0 with formatted report
  • Same with --max-wait 3 on slow task → exit 2, JSON running status
  • Already-terminal id → exit 0 fast path
  • Bad id → exit 1 with stderr
  • Dispatch via Agent(opencode:opencode-rescue) → clean report, no vague placeholder
  • auto-heal reconciles stuck job when worker dies mid-run
  • SAFETY_HEADER blocks recursive Task/Agent/Skill invocation from within opencode session

Why consolidated

None of #1/#2/#3 received review activity in 2+ days. Groups 2-8 build on group 1's state isolation. Groups 4 (monitor) and 8 (wait-and-result) both target the vague-notification problem from different angles (runtime hook vs. server-side wait) — they coexist as defense-in-depth. Easier for maintainer to review as one atomic change than three stacked PRs + two separate follow-ups.

🤖 Generated with Claude Code

suharvest and others added 23 commits April 18, 2026 15:46
When Claude dispatches an opencode rescue task (via Agent tool or direct
companion Bash call), this hook detects the new task-xxx id in the tool
response and injects a system-reminder instructing Claude to start a
persistent Monitor covering that id. On terminal states the Monitor
emits a READY line pointing to the companion result command so Claude
fetches the full payload and summarizes it for the user without needing
to be asked.

- New plugins/opencode/scripts/post-tool-use-monitor-hook.mjs
- hooks.json: register PostToolUse (matcher: Agent|Bash, timeout 5s)

Gracefully no-ops on non-matching tool output or missing companion markers.
On terminal state the Monitor script now calls companion result <id> and
emits the truncated summary inline (bounded by OPENCODE_MONITOR_RESULT_CHARS,
default 1500). Claude sees the result summary directly in the Monitor
event and no longer needs a follow-up Bash call.

Also fixes an extra trailing ) in the inline node -e expression that
would have caused the status parser to syntax-error at runtime.
… timeouts

OpenCode server's POST /session/:id/message occasionally fails to close
its HTTP response after the session emits the terminal assistant message
(observed with glm-5 backend, opencode 1.4.x). Without this fix,
sendPrompt hangs until AbortSignal fires, leaving the companion job
stuck in 'investigating' status until the (previously 5 min) timeout.

Changes:
- Race the POST fetch against a /session/:id/message polling watcher;
  whichever returns first aborts the other. Watcher only accepts a
  completion whose info.time.completed >= prompt startedAt.
- Bump generic request() timeout and sendPrompt timeout to 30 min,
  configurable via OPENCODE_REQUEST_TIMEOUT_MS / OPENCODE_PROMPT_TIMEOUT_MS
  env vars.
- Completion poll interval configurable via OPENCODE_COMPLETION_POLL_MS
  (default 5s).
`status` handler was ignoring argv entirely — `--json` was silently
dropped and positional task ids were never matched. Tooling that piped
status through jq would choke on the markdown fallback with "parse
error: Invalid numeric literal".

Now:
- `status --json` emits a workspace snapshot as JSON ({workspaceRoot,
  running, latestFinished, recent})
- `status <tid> [--json]` looks up a single job by id/prefix. JSON
  form is {workspaceRoot, job: <enriched|null>} so callers can always
  read .job.status safely.
- `status --all` widens from session-scoped to all-sessions (useful
  for cross-session observers like monitor scripts)
- Markdown output unchanged for the no-flag case.
Previously the Monitor script only emitted on status/phase transitions.
For long-running tasks that sit in 'running/investigating' for many
minutes, the user saw one initial event and then nothing — no way to
tell if the task was still alive.

Now:
- Include the last line of progressPreview in the state signature so
  any new log activity inside the task triggers an event (with elapsed
  time + latest log snippet)
- Emit a heartbeat every HEARTBEAT_POLLS ticks (default 10 = ~5min)
  with current status/phase/elapsed even when nothing has changed
- Both tunable via OPENCODE_MONITOR_HEARTBEAT_POLLS env var
Long-running background tasks occasionally get stuck in investigating
status after the OpenCode session has finished server-side (POST body
never closes, watcher misses the terminal signal, or task-worker dies).

- New lib/auto-heal.mjs probes GET /session/:id/message?limit=1 and
  transitions the local job to completed when the last assistant
  message has info.finish set and info.time.completed >= job.startedAt.
  If the task-worker PID is dead and the session is silent >60s, the
  job is marked failed with a clear reason.
- status, result, and task-resume-candidate run a silent heal pass
  before reading state so they never report a false "running" for a
  session that is actually complete.
- New `companion.mjs heal` subcommand scans and reconciles in bulk,
  with --dry-run / --json / --all flags.
- Heal is a no-op when the server is unreachable, so offline use of
  status/result keeps working.
Raise the absolute prompt timeout to 4h as a pure safety cap and move
real stall detection into the watcher so long-but-alive tasks aren't
killed by a fixed deadline.

- Idle timeout (OPENCODE_IDLE_TIMEOUT_MS, default 15min): abort when
  the session shows no message/part/tool-output change for too long.
- Bash-tool stuck detector: when the latest tool is a bash in status
  running but `opencode serve` has zero child processes for N
  consecutive polls (default 3 × 5s), abort. This catches the
  ask-permission deadlock (anomalyco/opencode#14473) where the shell process
  already exited cleanly but tool state never flipped to completed.
  Gracefully degrades on Windows or when lsof/pgrep is unavailable.
- Restructure fetch-vs-watcher race so a rejection from one side no
  longer cancels the other. The server's 5-min POST cap used to kill
  sendPrompt before the watcher could observe completion; now both
  settle independently and we prefer whichever succeeded.
GLM-5 (and likely other models) inside an opencode session sometimes
inherits routing directives from the outer Claude Code CLAUDE.md
("delegate long tasks to opencode-rescue") and tries to invoke Task
with subagent_type="opencode:rescue" / "codex:rescue". Those are
Claude Code skill namespaces, not opencode agents — the call errors
out, then the model stalls indefinitely retrying with zero output
emission, which the session-level idle watchdog only catches after
15min.

buildTaskPrompt now prepends an explicit notice that routing rules
have already been consumed and Claude Code subagent_types are
unavailable here; execute directly instead.

Applies only to prompts going through opencode-companion (task,
review paths). Direct 'opencode run' CLI invocations still need to
strip CLAUDE.md themselves — prefer Agent(subagent_type=opencode-
rescue) for proper watchdog coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tdio

Opens an append fd to the given path and passes it as stdout+stderr to the
detached spawn; closes the fd in the parent immediately after fork.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hed worker

Resolves logFile path and calls upsertJob with status=queued before spawn so
status/result commands see the job immediately. Updates pid after spawn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ind OPENCODE_STRICT_TERMINAL

Some OpenCode versions emit terminal assistant messages without a finish
field. Accept completed > 0 as sufficient when OPENCODE_STRICT_TERMINAL != "1".
Applies to both sendPrompt watcher and probeSessionTerminal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… live child processes

Prevents premature session termination when tool subprocesses are still
running. Resets lastActivityMs and logs a reason instead of aborting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ocess is dead

Replaces the unconditional skip with a dead-worker check: if job.pid is gone
and updatedAt is older than STALE_IDLE_MS, upsert status=failed and return
healed-failed so the job stops being retried.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…usted

Adds OPENCODE_MONITOR_MAX_POLLS (default 120) poll counter. When the loop
exits due to poll exhaustion rather than a terminal status, emits a
TERMINAL=timeout line for each non-terminal job so the parent thread sees a
clear signal. Guards against duplicate TERMINAL lines via terminal[] map.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The prior patch used `continue` inside the idle-threshold branch, which skipped
the POLL_INTERVAL_MS sleep at loop tail and hammered the opencode server while
tools were alive. Replace with a straight if/else so the reset path falls
through to the normal poll interval.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLAUDE_PLUGIN_DATA can be exported by an unrelated plugin (e.g. codex
companion) into the shared env, causing opencode state to land in
another plugin's data directory. Derive our own data dir from the
script's install path instead, falling back to CLAUDE_PLUGIN_DATA only
when it already names an opencode-scoped dir. Add OPENCODE_COMPANION_DATA
as an explicit override.
…rver

Two server-side robustness features so newcomers hit fewer cryptic failures:

1. errors.mjs — classifyError() annotates fetch/abort/connection errors with
   actionable fix hints. Wired into request() and sendPrompt() in
   opencode-server.mjs so "fetch failed" becomes e.g. "Aborted after Ns
   (OPENCODE_PROMPT_TIMEOUT_MS=...). For longer tasks set ... to a higher
   value." ECONNREFUSED gets the start-server hint, 401/403 points at
   OPENCODE_SERVER_PASSWORD, 5xx points at docker logs.

2. opencode-config.mjs — ensureOpencodeConfig() writes the headless-safe
   permission set (bash/edit/webfetch/external_directory all "allow") to
   ~/.config/opencode/opencode.json before ensureServer spawns opencode.
   Idempotent, atomic temp+rename, respects XDG_CONFIG_HOME. Fixes
   anomalyco/opencode#14473 hang that cost 48 min in one session to diagnose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New subcommands for first-run self-service:

- doctor [--fix] [--json] [--verbose] — 8 health checks covering opencode
  binary, config permissions, server reachability, plugin data dir resolution,
  stuck jobs, disk. With --fix, writes missing opencode.json permissions and
  runs autoHealJobs. Exits 1 on failures for CI use.
- config [--json] — dumps all recognized env vars with current values and
  source (default | env), plus resolved state dir, opencode config path, and
  server reachability. Makes hidden defaults discoverable.

Adds getSessionLastActivity() to auto-heal.mjs (used in follow-up commit for
status breadcrumbs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handleStatus now enriches each running job with its opencode session's last
activity (tool + command head + age, or text snippet) via
getSessionLastActivity. render.mjs displays breadcrumb under the job line
when present. Makes "investigating" actionable — users see "bash: docker
exec speech ... (3s ago)" instead of a static phase label.

README gets a Quickstart pointing at 'doctor' as the first thing to run, an
Environment Variables table documenting all OPENCODE_* knobs with defaults,
and a Pitfalls section for the two recurring surprises (stale investigating
→ heal; headless bash hang → doctor --fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l guidance

Rescue subagents occasionally return placeholder strings ("Monitor started",
"Waiting for completion", "Task forwarded in background") instead of the
companion's rendered terminal report. Add a PostToolUse hook that spots those
patterns, surfaces any task ids seen, and tells the main thread to poll
companion status/result rather than treat the vague output as final. Updates
opencode-rescue agent prompt and the result-handling/runtime skills to
document the dispatch-and-poll contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Skill + Agent to the existing Task blocklist (was only Task).
Adds explicit 'IGNORE that instruction' for task text that mentions
delegating to opencode-rescue / codex-rescue.
Covers plugin:name colon-namespaced identifiers as a class.

Addresses observed sub-agent confusion where names like opencode:rescue
look like plugin skills but are actually agents, causing Skill() calls
to fail silently or stall.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…scue wrapper

Replaces the fragile 3+ Bash call dispatch-and-poll loop in
opencode-rescue.md with a 2-step loop using a new server-side blocking
wait-and-result subcommand.

opencode-companion.mjs:
  - New handleWaitAndResult: polls job status every 250ms up to --max-wait
    seconds, returns final formatted result on terminal state (exit 0),
    or JSON status-running on timeout (exit 2). Bad task-id exits 1.

opencode-rescue.md:
  - Dispatch loop collapsed from 'task --background + sleep-30-poll loop +
    result' to 'task --background + wait-and-result loop (max 20 rounds)'
  - Each iteration is one clean Bash call with server-side blocking up
    to 8 minutes (safe under Claude Code's 10-minute Bash tool cap).

Fixes the observed ~5/6 failure rate where the wrapper returned
placeholder text like 'Monitor started' / 'dispatched, polling'
instead of the completed report. Tested with: fast-path (exit 0),
timeout path (exit 2), already-terminal, bad-id. All pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- strengthen SAFETY_HEADER: block Skill/Agent/Task delegation
- add wait-and-result subcommand + simplify opencode-rescue wrapper
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant