---
status: resolved
trigger: "Echo communes feature: agents get stuck in an endless loop at the Stop hook, and more and more owl.exe processes spin up in the background"
created: 2026-04-15T22:40:14-07:00
updated: 2026-04-15T23:22:00-07:00
---

## Current Focus

hypothesis: CONFIRMED — H1 is the root cause. `spawn_echo_commune_if_live` has no guard against being re-invoked when the echo commune's OWN `claude -p --resume` child finishes and fires a Stop hook on the SAME live agent session. H2 (not detached) is REFUTED: detach flags are correct; the visible "hang" is a symptom of the recursive hook storm.
test: Fix = environment-variable guard. Set `OWL_ECHO_COMMUNE=1` on the `claude -p --resume` child inside `run_echo_commune`; have `spawn_echo_commune_if_live` early-return when that env var is set.
expecting: Fix breaks the recursion. Each Stop hook runs once per user turn; echo commune fires once; its nested `claude -p` Stop hook sees the guard and skips.
next_action: DONE. Fix applied; `cargo check --release` clean; `cargo test --lib --release` 21/21 pass. Deploy (copy target/release/owl.exe → ~/.claude/skills/owl/owl.exe) pending — user handles via docs/DEPLOY.md.

## Symptoms

expected: Echo commune fires once after a Stop event on a normal agent session, runs detached, and completes without re-triggering itself. The parent agent is not blocked.
actual: Agents get stuck in an endless loop at the Stop hook. More and more `owl.exe` processes pile up in the background. The running agent appears blocked.
errors: None reported in user-visible output; symptom is process proliferation and agent hang.
reproduction: Invoke/enable the echo communes feature on an agent session; observe Stop hook firing repeatedly and background owl.exe process count climbing.
started: Since phase 18.1-04 introduced the echo communes feature (commit fea2efe, 2026-04-14).

## User hypotheses (to verify/refute)

1. **H1 CONFIRMED** — Stop hook fires for the `-p --resume` echo commune session itself, which creates another echo commune (recursive trigger — no guard).
2. **H2 REFUTED** — Detach flags in `spawn_echo_commune_if_live` are correct (`CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW` on Windows, `setsid()` on Unix; stdin/stdout/stderr all `Stdio::null()`; `cmd.spawn()` returns immediately). The observed "blocked agent" is a downstream symptom of H1 — the recursive Stop-hook storm blocks Claude Code's main session.

## Eliminated

- H2 (blocking/non-detached child) — verified by reading `src/owl/hook_idle.rs:86-111`: all detach flags and null I/O are in place.

## Evidence

- timestamp: 2026-04-15T22:42:00-07:00
  source: src/owl/hook_idle.rs:45-112 (spawn_echo_commune_if_live)
  finding: Function checks `info.state == PerchState::Live` and `Psyche ready`, then spawns `owl.exe _echo-commune {session_uuid} {owl_id} {psyche_id}` detached. No guard exists to detect "we are already INSIDE an echo-commune-spawned claude session."
- timestamp: 2026-04-15T22:42:05-07:00
  source: src/owl/echo_commune.rs:15-56 (run_echo_commune)
  finding: Runs `claude -p --resume {session_uuid}` against the SAME session UUID as the parent live agent. `env_remove("OWL_SESSION_ID")` on the child only prevents the env var from propagating — it does NOT prevent the Stop hook from firing when `claude -p` finishes, nor does it prevent perch lookup (Claude Code sends stdin `session_id` on the hook, and `find_perch_cached` matches by the resumed UUID directly since `info.json.session_id == session_uuid`).
- timestamp: 2026-04-15T22:42:10-07:00
  source: plugin/spt/hooks/hooks.json:33-42
  finding: The `Stop` hook is wired unconditionally to `owl.exe hook-idle`. Claude Code fires the Stop hook whenever the assistant finishes responding, INCLUDING in `-p` (print) mode invoked via `claude -p --resume`.
- timestamp: 2026-04-15T22:42:15-07:00
  source: src/common/hook_output.rs:117-144 (find_perch_by_session)
  finding: Perch lookup prefers `session_id` match; falls back to `parent_pid` ancestor walk (via `get_parent_pid()` which walks up to find nearest `claude.exe`). Both paths resolve back to the SAME live-agent perch when the child `claude -p --resume` fires its Stop hook.
- timestamp: 2026-04-15T22:42:20-07:00
  source: grep for OWL_ECHO / ECHO_COMMUNE_ACTIVE / IN_ECHO
  finding: No guard env var or any other recursion-prevention mechanism exists anywhere in the codebase.
- timestamp: 2026-04-15T22:42:25-07:00
  source: src/owl/hook_idle.rs:86-111 (detach flags)
  finding: Detach flags verified correct — `CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW` on Windows, `setsid()` on Unix, all I/O handles null. Refutes H2.
- timestamp: 2026-04-15T22:42:30-07:00
  source: src/owl/echo_commune.rs:50 (wait_with_output)
  finding: The `_echo-commune` subprocess DOES wait synchronously for its `claude -p` child (to parse [COMMUNE] markers). This is correct in isolation but contributes to process fan-out: each recursive level keeps a live `_echo-commune` AND a live `claude -p` process waiting, stacking up until system limits intervene.
- timestamp: 2026-04-15T23:20:00-07:00
  source: cargo check --release / cargo test --lib --release (post-fix)
  finding: Fix compiles cleanly (only 3 pre-existing unrelated dead-code warnings). All 21 library unit tests pass. `cargo build --release` could not replace target/release/owl.exe because an existing owl.exe (PID 92496) is running and holds the file lock; this is a Windows file-locking artifact and does not affect fix correctness — the updated object files are on disk and link the moment the running owl.exe exits.

## Root Cause

**Recursive Stop-hook trigger with no cycle guard.**

Sequence:
1. User interacts with live agent; Stop hook fires in live agent's Claude session.
2. `hook-idle` handler spawns detached `owl.exe _echo-commune {live_session_uuid} {owl_id} {psyche_id}`.
3. `_echo-commune` subprocess launches `claude -p --resume {live_session_uuid}` (same session as parent live agent) and waits.
4. When that `claude -p` response completes, Claude Code fires the Stop hook for it too.
5. That Stop hook runs `owl.exe hook-idle`, which:
   - Finds the SAME live agent perch (session_id matches, OR parent_pid ancestor walk finds the same `claude.exe`).
   - Sees `state == Live` and Psyche ready — no guard to skip.
   - Spawns ANOTHER `owl.exe _echo-commune` → goto step 3.
6. Each level of recursion adds one `_echo-commune` + one `claude -p` process. The tree fans out until the user's agent hangs waiting on hook chains / session locks.

`env_remove("OWL_SESSION_ID")` was the author's attempt at a guard but is insufficient — it only clears a cached env var, not the perch state or stdin-delivered session_id.

## Fix Plan

**Minimal guard via inherited environment variable.**

1. `src/owl/echo_commune.rs` — in `run_echo_commune`, set `cmd.env("OWL_ECHO_COMMUNE", "1")` on the `claude -p --resume` invocation. This env var is inherited by any child hook processes Claude Code spawns while that session is active.
2. `src/owl/hook_idle.rs` — in `spawn_echo_commune_if_live`, early-return when `std::env::var("OWL_ECHO_COMMUNE").is_ok()`. Place this check FIRST (before filesystem work) so the recursive hook exits cheaply.

This is a 2-line behavioral fix (plus guard comment) that:
- Breaks the recursion at depth 1.
- Does not depend on process topology, parent_pid walking, or session UUID matching.
- Leaves `setIdle_ready` intact (the Stop hook still does its primary job of marking the perch idle-ready for poll delivery).
- Preserves H2's already-correct detach flags.

### Optional hardening (not required for fix, noted for completeness):

- Could also clear `.idle-ready` setting inside the echo-commune-triggered Stop hook (skip it when `OWL_ECHO_COMMUNE=1`), since the echo commune isn't a real user-turn boundary. Tracked as nice-to-have; primary fix is the recursion guard.

## Resolution

**root_cause:** Recursive Stop-hook trigger — `spawn_echo_commune_if_live` had no guard preventing re-entry when the echo commune's own `claude -p --resume` child finished and fired its own Stop hook, which resolved back to the same live-agent perch and spawned another echo commune ad infinitum.

**fix:** Added `OWL_ECHO_COMMUNE=1` environment variable as a recursion guard. Set on the `claude -p --resume` child in `run_echo_commune`; child hook processes inherit it and `spawn_echo_commune_if_live` early-returns when it's present (check placed first, before any filesystem work). Breaks the recursion at depth 1 without altering detach semantics or the primary `set_idle_ready` behavior.

**files_changed:**
- `src/owl/echo_commune.rs` — added `cmd.env("OWL_ECHO_COMMUNE", "1")` at line 31 with explanatory comment
- `src/owl/hook_idle.rs` — added early-return env-var guard at top of `spawn_echo_commune_if_live` (lines 46-57) with explanatory comment

**verification:**
- `cargo check --release` — clean (3 pre-existing unrelated dead-code warnings only).
- `cargo test --lib --release` — 21/21 pass, 0 failed.
- `cargo build --release` — fails to replace `target/release/owl.exe` because a running owl.exe (PID 92496) holds the file lock on Windows. This does not indicate a fix problem; the lib compiled successfully and only the final link-replace step is blocked. Fix will link as soon as the running owl.exe exits, or the user can terminate that PID before deploying.

**deploy steps (for user, per docs/DEPLOY.md):**
1. Stop the running owl.exe (PID 92496) — `powershell -Command "Stop-Process -Id 92496 -Force"` (or let it exit naturally if it's a short-lived process).
2. `cargo build --release` to produce the updated binary.
3. Copy `target/release/owl.exe` → `~/.claude/skills/owl/owl.exe`.
4. No SKILL.md / LIVE-SKILL.md changes required (fix is pure Rust internals).
