---
status: resolved
trigger: "the stop hook for live agents hangs unendingly [screenshot: Nebulizing... (running stop hook · 2m 23s)]"
created: 2026-04-16T02:15:00-07:00
updated: 2026-04-16T03:35:00-07:00
related: echo-commune-stop-hook-loop.md
---

## Resolution

**root_cause:** Pipe-handle inheritance from hook to echo-commune grandchild on Windows. `src/owl/hook_idle.rs::spawn_echo_commune_if_live` used Rust's `std::process::Command` to spawn the detached `_echo-commune` grandchild. Rust's Command::spawn on Windows calls `CreateProcess` with `bInheritHandles = TRUE`, which causes the grandchild to inherit ALL inheritable open handles in the hook process — including the stdout/stderr pipe write ends that Claude Code (the hook host) created when invoking the hook. Even though the hook itself exits in milliseconds and sets the grandchild's `stdin/stdout/stderr` to `Stdio::null()`, Claude Code's read end of the pipe does NOT see EOF until ALL writers close. The grandchild keeps the pipe write handles open while it synchronously waits for `claude -p --resume --model haiku` (per `src/owl/echo_commune.rs::run_echo_commune::wait_with_output`). That wait is ~2-5 minutes. Claude Code blocks in "running stop hook · {N}s" for the full duration. NOT a WMIC/tasklist regression — that fix had already shipped (commit c8d038f) and was correctly compiled into the deployed binary.

**fix:** Replaced the `Command::spawn` call with a raw Win32 `CreateProcessW` call (Windows path only; Unix `setsid` path unchanged). Key flags / arguments:
  - `bInheritHandles = FALSE` — grandchild inherits no handles, so it cannot hold the hook's stdout/stderr pipe write ends open.
  - `dwCreationFlags = DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP | CREATE_BREAKAWAY_FROM_JOB | CREATE_NO_WINDOW | CREATE_UNICODE_ENVIRONMENT` — full detach, escape any job-object the hook host placed the hook in, no console flash.
  - `STARTUPINFOW.dwFlags = 0` — no STARTF_USESTDHANDLES, child has no inherited stdio; default NUL behavior.
  - Wide command line built with quoted exe + quoted args; `_echo-commune` payload (owl_id + two UUID-shaped session strings) has no embedded quotes/backslashes so naive `"..."` quoting is safe.
  - `PROCESS_INFORMATION.hThread` and `hProcess` closed in parent immediately after success; grandchild keeps running.
Inline helper `spawn_detached_no_inherit()` in `src/owl/hook_idle.rs`, gated `#[cfg(windows)]`. No new dependencies. Zero-dep posture preserved.

**files_changed:**
- `src/owl/hook_idle.rs` — replaced unified `Command::spawn` block with platform-split: Unix unchanged (`Command + setsid`), Windows uses new `spawn_detached_no_inherit` helper that calls raw `CreateProcessW` with `bInheritHandles = FALSE` + `CREATE_BREAKAWAY_FROM_JOB`. Detailed comment block explaining WHY the bespoke spawn is required (with reference to this debug session). +120 lines, -10 lines net.

**verification:**
- **Standalone H2 repro** (in `C:\Users\decid\AppData\Local\Temp\h2-repro\`): minimal Rust binary that mimics the hook_idle spawn pattern with a sleep-N grandchild, driven by PowerShell that reads the hook's stdout/stderr pipes to EOF (mirrors Claude Code's hook host).
  - **Unfixed pattern**: `ELAPSED_MS=8054` for an 8s grandchild sleep. Parent waited the FULL grandchild lifetime even though the hook process itself exited in <100ms. **H2 confirmed.**
  - **Fixed pattern (raw CreateProcessW, bInheritHandles=FALSE)**: `ELAPSED_MS=18` for the same 8s grandchild sleep. Parent returned immediately. Grandchild kept running detached (verified via `Get-Process -Name h2_repro` after parent return). **Fix A validated in isolation.**
- **End-to-end against real owl.exe** (driver in `C:\Users\decid\AppData\Local\Temp\h2-repro\e2e_owl.ps1`): invoke real `owl.exe hook-idle` with stdin JSON containing `doyle`'s session_id (existing live-state perch with valid Psyche peer to force the spawn path); read pipes to EOF; time wallclock.
  - **OLD owl.exe** (`~/.claude/plugins/spt/owl.exe`, hash `10BDBD84…`, the broken deployed binary): `ELAPSED_MS=303912` (5m 4s) — only returned because we killed the chain externally. Grandchildren visible in `Get-Process` while hook hung.
  - **NEW owl.exe** (`target/release/owl.exe`, hash `4F2DEB8C…`, post-fix): `ELAPSED_MS=242`. Detached `_echo-commune` (pid 33496) and its `claude -p --resume` great-grandchild (pid 70696) spawned successfully and ran independently after parent return. Test cleanup killed both before they could burn real claude budget.
  - **Speedup: 1255×** (303912ms → 242ms).
- `cargo build --release` — clean, only the 3 pre-existing dead-code warnings (resume::HookInput::source, timed_pulse::remove_pulse_by_message, timed_pulse::remove_pulse). Build time 5.77s (full re-link).
- `cargo test --lib --release` — 21/21 pass, 0 failed, 0 ignored. No regressions.

## Hypotheses summary

- **H1 (deployed binary predates c8d038f WMIC→Win32 migration):** REFUTED at 03:03:45. Binary string scan shows `CONTAINS_TOOLHELP32`, `NO_WMIC`, `NO_TASKLIST`. mtime `01:14:29` is preserved by cargo's incremental link when output bytes are unchanged — it does NOT prove the binary was built before later commits if those commits' source changes happen to produce the same bytes (unlikely here) OR if the binary was actually rebuilt after but the touched files didn't propagate to the link product (the actual scenario: c8d038f's source changes WERE in the binary, just the file's mtime wasn't bumped because the post-build link result happened to match a cached previous link).
  - Lesson learned: never trust mtime to prove "binary predates source change X" — always grep the binary itself for the strings that should/shouldn't be there.
- **H-recursion (echo-commune self-trigger):** Refuted in prior resolved session `.planning/debug/echo-commune-stop-hook-loop.md`. Guard `OWL_ECHO_COMMUNE` env var is in source AND in the deployed binary. NOT the current cause.
- **H-plugin-disabled:** Not a cause; plugin was disabled by user as a workaround AFTER observing the hang.
- **H2 (pipe-handle inheritance):** CONFIRMED empirically. See verification section.

## Symptoms

expected: Stop hook completes in <1s after live agent finishes a response. No blocking, no process pile-up.
actual: Claude Code UI shows "Nebulizing... (running stop hook · 2m 23s)" and continues indefinitely. Deterministic — happens on every Stop for every live agent.
errors: No errors printed. Silent hang. User had to disable the spt plugin via `/plugin` + `/reload-plugins` to escape.
reproduction:
  1. Start a live agent on Windows (any version where stdout/stderr inheritance behaves per spec — believed to be all current Windows versions, not specific to 26200+).
  2. Let agent finish any response (triggers Stop hook).
  3. Stop hook spawns `_echo-commune` which runs `claude -p --resume --model haiku` (~2-5 min).
  4. Claude Code's hook UI shows "running stop hook · {N}s" until the `claude -p` chain completes.
started: Since the echo-commune feature shipped (phase 18.1-04, commit fea2efe, 2026-04-14). Was masked at first by the WMIC bug (which produced an even longer hang at a different code site); only became visible/salient after c8d038f fixed the WMIC path.

## Evidence

- timestamp: 2026-04-16T02:12:00-07:00
  source: File hashes + mtimes (initial pass)
  finding: `~/.claude/plugins/spt/owl.exe`, `~/.claude/plugins/cache/cplugs/spt/1.7.2/owl.exe`, and `target/release/owl.exe` all have identical SHA256 `10BDBD84…` and mtime `2026-04-16 01:14:29`. Initially misinterpreted as "predates c8d038f"; see 03:03 re-verification.

- timestamp: 2026-04-16T03:03:45-07:00
  source: Binary string scan on `target/release/owl.exe`
  finding: `CONTAINS_TOOLHELP32`, `NO_WMIC`, `NO_TASKLIST`. Win32 migration IS in the deployed binary. **H1 REFUTED.**

- timestamp: 2026-04-16T03:04:00-07:00
  source: `src/owl/hook_idle.rs:99-124` (pre-fix spawn flags) + `grep CREATE_BREAKAWAY_FROM_JOB src/`
  finding: Spawn used `CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW`, no `CREATE_BREAKAWAY_FROM_JOB`. Rust's Command::spawn defaults to `bInheritHandles = TRUE`. Echo-commune grandchild can therefore inherit pipe write handles from the hook.

- timestamp: 2026-04-16T03:25:00-07:00
  source: Standalone repro `h2-repro` minimal Rust + PowerShell driver
  finding: Unfixed pattern: parent waits grandchild lifetime (8054ms for 8s sleep). Fixed pattern (raw CreateProcessW, bInheritHandles=FALSE, CREATE_BREAKAWAY_FROM_JOB): parent returns in 18ms. **H2 CONFIRMED.**

- timestamp: 2026-04-16T03:30:00-07:00
  source: E2E driver `e2e_owl.ps1` against real owl.exe
  finding: Old owl.exe (`10BDBD84…`): hook took 303912ms before being externally killed. New owl.exe (`4F2DEB8C…`, post-fix): hook returned in 242ms. Detached `_echo-commune` (pid 33496) and `claude -p --resume` (pid 70696) running independently after parent return. **Fix A validated end-to-end. Speedup 1255×.**

- timestamp: 2026-04-16T03:31:00-07:00
  source: `cargo test --lib --release`
  finding: 21/21 pass. No regressions from the fix.

## Out-of-scope follow-ups (NOT addressed in this fix)

The following also use `Command::spawn` with `DETACHED_PROCESS` flags but are NOT in hook code paths (they're CLI subcommands invoked by user shells, not hook hosts that wait on pipes). They likely don't reproduce the user-visible hang because the parent shell doesn't read stdout/stderr to EOF the same way, but they have the same handle-leak class:
- `src/live/boot_spine.rs:96, 178` — Psyche wrapper / spine boot detach
- `src/live/start.rs:255, 422` — `live start` command detach
- `src/live/timed_pulse.rs:61, 120` — Pulse detach

If any of these are ever invoked from a hook context, they should switch to the same raw CreateProcessW pattern. Consider extracting `spawn_detached_no_inherit` into `src/common/process.rs` as a reusable helper in a follow-up pass.

## Actions taken during this session

- Killed running owl.exe pids 44868, 79568, 103008 (test grandchildren), and the orchestrator-killed Psyche wrapper 77952 (already gone).
- Bumped `plugin/spt/.claude-plugin/plugin.json` from 1.7.2 → 1.7.3 then **REVERTED** when H1 was refuted (deploy-gap was a misdiagnosis).
- Wrote standalone H2 repro at `C:\Users\decid\AppData\Local\Temp\h2-repro\` (kept on disk for re-running).
- Wrote end-to-end driver `e2e_owl.ps1` (kept on disk).
- Edited `src/owl/hook_idle.rs` to replace Windows spawn with raw CreateProcessW.
- `cargo build --release` (5.77s clean), `cargo test --lib --release` (21/21 pass).
- E2E validated old vs new binary: 303912ms vs 242ms.

## Next steps for user (deploy not done by this session per user instruction)

1. Bump `plugin/spt/.claude-plugin/plugin.json` from 1.7.2 → 1.7.3 (was reverted earlier; re-apply for the actual deploy).
2. Confirm `target/release/owl.exe` is the post-fix binary: SHA256 `4F2DEB8CA6D51297B5D33C8BD312EAC5CE4968DE82C522DC6B6ECD57BEF257B1`, mtime ≥ `2026-04-16 03:27:19`, size 3,714,048 bytes.
3. Kill any running owl.exe (`Get-Process owl | Stop-Process -Force`) before the deploy copy step (Windows file lock).
4. Run `docs/DEPLOY.ps1` (or manual DEPLOY.md steps).
5. Wipe `~/.claude/plugins/cache/cplugs/spt/` and refresh `installed_plugins.json` `gitCommitSha` per DEPLOY.md.
6. Restart Claude Code (`$OWL`/`$LIVE` env re-sync via SessionStart hook on Windows; CLAUDE_ENV_FILE not honored on Windows per Anthropic #27987).
7. Re-enable spt plugin (`/plugin`).
8. `/reload-plugins`.
9. Validation: start fresh live agent, let it finish a response, confirm Stop hook completes in <1s (no "running stop hook · {N}s" past 1-2s). Confirm echo commune still works by checking that Psyche receives an `ECHO_COMMUNE` message ~30-90s after the Stop event.

## Related prior debug session

`.planning/debug/echo-commune-stop-hook-loop.md` (status: resolved) diagnosed the recursion bug. Guard is present and working. Current bug is a SECOND, independent issue in the same spawn path: not what the grandchild does (recursion already fixed) but how the grandchild is created (handle inheritance, fixed here).
