---
slug: binary-handoff-leaks-parent-processes
status: resolved
trigger: 16 live owl.exe processes accumulated; per-listener chain grows by ONE process per deploy; defer-prune queue growing monotonically because old owl.exe binaries still hold file locks
tdd_mode: false
goal: find_root_cause_only
created: 2026-05-15
---

# Debug Session: binary-handoff-leaks-parent-processes

## Symptoms

**Observable:** 16 live `owl.exe` processes on the user's box. Per-listener chain grows by exactly ONE process per deploy. Each chain link runs a DIFFERENT version of owl.exe; the most ancient process at the chain root still holds an exclusive file lock on its version's `owl.exe`, which is why DEPLOY.ps1's prune queue keeps growing monotonically.

```
ProcessId  ParentProcessId  CommandLine
41808      41520            ...1.9.15\owl.exe live revive dunsen
35828      41808            ...1.10.0\owl.exe poll dunsen listen --live
40568      35828            ...1.10.1\owl.exe poll dunsen listen --live
38820      40568            ...1.10.2\owl.exe poll dunsen listen --live
49116      38820            ...1.10.3\owl.exe poll dunsen listen --live
46444      49116            ...1.10.4\owl.exe poll dunsen listen --live
43980      46444            ...1.10.5\owl.exe poll dunsen listen --live
40380      43980            ...1.10.6\owl.exe poll dunsen listen --live
```

(plus analogous 8-deep `todlando` chain rooted at 1.10.1; plus the live dunsen-psyche wrapper subtree)

The leaf (1.10.6) is the only ACTIVE listener; all 7 ancestors are blocked in `WaitForSingleObject(child_handle, INFINITE)` inside `spawn_and_wait_inherit_stdio` (`src/common/win_spawn.rs:454` and `:686`). Per deploy: chain depth +1.

---

## Root Cause

**The stdio-relay listener handoff is working exactly as designed (D-03 / D-14 / HANDOFF-05). The chain-accumulation property is an unintended consequence of that design that Phase 18.4 never modeled.**

### Design intent (Phase 18.4 D-03, `18.4-CONTEXT.md:31-35`)

> "The listener uses a **stdio-relay** handoff. **Old process stays alive**, spawns new `owl.exe` as a child with the same argv + inherited stdin, and pipes the child's stdout/stderr to its own parent (Claude's Bash tool). **The old process exits only when the child exits.**"

> "**D-14:** Old process **exits with the child's exit code** when the child terminates. One clean background-task-completion event per handoff generation."

The motivation is Claude Code's Bash-tool background-task tracking: the BashID is bound to the original PID, and Claude Code does NOT inherit descendants into the BashID on either Unix or Windows (per the `claude-code-guide` research line in `18.4-CONTEXT.md:117`). So if the old listener exited after spawning the child, the agent's background task would complete and the new listener would be invisible to the agent.

The relay design fixes that — at the cost of pinning the old process to the BashID's wait. **For ONE deploy in a session this is fine.** For N deploys it produces an N+1-deep wait chain.

### What actually happens — line-by-line trace

1. **Listener startup** (`src/owl/poll.rs:22-552`): a process running `owl.exe poll <id> listen --live` enters the main poll loop.
2. **Handoff detection** (`src/owl/poll.rs:383`): each iteration calls `handoff::handoff_available()`, which compares `current_exe()` against the path resolved from `~/.claude/plugins/installed_plugins.json` (`src/common/handoff.rs:66-74`).
3. **On handoff** (`src/owl/poll.rs:383-446`):
   - Lines 404-405: release TCP port + ready file (so child can rebind cleanly).
   - Lines 411-412: build re-runnable argv `poll <id> listen --live` (the Phase 18.5 Bug #10 fix).
   - Lines 424-430: build envs with `OWL_HANDOFF_CHILD=1` (Bug #11 bypass for the duplicate-listener guard) and `SPT_TRAMPOLINE_GUARD=1`.
   - Lines 432-435: `#[cfg(windows)] win_spawn::spawn_and_wait_inherit_stdio(&target, &args_ref, &envs)`.
   - **Line 438:** `Ok(exit_code) => std::process::exit(exit_code)` — fires only AFTER the child terminates.
4. **The blocking wait** (`src/common/win_spawn.rs:454`): inside `spawn_and_wait_inherit_stdio`, the Windows call is `WaitForSingleObject(pi.h_process, INFINITE)`. The parent process is parked here for the entire lifetime of the child.

The child is itself a poll listener that **never voluntarily exits** under normal operation — it's the active listener for the perch. So the parent's `WaitForSingleObject` never returns, the `exit(exit_code)` at poll.rs:438 is never reached, and the old process stays alive holding (a) its `owl.exe` binary file handle, (b) a process slot, (c) all inherited stdio handles, (d) any heap state.

### Recurrence per deploy

When a new deploy lands a 1.10.7 binary and the `installed_plugins.json` flips:

- Only the LEAF (1.10.6) detects the handoff at its `handoff_available()` check — it's the only process still actively executing the poll loop. All 7 ancestors are frozen in `WaitForSingleObject`.
- 1.10.6 spawns 1.10.7 and enters `WaitForSingleObject` itself. Now the chain is 9 deep.
- This compounds linearly: chain depth after N deploys-since-last-listener-restart = N+1 processes for that perch.

The accumulation continues until the entire chain happens to terminate together (e.g., user runs `$LIVE stop dunsen`, or the listener exits cleanly via signoff/stop-flag/poison-file). On clean teardown the leaf calls `process::exit(0)` (paths at poll.rs:177, 284, 308, 359, 366, 375, 392), which triggers WaitForSingleObject to return up the chain, propagating exit codes from leaf to root. All processes exit. But until then, every deploy adds one more frozen ancestor.

### Why the file-lock prune problem follows directly

Each ancestor was launched against a specific version's `owl.exe` path (e.g. PID 41808 = `1.9.15\owl.exe`). Windows holds an exclusive lock on a process's image file for its lifetime. DEPLOY.ps1's `.pending-prune-{ver}-{ts}/` rename works (rename does not require the file to be unlocked) but the subsequent recursive-delete fails on the still-locked `owl.exe`. The prune is deferred to the next deploy, which arms ANOTHER `.pending-prune-` directory, and the cycle repeats. Disk usage grows monotonically by one cached version directory per deploy.

### Why Phase 18.4 didn't catch this

Phase 18.4 HUMAN-UAT Test 1 (`18.4-HUMAN-UAT.md:23-32`) was a **single deploy** in a fresh session:
- Pre-handoff pid: 38140 (1.8.6)
- Post-handoff pid: 40488 (1.8.7-test)
- "Both owl.exe processes exited" on stop.

The test ran one swap and stopped the listener. The relay topology (38140 alive as parent, 40488 as child) was **explicitly verified and accepted** as correct. The test did NOT iterate the swap repeatedly across deploys within a single listener generation, which is the case that exposes accumulation.

Phase 18.4/18.5 docs contain **zero** mentions of chain depth, multi-deploy accumulation, second-handoff behavior, or N-deep relays. The mental model was "one handoff per listener generation, then the listener gets restarted by the user before the next deploy."

In normal `DEPLOY.ps1` developer workflow, `Stop-Process owl.exe` runs at the start of every deploy (per D-13 in `18.4-CONTEXT.md:58`), which would have killed the entire chain and reset depth to 0 every time. **What broke that invariant is that the developer (the user here) ran multiple deploys WITHOUT manually killing live agents in between** — relying on the Phase 18.4/18.5 handoff path to keep them alive. The handoff path delivered on its "keep the agent alive across deploys" promise, but failed to model the cumulative cost of doing so.

---

## Confirmed Leak Mechanism — Line References

| Step | File:Line | Code | Effect |
|------|-----------|------|--------|
| 1 | `src/owl/poll.rs:383` | `if let Some(target) = handoff::handoff_available()` | Detect pending handoff |
| 2 | `src/owl/poll.rs:404` | `poll_listener.close();` | Release TCP port + registry |
| 3 | `src/owl/poll.rs:405` | `fs::remove_file(&ready_path);` | Remove ready file |
| 4 | `src/owl/poll.rs:411` | `build_handoff_child_argv(id, live, psyche)` | Re-runnable argv (Bug #10 fix) |
| 5 | `src/owl/poll.rs:429` | `envs.push((HANDOFF_CHILD_ENV, "1"))` | Bypass child DUPLICATE guard (Bug #11) |
| 6 | `src/owl/poll.rs:435` | `win_spawn::spawn_and_wait_inherit_stdio(...)` | Spawn child + block-wait |
| 7 | `src/common/win_spawn.rs:454` | `WaitForSingleObject(pi.h_process, INFINITE)` | **PARENT FROZEN HERE** |
| 8 | `src/owl/poll.rs:438` | `std::process::exit(exit_code)` | Unreachable until child exits |

Hypothesis ledger from the initial bug report:

- **H1 (exit call is unreachable):** Partially correct — the exit call is reachable as code but its prerequisite (`WaitForSingleObject` returning) is unsatisfied while the child is alive. The bug is NOT a missing exit-on-fork path; it's that the design REQUIRES the parent to wait. ✓
- **H2 (exit call is conditional):** No — the `Ok(exit_code) => exit(exit_code)` branch is unconditional once Wait returns.
- **H3 (stdio-relay hangs):** Yes — but "hangs" is too strong. The relay does what it was built to do: block until child exits. The child is a listener, so it doesn't exit. ✓
- **H4 (Windows Job Object inheritance):** No — `CREATE_BREAKAWAY_FROM_JOB` is set explicitly (`win_spawn.rs:553`, per `18.4-CONTEXT.md:34` and `18.4-02-SUMMARY.md:67`). Child is intentionally NOT in the parent's job. Job inheritance is not the leak.
- **H5 (trampoline confusion):** No — `SPT_TRAMPOLINE_GUARD=1` is set on every handoff spawn (`poll.rs:430`); the child's main.rs trampoline check short-circuits and dispatches into poll directly. Trampoline does not intercept anything in the relay chain.

**The true cause is structural: the chain-of-Wait pattern is the literal D-03/D-14 contract.**

---

## Edge-Case Ledger

### 1. Wrapper handoff is structurally immune (good news)

The Psyche wrapper handoff path uses `spawn-new-then-exit` (`18.4-CONTEXT.md:39`, D-06), not stdio-relay. The wrapper writes `wrapper-state.json` and exits immediately; the new wrapper rehydrates from disk. There is no chain. The user's snapshot shows "the current 1.10.6 dunsen-psyche wrapper subtree" only at one generation, confirming this works.

### 2. Touch supervisor is also immune

Same `spawn_detached_no_inherit` + state-on-disk pattern (`touch_loop.rs:67-103`). D-06/D-08 also apply.

### 3. Trampoline is immune

Trampoline uses `execvp` on Unix (process-image replacement, PID preserved) and `CreateProcess + Wait + propagate exit code` on Windows for a *short-lived command*. It's also a chain-of-Wait pattern in principle, but it's only used for one-shot CLI invocations that complete in milliseconds — not for long-running listeners. No leak.

### 4. Failed handoff (HANDOFF_FAILED branch, `poll.rs:439-444`)

`spawn_and_wait_inherit_stdio` returning `Err` triggers `output::owl_err("HANDOFF_FAILED:{id} {e}")` and `exit(1)`. Old process DOES exit cleanly on spawn failure. The leak is specific to the **successful** handoff path.

### 5. HANDOFF_DEFER child path (Bug #12 fix, `poll.rs:390-393`)

When a poll process is the wrapper's `poll_psyche` subprocess (`OWL_UNDER_WRAPPER=1`), it does NOT enter the stdio-relay branch — instead exits with code 2 and `HANDOFF_DEFER:<id>` on stderr so the outer wrapper takes over the handoff. This deferral path correctly avoids stacking a chain inside the wrapper. (But this is wrapper-internal — does not affect the Self listener chain.)

### 6. Signal handling and chain teardown

When the user runs `$LIVE stop dunsen`, only the leaf listener observes the signal (poison-file at owlery, `poll.rs:370-376`). Leaf calls `process::exit(0)`. That termination ripples up the chain: each ancestor's `WaitForSingleObject` returns child's exit code, the ancestor calls `process::exit(exit_code)`, and so on. Chain unwinds cleanly. Confirmed by UAT Test 1: "Clean process teardown on stop; both owl.exe processes exited."

### 7. Crash propagation

If any chain link crashes mid-relay, `GetExitCodeProcess` returns the crash code; ancestor propagates it via `exit(code)`. No special handling — the design just propagates. This is fine. But: if the crash happens at a middle link rather than the leaf, the leaf becomes orphaned-from-Bash and `parent_pid` checks in the wrapper's orphan detection start pointing at dead PIDs. (See impact #3 in the original bug report.)

### 8. Phase 30 stale-signoff fix is potentially affected (low risk in practice)

The signoff/commune drop-file mechanism (`poll.rs:262-287`) is per-listener. Only the leaf (active poll loop) scans `.claude/{self_id}-{commune,signoff}.md`. Ancestors are frozen in WaitForSingleObject and don't run any code, so they cannot scan-and-fire stale signoff EVENTs. **The phase 30 cleanup is not broken by chain depth** — only the active leaf participates. The original bug report's worry about "multi-generation signoff files" doesn't apply because non-leaf processes don't write signoff files.

### 9. Orphan detection ambiguity (real risk)

The Psyche wrapper's orphan check (`src/live/wrapper/orphan.rs`, called from `src/live/wrapper/mod.rs`) inspects the Self perch's `info.json` for `pid` and `parent_pid` (`get_parent_pid()` in `poll.rs:124`). After a handoff, the leaf's `parent_pid` is the previous listener PID (an ancestor in the chain), which IS alive (frozen in Wait), so `is_process_alive(parent_pid)` returns true — wrapper sees a "live" parent. This is semantically wrong in a chain (the parent is not Claude's Bash process; it's a previous relay generation) but **does not currently cause incorrect orphan detection** because the wrapper only checks `pid` first (`info.json.pid` = current leaf PID). The `parent_pid` fallback only triggers if `pid` is dead, by which time the entire chain has collapsed (the leaf's exit ripples up). So this is latent but not actively buggy today.

### 10. TCP port re-binding race

Each handoff has the documented sub-ms gap where no listener owns the port (D-05). Multi-deploy doesn't compound this — each handoff is a fresh single rebind. Spool fallback (P2P-04) is the documented safety net. No new bug.

---

## Proposed Fix Design (NOT APPLIED — investigation only)

Two non-mutually-exclusive directions:

### Fix A: Defang the relay chain on startup ("cleanup ancestors")

**Mechanism:** when a fresh poll listener starts (via the handoff-child path, `OWL_HANDOFF_CHILD=1`), it inspects its own process ancestor chain at startup. For each ancestor whose image is `owl.exe` AND whose argv matches `poll <same_id> listen [--live|--psyche]`, send a controlled exit signal:

1. Walk parents via `GetCurrentProcessId` → `Process32First/Next` snapshot to find `parent_pid` → repeat. Stop at first non-owl.exe parent or at the configured max depth.
2. For each owl.exe ancestor running the same listener argv: write a poison file for that PID, OR call `TerminateProcess` with exit code 0.
   - Caveat: TerminateProcess does not run destructors. The poison-file approach is gentler but the ancestor is parked in `WaitForSingleObject`, so it won't run the poison-file check loop iteration. Practical choice: TerminateProcess with exit code 0, accepting that destructors are skipped (poll listener owns no critical destructors that aren't already best-effort).
3. After ancestor cleanup, the new leaf is alone. Bash-tool BashID is still bound to the original PID (now dead). Important: Claude Code's Bash tool will see the bg task as completed if it polls PID liveness. That breaks the "no Bash-tool wake" guarantee from D-03.

**Tradeoff:** This violates the Phase 18.4 D-03 invariant ("background task active across the handoff") for the cumulative case. Acceptable IFF the wake-event is acceptable on multi-deploy day. (The user can decide; the alternative — chain growth forever — is worse.)

**Surface area:** new helper in `src/common/process.rs` (Windows: Process32 snapshot ancestry walk; Unix: read `/proc/<pid>/status` PPid chain). New invocation site in `poll.rs` startup, gated on `OWL_HANDOFF_CHILD=1`. ~80-120 lines including tests.

### Fix B: Skip stdio-relay on subsequent handoffs in the same session

**Mechanism:** on first handoff, do the current stdio-relay (preserve Bash-tool tracking for the first swap of the session). On subsequent handoffs detected within the same chain, the relay parent (existing leaf) does spawn-new-then-exit instead of spawn-and-wait.

Detection: when the current process was itself spawned as a handoff child (`OWL_HANDOFF_CHILD=1` env var was set by its parent), on its OWN handoff it uses spawn-new-then-detach instead of spawn-and-wait. This way the chain is always either 1 or 2 deep, never more.

**Tradeoff:** Second handoff onward DOES wake Claude's Bash tool. (Same downside as Fix A on second-and-later deploys.) But preserves the no-wake guarantee on the most common case (first deploy of the session).

**Surface area:** ~20 lines in `poll.rs` — check `is_handoff_child()` (already exists at line 818-820) at handoff time, switch to `spawn_detached_no_inherit` (already exists in `win_spawn.rs`) instead of `spawn_and_wait_inherit_stdio`. Plus the parent that spawned the now-detached child must `exit(0)` immediately. Trivial.

### Fix C: Combine A + B (recommended)

- **B** prevents future chain growth: any handoff after the first becomes spawn-and-exit. Chain depth caps at 2 (Bash-tool's tracked PID + active leaf).
- **A** cleans up the EXISTING 16-process leak: at next listener startup (whether from handoff or from `$LIVE revive`), kill same-perch owl.exe ancestors.

After deploy that ships A+B:
- First deploy after fix lands: existing 8-deep chains continue to exist. A handoff fires; new leaf at depth 9 starts. A's ancestor-scan kills depths 1-8. Chain depth = 2 (the unfortunate handoff parent + the new leaf). Bash-tool wakes once for the cleanup.
- All subsequent deploys: chain depth stays at 2 (B prevents growth).

### Tests

- Unit: ancestor-chain walker against a synthesized parent/child PID list with mixed owl/non-owl PIDs; argv-matching predicate; max-depth bound.
- Integration: `tests/handoff_integration.rs` extended with a "double-handoff" scenario — arm harness with version A, listener running, arm again with version B, assert only 2 owl.exe processes alive (the harness pid + the leaf), not 3.
- Manual smoke (replicating user's scenario): three sequential `DEPLOY.ps1` runs on a live `/spt:live` perch, observe chain stays at 2.

### Open questions for user before fix

1. Is the "Bash-tool wake on second-and-later deploys" tradeoff acceptable? The alternative (literal job-object handoff via WSADuplicateSocket / SCM_RIGHTS) was rejected in Phase 18.4 as over-engineered (`18.4-CONTEXT.md:159`). Re-opening that is feasible but ~10× the surface.
2. Should Fix A also reap the currently-stranded `.pending-prune-{ver}-{ts}/` directories on the same trigger, OR leave that to DEPLOY.ps1's existing logic?
3. Should the ancestor cleanup also kill the chain root (PID 41808 = `live revive dunsen`) when its argv differs from `poll <id> listen ...`? That's the head of the chain in the user's snapshot and holds the oldest `owl.exe` file lock (1.9.15). Yes, recommend including any owl.exe ancestor regardless of subcommand, but cap depth at e.g. 16 and refuse if argv mismatches `<id>` (defensive against unrelated owl.exe in the tree).

---

## Current Focus

Root cause confirmed. Awaiting user greenlight on fix direction before implementing.

## Evidence

- timestamp: 2026-05-15
  source: src/owl/poll.rs:378-446
  finding: handoff branch calls `spawn_and_wait_inherit_stdio` then `exit(exit_code)` on child termination. The exit-after-spawn pattern is the documented D-03 design, NOT a bug.

- timestamp: 2026-05-15
  source: src/common/win_spawn.rs:454, :686
  finding: `WaitForSingleObject(pi.h_process, INFINITE)` blocks the parent for the entire child lifetime. Confirms parent cannot exit until child does.

- timestamp: 2026-05-15
  source: .planning/phases/18.4-introduce-seamless-handoff-between-old-and-new-binaries-afte/18.4-CONTEXT.md:31-35 (D-03), :35 (D-14)
  finding: Stay-alive-relay was an explicit design decision motivated by Claude Code's Bash-tool descendant-orphan behavior. Trade ChainGrowth for BashIDPreservation.

- timestamp: 2026-05-15
  source: .planning/phases/18.4-introduce-seamless-handoff-between-old-and-new-binaries-afte/18.4-HUMAN-UAT.md:24-32
  finding: UAT validated single-handoff scenario. Two owl.exe processes alive during the relay, both exited on stop. Multi-deploy scenario never tested.

- timestamp: 2026-05-15
  source: grep over Phase 18.4/18.5 docs for "chain", "N-deep", "nested", "multi.*handoff", "accumulat"
  finding: Zero hits relating to multi-handoff chain depth. The design did not model this case.

- timestamp: 2026-05-15
  source: src/owl/poll.rs:818-820 (`is_handoff_child`), src/common/handoff.rs:33 (`HANDOFF_CHILD_ENV`)
  finding: The `OWL_HANDOFF_CHILD=1` env-var marker already exists and is observable at startup. Fix B (skip relay on subsequent handoffs) can detect "we ourselves were a handoff child" with no new state.

- timestamp: 2026-05-15
  source: src/common/win_spawn.rs (existing `spawn_detached_no_inherit`)
  finding: A detached-spawn helper already exists. Fix B can reuse it; no new Windows-API surface needed.

## Resolution

**root_cause:** The Phase 18.4 D-03/D-14 stdio-relay listener handoff is working as designed — the OLD listener stays alive in `WaitForSingleObject(child, INFINITE)` for the lifetime of the NEW listener so Claude Code's Bash-tool background-task tracking remains pinned to the original PID. The design implicitly assumes a single handoff per session. On repeated deploys within one listener generation, each handoff stacks another waiting ancestor, producing an N+1-deep process chain per perch where every ancestor pins an exclusive file lock on its version's `owl.exe`. The chain only collapses when the leaf exits (via stop / signoff / crash). No code defect; an unmodeled property of the multi-deploy steady state.

**fix (applied):** Fix D only — forward-looking sentinel exit code that propagates upward through the handoff-child chain so chain depth caps at 2.

**Mechanism:**
1. New constant `HANDOFF_DEFER_EXIT_CODE: i32 = 42` in `src/common/handoff.rs`. Value chosen to avoid collision with the existing exit-code set (0 normal/D12, 1 DUPLICATE/TCP-bind/HANDOFF_FAILED, 2 HANDOFF_DEFER under wrapper from Phase 18.5 Bug #12).
2. In `src/owl/poll.rs::run` handoff branch: if `is_handoff_child()` returns true (we are ourselves a handoff-child via `OWL_HANDOFF_CHILD=1`), tear down our perch handles and exit with the sentinel instead of spawning our own stdio-relay child. Emit `HANDOFF_BUBBLE:<id>` to stderr for log audit.
3. In the chain-root path (NOT a handoff-child), wrap the existing stdio-relay spawn-and-Wait in a respawn loop. When the child exits with the sentinel, re-resolve target from the latest `installed_plugins.json` and respawn in place. The loop reuses the existing stack slot — chain depth stays at exactly 2 (root + active leaf) regardless of how many deploys land.
4. Race handling: if `handoff_available()` returns None after a bubble (manifest reverted / now matches current_exe — extremely unlikely since /plugin update is monotonic), emit `HANDOFF_RESPAWN_RACE:<id>` and exit sentinel upward.
5. Wrapper handoff (`perform_wrapper_handoff` in `src/live/wrapper/mod.rs`) unchanged — already uses spawn-detached-then-exit (Phase 18.4 D-06), structurally immune. Touch supervisor (D-08) same.
6. Unix path updated symmetrically in `handoff::spawn_and_wait_inherit_stdio` (preserves call-site parity, even though Unix did not exhibit the leak in user reports).

**Commits:**
- `233ecd4` test(handoff): RED — sentinel exit code triggers parent respawn instead of normal-exit propagation
- `8182e2d` fix(handoff): GREEN — Fix D sentinel-exit-and-respawn breaks multi-deploy chain growth

**Verification:**
- `cargo test --lib --test-threads=1` → 356 passed, 0 failed, 1 ignored.
- `cargo test --test handoff_integration --test-threads=1` → 16 passed, 0 failed, 1 ignored (incl. new `handoff_child_defers_via_sentinel_exit_code`).
- `cargo build --release` → clean (5.67s).

**Files touched:**
- `src/common/handoff.rs` — added `HANDOFF_DEFER_EXIT_CODE` constant.
- `src/owl/poll.rs` — split handoff branch into handoff-child deferral path + chain-root respawn loop.
- `tests/handoff_integration.rs` — new test + #[ignore] negative-control placeholder.

**Out-of-scope (per user direction):**
- **No Fix A.** Existing pre-Fix-D 16-process leak (8-deep dunsen chain + 8-deep todlando chain + 1.10.6 dunsen-psyche wrapper subtree, all visible in the symptom snapshot at the top of this file) remains until the user manually drains via taskkill. That cleanup is explicit user-decision scope, not Fix D scope.
- **No prune-dir reaping.** DEPLOY.ps1's `.pending-prune-{ver}-{ts}/` retry logic is unchanged. Once the user drains the existing chain, those locks release and the next deploy's prune step will succeed naturally.
- **No root-of-chain auto-kill.** No code path now terminates owl.exe ancestors. The chain teardown still relies on the existing leaf→root exit-code propagation (which is now correct under Fix D since leaves bubble sentinel up; the loop respawns; teardown ripples normally on `$LIVE stop`).
- **No deploy.** The user decides when to deploy the new binary and when to drain the existing leak.

**Open questions from the find-root-cause phase — RESOLVED:**
1. *Bash-tool wake on second-and-later deploys acceptable?* — N/A under Fix D: the chain root keeps its Bash-tool BashID across an unbounded number of handoffs (it never exits; the respawn loop reuses its stack slot). No Bash-tool wake at all on subsequent deploys. Better than the Fix B tradeoff originally proposed.
2. *Should Fix A also reap .pending-prune dirs?* — Out of scope per user direction; deferred.
3. *Should ancestor cleanup also kill chain root when argv differs?* — Out of scope per user direction; no ancestor cleanup in Fix D.

**Edge-case ledger from investigation — re-checked under Fix D:**
- Wrapper handoff (Edge 1) — unchanged, still immune.
- Touch supervisor (Edge 2) — unchanged, still immune.
- Trampoline (Edge 3) — unchanged; the trampoline in `src/main.rs` calls `spawn_and_wait_inherit_stdio` and propagates exit codes. If a trampoline-spawned child does a Fix D handoff and bubbles sentinel up, the trampoline parent simply propagates 42 to its OWN parent (Claude's Bash tool), which sees an unusual but harmless exit code. The trampoline is one-shot CLI; no chain accumulation possible there regardless.
- Failed handoff (Edge 4 / HANDOFF_FAILED) — unchanged, still exits 1 cleanly.
- HANDOFF_DEFER child path under wrapper (Edge 5 / Bug #12) — unchanged; the wrapper-deferral check at poll.rs:390 fires BEFORE the new `is_handoff_child()` check at poll.rs:408, so under-wrapper polls still exit 2 with `HANDOFF_DEFER:<id>` as Phase 18.5 specifies.
- Signal handling (Edge 6) — unchanged; leaf-initiated teardown still ripples up normally. Under the new respawn loop, a stop signal at the chain root naturally exits the loop via the non-sentinel exit-code branch.
- Crash propagation (Edge 7) — unchanged. A child crash returns its crash code (NOT 42), so the root exits with that code rather than respawning. Equivalent to pre-Fix-D semantics for crashes.
- Phase 30 stale-signoff fix (Edge 8) — unchanged; only the leaf scans drop files.
- Orphan detection ambiguity (Edge 9) — unchanged. The wrapper's `parent_pid` fallback only triggers if `pid` is dead, which still requires full chain collapse.
- TCP port re-binding race (Edge 10) — unchanged; sub-ms gap on each respawn iteration; spool fallback covers losers (P2P-04).

Status: **resolved**. Move to `.planning/debug/resolved/`.
