---
status: fixing
trigger: "todlando-psyche shows OFFLINE post binary handoff, wrapper pid 3108 dead"
created: 2026-04-20T02:35:00-07:00
updated: 2026-04-20T03:45:00-07:00
---

## Current Focus

hypothesis: SessionEnd hook fired `owl cleanup-session` with OWL_SESSION_ID=14619600-... (a stale Claude Code session UUID that was env-inherited all the way down to poll subprocess 3108 and written into todlando-psyche/info.json). cleanup::run scans owlery for info.json containing that session_id, finds todlando-psyche, calls kill_process(3108) [TerminateProcess(handle,1) → exit=1 empty stderr] and soft_stop_perch [removes todlando-psyche/ready]. Wrapper 42804 survived (it owns no info.json; lookup by session_id finds nothing for it), observed ready gone on next iter, exited cleanly.
test: Implementing Fix B: in cleanup::run, skip perches with state=Psyche so wrapper-managed psyches are not torn down by Claude Code session ending. Also implement secondary fix: update `.psyche-wrapper-<self_id>.pid` in perform_wrapper_handoff so Touch loop's pid check works after handoff.
expecting: After the fix, a SessionEnd for an old stale session_id that is still embedded in a psyche's info.json will NOT kill the poll subprocess and will NOT remove the psyche ready file. Wrapper continues running normally.
next_action: Write code changes, run `cargo test`, deploy, verify with user.

## Symptoms

expected: todlando-psyche wrapper stays alive across binary handoff (v1.8.8 → v1.8.9 → v1.8.10). Phase 18.4/18.5 shipped handoff specifically for this case.
actual: todlando-psyche OFFLINE in list-psyches output. Wrapper pid 3108 (gen=17 rehydrated) and parent_pid 42804 both dead. Last log line: "cleanup complete, wrapper process exiting" at 01:40:08. Only `todlando-psyche/ready` was removed; `todlando/ready` remained.
errors:
  - poll iteration 2 (post-handoff, post-pulse 14): exit=1 empty stdout + empty stderr
  - wrapper logged: "ready file gone after empty poll, exiting"
reproduction: A live agent's wrapper has been running long enough that the OWL_SESSION_ID it inherited from `$LIVE start` (via env) has since been superseded by one or more NEW Claude Code sessions in the host. When the OLDER session ends (user closes that window / session rotates), its SessionEnd hook fires `owl cleanup-session` with the OLD session UUID. That UUID still matches the psyche's info.json (written by whichever poll subprocess spawned under the wrapper's env). cleanup-session kills the poll and soft-stops the psyche perch.
started: 2026-04-20 01:40:08 PDT.

## Eliminated

- hypothesis: handoff race (child never spawned)
  evidence: Log shows successful handoff at 01:15:44 to v1.8.10, new pid 42804 spawned, rehydrated wrapper-state.json, and ran pulse 14 successfully 20min later. Handoff worked perfectly.
  timestamp: 2026-04-20T02:38:00-07:00

- hypothesis: orphan parent_pid check killed wrapper
  evidence: Wrapper exit path was "ready file gone after empty poll, exiting" — not an INIT_SIGNOFF or orphan path. Orphan code would log differently.
  timestamp: 2026-04-20T02:38:00-07:00

- hypothesis: rate-limit cascade killed wrapper
  evidence: Log shows exit=1 "You've hit your limit" errors between 20:52 and 21:52 pulses 6-9, but wrapper continued running. Rate-limit by itself does not kill wrapper. Must have been something else at 01:40:08.
  timestamp: 2026-04-20T02:38:00-07:00

- hypothesis: User-initiated `owl stop` / `live stop` command
  evidence: User explicitly confirmed no stop command was issued. Self-side todlando listener poll stayed running the whole time; only the Psyche wrapper died. The exit was voluntary (logged cleanup), not TerminateProcess against the wrapper itself.
  timestamp: 2026-04-20T03:30:00-07:00

- hypothesis: Touch loop PSYCHE_DEAD → Spine cleanup
  evidence: boot_spine::handle_spine_message PSYCHE_DEAD handler only deletes ready files when BOTH psyche AND self have no ready. User confirmed todlando/ready was still present. So this path did NOT run the ready removal.
  timestamp: 2026-04-20T03:40:00-07:00

## Evidence

- timestamp: 2026-04-20T02:38:00-07:00
  checked: todlando.log full content (298 lines)
  found: Wrapper gen 17 rehydrated successfully on v1.8.10 at 01:15:44 (handoff worked). Pulse 14 completed at 01:35:57 (exit=0). Poll iteration 2 started at 01:35:57, gate rejected echo, scheduled short pulse in 804s. At 01:40:08 (4m 11s later — NOT pulse interval) poll exited code=1 empty stdout empty stderr, then wrapper logged "ready file gone after empty poll, exiting". Wrapper ran its designed cleanup and exited voluntarily.
  implication: The wrapper correctly died because Self Psyche perch `todlando-psyche/` lost its `ready` file. Question: who deleted it, without terminating the wrapper process?

- timestamp: 2026-04-20T03:38:00-07:00
  checked: Grep'd all `fs::remove_file.*ready` and `soft_stop_perch` call sites in src/
  found: Two classes of ready removal: (a) the perch's OWN poll subprocess via ctrl+c / ready-gone / poison / handoff, (b) external callers via `soft_stop_perch` in owl stop / live stop / cleanup-session / boot_spine cleanup_perch / list_psyches stale-GC / doctor --fix. Wrapper's own cleanup() (lifecycle.rs:78) also calls soft_stop_perch AFTER the loop exits — but that's AFTER the ready-missing detection, not the cause.
  implication: Narrows suspect list to the external callers that could have fired during the 01:35:57-01:40:08 window without user intervention.

- timestamp: 2026-04-20T03:40:00-07:00
  checked: `cleanup::run` in src/owl/cleanup.rs — the `cleanup-session` subcommand wired to SessionEnd hook (plugin/spt/hooks/hooks.json:13-22)
  found: Scans owlery. For each perch whose info.json text contains `"session_id":"<target_session>"`: (1) calls process::kill_process(pid) — TerminateProcess(handle, 1) on Windows → exit=1 empty; (2) for state=Psyche/Live/Listener, calls soft_stop_perch which removes ready file + info-companion files. EXACTLY matches all observed symptoms: exit=1 empty + ready removed + wrapper unharmed (wrapper owns no info.json directly).
  implication: SessionEnd → cleanup-session → stale-session-match is the root cause mechanism.

- timestamp: 2026-04-20T03:42:00-07:00
  checked: info.json session_ids for all active perches
  found: todlando-psyche = 14619600-d2e1-40e7-aecb-833ff781dafb (UUID format → Claude Code session UUID). todlando (live) = fb822094-... (different UUID). doyle-psyche = c1a4dc0c-... (yet another). doyle (live) = cd4c5420-... These UUIDs were written into info.json by whichever poll subprocess was spawned under each wrapper's env — they inherit OWL_SESSION_ID from the wrapper process env, which in turn traces back via handoff chain to the ORIGINAL Claude Code session the `$LIVE start` invocation ran under.
  implication: Detached wrappers carry a STALE Claude Code session UUID in their env forever. Each time the poll subprocess is spawned it rewrites info.json with that stale UUID. Any cleanup-session fired by the original session's end will match the stale UUID and tear down the psyche perch.

- timestamp: 2026-04-20T03:42:00-07:00
  checked: Doyle comparison — why doyle-psyche survived 01:40:08
  found: doyle-psyche/info.json has a DIFFERENT session_id (c1a4dc0c-...) than todlando-psyche (14619600-...). Todlando-psyche's stale session ended at 01:40:08 (mtime on status/ dir matches exactly). Doyle-psyche's stale session didn't end in that window.
  implication: Confirms the vulnerability is a function of WHICH Claude Code session is embedded in each psyche's info.json. Every wrapper has this latent risk — it only manifests when that specific ancestor Claude session ends.

- timestamp: 2026-04-20T03:43:00-07:00
  checked: Secondary issue — `.psyche-wrapper-<self_id>.pid` post-handoff
  found: `perform_wrapper_handoff` (src/live/wrapper/mod.rs:419-476) does NOT rewrite `.psyche-wrapper-<self_id>.pid` with the new child pid. doyle's pid file still contains 44840 (pre-handoff pid, long dead). This is an independent bug in the Touch→Spine health-check chain (src/live/touch_loop.rs:124-135): Touch uses that pid file to verify a psyche's wrapper is alive during a re-poll gap. Post-handoff, that check fails → Touch would spuriously send PSYCHE_DEAD → Spine would notify Self. (Does NOT delete the ready by itself in current code, but is a latent vulnerability that should be fixed alongside the primary fix.)
  implication: Fix the stale-pid-file bug as a secondary hardening to close the Touch→Spine false-positive path.

## Resolution

root_cause: **Claude Code SessionEnd hook fired `owl cleanup-session` with OWL_SESSION_ID=14619600-d2e1-40e7-aecb-833ff781dafb — a stale Claude Code session UUID that had been env-inherited by the detached wrapper chain (through one or more `perform_wrapper_handoff` steps) and was written into todlando-psyche/info.json by whichever poll subprocess ran under that wrapper's env.** cleanup::run matched the stale UUID in the psyche's info.json, called `process::kill_process(3108)` which invokes TerminateProcess(handle,1) on Windows producing exit=1 empty stdout empty stderr (exactly the signature observed, and matching the pre-existing resolved/poll-exits-empty-killing-wrapper.md Windows signature), and soft_stop_perch which removed `todlando-psyche/ready`. The wrapper process (42804) was NOT killed because it owns no info.json directly — cleanup only matches perches via info.json scan. Wrapper 42804 observed the empty poll result + missing ready on the next iteration and voluntarily ran its designed cleanup(). Doyle survived because doyle-psyche/info.json contained a DIFFERENT stale session_id whose parent Claude session didn't happen to end in that window.

This is a **structural design flaw**: detached wrappers persist indefinitely but their env (including OWL_SESSION_ID) is frozen at spawn time. Every poll subprocess they spawn writes that stale UUID into info.json. cleanup-session's correctness assumption — that matching session_id == owned-by-this-session — is violated by long-lived detached wrappers.

fix: Two-part code change:

1. **Primary fix (src/owl/cleanup.rs):** Skip `PerchState::Psyche` perches in cleanup::run. Rationale: Psyche perches are wrapper-managed, not session-scoped. They should only be torn down via `live stop` or the wrapper's own cleanup path — never by a Claude Code SessionEnd that happens to carry a UUID the psyche's info.json still records. Session-based cleanup remains intact for Working/Listener/Live perches (the intended targets — transient working perches and listener drops).

2. **Secondary fix (src/live/wrapper/mod.rs perform_wrapper_handoff):** After successful detached spawn, write the new child pid to `.psyche-wrapper-<self_id>.pid` so the Touch→Spine re-poll-gap check (touch_loop.rs:124-135) continues to work correctly post-handoff. Prevents a separate latent false-positive PSYCHE_DEAD path.

verification: 
- Run existing `cargo test` to ensure no regressions in cleanup-session semantics.
- Add unit test: a Psyche perch with a matching session_id must NOT be cleaned by cleanup::run.
- Add unit test: a Working perch with a matching session_id MUST still be cleaned.
- Manual verification: start a live agent, note its psyche info.json session_id, wait for ancestor Claude Code session to end, confirm psyche ready survives.

files_changed:
  - src/owl/cleanup.rs
  - src/live/wrapper/mod.rs
