---
status: resolved
trigger: |
  doyle gen43 wrapper terminated immediately after starting. Log shows wrapper started
  at 23:11:24, ran init_session OK, entered first poll iteration at 23:11:50, drained
  3 stale spool messages (all content-timestamped 2026-05-15T04:48:49-07:00, ~18h old),
  detected `<EVENT type="init_signoff">` in the drained batch, fired final_session,
  and exited 33 seconds after spawn. The wrapper's startup defense
  `drain_stale_init_signoffs` (lifecycle.rs:84) WAS called at run() line 527 but did
  not catch the stale init_signoff.
created: 2026-05-15
updated: 2026-05-16
---

# Debug Session: wrapper-drains-stale-signoff

## Symptoms

- **Expected**: After `init_session` and before the poll loop, `drain_stale_init_signoffs`
  consumes all stale `init_signoff`-bearing messages from `doyle-psyche`'s spool so the
  first poll iteration does not see them.
- **Actual**: Gen43 wrapper's first poll iteration drained 3 messages from the spool;
  the second of them was `<EVENT type="init_signoff" timestamp="2026-05-15T04:48:49-07:00">`.
  The wrapper's `msg.to_ascii_lowercase().contains("init_signoff")` predicate (mod.rs:654)
  matched and called `final_session(&msg)`, which returned true, breaking the loop.
  Net effect: wrapper exits ~33s after spawn. doyle live status flips to dead immediately.
- **Errors**: None — clean exit code 0. The final psyche even correctly identified
  the body as "Stale echo. Timestamp `04:48:49` predates gen42's post-signoff amendment"
  but had already committed to teardown.
- **Timeline** (doyle.log, 2026-05-15):
  - 23:11:24  wrapper started gen=43 session=doyle-psyche-43-23-11 handoff_hydrated=false
  - 23:11:24  spawning claude init...
  - 23:11:49  claude init exited code=0 (init_session OK)
  - 23:11:49  auto-commit: git_commit_context after init_session
  - 23:11:50  poll iteration 1 starting
  - 23:11:50  poll exited code=0 stderr=DRAIN:3 queued messages (senders: unknown)
  - 23:11:50  poll returned 977 bytes
  - 23:11:50  INIT_SIGNOFF detected, firing final context save
  - 23:11:56  [PSYCHE] final (exit=0): "Stale echo. Timestamp 04:48:49 predates ..."
  - 23:11:57  wrapper exiting

  NB: `drain_stale_init_signoffs` is called at mod.rs:527 — BETWEEN the init_session
  commit at 23:11:49 and the "poll iteration 1 starting" log at 23:11:50. The function
  emits a log line `discarded stale INIT_SIGNOFF from spool for doyle-psyche` for every
  init_signoff it discards. Grep for that line in doyle.log returns ZERO hits. Confirms
  the function ran but discarded nothing.
- **Reproduction**: Any time a `doyle-psyche` spool contains an `init_signoff`-bearing
  message that is positioned AFTER any non-`init_signoff` message (ordering by row id,
  oldest first), the next wrapper to spawn for `doyle` will hit this bug.

## Current Focus

hypothesis: |
  `drain_stale_init_signoffs` has a fatal early-break: it stops at the FIRST
  non-`init_signoff` message it drains. The function (lifecycle.rs:84-111):

  ```rust
  loop {
      match spool::drain_one(&self.psyche_id, &owlery) {
          Ok(Some(msg)) => {
              if msg.to_ascii_lowercase().contains("init_signoff") {
                  self.log("discarded stale INIT_SIGNOFF ...");
                  continue;  // keep draining
              }
              // Non-INIT_SIGNOFF message -- re-spool it
              let from = msg.strip_prefix("__REPLY_TO__:")...;
              let body = msg.splitn(2, '\n').nth(1).unwrap_or("");
              let _ = spool::spool_message(&self.psyche_id, from, body, &owlery);
              break;  // <-- BUG: stops here, leaves remaining messages in spool
          }
          _ => break,
      }
  }
  ```

  When the spool order is `[echo_commune, init_signoff, msg]`:
  1. `drain_one()` returns the echo_commune (no "init_signoff" substring).
  2. Function respools the echo_commune (now at the END of the queue) and BREAKS.
  3. Remaining `init_signoff` + `msg` rows still have `delivered = 0`.
  4. Wrapper enters poll loop, spawns `owl poll doyle-psyche` subprocess.
  5. That subprocess drains ALL undelivered via `drain_all_with_metadata`
     (poll.rs:146), emits 3 EVENT lines including the init_signoff.
  6. Wrapper's `init_signoff` predicate at mod.rs:654 matches → final_session
     → wrapper exits.

  The structural defect is a "respool-and-stop" design that assumes init_signoffs are
  always at the FRONT of the queue. Any time another message (echo_commune, msg, pulse,
  file_drop, anything) is queued BEFORE a stale init_signoff in the spool, the defense
  fails open.

test: read lifecycle.rs:84-111, mod.rs:527 (call site), mod.rs:654 (init_signoff
  predicate), spool.rs:96-124 (drain_one impl — UPDATE delivered=1 on a single oldest
  row), spool.rs:227-250 (drain_all_with_metadata — same `WHERE delivered = 0` query,
  drains ALL rows). poll.rs:146-181 confirms the subprocess emits all drained rows as
  EVENT lines on first poll. doyle.log lines 1-25 confirm the timing matches the
  hypothesis exactly (init_session at 23:11:49, drain_stale ran silently between
  23:11:49 and 23:11:50, poll iter 1 then DRAIN:3 → init_signoff detected).

expecting: |
  Fix in `drain_stale_init_signoffs`: instead of `drain_one` + respool-and-break,
  use `peek_all` + `mark_delivered` to atomically mark only the init_signoff-bearing
  rows as delivered, leaving all other rows untouched at `delivered = 0`. This
  preserves spool ordering, avoids the respool that re-queues echo_communes to the
  back, and discards every stale init_signoff regardless of position.

next_action: |
  Apply fix in two atomic commits via GSD conventions:

  Commit A (test, RED) — `drain_stale_init_signoffs_drains_init_signoff_behind_non_init_signoff`
    in src/live/wrapper/lifecycle.rs `#[cfg(test)] mod tests`. SPT_HOME-isolated
    tempdir per existing convention. Pre-populate the doyle-psyche spool with
    three rows in this order: (a) anonymous `<EVENT type="echo_commune">body</EVENT>`,
    (b) anonymous `<EVENT type="init_signoff" timestamp="...">...</EVENT>`,
    (c) anonymous `<EVENT type="msg" from="unknown"></EVENT>`. Call the (refactored)
    drain helper. Assert post-state: row (b) is `delivered = 1`; rows (a) and (c)
    are `delivered = 0` and still at their original (or non-respooled) positions.
    Run with `cargo test drain_stale_init_signoffs_drains_init_signoff_behind_non_init_signoff
    -- --test-threads=1`. Expect RED on current code.

  Commit B (fix, GREEN) — rewrite the loop body to:
    1. Read all undelivered rows via `spool::peek_all(&psyche_id, &owlery)`.
    2. Filter for rows whose `body` (or formatted `__REPLY_TO__:{from}\n{body}` form
       — match the same shape `drain_one` returns so the case-insensitive
       contains-check is consistent) lowercased contains `"init_signoff"`.
    3. Collect their msg_ids; call `spool::mark_delivered(&psyche_id, &ids, &owlery)`.
    4. Log one summary line: `discarded N stale INIT_SIGNOFFs from spool for {psyche_id}`
       when N > 0 (matches existing log idiom; preserves grep-ability).
    Re-run the regression test → expect GREEN. Run `cargo test --lib` to confirm no
    other tests regressed.

  Both commits land on main per GSD-quick conventions. No deploy needed for the test
  cycle; deploy with `docs/DEPLOY.ps1 -Bump patch` once both commits are green and the
  user greenlights v1.10.9.

## Evidence

- timestamp: 2026-05-15T23:11:24..57
  observation: |
    Full gen43 wrapper log (25 lines) shows the bug end-to-end. Notable: no
    "discarded stale INIT_SIGNOFF from spool" log line anywhere. That string is
    the ONLY emission point in `drain_stale_init_signoffs` (lifecycle.rs:93).
    Its absence is hard proof that the function drained nothing — and the
    subsequent DRAIN:3 in the poll-iter-1 stderr proves messages WERE in the
    spool when the function ran.
  source: C:\Users\decid\AppData\Local\spt\logs_latest\doyle.log lines 1-25.

- timestamp: 2026-05-15 (code read)
  observation: |
    `drain_stale_init_signoffs` at src/live/wrapper/lifecycle.rs:84-111:
    Loop body uses `spool::drain_one` (single oldest row).
    When the drained body contains "init_signoff": logs + `continue` (keep draining).
    When the drained body does NOT contain "init_signoff": respools the message
    (line 105 `spool::spool_message(..., from, body, &owlery)`) then `break` at
    line 106. The break exits the loop, leaving any remaining undelivered rows
    in the spool untouched.
  source: src/live/wrapper/lifecycle.rs:84-111.

- timestamp: 2026-05-15 (code read)
  observation: |
    `spool::drain_one` (src/common/spool.rs:96-124) uses
    `SELECT id, from_id, body FROM messages WHERE delivered = 0 ORDER BY id ASC LIMIT 1`,
    then `UPDATE messages SET delivered = 1 WHERE id = ?1`. So messages are
    "drained" by marking delivered=1, not by DELETE. Respool inserts a NEW row
    with a NEW id at the end — original-id row remains delivered=1 but ordering
    moves a respooled message to the back of the queue.
  source: src/common/spool.rs:96-124.

- timestamp: 2026-05-15 (code read)
  observation: |
    `spool::drain_all_with_metadata` (src/common/spool.rs:227-250) uses the
    SAME `WHERE delivered = 0 ORDER BY id ASC` predicate (no LIMIT) and marks
    each row delivered=1 as it iterates. Called by `owl poll <id> listen` at
    poll.rs:146 in the post-bind drain-spool stanza. Every undelivered row at
    that moment is emitted as an `<EVENT type="msg" ...>...</EVENT>` line via
    `emit_event_line` (poll.rs:170).
  source: src/common/spool.rs:227-250; src/owl/poll.rs:139-182.

- timestamp: 2026-05-15 (code read)
  observation: |
    Wrapper's live-loop init_signoff predicate at src/live/wrapper/mod.rs:654:
      `if msg.to_ascii_lowercase().contains("init_signoff") {
          if self.final_session(&msg) { break; }
      }`
    The `msg` here is the COMBINED stdout of the poll subprocess — all 3 drained
    EVENT lines concatenated. Substring match on lowercased text is satisfied
    by the init_signoff EVENT regardless of position within the multi-event blob.
  source: src/live/wrapper/mod.rs:643-658.

- timestamp: 2026-05-15 (code read)
  observation: |
    `spool::peek_all` (spool.rs:255-272) and `spool::mark_delivered` (spool.rs:276-290)
    exist and are the natural primitives for a position-independent
    "discard-only-init_signoff" implementation. peek_all returns
    (msg_id, from_id, body, created_at) tuples without changing state;
    mark_delivered atomically flips a list of ids to delivered=1 within
    `BEGIN IMMEDIATE` / `COMMIT`.
  source: src/common/spool.rs:255-290.

- timestamp: 2026-05-15 (cross-check vs. prior debug)
  observation: |
    Prior session `psyche-wrapper-init-signoff-loop.md` (resolved 2026-05-13) fixed
    the wrapper's live-loop init_signoff predicate from
    `line.trim_start().starts_with("INIT_SIGNOFF")` to `msg.contains("INIT_SIGNOFF")`
    (later refined to `to_ascii_lowercase().contains("init_signoff")`). That fix is
    the REASON the gen43 wrapper now correctly DETECTS init_signoff in the drained
    batch — which is desirable for the legitimate live-signoff path but disastrous
    for the stale-batch path because `drain_stale_init_signoffs` is supposed to
    prevent stale init_signoffs from ever reaching that predicate.
    The two fixes are complementary: predicate fix made the live path reliable;
    THIS fix makes the stale-startup defense reliable. Both must hold.
  source: .planning/debug/resolved/psyche-wrapper-init-signoff-loop.md;
    src/live/wrapper/mod.rs:643-658 (current predicate).

- timestamp: 2026-05-15 (cross-check vs. file-drop path)
  observation: |
    Prior session `stale-signoff-fires-on-next-session-start.md` (resolved 2026-05-15)
    added `drain_stale_signoff_file` (src/live/start.rs:90) as the parallel defense
    for the Phase 30 .claude/{id}-signoff.md FILE-DROP path. That fix covers a
    DIFFERENT path (cwd-side file-drop, not spool-side message) and does NOT touch
    the spool. The current bug is in the spool defense (lifecycle.rs), not the
    file-drop defense (start.rs). Two paths, two defenses, two bugs — the file-drop
    defense was newly added and works; the spool defense was pre-existing and has
    the early-break flaw.
  source: .planning/debug/resolved/stale-signoff-fires-on-next-session-start.md;
    src/live/start.rs:90-130.

## Eliminated

- **The 3 messages were spooled DURING gen43's startup race (between
  drain_stale_init_signoffs and first poll iteration)**: ELIMINATED. The log
  shows init_session completed at 23:11:49 and the very next event is
  "poll iteration 1 starting" at 23:11:50 — a 1-second window. The 3 drained
  messages carry content timestamps `04:48:49` (~18h earlier). No external
  process delivered messages with backdated content stamps in that 1-second
  window. The messages were already in the spool when the wrapper started.

- **`drain_stale_init_signoffs` was never called for gen43**: ELIMINATED. The
  call site at mod.rs:527 is unconditional (always runs except on
  `hydrated_from_handoff` path — and the gen43 wrapper log line 1 shows
  `handoff_hydrated=false`). The function ran; it just discarded nothing
  because of the early-break bug.

- **Predicate substring mismatch (e.g. "init_signoff" vs "INIT_SIGNOFF")**:
  ELIMINATED. Both lifecycle.rs:92 and mod.rs:654 use
  `msg.to_ascii_lowercase().contains("init_signoff")` — the same idiom. The
  predicate is correct in both places; the defect is purely the early-break
  short-circuit in lifecycle.rs.

- **The poll subprocess re-spooled (rather than drained) the init_signoff
  row, so it kept coming back**: ELIMINATED. `drain_all_with_metadata` at
  spool.rs:245 marks every drained row `delivered = 1` in the same scan.
  After the gen43 first-iter drain the rows are gone (from the
  `delivered = 0` working set). The bug fires exactly once per stale batch.

- **The handoff path leaked stale messages from gen42's wrapper-state.json**:
  ELIMINATED. `handoff_hydrated=false` on line 1. Cold-start path.
  wrapper-state.json was not consulted.

- **`emit_event_line` double-wrap (echo-commune-double-envelope bug)**:
  ELIMINATED for THIS bug. The double-wrap bug was fixed (poll.rs:611-635
  `body_is_typed_event_envelope` passthrough). The drained messages here are
  raw EVENT envelopes, not double-wrapped. The init_signoff predicate matches
  regardless of wrapping anyway.

## ROOT CAUSE (confirmed)

`src/live/wrapper/lifecycle.rs::drain_stale_init_signoffs` (lines 84-111) breaks
out of its drain loop the first time it encounters a non-`init_signoff` message.
The non-`init_signoff` message is respooled (re-inserted at the end of the queue)
and the function returns, leaving any remaining undelivered rows — including stale
`init_signoff` rows — in the spool with `delivered = 0`. The next poll subprocess
drains them via `drain_all_with_metadata` (poll.rs:146), the combined stdout reaches
the wrapper's live-loop init_signoff predicate (mod.rs:654), and the wrapper fires
`final_session` and exits.

The early-break design implicitly assumes init_signoff messages are always at the
FRONT of the spool queue. This held in early prototypes where the wrapper-Self
signoff path was the only producer of messages on the psyche perch, but no longer
holds: echo-commune dispatch (b438306), Phase 30 file_drop EVENTs (d6bd054), pulse
triggers, and direct `$OWL send doyle-psyche ...` calls can all interleave queued
rows in arbitrary order. Any non-init_signoff row ahead of a stale init_signoff
row defeats the defense.

## Resolution

status: root_cause_found
specialist: rust
fix_target: |
  Replace the drain_one + respool-and-break loop with a peek-and-targeted-mark approach:

  ```rust
  pub fn drain_stale_init_signoffs(&self) {
      let owlery = owlery::owlery_dir();
      let rows = match spool::peek_all(&self.psyche_id, &owlery) {
          Ok(r) => r,
          Err(_) => return,  // best-effort defense; never crash boot
      };
      let stale_ids: Vec<i64> = rows.iter()
          .filter(|(_, _, body, _)| body.to_ascii_lowercase().contains("init_signoff"))
          .map(|(id, _, _, _)| *id)
          .collect();
      if !stale_ids.is_empty() {
          let _ = spool::mark_delivered(&self.psyche_id, &stale_ids, &owlery);
          self.log(&format!(
              "discarded {} stale INIT_SIGNOFF(s) from spool for {}",
              stale_ids.len(), self.psyche_id
          ));
      }
  }
  ```

  Behavior change: drains every stale init_signoff regardless of position; never
  respools non-init_signoff rows (they sit at their original ids, will be picked
  up by the wrapper's first poll iteration in their original order). Preserves
  ordering. Atomic via mark_delivered's BEGIN IMMEDIATE transaction. Single log
  line summarizing the count (matches existing idiom; preserves grep-ability).

files_to_inspect:
  - src/live/wrapper/lifecycle.rs (drain_stale_init_signoffs body, lines 84-111;
    add unit test in `#[cfg(test)] mod tests` at line 152)
  - src/common/spool.rs (peek_all line 255, mark_delivered line 276 — primitives
    used by fix; no edits needed)
  - src/live/wrapper/mod.rs (call site at line 527; no edits needed)
  - tests/ — no new integration test required; lifecycle.rs unit test with
    SPT_HOME tempdir + ENV_LOCK is sufficient (existing convention).

verification_plan:
  - cargo test --lib drain_stale_init_signoffs -- --test-threads=1
    (new regression test asserts: stale init_signoff BEHIND non-init_signoff
    rows in the spool gets discarded; non-init_signoff rows remain at
    delivered=0 with original msg_ids preserved)
  - cargo test --lib -- --test-threads=1 (full suite, ensure no regression)
  - cargo build --release
  - Manual: spawn doyle live, drop an init_signoff via
    `$OWL send doyle-psyche <<<'<EVENT type="init_signoff">stale</EVENT>'`,
    THEN drop a benign message via `$OWL send doyle-psyche <<<'hello'`,
    THEN kill the wrapper without letting it drain (taskkill on the wrapper PID),
    THEN restart via `$LIVE start doyle`. Expect: wrapper starts, drain_stale
    logs `discarded 1 stale INIT_SIGNOFF(s) from spool for doyle-psyche`,
    first poll iter drains the 'hello' message normally, wrapper stays alive.
  - Verify the previously-resolved psyche-wrapper-init-signoff-loop regression
    (live-loop init_signoff predicate) still passes — no overlap with this fix.

## Resolution Applied

Resolution complete via two atomic commits on main:

- d6e8bc1 test(wrapper): RED — drain_stale_init_signoffs leaks signoff behind echo_commune
  Adds regression test drain_stale_init_signoffs_drains_init_signoff_behind_non_init_signoff
  in src/live/wrapper/lifecycle.rs tests module. SPT_HOME-isolated tempdir + local
  ENV_LOCK / SptHomeSnapshot helpers (private to the lifecycle test module so the
  test file stays self-contained). Test seeds [echo_commune, init_signoff, msg] into
  the doyle-psyche spool, calls drain_stale_init_signoffs, asserts post-state has
  zero undelivered rows containing init_signoff AND exactly 2 non-signoff rows
  remain (no respool duplication). Fails RED on pre-fix code at lifecycle.rs:106
  with the init_signoff row at id=2 leaking through.

- 7da2cfe fix(wrapper): GREEN — drain stale INIT_SIGNOFFs position-independently
  Rewrites drain_stale_init_signoffs to use spool::peek_all + spool::mark_delivered
  instead of drain_one + respool-and-break. Position-independent: every undelivered
  row whose body lowercased contains init_signoff is atomically marked delivered=1
  via BEGIN IMMEDIATE / COMMIT in a single mark_delivered call. Non-signoff rows
  remain at delivered=0 in their original positions so the wrapper first poll
  iteration picks them up normally. Single summary log line discarded N stale
  INIT_SIGNOFF(s) from spool for {psyche_id} preserves grep-ability.

  Adds two companion tests covering interleaved-multiple-signoffs (mixed lowercase
  EVENT envelopes + uppercase legacy literal) and empty-spool no-op. All three
  drain_stale_init_signoffs tests GREEN. Full lib suite: 360 passed / 0 failed / 1
  ignored (up from 358 baseline; +1 from RED commit + 2 from companion tests).
  cargo build --release clean.

Defense-in-depth picture (three orthogonal entry points, three parallel defenses):

| Path                                    | Defense                                | Location                                |
|-----------------------------------------|----------------------------------------|-----------------------------------------|
| Live-loop init_signoff predicate        | msg.to_ascii_lowercase().contains      | src/live/wrapper/mod.rs:654             |
| Cold-start file-drop .claude/{id}-signoff.md | drain_stale_signoff_file (surface+unlink) | src/live/start.rs:90               |
| Spool startup defense (THIS fix)        | drain_stale_init_signoffs peek+mark    | src/live/wrapper/lifecycle.rs:84        |

The three defenses are independent and complementary. Each path was previously
covered by a debug session: psyche-wrapper-init-signoff-loop (resolved 2026-05-13),
stale-signoff-fires-on-next-session-start (resolved 2026-05-15), and
wrapper-drains-stale-signoff (this session, resolved 2026-05-16).

Verification:

- cargo test --lib drain_stale_init_signoffs -- --test-threads=1 → 3 passed.
- cargo test --lib -- --test-threads=1 → 360 passed / 0 failed / 1 ignored.
- cargo build --release → clean (3 pre-existing dead-code warnings only).
- Manual verification deferred to next live-agent spawn; the RED-then-GREEN
  test pair pins the structural contract directly against the spool primitives,
  which is the only path the wrapper consults at startup.
