# M5-D4 — shell sleep/wake + owner cascade (JIT plan)

**Status:** **CLOSED 2026-06-04** — D4a (`2fe4573` + zombie-liveness hardening
`78e9e18`) · D4b (`2399eeb`) · D4c (`ba4dadf`) · D4d + sleep/wake E2E (`8bfa933`) —
all CI-green-FINAL. REQ-SHELL-2 `[impl, unit, int]` satisfied. Successor:
`M5-D5-PLAN.md` (resting arms). Upstream: D3 closed (`1271023`,
CI-green-FINAL — see `M5-D3-PLAN.md`). D4 makes the shell lifecycle real: link-break
close semantics, ephemeral-vs-persistent divergence, `relink`, the `wake_command`
wake-watcher, state-keyed wake resolution, and the owner cascade driven from both ends
(`spt shutdown` / `api owner-shutdown`). Authoritative model: **CONTEXT §Shell model**
(~line 231 lifecycle, §Shell sleep/wake ~line 258), `docs/MANIFEST.md` §Shell adapters
(sleep/wake paragraph), M5-PLAN §D4. Requirement: **REQ-SHELL-2** (activates
`[impl, unit]` at D4a).

## Goal

A shell instance sleeps and wakes from either end: owner suspends (`spt suspend` /
`spt shutdown` / `api owner-shutdown`) → every online shell closes gracefully
(`pre_close` instruction + `close_timeout_ms` window, then force-kill) → **ephemeral**
shells tear down with history erased, **persistent** shells go offline re-linkable;
while offline, a manifest `wake_command` watcher runs (mutually exclusive with the
binary), and its wake-opcode exit triggers state-keyed wake resolution (dormant →
online shell; suspended → revive owner, then online; no reachable instance → documented
refusal); `persistent` shells auto-online on the owner's wake edge; `relink` is the
manual online switch.

## Researched code map (verified 2026-06-04)

- `spt-daemon::shellhost` — `launch_shell` (relay detached / stdin brokered; parks
  `shell.pid` + `link.token` on the perch), `fill_spawn_command` (mints fresh token),
  `bind_shell_by_token` (flips ONLINE), `verify_link`, `kill_shell_pid`/`kill_shell_at`
  (interim close: `taskkill /T /F` · `kill -9` — D4a wraps these in the graceful close).
- `spt-daemon::resting` — `RestState {Active,Dormant,Suspended}`, `RestEvent`,
  `daemon_rest_event(id, event, knob)` (the transition host: echo gate at the rest edge,
  `fire_wake_effects` at wake). Call sites: `cli.rs cmd_rest` (`spt suspend/wake`),
  `attach.rs` (detach/attach edges), `dispatch.rs`. **The cascade hooks land here.**
- `spt-runtime::manifest::Shell` — ALL D4 fields already parsed: `ephemeral`,
  `persistent`, `wake_command`, `pre_close`, `close_timeout_ms`, `can_shutdown`.
- `spt-daemon::grants` — `CAP_OWNER_SHUTDOWN` + `author_can_shutdown_grant(store,
  shell_id, node)` exist (D1c) — **no call site yet**; D4d wires authoring at spawn and
  the gate at `api owner-shutdown`.
- `spt-store::shellinfo` — `ShellInfo.status` online|offline, `teardown_shell` removes
  the perch dir entirely (spool.db included — the ephemeral history-erase primitive).
- `spt shell relink` — CLI shape exists, refuses with `SHELL_RELINK_PENDING` (cli.rs
  ~2054). `spt suspend/wake <id>` exist (local-only, `cmd_rest`). `api shutdown` exists
  (M2b live-agent signoff: final echo-commune then soft-stop — `live::cmd_shutdown`).
- Backoff pattern to copy: `spt-net::pairing::ratelimit::backoff_secs` (exponential,
  base × 2^(n-1), saturating cap).
- Detached spawn: `daemon::detached_no_inherit(program, &[String]) -> io::Result<u32>`
  (KH 5.6 — the only safe detached spawn on Windows).

## Decisions (locked)

- **Owner "offline" = `Suspended`.** Dormant is resting-warm (session running) — shells
  stay online through dormancy. The cascade fires on the **edge into Suspended**
  (from Active or Dormant); `persistent` auto-online fires on the **wake edge into
  Active**. Both hooks live in `resting::daemon_rest_event` (the single transition
  host all suspend paths already route through), best-effort like the existing echo
  gate + wake effects — a cascade failure is loud but never wedges the rest state.
- **Graceful close routine** (`shellhost::close_shell`): (1) if `pre_close` declared,
  compose a dedicated **`shell_close` frame** carrying the instruction and deliver per
  receipt mode (relay: spool — the binary drains it during the grace window; stdin:
  `send_effect` via the broker session). `pre_close` is manifest-declared machinery,
  **not** vocabulary-checked (the vocab gates *agent* commands; the manifest is its own
  authority). (2) wait up to `close_timeout_ms` (absent ⇒ 0 — immediate) polling child
  liveness; (3) force-kill (`kill_shell_at`), remove `shell.pid`, flip status offline.
  The binary **never** survives a link-break (CONTEXT) — force-kill is unconditional at
  window end.
- **Ephemeral vs persistent at close:** after the binary closes — `ephemeral` ⇒
  `teardown_shell` (perch + spool gone = history erase; the mint slot frees);
  `persistent` (or plain non-ephemeral) ⇒ status=offline, perch intact, re-linkable.
- **`relink` = the online switch, no re-approval:** resolve ref (alias|canonical, owner
  exclusivity holds) → must be offline + adapter still registered → re-fill spawn
  (mints a **fresh** link token — stale tokens die with the old binary) → launch per
  receipt mode → bind flips online (launch alone does not). Spawn gates do **not**
  re-fire (they govern *minting*; this instance was approved at mint — the cap counts
  existing instances, and this one already counts). Kills any running wake-watcher
  first (mutual exclusivity).
- **Wake-watcher home: new `spt-daemon::shellwake`,** daemon-hosted. Per-watcher
  **supervision thread** in the daemon (KH 7.4 stance: own thread, never the control
  loop), child spawned with the filled `wake_command` template; `waker.pid` parked on
  the perch (one watcher per offline instance — pid file is the lock; stale-pid check
  on reconcile). Exit-opcode supervision: **`WAKE_OPCODE = 86`** (documented in
  MANIFEST.md at D4b) → wake resolution; any other exit → respawn with exponential
  backoff (`1s × 2^(n-1)`, cap 60s — the ratelimit pattern) and **give-up after 6
  consecutive crash-exits** (counter resets on any shell activity: relink, spawn,
  owner wake). Watcher is killed on online-flip (relink/bind/auto-online) and on
  teardown.
- **Reconcile, not events-only:** `shellwake::reconcile(owlery)` sweeps all owners'
  shells — offline + `wake_command` + no live watcher ⇒ start one; online/torn-down +
  live watcher ⇒ kill it. Called at daemon boot (crash recovery — watchers are daemon
  children and die with it) and nudged after every lifecycle flip (close/relink/
  cascade). Status flips happen in CLI-process library code, so the nudge is
  best-effort (`ensure_running` + control-channel poke or next boot/tick catches it);
  the invariant holder is the reconciler, not the caller.
- **Wake resolution (D4c, state-keyed on the owner's local rest record):**
  `read_rest(owner_perch)` — `Dormant` ⇒ online the shell only; `Suspended` ⇒
  `daemon_rest_event(owner, Wake)` (existing wake effects: resurface + freshness pull)
  then online the shell; `Active` ⇒ online the shell. **No local owner perch** ⇒
  refuse with `WAKE_NO_REACHABLE_INSTANCE` naming the deferral (the
  `shell_wake_spawn_anywhere` fresh-spawn branch rides instantiate-anywhere; D1c's
  `author_wake_spawn_anywhere` grant shape is the seam). The *active-elsewhere
  cross-node attach* arm upgrades at D6 (MRA resolution) / D8c (cross-node link) —
  D4 resolves against the local node and says so in the refusal text.
- **`spt shutdown <id>` = the graceful self-suspend verb:** runs the live signoff
  boundary when the endpoint is live (the existing `api shutdown` echo-commune
  machinery), then `daemon_rest_event(id, Suspend)` — which now cascades shells.
  Distinct from `spt suspend` only by the signoff leg; both ride the same edge so the
  cascade is owned by the transition host, not the verbs.
- **`api owner-shutdown <shell-id> --link <token>`:** machinery surface (AuthFlags do
  not apply — the link token IS the auth, same as poll/emit). `verify_link` → resolve
  owner → `grants::check(CAP_OWNER_SHUTDOWN, shell_id, node, None)` — **fail-closed:
  no grant ⇒ refuse** (no escalation prompt: the manifest flag is the only authoring
  path, CONTEXT "the flag *is* the consent") → `daemon_rest_event(owner, Suspend)`.
  The firing shell goes offline in its own cascade — expected, documented in help.
  Grant authoring: spawn (D3d gate chain tail) calls `author_can_shutdown_grant` iff
  manifest `can_shutdown = true`; teardown revokes it.

## Pieces (build order — one CI-green slice each)

1. **D4a — link-break close + divergence + relink + owner cascade**
   (`REQ-SHELL-2` impl/unit begins). `shellhost::close_shell` (pre_close frame +
   timeout + force-kill); ephemeral teardown vs persistent offline; `shell relink`
   un-stubbed; cascade hooks in `daemon_rest_event` (suspend edge ⇒ close all online
   shells; wake edge ⇒ relaunch `persistent` offline shells); `shell teardown` on an
   online instance routes through `close_shell` first.
2. **D4b — wake-watcher** (`spt-daemon::shellwake`). Watcher spawn/supervise thread,
   `WAKE_OPCODE` exit → wake resolution call, crash-exit exponential backoff +
   give-up, `waker.pid` exclusivity, `reconcile` at daemon boot + lifecycle nudges,
   mutual exclusivity with the binary both directions. Mock waker bin (adapters/mock
   third `[[bin]]` or arg-mode of mock-shell) for tests. MANIFEST.md gains the
   wake-opcode value.
3. **D4c — state-keyed wake resolution** (`REQ-SHELL-2` completes impl/unit).
   `shellwake::resolve_wake(owlery, owner, shell_id)` per the locked decision;
   refusal text documents the spawn-anywhere deferral; D6 MRA upgrade seam noted at
   the fn signature.
4. **D4d — `spt shutdown` + `api owner-shutdown`.** CLI verb (signoff leg + Suspend
   edge); api machinery command gated by `CAP_OWNER_SHUTDOWN` fail-closed;
   `author_can_shutdown_grant` wired at spawn, revoked at teardown.
5. **E2E extension** (closes D4 acceptance, rides the last slice): full cycle in
   `shell_e2e.rs` — spawn → online → `spt shutdown` → pre_close evidence + offline →
   watcher up → wake-opcode exit → owner revived + shell online → ephemeral twin torn
   down. Crash-exit backoff observed with a deliberately-crashing waker.

## THIS slice — D4a (link-break close + divergence + relink + cascade)

- **`shellchan::compose_close_frame(instruction)`** — new `shell_close` EVENT type
  (spt-proto taxonomy), MAC-stamped like every link frame. Not vocab-checked (above).
- **`shellhost::close_shell(owlery, owner, shell_id, shell: &Shell)`** — the locked
  routine: pre_close deliver (relay spool / stdin send_effect; http unreachable — spawn
  already refuses http) → liveness-poll up to `close_timeout_ms` (absent ⇒ 0) →
  `kill_shell_at` + pid-file cleanup → divergence: `ephemeral` ⇒ `teardown_shell`,
  else status=offline. Returns what it did (`Closed{torn_down: bool}`) for cascade
  reporting. Brokered (stdin) shells: broker session end + pid kill — both, belt and
  suspenders (the session label is `<owner>/shells/<id>`).
- **`relink`** (cli.rs `ShellCmd::Relink` un-stub): offline-only (online ⇒
  `SHELL_ALREADY_ONLINE`), registered-adapter check (deregistered ⇒ refuse naming
  `adapter add`), fresh token via `fill_spawn_command`, launch per receipt, status
  stays offline until bind (the D3b contract). Owner exclusivity negative carried.
- **Cascade hooks** in `resting.rs` (`daemon_rest_event`, beside the echo gate):
  - edge → `Suspended`: for each owner shell with status=online ⇒ `close_shell`
    (divergence applies per-shell). Best-effort per shell; failures reported, edge
    never wedged.
  - edge → `Active` via `Wake`: for each owner shell `persistent` + status=offline ⇒
    relaunch (the relink machinery, no fresh approval). Non-persistent stay offline
    (watcher's job, D4b).
  - Manifest resolution by `adapter_name` through the registered set; an
    unregistered-but-existing instance closes with defaults (no pre_close, 0 timeout).
- **`shell teardown` on an online instance** routes through `close_shell` before the
  perch removal (today it pid-kills then removes — keep the force path as fallback).
- **Activate `REQ-SHELL-2` `[impl, unit]`** in `traceable-reqs.toml` in this commit;
  tag `[impl->REQ-SHELL-2]` / `[unit->REQ-SHELL-2]` on the real evidence.

### Tests (D4a)
- close: pre_close declared ⇒ `shell_close` frame lands in the spool (relay) /
  reaches stdin (brokered, in-proc broker) before the kill; no pre_close ⇒ no frame;
  timeout 0 ⇒ immediate kill; child already dead ⇒ clean no-op close.
- divergence: ephemeral ⇒ perch gone + mint slot freed + spool gone; persistent ⇒
  offline + perch intact + `relink` brings it back (fresh token differs from old;
  old token refused on poll/emit after relink).
- relink: offline-only; deregistered adapter refused; non-owner refused; alias ref
  works; status flips online only at bind.
- cascade: owner suspend (cmd_rest path) closes both an online relay shell and an
  online ephemeral shell with correct divergence; owner wake auto-onlines the
  persistent one only; dormant edge (Detach) closes nothing.
- teardown-online: graceful close observed before removal.

## NOT in D4 (lands later in M5)
- Presence/MRA upgrade of wake-resolution + escalation targets — **D6**.
- Cross-node link/relink + other-node discovery arm — **D8c**; cross-node spawn +
  `shell_wake_spawn_anywhere` *behavior* — deferred past M5 (instantiate-anywhere).
- Deferred-message resting gate (`REQ-INST-6`) + remote `spt suspend <id@node>` — **D5**.
- HTTP command-receipt mode — unless D8's shell demands it.

## Conventions (carried)
- NO `cargo fmt`. Tag `[impl->REQ-…]` / `[unit->REQ-…]` on real evidence in the same
  commit. `traceable-reqs check` from repo root before declaring done.
- Linux CI clippy `-D warnings` is the real gate; `--no-default-features` must stay
  clean.
- Push each slice → watch the **sha-pinned FINAL** conclusion; never push the next
  slice over an in-flight run.
- `cargo build -p mock-adapter` before local `shell_e2e` iteration (no dep edge —
  stale-binary trap). Flags before variadic positionals in tests/scripts (clap trap).
- Commit trailer: `Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>`.
