# Restoration D3 — supervision anchor + update trigger (task plan)

> Working doc for RESTORATION-PLAN.md **D3** (ADR-0018 Q2/Q3/Q4-disc, amendment
> V2; the V6 N-1 harness deferred here from D2). D1 (process-split skeleton) and
> D2 (loop relocation) are DONE + cross-OS CI-green on main (7398d7c). This is
> the milestone's first **versioned-surface** change — the spawn-time argv field
> — so it is where the V6 N-1 compat harness gets its first real assertion.

## Goal (D3-close invariant)

The broker is the **always-up per-machine anchor and the brain's supervisor**,
and a routine `apply` makes the **live brain process** restart onto the swapped
binary — new code runs immediately, no manual bounce, every broker-held endpoint
untouched at the process level. Concretely at D3-close:

- The seed-control lock + liveness (`ensure_running`/`is_running`/`daemon stop`)
  provably live in the never-restarting broker layer; a brain kill+respawn never
  flaps `is_running` nor lets a second daemon win the bind. External contracts
  **unchanged**.
- The broker owns the **generation counter** (KH-2.4 custody, V2), increments it
  on every brain spawn — planned *or* crash — and hands `{generation,
  start-reason}` to the brain at spawn via a **versioned argv field** (KH-2.3
  forward/back-tolerant, defaulted).
- `spt update apply` triggers a **real brain-process restart**: after the binary
  swap it asks the broker (over IPC) to cycle its brain child with
  `start-reason=update`; the broker's supervisor (D1) respawns from the
  executable path = the new binary. The in-process `apply_brain_only` handoff
  stops being the production trigger.
- A first CI-real **old-broker × new-brain** harness asserts the argv field's N-1
  window (old broker spawns new brain with no `--generation` → new brain defaults
  cleanly). Scaffolded here; becomes the green gate at D7.

## What is already satisfied (don't re-migrate)

- **Seed-control already serves in the broker process.** `Daemon::run` runs
  `seedmap::serve_seed_control` as its **foreground** loop (`daemon.rs:239`),
  i.e. in the broker process — D1 already put the Q2 anchor where Q2 wants it.
  `is_running`/`ensure_running`/`request_stop` already target the seed channel
  (`daemon.rs:321/360/329`). D3-1 is therefore **lock-in + falsifiable test**,
  not a relocation.
- **The brain supervisor already exists** (`brainproc::supervise_brain` +
  `spawn_brain_supervisor`, D1) with respawn + capped backoff. D3 extends it with
  generation custody + a planned-restart trigger; it does not rebuild it.

## Per-commit discipline

Each sub-task is its own atomic commit with evidence tagged in-commit. Gates
every commit: `cargo build` · `cargo test` · `cargo clippy` · `cargo build
--no-default-features` · `traceable-reqs check` (EXIT=0) · `xtask check`. Push to
a dev-freeform branch → **CI both runners** before any tag.

---

## D3-1 — Seed-control anchor lock-in (Q2) · confirm + assert  ✅ DONE

Seed-control + liveness are already broker-resident (above). This commit makes
the invariant **falsifiable** rather than incidental: a brain restart must not
disturb the single-daemon-per-`SPT_HOME` contract.

- **Test (process-level, brain_split sibling):** bring up `spt daemon run`, prove
  `is_running()` true; kill the brain child; while the supervisor respawns it,
  `is_running()` stays **continuously** true (the seed channel never dropped —
  it lives in the broker, not the brain) and a second `spt daemon run` racing the
  bind **loses and exits** (the winner keeps serving). The respawned brain
  (new `brain.ready` pid) reconnects to the same still-bound seed+broker sockets.
- **No code move expected.** If the audit surfaces any liveness/seed path that a
  brain restart *can* perturb, fix it here; otherwise this is a guard test + a
  doc-tag confirming custody. External `ensure_running`/`is_running`/`daemon
  stop` behavior is byte-for-byte unchanged.

Evidence: `[unit->REQ-HAZARD-BROKER-PROCESS-ISOLATION]` (seed anchor survives a
brain cycle). REQ already `[doc,impl,unit]` active — no toml change.

---

## D3-2 — Generation custody → broker + versioned spawn-time argv field · Q2/V2, KH-2.3/2.4  ✅ DONE

The first new **versioned surface** of the milestone. The broker owns the
generation counter and hands `{generation, start-reason}` to each brain at spawn.

- **Broker-owned counter.** The supervisor holds an `AtomicU64` generation,
  incremented on **every** spawn (cold, crash-respawn, update-respawn). Because
  the broker observes every respawn (V2), this is strictly more reliable than the
  brain→brain `BrainState.generation` it replaces (that retires fully in D4).
- **Start-reason discriminator** (Q4's update-vs-crash, one channel): `cold` =
  the broker's first brain spawn; `crash` = a supervised respawn after an
  unexpected child exit; `update` = a respawn the broker itself initiated via the
  D3-3 trigger. The supervisor sets it per spawn.
- **Carrier = argv on `spt daemon brain`** (KH-2.3: "any brain-relaunch argv must
  be versioned and forward/backward tolerant"). `spawn_brain_child` appends
  `--generation <N> --start-reason <cold|crash|update>`. The brain entry parses
  them with **defaults** (`--generation` absent → 0; `--start-reason` absent →
  `cold`), so:
  - **N-1 (steady state): old broker × new brain.** The old broker spawns the new
    brain with the *old* argv (no `--generation`); the new brain defaults — never
    a clap hard-reject before state read (KH-2.3's exact failure mode).
  - **new broker × old brain.** The old brain must ignore the unknown flags
    (verify the brain arg parse tolerates-and-ignores unknowns, or the flags are
    optional+ignored — do **not** let clap error on an unknown argument on this
    path).
- **Brain surfaces them** for observability/D5 (the start-reason feeds D5's
  update-vs-crash deadline rehydration). `gen_start = now()` stays fresh on every
  start (KH-2.4 — never rehydrated); only the *counter* comes from the broker now.

Evidence: `[impl->REQ-HAZARD-BROKER-PROCESS-ISOLATION]` (broker custody of the
counter + spawn-time handoff) · `[impl->REQ-HAZARD-HANDOFF-ARGV-COMPAT]` (the
versioned, defaulted argv field) · `[unit->REQ-HAZARD-HANDOFF-ARGV-COMPAT]`
(argv parse/default table: present, absent, unknown-tolerated) ·
`[unit->REQ-HAZARD-BROKER-PROCESS-ISOLATION]` (counter increments per spawn,
reason classification). REQ-HAZARD-HANDOFF-ARGV-COMPAT already `[impl,unit]`
active — no toml change.

---

## D3-3 — Update trigger rewire (Q3): real brain-process restart onto the swap  ✅ DONE

The pillar this whole milestone exists for. Today `apply_staged` (in the CLI
`spt update apply` process) swaps the binary then calls `apply_brain_only`, which
snapshots/drops/re-attaches a **CLI-side** `Brain` connection in-process — the
live brain process keeps running the OLD code (the ADR-0018 regression). D3-3
makes the live brain process actually restart onto the new binary.

- **Route the trigger through the broker** (so V2 holds — the broker *knows* the
  restart was planned). Add a broker control verb (e.g. `KIND_BRAIN_RESTART`,
  additive/defaulted per KH-2.3) the CLI sends after the binary swap. Broker-side
  handling raises a `restart-requested(reason=update)` flag the supervisor
  watches; the supervisor kills the current brain child, bumps the generation,
  and respawns from `current_exe()` (now the swapped binary) with
  `--start-reason update`. The new brain connects → new code live.
- **`apply_staged` stops doing the in-process handoff** as the production trigger:
  after the swap + `record_applied`, it sends `KIND_BRAIN_RESTART` over the broker
  socket instead of `apply_brain_only(... handoff_retry ...)`. A node with no
  broker (vacuous) still swaps; the next `ensure_running` brings up the new
  binary. The endpoint-survival guarantee is the broker holding PTY/QUIC across
  the child cycle (proven end-to-end at D7).
- **`apply_brain_only` + `Brain::handoff` stay** for the existing unit tests; the
  production path no longer calls them. The full `BrainState`-message retire +
  multi-session resume is **D4** — D3-3 restarts the brain process; D4 makes the
  new brain re-attach *all* sessions in resume mode. (In today's spt-core daemon
  the prod brain hosts only net-consumers + shellwake, both disk/`net-status`
  re-derived on start, so D3-3's restart is already correct for them; D4 adds
  session-cursor resume when daemon-hosted sessions land.)
- **`update.rs:233-234`'s "exec the new binary's brain" finally becomes real** —
  via broker respawn rather than `exec` (Q8: `Command::spawn`, no OS fork).

Evidence: `[impl->REQ-HAZARD-BROKER-PROCESS-ISOLATION]` (the live process restarts
onto the swap; broker-observed planned cycle) · `[impl->REQ-UPD-3]` (the real
no-manual-bounce trigger) · `[unit->REQ-HAZARD-BROKER-PROCESS-ISOLATION]` (the
restart verb flips the supervisor to a planned respawn with reason=update; a
no-broker apply still swaps). The REQ-UPD-3 / REQ-DAEMON-2 **int** re-point to a
process-level survival E2E is **D7**, not here — D3-3 adds impl only, leaving the
(mis-evidenced) in-process int tags for the D7 commit that lands their
replacement (never two int evidences at once).

---

## D3-4 — V6 N-1 compat harness scaffold (build here; gate at D7)  ✅ DONE

The first real versioned surface (D3-2's argv field) now exists, so the
old-broker × new-brain compat test has something with teeth to assert.

- **Shape:** a CI-real exercise pairing an **old-broker binary** with a
  **new-brain binary** across the spawn handshake: the old broker spawns the new
  brain with the *old* argv (no `--generation`/`--start-reason`); assert the new
  brain comes up `ready` with defaulted generation 0 / reason `cold` — never a
  clap reject, never an unlogged death (KH-2.3). The inverse (new broker spawns
  old brain, unknown flags ignored) rounds out the window.
- **Binary provenance:** reuse the brain_split real-binary harness pattern; the
  "old broker" is a pinned prior `spt` (a checked-in fixture build or the
  last-release binary the rig already stages). Keep it a single first-field
  assertion now; it **grows one assertion per later additive field** (D4–D6) and
  becomes a **green CI gate at D7**.
- **Scope guard:** a breaking verb-signature change is a broker-breaking update
  class — out of scope, must be *refused*, never silently shipped (RESTORATION
  risk note). The harness asserts additive-compat only.

Evidence: `[unit->REQ-HAZARD-HANDOFF-ARGV-COMPAT]` + `[unit->REQ-HAZARD-BROKER-
PROCESS-ISOLATION]` (the N-1 window holds for the first versioned field). The
**int** activation of the N-1 gate is D7.

---

## Sequencing

D3-1 (anchor lock-in, no new surface) → **D3-2** (generation + argv — the new
versioned field) → **D3-3** (trigger rewire; consumes D3-2's `start-reason`) →
**D3-4** (N-1 harness over D3-2's field). D3-2 before D3-3 because the
update-trigger respawn must stamp `start-reason=update`, which D3-2 defines.

## Traceability — no toml activation needed in D3

| REQ | State entering D3 | D3 adds | Activation note |
|-----|-------|---------|-----------------|
| `REQ-HAZARD-BROKER-PROCESS-ISOLATION` | `[doc,impl,unit]` (D1/D2) | impl + unit (custody, trigger, anchor test) | already active; int still → D7 |
| `REQ-HAZARD-HANDOFF-ARGV-COMPAT` | `[impl,unit]` | impl + unit (the argv field) | already active |
| `REQ-UPD-3` | `int` (in-process, mis-evidenced) | impl (real trigger) | int re-point stays **D7** |

No `required_stages` change lands in D3 (rule 5): every D3 REQ is already active,
and the int re-points are D7 work. `traceable-reqs check` stays green at every
commit.

## Risks / watch-items

- **Argv N-1 is the single most update-frequency-sensitive invariant (KH-2.3).**
  The brain arg parse must DEFAULT a missing field and IGNORE an unknown one —
  never clap-reject before state read. D3-4 is the backstop; verify it on the
  real binaries, not just a unit parse.
- **Planned-vs-crash must be broker-decided (V2).** If the brain self-exits and
  the supervisor can't tell it was planned, reason defaults to `crash` — which is
  *safe* (a crash restart is a superset recovery) but loses the D5 phase-preserve
  optimization. Route the trigger through the broker so `update` is authoritative.
- **Endpoint survival is asserted end-to-end at D7, not D3.** D3-3 proves the
  process *restarts onto the swap*; the PTY-child + live-QUIC gapless survival
  across that restart is the D7 `int` E2E. Don't over-claim REQ-UPD-3 here.
- **Broker socket-bind flake on kitsubito** (DEFERRED): D3-1/D3-4 add real
  process-spawn tests — fix the bind determinism opportunistically (unlink-before-
  bind / per-test TempDir) rather than fighting flakes.

## Immediate next step

Start **D3-1**: write the seed-anchor-survives-brain-cycle process test (sibling
of `brain_split.rs`), confirm no seed/liveness path is brain-perturbable, tag
`[unit->REQ-HAZARD-BROKER-PROCESS-ISOLATION]`. Then D3-2 (generation + argv).
