# Restoration D6 — failure atomicity: readiness-gated auto-rollback (task plan)

> **VET STATUS: APPROVED + GREENLIT (doyle, 2026-06-10).** File:line audit clean;
> structure (primitive → consumer+harness → invariant+activation) is the proven
> D5 shape. Four amendments **A8–A11 folded below**; all 5 open calls resolved
> (see §Resolved open calls); 2 doyle notes folded. Clear to execute D6-1.

> Working doc for RESTORATION-PLAN.md **D6** (ADR-0018 Q7, V1). D1 (process
> split), D2 (loop relocation), D3 (supervision anchor + real brain-process
> update trigger), D4 (multi-session cold-start resume; `BrainState` message
> retired), D5 (durable absolute-deadline loop timing) are DONE + cross-OS
> CI-green on main (@2ba4fd7). The broker now owns the brain supervisor
> (`brainproc::supervise_brain`), respawns it from the executable path on every
> exit (crash → `reason=Crash` + capped backoff; planned update →
> `reason=Update`, prompt respawn), and stamps `{generation, start-reason}` at
> spawn. `apply` swaps the binary on disk and fires `request_brain_restart`, and
> the supervisor cycles the brain onto the swapped binary. **What is missing is
> the safety latch:** the supervisor spawns the new binary and *assumes it
> works* — there is no readiness gate, so a new brain that bricks its *logic*
> (panics on boot, never re-attaches) would crash-loop forever under the
> existing backoff: endpoints alive (the broker holds them), logic dead —
> "up but useless." D6 makes a bad brain-only update **self-heal back to the
> last-known-good binary** without a human.
>
> **Discriminator + custody already in hand (D3).** The broker owns the
> generation counter (`brainproc.rs:315`, increments every spawn) and stamps
> `StartReason::{Cold,Crash,Update}` at spawn (`brainproc.rs:323`). D6 adds the
> **ready** observation and the **rollback** arm to the loop the broker already
> drives — it does not add a new process or a new supervisor.

## Goal (D6-close invariant)

A brain-only update that fails to come up **rolls back automatically** to the
last-known-good binary, with every hosted endpoint untouched throughout (the
broker holds them). Concretely at D6-close:

- **Readiness-gated promotion.** A freshly-respawned **update** brain is *on
  trial* until it signals **ready** — re-attached all hosted sessions + resumed
  its loops. Until ready it is not "the accepted binary"; it is a candidate.
- **Bounded retry → auto-rollback.** If the candidate fails to reach ready
  within a bounded retry budget / healthy-run window (it exits pre-ready, or
  never signals ready), the supervisor **spawns the last-known-good binary
  instead**, **quarantines** the bad version (no auto re-apply / re-fetch), and
  surfaces a **loud consent-style notification**. The broker holds **both** the
  candidate and the last-known-good path and **chooses which to spawn from
  durable state** — no risky file rename under duress at the failure instant.
- **Two-phase applied record (fixes a real existing bug).** `apply` writes
  `applied-pending` at swap (not the optimistic `applied` it writes today,
  `applyhost.rs:178-179`, before the brain boots — the `applied.json={version:6}`
  observed on `enlyzeam`). The supervisor **promotes** it to `applied` on the
  ready signal, or **corrects** it to `rolled-back(quarantine=N, running=N-1)`
  on failure.
- **A promoted binary is no longer on trial.** Once a candidate reaches ready,
  it is the accepted binary; a *later* crash is a normal supervised respawn
  (capped backoff), not a rollback trigger — the rollback latch is the
  *first-boot* trial, not a permanent crash-loop watchdog.
- **[V1] Rollback-state-compat invariant, asserted while free.** Auto-rollback
  spawns the **old** binary against durable state the **new** brain may already
  have written. Safe today (2026-06-09 audit: zero state-migration code), but the
  first durable-state schema migration would silently break rollback. D6 lands
  the invariant as an explicit, tested **tripwire** (a brain must not
  irreversibly migrate durable state before ready-promotion — every pre-ready
  write stays N-1-readable) + the KH 6.8 doc, so a future migration trips a known
  wire.

## What is already satisfied / true (don't re-build, don't mis-scope)

- **The supervisor + generation custody + StartReason are wired (D1/D3).**
  `supervise_brain` (`brainproc.rs:301-377`) is broker-owned, respawns on exit,
  owns the generation counter (`:315`), and stamps `Update` vs `Crash`
  (`:362-368`). D6 consumes these; it does **not** re-mint custody or add an IPC
  field on the wire (the ready signal is brain-local disk state, like the D5
  anchor — the D7 N-1 verb harness gains **no** assertion in D6).
- **The binary already steps aside as `<exe>.old-<version>` (`applyhost.rs:166,
  206-214`).** The last-known-good backup D6 rolls back to **already exists** —
  D6 adds *selection* + *quarantine* on top of it, not the backup machinery. The
  on-land-failure rollback (`applyhost.rs:169-171`) is the precedent rename;
  D6's failure-time path deliberately does **not** rename (see D6-2).
- **The ready breadcrumb exists but is too weak to gate on.** The brain writes
  `brain.ready` (its pid) after *connect* (`brainproc.rs:162`) and on heartbeat
  (`:218`) via `write_ready` (`:274-279`); the path is `<spt_home>/brain.ready`
  (`endpoint.rs:55-66`). It is **best-effort, written post-connect not
  post-resume, and not generation-stamped** — so it cannot today distinguish
  "*this* generation is ready" from "a *previous* generation left a stale ready
  file." D6-1 strengthens it (post-resume + generation-stamped); it does not
  invent a new channel.
- **The durable-record pattern already exists.** `ReleaseCache`
  (`relcache.rs:289-343`) already persists `applied.json` (`{version}`) and
  `last-outcome.json` (`{outcome}` free-string: `staged`/`applied`/`rejected:…`)
  as atomic JSON under `spt_home()`. D6-1's two-phase record **extends this file
  set** with the same atomic-write + corrupt-degrades posture — no new storage
  machinery.
- **The consent-notif surface exists.** `produce_consent_notif`
  (`notif.rs:363-396`) already authors a daemon-originated consent-style notify
  envelope and first-fires it to the most-active visible endpoint. D6's loud
  rollback notif **reuses this surface** (a new notif *kind*/body, not a new
  delivery path).
- **Forward-correct, not field-exercised (the D4/D5 posture, verbatim).** The
  daemon brain hosts no PTY sessions today, so `resume_sessions`
  (`brain.rs:596-608`) is a near-noop and a real *field* update is not driven on
  a single-home dev box. D6 is therefore **forward-correct capability + a real
  spawn/kill/respawn harness proof** — the readiness-gate + rollback machinery is
  built and exercised by a real harness now (a candidate that fails-to-ready is
  injected, the rollback observed), so it is genuine machinery the live-agent
  adapter inherits, not dead code. Keep the harness **real** (real
  spawn/kill/record/notif against a temp `spt_home`), not a mocked stub.
- **The readiness+backoff machinery already exists (Q7 reuse claim, verified).**
  `SUPERVISE_BACKOFF_BASE`=2s, `SUPERVISE_BACKOFF_CAP`=60s,
  `SUPERVISE_HEALTHY_RUN`=30s (`brainproc.rs:38-43`); `next_backoff` doubling
  with healthy-run reset. D6 reuses the healthy-run window as the trial window —
  it does not add a parallel timing system.

## Per-commit discipline

Each sub-task is its own atomic commit with evidence tagged in-commit. Gates
every commit: `cargo build` · `cargo test` · `cargo clippy` · `cargo build
--no-default-features` · `traceable-reqs check` (EXIT=0) · `xtask check`. Push to
a dev-freeform branch → **CI both runners** before any tag (the v0.3.0 lesson:
the single-home non-elevated dev box is blind to the very spawn/socket topology
D6 exercises — CI is the only place the real process-spawn rollback runs on both
OSes).

---

## D6-1 — Two-phase applied record + generation-stamped ready contract · Q7

The two primitives D6-2 consumes, built + unit-proven in isolation before the
supervisor gates on them (the D5-1 posture: primitive first, consumer second).

### (a) Two-phase applied/outcome record

- **Replace the free-string outcome with a structured, three-state record.**
  Today `apply` writes `record_applied(version)` + `record_last_outcome("applied")`
  (`applyhost.rs:178-179`) **before** the brain restart is even triggered — the
  optimistic record the design indicts. Make the outcome a small enum persisted
  under `spt_home()` (sibling of `applied.json`/`last-outcome.json`, same atomic
  + corrupt-degrades-safe posture):
  - `AppliedPending { version, rollback_binary, candidate_started_ms }` — written
    by `apply` at the moment of swap (the binary is on disk, the trial brain is
    about to spawn).
  - `Applied { version }` — written by the **supervisor** when the candidate
    generation signals ready (promotion).
  - `RolledBack { quarantine_version, running_version, rollback_binary }` —
    written by the **supervisor** when the candidate exhausts its retry budget
    pre-ready (correction).
- **The record is the cross-process channel.** `apply` runs in the **CLI
  process** (`spt update apply`), swaps the binary, fires `request_brain_restart`,
  and **returns** — it cannot itself observe ready/rollback (that happens later,
  asynchronously, in the broker's supervisor thread). So the `AppliedPending`
  record **carries everything the supervisor needs**: the candidate `version` and
  the `rollback_binary` path (the `<exe>.old-<version>` aside `apply` just
  created, `applyhost.rs:166`). The supervisor reads the record to know both the
  candidate (its own `current_exe`) and the last-known-good to roll back to. One
  source of truth, no shared memory, no new IPC verb.
- **`apply` writes `AppliedPending`, never the eager `applied`.** The CLI's
  returned outcome becomes **provisional**: "restart triggered, promotion
  pending" — `apply` no longer *claims* `applied` synchronously (it cannot know
  yet). The final outcome converges via the record (read by `xtask
  debug-converge` / the rollback notif). **[resolved — return pending
  immediately]** `apply` does **not** block on the trial verdict (blocking the CLI
  for a 30 s window is poor UX and the notif surfaces a rollback loudly); it
  returns a pending status with **honest text** ("trial started; verdict via notif
  / `debug-converge`"), never a premature "applied."
- **Saturating / fail-safe throughout.** A corrupt or absent record degrades to
  "no pending trial" (the supervisor treats a missing pending-record as "spawn
  the candidate as a normal cold/crash brain", never a hard fail) — same posture
  as `DaemonConfig`/`EffectJournal`/the D5 anchor.

### (b) Generation-stamped, post-resume ready contract

- **The ready signal must answer "is *generation N* ready," not "did *some*
  brain once write ready."** Strengthen the breadcrumb on two axes:
  - **Generation-stamped, self-describing shape [A10].** `brain.ready` changes
    from bare pid text to a self-describing **JSON `{pid, generation}`** (house
    posture; the `generation` is handed at spawn, `brainproc.rs:147`
    `run_brain(generation, reason)`). The supervisor that spawned generation N
    waits for a ready stamp **== N**; a stale stamp from generation N−1 (left by a
    brain that died instantly) **does not satisfy the gate**. *(This is the exact
    staleness class doyle caught in D5 amd-6 — the keyed-anchor cross-clobber. A
    bare, un-stamped ready file would let a candidate that panics in microseconds
    be falsely promoted on its predecessor's breadcrumb. This is the single
    load-bearing correctness invariant of D6.)*
  - **[A10] The shape change has exactly one reader to migrate.** The only reader
    of `brain.ready` today is the `brain_split.rs` harness (polls pid +
    pid-change; no production machine reader). **Update its parser in the same
    commit** as the shape change. And **register `brain.ready` in D6-3's pre-ready
    durable-file set** — it is itself a pre-ready durable write whose shape D6-1b
    mutates, so the V1 tripwire must own it from birth (it dogfoods its own
    invariant).
  - **[A9] Equality-stamping alone has a cross-lifetime hole — the supervisor
    must CLEAR the file before each trial spawn.** `generation` is **in-memory** in
    the supervisor (`brainproc.rs:315` `let mut generation: u64 = 0`) and **resets
    to 0 on a broker restart**. Lethal sequence: yesterday's broker leaves
    `brain.ready` stamped gen=0; the machine reboots mid-trial (record still
    `AppliedPending`); the fresh broker spawns the candidate as gen 0; the
    candidate panics in microseconds; the supervisor polls `ready_generation()` →
    finds the **stale file stamped 0 == this lifetime's trial gen 0** → **false
    promotion of a brain that never booted.** Equality-stamping only guards
    *same-lifetime* staleness. Fix (cheap, belt+braces): **the supervisor deletes
    `brain.ready` immediately before each trial spawn** — the prior child is
    already waited-on dead (no late-writer race), so the only `brain.ready` that
    can exist post-clear is the one *this* trial child writes. Generation-stamping
    stays as the same-lifetime defense; the clear closes the cross-lifetime hole.
  - **Post-resume, not post-connect** — write ready only **after**
    `resume_sessions` (`brain.rs:596-608`) completes + loops are resumed, i.e.
    the design's "re-attached all sessions + resumed loops," not merely "socket
    connected" (today's `:162` write is post-connect). Today resume is a near-noop
    (no hosted sessions), so this is forward-correct: the gate is *defined*
    correctly now so the adapter era inherits a real readiness semantic, not a
    connect-only proxy.
- **Pure, testable observation.** The supervisor's "is generation N ready?" is a
  pure read of the stamped breadcrumb (`ready_generation(spt_home) -> Option<u64>`),
  so the unit harness asserts it with explicit values — no wall-clock flake.

Evidence: `[impl->REQ-UPD-6]` (the applied record is now two-phase + structured)
· `[impl->REQ-HAZARD-BROKER-PROCESS-ISOLATION]` (the generation-stamped ready
contract — the gate a brain-process restart is promoted/rolled-back against) ·
`[unit->REQ-UPD-6]` (record round-trips: pending→applied promotion, pending→
rolled-back correction, corrupt/absent degrades to no-trial; the
`AppliedPending` carries the rollback path) · `[unit->REQ-HAZARD-BROKER-PROCESS-
ISOLATION]` (**the staleness invariant**: a ready stamp for generation N−1 does
**not** satisfy a gate waiting on N; an N-stamped ready does; absent ready never
spuriously promotes; **[A9] cross-lifetime: a stale file stamping the *exact*
trial generation — the broker-restart gen-reset case — must NOT promote, proving
the clear-before-spawn closes it, not the equality check alone**; **[A10] the
JSON `{pid, generation}` shape round-trips and the `brain_split.rs` parser reads
it**). REQ-UPD-6 + REQ-HAZARD-BROKER-PROCESS-ISOLATION already active — **no toml
change in D6-1**.

---

## D6-2 — Readiness-gated supervisor: bounded-retry → auto-rollback + quarantine + notif · Q7

> **DONE (local, pre-CI).** `supervise_brain` gained the injected `TrialEnv` +
> the trial gate: record-driven binary selection, A8 record-latch, A9
> clear-before-spawn, A11 kill-before-rollback, K=3 pre-ready budget, promote-on-
> ready (relocated legacy `applied.json`/`last-outcome=applied` write), `RolledBack`
> + loud resurfacing notif (`produce_rollback_notif`, NOTIF_KIND_ROLLBACK),
> quarantine guard at the apply gate (`Quarantined` outcome), CLI `AppliedPending`
> pending semantics + honest "trial started" text. 7 new harness tests (real
> spawn/kill via injected children + scripted ready/record/notif fakes) + the
> quarantine + rollback-notif + pending-message tests. DEFERRED.md rows landed
> (reboot residual + operator `--force` override). REQ-HAZARD-BROKER-PROCESS-
> ISOLATION impl+unit (no toml change — D6-3 is the one activation).

The supervisor consumes D6-1's two primitives and gains the rollback arm. **This
is the bulk of D6** (the D5-2 analogue — the consumer + the real harness).

- **Record-driven binary selection (honors "no file manipulation at failure
  time").** The supervisor consults the durable record at each spawn to choose
  which binary to launch:
  - record `Applied` / absent → spawn `current_exe()` (the accepted binary, as
    today).
  - record `AppliedPending{version, rollback_binary}` → spawn `current_exe()` as
    the **trial candidate**, *gated* (below).
  - record `RolledBack{rollback_binary}` → spawn the **`rollback_binary`** path
    directly (the last-known-good `.old-N`), **never** `current_exe` (whose bytes
    are the quarantined bad version).
  This makes rollback survive a reboot for free: a fresh broker start re-reads the
  record and keeps spawning the good binary — no rename of the on-disk
  `current_exe` is ever needed, exactly the design's "the broker holds both paths
  and chooses which to spawn (no file manipulation at failure time)."
  - **[open-call-3] Land an honest `docs/DEFERRED.md` row for the residual.**
    Record-driven selection rescues the *brain*, but `current_exe` on disk still
    holds the quarantined bytes — an OS-service/autostart/`ensure_running`
    relaunch execs them **as the broker**, and **if the bad binary panics before
    broker bind the node is down until manual intervention** (selection cannot
    save a binary that won't boot). The deferred reconcile (calm-path rename of
    the good bytes over `current_exe` on the next clean apply, or a broker
    self-reconcile once stable) closes it. The row must name this residual plainly,
    not read as cosmetic.
- **[A8] The trial latch is the RECORD, not `StartReason`.** A candidate is "on
  trial" **iff an unpromoted `AppliedPending` record is present — regardless of
  the spawn reason.** A reason-keyed gate breaks on two real paths: (a) the
  candidate **crashes** pre-ready → the supervisor respawns it with `reason=Crash`,
  but the on-disk bytes are still the candidate's, so the trial **must continue**
  (and that crash counts toward K); (b) the machine **reboots** mid-trial → a
  fresh broker's first spawn is `reason=Cold`, but the record is still
  `AppliedPending`, so **that spawn IS a trial**. `StartReason` stays purely the
  D5-anchor preserve-vs-reset discriminator; it does **not** gate the trial. Each
  trial spawn first **clears `brain.ready`** (A9) so the gate reads only this
  child's stamp.
- **The trial gate (bounded-retry → rollback).** For every trial spawn (latch =
  unpromoted `AppliedPending`):
  - **Promote on ready.** Poll `ready_generation()` (D6-1b) for a stamp == this
    spawn's generation, within the per-attempt healthy-run window
    (`SUPERVISE_HEALTHY_RUN`, 30 s, reused — not a new timer). **[A11/note-ii]
    slice the poll sleep like `sleep_interruptible`** so a daemon stop stays
    prompt during the window. On a match → write `Applied{version}` (promotion),
    clear the trial latch; the candidate is now the accepted binary and any later
    crash is a **normal** supervised respawn (capped backoff), **not** a rollback.
  - **Retry on pre-ready exit.** If the candidate **exits before** signaling
    ready, that is a **failed boot** — respawn the *same* candidate (the record is
    still `AppliedPending`, so still on trial) up to **K=3 consecutive pre-ready
    exits** (resolved with doyle; concrete + testable, matched to the existing
    capped-backoff cadence). A transient first-boot hiccup should not immediately
    abandon a good binary.
  - **[A11] Roll back on budget exhaustion — KILL the candidate first if it is
    alive-but-never-ready.** Two exhaustion paths both → rollback: K consecutive
    pre-ready *exits*, **or** the window elapses with the child **alive but never
    ready** (the "up but useless" panic-in-a-loop case). In the alive-never-ready
    path the supervisor **must kill the candidate child before spawning the
    rollback binary** — two live brains otherwise (every seed/socket/IPC
    assumption is single-brain). *This is the one place the system deliberately
    terminates a brain; endpoints are untouched throughout (the broker holds
    them) — say so in the rustdoc.* Then write
    `RolledBack{quarantine=N, running=N-1, rollback_binary}`, **switch selection
    to the rollback binary** (spawn `.old-N` henceforth), and **fire the loud
    notif** (below).
- **Quarantine (no auto re-apply / re-fetch).** The `RolledBack` record **is**
  the quarantine marker: the update pump (`peerloop.rs`) must **not** re-stage or
  re-apply the quarantined version. Add the guard at the stage/apply gate (a
  staged version equal to a `RolledBack.quarantine` is refused with a clear
  "quarantined — failed readiness on this node" reason). The bad bytes stay on
  disk (no deletion under duress) but are never spawned (record-driven selection)
  and never re-applied (quarantine guard).
  - **[note-i] Quarantine-clear semantics.** The equality guard clears
    **naturally** when version N+1 stages (N+1 ≠ the quarantined N). The path to
    name (don't build now): "version N was actually fine — the failure was
    environmental" needs an **explicit operator re-apply override** (even just a
    deferred `spt update apply --force`) so a node cannot be **permanently
    allergic** to a genuinely-good version. Record it as the deliberate escape
    hatch; the override itself can defer to the alarm/adapter era.
- **[A8/open-call-4] Rollback is SELECTION, not an update — it never touches the
  monotonic floor.** Rollback **spawns** the `.old-N` binary via the supervisor;
  it does **not** route through the update engine / `apply` / `verify_metadata`.
  So the anti-downgrade monotonic version floor (`REQ-HAZARD-UPDATE-ROLLBACK`,
  toml:629-631 — *attack* protection: refuse a version *downgrade*) **never sees**
  the recovery rollback, and the two concerns stay cleanly separate. **Record the
  non-interaction explicitly** so no future change "fixes" `verify_metadata` to
  permit recovery downgrades — *selection-not-apply is the mechanism, by design.*
- **Loud consent-style rollback notif.** On rollback, author a notif via the
  existing `produce_consent_notif` surface (`notif.rs:363-396`) — a distinct
  *kind*/body: "update vN failed to come up — rolled back to vN−1, quarantined;
  the daemon is healthy on the previous version." **[doyle watch-for]** "loud" =
  it must **resurface** (not fire-once-and-vanish) until acknowledged, because a
  rolled-back operator needs to *know* (the agent's surface map flagged the
  current consent notif as first-fire + boundary-resurface — confirm the resurface
  path covers the rollback kind so it is not a silent one-shot). Reuse the
  boundary-resurface machinery; do not build a new persistence path.
- **Supervisor signature grows, stays injectable for the harness.**
  `supervise_brain` gains the rollback context — the `spt_home` (to read the
  record + ready stamp), the record reader/writer, and the notif hook — kept
  **generic / injected** exactly as `spawn_child` is today (`brainproc.rs:305`),
  so the unit harness injects fakes (a scripted ready/no-ready, a capturing notif
  sink) and asserts the state machine deterministically without real processes.
  Production binds the real `spt_home` + relcache + notif sender.
- **Idempotent / saturating throughout** (the peerloop cadence-panic lesson):
  every duration math saturates; a missing record / missing ready file degrades
  to the safe branch (treat as no-trial / not-ready), never a panic in the
  broker's supervisor thread (a panic there orphans the brain).

Evidence — **[open-call-4 resolved: tag `REQ-HAZARD-BROKER-PROCESS-ISOLATION`, do
NOT reuse `REQ-HAZARD-UPDATE-ROLLBACK`]** (that hazard is anti-*downgrade-attack*
protection — the monotonic floor — the opposite concern from recovery rollback;
conflating them muddies both. Q7 failure-atomicity is an arm of the restoration
hazard ADR-0018 minted for exactly this — no new REQ):
`[impl->REQ-HAZARD-BROKER-PROCESS-ISOLATION]` (the supervisor gates promotion on
ready and auto-rolls-back to the last-known-good binary — the brain restart fails
*safe*, endpoints held; bounded-retry → rollback + quarantine; rollback is
selection-not-apply) · `[unit->REQ-HAZARD-BROKER-PROCESS-ISOLATION]` (**the real
spawn/kill/respawn harness, forward-correct not pre-failed**: (1) a candidate that
signals ready for its generation within the window → record promoted to `Applied`,
no rollback, a later kill respawns the *same* accepted binary; (2) a candidate
that exits pre-ready K=3 times → record corrected to `RolledBack`, selection
switches to the rollback binary, the rollback binary is what spawns next, a
capturing notif sink received the loud rollback notif; (3) **[A11]** the
alive-but-never-ready path: window elapses with the child alive → the supervisor
**kills** it, then spawns the rollback binary (assert: the candidate pid is dead
before the rollback pid exists — never two live brains); (4) the staleness guard
end-to-end: a stale generation-N−1 ready file does **not** promote generation N,
**and [A9] a stale file stamping the *exact* trial generation does not promote
either** (the clear-before-spawn is load-bearing); (5) **[A8]** an `AppliedPending`
record + a `Cold`/`Crash` start is **still gated** (trial latch is the record, not
the reason) and still rolls back on budget exhaustion; (6) record-driven reboot: a
fresh supervisor reading a `RolledBack` record spawns the rollback binary, not
`current_exe`; (7) quarantine: a staged version == `RolledBack.quarantine` is
refused). The **int** process-level survival re-point stays **D7**.

---

## D6-3 — [V1] Rollback-state-compat tripwire + KH 6.8 promotion + REQ activation · Q7-V1

> **DONE (local, pre-CI).** New `rollback_compat.rs`: `PRE_READY_DURABLE_FILES`
> registry (deadline-<key>.json / applied-state.json / brain.ready) + the
> tripwire unit test (each shape additive / N-1-readable — load-bearing fields
> present, an unknown extra field still deserializes; a non-additive change
> trips it). KH 6.8 promoted (D5 note → D6 guard). The one toml activation:
> `REQ-HAZARD-ROLLBACK-STATE-COMPAT` doc→`[doc,impl,unit]` co-landing its
> evidence. D6 milestone complete (D6-1/D6-2/D6-3).

D6 mints the forward invariant that keeps auto-rollback honest as durable-state
schemas evolve — asserted **now, while free** (zero migrations exist), because it
is unmintable retroactively once a migration ships. (The D5-3 analogue: lock the
invariant + docs + the one toml activation.)

- **The invariant, as an explicit tested tripwire (not a migration framework).**
  *A brain must not irreversibly migrate durable state before ready-promotion —
  every pre-ready write stays N-1-readable.* Land it minimally + honestly
  (activate-don't-pre-fail — building a migration engine for migrations that
  don't exist is the dead-code anti-pattern):
  - **Enumerate the pre-ready durable writes** in one place — the files a brain
    writes *before* it signals ready, which a rolled-back N−1 brain must still
    read: `deadline-<key>.json` (D5), the two-phase applied record (D6-1), the
    generation-stamped **`brain.ready`** breadcrumb (**[A10]** — its own shape
    changed in D6-1b, so the tripwire owns it from birth), and any future
    addition. A small registry/const list + a doc comment naming the rule.
  - **A schema marker per durable file** (a version field or an additive-only
    contract) + a **tripwire unit test** asserting the discipline: the test
    encodes "these pre-ready files are additive / N-1-readable at their current
    shape," so the day someone changes a durable file's shape in a
    non-additive/pre-ready way, **this test fails** and forces the migration
    behind ready-promotion (or into an N-1-tolerant additive form). The tripwire,
    not a framework, is the deliverable.
- **KH 6.8 promotion.** Today KH 6.8 carries a **D5 conformance note** (the
  `deadline-<key>.json` is additive → rollback-N-1-safe by construction). Promote
  it to the **D6 guard**: the invariant is now *asserted* (the tripwire), not just
  noted; the pre-ready durable-file set is enumerated; a future *migration* of any
  of them is the thing the guard trips on. Dual-audience per DOCS-STRATEGY;
  `xtask check` stays drift-clean.
- **The two-phase record is itself rollback-N-1-safe — record it.** The new
  applied-record file (D6-1) is **additive** durable state: a rolled-back pre-D6
  binary does not know it → ignores it (falls back to the old `applied.json`/
  `last-outcome.json` it does understand, which D6 keeps writing alongside for the
  convergence query). No existing-file schema migration → N-1-safe by
  construction, same as the D5 anchor. Note it in KH 6.8 beside the deadline note.

Evidence: `[impl->REQ-HAZARD-ROLLBACK-STATE-COMPAT]` (the pre-ready durable-file
registry + schema markers — the guard surface) · `[unit->REQ-HAZARD-ROLLBACK-
STATE-COMPAT]` (the tripwire: the current pre-ready file set is additive /
N-1-readable; a non-additive pre-ready change trips the assert) + the KH 6.8
doc-note promotion (`[doc->REQ-HAZARD-ROLLBACK-STATE-COMPAT]` already exists →
add impl/unit). **This is the one toml activation in D6** (rule 5): set
`REQ-HAZARD-ROLLBACK-STATE-COMPAT.required_stages = ["doc","impl","unit"]` in the
**same commit** that lands the guard + tripwire. `traceable-reqs check` stays
green at the commit (evidence + activation land together).

---

## Sequencing

D6-1 (the two primitives — two-phase record + generation-stamped ready contract,
unit-proven in isolation) → **D6-2** (the supervisor consumes them; the real
spawn/kill/respawn harness proves promote-on-ready / bounded-retry→rollback /
quarantine / notif / record-driven-reboot) → **D6-3** (mint the V1 tripwire,
promote KH 6.8, the one toml activation). D6-1 before D6-2 because the supervisor
gates on the primitives; D6-3 last because it asserts the cross-binary durable-
state invariant the first two establish (and is the only activation).

## N-1 compat — does NOT grow in D6

D6 adds **no new IPC wire field**. The ready signal is a generation-stamped
*disk breadcrumb* (brain-local), the applied record is *disk state* — neither
crosses the broker↔brain IPC verb surface. The generation + StartReason it keys
on are D3-2's already-shipped argv field. The D7 old-broker × new-brain
verb-surface harness therefore gains **no new assertion** in D6 (it grew in
D3-2 and D4-1; D5 and D6 contribute none). The only D6 versioned surfaces are the
**disk-file shapes** (ready breadcrumb + two-phase record), governed by the
additive / corrupt-degrades / **rollback-N-1-safe** rules above — i.e. by the V1
invariant D6-3 mints, not by KH-2.3 wire compat. Record this so the D7 close-out
does not hunt for a D6 wire field that doesn't exist.

## Traceability — one toml activation in D6 (D6-3)

| REQ | State entering D6 | D6 adds | Activation note |
|-----|-------------------|---------|-----------------|
| `REQ-HAZARD-BROKER-PROCESS-ISOLATION` | `[doc,impl,unit]` | impl + unit (readiness gate, auto-rollback, bounded-retry, staleness invariant) | already active; **int → D7** |
| `REQ-HAZARD-ROLLBACK-STATE-COMPAT` | `[doc]` | impl + unit (the tripwire guard) | **activated at D6-3** → `[doc,impl,unit]` |
| `REQ-UPD-6` (applied/convergence record) | active | impl + unit (two-phase record) | already active; no stage change |

**[open-call-4 resolved] No new REQ; auto-rollback tags
`REQ-HAZARD-BROKER-PROCESS-ISOLATION`.** `REQ-HAZARD-UPDATE-ROLLBACK` (toml:629-631)
is anti-downgrade-*attack* protection (the monotonic version floor) — the *opposite*
concern from recovery rollback; conflating them muddies both. Q7 failure-atomicity
is an arm of the restoration hazard ADR-0018 minted for exactly this.

Rule 5: the **only** `required_stages` change in D6 is
`REQ-HAZARD-ROLLBACK-STATE-COMPAT` doc→doc,impl,unit, landed in the D6-3 commit
that delivers the tripwire evidence. `traceable-reqs check` stays green at every
commit.

## Risks / watch-items (baked in for the vet)

- **The staleness invariant is load-bearing — and has a cross-lifetime hole
  [A9].** A bare, un-stamped ready file would falsely *promote* a candidate that
  panics in microseconds on its predecessor's stale breadcrumb. Generation-
  stamping guards *same-lifetime* staleness — but the supervisor's generation is
  **in-memory and resets to 0 on a broker restart**, so a reboot-mid-trial can
  make a stale gen-0 file match a fresh gen-0 trial. **The clear-before-each-trial-
  spawn (delete `brain.ready` before spawning) is what actually closes it** — test
  both the N−1 case **and** the exact-same-generation-after-reboot case.
- **[A8] Trial latch = the RECORD, never `StartReason`.** A reason-keyed gate
  silently breaks on a pre-ready crash (respawn is `Crash`, trial must continue)
  and on a reboot-mid-trial (fresh spawn is `Cold`, but it's still a trial). The
  unpromoted `AppliedPending` record is the only latch; reason is the D5-anchor
  discriminator and nothing more.
- **[A11] Never two live brains.** The alive-but-never-ready rollback path must
  **kill the candidate child before** spawning the rollback binary — every
  seed/socket/IPC assumption is single-brain. This is the one deliberate brain
  kill; endpoints stay up (broker holds them). Assert pid-dead-before-rollback-pid
  in the harness.
- **Promotion latch vs permanent watchdog.** Rollback triggers on the
  *first-boot trial* only. Do **not** turn the supervisor into a permanent
  crash-loop watchdog that rolls back a long-accepted binary that crashes weeks
  later for an unrelated reason — that is a different failure class and would
  thrash a healthy fleet. The latch clears on the first ready-promotion.
- **"No file manipulation at failure time" (the design's explicit constraint).**
  Resist the temptation to rename `.old-N` back over `current_exe` at the failure
  instant (the brittle multi-step path the design rejected). Selection is
  record-driven: the broker *chooses which path to spawn*. The on-disk
  `current_exe` reconciliation is a deferred follow-up, not failure-time work.
- **Quarantine must actually stop the pump (no re-apply / re-fetch loop).** A
  rolled-back version that the update pump immediately re-stages + re-applies is
  an infinite rollback loop. The `RolledBack.quarantine` guard at the stage/apply
  gate is not optional — without it the self-heal becomes a self-harm loop.
- **`apply` returns provisional now.** The CLI's `Applied` becomes
  `applied-pending` — the operator's `spt update apply` no longer *proves* the new
  code is live, only that the trial started. The notif + convergence query carry
  the real verdict. Make the CLI message honest (don't print "applied" when it is
  pending) — a false "applied" is exactly the `enlyzeam` optimism D6 exists to
  kill.
- **Forward-correct, not field-exercised (keep the harness real).** No hosted
  sessions today → resume is a near-noop and no real field update runs on the
  dev box. The proof is the **real spawn/kill/respawn harness** (a real candidate
  that fails-to-ready, a real rollback observed against a temp `spt_home`), not a
  mock. Same posture as D4/D5 — genuine machinery the adapter inherits.
- **The notif must be loud (resurface, not fire-once).** A rollback the operator
  never sees defeats the consent-style intent. Confirm the rollback notif rides
  the boundary-resurface path, not a silent one-shot.
- **Don't break the convergence table.** Keep writing the legacy `applied.json`/
  `last-outcome.json` (the M8 `xtask debug-converge` reads them) alongside the new
  two-phase record — the new record is additive, the old query still works.
- **Supervisor panic = orphaned brain.** Every new branch in `supervise_brain`
  must be panic-free (saturating math, safe degrades). A panic in the broker's
  supervisor thread drops the supervision the whole self-heal depends on.

## Resolved open calls (doyle vet, 2026-06-10)

1. **`apply` return semantics** — **RESOLVED: return `pending` immediately.** With
   honest CLI text ("trial started; verdict via notif / `debug-converge`"). A
   false "applied" is the `enlyzeam` optimism D6 kills.
2. **Retry budget** — **RESOLVED: `K=3` consecutive pre-ready exits.** Precision:
   the healthy-run window is **per-attempt** (30 s each, existing const); K counts
   exits-before-ready; the alive-but-never-ready case consumes its attempt at
   window expiry via the A11 kill. Either exhaustion path → rollback.
3. **On-disk `current_exe` reconciliation after rollback** — **RESOLVED:
   deferred follow-up** (record-driven selection is correct *now*). **But the
   DEFERRED row must NAME the residual honestly, not read as cosmetic:** after a
   rollback the on-disk `current_exe` still holds the quarantined bytes, so an
   OS-service / autostart / `ensure_running` relaunch execs `current_exe` **as the
   broker**. Record-driven selection rescues the *brain* (the reboot case above),
   **but it cannot save a binary that won't boot** — if the bad binary panics
   *before broker bind*, the node is **down until manual intervention**. The
   deferred reconcile (calm-path rename of the good bytes over `current_exe` on
   the next clean apply, or a broker self-reconcile once stable) is what closes
   that hole. **Add this as an honest `docs/DEFERRED.md` row** (not a failure-time
   rename — that brittle path the design rejected).
4. **REQ id for auto-rollback** — **RESOLVED registry-first: tag
   `REQ-HAZARD-BROKER-PROCESS-ISOLATION`, do NOT reuse `REQ-HAZARD-UPDATE-ROLLBACK`**
   (toml:629-631 is anti-downgrade-*attack* / monotonic-floor — the opposite
   concern). No new REQ; Q7 is the failure-atomicity arm of the restoration hazard
   ADR-0018 minted. (Folded into D6-2 evidence + the traceability table.)
5. **Ready channel** — **RESOLVED: the FILE** (generation-stamped `brain.ready`).
   The supervisor has no broker IPC conn; the file is reboot-readable, follows the
   D5-anchor pattern, and corrupt-degrades **SAFE to not-ready**. With A9
   (clear-before-trial-spawn) + A10 (format discipline) it is solid; an IPC
   ready-verb would grow the KH-2.3 wire surface for no gain (keep
   "N-1-does-not-grow-in-D6" true).

## Immediate next step

Start **D6-1**: (a) add the structured two-phase applied record to `relcache.rs`
(`AppliedPending`/`Applied`/`RolledBack`, atomic-write + corrupt-degrades-safe,
the `AppliedPending` carrying `version` + `rollback_binary`), flip
`applyhost.rs:177-179` from eager `applied` to `AppliedPending` at swap; (b) change
`brain.ready` to JSON `{pid, generation}` **[A10]** (update the `brain_split.rs`
parser in the same commit — the only reader), move its write to post-`resume_
sessions`, and add the pure `ready_generation(spt_home) -> Option<u64>` reader.
Unit-prove the record round-trip + the staleness invariant — both the N−1-stamped
case **and [A9]** the exact-same-generation-after-reboot case (which only the
supervisor's clear-before-spawn, D6-2, defeats) — with explicit values **before**
D6-2 wires the supervisor to gate on them. Do **not** add an IPC wire field (the
ready signal is disk-local, like the D5 anchor).