# Broker/Brain Split Restoration — Design Rationale (pre-ADR)

**Status:** **Ratified (2026-06-09).** Independently verified-with-amendments (2026-06-09, agent `doyle` — see §0); six amendments folded in (marked **[V1]**–**[V6]**). The operator ratified both open decisions: the §9 artifacts and the **sequencing** decision (§6.1 — restoration = next milestone, before `spt-claude-code`). The §9 artifacts are now **delivered** (ADR-0018, `REQ-HAZARD-BROKER-PROCESS-ISOLATION` + `REQ-HAZARD-ROLLBACK-STATE-COMPAT`, KNOWN-HAZARDS 6.7/6.8); build plan = `RESTORATION-PLAN.md`. The only genuinely-external item left is the legacy multipart-truncation row (§7, routed to `claude_skill_owl`).

**Audience:** dual (human reviewer + AI dev-agent). This is a *why* document, not an implementation plan.

---

## 0. Verification (2026-06-09, independent — agent `doyle`)

This document was independently verified against ADR-0004 (incl. §A/§B amendments), the PRD (Goal 5, R-DAEMON-2, R-UPD-3, success criterion 4), `CONTEXT.md`, KNOWN-HAZARDS (2.3 / 2.4 / 2.5 / 5.9 / 7.2 / 7.4), the `traceable-reqs` evidence tags, and a source-level audit of every `file:line` claim.

**Verdict: the core holds.** Confirmed at source — broker is an in-process thread (no broker child-spawn exists anywhere in the repo); no doc/plan/commit marks the collapse deliberate (silent drift; regression framing correct); `update.rs:233-234`'s "exec the new binary's brain" is aspirational and never wired; `applyhost.rs:176` records `applied` before the handoff (the optimistic `applied.json` watched on `enlyzeam`); `applyhost.rs:238-239` re-attaches `sessions.first()` with `from_seq=0` (both Q6 gaps real). Q2, Q3, Q5 (incl. the shellwake exception), Q6, Q7's direction, and Q8 agreed; update-exit as crash-equivalent is right (KH 7.2 + the `EffectJournal` guarantee idempotency at that boundary).

**Two audit corrections folded in:** the NetHost is **already broker-owned** (`broker.rs:175` `OnceLock`, per the §B M4-D4a note), so the Q5 net item is near-free, not a migration; the supervise-backoff is wired at `peerloop.rs:805` (M8-D4), confirming Q7's reuse claim.

**Six amendments raised and accepted** — each folded into the relevant section below and marked **[V1]**–**[V6]**.

---

## 1. Why this session happened

This came out of live operations, not a planning cycle.

While verifying the v0.3.2 cross-OS update fix end-to-end on the real fleet, the third node (`enlyzeam`, a Windows box hosting the `gerald` ready_agent) was updated:

- `spt update apply` returned `APPLIED:6`.
- The installed `spt.exe` on disk was the valid 0.3.2 Windows PE (magic `4d5a`, sha matched the published binary), `--version` reported `0.3.2`, `applied.json` = `{version:6}`, backup `spt.exe.old-6` present.
- **But the running daemon's pid was unchanged, and it was still executing 0.3.0 in memory.**

The operator's question: spt was specced with a seamless "hot swap" — why did `apply` change the on-disk binary but not the *running code*?

Investigation established that:

1. The hot-swap that *is* wired into `apply` is a **per-session brain handoff** (`apply_brain_only` → `Brain::handoff`), whose job is endpoint survival (REQ-UPD-3), not running-process replacement.
2. By explicit design, `applyhost.rs:36-38` states: *"The daemon's own resident loops keep running the old logic until its next restart (`ensure_running` spawns the executable path — now the new binary); that restart is routine lifecycle, never an endpoint cycle."*
3. So new code only goes live on a *subsequent, unrelated* daemon restart (next logon, or a manual restart — which is what made the *other* node, `hfenduleam`, show a new pid). The practical consequence on `enlyzeam`: `gerald`'s live daemon kept running 0.3.0 — including the message-envelope codec — so the box from the original bug report **was still reproducing the original `\r` corruption even though the fix sat on disk.**

The operator then asked the real question this document answers: **what would it take to make an update actually hand off to the new executable seamlessly — no dropped processes, internal loop timings consistent across the swap, plus any other angles that matter?**

Walking that question backward surfaced the actual root cause (Section 2), which reframed the whole task from "add a feature" to "correct a regression."

---

## 2. The root cause: unintended spec/impl drift

### 2.1 What the spec decided

ADR-0004 (*"Single consolidated daemon with broker/brain split; peer-propagated gated self-update"*, **accepted 2026-05-29**) decided a **two-layer split implemented as two processes**:

- **broker** (stable kernel) — holds *only* the un-transferable, must-not-die resources: PTY master fds, spawned harness child processes, listening network sockets, and (per the §B ownership table) the Iroh/QUIC endpoint + conn table. Minimal, versioned local IPC. Almost never updates.
- **daemon brain** (userspace) — all logic (routing, registry, pulse/psyche loops, manifest parsing, update orchestration). Restarts freely on update; rehydrates from disk and re-attaches to broker-held handles.

Explicit evidence the split was meant to be a **process** boundary:
- ADR-0004 Consequences: *"A small internal broker **process** exists beneath the daemon — a deliberate, bounded walk-back of 'literally one process,' … guaranteeing endpoint survival across updates."*
- ADR-0004 rejected alternatives include **"whole-daemon live FD-passing"** and **"drain + restart"** — i.e. the chosen design is specifically *restart the brain, the broker process survives.*
- `M3a-PLAN.md`, `M3-PLAN.md`, `M3b-PLAN.md` all name "the broker **process** + IPC (M3b)."
- Spikes 01/03/04/05/06 each **PROVEN** with *two separate binaries* (`spt-spikes/spike-01-broker-handoff`): a PTY child + a live QUIC transfer survive 100× brain restarts, gapless and exactly-once. A spike that proves "survives a brain restart" presupposes the broker is a separate process.

### 2.2 What production actually does

The production daemon runs the broker as a **background thread inside the single `spt daemon` process**:

- `daemon.rs:41-44` — *"bind the broker (served on a background thread) + the seed-control channel (the foreground loop)."*
- `daemon.rs:165-170` — `Broker::bind_in_with_net(...)` returns an `Arc<Broker>`; the daemon `thread::spawn`s `broker.serve()`. `broker.rs:181/196` confirm `bind` returns `Arc<Self>` (an in-process object, not a spawned child).
- The pump (`daemon.rs:280`), dispatcher, digest hub, net consumers, psyche loops — **all threads in the same process.**
- A repo-wide search finds **no** `Command`/subcommand that spawns a broker child anywhere.

### 2.3 Why this is a regression, not a deferral

The operator confirmed there was **no deliberate decision** to collapse the two processes into one; it is unintended drift. The consequences are not cosmetic:

- The brain cannot restart onto a new binary without killing the in-process broker thread — which would close every PTY, orphan every harness child, and drop every listening socket. So **the no-endpoint-drop self-update pillar is silently unrealized.**
- `apply` therefore performs an in-process `Brain::handoff` that re-attaches a subscriber within the *same old process* — achieving nothing for a binary swap. `update.rs:234`'s statement that "the live daemon execs the verified new binary" is **aspirational; it was never wired** because the process split it depends on does not exist in production.
- Updates do not run until an unrelated restart/logon (observed live, Section 1).

This document's decisions restore ADR-0004's intended model.

> **Glossary note already applied:** `CONTEXT.md` was sharpened during the session to state explicitly that there is **one broker per machine** (per `SPT_HOME`), *not* one per endpoint — a single broker hosts every endpoint's resources and is present even with zero endpoints online. This corrected a latent "one broker per endpoint" misconception and is load-bearing for decision Q2.

---

## 3. Design goal

Restore the two-process model so that a routine **brain-only** update (the common case per ADR-0004 §A's class taxonomy) hands off to the new executable seamlessly:

1. no endpoint or harness child process is dropped or suspended (REQ-UPD-3),
2. internal loop timings stay consistent across the swap,
3. a failed update fails *safe*,
4. the mechanism is uniform across Windows and Linux.

The broker-touching update classes (broker-compatible / broker-breaking) remain as ADR-0004 left them — out of scope here; this work is the **brain-only** path's realization.

---

## 4. Decisions and their reasons

Each decision lists the alternatives considered and why they were rejected, so a verifier can check the reasoning, not just the conclusion.

### Q2 — Supervision topology: the broker is the always-up per-machine anchor

**Decision.** The broker process is the long-lived per-machine anchor. It owns the **seed-control lock + liveness** (the single-daemon-per-home invariant and the `ensure_running`/`is_running` ping target) and **supervises/spawns the brain** as its child. On update, the broker re-spawns the brain.

**Reason.** The broker is the one component guaranteed present whenever the machine's daemon is alive — it runs even with zero endpoints (the "bare daemon" case, `daemon.rs:196`). The seed lock and liveness *must* survive a brain restart, or a second daemon could win the bind during the restart window and `is_running` would flap. Anchoring them in the never-restarting layer makes the single-instance invariant and liveness survive brain restarts for free.

**Alternatives rejected.**
- *A thin third supervisor process spawning both broker and brain* — adds a process for no invariant the broker can't already hold.
- *Brain as parent, spawning the broker* — incoherent: a brain restart would orphan or kill its own broker child, dropping endpoints.

**Operator's challenge that strengthened this.** The operator initially held a "one broker per endpoint" model and correctly objected that liveness shouldn't live in something that only exists when an endpoint is online. The code shows one broker *per machine*, always present, so the objection's principle is satisfied and actually reinforces putting liveness in the broker.

### Q3 — Update trigger: brain self-exit, broker auto-respawn from disk

**Decision.** `apply` verifies + swaps the binary on disk, then signals the brain to **snapshot and self-exit**. The broker — already the brain's supervisor — observes the exit and **respawns from the executable path (now the new binary)**; the new brain re-attaches.

**Reason.** This reuses primitives that already exist. `apply_brain_only` already models exactly "snapshot the outgoing brain, drop it — the crash boundary the broker already tolerates — re-attach." The broker must already supervise and respawn the brain for crash recovery, so an update is just a *planned* snapshot+exit handled by the existing respawn path, and "respawn from the executable path" naturally picks up the swapped binary — which is literally what `applyhost.rs:37` says happens "at next restart." We simply trigger that restart immediately instead of waiting for logon.

**Alternatives rejected.**
- *Broker actively cycles the brain on command* — nearly identical, but puts cycle logic into the layer we want to keep dumb and stable.
- *The `spt update apply` CLI spawns the new brain directly* — parentage is wrong: the brain would be a child of the transient CLI process and die when the command returns.

### Q4 — Loop-timing continuity: durable absolute deadlines, not a handoff snapshot

**Decision.** Loop timing that must survive lives as **durable absolute-deadline state on disk**, rehydrated on every brain start. Two categories:
- **Periodic resident loops** (pump cadences, pulse/echo-commune): persist a `(anchor, interval)` pair **once** on a fresh/crash start. The next fire is derived functionally — `next_fire = anchor + interval × ⌈max(0, now − anchor) / interval⌉` — with **no per-fire writes**. An **update restart re-reads `(anchor, interval)` and keeps deriving** (phase preserved, lands mid-grid). A **crash restart is treated as a fresh start** (rewrite `anchor = now`; phase reset is acceptable).
- **One-shot scheduled events** (alarms): persist the absolute `target-time` **at creation**; every start (update *or* crash) reads it and fires-if-due. **These never reset on crash** — a user's "remind me at 3pm" is a commitment that must outlive any restart. **[V3]** The one-shot *rule* is specified this milestone, but the *machinery* is **built with the alarm port, not here**: the daemon has no one-shot consumer today (§7), so building it now would ship untested dead code (matches the project's activate-don't-pre-fail posture).

The brain distinguishes update-restart from crash/fresh-restart via a signal on the broker's respawn (the broker *initiated* the planned cycle but *observed* the unexpected crash). The `(anchor, interval)` write happens once per fresh/crash start, zero per update, zero per fire.

**Reason.**
- `BrainState` (the handoff snapshot) is the wrong vehicle: it only exists for a *planned* update, but a brain **crash** (same respawn path) carries no snapshot, so anything timing-critical in it is lost on crash. Timing that must survive has to survive crash too → it belongs on disk.
- In-memory monotonic `Instant` phase cannot be transferred across a process boundary meaningfully (new process = new monotonic epoch); only absolute epoch-ms is portable.
- The operator added the write-minimization refinement (store anchor+interval once, derive thereafter) so the disk is touched once per fresh start rather than per fire — important given a 200 ms tick across many loops.

**Current state that motivates it.** The pump cadences are already catch-up/idempotent ("stagger from everything due now," `peerloop.rs:304`) and are restart-safe as-is. The pulse loop is **phase-relative** (`lifecycle.rs:486` `sleep_interruptible(pulse_period)`), so its phase resets on restart — this is the loop that needs converting to a disk-anchored deadline. Idempotent catch-up loops need no phase state at all; only phase-significant loops are converted.

**Alternative rejected.** *Snapshot each loop's `Instant`/phase into the handoff message and restore it* — fragile (monotonic clocks don't cross processes), lost on crash, and incompatible with the model-B trigger (no brain→brain message exists).

### Q5 — IPC boundary: the §B rule adjudicates; everything brain→broker becomes a versioned IPC verb

**Decision.** Apply ADR-0004 §B's rule — *a resource is broker-owned iff a live consumer would lose continuity on a brain restart* — to every component that today shares the in-process `Arc<Broker>`:
- **Net bring-up + REQ-DAEMON-9 self-heal retry → broker** (the brain must not own net bring-up). *Audit correction:* the NetHost is **already** broker-owned (`broker.rs:175` `OnceLock`, §B M4-D4a), and net bring-up + the self-heal retry already live in the `daemon.rs` entry that *becomes* the broker entry — so this item is **near-free**, not a migration.
- **Digest hub → broker** for the live projections + subscriber sockets; **the parse that feeds it → brain** (pushes parsed digests over IPC).
- **Seed lock + liveness → broker** (Q2).
- **Every other brain→broker call becomes a versioned IPC verb** — in the two-process world there is no shared `Arc`, so each current direct call is a latent IPC verb on the same versioned contract that already carries `SPAWN`/`INPUT`/`NET_*`/`SESSIONS`.

**Accepted consequence.** This enlarges the "dumb kernel." That is already-precedented: ADR-0004:63 accepted exactly this when QUIC ownership moved into the broker — *"the broker is 'stable' in update cadence, not in narrowness."*

**[V6] N-1 compat is scoped milestone work.** Because the broker almost never updates but the brain updates routinely, the **steady state after every routine update is NEW brain × OLD broker.** The `classify` pre-swap handshake (`brain_ipc_version ≥ broker.min_compatible`) gates the *forward* direction, but each direct-call→verb conversion must hold that N-1 window across the *whole* verb surface. The milestone must include a **CI-real compatibility test — old-broker binary × new-brain binary across the verb surface** — or KH-2.3 (handoff-argv-compat) returns the first time a verb signature changes.

**Exception — shellwake stays brain-side.** The supervised `wake_command` watcher *children* are an exception to the §B continuity rule: they remain **brain-owned** and are re-reconciled from disk on brain start (the boot sweep at `daemon.rs:206-217` already does orphan cleanup). 

**Reason for the exception.** Strict §B would put the watcher children in the broker for continuity, but the operator judged the continuity loss acceptable: updates are rare, the brain re-reconciles watchers on start, and keeping watcher-child *supervision logic* out of the kernel keeps the broker narrower. The only cost is a brief window during a (rare) brain update where an offline shell could miss a wake — explicitly accepted.

### Q6 — Multi-session handoff: the broker is cursor-of-record; the `BrainState` message retires

**Decision.** The broker tracks each session-subscriber's last-delivered cursor. On (re)start the new brain queries the broker for **all** hosted sessions and re-attaches each in **resume** mode. **Output is at-least-once** (the broker resumes from last-*sent*); **input/effects stay exactly-once** via the already-broker-owned `EffectJournal`. The explicit `BrainState` handoff *message* is **retired** for production — the new brain cold-starts and reconstructs continuity by querying the broker.

**[V2] Generation custody.** Retiring `BrainState` would orphan the generation counter (KH 2.4: increments on every start/revive; `gen_start` always `= now()`). Custody moves to the **broker**: it owns the counter, increments it on **every** brain spawn (planned or crash), and hands `{generation, start-reason}` to the brain at spawn time via a **versioned argv/hello field** (KH-2.3 forward-compat, defaulted). Because the broker observes every respawn, this is strictly more reliable than passing it brain-to-brain — and the **same spawn-time channel carries Q4's update-vs-crash discriminator** (one channel, both payloads).

**Reason.**
- Two concrete gaps exist today: `applyhost.rs:239` re-attaches only `sessions.first()` (one session) and with `from_seq = 0` (re-replays the whole ring → duplicate output).
- Under the model-B trigger the outgoing brain is gone before the new one starts, so **no brain→brain message can be passed**; continuity must come from the persistent side. The broker already holds the per-session output ring (`broker.rs:73-143`); making it also track the delivered cursor lets a respawned brain resume gaplessly with **no disk writes** (respecting Q4) and **no handoff frame**.
- At-least-once output matches the *existing* terminal-stream contract: SPIKE-05 already established "terminal-stream ≠ exactly-once transfer" (a resize repaint reorders and duplicates the stream anyway). Exactly-once output would require a per-chunk ack protocol for a guarantee the terminal stream does not make; the thing that *must* be exactly-once (injected input/effects) already has the journal.

**Alternative rejected.** *Add brain→broker acks so the broker resumes from last-processed (exactly-once output)* — cost without benefit, given the terminal-stream contract.

### Q7 — Failure atomicity: bounded retry → auto-rollback to last-known-good

**Decision.** If the new brain fails to reach a **ready** signal (re-attached all sessions + resumed loops) within a bounded number of boots / a healthy-run window, the broker **rolls back to the last-known-good binary** (the `spt.exe.old-N` backup), **quarantines** the bad version (no auto re-apply/re-fetch), and surfaces a loud consent-style notification. The broker holds **last-known-good + candidate** paths and chooses which to spawn (no file manipulation at failure time). Endpoints survive throughout (the broker holds them). The **applied record becomes two-phase**: `applied-pending` at swap, promoted to `applied` on the ready signal, or corrected to `rolled-back(quarantine=N, running=N-1)` on failure.

**[V1] Rollback-state-compat invariant (forward invariant, mint this milestone).** Auto-rollback spawns the **old** binary against durable state the **new** brain may already have written. This is safe *today* (a source audit confirmed zero state-migration code exists), but the first release that migrates a durable-state schema would silently break Q7's rollback. The invariant must be minted now, while it is free: **a brain must not irreversibly migrate durable state before ready-promotion** — equivalently, every pre-ready write must remain readable by the N-1 brain. Cheap to assert now; unmintable retroactively after a migration ships.

**Reason.**
- The dangerous case is a *runtime* bug (new brain panics on boot), not the binary land (already rolled back at `applyhost.rs:167-170`) and not IPC mismatch (the `classify` handshake `brain_ipc_version ≥ broker.min_compatible` is checked pre-swap). A naive supervisor would crashloop the new binary: endpoints alive but logic dead — "up but useless."
- The backup binary is already produced for exactly this, and the readiness+backoff machinery already exists — `peerloop.rs:82-91` has `SUPERVISE_BACKOFF_BASE/CAP` + `SUPERVISE_HEALTHY_RUN` (60 s healthy resets backoff), and `supervise_pump` is wired at `peerloop.rs:805` (M8-D4, verified). A bad update that bricks *logic* should self-heal back to last-good without a human, the same spirit as the boot-race self-heal already shipped.
- The two-phase record fixes a real existing bug: `applyhost.rs:176` writes `record_applied(version)` + `last-outcome=applied` **before** the brain handoff — i.e. before we know the new brain boots. (This is the optimistic `applied.json={version:6}` observed on `enlyzeam`.)

**Alternative rejected.** *Bounded retry → halt and wait for manual intervention* — safer-feeling but leaves logic dead until a human shows up; inconsistent with the self-healing posture.

### Q8 — Cross-platform uniformity (confirmation, not a fork)

**Decision/observation.** The broker spawns the brain as a **child process** (`Command::spawn`) and they communicate over the existing local socket IPC. **No `exec` is required**, so there is no Windows/Unix divergence. ConPTY (Windows) and forkpty (Linux) handles stay broker-side and never cross to the brain. Spikes 01 and 04 already proved the handoff on both OSes.

**Reason.** This is precisely the uniformity ADR-0004:25 sought when it rejected whole-daemon live FD-passing as "hard and platform-divergent." Spawn-plus-socket sidesteps it entirely.

---

## 5. Implied decomposition (consequences of the above)

- `spt daemon run` becomes the **broker** process entry: binds the seed-control channel + broker socket + NetHost + digest hub; holds harness children + the `EffectJournal`; spawns the brain child.
- A new hidden `spt daemon brain` entry is the **brain** process: connects to the broker, runs pump/dispatch/psyche loops/session drivers, rehydrates from disk, emits a `ready` signal.
- `ensure_running` / `is_running` / `daemon stop` contracts are unchanged — they target the broker's seed channel.
- The de-elevation guard (`daemon.rs:60-103`) applies at the **broker** entry; the brain child inherits the unelevated identity.

---

## 6. Scope and sequencing

This is **milestone-sized**, not a patch. It includes: the process decomposition; an IPC-verb surface for the hard cases (digest push, seed-lock/liveness, plus the general direct-call audit — net bring-up is near-free per the §Q5 audit correction); broker-side per-session delivered-cursor tracking; broker-owned generation custody handed at spawn **[V2]**; durable absolute-deadline timing across **phase-significant** resident loops only **[V4]** (the idempotent pump cadences need *no* conversion — converting them would reintroduce the per-loop writes Q4 minimized; the one-shot deadline *machinery* defers to the alarm port **[V3]**, only its rule is fixed here); the auto-rollback + readiness machinery with the **[V1]** rollback-state-compat invariant; the two-phase applied record; an activated **REQ-UPD-3 int stage proving process-level endpoint survival [V5]**; and the **[V6]** N-1 old-broker × new-brain CI compat test. It should be planned as a roadmap phase, not folded into an unrelated branch.

### 6.1 Sequencing — RATIFIED (operator, 2026-06-09)

**Restoration = the next milestone, before `spt-claude-code` scoping.** `doyle`'s recommendation, ratified by the operator 2026-06-09. Reasons:

- **(a) Queue is open** — mesh shipped at v0.3.0, M8 acceptance is COMPLETE.
- **(b) Self-bootstrapping (decisive).** Until this ships, *every* release needs manual daemon bounces fleet-wide — paid on all 3 nodes for v0.3.2, and `enlyzeam` reproduced the `\r`-corruption for ~a day with the fix sitting on disk. The restoration release is the **last** one that needs a manual bounce; once it lands, every adapter-era release rolls seamlessly.
- **(c) Adapter window safe.** The split changes daemon *internals*, not the CLI/api surface M8 froze. Better the adapter lands on the final topology than atop a daemon we later perform open-heart surgery on while it hosts the user's daily driver.

Ratified by the operator 2026-06-09.

---

## 7. Separate gaps surfaced (out of scope here, tracked separately)

- **Alarms have no durable daemon scheduler.** In the spt-core daemon, `alarm` exists only as an event *shape* (`spt-proto`) + relay handling (`psyrelay.rs:93`) + a test fixture; `target-time` is only a wire attribute. The actual "wait until target, fire the TIMED PULSE" timer lives in the legacy owl listener in-memory. Porting alarms into the daemon as a durable scheduler (which would then ride the Q4 one-shot mechanism) is its own gap.
- **Registry one-cadence reconverge window.** `RegistryHost` is intentionally in-memory (`registryhost.rs:19-26`): a restart forgets peers' rows and reconverges within one pump cadence (self-rows re-advertised each tick; mirrored to `identity/registry/` JSON for out-of-process readers). Across a brain restart this is a brief reconverge window, accepted by design — not a blocker, noted for completeness.
- **Multipart inter-agent message truncation — belongs to LEGACY `claude_skill_owl`, not spt-core.** During the verification exchange (which ran over the legacy live-agent messaging infra — the `owl.exe` listener, not the spt-core daemon), an 18-part chunked message reached the recipient as ~parts 1–8; a long resend also dropped its tail. Long messages silently lose their tail — a delivery-integrity issue in the chunk-reassembly path. **This is a legacy-project bug** (the agents in this exchange run on `claude_skill_owl`'s transport); spt-core's *own* reassembly is already covered by `REQ-HAZARD-EVENTPART-REASSEMBLY`. It is **not an spt-core artifact** — flagged here only because the session surfaced it; route the repro + fix to the legacy project. (`doyle` can partner on a repro harness.)
- **Stale DEFERRED mesh row.** Side-flag from verification: the mesh row in `docs/DEFERRED.md` is stale (mesh shipped at v0.3.0). Minor cleanup, noted for a docs pass.

---

## 8. Cross-references for verification

- **ADR-0004** — `docs/adr/0004-single-daemon-broker-brain-split-and-self-update.md` (the decision being restored; §A update classes, §B ownership table, the spike amendments).
- **Requirements** — REQ-UPD-3 (no-endpoint-drop, brain-only), REQ-UPD-4 (staged + consent), REQ-UPD-6/8 (update-set / platform-safe), REQ-DAEMON-3/9 (auto-start / net self-heal), REQ-HAZARD-UPDATE-ROLLBACK, REQ-HAZARD-HANDOFF-ARGV-COMPAT.
- **Glossary/model** — `CONTEXT.md` daemon / broker / brain (incl. the one-broker-per-machine clause added this session).
- **Spikes** — `docs/spikes/SPIKE-01-broker-handoff.md`, `-03-quic-survival`, `-04-forkpty-parity`, `-05-restart-stress`, `-06-idempotent-boundary` (proofs that the two-process handoff works on both OSes).
- **Key source** — `daemon.rs` (`run`, net consumers, self-heal, digest hub, shellwake, seed-control), `broker.rs` (`OutputLog`, `bind`, `EffectJournal`), `brain.rs` (`BrainState`, `cold_start`/`handoff`/`snapshot`), `applyhost.rs` (`apply_staged`, `connect_retry`, record-before-handoff), `update.rs` (classes, `classify`, `apply_brain_only`), `peerloop.rs` (cadences, supervise-backoff), `lifecycle.rs` (pulse loop), `registryhost.rs`.

---

## 9. Artifacts — DELIVERED (ratified 2026-06-09)

Verification is done (§0); the operator ratified; these durable artifacts are now written (commit 44497f4 unless noted):

- **ADR-0018** (`docs/adr/0018-broker-brain-process-isolation-restoration.md`) — extends + amends ADR-0004: the drift finding, the Q2–Q8 restoration decisions, the six verification amendments. Status: Accepted (2026-06-09).
- **`REQ-*` registry mints** (registry-first, `traceable-reqs.toml`; minted INACTIVE per rule 5, activated per `RESTORATION-PLAN.md`'s schedule):
  - **`REQ-HAZARD-BROKER-PROCESS-ISOLATION`** — a brain restart must never drop a hosted endpoint (REQ-UPD-3's real promise). **Evidence [V5]:** an `int` stage productionizing SPIKE-01/03 — a PTY child + a live QUIC conn survive a brain-*process* restart onto a *swapped* binary. REQ-UPD-3 / REQ-DAEMON-2's current `int` evidence (`applyhost.rs:39/77`, `update.rs:246`, `brain_swap.rs`) proves only the *in-process* handoff shape — **mis-evidenced today, re-pointed at plan task D7.**
  - **`REQ-HAZARD-ROLLBACK-STATE-COMPAT` [V1]** — a brain must not irreversibly migrate durable state before ready-promotion (pre-ready writes stay N-1-readable).
  - The functional two-process-split realization + the **[V6]** N-1 old-broker × new-brain CI compat test are carried as `RESTORATION-PLAN.md` tasks D1–D7 (the compat test lands as a D7 gate).
- **KNOWN-HAZARDS** §6.7 (in-process collapse) + §6.8 (rollback-state-compat), with condensed-checklist rows.
- **Build plan** `RESTORATION-PLAN.md` (D1–D7), linked from ROADMAP.
- (Separately, per §7) the **multipart message-truncation** bug is routed to legacy `claude_skill_owl` — the one genuinely-external item, not an spt-core artifact.