# Restore broker/brain process isolation (correct the in-process-collapse regression)

## Status

Accepted (2026-06-09) — **extends and amends ADR-0004.** The design was independently verified (agent `doyle`, verified-with-amendments); implementation is a dedicated future milestone, sequenced next (before `spt-claude-code`). Full rationale, the `file:line` audit, and the per-decision alternatives live in `docs/BROKER-BRAIN-SPLIT-RESTORATION.md`.

## Context

ADR-0004 decided the broker/brain split **as two processes** — a stable broker kernel (PTY masters, harness children, sockets, the Iroh/QUIC endpoint) beneath a freely-restartable brain — specifically to satisfy the no-endpoint-drop self-update invariant (REQ-UPD-3): a routine (brain-only) update restarts the brain while the broker process survives, so no hosted endpoint terminates. The spikes (01/03/04/05/06) proved this with two separate binaries.

On 2026-06-09, while verifying the v0.3.2 cross-OS update fix on the live fleet, a regression was discovered: **the production daemon runs the broker as a background thread inside the single `spt daemon` process** (`daemon.rs:165-170`: `Broker::bind_in_with_net` → `Arc<Broker>` served on `thread::spawn`; no broker child-spawn exists anywhere in the repo). This is **unintended drift** — no doc, plan, commit, or comment marks the collapse as a deliberate interim (the operator confirmed no such decision was made).

Consequences of the drift:
- The brain cannot restart onto a new binary without killing the in-process broker thread, which would close every PTY, orphan every harness child, and drop every listening socket. **The no-endpoint-drop self-update pillar is therefore silently unrealized.**
- `spt update apply` performs an in-process `Brain::handoff` that re-attaches a subscriber within the *same old process* — a no-op for a binary swap. `update.rs:233-234`'s "the live daemon execs the verified new binary" is aspirational and was never wired.
- New code does not run until an unrelated restart/logon. Observed live: `enlyzeam` ran 0.3.0 with the valid 0.3.2 binary on disk for ~a day, continuing to reproduce the `\r`-corruption bug the update was meant to fix.
- **REQ-DAEMON-2 and REQ-UPD-3 carry `int` evidence that proves only the in-process handoff shape** (`tests/update.rs`, `brain_swap.rs`, the M3b-B9 daemon E2E) — i.e. the regression is masked in the requirement registry: the tests pass while proving the wrong thing.

## Decision

**Restore the ADR-0004 two-process model.** The brain runs new code by restarting onto the swapped binary while the broker process — holding every continuity-bearing resource — survives. The design decisions (grill Q2–Q8, hardened by six verification amendments `[V1]`–`[V6]`):

1. **Supervision (Q2).** The broker is the always-up per-machine anchor (one per `SPT_HOME`, present even with zero endpoints). It owns the seed-control lock + liveness (`ensure_running`/`is_running` target) and supervises/spawns the brain as its child.
2. **Update trigger (Q3).** `apply` swaps the binary on disk, then signals the brain to snapshot + self-exit; the broker auto-respawns from the executable path (now the new binary); the new brain re-attaches. Reuses the existing snapshot→drop→re-attach primitive; an update is a *planned* crash on the path the broker already recovers from. **`[AMENDED 2026-06-11, v0.4.2]`** "the executable path" silently assumed path-string semantics. The Linux apply renames the running `spt` → `spt.old-N`; a per-spawn `std::env::current_exe()` (`readlink(/proc/self/exe)`, inode-tracking) then **follows the rename to `.old-N`**, so the broker respawned the brain onto the OLD bytes while recording `applied` (caught live on kitsubito, v0.4.1 roll). Fix: capture the canonical exe path **once at broker start** as the respawn default (Windows already had path-at-start semantics via `GetModuleFileName`), and **bytes-gate promotion** (trial promotes only if `brain.ready` `exe_hash` == the staged artifact hash, else auto-rollback). See `REQ-HAZARD-BRAIN-RESPAWN-PATH` / KNOWN-HAZARDS 6.11 / `V042-PLAN.md`.
3. **Loop-timing continuity (Q4).** Durable absolute-deadline state on disk, not a handoff snapshot. Periodic loops persist `(anchor, interval)` once on a fresh/crash start and derive `next_fire` functionally (no per-fire writes); an update restart keeps deriving (phase preserved), a crash restart resets the anchor. One-shot deadlines persist at creation and survive crash + update. **`[V3]`** The one-shot *rule* is fixed here; its *machinery* is built with the alarm port (no untested dead code). **`[V4]`** Only phase-significant loops convert — the idempotent pump cadences need none.
4. **IPC boundary (Q5).** ADR-0004 §B adjudicates: net bring-up → broker (already broker-owned via `broker.rs:175` `OnceLock` — near-free), digest hub → broker / parse → brain, seed-lock → broker; every brain→broker call becomes a versioned IPC verb (no shared `Arc`). **Exception:** shellwake watcher children stay brain-side, re-reconciled from disk on start (rare updates make the window tolerable). Broker enlargement accepted (ADR-0004:63 precedent). **`[V6]`** Steady state after any update is new-brain × old-broker, so a CI-real old-broker × new-brain compat test across the verb surface is required (else KH-2.3 returns).
5. **Multi-session handoff (Q6).** The broker becomes cursor-of-record per session; the new brain re-attaches **all** sessions in resume mode. Output is at-least-once (matches the SPIKE-05 terminal-stream contract); input/effects stay exactly-once via the broker-owned `EffectJournal`. The explicit `BrainState` handoff *message* retires (no brain→brain channel exists under the self-exit trigger).
6. **Failure atomicity (Q7).** Bounded-retry → auto-rollback to the last-known-good binary, gated on a brain `ready` signal (reuse the `peerloop.rs:805` supervise-backoff + healthy-run). Quarantine the failed version; loud notif. The applied record becomes **two-phase** (`applied-pending` → `applied` on ready, or `rolled-back` on failure) — fixing today's optimistic `record_applied` before boot (`applyhost.rs:176`). **`[V1]`** Forward invariant minted: a brain must not irreversibly migrate durable state before ready-promotion (pre-ready writes stay N-1-readable), or auto-rollback silently breaks on the first schema migration.
7. **Generation custody (Q2/`[V2]`).** Retiring `BrainState` would orphan the KH-2.4 generation counter; the **broker** owns it, increments on every spawn, and hands `{generation, start-reason}` to the brain at spawn via a versioned argv/hello field (KH-2.3 compat). The same channel carries Q4's update-vs-crash discriminator.
8. **Cross-platform uniformity (Q8).** The broker spawns the brain as a child process (`Command::spawn`) + socket IPC — no `exec`, so no Windows/Unix divergence. ConPTY/forkpty handles never leave the broker.

## Consequences

- **Decomposition:** `spt daemon run` becomes the broker process entry (binds seed-control + broker socket + NetHost + digest hub; holds children + `EffectJournal`; spawns the brain); a new hidden `spt daemon brain` is the brain entry (connects, runs the logic loops, rehydrates from disk, emits `ready`). `ensure_running`/`is_running`/`daemon stop` contracts unchanged. The de-elevation guard applies at the broker entry.
- **Requirement registry:** REQ-DAEMON-2 and REQ-UPD-3's `int` evidence must be **re-pointed** to a productionized SPIKE-01/03 E2E proving *process-level* endpoint survival (`[V5]`). Two new hazards are minted: `REQ-HAZARD-BROKER-PROCESS-ISOLATION` (a brain restart must never drop a hosted endpoint) and `REQ-HAZARD-ROLLBACK-STATE-COMPAT`. A KNOWN-HAZARDS §6.7 entry records the regression.
- **Sequencing:** the restoration is the next milestone, before `spt-claude-code` scoping (operator-accepted 2026-06-09). Rationale: it is the *last* release that needs a manual fleet daemon bounce — paid 3× for v0.3.2 — so every adapter-era release rolls seamlessly; and the adapter is better built on the final topology than atop a daemon under later surgery while it hosts the user's daily driver. The split changes daemon internals, not the M8-frozen CLI/api surface.
- **Out of scope (unchanged):** the broker-touching update classes (broker-compatible / broker-breaking) remain as ADR-0004 left them; whole-daemon FD-passing stays the deferred "future polish" (ADR-0004:38); a durable in-daemon alarm scheduler is a separate gap (alarms are legacy-listener-only today).
