# Single consolidated daemon with broker/brain split; peer-propagated gated self-update

## Status

accepted (2026-05-29)

## Context

ADR-0002 collapsed PTY-hosting and network-hosting into one per-machine `spt-daemon`. Two further forces refine its shape:

1. **Consolidation.** The sister project runs poll listeners as ephemeral per-session background tasks and Psyche wrappers as detached per-live-agent supervisor processes. But poll listeners already interact directly with the agent session (capsule/idle), and Psyche wrappers already invoke harness binaries directly. Once the daemon owns every PTY, keeping these as separate processes is unjustified.

2. **Seamless self-update with a hard no-terminate constraint.** Self-update is a day-one pillar. The constraint: *no endpoint process may terminate or suspend during an spt-core update* — we cannot assume every endpoint can safely suspend. The naive "drain + restart the daemon" approach violates this for spt-hosted sessions (the daemon owns their PTY; killing the daemon SIGHUPs the child).

## Decision

**Consolidate all per-machine logic into the one daemon.** Poll-listener logic and Psyche/pulse loops move into the `spt-daemon` — no separate listener or wrapper processes. The only residue is a thin, stateless **in-session relay** for harness-hosted sessions (topology 1), where spt cannot reach into a process tree it doesn't own; the relay just pipes the daemon's events into the session's stdout and is freely killable.

**Split the logical daemon into two implementation layers** to satisfy the no-terminate invariant:
- **broker** (stable kernel) — holds *only* un-transferable, must-not-die resources: PTY master fds, spawned harness child processes, listening network sockets. Minimal, versioned local IPC. Almost never updates.
- **daemon brain** (userspace) — all logic (routing, registry, pulse/psyche loops, manifest parsing, update orchestration). Restarts freely; rehydrates from disk state and re-attaches to broker-held handles.

Routine self-updates replace only the brain → endpoints never notice. The broker updates rarely (IPC-contract change, held-resource-type change, OS PTY/socket API change, or broker bugfix); the IPC is versioned so a newer brain talks to an older broker. Logical addressing is unchanged — still one per-machine `spt-daemon`; the broker is internal, not separately addressable.

Rejected alternatives for the invariant: **drain + restart** (violates no-terminate for spt-hosted sessions); **whole-daemon live FD-passing** (ConPTY handle transfer mid-swap is hard and platform-divergent — pushing that complexity into the rare broker update instead of every routine update is strictly better); **topology-1-only invariant with spt-hosted as fast-follow** (silently breaks the stated hard constraint for the exact new topology spt-core introduces).

**Self-update delivery:** peer-propagated over P2P, layered on self-fetch, out-of-band still supported. All binaries signature-verified before handoff (spt-core's own release key) regardless of source — peer-propagation otherwise lets one compromised node poison the subnet. spt-core conducts updates for the whole stack: self first, then ripple-update each registered adapter via the adapter manifest's update declaration (file-pull or delegated command). The plugin's role shrinks to initial bootstrap only.

**Cadence/consent:** not fully automatic by default; gated on user confirmation delivered to the most-recently-active live session, with an opt-in full-auto choice.

## Consequences

- The daemon is the single brain for a machine; crash-recovery and update logic must cover PTYs, networking, registry, spools, listeners, and psyche loops together.
- A small internal broker process exists beneath the daemon — a deliberate, bounded walk-back of "literally one process," preserving B1's *intent* (one network identity, one supervisor, one firewall prompt) while guaranteeing endpoint survival across updates.
- Peer-propagated updates make release signing mandatory, not optional.
- spt-core becomes the update conductor for adapters too; adapter manifests must declare an update avenue.
- The "deliver to most-recently-active session" mechanism is a v1 building block that the deferred PresenceChannel will later generalize.
- Whole-daemon live FD-passing (zero-interruption even for broker updates) remains a possible future polish but is explicitly not required for v1.

## Validation & amendments (2026-05-31 — Stage A red-team + Spike #1)

Codex adversarial review (`docs/reviews/STAGE-A-codex-redteam.md`) put 4 FATAL findings on this ADR; Spike #1 (`docs/spikes/SPIKE-01-broker-handoff.md`, `spt-spikes/spike-01-broker-handoff`) tested the hardest path. Resulting binding amendments:

### A. Update-class taxonomy (resolves FATAL #1 — R-UPD-3 self-contradiction)
R-UPD-3's "no endpoint terminates during update" is **only** absolute for the *brain-only* class. Three classes, each with its own invariant:
- **brain-only** (the routine case) — endpoints MUST survive untouched. **Spike #1: PROVEN on Windows ConPTY.**
- **broker-compatible** — broker binary swap behind a versioned IPC the running broker can hot-accept; endpoints survive; no spike yet.
- **broker-breaking** — held-resource-type or OS-API change; requires a *planned endpoint-cycle*. R-UPD-3 is explicitly weakened here: this class MAY suspend endpoints, with consent + scheduling. PRD R-UPD-3 must be reworded to scope the absolute guarantee to brain-only.

### B. Ownership table (resolves FATAL #4 — "any live stream owned by brain breaks the invariant")
The boundary is normative, not "broker holds fds, brain holds logic." A resource is broker-owned **iff** a live consumer would lose continuity on brain restart.

| Resource | Owner | Spiked? |
|---|---|---|
| PTY master read+write handle | **broker** | ✅ #1 |
| Spawned harness child process | **broker** | ✅ #1 |
| Accepted local client socket (`api listen`, relay) | **broker** | ✅ #1 (plain TCP) |
| Iroh endpoint + accepted QUIC streams | **broker** | ✅ #3 (loopback shape) → **implemented M4-D4a** (`spt-daemon::nethost::NetHost`: dedicated tokio runtime in the broker owns the endpoint + conn table; brain drives status/dial over IPC net frames) |
| mDNS socket / relay session | **broker** | ✅ closed by construction at M4-D4a (2026-06-03): `MdnsAddressLookup` and the relay session are constructed *inside* the broker-owned iroh endpoint — there is no separate socket to own |
| Listening sockets | **broker** | ✅ #1 |
| Routing, registry state, pulse/psyche loops, manifest parse, update orchestration | brain (rehydrate from disk) | n/a |

Consequence codex forced into the open: pushing live Iroh/QUIC ownership into the broker means networking stream-state lives in the "stable kernel," enlarging it. Accepted as the cost of the invariant; the broker is "stable" in *update cadence*, not in *narrowness*.

### C. ConPTY DSR gotcha (new hazard, discovered by Spike #1)
ConPTY withholds **all** child stdout until the terminal answers its startup cursor-position query (`ESC [ 6 n`). A broker reading a ConPTY master that does not reply `ESC [ <r>;<c> R` sees only the 4-byte query and then nothing — looks like a hung/silent child. Spike got 0 bytes of program output until the reader answered DSR, then full output. → `REQ-HAZARD-CONPTY-DSR`; every ConPTY reader must auto-answer DSR. Added to KNOWN-HAZARDS §5.

### D. Self-update delivery hardening (resolves SERIOUS #5)
Signature-verify alone is insufficient. v1 adds: monotonic version field (rollback rejection), release-metadata expiry, channel pinning, key rotation/revocation list, and **adapter content signing** (adapter file-pull / delegated-command updates were uncovered by "binary signature"). Track as `REQ-HAZARD-UPDATE-ROLLBACK`.

### E. Open spike gaps (must close before M3 builds the daemon)
1. ✅ **CLOSED (Spike #3, 2026-06-01).** Live Iroh + file-transfer stream survival across brain restart (FATAL #2). Spike #1 only proved PTY + plain TCP; Spike #3 stood up a broker-owned Iroh endpoint + uni QUIC transfer and proved a live peer download survives a brain restart gapless + exactly-once. Loopback shape validated; off-node transport + the QUIC-ownership *implementation* defer to M4. (`docs/spikes/SPIKE-03-quic-survival.md`.)
2. ✅ **CLOSED (Spike #4, 2026-06-01).** Linux `forkpty` parity. Spike #1's binary, unchanged, passed all four invariants on `gravity-linux` (Ubuntu 22.04) — `portable-pty` selects `forkpty` on Unix; the ConPTY-DSR branch is inert there. Invariant B passed with the *strict* contiguity check that ConPTY-under-resize fails (Spike #5): `forkpty` is a raw pipe, ConPTY a screen buffer — different stream contracts, promoted into M3a's `SessionSurface` design. (`docs/spikes/SPIKE-04-forkpty-parity.md`.)
3. ✅ **CLOSED (Spike #5, 2026-06-01).** 100× restart + resize-under-load stress (codex #3). The broker survived 100 rapid brain kill/restart cycles + a resize thread racing the reader with no leak (one child, clean reap), no hang (watchdog deadline), no lost byte (full value-set coverage in the client's observation window). Surfaced two binding ConPTY findings for M3a/REQ-TERM-3: resize triggers a repaint that reorders+dups the stream (terminal-stream contract ≠ exactly-once transfer contract); mid-session attach yields the viewport only, not scrollback. (`docs/spikes/SPIKE-05-restart-stress.md`.)
4. ✅ **CLOSED (Spike #6, 2026-06-01).** Idempotent/exactly-once delivery across brain restart (FATAL-adjacent #14). Durable-ID + WAL + dedup-at-effect proven exactly-once across a crash at any protocol point (before-intent / before-effect / after-effect). Two binding design constraints surfaced for `REQ-HAZARD-RESTART-IDEMPOTENT` / M3b-B5: broker-owned recovery anchor; dedup-at-effect keyed by durable ID. (`docs/spikes/SPIKE-06-idempotent-boundary.md`.)

**Gate status (2026-06-01): ALL FOUR §E GAPS CLOSED PASS** (#3, #4, #5, #6) → the M3-PLAN Phase-0 spike-gate is **complete**. The broker/brain split is validated on both OSes (ConPTY + `forkpty`), across QUIC-stream survival, restart+resize churn, and exactly-once idempotency. M3a and M3b are both unblocked; the QUIC-ownership *implementation* and off-node transport remain explicitly deferred to M4 (only the *shape* was spiked).
