# M4 Plan — networking + instances

> **Status: COMPLETE (2026-06-04).** D0–D9 all delivered CI-green; the §D9b
> two-host proof ran on the real rig (CI run `26958175812`, HFENDULEAM ↔
> gravity-linux — `docs/TWO-HOST-RUNBOOK.md` evidence appendix); every M4 req
> satisfied at its `required_stages` or rule-5-deferred with rationale
> (`traceable-reqs check` green); ROADMAP §M4 records the delivery. Deferred
> seams annotated to M5/post-v1 in ROADMAP §M5 + §Out below.

> **Just-in-time, lightweight** — same pattern as `M0/M1/M2a/M2b/M3a/M3b/M3c-PLAN.md`.
> Task layer authored in full 2026-06-03 at M4 start (see §Tasks).
> Branch: `dev-freeform`. Authoritative architecture: **ADR-0002** (networking baked
> into core), **ADR-0003** (multi-instance node-anchored identity), **ADR-0005**
> (pairing), **ADR-0006/0007** (subnet membership + notification primitive), and
> `ROADMAP.md` §M4.

> **Upstream is done:** **M3 COMPLETE (2026-06-03)** — M3a (spt-term PTY) · M3b
> (broker/brain daemon) · M3c (signed self-update). The broker's resource-ownership
> table (ADR-0004 §B) and the update engine's transport seam are both built
> transport-agnostic precisely so M4 is an *integration*, not a re-architecture:
> M4 plugs the P2P transport into seams that already exist.

## Goal

Deliver the real cross-machine proof: `spt-net` (Iroh WAN + mDNS LAN + TOTP-SPAKE2
pairing + the eventually-consistent subnet registry) and the multi-instance model
(ADR-0003) so the same agent runs on each of a user's machines with a synced mind,
reachable and operable from anywhere — **cross-node message + drive an agent on
machine A from machine B**, plus update peer-propagation across the subnet.

## Scope

### In

- **`spt-net` crate** — the remaining sibling in the layering
  (`…→spt-msg→{spt-net, spt-term, spt-runtime}→…`): Iroh endpoint (WAN, the Spike #2
  shape) + mDNS (LAN discovery) + TOTP-SPAKE2 pairing (ADR-0005) (`R-NET-*`, `R-PAIR-*`).
- **Subnet registry** — eventually-consistent `endpoint_id → [instances]` + resolution
  policy (local → most-recently-active → `id@node`) (`REQ-INST-7`).
- **QUIC-stream broker ownership — the implementation.** Spike #3 validated the
  *shape* (a broker-owned Iroh endpoint + live QUIC stream survives a brain restart);
  M4 builds it for real into the broker's ADR-0004 §B ownership table.
- **Cross-node Psyche sync (P2P)** — replaces the interim gh-repo sync; the direct-write
  precedence marker + node id (KNOWN-HAZARDS 6.5).
- **Remote-drive of an already-running instance** — the network session-surface (the
  highest-value v1 slice; remote-drive detection + off-node file transfer).
- **Update peer-propagation transport (`REQ-UPD-1`)** — plug the P2P transport into the
  **M3c update-engine seam**: the subnet self-heals to latest, every propagated binary
  still passing the M3c signature/rollback gate before any handoff (one compromised node
  must not poison the subnet — ADR-0004 §D).

### Out

- **Shells, PresenceChannel, and the consent-gated capabilities** (remote-exec,
  instantiate-anywhere) — **M5**, landing into the v1 consent seam M3c-C2 already shaped.
- **Cross-user / per-(subnet, user)** generalization of the notif spool + dismissal — a
  documented forward seam, not v1.

## Requirements (activate per task when M4 starts)

`R-NET-*`, `R-PAIR-*`, `REQ-INST-7`, `REQ-UPD-1`, `REQ-EP-4` (PresenceChannel impl — or
M5), and the distributed precedence/freshness generalization of 6.5 — all currently
`required_stages = []`. Re-confirm the new-requirement assessment (TRACEABILITY rule 3)
at task-authoring time (pairing + subnet wire formats may surface `REQ-HAZARD-*`).

## Clean-room posture

Copy-verbatim only the sister's stable wire/registry **formats** that survive (ADR-0001);
clean-room the transport (the sister had no P2P). Reuse Spike #2/#3 Iroh learnings (iroh
0.98 API, the loopback QUIC-survival shape) — those were sandbox spikes, not in-repo crates.

## Tasks (expanded 2026-06-03 at M4 start)

> **Authored against:** ADR-0002 (net baked-in), ADR-0003 (instances + #7/#8/#9 red-team),
> ADR-0005 (pairing + #10/#11/#12 red-team), ADR-0006 (multi-subnet), ADR-0007 (notif),
> ADR-0009 (endpoint ACL, node-tier), ADR-0004 §B/§D (broker ownership + update gate),
> KNOWN-HAZARDS 6.5 (`REQ-HAZARD-DIRECT-WRITE-PRECEDENCE`). Sub-tasks are the executable
> unit; each lands with its tag in the same commit (TRACEABILITY rule 1). Activate the
> listed reqs **at the sub-task that delivers them** (rule 5), not up front.

**Sequencing.** D0→D1 are the transport floor. D2 (pairing) and D3 (registry) both sit on
D1 and can interleave, but D3's cross-node resolution wants D2's trust material to be real,
so author D2 first. D4 (broker-owned endpoint) underpins D5/D7 (both ride broker QUIC
streams). D6 (Psyche sync) needs D3's per-subnet distribution. D8 (notif) reuses D3+D6
distribution. D9 is the closeout. **Each D-task is itself a milestone-sized body** — expect
a `/self-clear` + commune between most of them; commit atomically per sub-task.

### D0 — Scaffold `crates/spt-net`
- **D0a** New crate (sibling of spt-term/spt-runtime, above spt-msg). Workspace member; deps
  `spt-proto`/`spt-store`/`spt-msg`; `publish = false` (not public SDK, R-ARCH-2).
- **D0b** `net` cargo feature (default-on in `spt` ref binary, optional for lib consumers —
  ADR-0002). Empty `net`-gated module tree so a `--no-default-features` lib build stays clean.
- **Reqs:** REQ-NET-1 (`doc` stub only at D0; `impl`/`unit` land in D1). · **Acceptance:**
  `cargo check` + `--no-default-features` both compile; layering acyclic; `traceable-reqs check` green.

### D1 — Iroh endpoint + mDNS discovery
- **D1a** Iroh endpoint bound to the node's Ed25519 identity (`REQ-NODE-IDENTITY` already
  built in spt-proto — reuse, don't re-mint). Spike #2 shape (iroh 0.98).
- **D1b** n0 default relay + `self-host relay` config knob + plain-language disclosure string
  (ADR-0002 v1 relay stance).  · **REQ-NET-2**
- **D1c** mDNS LAN discovery: one node advertises, another finds it; connect on a `net` ALPN.
- **Reqs:** REQ-NET-1 (impl+unit), REQ-NET-2. · **Acceptance:** two local nodes discover + connect; E2E.

### D2 — TOTP-SPAKE2 pairing (security-critical)
- **D2a** TOTP seed = durable subnet secret; QR / `otpauth://` provisioning; SPAKE2 handshake
  using the current 6-digit code as the PAKE password (ADR-0005). Dedicated pre-trust pairing ALPN.
- **D2b** **Full pairing-transcript binding** — roles, both node pubkeys, subnet ID, seed
  epoch, TOTP time-step, confirmation MACs (red-team **#12**). Negative tests: MITM, replay,
  reflection, wrong-subnet, stale time-step.
- **D2c** **Seed epoch + rotation** — removing a node rotates the subnet seed so an old
  node/seed cannot rejoin; transcript binds the epoch (red-team **#10**). Trust-store delete
  alone is NOT revocation.
- **D2d** **Subnet-global rate limiting** — one active ceremony per subnet, shared attempt
  counter, exponential backoff; justify-or-drop the ±1 TOTP window (red-team **#11**).
- **D2e** Local trust store (authorized peer pubkeys) + TOFU + warn-on-key-change. · **REQ-PAIR-2**
- **D2f** Multi-subnet ceremony: discovery takes TOTP code + subnet-name; rendezvous token =
  `H(subnet-name ‖ TOTP-epoch)`; create-new names the subnet up front. · **REQ-PAIR-5, REQ-PAIR-4**
- **D2g** Elevation-gated per-subnet code fetch (Win UAC / Linux root or elevated agent
  endpoint; else authenticator app). `spt pair show-totp [--subnet|--create-new]`. · **REQ-PAIR-6, REQ-PAIR-3**
- **Reqs:** REQ-PAIR-1..6. · **New hazard reqs to register first (rule 3):** add
  `REQ-HAZARD-PAIR-TRANSCRIPT-BIND` (#12), `REQ-HAZARD-PAIR-SEED-ROTATION` (#10),
  `REQ-HAZARD-PAIR-RATE-LIMIT` (#11) to `traceable-reqs.toml` before satisfying.
- **Acceptance:** two machines pair via one TOTP code; trust material stored; all negative tests pass.
- **(REQ-PAIR-7 subnet icon = GUI-only consumer; defer to D9 unless trivial.)**

### D3 — Subnet registry + resolution policy
- **D3a** Per-subnet eventually-consistent registry `endpoint_id → [instances]` with
  status active/dormant/offline. · **REQ-INST-7**
- **D3b** **Registry lease/epoch via monotonic per-node epochs** (NOT wall-clock) so
  active/dormant transitions + partitions don't route to the wrong instance (red-team **#8**).
  Chaos-test stale entries + ambiguous active instance.
- **D3c** Resolution policy: local → most-recently-active → explicit `id@node`. Qualified
  addressing `[subnet:]id[@node]`; **ambiguity across visible subnets refuses + forces
  qualification** (no guessing). · **REQ-INST-10**
- **D3d** Multi-subnet membership (one node holds N trust-contexts/seeds; same-user only,
  cross-user seam shaped). Join-time bare-id collision check. · **REQ-INST-9**
- **D3e** Endpoint visibility per-`(endpoint,subnet)`: excluded = not advertised AND not
  routable; OR-of-defaults + per-`(E,S)` override; **visibility gates sync**. · **REQ-INST-12, REQ-INST-13**
- **D3f** `spt rename <id> <new_id>` rippled to all instances (registry, every perch,
  `a-<id>` context branch), collision-checked per target subnet, 6.5-reconciled. · **REQ-INST-11**
- **Reqs:** REQ-INST-7,9,10,11,12,13. **New hazard req (rule 3):** `REQ-HAZARD-REGISTRY-EPOCH-LEASE` (#8).
- **Acceptance:** an `endpoint_id` resolves to a live instance across nodes; ambiguous id refuses.

### D4 — QUIC-stream broker ownership (impl Spike #3 shape)
- **D4a** Move the Iroh endpoint + live QUIC streams into the **stable broker** per ADR-0004
  §B ownership table (Spike #3 validated the shape; build it for real). Broker owns the
  socket; brain attaches/detaches.
- **D4b** Stream survives a brain restart **gapless** (no endpoint interruption — composes
  with the M3b handoff substrate + REQ-UPD-3 guarantee).
- **D4c** `REQ-EP-4` PresenceChannel broker endpoint seam (day-one seam; minimal v1 surface).
- **Reqs:** ADR-0004 §B, REQ-EP-4 (seam). **Track (ADR-0002 SERIOUS #6):** start the
  fault-injection acceptance matrix (crash/hang Iroh, mDNS, registry, PTY broker independently).
- **Acceptance:** a live cross-node QUIC stream survives a brain restart gapless.

### D5 — Cross-node message + remote-drive of a running instance
- **D5a** WAN message delivery over the broker endpoint (the REQ-MSG path, now off-node). · **REQ-NET-1**
- **D5b** Remote-control mode (byte-stream terminal attach to an instance on another node;
  compute+files stay remote) — distinct from local operation. · **REQ-INST-8**
- **D5c** Off-node remote-drive **detection** + file transfer (handoff seam: `git pull` +
  fresh-with-preload resume; remote goes dormant, no teardown). · **REQ-REACH-1**
- **D5d** Endpoint access control (ADR-0009 node-tier): per-endpoint whitelist, origin-node
  gate, stateful-firewall (reply/outbound exempt), outer gate before grants. · **REQ-SEC-1**
- **Reqs:** REQ-NET-1, REQ-INST-8, REQ-REACH-1, REQ-SEC-1. · **Acceptance:** drive an agent
  on machine A from machine B; off-node access denied without a grant.

### D6 — Cross-node Psyche sync (P2P), retire gh-repo interim
- **D6a** P2P context replication replacing the interim `gh`-repo sync (ADR-0002 hub stays
  opt-in for `tracked/` only). Two-tier: live context → all instances; project context →
  same-project instances. · **REQ-NET-3, REQ-INST-2, REQ-INST-5**
- **D6b** **Distributed precedence/freshness merge** — precedence marker carries source +
  timestamp + **node identity**; freshness = newest-and-newer-than-mine; the distributed
  generalization of 6.5 (red-team **#7**). **Merge-model DECIDED (2026-06-03, best
  judgment):** per-node **monotonic epoch counters + a version vector** (one counter per
  node, persisted in branchstore). A write carries the author's vector; on receipt, if the
  incoming vector strictly dominates → accept; if dominated → drop (stale); if **concurrent**
  (neither dominates) → **surface as an explicit conflict**, never silently newest-wins.
  Rationale: text context snapshots written infrequently by ≤N nodes don't need CRDT
  machinery, but DO need correct causality + concurrent-write detection (#7's core); a
  version vector gives both cheaply and **unifies with D3b's epoch-lease (#8)** — one
  per-node monotonic epoch source serves registry leases *and* sync precedence. Wall-clock
  stays only as a human-facing tiebreaker hint inside a flagged conflict, never as the
  ordering authority. **Conflict-surface UX CONFIRMED with user (2026-06-03, ADR-0013):
  Psyche auto-reconcile** — both versions persist as tracked conflict artifacts (never
  lose either), the endpoint's own Psyche merges them in one bounded stdout-captured turn
  on the **active instance's node** (single reconciler; fallback lowest node id); merged
  write = `join(vA,vB)` + reconciler bump (dominates both parents, clears subnet-wide).
  **D7.5a (the bounded stdout-captured Psyche turn driver) is pulled forward into D6b**
  as its prerequisite; D7.5b/c stay at D7.5. Transport confirmed git-native: incremental
  bundles over broker streams, pull-based ref-scoped, shared DAG (no per-node divergent
  histories); BranchStore v1 = git-CLI-backed (REQ-STORE-1); hub mode stays a deferred
  opt-in seam.
- **D6c** Subnet-exclusive sync honored (hidden ⟹ not synced; per-endpoint membership list).
- **Reqs:** REQ-NET-3, REQ-INST-2, REQ-INST-5; `REQ-HAZARD-DIRECT-WRITE-PRECEDENCE` (extend
  coverage to multi-writer/node-id). · **Acceptance:** the mind follows the user across two
  machines (two-tier); concurrent-write conflict surfaced, not silently dropped.

### D7 — Update peer-propagation transport (into the M3c engine seam)
- **D7a** Plug the P2P transport into the **existing M3c update-engine seam** (`update::`
  front door); the subnet self-heals to latest. No re-architecture of the engine.
- **D7b** **Every propagated binary still passes the M3c signature/rollback gate per node**
  before any handoff — one compromised node must not poison the subnet (ADR-0004 §D). · **REQ-UPD-1**
- **Reqs:** REQ-UPD-1. · **Acceptance:** a signed update self-heals across the subnet; the
  Ed25519+rollback gate runs on every receiving node; a tampered propagated binary is rejected.

### D7.5 — Psyche outbound relay channel (prerequisite for D8) — ADR-0012
> Surfaced via grill-with-docs 2026-06-03. The Psyche is **sandboxed** (Read/Write/Edit
> only) — its **sole outbound is stdout** `<EVENT type="reply|notify">` intents, which the
> daemon parses + relays as its proxy. spt-core has **neither** a stdout-capturing
> live-Psyche driver **nor** a marker parser yet; the interim `spawn_session` `Stdio::null()`
> silently discards every Psyche reply/notify (KNOWN-HAZARDS 7.3).
- **D7.5a** Drive the live-Psyche turn as a **bounded, stdin-fed, stdout-CAPTURED** harness
  invocation (per-turn `--resume`, mirroring the sister) — **not** the detached
  `Stdio::null()` `spawn_session`. The captured stdout is the reply/notify channel.
  **Pulled forward into D6b** (ADR-0013: the conflict-reconcile turn needs it); D7.5
  consumes the driver built there.
- **D7.5b** Parse the Psyche's stdout `<EVENT type="reply">` / `<EVENT type="notify">`
  envelopes; extend the `spt-proto::event` type taxonomy (`+reply`, `+notify`).
- **D7.5c** Daemon relay + **sanitize** (the trust boundary): strip any Psyche-supplied
  `from=`/target, re-stamp `from=<self_id>`, **constrain routing** (`reply` → the
  `__REPLY_TO__` sender only; `notify` → the agent's own user/subnet only), validate body
  (4.1). Anti-spoof is a conformance invariant, not best-effort.
- **Reqs:** `REQ-HAZARD-PSYCHE-OUTBOUND-PROXY` (activate here), ADR-0012. · **Feeds:** D8
  (the `notify` relay is D8's Psyche-producer path; the `reply` half is psyche→agent
  messaging). · **Acceptance:** a Psyche turn emitting `<EVENT type="reply">` relays to the
  inbound sender; `<EVENT type="notify">` reaches the user; a Psyche-supplied `from=` is
  stripped (anti-spoof negative test); a null-stdout driver fails the guard.

### D8 — Subnet notification primitive
- **D8a** Notif as a first-class kind (user-directed, dismissable, resurfacing): per-subnet
  replicated spool (reuses D3 distribution), subnet-tagged, dismiss-state replicates
  subnet-wide. Two states seen/dismissed. · **REQ-NOTIF-1**
- **D8b** First-fire → most-recently-active endpoint in-subnet; resurface undismissed at
  reported boundaries (wake / api-clear / api-compact / new-session) gated by seen-per-endpoint
  + cross-endpoint suppression timeout (~1h). Scoped to visible subnets.
- **D8c** Refactor the M3c self-update prompt + consent escalation to be **notif producers**
  (one primitive, not three ad-hoc paths).
- **D8d** `spt notify` (agent-issued subnet notif) + `notif_command` manifest seam on harness
  + shell adapters; envelope `from` = issuer endpoint+node+subnet. · **REQ-NOTIF-2**
- **Reqs:** REQ-NOTIF-1, REQ-NOTIF-2. · **Acceptance:** a notif fires to the active endpoint,
  resurfaces at a boundary, dismiss replicates subnet-wide; `spt notify` reaches the user.

### D9 — Activation sweep + cross-node E2E + closeout
- **D9a** Dormancy resource budget (red-team **#9**): measure N dormant sessions on real
  adapters; define adapter warm/cold policy before locking warm default. · **REQ-INST-3, REQ-INST-4**
- **D9b** Two-host cross-node E2E on the real rig (gravity-linux ↔ a Windows host): pair →
  register → cross-node message → remote-drive → Psyche sync → update self-heal.
- **D9c** REQ-INST-14 (resource advertisement) + REQ-INST-15 (immutable home subnet + `spt
  fork`, ADR-0010) — register/activate if in V1-mid line; else document as forward seam.
- **D9d** Confirm every M4 req `required_stages` is satisfied; `traceable-reqs check` green;
  amend ROADMAP (M4 delivered) + CONTEXT glossary; CI matrix green (ubuntu+windows+traceability).
- **Acceptance:** the V1-mid scope line met; `check` green; CI green; cross-node proof demoed.

## Requirement → task coverage map

| Task | Activates |
|------|-----------|
| D0 | REQ-NET-1 (doc) |
| D1 | REQ-NET-1 (impl/unit), REQ-NET-2 |
| D2 | REQ-PAIR-1..6 + 3 new `REQ-HAZARD-PAIR-*` |
| D3 | REQ-INST-7,9,10,11,12,13 + `REQ-HAZARD-REGISTRY-EPOCH-LEASE` |
| D4 | REQ-EP-4 (seam), ADR-0004 §B |
| D5 | REQ-INST-8, REQ-REACH-1, REQ-SEC-1, REQ-NET-1 (WAN msg) |
| D6 | REQ-NET-3, REQ-INST-2, REQ-INST-5, `REQ-HAZARD-DIRECT-WRITE-PRECEDENCE` (extend) |
| D7 | REQ-UPD-1 |
| D7.5 | REQ-HAZARD-PSYCHE-OUTBOUND-PROXY (+ ADR-0012) |
| D8 | REQ-NOTIF-1, REQ-NOTIF-2 (notify-from-psyche path depends on D7.5) |
| D9 | REQ-INST-3,4 (+14,15 if in line); sweep |

**Deferred to M5 (out of M4, per Scope §Out):** REQ-EP-5 (shell instantiation), REQ-REACH-2
(remote command exec), PresenceChannel full impl, cross-user generalization of every per-subnet
key. Their seams are shaped in D4/D5/D8 but not built.

## Red-team obligations (binding — from the Stage-A review, must close in M4)

| Item | ADR | Lands in | Gate |
|------|-----|----------|------|
| #6 one-daemon failure domain | 0002 | D4 | fault-injection acceptance matrix |
| #7 wall-clock loses concurrent writes | 0003 | D6 | epoch/vector-clock merge, conflict surfaced |
| #8 ambiguous resolution under skew | 0003 | D3 | monotonic per-node epoch lease + chaos test |
| #9 dormant-warm resource cost | 0003 | D9 | measured budget + warm/cold policy |
| #10 trust-delete ≠ revocation | 0005 | D2c | seed epoch + rotation |
| #11 distributed PAKE guessing | 0005 | D2d | subnet-global rate limit, one ceremony |
| #12 transcript must bind | 0005 | D2b | full transcript spec + negative tests |

## Workspace change

Add `crates/spt-net` (the remaining sibling). Final layering (R-ARCH-1, acyclic):
`spt-proto → spt-store → spt-msg → {spt-net, spt-term, spt-runtime} → spt-live → spt-daemon → spt`.
`spt-net` is **not** public SDK (R-ARCH-2 stays proto/runtime/msg).
