# Two-host rig runbook — the D9b ladder on real hardware

How to run the `SPT_TWO_HOST` integration ladder (`crates/spt-daemon/tests/twohost.rs`)
across two real machines, and what green looks like. This is the M4 cross-node proof
(M4-PLAN §D9b): **pair → register → cross-node message → remote-drive → Psyche sync →
update self-heal → notif cross-node** — every rung over a real network path between two
separate processes on two separate hosts, no loopback shortcuts.

The test is **env-gated**: without `SPT_TWO_HOST=1` + a role it silent-skips (CI's
normal sweeps never run it). The `int` evidence tags live on the gated tests; the
requirement `int` stages activate only after a real rig run is green (TRACEABILITY.md
rule: tags on real evidence).

## The rig

| host | OS | role | tailscale address |
|------|----|------|-------------------|
| HFENDULEAM | Windows 11 | **a** — joiner / driver | `100.68.35.65` |
| kitsubito | Ubuntu 22.04 | **b** — seed-holder / server | `100.98.197.12` |

(kitsubito replaced gravity-linux `100.100.62.43` as the Linux rig host /
runner 2026-06-07 — same Ubuntu 22.04 baseline; historical evidence below was
captured on gravity.)

Tailscale addresses are the rig's stable transport (the hosts sit on different
physical subnets and the Linux host's name does not resolve from HFENDULEAM; the
tailnet IPs are static). Any reachable IP pair works — LAN addresses included.

## How the two sides find each other (no runtime exchange)

Everything either side needs is **derived from a shared secret + static env**, so both
legs run non-interactively (the Linux leg is driven by a commit-message-gated CI
step — there is no SSH between the rig hosts):

- node identities ← `sha256(secret:"node-a"/"node-b")`;
- the subnet TOTP seed ← `sha256(secret:"subnet-totp")[..20]` — role A computes the
  6-digit code locally, standing in for the user reading the seed-holder's
  `spt subnet show-code` screen. **The wire ceremony is the real SPAKE2/TOTP
  exchange** (run_initiator/run_responder over the pre-trust ALPN, cross-host);
  only the operator's eyeballs are simulated;
- the release signing key ← `sha256(secret:"release")`;
- dialable addresses ← derived node id + the peer's IP + fixed UDP ports
  (`BindScope::Port`): relays and discovery disabled, one direct path each way.

## Env contract

| var | role | meaning | rig value |
|-----|------|---------|-----------|
| `SPT_TWO_HOST` | both | `1` activates; anything else silent-skips | `1` |
| `SPT_TWO_HOST_ROLE` | both | `a` (joiner/driver) / `b` (seed-holder/server) | per host |
| `SPT_TWO_HOST_SECRET` | both | shared secret, identical both sides | any string |
| `SPT_TWO_HOST_PEER_IP` | both | the OTHER host's IP | see table above |
| `SPT_TWO_HOST_SUBNET` | both | subnet name | default `twohost` |
| `SPT_TWO_HOST_PORT_A` | both | A's broker UDP port | default `7460` |
| `SPT_TWO_HOST_PORT_B` | both | B's broker UDP port | default `7461` |
| `SPT_TWO_HOST_WAIT_SECS` | both | per-rung convergence deadline | `900` on the rig (absorbs runner start skew); default `300` |

## Prerequisites

- Clocks NTP-synced (the pairing TOTP window is ±30 s).
- UDP reachable both directions on the three ports. The Windows side may need a
  one-time inbound allow (run elevated):
  `netsh advfirewall firewall add rule name="spt twohost rig" dir=in action=allow protocol=UDP localport=7460-7462`
  In practice A's pump dials B first and B's replies ride the same UDP 4-tuple, so
  a stateful default usually passes — add the rule if role B logs
  `PUMP_PEER_FAIL`/dial timeouts against A.
- Both sides on the same commit (the wire formats must match).

## Procedure 1 — CI-driven rig run (the standard path)

Both rig hosts are the repo's self-hosted runners, so one tagged push runs both legs:

1. Land the change you want proven; wait for normal CI green.
2. Push a commit whose message contains **`[twohost]`** (an empty commit is fine):
   `git commit --allow-empty -m "test(rig): two-host ladder run [twohost]"`
3. The `twohost-a` (hfenduleam) and `twohost-b` (kitsubito) jobs gate on the tag and
   on `needs: test` — they release together after both test legs finish, which keeps
   the two roles inside each other's `WAIT` window.
4. Verify FINAL conclusions sha-pinned (gotchas #7/#13):
   `gh run list --commit <sha> --json name,conclusion`
5. Capture both job logs' `TWOHOST OK:` lines into the evidence appendix below.

## Procedure 2 — manual two-terminal run

On kitsubito (start first):

```sh
SPT_TWO_HOST=1 SPT_TWO_HOST_ROLE=b \
SPT_TWO_HOST_SECRET=rig-<date> SPT_TWO_HOST_PEER_IP=100.68.35.65 \
SPT_TWO_HOST_WAIT_SECS=900 \
cargo test -p spt-daemon --test twohost -- --nocapture
```

On HFENDULEAM (PowerShell):

```powershell
$env:SPT_TWO_HOST="1"; $env:SPT_TWO_HOST_ROLE="a"
$env:SPT_TWO_HOST_SECRET="rig-<date>"; $env:SPT_TWO_HOST_PEER_IP="100.98.197.12"
$env:SPT_TWO_HOST_WAIT_SECS="900"
cargo test -p spt-daemon --test twohost -- --nocapture
```

Role A retries the pairing dial until B's ceremony endpoint answers, so exact start
order only matters within the `WAIT` window.

## What green looks like

Role A prints, in order:

```
TWOHOST OK: pairing (initiator) — Pinned
TWOHOST OK: register: B's perch row replicated to A
TWOHOST OK: message sent (B asserts the spool)
TWOHOST OK: file fetch (sid <n>)
TWOHOST OK: remote-drive (attach echo)
TWOHOST OK: sync: A pulled B's mind
TWOHOST OK: update: B's v6 staged at A through the verify gate
TWOHOST OK: update: consent notif surfaced at A
TWOHOST OK: notif: B's dismissal replicated to A
TWOHOST OK: file push (done barrier)
TWOHOST role A: ladder complete
```

Role B prints, in order:

```
TWOHOST OK: pairing (responder) — Pinned
TWOHOST OK: register: A's perch row replicated to B
TWOHOST OK: message: A's WAN message spooled at B
TWOHOST OK: notif: A's insert fired on B
TWOHOST OK: notif dismissed at B (<id>) — replicating back
TWOHOST OK: sync: B pulled A's mind
TWOHOST OK: done-file pushed by A (ladder complete on A)
TWOHOST role B: ladder complete
```

Plus `test result: ok` on both sides. Any rung that never converges panics with
`never converged on the rig: <rung>` — file a flake-ledger entry, don't shrug.

## What each rung proves

| rung | evidence | requirement |
|------|----------|-------------|
| pairing ceremony, cross-host SPAKE2 + seed transfer | trust pinned both sides; joiner holds the subnet seed | REQ-PAIR-1, REQ-PAIR-5 |
| registry replication, both directions | each side's perch row in the other's gated registry | REQ-INST-7 |
| WAN message | A's send lands in B's perch spool through the dispatcher funnel | REQ-NET-1 |
| remote-drive | A fetches a file off B, types into B's live PTY, reads the echo, pushes a file back | REQ-INST-8, REQ-REACH-1 |
| Psyche sync, both directions | each side bootstrap-pulls a mind it never held (registry-derived want-ref, gate served remotely) | REQ-INST-5 |
| update self-heal | B's staged v6 verified + staged at A under the trusted key; consent notif surfaced (gated default) | REQ-UPD-1, REQ-UPD-4 posture |
| notif cross-node | insert on A fires on B; B's dismissal replicates back to A | REQ-NOTIF-1 |

## Evidence appendix

> Filled from the first green rig run; later reruns replace it.

**Run:** CI run `26958175812` on `a86371c` (2026-06-04, Procedure 1 — `[twohost]`
tagged push; both jobs green on the first attempt, no flakes, no firewall rule
needed — role A's pump traffic kept the reverse UDP 4-tuple warm as predicted).
Role A retried the pairing dial for ~85 s while gravity's job built; once both
sides were up the whole ladder converged in ~8 s over tailscale.

Role A (hfenduleam, node `bcead52b…4b41a`):

```
14:33:51 TWOHOST role A: node bcead52b862344aef51998ca5d8f15dce1e38e6cee3795b219a5849f6ad4b41a
14:35:18 TWOHOST OK: pairing (initiator) — Pinned
14:35:25 TWOHOST OK: register: B's perch row replicated to A
14:35:25 TWOHOST OK: message sent (B asserts the spool)
14:35:25 TWOHOST OK: file fetch (sid 1)
14:35:25 TWOHOST OK: remote-drive (attach echo)
14:35:25 TWOHOST OK: sync: A pulled B's mind
14:35:25 TWOHOST OK: update: B's v6 staged at A through the verify gate
14:35:25 TWOHOST OK: update: consent notif surfaced at A
14:35:26 TWOHOST OK: notif: B's dismissal replicated to A
14:35:26 TWOHOST OK: file push (done barrier)
14:35:26 TWOHOST role A: ladder complete
         test result: ok. 2 passed; 0 failed (94.95s)
```

Role B (gravity-linux, node `9bbcee97…8b612`):

```
14:35:16 TWOHOST role B: node 9bbcee970607c7454b0baf8b38b032ccdf1a36de7ef04d614158fecd1f08b612
14:35:16 TWOHOST role B: drive session 1 ready
14:35:18 TWOHOST OK: pairing (responder) — Pinned
14:35:22 TWOHOST OK: register: A's perch row replicated to B
14:35:25 TWOHOST OK: message: A's WAN message spooled at B
14:35:25 TWOHOST OK: notif: A's insert fired on B
14:35:25 TWOHOST OK: notif dismissed at B (bcead52b…:1780583722231) — replicating back
14:35:25 TWOHOST OK: sync: B pulled A's mind
14:35:26 TWOHOST OK: done-file pushed by A (ladder complete on A)
14:35:26 TWOHOST role B: ladder complete
         test result: ok. 2 passed; 0 failed (10.15s)
```

## M5 rungs (D9a — appended to the same ladder)

One tagged run now also climbs the M5 legs, in order after the M4 rungs:

| rung | what happens | requirement |
|------|--------------|-------------|
| remote suspend/wake | A sends the wire rest ops; B's transition host runs the edges; the state flips come back through the ordinary registry advertisement | REQ-INST-6 (rig leg) |
| presence-shift redirect | the user is most-recently-active at B (gossiped stamp); A's produce first-fires REMOTE (nothing surfaced at A), B's feed-apply surfaces it there; the marks replicate back | REQ-PRES-1 |
| cross-node shell relink + drive | A relinks its persistent notify shell living at B and drives a vocabulary-checked `notify` command over the shelllink wire; B relaunches the binary and the command renders | REQ-SHELL-1, REQ-SHELL-2 |
| notif→toast (the dogfood demo) | an agent's `spt notify` at A surfaces as a native render on whichever host the user last touched — B, via the shell's `[session.notif]` template | REQ-NOTIF-2 |

**The notify shell binary:** the CI twohost jobs check out + build
`SaberMage/spt-shell-notify` and pass `SPT_TWO_HOST_NOTIFY_BIN`; the rung also
expects the workspace `spt` binary in the same target dir (the `needs: test`
leg built it on the same runner). Either missing ⇒ the rungs degrade loudly to
a fallback template shell: the link/seam mechanics still prove, only the
real-adapter render assert is skipped.

**The real-toast eyeball check (manual, non-CI):** run role B on a host with a
display and `SPT_TWO_HOST_NOTIFY_BIN` set but WITHOUT the `--render-file` knob
in the manifest the test writes — or simply run the standalone adapter's
binary directly (`notify-shell --render-title hi --render-body there`): a
native toast (Windows) / notify-send bubble (Linux) appears. CI asserts the
render-file observable; the OS call itself is this one eyeball step.

> M5 evidence appendix: filled from the first green M5-rung rig run.

**M5 evidence (D9a):** run `26998058816` on `620edf3` (2026-06-04, third
attempt — two gap-closes: the canonical-snapshot-dir fix `b518219`, the
in-job spt build `620edf3`). Full 11-rung ladder green with the REAL
standalone adapter (`real_mode=true`), role B (gravity):

```
06:00:35 TWOHOST role B: notify instance notify-0 minted (real_mode=true)
06:00:40 TWOHOST OK: pairing (responder) — Pinned
06:00:44 TWOHOST OK: register: A''s perch row replicated to B
06:00:47 TWOHOST OK: message: A''s WAN message spooled at B
06:00:47 TWOHOST OK: notif: A''s insert fired on B
06:00:47 TWOHOST OK: notif dismissed at B — replicating back
06:00:47 TWOHOST OK: sync: B pulled A''s mind
06:00:48 TWOHOST OK: rest: A''s remote suspend landed (B suspended)
06:00:48 TWOHOST OK: rest: A''s remote wake landed (B active again)
06:00:50 TWOHOST OK: presence: A''s redirected notif surfaced at B
06:00:50 TWOHOST OK: presence: the surfacing node owns the marks
06:00:51 TWOHOST OK: shell: A''s cross-node notify command RENDERED at B
06:00:51 TWOHOST OK: toast: A''s notify rendered via the shell template at B
06:00:52 TWOHOST OK: done-file pushed by A (ladder complete on A)
06:00:52 TWOHOST role B: ladder complete
```

---

## Restoration field-run evidence (D7-4) — the seamless-update acceptance

<!-- [doc->REQ-DAEMON-2] -->

The restoration milestone's field acceptance: a published restoration-line
release (`v0.4.0`, release counter 7) applied on the live fleet, with the
running brain proven to be executing the **released artifact bytes** via the
`brain.ready` `exe_hash` breadcrumb (D7-1) — the bytes-level fact the
v0.3.2/enlyzeam incident lacked (binary on disk, old code resident for ~a day).

**Published artifact hashes (SHA256SUMS, `v0.4.0`):**

```
b4b7ea142d2ec3d7bb20e33cb66c8ff81464aeef258eca42842e55e04cf07b5e  spt-x86_64-linux
81c24675e344bf7426a654df364ef2ba5606b3f0e758308942c987cd98747da9  spt-x86_64-windows.exe
```

**Fleet roll (2026-06-10), both nodes → v0.4.0, two-process resident:**

| host | OS | broker pid | brain pid | `brain.ready` exe_hash | == artifact |
|------|----|-----------|-----------|------------------------|-------------|
| kitsubito | Linux | 819830 | 819839 | `b4b7ea14…cf07b5e` | ✅ linux |
| HFENDULEAM | Windows | 17232 | 38324 (`--start-reason cold`) | `81c24675…8747da9` | ✅ windows |

Both nodes: a `daemon run` broker + a real `daemon brain` child process, the
brain's `exe_hash` an **exact match** to the released artifact for its platform
— the resident bytes ARE the release, not a stale image.

**The last manual bounce (expected, by design).** The fleet was on `v0.3.2` —
the **in-process** daemon, i.e. the regression itself. Observed live on
kitsubito: `spt update apply` swapped the binary on disk (`spt --version` →
`0.4.0`, friendly "Updated spt-core to v0.4.0." message) yet the **running
daemon pid was unchanged** (159446, pump still STALLED) — the in-process broker
cannot self-relaunch onto new bytes. A manual unit restart
(`systemctl --user restart spt-daemon.service`) then brought up the v0.4.0
two-process daemon (new pid 819830 + brain 819839, pump LIVE, `brain.ready`
written). This crossing is each node's **last manual bounce** — `v0.4.0` is, by
the milestone's own framing, the last release that needs one.

**The seamless no-bounce assert is carried forward.** Proving `update apply`
triggers a brain-process respawn onto new bytes **with no manual bounce**
requires the daemon to already be split-era (`v0.4.0`) before the apply — so it
is owned by the D7-1 process-level survival E2E (green) and field-confirmed on
the **next** release roll (`v0.4.1`: v0.4.0 → v0.4.1, seamless). The `v0.3.2 →
v0.4.0` crossing here proves the bytes-match + resident-two-process half; the
no-bounce half lands at v0.4.1.

> Outstanding at this rung: enlyzeam (`<0.3.2`) — its catch-up bounce onto the
> split era is the project's final manual bounce; pair it with the v0.4.1 roll.
