# Robust WAN subnet join: meet-before-code + per-family bind gate

<!-- [doc->REQ-JOIN-TWO-PHASE] [doc->REQ-NET-FAMILY-GATE] [doc->REQ-JOIN-DIAGNOSTICS] -->

## Status

accepted (2026-06-27; grilled with operator + `doyle`, `/grill-with-docs`, against a confirmed field incident). Builds on ADR-0002 (networking in core, n0 relay/DNS default), the TOTP-seeded SPAKE2 pairing model (CONTEXT.md "Pairing & trust"), and the rendezvous/meet path (`crates/spt-net/src/net/pairing/{meet,rendezvous,totp,ntp}.rs`, `crates/spt-daemon/src/pairhost.rs`). Full per-item spec: `docs/design/robust-wan-subnet-join.md`.

## Context

A remote machine (SCELTOUIN) could not join an spt subnet over WAN: it "never found a machine," then silently stopped — no error. Diagnosis (confirmed by toggling the variable): the host had **half-broken IPv6** (AAAA records resolve, the IPv6 path is dead). spt's iroh endpoint binds **dual-stack** and attempts both families for n0 discovery (`dns.iroh.link`) and the rendezvous hole-punch; the dead-IPv6 attempts don't refuse, they **silently time out**, burning the 30 s rotating-rendezvous window (`SEARCH_DEADLINE` 75 s). Usually a silent timeout; occasionally a slow meet whose pre-entered code had rotated → "Wrong code (codes rotate every 30 s)". Disabling IPv6 fixed it instantly.

Three defects surfaced behind one symptom:

1. **A glossary conflation that misshaped the flow.** CONTEXT.md called the **TOTP-code** "the link-discovery selector." It is not — the rendezvous/meet selector is `(subnet-name, TOTP-epoch)` (`rendezvous_token = SHA256(domain ‖ name ‖ totp_step)`), both **public**. The secret **code** is the SPAKE2 password *only*, used in the ceremony *after* the meet. The flow prompted for the code up front because the mental model said "the code routes discovery" — so the entered code aged across a slow meet and went stale.
2. **No resilience to half-broken IPv6.** Dead-AAAA candidate attempts silently consume the rendezvous window; the operator had to disable IPv6 by hand.
3. **A silent failure path.** `connect_seed_holder` (`pairhost.rs:437`) retries for 75 s with no output; `dial_via_rendezvous` (`meet.rs:281`) discards each probe error; `brain.rs:1024` `_ => continue` can swallow the terminal event — so the user saw nothing.

## Decision

### 1. Meet-before-code — a two-phase join

Because the meet is **code-independent** (selector = public `(name, epoch)`), reorder the join so the code is collected **after** a member is found and used **immediately** at the ceremony — fresh regardless of how long discovery took.

Extend the `brain.pair_join` event loop into two phases on one stream: CLI `PairMeetReq` → daemon meets `(name, current-epoch)`, resolving the seed-holder's **real, stable** pairing address → `MetMember` event → CLI prompts the code → `PairCodeSubmit` → daemon runs the SPAKE2 ceremony against the held address → `PairJoined`/`PairFail`. The daemon holds the resolved real-address between phases, bounded by a **5-minute wait-for-code timeout**. A genuine wrong-code retry re-runs only the ceremony against the held address — **no re-search**. The non-interactive `--code` path stays one-shot (no prompt); it relies on fast discovery (decision 2) to keep the supplied code in its window. Security is unchanged: the meet is already pre-trust/unauthenticated (`SPT_PAIR_MEET_ALPN`); authentication stays in SPAKE2.

### 2. Per-family bind gate

spt-core controls which IP families the iroh endpoint binds (`NetEndpoint::bind`). At bind, probe each family's reachability **once** and bind **only the families that actually work**: dual-stack when both are healthy, IPv4-only when IPv6 is the half-broken kind, **IPv6-only when IPv4 is the dead one** (drop the *dead* family, never a fixed "prefer IPv4" — IPv6-only networks must still work). Re-evaluated on daemon restart. Explicit overrides `SPT_DISABLE_IPV6` / `SPT_DISABLE_IPV4` force a family off (escape hatch + determinism + testing), mirroring the `SPT_NTP_SERVER` pattern.

### 3. Diagnosable join

The meet must never fail silently: (a) **live progress** during the search (elapsed / deadline) so silence never reads as a hang; (b) a **detailed failure** on exhaustion — rendezvous candidates and families attempted, relay vs direct, the last concrete error — surfaced **before** any code prompt (a dead subnet must not make the user fetch a code); (c) **propagate the terminal event** (`brain.rs:1024` must deliver `NoSeedHolder`/`PairFail` to the CLI, not `continue` it away); and (d) a `--verbose` / `SPT_LOG` **discovery trace** (per-probe id, discovery path, timeouts) so the next WAN-join issue is self-serve.

## Considered and rejected

- **"Find a machine first, then prompt for the code" taken literally as impossible.** Initially thought blocked because the id seemed code-derived; the code proves the selector is the public epoch, so the reorder is sound. Recorded because the conflation was load-bearing.
- **Probe-then-join (two stateless calls: quick probe, then a one-shot join with the code).** Simpler/stateless but the code is still entered *before* the join-meet, so a slow second meet re-strands it — it does not actually guarantee the goal. Rejected for the callback-stream (decision 1).
- **"Always prefer IPv4."** Breaks IPv6-only networks. Rejected for "drop the *dead* family."
- **Shrinking the staleness window via fast discovery alone (keep one-shot).** The IPv6 fix shrinks it but a momentarily slow meet still strands the code. Fast discovery is necessary (and is decision 2) but not sufficient — the reorder is what removes the window.
- **Periodic family re-evaluation.** Deferred — once-at-bind (re-probe on restart) is enough for v1; live re-evaluation is added complexity without a demonstrated need.

## Consequences

- WAN join is robust to half-broken IPv6 with no manual intervention, and the code can no longer go stale across a slow meet.
- A join that fails says **why**, before bothering the user for a code — and `--verbose`/`SPT_LOG` makes the discovery path inspectable (this incident needed code-reading + manual `Test-NetConnection`).
- New protocol surface: a `MetMember` event + a `PairCodeSubmit` message + a daemon-side held-address state machine with a 5-minute timeout. The `--code` path and scripted automation are unchanged.
- The per-family gate is reusable beyond join — any spt connectivity benefits from not attempting a dead family.
- CONTEXT.md is corrected: **TOTP-epoch** (public, routes the meet) vs **TOTP-code** (secret, SPAKE2 only); "the code routes discovery" is now an explicit _Avoid_.
