# Robust WAN subnet join — design

> Source: `/grill-with-docs` session 2026-06-27 (operator + doyle), off a confirmed field incident
> (SCELTOUIN could not join SPT_DEV over WAN — half-broken IPv6 silently burned the rendezvous
> window). Decision record: **ADR-0030**. Glossary correction landed in CONTEXT.md ("link discovery").
> Each item below lands as a tagged `REQ-*` in `traceable-reqs.toml`; this is the next milestone's spec.

## Background (the confirmed root cause)

The joiner dials an **ephemeral, 30 s-rotating** rendezvous id derived from the **public**
`(subnet-name, TOTP-epoch)` — *not* the secret code. iroh resolves that id via n0 DNS
(`dns.iroh.link`) + relay on WAN, mDNS on LAN, then the responder answers with its real address and
the SPAKE2 ceremony runs (the code is the PAKE password). SCELTOUIN had **half-broken IPv6** (AAAA
resolves, path dead); iroh's dual-stack attempts to dead IPv6 candidates silently timed out, blowing
the ±1 window (~90 s of token acceptance; `SEARCH_DEADLINE` 75 s). Symptoms: usually a silent
timeout; occasionally a slow meet whose pre-entered code had rotated → "wrong code". Refuted cheaply
first: clock/NTP (offset +0.09 s), Windows firewall (inbound-UDP allow present), n0 relay reach (fine).
The discriminator: `Test-NetConnection dns.iroh.link -Port 443` — IPv6 addr **fails**, IPv4 succeeds.

## A. Meet-before-code — two-phase join → `REQ-JOIN-TWO-PHASE`

The meet selector is public `(name, epoch)`; the code is only the ceremony password. So collect the
code **after** the meet and use it immediately — fresh at ceremony regardless of discovery time.

- **Protocol (one stream, extend `brain.pair_join`, `brain.rs:1009`):**
  `PairMeetReq{subnet}` → daemon runs the meet `(name, current-epoch)` (`pairhost.rs` `connect_seed_holder`
  / `meet.rs` `dial_via_rendezvous`), resolving the seed-holder's **real, stable** pairing address →
  `MetMember` event → CLI prompts the code (`cli.rs` `cmd_subnet_join`, `:6236`) → `PairCodeSubmit{code}`
  → daemon dials the held real-address on `SPT_PAIR_ALPN` + SPAKE2 → `PairJoined`/`PairFail`.
- **Held-address state machine:** the daemon caches the resolved real-address between `MetMember` and
  `PairCodeSubmit`, bounded by a **5-minute wait-for-code timeout**. The real-address is the member's
  stable endpoint (not the rotating rendezvous id), so it does not age within the hold.
- **Wrong-code retry re-runs the ceremony only** against the held address — no re-search (a property
  of holding the address). Bounded by the same 5-minute hold.
- **`--code` path unchanged (one-shot):** code supplied at invocation → no prompt; meet then ceremony
  in sequence. Relies on B (fast discovery) to keep the supplied code in its window; on staleness it
  fails loudly per C (no silent burn).
- **Security unchanged:** the meet is pre-trust/unauthenticated (`SPT_PAIR_MEET_ALPN`); auth stays in
  SPAKE2. Meeting before the code is entered weakens nothing.
- Activate impl+unit+int (int = a real two-phase join over a live meet; wrong-code retry hits ceremony
  only; the 5-minute hold times out cleanly).

## B. Per-family bind gate → `REQ-NET-FAMILY-GATE`

- **At `NetEndpoint::bind` (`endpoint.rs`), probe each IP family's reachability ONCE** and bind only
  the families that work: dual-stack when both healthy; IPv4-only when IPv6 is the half-broken
  AAAA-resolves-but-dead kind; **IPv6-only when IPv4 is the dead one** (drop the *dead* family — never
  a fixed prefer-IPv4; IPv6-only networks must keep working). Re-evaluated on daemon restart.
- **Probe shape (impl-verify for the executor):** a short-timeout reachability check per family —
  e.g. a bounded connect to the n0 relay's A / AAAA, or a default-route presence check — kept cheap so
  it does not slow bind. Confirm the iroh API for binding a single family (omit the v6/v4 socket) and
  whether iroh's own hole-punch already fast-fails candidates (may narrow the fix).
- **Override knobs:** `SPT_DISABLE_IPV6` / `SPT_DISABLE_IPV4` force a family off (escape hatch +
  determinism + testing), mirroring `SPT_NTP_SERVER`. A forced family is never bound regardless of the
  health probe.
- Reusable beyond join — every spt connection benefits. Activate impl+unit.

## C. Diagnosable join → `REQ-JOIN-DIAGNOSTICS`

- **Live progress** during the meet (replace the one-shot "Searching…" at `cli.rs:6268` with periodic
  elapsed / deadline output) so silence never reads as a hang.
- **Detailed failure** on meet-exhaustion — rendezvous candidates + families attempted (IPv4/IPv6),
  relay vs direct, the last concrete error — surfaced **before** any code prompt. (`connect_seed_holder`
  `pairhost.rs:437` and `dial_via_rendezvous` `meet.rs:281` currently swallow per-attempt errors —
  thread the last error up with attempt context.)
- **Propagate the terminal event:** `brain.rs:1024` `_ => continue` must deliver a daemon
  `NoSeedHolder`/`PairFail` to the CLI as a printed error, not discard it (this is *why the user saw
  nothing*).
- **`--verbose` / `SPT_LOG` discovery trace:** per-probe derived id, discovery path taken (mDNS / n0
  DNS / relay), per-family timeouts — opt-in, so the next WAN-join issue is self-serve. No knob exists
  today.
- Activate impl+unit.

## Cross-cutting

- **CONTEXT.md** corrected (done this session): **TOTP-epoch** (public, routes the meet) vs
  **TOTP-code** (secret, SPAKE2 only); _Avoid_ "the code routes discovery".
- **ADR-0030** doc-ratifies the cluster. Traceability: seed all three `REQ-*` with `required_stages =
  ["doc"]` (ADR + this doc are the doc evidence); impl/unit/int activate at the build milestone (rule 5).
- **Known kin (the other two silent join traps):** `docs/DEFERRED.md` row 23 (Windows-firewall inbound
  UDP drop) and row 30 (clock skew → wrong epoch). Both are refuted-but-related; the diagnostics (C)
  would have made each self-evident.
- **Public docs** (ride the build): MANIFEST/reference unaffected (no manifest surface); gh-pages
  `networking/overview` + a join/pairing troubleshooting note (families, `SPT_DISABLE_IPV6`, `--verbose`).
