# v0.17.0 — robust WAN join + presence-liveness truth — build JIT

> Next spt-core minor (after v0.16.0 / counter 35). **Design (binding):** ADR-0030 +
> `docs/design/robust-wan-subnet-join.md` (the join cluster); the presence fix is a correctness bug
> (no ADR). REQs seeded in `traceable-reqs.toml` (`required_stages=["doc"]` for the ADR-0030 cluster,
> `[]` for the presence fix). Glossary correction (TOTP-epoch vs TOTP-code) already in CONTEXT.md.
> Source: a confirmed field incident this session (SCELTOUIN couldn't join over WAN; and its picker
> showed HFENDULEAM's dead endpoints as ONLINE). todlando builds, doyle gates, deployah publishes.

## Why this milestone (the two confirmed bugs)

1. **WAN join silently fails on half-broken IPv6.** iroh binds dual-stack; dead-AAAA candidate
   attempts silently burn the 30 s rendezvous window. Refuted first: clock/NTP, firewall, relay.
   Discriminator: `Test-NetConnection dns.iroh.link -Port 443` IPv6 fails / IPv4 ok. Fixed live by
   disabling IPv6. ([[broken-ipv6-poisons-iroh-discovery]])
2. **Dead endpoints show ONLINE on remote nodes.** `registryhost.rs:404` advertises a not-alive perch
   as `Dormant`; `picker/data.rs:268` maps remote `Dormant → Online`. Dead → Dormant → green ONLINE,
   while the home node reads the same perch Offline.

## Build invariants (every wave)

- **Activate per wave, int at the fold** (rule 5 / [[traceable-per-wave-activation]]). Tag evidence in
  the same commit (`[impl->]`/`[unit->]`/`[int->]` on real code/tests).
- **Gate each wave:** `cargo nextest` (spt-net + spt-daemon + spt; NEVER bare `cargo test` on Win —
  env races), `cargo clippy --workspace`, `traceable-reqs`/`xtask check`, `xtask gen` where a CLI
  surface changed. Pre-kill the orphan `target/debug/spt.exe` scoped to workspace/PID
  ([[no-machinewide-killon-shared-runner]]).
- **Shared working tree** with todlando — commit SELECTIVELY (`git add <files>`, never `-A`); coordinate
  `traceable-reqs.toml` edits (the contended file).
- **CI fires on the release PR** (push only triggers main/dev-freeform); twohost rig is `[twohost]`-gated.

## Waves

### W1 — per-family bind gate (REQ-NET-FAMILY-GATE) — THE URGENT ROOT FIX
- At `NetEndpoint::bind` (`crates/spt-net/src/net/endpoint.rs`), probe each IP family's reachability
  ONCE; bind only the working families (dual-stack / IPv4-only / IPv6-only — drop the DEAD family, never
  a fixed prefer-IPv4). Re-eval on daemon restart.
- `SPT_DISABLE_IPV6` / `SPT_DISABLE_IPV4` overrides force a family off regardless of the probe (mirror
  `SPT_NTP_SERVER`).
- **Impl-verify:** the iroh single-family bind API (omit the v6/v4 socket) + whether iroh's hole-punch
  already fast-fails candidates (may narrow the fix). Keep the probe cheap (short timeout; don't slow bind).
- Activate **impl+unit** (unit = family-selection logic both/v4-dead/v6-dead/forced, no live sockets).
- *Shippable alone as the urgent WAN-join unblock.*

### W2 — two-phase meet-before-code join (REQ-JOIN-TWO-PHASE)
- Extend `brain.pair_join` (`crates/spt-daemon/src/brain.rs:1009`) into the two-phase stream:
  `PairMeetReq{subnet}` → daemon meets `(name, current-epoch)` resolving the seed-holder's **real stable**
  pairing address (`pairhost.rs` `connect_seed_holder` / `meet.rs` `dial_via_rendezvous`) → new
  `MetMember` event → CLI prompts the code (`cli.rs` `cmd_subnet_join`, `:6236`) → `PairCodeSubmit{code}`
  → daemon dials the held addr on `SPT_PAIR_ALPN` + SPAKE2 → `PairJoined`/`PairFail`.
- Daemon holds the resolved real-address between phases; **5-minute wait-for-code timeout**. A wrong-code
  retry re-runs the **ceremony only** against the held addr (no re-search).
- `--code` path stays one-shot (no prompt). Security unchanged (meet is pre-trust `SPT_PAIR_MEET_ALPN`).
- Activate **impl+unit+int** (int = real two-phase join over a live meet + wrong-code-retry-hits-ceremony-only
  + held-address timeout).

### W3 — diagnosable join (REQ-JOIN-DIAGNOSTICS)
- (a) Live search progress (replace the one-shot "Searching…" `cli.rs:6268` with periodic elapsed/deadline).
- (b) Detailed meet-failure (candidates + families + relay-vs-direct + last error) surfaced BEFORE any code
  prompt; thread the last error up from `connect_seed_holder` (`pairhost.rs:437`) + `dial_via_rendezvous`
  (`meet.rs:281`) (today swallowed).
- (c) Propagate the terminal event: `brain.rs:1024` `_ => continue` must deliver `NoSeedHolder`/`PairFail`
  to the CLI (this is why the user saw nothing).
- (d) `--verbose` / `SPT_LOG` discovery trace (per-probe id, path mDNS/n0-DNS/relay, per-family timeouts).
- Activate **impl+unit**. `xtask gen` if the `--verbose` flag touches the CLI surface.

### W4 — presence-liveness truth (REQ-PRESENCE-LIVENESS-TRUTH) — independent, can parallel W1-W3
- `registryhost.rs:397-405`: a not-alive perch with NO live broker session advertises `Status::Offline`,
  not `Dormant`. **Preserve `Dormant`** for a resting-but-addressable perch (`RestState::Dormant` on a LIVE
  perch) AND the **unbound-but-live-session** case (`is_perch_alive` is bound-gated — the `else` also catches
  live-unbound perches; must NOT mis-mark those Offline). Reuse the picker's `is_perch_unbound` signal.
- `picker/data.rs:266-270`: render a remote `Dormant` DISTINCT from `Active` — reserve `EpStatus::Online`/
  green for `Active`; show `Dormant` as a distinct idle state (a new `EpStatus` or display variant).
- Activate **impl+unit+int** (int = a dead perch reads Offline on a REMOTE node's picker; home-vs-remote agree).

### W5 — int activations, docs, release
- Activate int stages (W2/W4 + any deferred) at the fold; doyle gates the matrix ×3 both runners (twohost
  rig skipped unless a wave touches the cross-host seam — W2's meet/ceremony does NOT change the QUIC subnet
  transport, only the pairing handshake order; judge at the fold).
- Public docs (VERSION numbers, no internal codes): gh-pages `networking/overview` + a join/pairing
  troubleshooting note (families, `SPT_DISABLE_IPV6`, `--verbose`); `reference.md` via `xtask gen` if `--verbose`
  landed. No MANIFEST surface.
- Release: bump-in-PR ([[release-standard-bump-in-pr]]), doyle CHANGELOG-vet, sign FRESH, cross-check counter
  **36** monotonic from published metadata, deployah publishes.

## Sequencing / priority
- **W1 + W4 are the correctness blockers** (urgent: unblock WAN join + stop false-ONLINE). W2+W3 are the UX
  hardening that makes join robust + self-diagnosing. If a fast patch is wanted, W1 (+W4) can ship first as
  v0.17.0; W2/W3 fold into the same minor or a follow-up — doyle's call at scope time.
- W4 is independent of the join cluster (different subsystem: presence gossip vs pairing) — parallelizable.
