# Mesh-D8 — concurrent liveness probes (`subnet status --nodes`)

> JIT plan for `SUBNET-MESH-PLAN.md` §Build phases · phase 8. Activates **REQ-MESH-6** (impl, unit). Independent polish after D7 (no crypto, no wire change): the mesh widen (D6) makes a node see ALL members of a subnet, many offline — so the serial per-node probe loop now costs `offline_count × ceiling`. D8 fans the probes out so the wall-time is bounded by **one** single-probe ceiling, not `k × ceiling`.

## Pre-flight — verify the REQ-SUBNET-5 loop (done; it IS serial)

Confirmed by reading, not assumed:

- `cli.rs:2856–2866` (`cmd_subnet_status`, the `--nodes` arm): a **sequential** `rows.into_iter().map(|r| … probe_node(&r.node, &sub.name) …)`. Each `Probe`-state row blocks the next.
- `probe_node` (`cli.rs:2675`) wraps one probe in `run_bounded(PROBE_TIMEOUT=2500ms, false, …)` → `wansend::probe_node_serving` (`wansend.rs:420`): cold-starts a brain IPC connection and `net_dial`s the peer, then asks "serving `subnet`?". A dead id-only peer would otherwise hang the full iroh discovery timeout (~30s), so the 2.5s bound is load-bearing.
- `run_bounded` (`cli.rs:2651`) = spawn-thread + `recv_timeout`. So **one probe already runs on its own thread**; the serialization is purely the caller's `.map`, which joins each before starting the next.

So with `k` offline members the view waits `k × 2.5s` (the mesh makes `k` large — every roster member, online or not, that decays to `Probe`). REQ-MESH-6 is exactly this regression.

## The fix — fan out the probe phase

One pure-core helper that runs a batch of probes concurrently under a single ceiling, with the real probe injected (so the unit test injects a deterministic fake). Shape:

```rust
/// Run every (node, subnet) probe concurrently; collect results in input order.
/// Wall-time ≈ ONE `ceiling`, not k×ceiling. `probe` is injected (real:
/// probe_node_serving; test: a deterministic sleeper). Bounded concurrency so a
/// huge mesh can't spawn unbounded threads/broker conns.
fn probe_all<F>(items: &[(String, String)], ceiling: Duration, max_inflight: usize, probe: F) -> Vec<bool>
where F: Fn(&str, &str) -> bool + Send + Sync + Clone + 'static
```

- **Concurrency:** spawn one `std::thread` per probe (each does its own `recv_timeout`-bounded run, mirroring `run_bounded`), but cap in-flight at `max_inflight` (a small fixed pool, e.g. 16) so a 200-member subnet doesn't open 200 simultaneous broker IPC conns + dials. With a cap, wall ≈ `ceil(k / max_inflight) × ceiling`; for a normal fleet (`k ≤ cap`) that's one ceiling. **Log if `k > max_inflight`** (no silent cap — the wait is then a small multiple, and the user should know why).
- **Ordering:** carry the input index through, so results map back to rows deterministically (the render order is stable).
- **Each thread is hard-bounded** by `ceiling` (its own `recv_timeout`), so one wedged dial costs one ceiling, never the batch — KH 5.3 (every subprocess/dial bounded) preserved.

### Caller change (`cmd_subnet_status` `--nodes` arm)

Two passes instead of the inline `.map`:
1. Build `rows` per subnet as today (`node_status_rows`); collect the `Probe`-state ones into a flat `Vec<(node, subnet)>` (the work-list), remembering each row's place. `Online`/`Offline` rows need no probe — settle them directly.
2. One `probe_all(&worklist, PROBE_TIMEOUT, MAX_INFLIGHT, |n, s| probe_node_serving(n, s))` call across **all** subnets at once (not per-subnet) — the fan-out spans the whole view, so a multi-subnet status pays one ceiling total, not one per subnet.
3. Stitch results back into `(NodeRow, bool)` settled rows; render unchanged (`render_node_rows`).

The "Checking remote nodes…" announce stays (fire once before the batch when any row is `Probe`).

Keep `probe_node` for any remaining single-probe site (and the existing test); `probe_all` becomes the batch path the `--nodes` view uses.

## Tests

- **unit (REQ-MESH-6):** the timing fact via a **deterministic injected probe** (no real network — the mock-clock seam the reg note names). Inject `probe = |_, _| { sleep(D); true }` for `N` items; assert the batch completes in **well under `N × D`** (≈ one `D` plus scheduling slop) — e.g. `N=8, D=200ms`: assert elapsed `< 4 × D` (loose enough for CI jitter, tight enough to prove it isn't serial `8 × D = 1600ms`). Assert results come back in input order and all `true`.
  - Plus a **cap** unit: `N=20, max_inflight=4, D=100ms` → elapsed in `[~5×D, ~8×D]` (proves the cap batches rather than either serializing fully or ignoring the bound), and the over-cap path is logged.
  - Plus a **timeout** unit: a probe that sleeps `> ceiling` settles `false` (its `recv_timeout` default) without dragging the batch past one ceiling.
- **Regression:** the existing `--nodes` rendering + `probe_node`/`run_bounded` tests stay green; `node_status_rows` (the pure liveness classifier) is untouched; a status view with mixed Online/Offline/Probe rows renders the same settled output as the serial path would (order-stable).
- `traceable-reqs check` clean with **REQ-MESH-6 `[impl, unit]`**; clippy + `--no-default-features` + workspace green; docs-drift gate (run `xtask gen` only if `status --help` text changed — D8 changes no flags, so likely a no-op, but check).

## Done when

- `spt subnet status --nodes` probes every silent member **concurrently**: a view with `k` offline members settles in ≈ one `PROBE_TIMEOUT` (bounded by `max_inflight`), not `k × PROBE_TIMEOUT`; verified by the injected-probe timing unit.
- Each probe stays individually hard-bounded (one wedged dial = one ceiling, never the batch); the cap is logged when exceeded (no silent truncation of the wait).
- Render output is identical to the serial path (order-stable, same Online/Offline/Probe→bool settling); existing tests green.
- `traceable-reqs check` clean with **REQ-MESH-6 `[impl, unit]`** activated; clippy / `--no-default-features` / docs-drift green.
- One commit: `perf(mesh): concurrent subnet status --nodes liveness probes (REQ-MESH-6) — Mesh-D8`.

## Explicitly NOT D8 (carry forward)

- Cross-node un-tombstone convergence beyond local clear-on-re-pair (D3 boundary; revisit only on a field hit).
- Multi-user `(subnet, user)` revocation scoping (the seam exists; the cross-user milestone owns it).
- Force-drop of a revokee's lingering live conn (D7 deferred; tombstone+rotation already lock out — revisit only if a field hit demands prompt teardown).
- Any async-runtime adoption for the probe fan-out — threads + the existing `run_bounded` discipline are sufficient; do not pull tokio into the CLI for this.
