# Build plan — `xtask debug-converge` (deferred follow-up)

<!-- [doc->REQ-UPD-6] -->

Status: **BUILT (M8-D4, decision 19).** The watcher shipped as specified:
status-only query on the update wire (`UpdRecord::StatusQuery`/`Status`,
served by `propagate::serve_update` under the same trust gate as a pull),
`propagate::classify_status` as the pure classifier, the poll loop + table
in `xtask debug-converge`. Deviations from the spec below: the per-node
status carries `(channel, staged_version, applied_version, last_outcome)`
from durable records (`applied.json` / `last-outcome.json` in the release
cache — apply and pump record them), and expected-node derivation is the
coordinator's trust rows (self excluded) rather than a reachability sweep —
an unreachable trusted node renders `Offline`, never a silent omission.
This document remains the design rationale.

## Goal

After a coordinator stages a debug rollout, answer one question without
hand-walking nodes: **did every expected lab node converge on the target debug
version, and if not, why not?**

`debug-converge` watches until every *expected* debug-pinned reachable node
reports the target version applied, or it returns a per-node timeout table.
This is what turns "fast to stage" into "fast to *verify deployed*" — the
actual deployment-speed win the rollout pipeline exists for.

## Scope

- **In:** an `xtask debug-converge` subcommand (maintainer tooling only, never
  the public `spt` surface — ADR-0016), per-node state classification, a
  convergence table, exit codes, loopback + two-host tests.
- **Out (still deferred):** remote build orchestration, dashboards/TUI, adapter
  inclusion, stable-release convergence. Keep the first watcher narrow.

## CLI

```
xtask debug-converge --version <u64> [flags]
```

| Flag | Default | Meaning |
|---|---|---|
| `--version <u64>` | required | target debug version to converge on |
| `--nodes <id,…>` | derived | explicit expected-node set; overrides derivation |
| `--subnet <name>` | the rollout's lab subnet | scope derivation to one subnet |
| `--timeout <secs>` | 120 | overall watch budget |
| `--poll <secs>` | 3 | per-round poll interval |
| `--home <path>` | `$SPT_HOME` | coordinator home (testing) |

**Expected-node derivation (default):** every node the coordinator can reach in
the selected lab subnet that is pinned to the `debug` channel. Reuse the
trust-derived subnet membership introduced in `b9326f4` (self + pinned peers,
registry rows as enrichment) so a freshly-joined or eviction-decayed node is
still *expected* and shows up as a row, not a silent omission. `--nodes` gives
an explicit set for targeted tests.

## Per-node state model

The table classifies each expected node into exactly one terminal-or-pending
state (names mirror the existing typed pull/apply outcomes so the vocabulary
stays one set):

| State | Meaning | Terminal? |
|---|---|---|
| `Applied` | node reports `current debug version == target` | ✓ success |
| `NotPinned` | node is on the `stable` channel — rejected at the metadata gate | ✓ (expected-set bug or intentional) |
| `Offline` | unreachable within the round | pending → timeout |
| `StagedAwaitingConsent` | candidate staged, default-gated, no apply ack yet | pending |
| `BlockedByBrokerResources` | broker-touching candidate refused while live endpoints held | pending (operator must quiesce) |
| `Rejected{reason}` | typed refusal: `NoArtifactForPlatform`, `WrongChannel`, `Rollback`, `ArtifactMismatch`, … | ✓ failure |

Convergence = every expected node is `Applied`. Any node still pending when the
timeout fires drops into the timeout table with its last-known state.

## The one real dependency: a node-status read path

The watcher needs each node's **(pinned channel, current applied debug
version, last candidate state)**. This is the only genuinely new surface; the
rest is assembly. Resolve it in this order of preference:

1. **Reuse the update pull/serve handshake** (`propagate.rs`). The serve leg
   already answers "what do you have?" during a pull. Add a *status-only*
   query (no fetch) that returns `{channel, applied_version, last_outcome}` for
   the requester's platform. Cheapest: it rides the existing update wire
   (`net/update.rs`) and the QUIC-handshake-identity origin rule (KH 7.5) gives
   the responder's identity for free.
2. **Registry/presence enrichment.** If a node already publishes its applied
   version + provenance into a registry row (the rollout metadata carries
   `git_commit`/version provenance), the coordinator can read convergence from
   rows it already syncs — no new round-trip. Verify whether
   `relcache`/registry persists "applied version" anywhere queryable; if not,
   prefer option 1.

Pick option 1 as the baseline (self-contained, testable on loopback); treat
option 2 as an optimization once a node actually persists applied-version into
a synced row. **Do not** invent a bespoke push channel — same constraint as the
rollout driver itself (ADR-0016).

## Output

```
DEBUG_CONVERGE target=7 channel=debug subnet=lab expected=3
  node-a   Applied                v7
  node-b   StagedAwaitingConsent  v6  (run: spt update apply)
  node-c   Rejected               NoArtifactForPlatform (x86_64-apple-darwin)
CONVERGED 1/3  (timeout 120s)
```

- All expected `Applied` → print table, exit `0`.
- Timeout with any pending/failed → print table + the quiesce/apply hints,
  exit non-zero. The exit code must be machine-distinguishable from a usage
  error (reserve `2` for usage; use `1` for non-convergence).

## Implementation steps

1. **Status query (wire).** Add the status-only request/response to
   `net/update.rs` + the daemon serve path in `propagate.rs`; origin identity
   from the stream table (KH 7.5), never payload bytes. Unit-test the
   request/response codec.
2. **Node enumeration.** Factor the trust-derived subnet membership read out of
   the `subnet status` path (b9326f4) into something `debug-converge` can call
   for the expected set. Honor `--nodes` / `--subnet`.
3. **Poll loop + classifier.** In `xtask`: round-robin the expected set, map
   each response to the state model, stop early when all `Applied`, else until
   `--timeout`. Pure classifier function is unit-testable with canned responses.
4. **Table + exit codes.** Render + exit-code policy above.
5. **Tests.**
   - *unit:* state classifier (each typed outcome → row); table formatting;
     exit-code policy. `[unit->REQ-UPD-6]`
   - *int (loopback + two-host):* coordinator stages v_N; drive a second node
     through Offline → StagedAwaitingConsent → Applied and assert the table
     converges and exits 0; assert a stable-pinned node shows `NotPinned` and a
     missing-platform node shows `Rejected{NoArtifactForPlatform}`. This is the
     **int** stage REQ-UPD-6 currently lacks. `[int->REQ-UPD-6]`
6. **Runbook.** Replace the manual "Apply and observe" steps in
   `DEBUG-ROLLOUT.md` with `xtask debug-converge`, keeping the manual path as
   the fallback.

## Traceability

Building this activated the **`int`** stage for `REQ-UPD-6`
(`required_stages = ["doc","impl","unit","int"]` since M8-D4). The `int`
evidence is the loopback convergence test
(`tests/propagate.rs::status_query_drives_the_convergence_table_end_to_end`):
a lab node walked Pending → StagedAwaitingConsent → Applied through the real
status wire, plus the NotPinned / Rejected / untrusted-all-None shapes. The
two-host run rides M8 acceptance criterion 6.

## Estimate

Small-to-medium. The status query (step 1) is the only new protocol surface and
the only real risk; steps 2–4 are assembly over existing primitives; step 5 is
the bulk of the diff. One focused session if the loopback status query lands
cleanly.