# v0.13.0 P1c — controller-writer reorder on brain-restart re-serve (root-fix)

**Status:** design locked (doyle), build → todlando, gate → doyle.
**Operator ruling (2026-06-20):** root-fix the flaky resume race before shipping v0.13.0. Full fix + new hazard REQ.
**P1b is INNOCENT** — proven mechanically (its diff touches only input-ack machinery) + empirically (the test passes post-P1b in isolation). This race is **pre-existing**, surfaced because the v0.13.0 release gate requires green-both-runners and this flaky cluster blocks it.

## Symptom

`crates/spt-daemon/tests/attach.rs::attach_survives_target_brain_restart_exactly_once` is flaky on Linux (~50% fail in isolation). It either:
- panics at `attach.rs:722` (`.expect("re-serve")`) with `output gap: got seq 1 want 0`, **or**
- HANGS in `render_until` (attach.rs:729) — the serve thread died on the gap, so `MARKER_TWO` never reaches the wire and the main thread waits forever.

Same root event; FAIL-vs-HANG is pure harness timing. Sibling cluster members: `inject_control_wedge::g2`, `broker::spawn_env_reaches_child` (the commune's `g2/w5_a2` flaky cluster).

## Root cause (definitive, instrumented repro on kitsubito)

**Two `controller_writer` threads race the SAME brain connection's socket.** life2 (the handoff brain) registers as controller on the same session **twice**, over the same `Brain::conn` socket:

1. `Brain::handoff` (brain.rs:330-346) calls `subscribe(prior.session_id, prior.next_seq)`. life1 already consumed seq 0 → `prior.next_seq = 1` → broker `become_controller(from_seq=1)`, `initial=[1]`, spawns **writer-A** (writes seq 1).
2. `serve_attach` re-handles the replayed `Request{from_seq:0}` (attach.rs:186 → `attach_as(sid, 0, …)`) → broker `become_controller(from_seq=0)`, `initial=[0,1]`, spawns **writer-B** (writes seq 0, then 1).

`become_controller` (broker.rs:349-375) drops the prior `ControllerSink` (its `tx`) but **does not stop the prior writer thread** — writer-A keeps flushing its `initial` batch. Both writers hold clones of the same `SharedSend` (`Arc<Mutex<socket>>`, broker.rs:364) and `controller_writer` (broker.rs:703-737) writes its `initial` batch frame-by-frame under that mutex with **no inter-thread ordering**. When writer-A's seq 1 wins the socket before writer-B's seq 0, the consumer (strict legacy path, brain.rs:599-608) sees seq 1 while `next_seq=0` → gap.

### RACEDIAG trace (failing interleaving)
```
become_controller sid=1 from_seq=1 ring=[0,1] initial=[1]   # handoff: prior.next_seq=1
ctrlwriter WRITE-INITIAL seq=1                               # writer-A writes seq 1 FIRST
become_controller sid=1 from_seq=0 ring=[0,1] initial=[0,1] # re-serve Request(from_seq=0)
legacy-recv got_seq=1 next_seq=0 → legacy-GAP               # consumer sees 1 before 0
ctrlwriter WRITE-INITIAL seq=0                              # writer-B writes seq 0 — too late
```

### The two compounding defects
- **A (the real ordering bug):** the handoff's `from_seq=1` replay writes seq 1 to life2's connection *before* the operator's `from_seq=0` replay writes seq 0. **`prior.next_seq` is life1's consumption cursor, NOT life2's connection state — life2's socket has been sent NOTHING, so its replay floor is wrong.** You cannot serve a `from_seq=0` full replay on a connection that already streamed seq 1; the two subscribes are contradictory.
- **B (the latent broker hazard):** `become_controller` lets a superseded `controller_writer` keep writing the shared socket. Two writers on one connection can interleave non-monotonic seqs. Eviction is ruled out (ring cap `DEFAULT_LOG_CHUNKS=4096` ≫ the test's chunk count).

Snap-above tolerance ALONE does **not** fix it: the consumer would snap to seq 1 then **dedup-drop** the late seq 0 → seq 0's bytes lost → the exactly-once byte-identity assert fails. The reorder must be eliminated, not tolerated.

## Required invariant (what the fix must guarantee)

> On a single brain↔broker connection, the controller output-frame stream is **monotonic non-decreasing in seq** (modulo dedup re-sends). Exactly **one** `controller_writer` is ever live per connection; a superseded writer writes **no further frames**. A re-serve never replays a seq below what has already been written to that connection.

## Fix (three parts — implement all)

1. **PRIMARY — collapse the duplicate controller registration on handoff.**
   `Brain::handoff` must not leave a stale `from_seq=1` controller registration that `serve_attach` then re-subscribes from 0. Recommended: make `handoff` **not eagerly `subscribe`-as-controller** (the brain.rs:344 call) — let `serve_attach`'s `attach_as(sid, 0, …)` be the *sole* controller registration + replay for that connection. (handoff is test-only per the brain.rs:328 comment, so this is safe; verify no production caller depends on the eager subscribe.) Net: one `become_controller`, one writer, monotonic 0,1.

2. **BROKER HARDENING — `become_controller` must stop a superseded writer before the new one writes.** Enforce the invariant in the broker so a double-registration can never silently spawn two racing writers, even if some future caller double-subscribes. Approach (choose what's cleanest + W1-safe — **never block the drain under the `Mutex<OutputLog>`**):
   - Epoch-gate the `controller_writer` initial-batch writes (the writer checks the shared `controller_epoch` before each frame and returns once superseded — mirrors `mark_controller_gone`), **and/or**
   - Clamp a re-serve's replay floor so it never re-emits below the connection's already-delivered frames (but only when that does not drop an un-sent lower seq — see defect A: the real floor must be the connection's true contiguous-sent cursor, not `prior.next_seq`).
   The combination of (1)+(2) must make a superseded writer provably silent.

3. **DEFENSE-IN-DEPTH — seed `session_cursors` on handoff** (brain.rs:341 is currently empty) so the consumer uses the dedup-below + snap-above path (brain.rs:582-595), matching production `resume_sessions` (brain.rs:830) and cold-start. The strict reject-gap legacy path (brain.rs:599-608) is a latent trap for any at-least-once reorder; production resume already avoids it.

## Traceability (mint FIRST, then satisfy — rule 3)

Add to `traceable-reqs.toml`:
- **`REQ-HAZARD-CONTROLLER-WRITER-REORDER`** — "Two `controller_writer` threads must never race one brain connection's socket: `become_controller` superseding a controller leaves the prior writer able to flush its `initial` batch concurrently with the new writer on the same `SharedSend`, producing a non-monotonic on-wire seq stream (`got seq 1 want 0`) that the strict consumer rejects (gap) or that drops an un-sent lower seq under snap-above (byte loss). Invariant: one live `controller_writer` per connection; a superseded writer writes nothing further; re-serve replay is monotonic vs what the connection already received. Root-caused via instrumented repro (doyle agent, 2026-06-20); pre-existing, surfaced by the v0.13.0 green-CI gate. NOT a P1b regression." `required_stages = ["doc","impl","unit","int"]`.
- KNOWN-HAZARDS.md entry (next 7.x) with the failure / invariant / spt-core-mapping / source structure.

## Gate (non-vacuous — doyle)

- `attach_survives_target_brain_restart_exactly_once` runs **20× isolated, single-threaded, on Linux/kitsubito, ALL green** (it is the deterministic RED-on-revert carrier). Wrap with `timeout` so a regression surfaces as a clean fail, never a 27-min hang.
- A focused **unit/int that deterministically forces the two-`become_controller`-on-one-connection ordering** (the keystone): pre-fix it reorders → gap/drop; post-fix the wire stays monotonic + byte-exact. Real broker + real connection, no mocks.
- Full seam sweep: every test touching `become_controller` / `controller_writer` / handoff / resume (`attach.rs`, `broker.rs`, `resume.rs`, `brain_swap.rs`, `inject_control_wedge.rs`) — the W1 controller-writer model is shared (see memory: shared-seam change → run ALL seam tests).
- clippy `--workspace --all-targets -D warnings` = 0; traceable EXIT=0; docs-drift OK.

## After green

Resume the v0.13.0 publish pipeline: green-both-runners on PR #26 → deployah cuts v0.13.0 (MINOR) → verify GH-Pages docs → ping perri. Branch: `v0.13.0-p1b-ack-deadlock` (or a fresh `v0.13.0-p1c-…`; fold into PR #26's `v0.13.0-delivery-control` head).
