---
name: pump-ipc-deadline-fix
description: "Locked plan+ruling for the peer-pump stall fix (brain-IPC deadline); doyle-ratified, mid-implementation"
metadata: 
  node_type: memory
  type: project
  originSessionId: 9190dea1-a704-43e5-b9a0-12f763cfa866
---

**STATUS 2026-06-10: DONE — MERGED to main @ 4d8b810 (PR #3, squash=merge-commit; fix commit fe195e4). doyle pre-merge gate APPROVED + CI green BOTH runners (kitsubito Linux + hfenduleam Windows `test` 6m23s incl new pumpdeadline socket test; twohost skipped, no tag). Rides v0.4.1 brain-only roll. Don't re-plan. codec.rs nets to zero vs main (WIP added read_frame_deadline, fix removed it). NEXT v0.4.1: seamless no-bounce assert + enlyzeam catch-up + restoration milestone close.** Mechanism switch DONE: reader-thread+channel carrier (BrainConn Whole|Split, Brain::cold_start_pump splits at construction + spawns named `pump-ipc-reader` thread, read via recv_timeout). Abandoned codec::read_frame_deadline+set_nonblocking REMOVED with their unit. Both doyle amendments folded: A1 (reader leak accept-not-fix — thread named+logs spawn/exit; DEFERRED broker-B + KH 7.6 record parked-reader-on-Windows motivation), A2 (net_dial deadline covers ensure_conn redial — confirmed). New test tests/pumpdeadline.rs (real-socket never-reply → net_open_stream → io::TimedOut). GATES GREEN: build·workspace test exit 0·MESH E2E ORACLE 2-pass (the Windows proof the nonblock approach failed)·clippy -D·--no-default·traceable EXIT=0·xtask. NOTE: writing the test surfaced A1's leak live (EOF-driven join hung >60s; fixed test shape not the accepted leak).

---

**Peer-pump stall FIX — doyle-ratified ruling (binding) + implementation plan.** In progress on branch (after the `simple-fixes` batch). Diagnosed 2026-06-11: the deployed v0.4.0 pump wedged 2.2h on `brain.net_open_stream`'s unbounded `loop { read_event() }` — a peer conn black-holed, broker never replied, single-threaded pump froze, `supervise_pump` can't rescue a BLOCKED thread (only catches panic/error/return). Heartbeat (loop-top) correctly stalled. The behaviour-neutral pump-seam is NOT implicated (same hole old+new).

**doyle ruling (frozen, all binding):**
- **Q1: A now, B deferred.** A = brain-side deadline (ships via brain-only update, reaches the fleet — v0.4.1 candidate). B = broker-side bound (rare broker-update path) → `docs/DEFERRED.md` row "the broker must never make a brain wait unbounded on a QUIC op — bound net_open_stream/send/dial handlers", next broker-update batch. C (scratch-thread) impractical (Brain not Send-shareable).
- **Q2: TOTAL-WAIT deadline per net_* call, NOT per-read timeout.** read_event loops `continue` past unrelated events (closed-watcher conn events for OTHER peers); a per-read timeout resets under drip → wedge survives. Shape: deadline = now + PUMP_PEER_IO_TIMEOUT at call entry, each read bounded by REMAINING. Mechanism = my discretion (interprocess v2.4.2 has `Stream::set_nonblocking(bool)` → non-blocking + WouldBlock-poll-until-deadline; cross-platform, testable with a fake `Read` yielding WouldBlock).
- **AMENDMENT (recovery tier, binding):** a TimedOut brain-IPC read POISONS the client (late reply e.g. NetStreamOpened for the abandoned op can bind the WRONG stream id in a LATER call — gotcha-#1-adjacent). Two-tier: (1) ordinary peer_step Err = broker REPLIED with error → existing per-peer abort + conn drop + redial (unchanged); (2) `e.kind()==TimedOut` → bubble OUT of the round → `run_peer_pump` returns Err → `supervise_pump` restarts (PEER_PUMP_FAIL, 5s floor; fresh brain client + conn cache + WorkerLasts reset → first-tick re-prime, V4 stagger, idempotent). ZERO new recovery code — supervisor already exists for this.
- **Q3: PUMP_PEER_IO_TIMEOUT = 30s const** (> any legit round-trip, < 60s QUIC idle). Explicit conn-evict subsumed by the restart.
- **House discipline (same commit):** (1) new REQ-HAZARD-* in traceable-reqs.toml, required_stages=[doc,impl,unit], ACTIVATED in the fix commit (rules 3+5). (2) KNOWN-HAZARDS.md entry (paid-for bug, twice: 2026-06-07 + the 2.2h wedge). (3) Heartbeat UNTOUCHED (loop-top proved itself). (4) Unit evidence: (a) never-replying-transport → net_open_stream → io::TimedOut; (b) the TIER-SPLIT choreography fact — scripted TimedOut step → run_peer_pump returns Err (supervisor path), ordinary Err → per-peer abort only.
- **GATE: one pre-merge doyle review.** Ping when ready. GO given.

**MECHANISM (decided 2026-06-11 after a platform wall):** interprocess v2.4.2 local sockets — Windows named pipes support NEITHER portable read timeout: `set_recv_timeout` → `no_timeouts()` ERROR (windows/named_pipe/local_socket/stream.rs:53), and `set_nonblocking` uses deprecated PIPE_NOWAIT which CORRUPTS mid-stream (proven: tests/mesh.rs "never converged: B sees A" failed on the Windows host with the non-blocking toggle). hfenduleam (the wedge target) IS Windows → the mechanism MUST work there. → **reader-thread + channel**: pump brain `Stream::split()` into (RecvHalf→reader thread doing blocking `read_frame`→`Sender<io::Result<Envelope>>`, SendHalf→main thread writes). `read_event_until(deadline)` = `frames.recv_timeout(remaining)` → TimedOut on expiry. Blocking reads work on BOTH OSes; the channel gives the deadline. On TimedOut → pump bubbles → supervised restart → fresh Brain (new conn + new reader thread); old reader+RecvHalf abandoned (old conn drops → read_frame errors → reader exits). conn becomes an enum: `Whole(Stream)` (non-pump, unchanged) | `Split{send, frames, _reader}` (pump). send()/read_event_until dispatch on it. ABANDON the non-blocking + codec::read_frame_deadline approach.

**Impl plan:** brain.rs: `enum BrainConn { Whole(Stream), Split{ send: SendHalf, frames: Receiver<io::Result<Envelope>> } }` replacing `conn: Stream`; `set_pump_io_timeout` splits + spawns reader thread; `send`/`read_event_until` dispatch; `read_event_until(Some)` = recv_timeout(remaining). sync.rs + propagate.rs pull loops: deadline re-armed on each RELEVANT frame (DONE). pump: connect() sets pump io_timeout=30s; fan-out tier-split via `peer_outcome` (TimedOut→Err bubble, ordinary→drop conn+continue) (DONE). traceable-reqs.toml REQ-HAZARD activated (DONE). KNOWN-HAZARDS 7.6 + DEFERRED broker-B row (DONE). Tests: (a) socket-pair never-reply → net_open_stream TimedOut (brain-level, replaces the codec poll test); (b) peer_outcome tier-split (DONE). Gates: build·test (mesh E2E is the oracle)·clippy -D·--no-default·traceable EXIT=0·xtask. Then ping doyle gate.

Related: [[v040-published-fleet-rolled]] (brain-only update proved), the pump-seam (PUMP-SEAM-PLAN.md EXECUTED, merged 610c176 — the clean seam this lands on).
