# Fault-injection acceptance matrix

**Why this exists.** ADR-0002 collapsed networking into the one per-machine daemon
(SERIOUS #6, Stage-A red team): a networking bug and a PTY bug now share a failure
domain. The collapse is locked — the owed mitigation is this matrix: inject a fault
into each subsystem **independently**, observe the blast radius (what else breaks)
and the recovery (what brings it back), and grow the proven rows until every
subsystem has injection evidence. Begun at M4-D4 (per `M4-D4-PLAN.md` piece 5);
rows land as their machinery lands.

**How to read it.** *Fault* = what is injected. *Blast radius* = the components that
degrade, by design. *Survives* = what must keep working untouched. *Recovery* = the
path back. *Evidence* = the test (or gap) backing the row. Rows marked **planned**
are accepted debt — tracked here so the matrix is the single place to look.

## Proven rows (injection evidence in-tree)

| # | Fault injected | Blast radius (by design) | Survives untouched | Recovery | Evidence |
|---|---|---|---|---|---|
| 1 | **Brain killed mid-PTY-stream** (logic crash / routine self-update) | Logic halts until restart | Broker, PTY child (pid stable), output log | New brain re-subscribes from cursor; gapless + exactly-once | `spt-daemon/tests/` B2/B9 handoff suite; `tests/idempotent.rs` (crash before-intent / before-effect / after-effect) |
| 2 | **Brain killed mid-QUIC-stream, receive side** | Logic halts | Broker-owned endpoint, conn, stream, read ring | Resubscribe from durable cursor; ring replays the dead window | `tests/netstream.rs::receiver_brain_restart_is_gapless_and_exactly_once` |
| 3 | **Brain killed mid-QUIC-stream, send side** | Logic halts | Broker-owned conn/stream + effect journal | Re-drive whole durable op sequence; journal dedups replays (no dup on the wire) | `tests/netstream.rs::sender_brain_restart_redrive_is_exactly_once` |
| 4 | **Brain dead across peer connect/disconnect** | Presence consumption pauses | PresenceLog ring buffers liveness transitions | Resubscribe from presence cursor; missed transitions replay | `tests/presence.rs` (both tests) |
| 5 | **Peer node vanishes (graceful close)** | That conn's streams end | Local broker, other conns, brain | Closed-watcher removes conn row + emits `disconnected`; dial fresh when needed | `tests/presence.rs`; `nethost.rs` closed-watcher |
| 6 | **Registry feed: stale/replayed update over the wire** | None — update dropped | Stored registry state (newer epoch holds) | None needed (epoch lease absorbs it) | `tests/replicate.rs::registries_converge_over_the_wire_and_the_lease_holds`; `replicate.rs` unit tests |
| 7 | **Registry feed: corrupt record in the stream** | That record only — skipped | The rest of the feed (decoder does not wedge) | Lease makes loss safe; next update supersedes | `replicate.rs::decoder_survives_chunk_splits_and_corrupt_lines` |
| 8 | **Net disabled / endpoint bind fails at boot** | No WAN; net frames answer `enabled:false` / typed error | PTY hosting (daemon degrades to net-less broker) | Fix config, restart daemon | `broker.rs::dispatch_net_status` (None arm); `Daemon::run` degrade path |
| 9 | **Broker process restart (journal survives, conns don't)** | Live conns/streams lost | Effect journal (durable) | Deduped op whose resource is gone → typed "retry with a fresh op_id" error; brain re-dials | `broker.rs::dispatch_net_dial` / `dispatch_net_stream_open` restart arms |
| 10 | **WAN msg feed replayed** (sender redrive, or receiver brain restart resubscribing from a stale cursor) | None — replayed records dedup | Spool state (each op exactly one row); already-delivered ops | Durable `wan_seen` op-id claims absorb the replay; spool path claims atomically with the row | `tests/wanmsg.rs::receiver_restart_replays_feed_without_double_delivery`; `spool.rs` wan unit tests |
| 11 | **WAN msg record forges its origin** (payload `origin_node`/`node` field) | None — forgery inert (unknown field) | Access-gate subject (handshake-proven `remote_id_hex`) | None needed (records carry no origin field by design — KH 7.5) | `tests/wanmsg.rs::wan_message_lands_exactly_once_under_transport_origin`; `wanmsg.rs::forged_origin_field_is_inert` |
| 12 | **Target brain killed mid-remote-attach** (operator typing into the dead window) | Pumping pauses until handoff | Broker-held QUIC stream, session + child, output log, effect journal; the operator's viewport | Successor re-serves the same stream from seq 0 (worst case): replayed input dedups at the PTY-write journal; re-transmitted output dedups at the operator's render cursor — every byte exactly once | `tests/attach.rs::attach_survives_target_brain_restart_exactly_once` |
| 13 | **Target brain killed mid-file-transfer** (push receive: partial temp file on disk) | Transfer pauses until handoff | Broker-held QUIC stream + ring, the partial temp, the durable progress record (last observed position) | Successor re-serves the same stream from seq 0 (worst case full replay): chunks carry absolute offsets — replayed bytes rewrite in place; the commit is an atomic temp→final rename gated on temp completeness, and a replayed commit dedups against an already-committed final (never renames garbage over it) | `tests/xfer.rs::push_survives_target_brain_restart_exactly_once`; `xfer.rs::recv_state_chunks_idempotent_and_commit_replays_safely` |
| 14 | **Stream lost mid-context-bundle-sync** (responder dies / wire tears before `Done`) | That one pull errors at the requester | The requester's context store (an incomplete bundle is never fetched — length-gated, scratch file deleted); the responder's store untouched | No resume protocol by design: re-pull fresh — the apply is ancestry-idempotent (an already-joined tip short-circuits; fetched tips quarantine under `refs/spt-sync/` so a partial fetch never clobbers `refs/heads/`) | `tests/sync.rs::torn_pull_recovers_by_repulling` |
| 15 | **Partition: concurrent context writes on two nodes** | The file's auto-propagation pauses (conflict surfaced, not propagated) | BOTH versions — local file untouched on each node + the other's version as a durable tracked artifact (hazard 6.6); every other file keeps syncing | Elected reconciler (Active instance's node; lowest-id fallback) runs one bounded Psyche turn → merged write `join(vA,vB)+bump` dominates both parents → propagates as a plain accept, clears artifacts subnet-wide | `tests/sync.rs::concurrent_writes_reconcile_on_elected_node_and_converge`; `spt-store::syncmerge` unit suite |
| 16 | **Reconcile turn fails** (harness absent / timeout / garbage or empty output / 6.5-suppressed write) | That file stays in conflict-holding state | Both versions (artifacts + local file) — exactly where they were; the store, the sync loop | Retry at the next sync/activation; a fresher direct write re-merges against the new local version | `spt-live reconcile.rs::failed_turn_preserves_everything` / `suppressed_write_preserves_artifacts`; `turn.rs::empty_stdout_is_an_error` |
| 17 | **Registry ambiguity during reconciler election** (partition: two nodes both see Active / no Active anywhere) | Two textually-different merges of the same pair may both mint | Neither version is ever lost — merged writes carry joined vectors, so the two merges classify **concurrent** and re-surface as a new conflict pair (detected, not silently last-wins) | Deterministic tiebreak bounds the storm (lowest node id among Actives / non-Offline holders); the re-surfaced pair reconciles on the next pass | `spt-daemon reconcile.rs::double_active_tiebreaks_deterministically` / `fallback_lowest_non_offline`; `contextmark.rs::merge_decisions` (concurrent never auto-picks) |
| 18 | **Compromised relay node serves a tampered update artifact** (bytes corrupted at rest after the relay's own verification) | That one pull ends `Rejected(ArtifactMismatch)` at the puller | The puller's staged release + running binary (nothing staged, nothing applied — no code path from wire bytes to the cache except through `plan_verified`); the rest of the subnet pulls from honest peers | Pull from another peer; the per-node gate (REQ-UPD-2) runs at **every** hop, so one poisoned node never re-propagates | `tests/propagate.rs::tampered_relay_artifact_is_rejected_and_never_staged` |
| 19 | **Non-conforming peer offers a rollback/expired/off-channel release** | None — rejected pre-fetch, zero artifact bytes move | The puller's version (monotonic floor holds); the wire (no transfer ever starts) | None needed — the offer-then-fetch shape gates metadata before bytes (REQ-HAZARD-UPDATE-ROLLBACK) | `tests/propagate.rs::rollback_offer_is_rejected_before_any_fetch` |
| 20 | **Untrusted node queries for the staged update** | None — refused by not offering (fail-closed, the up-to-date shape: learns nothing, not even whether a release exists) | The staged release; the serve loop (one refusal, no state) | None needed; pair the node to entitle it | `tests/propagate.rs::untrusted_origin_gets_no_offer` |
| 21 | **Stream lost mid-update-artifact** (responder dies / wire tears before `Done`) | That one pull errors at the requester | The requester's staged release (staging is atomic, artifact-then-metadata commit point — a torn pull stages nothing); the scratch file is transient | No resume protocol by design: re-query fresh — the query is idempotent and chunks are positional | torn-pull shape shared with row 14 (`request_update` errors on EOF-before-`Done`); staging atomicity: `relcache.rs::torn_or_corrupt_stage_offers_nothing` |

| 22 | **Psyche emits spoofed routing** (`<EVENT type="reply" from="evil" to="victim">` on its stdout) | None — the spoof is structurally inert: the intent parser carries body only (attrs unrepresentable), the daemon re-stamps `from=<psyche_id>` and routes to the inbound message's structural sender (its `from`) alone | The spoof target (receives nothing); the wire (only daemon-authored envelopes); every other agent's identity | None needed (anti-spoof is structural + re-stamp, ADR-0012 / KH 7.3) | `psyrelay.rs::spoofed_routing_is_stripped_and_restamped`; `spt-live outbound.rs::spoofed_routing_attrs_are_unrepresentable` |
| 23 | **Psyche replies with no inbound sender to answer** (alarm-fired turn emits a reply) | That one intent drops with a typed outcome (fail-closed — never broadcast, never guessed) | Every perch (zero deliveries); the notify leg (independent) | The Psyche's next answered turn replies normally | `psyrelay.rs::reply_without_target_is_dropped` |
| 24 | **Live-Psyche driver loses stdout** (the 7.3 null-stdout shape: detached/`Stdio::null()` regression or a turn that produced nothing) | That turn fails loudly (`TurnError::EmptyOutput`); nothing relays | The relay boundary (no silent success — indistinguishable-from-bug shapes are errors); the spool | Fix the driver / retry the turn | `psyrelay.rs::null_stdout_driver_fails_the_guard`; `spt-live turn.rs::empty_stdout_is_an_error` |
| 25 | **Runaway Psyche dumps an oversize body** (> 64 KiB in one intent) | That intent drops with a typed outcome | Delivery substrate (never asked to carry it); other intents in the same turn relay normally | None needed (legitimate long bodies chunk at delivery — T3 `EVENT-PART`) | `psyrelay.rs::empty_and_oversize_bodies_drop_typed` |
| 26 | **Notif feed replayed whole** (re-presented rows after loss/restart — the push-feed's normal recovery) | None — the semilattice join no-ops (`Unchanged` per row, nothing written) | The spool (no duplicates, no state regression) | None needed: replay IS the recovery protocol (full-row re-presentation, no delta bookkeeping) | `tests/notifsync.rs::notif_spools_converge_over_the_wire_and_dismiss_replicates` (replay leg); `notifsync.rs::feed_converges_two_stores_and_dismiss_replicates` |
| 27 | **Concurrent dismiss + surface on two nodes** (the same notif dismissed on A while B marks it seen/surfaced) | None — the writes commute through the join (OR/union/max); both survive the bidirectional exchange | Both stores converge to identical rows; the dismiss latch (a stale undismissed copy can never un-dismiss) | None needed (no conflict exists to surface — every field is monotone) | `notifsync.rs::concurrent_dismiss_and_surface_commute_across_the_feed`; `spt-store notif.rs::merge_is_idempotent_and_commutative` |
| 28 | **Untrusted origin injects notif records** (unpaired peer, or a `home`-trusted peer spoofing into `work`) | None — dropped fail-closed at the apply gate, zero rows written; the gate's subject is the handshake-proven stream-table origin, never payload bytes (KH 7.5) | The spool (nothing materialized); every subnet the origin isn't trusted in (trust is per-subnet) | None needed; pair the node to entitle it | `tests/notifsync.rs` (gate-negative leg); `notifsync.rs::untrusted_and_cross_subnet_origins_are_dropped` |
| 29 | **Notif record for a non-member subnet** (a peer feeds rows for a subnet this node never joined) | None — dropped fail-closed; the subnet is never materialized locally (the REQ-INST-13 posture: unconfigured replicates nowhere) | The member subnets' spools; the feed (other records still apply) | None needed | `notifsync.rs::non_member_subnet_record_never_materializes` |

## Planned rows (accepted debt — no injection evidence yet)

| # | Fault to inject | Expected blast radius | Expected survival | Target |
|---|---|---|---|---|
| P1 | **Iroh endpoint task hangs** (not crash — a wedged accept loop) | New inbound conns stall; existing conns + PTY hosting unaffected | PTY side fully isolated (separate threads/runtime) | D9 chaos pass |
| P2 | **mDNS discovery dies** | LAN dial-by-id fails; direct-addr + relay dials unaffected | Endpoint, conns, PTY | D9 (needs real LAN) |
| P3 | **Relay unreachable** (n0 outage / air-gap) | WAN dials needing relay fail; LAN/direct unaffected | Everything local | D9 two-host |
| P4 | **PTY broker drain thread dies** | That session's output stalls | Other sessions, net side | with terminal-wrapper hardening |
| P5 | **Manifest invocation hangs** (adapter subprocess) | That endpoint's lifecycle op | Daemon scheduler (KH 5.3 timeouts + 7.4 non-blocking) | when daemon hosts N agent loops |
| P6 | **Registry feed flood** (malicious/buggy peer spams updates) | Bounded by ring caps + lease; CPU cost unmeasured | — | D9 + REQ-SEC-1 outer gate |
| P7 | **Closed-watcher starvation** (net runtime saturated) | Presence lags behind reality | Conn table eventually consistent | D9 chaos pass |
| P8 | **Receiver crash between a live-TCP WAN delivery and its op-id claim** | That one in-flight op re-delivers on replay (at-least-once window — chosen over claim-first, whose crash gap silently *drops*; see `wan.rs` module docs) | Spool path (claim+row atomic); all other ops | live-listener durable receipt, if the window ever bites |

**Maintenance rule:** when a planned row gains a real injection test, move it up with
its evidence path. When new subsystems land (D5 WAN delivery, D6 Psyche sync), add
their fault rows *in the same change* — a subsystem without a row here is a review
finding.
