---
name: v081-livehost-bootrace
description: "v0.8.1 task — fix REQ-HAZARD-LIVEHOST-BOOT-RACE: net-less node, live agent goes online but daemon never hosts Psyche (no LIVEHOST_PSYCHE). doyle delegation, I own waves+execution. Verified root-cause-so-far + wave plan + the non-negotiable real-daemon E2E gate."
metadata: 
  node_type: memory
  type: project
  originSessionId: 8350d43e-fe2d-4e24-9999-eecbcc6d3a50
---

**v0.8.1 livehost boot-race** — doyle delegation (operator-directed), I own W/T planning + execution, doyle gates + drives release (counter 17). Branch `v0.8.1-livehost-bootrace` off main (v0.8.0), base @2aab895 mints `REQ-HAZARD-LIVEHOST-BOOT-RACE` (required_stages=[]; I activate [impl,unit,int] per wave, same-commit tags). PR → ping doyle at PR-ready (self-drive until then). See [[v080-merge-reconciliation]] (the v0.8.0 livehost reapply this builds on).

**DEFECT (perri v0.8.0 dogfood, real bug):** harness-hosted live agent (`spt api seed` + `spt api listen <id>`) goes status=online but NEVER gets a daemon-hosted Psyche on a net-less/unpaired/peer-pump-STALLED node. Host-predicate fully passes (state=live_agent, status=online, adapter resolvable, psyche_init present) yet `reconcile_once` never emits `LIVEHOST_PSYCHE`. "peer pump: STALLED" reappears after every restart.

**VERIFIED from source (brainproc.rs:159-255):** run_brain :161 `connect_retry(&broker_socket)?` connects to the LOCAL BROKER (not net); broker binds its IPC listener INDEPENDENT of net (broker.rs:575 bind_in_with_net). So net-less the broker-connect succeeds → :230 `spawn_live_host` IS reached (unconditional, own thread, before the heartbeat loop). The heartbeat loop (:233) net-gates ONLY the consumers (dispatcher/peer pump) via consumer_gate, NOT livehost. ⇒ **H1 (brain starved before :230 by net) REFUTED by source** — the live reconcile thread runs net-less. (Caveat NOT yet verified: whether the BROKER itself spawns the brain child net-less, or the broker's own boot stalls before binding — H3/broker-side, open.)

**Explore report (afe7959) — TREAT WITH CAUTION, its mechanism is shaky:** concluded "registry load-order race → install_dir=None → spawn_psyche fails." INTERNALLY INCONSISTENT: `resolve_option_in(adapter)` and the install_dir `find` use the SAME `registered` slice + same parent — can't succeed-then-miss. And a first-tick registry miss self-heals next tick (5s) ≠ the PERSISTENT defect. So the specific theory is likely WRONG. doyle warned "do NOT presume (perri refuted doyle's first guess)."

**TRUE failure = runtime-observable only.** Real candidates: (a) `spawn_psyche` persistently fails → `LIVEHOST_PSYCHE_FAIL` every tick — most likely REQ-INSTALL-11 psyche-binary-not-found (perri's blocked proof, copy-mode binary absent from install_dir+PATH); (b) H3 broker doesn't spawn brain child net-less; (c) reconcile predicate fails on a field. doyle's brief: "check LIVEHOST_PSYCHE_FAIL."

**WAVE PLAN:**
- **W1 — reproduce via the non-negotiable gate artifact:** a REAL detached-daemon E2E (spawn the real brain CHILD via `spt daemon`, NOT in-process; existing live_firsthost_e2e uses in-process daemon + hand-writes perch, so the production broker→brain-child→run_brain→spawn_live_host→reconcile path was NEVER E2E-tested). `api seed` + `api listen` a live agent on a net-less/unpaired node → assert `LIVEHOST_PSYCHE`. Watch it FAIL → read the ACTUAL failure (LIVEHOST_PSYCHE_FAIL text / behavior). Ground the fix in THIS, not the report. (Look at crates/spt-daemon/tests/daemon_lifecycle_real_brain.rs for real-brain-spawn infra.)
- **W2 — fix** per observed failure: live-host reconcile runs net-INDEPENDENTLY + survives the boot-race, self-healing like REQ-DAEMON-9 (dispatcher/peer-pump boot-race self-heal in broker.rs attach_net). Likely also fix REQ-INSTALL-11 psyche binary resolution if that's the observed failure.
- **W3 — green + activate + gate.**

**WHY IT MATTERS (doyle):** unblocks perri's last 2 parity proofs — REQ-INSTALL-11 install-dir binary resolution (psyche must actually SPAWN to exercise it) + F-006 interim PATH-copy.

**GATE (pre-flight before pinging doyle):** `clippy --workspace --all-targets -D warnings` (EXACT CI invocation, NOT -p) · `traceable-reqs check` EXIT=0 · docs-drift `xtask gen` if any CLI touched · the real-daemon E2E GREEN. Then doyle code-reads + verifies the E2E spawns the brain CHILD + gates → deployah ships v0.8.1 (counter 17) → doyle pings perri.

**W1 DONE — real-daemon E2E built + bug DID NOT reproduce as a net-race (RUNTIME-PROVEN reframe):** `crates/spt/tests/livehost_bootrace_e2e.rs` (`netless_online_live_agent_is_hosted_by_the_real_daemon_brain`) launches the REAL `spt daemon run` broker → real `spt daemon brain` CHILD, forces net-less (corrupt node.key → NODE_KEY_FAIL no-retry), real `api seed`+`listen` establish→online, REAL psyche binary (`psychebin`=copy of spt) by BARE token ONLY in install source_dir (exercises REQ-INSTALL-11), bounded 25s psyche-perch poll. **PASSES net-less** — `LIVEHOST_PSYCHE` fires with net fully starved. Negative control (binary removed) → `LIVEHOST_PSYCHE_FAIL: ... program not found` per 5s tick. So:
- **H1 net-gating REFUTED by RUNTIME** (not just source): spawn_live_host @brainproc:230 unconditional, before the heartbeat net-gate (consumer_gate gates ONLY dispatcher/peer-pump). Net-independence ALREADY HOLDS — the E2E locks it in; NO behavior fix needed there.
- **install-dir resolution CORRECT + CONSISTENT:** github/release registration sets `source_dir = adapters/_github/<safe>/` (cli.rs:5470,5487; runtime.rs:421-424), copy-mode source_dir = manifest dir; livehost (rec.source_dir) AND api seam (resolve_ctx_manifest, mod.rs:501) use the SAME source_dir. The W1 "divergence" candidate REFUTED. resolve_program_in_dir (runtime.rs:434) works (E2E proves).
- **REAL spt-core gap = SILENT-FAILURE MASKING (the REQ title names it):** spawn_psyche Err → `LIVEHOST_PSYCHE_FAIL` is eprintln-ONLY on the brain's INVISIBLE stderr (brainproc.rs:927-948 brain inherits broker stderr → foreground `spt daemon run`, unseen by a harness watching only `api listen` stdout). Agent stays state=live_agent+alive=true+status=online (cmd_listen stamps unconditionally, startup.rs:284) with NO Psyche, NO harness-visible error. = perri's exact diagnostic.

**PROPOSED v0.8.1 SCOPE (pending doyle confirm — premise shifted):** (1) KEEP the real-daemon E2E as the non-negotiable gate artifact; re-tag its evidence to `[int->REQ-HAZARD-LIVEHOST-BOOT-RACE]` (W1 tagged it REQ-DAEMON-1). (2) SURFACE the silent host-failure — write a disk-observable signal on the live perch when spawn_psyche fails (so "online but un-hosted" is diagnosable by the harness), the real behavioral fix for the masking. (3) net-independence = E2E regression-lock only (already holds). perri's actual binary: present-but-unresolved (would need an spt-core resolution fix — but E2E says resolution works) vs absent (adapter packaging, not spt-core, but now surfaced) — doyle has perri's env to disambiguate.

**doyle ENDORSED the reframe + scope A/B/C with 4 refinements (RULING):** (1) signal HARNESS-REACHABLE = info.json field on the live perch (perri reads perch state, never brain stderr), surfaced by `spt endpoint list`/status; (2) do NOT de-stamp status=online (agent reachable; only Psyche missing; brain-restart rehydrate has legit online-without-Psyche windows) — SEPARATE psyche-host-health field, online authoritative; (3) CURRENT-STATE stamp (overwrite reason+ts+attempts, CLEARED on successful host), NOT append (5s reconcile would spam); (4) REQ primary deliverable shifts to "surface the silent psyche-host failure on the perch", net-independence = locked-in invariant (E2E). int = E2E positive net-less host + negative missing-binary→signal-on-perch. BOUNDARY: spt-core SURFACES; perri owns her packaging.

**REQ TITLE: doyle APPROVED my wording VERBATIM — I apply it MYSELF** in the evidence commit (his separate commit would race my branch; title is a doc-field, no traceable race). The approved title is the long wording I proposed (surface host-failure + net-independent invariant + current-state field + status authoritative + real-daemon E2E + adapter owns packaging). **FINALIZE CHECKLIST (on me after the build agent abe4a6b2/todlando-w161 reports):** (a) apply the approved REQ title to traceable-reqs.toml (build agent only set required_stages, didn't have the verbatim title); (b) review the agent's commit/diff; (c) run FULL gate — clippy --workspace --all-targets -D warnings, traceable EXIT0, E2E pos+neg (`cargo test -p spt --test livehost_bootrace_e2e`), docs-drift (xtask check/gen); (d) push branch + open PR; (e) ping doyle PR-ready. doyle gates → deployah ships v0.8.1 (counter 17) → doyle pings perri. (Part 9 dropped — owl 8-part cap, substance complete per doyle.)

**PR-READY @28d6e70 → PR #17 (base main), pinged doyle.** Build by subagent (8e3b14c), I applied the approved-verbatim title (amended into the evidence commit → 28d6e70), reviewed host_one (clear-on-Ok/stamp-on-Err/status-untouched ✓) + the negative E2E (real `spt daemon run`→brain CHILD ✓ doyle's non-negotiable), re-ran the FULL gate on my tree: clippy --workspace -D warnings clean · traceable EXIT0 ([impl,unit,int]) · both real-daemon E2Es pass (positive netless-host + negative missing-binary→perch-stamp) · xtask check OK. Files: info.rs (psyche_host_error field+helper+unit), livehost.rs (host_one), cli.rs (SELF-pin render), 2 E2Es, traceable-reqs.toml (activate+title), CONTEXT.md. **PENDING — fold into my NEXT natural change (doyle directive, NOT a standalone PR):** add a `docs/FLAKE-LEDGER.md` entry — *kitsubito "Build notify-shell" step: crates.io curl/dep-download blip (`download of config.json failed, curl failed` fetching serde_json) → `gh run rerun --failed` clears (seen run 27652755792 attempt 1, v0.8.1 PR #17).* A one-line doc entry isn't worth a both-runner CI cycle mid-release, so it RIDES the next thing I touch. Hold ready.

**v0.8.1 GATED + SHIPPING.** doyle GATE-PASS on PR #17 (independently re-verified: code-read, clippy -D warnings=0, traceable EXIT0, both real-daemon E2Es broker→brain-CHILD-via-brain.ready-pid, all units, CI green) → handed deployah the release GO (counter 17, vetted changelog). doyle pings perri post-publish. I STOOD DOWN.

**CI run 27652755792 (PR #17): attempt 1 RED, attempt 2 GREEN via doyle's `gh run rerun --failed`.** CORRECTION of my own error: I mis-diagnosed the red as a "transient mid-flight read that self-healed." WRONG — doyle clarified: a job reports `conclusion=FAILURE` ONLY when COMPLETE, never mid-flight. Attempt 1 genuinely FAILED on the "Build notify-shell" step (crates.io curl dep-download blip: `download of config.json failed, curl failed` fetching serde_json) — a real transient INFRA flake. The green came from doyle's RERUN, not no-action; the in_progress I saw afterward was attempt 2 (the rerun), not a stale read of attempt 1. **LESSONS:** (1) `conclusion=failure` = COMPLETE — never dismiss as stale; READ THE FAILED STEP LOG. (2) a run flipping to in_progress AFTER a failure conclusion = a RERUN was triggered (check attempt number / who reran), not a stale read. (3) kitsubito "Build notify-shell" has a crates.io curl/dep-download flake → `gh run rerun --failed` clears it (FLAKE-LEDGER candidate). [[hfenduleam-disk-full-ci]] / [[unmergeable-pr-blocks-ci]] siblings — CI-red triage: read the actual failed step before classifying.

**LESSON:** the Explore root-cause agent (afe7959) over-theorized a shaky "registry race" — REFUTED. The reproduce-agent (af5fc4ee) with a NEGATIVE CONTROL nailed it. Ground in RUNTIME repro, not agent narrative. [[grep-req-tags-to-find-impl]] sibling.
