# v0.13.2 W3 — Live daemon-coordinated adapter update: DESIGN PROPOSAL (todlando → doyle gate)

Response to `W3-DESIGN-GATE.md`. Code-grounded; **no impl written**. All file:line from a 4-agent sweep of the live tree on `v0.13.2-adapter-packaging` @0d8f8ca.

---

## 0. The load-bearing finding (reshapes the whole wave)

W3 touches **two distinct "resident" things that live in two different homes with opposite reachability:**

| Thing | Home | IPC-reachable? | Kill story |
|---|---|---|---|
| **Translation binary** (`[message-idle-translation-binary]`) | **broker** — `Translation.child: Arc<TranslationChild>` (broker.rs:845) inside `HostedSession{endpoint, translation}` in `sessions: Arc<Mutex<HashMap<u64,HostedSession>>>` (broker.rs:872,1112). **Survives brain restart.** | **YES** (broker `handle_conn`) | bounded `terminate()` (translation.rs:312: 500ms grace + 20ms poll + force-kill) |
| **BrainLifecycle** (manifest + psyche supervisor) | **brain driver thread** — *moved* into the `std::thread::spawn(move ...)` closure (livehost.rs:400), behind **no lock**, runs `run_pulse_loop` forever | **NO** — unreachable from any IPC handler; only signal in is `stop: Arc<AtomicBool>` | thread join on stop flag |
| **Psyche** (`[session.psyche_init] detach=true`) | separate **detached install-dir binary**, spawned `spawn_psyche_owned` (psyche.rs:108) → `runtime.spawn_session_owned(PSYCHE_INIT_ROLE)`; handle held as `HostedLife.psyche_child: Option<Child>` (livehost.rs:102) | partially (handle in brain's LiveSet) | `stop_host` handle-reap (livehost.rs:244) **only if brain alive**; else cmdline-scoped reap at **brain start only** (livehost.rs:683) |

**Consequence:** W3a/W3d (cycle the translation child) belong in the **broker** and are clean. W3c (manifest refresh) **cannot** be "re-clone into the running BrainLifecycle" as ADR-0025 states — the live instance is thread-owned and IPC-unreachable. It needs a **shared refresh handle** the brain registers at host time. This is the biggest correction below.

---

## 1. W3a — resident-children registry  → **the broker's `sessions` map already IS it**

Do **not** build a parallel registry (dual source of truth). The broker is the daemon-state anchor (broker.rs:1112, survives brain restart); each `HostedSession` already carries `endpoint: String` (broker.rs:869) + `translation: Option<Arc<Translation>>` (broker.rs:872). The translation child is fed in at session spawn (`dispatch_spawn` → `TranslationChild::spawn`, translation.rs:240; stored broker.rs:845) and torn down on session drop/teardown (`terminate()` at broker.rs:1095/1104 + Drop translation.rs:342).

**Proposed shape:**
- Add two fields to `HostedSession`: `adapter: String` and `install_dir: PathBuf` (sourced from the `SpawnReq`/harnesshost path that already carries `translation_binary`, harnesshost.rs:142). These let an apply ask *"which live sessions run adapter X, and where is its install dir."*
- Add a broker query helper: `fn resident_children_for_endpoint(&self, endpoint: &str) -> Vec<Arc<Translation>>` and `fn endpoints_running_adapter(&self, adapter: &str) -> Vec<(String /*endpoint*/, PathBuf /*install_dir*/)>` over the existing map.
- **Flag (JIT W3a wording):** the JIT says "formalize … into a registry" — grounded, the registry exists; W3a is a thin query + 2 new `HostedSession` fields, not a new structure.

---

## 2. W3c — manifest refresh  → **shared `Arc<RwLock>` handle the brain registers** (ADR-0025 correction)

ADR-0025 step 3 ("Re-clone the new on-disk manifest into the running `BrainLifecycle`") is **not directly implementable**: BrainLifecycle is moved into the brain thread (livehost.rs:400), no lock, IPC-unreachable. Fields read live: `manifest` at lifecycle.rs:355 (`history`/echo); `runtime` at 185/209/430 (`spawn_psyche`, `spawn_psyche_owned`, `fire_echo`).

**Proposed mechanism (std only, no new dep):**
- Change BrainLifecycle fields to interior-mutable shared handles:
  `manifest: Arc<RwLock<Manifest>>`, `runtime: Arc<RwLock<ManifestRuntime>>`.
  The 4 read sites become `.read()` borrows; the refresh becomes `fn refresh_manifest(&self, new: Manifest, install_dir: Option<&Path>)` — now `&self` (interior mutability), so it is callable through a cloned `Arc` from another thread.
- At host time (livehost.rs `host_one`, ~360), the brain **registers** `(endpoint → Arc<RwLock<Manifest>> + Arc<RwLock<ManifestRuntime>>)` into a broker-side per-endpoint map (lives next to / inside the broker, the only IPC-reachable, brain-restart-surviving home). The apply handler swaps via this handle; next pulse tick the brain reads the new manifest.
- **Brain-parity preserved:** the brain thread keeps running; PTY + broker session + OutputLog (all broker-state) untouched; only the manifest/runtime *contents* swap.
- `ManifestRuntime::with_install_dir(manifest, install_dir)` (runtime.rs:298) is the rebuild call (it caches `manifest` + `install_dir` immutably for REQ-INSTALL-11 token resolution; rebuilding = new clone under the write lock).
- **Alternative considered & rejected:** "re-host" (rebuild BrainLifecycle) — heavier, re-spawns the psyche even on a translation-only change, and blurs brain-parity. The `Arc<RwLock>` swap is the minimal correct mechanism.

---

## 3. (c) PRIMARY — psyche-from-own-copy  + REQ-INSTALL-11 proof

The locker is `claude-spt-psyche.exe` running **detached, unsupervised, from the shared install dir** (psyche.rs spawn → `runtime.spawn_session_owned`, resolves `<install_dir>/<program>` via `resolve_program_in_dir`, runtime.rs:524).

**Proposed:** spawn the detached psyche from a **per-endpoint private copy**, never the shared install-dir binary.
- **Where:** a per-endpoint private dir — propose `<perch_dir>/.live-bin/` (per-endpoint, beside the perch, torn down with it). One copy of just the psyche program binary (+ any sibling it needs to exec).
- **When:** at host/spawn (`spawn_psyche_owned`, lifecycle.rs:203 / livehost.rs:369). Resolve the bare psyche token to the shipped binary (`<install_dir>/<program>`), **copy** those bytes to `<.live-bin>/<program>`, then spawn the **copy** path.
- **Update refresh:** the next psyche spawn re-copies the (now updated) install-dir bytes → own-copy self-heals — exactly the "ephemeral self-heals on next spawn" rule, now made lock-safe. On apply: reap the current psyche (handle or §4 scoped-reap), then it re-spawns from a fresh copy of the new bytes.
- **REQ-INSTALL-11 proof:** REQ-INSTALL-11's contract = *a bare program token resolves to the adapter's shipped binary* (so the adapter is self-contained, not PATH-dependent). The own-copy keeps that intact: resolution still **finds** the shipped binary in `install_dir` (to source its bytes); only the **exec path** is the relocated copy, whose *content is identical* to the shipped binary. All other token resolution (args, `strings/`, `{session_id}` keys) still resolves against `install_dir` unchanged. The endpoint stays self-contained; the only change is the running executable no longer holds a lock on the *update target*. Net: REQ-INSTALL-11 holds; we add an explicit `[unit]` proving the spawned psyche path is under `.live-bin/` and its bytes equal `<install_dir>/<program>`.
- **Resident translation binary stays in-install-dir (NOT own-copy).** Justification: own-copy is only needed for an **unsupervised** locker (detached, no handle, reap-timing-dependent). The translation child is **supervised** — broker handle + bounded `terminate()` — so W3d's stop-before-swap releases its lock deterministically. Giving it own-copy would add copy churn for zero benefit. (Stated explicitly per the brief's ask.)

---

## 4. (a) SUPPORTING — stop-path / brain-death orphan reap

Grounded gap: `stop_host` reaps `psyche_child` by handle (livehost.rs:244) **only when the owning brain is alive**; the cmdline-scoped `psyche_orphan_should_reap` (livehost.rs:595 — pid-alive ∧ basename==psyche-prog ∧ cmdline⊇`<id>-psyche`, fail-safe-decline on any unreadable signal) runs **only at brain start** (`reap_orphan_psyches`, livehost.rs:683). perri's orphan slipped both: brain died abruptly (handle lost) and endpoint-stop never ran the scoped reap.

**Proposed:** invoke the **existing** scoped-reap at the **endpoint-stop / brain-death reconcile** path, not just brain start.
- Hook: `reconcile_once` stop-side (livehost.rs:280-301) and `confirm_residency_or_unhost` (livehost.rs:477) — after `set.stop_host(&id)` (handles the handle-alive case), run a scoped-reap pass for `<id>-psyche` to cover the **handle-lost** case (brain-less orphan). Reuse `psyche_orphan_should_reap` verbatim → keep the fail-safe-decline invariant (a missed dup is bounded; a wrong-kill is catastrophic — never kill on an unreadable signal).
- This fixes the orphan **leak** independent of updates (perri step 3 reaps the orphan).
- **REQ-HAZARD: propose a NEW one** — distinct from `REQ-HAZARD-UNHOST-PSYCHE-REAP` (handle reap) and `REQ-HAZARD-BRAIN-RESTART-PSYCHE-DUP` (brain-start reap). Proposed id: **`REQ-HAZARD-STOP-PATH-PSYCHE-ORPHAN-REAP`** — *"endpoint-stop / brain-death reconcile reaps a brain-less perch's orphan detached psyche via the cmdline-scoped guard (the handle-reap cannot — the owning brain is gone), preserving fail-safe-decline."* **doyle confirms the id/shape before I mint.**

---

## 5. (b) BACKSTOP + W3b — CRC-gated, extract-to-temp, atomic per-file swap

Reuse `spt_daemon::release::sha256_hex(&[u8]) -> String` (release.rs:603) — `sha2` already a workspace dep; **no new crate**. Compare = `==` on lowercase hex.

**Proposed swap (replaces today's `remove_dir_all(dest)` + blanket extract — that IS the Windows lock-failure path):**
1. Extract the staged archive to a **temp tree** (sibling of install_dir) — reuses W1 `extract_release_archive` shape, into temp not dest.
2. Walk the temp tree; for each file, `sha256_hex(temp_bytes) != sha256_hex(installed_bytes)` → mark for swap; identical → skip (untouched, incl unchanged running binaries).
3. Swap each differing file **atomically**: write into install_dir as `<file>.new` then rename over the original (rename is atomic on same fs; the stopped-binary's lock is already released by W3d step before this).
4. Remove the temp tree.

This is both W3b (only-changed) and the (b) backstop (one busy file can't fail the whole update — only the file actually being replaced needs its lock free, and W3d frees exactly the resident ones). **Note (per brief): (b) alone can't replace a *running* locked `.exe` — that's why (c)+(a)+W3d-stop are primary.**

---

## 6. W3d — daemon-apply IPC (route + per-endpoint apply + bounded + partial-failure)

**CLI keeps fetch+verify+stage** (`cmd_adapter_update`, cli.rs:6064; stage @6136). New decision after stage:
- **Which endpoints run this adapter live?** `roster::enumerate()` (spt/roster.rs:25) → `.alive` ⨯ `registry::registered()` cross-ref (no live-adapter registry exists; build the snapshot). 
  - **None live →** CLI applies directly on the current path **but via the §5 CRC swap** (not `remove_dir_all`) — no lock, no daemon round-trip.
  - **Some live →** CLI hands **apply** to the daemon.
- **IPC surface — the BROKER, not seedmap.** Two surfaces exist: seed-control (`seedmap.rs` — put/take/ping) and the **broker session protocol** (`handle_conn` match, broker.rs:1306; kinds in msg.rs:24+). The apply must reach the translation child + sessions → it is a **broker** command. Add `KIND_ADAPTER_APPLY` (msg.rs) + `AdapterApplyReq { adapter, staged_archive_path, install_dir }` + a `dispatch_adapter_apply` arm at broker.rs:1306 (pattern: `KIND_KILL`→`dispatch_kill`, broker.rs:2184). CLI client sends it over the same broker channel the api client already uses (confirm the exact CLI-side broker-call helper at impl time; `spt/src/api/mod.rs` is the client home).

**`dispatch_adapter_apply`, per affected endpoint (bounded throughout):**
1. Query resident children + manifest-refresh handle (W3a/W3c).
2. **Stop** the resident translation child — `terminate()` (already bounded, translation.rs:312). Reap the psyche (handle or §4 scoped-reap).
3. **§5 CRC swap** install_dir from the staged archive (locks now free).
4. **W3c refresh** the BrainLifecycle manifest/runtime via the shared `Arc<RwLock>` handle.
5. **Restart** the resident translation child from the refreshed manifest path (re-spawn). The psyche re-spawns on the next pulse from a fresh §3 own-copy.

**Brain-parity:** the `BrainLifecycle`/broker session/PTY never restart — only the resident child cycles (translation re-spawn; psyche re-spawn).

**Partial-failure recovery (binding):** ordering = *validate staged → stop child → atomic per-file swap → refresh manifest → restart*. A failure **before** swap-commit leaves install_dir on the OLD bytes → restart the OLD child, report update-failed, endpoint stays live on the old version. Because the swap is per-file write-temp-then-rename (§5), there is no "half-swapped + stopped-binary stranded" window: either a file's rename committed or it didn't; on abort we restart from whatever is on disk (old, since uncommitted renames never landed). The endpoint is **never** left stopped-without-restart.

---

## 7. W3e — int keystone, NO mocks (ship-blocker)

Reuse the v0.12.1 dummy-harness fixture + real broker + real daemon. perri's exact 4-step repro as acceptance:
1. bring up a claude-spt-style live agent `wall-X` → daemon spawns the detached psyche;
2. kill the psyche's **parent** (brain/livehost) → parent-dead orphan;
3. `spt endpoint stop wall-X` → **(a)** scoped-reap kills the orphan; perch `alive=false`;
4. `spt adapter update` → **(c)** psyche never ran from the shared target **and/or (b)** locked-file no longer fails the whole update → extract/swap clean.

**Regression (the gate):** the **Windows locked-binary** case — pre-fix `Access is denied (os error 5)`, post-fix clean.
**Plus two variants:** (i) graceful-stop (handle-reap path still reaps); (ii) happy-path live-update — endpoint stays live, translation stops→CRC-swaps→restarts→**serves the new binary**, psyche re-spawns from the new own-copy. 
If it flakes on hfenduleam (heavy: real broker + PTY + resident child), use the `SPT_ATTACH_IPC_DEADLINE_MS` / watchdog env-knob precedent.

---

## 8. Docs + traceability (what I'll write)

- **ADR-0025 amendment** — add the **third axis** to the resident/ephemeral split: *ephemeral-but-install-dir-locking → run from a per-endpoint own copy*. Correct the two errors below; document (a)/(c)/broker-as-registry/manifest-refresh-via-shared-handle.
- **REQ-ADAPTER-LIVE-UPDATE** → `["doc","impl","unit","int"]` (activate at W3 build).
- **New `REQ-HAZARD-STOP-PATH-PSYCHE-ORPHAN-REAP`** (§4) — doyle confirms id/shape before mint.
- Tags `[impl/unit/int->REQ-ADAPTER-LIVE-UPDATE]` + `[unit/int->REQ-HAZARD-STOP-PATH-PSYCHE-ORPHAN-REAP]` on real evidence.

---

## 9. Flags: where ADR-0025 / the JIT are wrong or underspecified

1. **ADR-0025 mis-classifies the psyche as in-daemon** ("the psyche loop is daemon-hosted, not a separate process (ADR-0004)"). Field-false for claude-spt: it's a separate detached install-dir binary. → §3 own-copy + amendment.
2. **ADR-0025 step 3 ("re-clone into the running BrainLifecycle") is not implementable as written** — BrainLifecycle is thread-moved (livehost.rs:400), unlocked, IPC-unreachable. → §2 shared `Arc<RwLock>` handle registered at host.
3. **JIT W3a ("formalize into a registry")** — the broker `sessions` map already is the per-endpoint resident registry. → §1 thin query + 2 fields, not a new structure.
4. **IPC target ambiguity** — apply is a **broker** command (sessions/translation live there), **not** a seed-control command. → §6.
5. **`TranslationChild::spawn` path is opaque/pre-resolved** (`Command::new(path)`, translation.rs:241, from `SpawnReq.translation_binary`, harnesshost.rs:142). The restart (W3d step 5) must re-resolve the path from the **refreshed** manifest so it picks up new bytes / a moved path.

---

## 10. Proposed build order (reviewable sub-commits, doyle gates each)

`W3a` (broker query + `HostedSession{adapter,install_dir}`) → `W3b` (pure `sha256`-CRC swap helper, unit-tested) → `W3c` (`BrainLifecycle` `Arc<RwLock>` + `refresh_manifest` + brain-registers-handle) → **(a)** stop-path scoped-reap + **(c)** psyche own-copy → `W3d` (`KIND_ADAPTER_APPLY` + `dispatch_adapter_apply` + CLI routing/decision) → `W3e` (NO-mocks int + Windows-lock regression). 

**No impl until doyle confirms** — especially: the §2 manifest-refresh mechanism, the §3 own-copy dir location, the §4 new REQ-HAZARD id/shape, and the §6 broker-IPC-vs-seedmap call.
