# Multi-instance, node-anchored identity model

## Status

accepted (2026-05-29) · amended (2026-05-30) — see Amendment below; the original framing over-weighted remote-drive as the *definition* of multi-instance.

## Amendment (2026-05-30): native-instances-with-synced-mind is primary; remote-drive is a separate mode

The original decision below framed "operate a LiveAgent on any node from any other machine" primarily as **remote PTY drive**. That is backwards-weighted. Corrected model:

- **Primary — instance:** the *same* endpoint (same harness+adapter) running **natively** on a machine (local files, local compute), with the **Psyche mind synced** across instances. Files/session-history are per-node and cannot teleport; the mind follows. This is the core meaning of "same agent on multiple machines."
- **Instance dormancy:** typically one instance is actively-driven; others are **dormant** (live, state-preserved, re-activatable in place). Switching machines makes the previously-driven instance dormant — no explicit stop. Registry status is active / dormant / offline.
- **Remote-control (separate, Shell-like):** attaching a control/view surface to an instance *running on another node* (compute+files stay remote; you are a viewport — the byte-stream terminal attach). A driven surface, user→agent direction; distinct from the instance concept.
- **Handoff (remote-control → local):** `git pull` brings files; the fresh-with-preload resume seam preloads the other instance's synced context; the remote instance goes dormant. No teardown.
- **Two-tier context sync:** **live context** (per-agent) syncs to all instances; **project context** (per-agent-per-project) syncs only to same-project instances. Resolves divergent simultaneous work without losing "mind follows everywhere." (Depends on refining the Psyche authoring directives so each tier holds the right material.)

The decision below stands except where it implies remote-drive is the primary realization of multi-instance — read "drivable from anywhere" as "runnable natively on any of its nodes with synced mind, **plus** an optional remote-control surface mode."

## Context

The sister project models an agent as essentially one perch on one machine, and explicitly deferred "agent name resolution" across a subnet. spt-core commits to two user-facing requirements that break that model: (a) the user can interact with any endpoint from any machine, and (b) a LiveAgent can be operated on any node it has ever been instantiated on, from any other machine. The Instances concept (each endpoint classifies as an instance; the same endpoint ID can have instances across multiple machines with equal functionality) requires deciding what an "instance" actually is before any of it can be built.

## Decision

Split agent identity into two layers: a subnet-wide **endpoint ID** (the logical identity, `ling`) and an **instance** (a materialization of that ID on a specific node, `ling@desktop`).

**Instances are node-anchored, not portable.** An instance's substance — project working directory, harness session history, Psyche context, spool — lives on its node's filesystem and does not teleport. Instances of one ID are distinct *seats* (possibly on different projects), sharing only identity. "Equal functionality across all machines" therefore means *any instance is drivable with the full command surface from any node*, NOT that instances share state.

Rejected alternative — **portable/synced instances** (one logical conversation+state replicated across nodes): a LiveAgent's whole point is operating inside a project on a machine; the project filesystem can't be teleported, and full state replication across nodes is a distributed-consistency problem with no payoff for the actual use case (drive my desktop agent from my laptop).

Remote drive is realized by daemon-to-daemon PTY proxying over the network session-surface abstraction — enabled by the prior decision (ADR-0002) to collapse PTY-host and network-host into one `spt-daemon`.

**v1 scope (V1-mid):** ship the data-model split, the subnet registry (the now-mandatory name-resolution layer), the resolution policy, remote-drive of already-running instances, and cross-node Psyche sync transport. Defer (seam-compatible) instantiate-anywhere + its consent gate, remote command execution + its security gate, and the cross-instance context-freshness auto-feed rule.

## Consequences

- The subnet registry becomes a mandatory, first-class subsystem (eventually-consistent `endpoint_id → [instances]`), not a deferred nicety.
- Bare-ID addressing needs a deterministic resolution policy (local → most-recently-active → explicit `id@node` override).
- The networking decision (ADR-0002) is now load-bearing for a flagship user feature, not just messaging — remote drive depends on it working well.
- Cross-node Psyche sync over P2P replaces the sister project's GitHub-repo sync, removing the `gh`/account/setup dependency.
- Remote command execution and remote instantiation are the highest-risk capabilities in the system; both are gated behind a consent/security model whose design is deferred but whose existence is assumed by the architecture.

## Stage A red-team amendments (2026-05-31)

Codex (`docs/reviews/STAGE-A-codex-redteam.md`) flagged three SERIOUS issues; resolve before M4:

- **#7 — wall-clock "newest" loses concurrent context writes.** Two active instances editing the same live topic is a semantic *conflict*, not a stale-write race; wall-clock newest silently drops a valid update, and LLM-authored live/project tiering is too soft to be the conflict boundary. **Required:** monotonic per-node epochs / vector clocks (or CRDT-ish merge / explicit conflict review) for the cross-node Psyche merge — the distributed generalization of KNOWN-HAZARDS 6.5. *Decision pending: which merge model.*
- **#8 — `local → most-recently-active → id@node` is ambiguous under partition/clock-skew.** Can route messages or remote-control to the wrong instance around active/dormant transitions. **Required:** registry lease/epoch using monotonic per-node epochs, not wall-clock; chaos-test stale entries + ambiguous active instances.
- **#9 — dormant-warm default has real resource/security cost.** Warm sessions retain credentials, file locks, model/billing footprint, PTYs, watchers on *every* node. **Required:** measure N dormant sessions on real adapters; define an adapter-specific warm/cold policy + resource budget before locking warm as the default (R-INST-3).
  **CLOSED 2026-06-04 (M4-D9-3):** measured with the env-gated harness (`crates/spt-daemon/tests/budget.rs`, `SPT_BUDGET=1`) on the rig — an idle warm seat burns **zero CPU**; the cost is RSS only (~8 MiB shell-class, ~300 MiB LLM-harness-class; suspended residual = the on-disk record). Policy locked in `docs/DORMANCY-BUDGET.md`: warm default **confirmed**, auto-suspend opt-in default OFF with the global → node → endpoint `auto_suspend_after_ms` knob chain (D9-2), constrained nodes default the node leg ON.
