# Architecture Patterns: Native Runtime Replacement

**Domain:** Native IPC runtime replacing bash-based agent messaging
**Researched:** 2026-03-29
**Confidence:** MEDIUM (IPC mechanisms well-understood; integration with Claude Code agent model has some unknowns)

## Recommended Architecture

**Central daemon + thin CLI client in a single binary.** The binary operates in two modes selected by subcommand: `owl daemon` starts the broker, everything else (`owl poll`, `owl deliver`, etc.) connects as a client.

```
                   ┌────────────────────────────────────────┐
                   │           owl daemon (long-lived)       │
                   │                                         │
                   │  ┌──────────────────────────────────┐  │
                   │  │    Message Router (in-memory)     │  │
                   │  │                                    │  │
                   │  │  perch_table: HashMap<ID, Perch>   │  │
                   │  │  msg_queues: HashMap<ID, VecDeque> │  │
                   │  └────────────┬───────────────────────┘  │
                   │               │                          │
                   │  ┌────────────┴───────────────────────┐  │
                   │  │    Local Socket Listener            │  │
                   │  │    (named pipe Win / UDS Unix)      │  │
                   │  └────────────────────────────────────┘  │
                   │               │                          │
                   │  ┌────────────┴───────────────────────┐  │
                   │  │    Psyche Process Supervisor        │  │
                   │  │    (spawns/monitors claude -p)      │  │
                   │  └────────────────────────────────────┘  │
                   └───────────────┬────────────────────────┘
                                   │ local socket
          ┌────────────────────────┼────────────────────────┐
          │                        │                        │
   ┌──────┴──────┐          ┌──────┴──────┐          ┌──────┴──────┐
   │ owl poll    │          │ owl deliver │          │ owl list    │
   │ (agent A)   │          │ (agent B)   │          │ (agent C)   │
   │ CLI client  │          │ CLI client  │          │ CLI client  │
   └─────────────┘          └─────────────┘          └─────────────┘
```

### Why Daemon + CLI, Not Peer-to-Peer

1. **Blocking poll semantics require a persistent broker.** Agents call `$OWL poll <id>` which blocks until a message arrives. With peer-to-peer, each agent would need its own listener -- the current file-based system already does this but with the overhead of disk writes and polling loops. A central daemon can use condition variables or async channels to wake blocked clients instantly with zero polling.

2. **State centralization simplifies lifecycle management.** The daemon holds the perch table, pending messages, and liveness state in memory. `list`, `stop --all`, `cleanup-session`, and stale detection become instant lookups rather than filesystem scans. Session scoping (which perches belong to which session) is a HashMap filter.

3. **Single point of coordination for Psyche processes.** The daemon can own the Psyche wrapper loop (launching `claude -p`, feeding messages on resume). This eliminates the current pattern of background bash subshells with PID files.

4. **Peer-to-peer adds complexity with no benefit here.** All agents run on the same machine in the same user context. There is no network partitioning to handle, no multi-host routing. A local daemon is simpler and faster.

### Why Single Binary, Not Separate Daemon and Client

1. **Distribution simplicity.** One file to copy: `owl.exe` (or `owl` on Unix). The current system's value proposition is "copy files and go" -- maintaining that with a single binary preserves it.

2. **Auto-start pattern.** Any CLI command can check whether the daemon is running and start it automatically if not. This means agents never need to explicitly start the daemon -- the first `$OWL` call bootstraps it.

3. **Version coherence.** Daemon and client are always the same version. No compatibility matrix.

### Component Boundaries

| Component | Responsibility | Communicates With |
|-----------|---------------|-------------------|
| **CLI Parser** | Parse subcommands and args, dispatch to daemon mode or client mode | All components (entry point) |
| **Daemon / Broker** | Accept connections, route messages, manage perch state, supervise Psyche processes | CLI clients via local socket |
| **Perch Manager** | Register/unregister perches, track PID/session metadata, detect stale perches | Daemon (internal module) |
| **Message Router** | Queue messages per perch, notify blocked poll clients when messages arrive | Daemon (internal module) |
| **Psyche Supervisor** | Launch `claude -p` processes, feed poll results to `claude --resume`, track generations | Daemon (internal module) |
| **Client Stub** | Connect to daemon socket, send command, receive response, print to stdout/stderr | Daemon via local socket |

### Data Flow

**Message delivery (agent A sends to agent B):**

1. Agent B's CLI runs `owl poll B-id --setup` -- client connects to daemon, sends `POLL B-id --setup`
2. Daemon creates perch for B-id, registers the connection as a "waiting client" for B-id
3. Agent A's CLI runs `owl deliver B-id A-id <<< "hello"` -- client connects, sends `DELIVER B-id A-id hello`
4. Daemon looks up B-id in perch table, finds a waiting client, pushes the message to that client's connection
5. B's waiting client receives the message, prints it to stdout (with `__REPLY_TO__:A-id` header), disconnects
6. B's agent processes the message, re-runs `owl poll B-id` to wait for the next one

**Key difference from current system:** Step 4 is instant (no 1-second sleep loop). The daemon holds B's connection open and pushes the message the moment it arrives.

**Ephemeral send (one-shot with reply wait):**

1. Agent A's CLI runs `owl send target-id my-id <<< "msg"` -- client connects, sends `SEND target-id my-id msg`
2. Daemon creates an ephemeral perch for my-id, delivers the message to target-id, holds the connection open
3. When a reply arrives at my-id's ephemeral perch, daemon pushes it to A's held connection
4. Client prints the reply, disconnects. Daemon auto-removes the ephemeral perch.

## IPC Transport: Local Sockets via `interprocess`

**Use the `interprocess` crate** (`local_socket` module). It provides a unified API that compiles to named pipes on Windows and Unix domain sockets on Unix/macOS. Confidence: HIGH -- this is the standard Rust cross-platform local IPC library, actively maintained, supports async via Tokio.

**Socket path:** `~/.claude/owlery/owl.sock` (Unix) or `\\.\pipe\owl-messaging` (Windows named pipe). The daemon creates the socket on startup and removes it on shutdown.

**Wire protocol:** Simple line-delimited text. Each command is one line: `COMMAND arg1 arg2 ...`. Responses are newline-delimited. Message bodies use length-prefix framing to handle multi-line content.

```
CLIENT -> DAEMON:  DELIVER target-id from-id 42\n<42 bytes of body>
DAEMON -> CLIENT:  OK SENT:target-id\n

CLIENT -> DAEMON:  POLL my-id listen --setup\n
DAEMON -> CLIENT:  OK READY:my-id\n
                   (connection held open until message arrives)
DAEMON -> CLIENT:  MSG 57\n<57 bytes: __REPLY_TO__:sender\nbody>
```

Why not gRPC, HTTP, or protobuf: Overkill. The protocol has 7 commands with positional args and plain-text bodies. A custom line protocol is simpler, faster, zero-dependency (no protobuf compiler needed), and matches the existing text-based output format exactly.

## Blocking Poll Semantics

The current bash `_poll_once` blocks using `sleep 1` in a loop checking for `.msg` files. The native replacement eliminates this polling entirely:

1. Client sends `POLL <id>` to daemon
2. Daemon registers the client connection in a `waiting_clients: HashMap<ID, Sender<Message>>` map
3. When a message arrives for that ID (via `DELIVER`), the daemon sends it through the channel
4. The client connection receives the message and returns it to the calling process's stdout

**For pulse intervals:** The daemon starts a tokio timer when registering a poll with `--pulse-interval N`. Whichever fires first -- message arrival or timer expiry -- wakes the client. Timer expiry sends `PULSE_TRIGGER (<timestamp>)` just like the current system.

**BUSY state:** When a poll returns a message, the daemon marks the perch as BUSY (no active poll connection). Messages arriving while BUSY queue in memory. When the agent re-polls, queued messages are delivered immediately.

## Preserving the CLI Interface

**The `$OWL` env var points to the new binary instead of `bash owl.sh`.** All subcommands and output formats remain identical:

| Current | New | Change |
|---------|-----|--------|
| `$OWL poll <id> --setup` | `owl poll <id> --setup` | None -- same args, same stdout/stderr output |
| `$OWL deliver <id> <from> <<< "body"` | `owl deliver <id> <from> <<< "body"` | None -- reads stdin, same output |
| `$OWL list` | `owl list` | None -- same formatted output |
| `$OWL stop <id>` | `owl stop <id>` | None |
| `$OWL cleanup-session` | `owl cleanup-session` | None |
| `$OWL session-resume` | `owl session-resume` | None |

**Critical:** The output format (ANSI colors, `TAG:value` tokens, stderr for status, stdout for content) must be byte-identical. SKILL.md teaches agents to parse these exact patterns. The SKILL.md file itself does not need to change -- it references `$OWL <command>` which is agnostic to the underlying binary.

**Settings.json change:** Only the env var value changes:
```json
{
  "env": {
    "OWL": "C:/Users/.../.claude/skills/owl/owl",
    "LIVE": "C:/Users/.../.claude/skills/owl/owl live"
  }
}
```

Note: `$LIVE` commands become subcommands of the same binary (`owl live start`, `owl live stop`, etc.) rather than a separate script. The `$LIVE` env var can point to `owl live` as a prefix.

## Psyche Wrapper Loop

The current `live.sh start` launches a background bash subshell that:
1. Initializes a `claude -p` session with `--agents` JSON and `--agent owl-psyche`
2. Loops: blocks on `$OWL poll <psyche-id>`, feeds output to `claude --resume`
3. Exits when the perch is removed

**In the native daemon, this becomes a supervised async task:**

```
Daemon
  └─ PsycheSupervisor
       └─ Task per Psyche:
            1. Spawn `claude -p --agents ... --agent owl-psyche --name <session>`
            2. Loop:
               a. Await message from internal channel (daemon routes messages here)
               b. Spawn `claude -p --resume <session>` with message as stdin
               c. If perch removed, break
            3. On exit: clean up perch, remove from registry
```

**Advantages over bash wrapper:**
- No PID file management -- the daemon owns the child processes directly
- No background subshell -- tokio task is lightweight
- Graceful shutdown is `task.abort()` + wait, not `kill -USR1` + `sleep 0.5` + `kill -9`
- Generation tracking is an atomic counter, not a file read/increment/write

**The daemon still shells out to `claude -p`** because claude is an external binary. The native runtime manages the process lifecycle, not the conversation logic.

## Graceful Migration Strategy

**Phase 1: Build the native binary with file-based fallback.**

The binary can operate in two modes:
- **Daemon mode (default):** Full IPC via local socket
- **File-compat mode:** Reads/writes `.msg` files in `~/.claude/owlery/` just like the bash scripts

File-compat mode allows a single native binary to replace `owl.sh` immediately without requiring all agents to upgrade simultaneously. The daemon can also monitor the owlery directory for file-based messages from any remaining bash-based agents.

**Phase 2: Daemon mode becomes default.**

Once the native binary is proven stable, daemon mode becomes the default. File-based fallback remains available via `--file-compat` flag.

**Phase 3: Remove file-based fallback.**

After sufficient time, the file-compat code is removed. The owlery directory is only used for the daemon socket and possibly state persistence.

**Coexistence during migration:**

| Scenario | Works? | How |
|----------|--------|-----|
| Native daemon + bash agent | YES | Daemon monitors owlery for .msg files from bash agents, or bash agent is updated to call native binary |
| Bash script + native client | YES | Native client writes .msg files when no daemon is running (file-compat mode) |
| Mixed native daemon + native clients | YES | Primary mode |
| Multiple daemons | NO | Socket lock prevents this. First daemon wins, others become clients. |

**Migration checklist:**
1. Build native binary with identical CLI interface
2. Test output byte-compatibility with bash scripts
3. Replace `$OWL`/`$LIVE` env vars to point at native binary
4. SKILL.md and LIVE-SKILL.md require zero changes (they reference `$OWL`/`$LIVE`)
5. Deploy to `~/.claude/skills/owl/owl` (or `owl.exe` on Windows)

## Patterns to Follow

### Pattern 1: Auto-Start Daemon

**What:** Every CLI command checks if the daemon is running. If not, fork the daemon in the background, wait for socket to become available, then proceed.

**When:** Every CLI invocation.

**Example flow:**
```
owl poll my-id --setup
  -> try connect to socket
  -> ECONNREFUSED or socket missing
  -> fork: owl daemon --background
  -> retry connect (with 100ms backoff, 3 attempts)
  -> send POLL command
```

**Why:** Agents should never have to think about daemon lifecycle. The first `$OWL` call in a session starts it; `cleanup-session` or `stop --all` shuts it down.

### Pattern 2: Length-Prefixed Message Framing

**What:** Multi-line message bodies are sent as `<length>\n<body>` rather than relying on delimiters.

**When:** Any command that transmits message content (deliver, reply, send).

**Why:** Message bodies can contain any text, including newlines. The current `.msg` file format uses `__REPLY_TO__:` as the first line with the rest being body. The wire protocol needs explicit framing since there are no file boundaries.

### Pattern 3: Graceful Shutdown Chain

**What:** On shutdown, the daemon: (1) sends termination to all Psyche child processes, (2) drains pending messages, (3) notifies waiting poll clients with a disconnect, (4) removes the socket file.

**When:** `stop --all`, `cleanup-session`, or daemon receives SIGTERM.

**Why:** The current system has a two-phase kill (USR1 + sleep + kill -9). The native system should do this cleanly via async task cancellation.

### Pattern 4: State Persistence for Session Resume

**What:** The daemon periodically writes a snapshot of the perch table to `~/.claude/owlery/state.json`. On restart, it reads this file to know which perches were active.

**When:** On perch creation/deletion and on daemon shutdown.

**Why:** `session-resume` currently scans the owlery directory for dead perches. With a daemon, the directory may be empty (everything is in-memory). The state file bridges daemon restarts.

## Anti-Patterns to Avoid

### Anti-Pattern 1: HTTP/REST for Local IPC

**What:** Using an HTTP server (e.g., hyper, actix) as the daemon transport.

**Why bad:** HTTP adds ~1ms overhead per request (header parsing, connection management) with no benefit for local single-machine IPC. The owl protocol has 7 commands -- an HTTP API adds route parsing, serialization, and content-type handling for zero gain. Named pipes/UDS are lower latency and simpler.

**Instead:** Use `interprocess::local_socket` with a custom line protocol.

### Anti-Pattern 2: SQLite as Message Store

**What:** Using SQLite (e.g., via `rusqlite`) as the message queue backend.

**Why bad:** Messages are transient (write-once, read-once, delete-on-read). A database adds write-ahead log overhead, VACUUM concerns, and file locking complexity for what is fundamentally an in-memory queue. The current file-based system's problems (disk I/O, polling) would be partially replicated.

**Instead:** In-memory `VecDeque<Message>` per perch. Persistence only needed for the perch table metadata (for session-resume), not messages.

### Anti-Pattern 3: Async Runtime for the CLI Client

**What:** Pulling in Tokio for the thin CLI client side.

**Why bad:** Each `$OWL deliver` invocation spawns a new process, connects to the socket, sends one command, gets one response, and exits. An async runtime's startup cost (~1ms) is wasted here. The client only makes one blocking call.

**Instead:** Use synchronous `std::io` for the client. Only the daemon needs Tokio (for multiplexing connections and managing timers/tasks). Use conditional compilation or runtime branching to only initialize Tokio in daemon mode.

### Anti-Pattern 4: Daemon as Mandatory First Step

**What:** Requiring users to run `owl daemon start` before any other command works.

**Why bad:** Breaks the zero-setup promise. The current system works the moment files are copied.

**Instead:** Auto-start daemon on first CLI call. The user never knows the daemon exists.

## Scalability Considerations

| Concern | 5 agents (typical) | 50 agents (stress) | 500 agents (theoretical) |
|---------|---------------------|---------------------|--------------------------|
| Memory | ~1MB (perch table + queues) | ~10MB | ~100MB (still fine) |
| Connections | 5 concurrent sockets | 50 sockets (well within OS limits) | 500 sockets (may need connection pooling) |
| Message latency | <1ms (local socket push) | <1ms | <5ms (contention on router lock) |
| Daemon startup | ~10ms | ~10ms | ~10ms |
| Psyche processes | 1-3 `claude -p` children | 10+ (CPU-bound, not daemon's problem) | Impractical (claude itself is the bottleneck) |

The daemon is not the bottleneck in any realistic scenario. Claude Code agents themselves are the expensive resource.

## Build Order (Suggested Phases)

1. **CLI skeleton + daemon loop** -- Parse all subcommands with clap, start local socket listener, accept connections. No message routing yet.
2. **Perch management + message routing** -- Implement setup, deliver, list, stop. Messages flow through daemon. Poll blocks correctly.
3. **Output compatibility layer** -- Match ANSI output byte-for-byte with bash scripts. Test against captured bash output.
4. **Psyche supervisor** -- Port the wrapper loop from live.sh. Launch/monitor claude processes.
5. **Session management** -- cleanup-session, session-resume, state persistence.
6. **File-compat bridge** -- Optional: monitor owlery directory for .msg files from legacy bash agents.
7. **Cross-platform builds** -- CI pipeline for Windows, Linux, macOS binaries.

## Sources

- [interprocess crate (cross-platform local sockets)](https://github.com/kotauskas/interprocess) -- HIGH confidence, actively maintained, standard Rust IPC library
- [interprocess docs](https://docs.rs/interprocess) -- local_socket module provides named pipe (Windows) / UDS (Unix) abstraction
- [clap subcommand patterns](https://docs.rs/clap/latest/clap/trait.Subcommand.html) -- standard Rust CLI argument parsing
- [Rust cross-compilation guide](https://rust-lang.github.io/rustup/cross-compilation.html) -- official rustup documentation
- [cross tool for CI builds](https://github.com/japaric/rust-cross) -- Docker-based cross-compilation
- [Single binary gRPC server-client pattern](https://tjtelan.com/blog/lets-build-a-single-binary-grpc-server-client-with-rust-in-2020/) -- architectural reference (we use local sockets, not gRPC, but the single-binary pattern applies)

---

*Architecture research: 2026-03-29*
