# Architecture Patterns: Native Runtime Replacement **Domain:** Native IPC runtime replacing bash-based agent messaging **Researched:** 2026-03-29 **Confidence:** MEDIUM (IPC mechanisms well-understood; integration with Claude Code agent model has some unknowns) ## Recommended Architecture **Central daemon + thin CLI client in a single binary.** The binary operates in two modes selected by subcommand: `owl daemon` starts the broker, everything else (`owl poll`, `owl deliver`, etc.) connects as a client. ``` ┌────────────────────────────────────────┐ │ owl daemon (long-lived) │ │ │ │ ┌──────────────────────────────────┐ │ │ │ Message Router (in-memory) │ │ │ │ │ │ │ │ perch_table: HashMap │ │ │ │ msg_queues: HashMap │ │ │ └────────────┬───────────────────────┘ │ │ │ │ │ ┌────────────┴───────────────────────┐ │ │ │ Local Socket Listener │ │ │ │ (named pipe Win / UDS Unix) │ │ │ └────────────────────────────────────┘ │ │ │ │ │ ┌────────────┴───────────────────────┐ │ │ │ Psyche Process Supervisor │ │ │ │ (spawns/monitors claude -p) │ │ │ └────────────────────────────────────┘ │ └───────────────┬────────────────────────┘ │ local socket ┌────────────────────────┼────────────────────────┐ │ │ │ ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ │ owl poll │ │ owl deliver │ │ owl list │ │ (agent A) │ │ (agent B) │ │ (agent C) │ │ CLI client │ │ CLI client │ │ CLI client │ └─────────────┘ └─────────────┘ └─────────────┘ ``` ### Why Daemon + CLI, Not Peer-to-Peer 1. **Blocking poll semantics require a persistent broker.** Agents call `$OWL poll ` which blocks until a message arrives. With peer-to-peer, each agent would need its own listener -- the current file-based system already does this but with the overhead of disk writes and polling loops. A central daemon can use condition variables or async channels to wake blocked clients instantly with zero polling. 2. **State centralization simplifies lifecycle management.** The daemon holds the perch table, pending messages, and liveness state in memory. `list`, `stop --all`, `cleanup-session`, and stale detection become instant lookups rather than filesystem scans. Session scoping (which perches belong to which session) is a HashMap filter. 3. **Single point of coordination for Psyche processes.** The daemon can own the Psyche wrapper loop (launching `claude -p`, feeding messages on resume). This eliminates the current pattern of background bash subshells with PID files. 4. **Peer-to-peer adds complexity with no benefit here.** All agents run on the same machine in the same user context. There is no network partitioning to handle, no multi-host routing. A local daemon is simpler and faster. ### Why Single Binary, Not Separate Daemon and Client 1. **Distribution simplicity.** One file to copy: `owl.exe` (or `owl` on Unix). The current system's value proposition is "copy files and go" -- maintaining that with a single binary preserves it. 2. **Auto-start pattern.** Any CLI command can check whether the daemon is running and start it automatically if not. This means agents never need to explicitly start the daemon -- the first `$OWL` call bootstraps it. 3. **Version coherence.** Daemon and client are always the same version. No compatibility matrix. ### Component Boundaries | Component | Responsibility | Communicates With | |-----------|---------------|-------------------| | **CLI Parser** | Parse subcommands and args, dispatch to daemon mode or client mode | All components (entry point) | | **Daemon / Broker** | Accept connections, route messages, manage perch state, supervise Psyche processes | CLI clients via local socket | | **Perch Manager** | Register/unregister perches, track PID/session metadata, detect stale perches | Daemon (internal module) | | **Message Router** | Queue messages per perch, notify blocked poll clients when messages arrive | Daemon (internal module) | | **Psyche Supervisor** | Launch `claude -p` processes, feed poll results to `claude --resume`, track generations | Daemon (internal module) | | **Client Stub** | Connect to daemon socket, send command, receive response, print to stdout/stderr | Daemon via local socket | ### Data Flow **Message delivery (agent A sends to agent B):** 1. Agent B's CLI runs `owl poll B-id --setup` -- client connects to daemon, sends `POLL B-id --setup` 2. Daemon creates perch for B-id, registers the connection as a "waiting client" for B-id 3. Agent A's CLI runs `owl deliver B-id A-id <<< "hello"` -- client connects, sends `DELIVER B-id A-id hello` 4. Daemon looks up B-id in perch table, finds a waiting client, pushes the message to that client's connection 5. B's waiting client receives the message, prints it to stdout (with `__REPLY_TO__:A-id` header), disconnects 6. B's agent processes the message, re-runs `owl poll B-id` to wait for the next one **Key difference from current system:** Step 4 is instant (no 1-second sleep loop). The daemon holds B's connection open and pushes the message the moment it arrives. **Ephemeral send (one-shot with reply wait):** 1. Agent A's CLI runs `owl send target-id my-id <<< "msg"` -- client connects, sends `SEND target-id my-id msg` 2. Daemon creates an ephemeral perch for my-id, delivers the message to target-id, holds the connection open 3. When a reply arrives at my-id's ephemeral perch, daemon pushes it to A's held connection 4. Client prints the reply, disconnects. Daemon auto-removes the ephemeral perch. ## IPC Transport: Local Sockets via `interprocess` **Use the `interprocess` crate** (`local_socket` module). It provides a unified API that compiles to named pipes on Windows and Unix domain sockets on Unix/macOS. Confidence: HIGH -- this is the standard Rust cross-platform local IPC library, actively maintained, supports async via Tokio. **Socket path:** `~/.claude/owlery/owl.sock` (Unix) or `\\.\pipe\owl-messaging` (Windows named pipe). The daemon creates the socket on startup and removes it on shutdown. **Wire protocol:** Simple line-delimited text. Each command is one line: `COMMAND arg1 arg2 ...`. Responses are newline-delimited. Message bodies use length-prefix framing to handle multi-line content. ``` CLIENT -> DAEMON: DELIVER target-id from-id 42\n<42 bytes of body> DAEMON -> CLIENT: OK SENT:target-id\n CLIENT -> DAEMON: POLL my-id listen --setup\n DAEMON -> CLIENT: OK READY:my-id\n (connection held open until message arrives) DAEMON -> CLIENT: MSG 57\n<57 bytes: __REPLY_TO__:sender\nbody> ``` Why not gRPC, HTTP, or protobuf: Overkill. The protocol has 7 commands with positional args and plain-text bodies. A custom line protocol is simpler, faster, zero-dependency (no protobuf compiler needed), and matches the existing text-based output format exactly. ## Blocking Poll Semantics The current bash `_poll_once` blocks using `sleep 1` in a loop checking for `.msg` files. The native replacement eliminates this polling entirely: 1. Client sends `POLL ` to daemon 2. Daemon registers the client connection in a `waiting_clients: HashMap>` map 3. When a message arrives for that ID (via `DELIVER`), the daemon sends it through the channel 4. The client connection receives the message and returns it to the calling process's stdout **For pulse intervals:** The daemon starts a tokio timer when registering a poll with `--pulse-interval N`. Whichever fires first -- message arrival or timer expiry -- wakes the client. Timer expiry sends `PULSE_TRIGGER ()` just like the current system. **BUSY state:** When a poll returns a message, the daemon marks the perch as BUSY (no active poll connection). Messages arriving while BUSY queue in memory. When the agent re-polls, queued messages are delivered immediately. ## Preserving the CLI Interface **The `$OWL` env var points to the new binary instead of `bash owl.sh`.** All subcommands and output formats remain identical: | Current | New | Change | |---------|-----|--------| | `$OWL poll --setup` | `owl poll --setup` | None -- same args, same stdout/stderr output | | `$OWL deliver <<< "body"` | `owl deliver <<< "body"` | None -- reads stdin, same output | | `$OWL list` | `owl list` | None -- same formatted output | | `$OWL stop ` | `owl stop ` | None | | `$OWL cleanup-session` | `owl cleanup-session` | None | | `$OWL session-resume` | `owl session-resume` | None | **Critical:** The output format (ANSI colors, `TAG:value` tokens, stderr for status, stdout for content) must be byte-identical. SKILL.md teaches agents to parse these exact patterns. The SKILL.md file itself does not need to change -- it references `$OWL ` which is agnostic to the underlying binary. **Settings.json change:** Only the env var value changes: ```json { "env": { "OWL": "C:/Users/.../.claude/skills/owl/owl", "LIVE": "C:/Users/.../.claude/skills/owl/owl live" } } ``` Note: `$LIVE` commands become subcommands of the same binary (`owl live start`, `owl live stop`, etc.) rather than a separate script. The `$LIVE` env var can point to `owl live` as a prefix. ## Psyche Wrapper Loop The current `live.sh start` launches a background bash subshell that: 1. Initializes a `claude -p` session with `--agents` JSON and `--agent owl-psyche` 2. Loops: blocks on `$OWL poll `, feeds output to `claude --resume` 3. Exits when the perch is removed **In the native daemon, this becomes a supervised async task:** ``` Daemon └─ PsycheSupervisor └─ Task per Psyche: 1. Spawn `claude -p --agents ... --agent owl-psyche --name ` 2. Loop: a. Await message from internal channel (daemon routes messages here) b. Spawn `claude -p --resume ` with message as stdin c. If perch removed, break 3. On exit: clean up perch, remove from registry ``` **Advantages over bash wrapper:** - No PID file management -- the daemon owns the child processes directly - No background subshell -- tokio task is lightweight - Graceful shutdown is `task.abort()` + wait, not `kill -USR1` + `sleep 0.5` + `kill -9` - Generation tracking is an atomic counter, not a file read/increment/write **The daemon still shells out to `claude -p`** because claude is an external binary. The native runtime manages the process lifecycle, not the conversation logic. ## Graceful Migration Strategy **Phase 1: Build the native binary with file-based fallback.** The binary can operate in two modes: - **Daemon mode (default):** Full IPC via local socket - **File-compat mode:** Reads/writes `.msg` files in `~/.claude/owlery/` just like the bash scripts File-compat mode allows a single native binary to replace `owl.sh` immediately without requiring all agents to upgrade simultaneously. The daemon can also monitor the owlery directory for file-based messages from any remaining bash-based agents. **Phase 2: Daemon mode becomes default.** Once the native binary is proven stable, daemon mode becomes the default. File-based fallback remains available via `--file-compat` flag. **Phase 3: Remove file-based fallback.** After sufficient time, the file-compat code is removed. The owlery directory is only used for the daemon socket and possibly state persistence. **Coexistence during migration:** | Scenario | Works? | How | |----------|--------|-----| | Native daemon + bash agent | YES | Daemon monitors owlery for .msg files from bash agents, or bash agent is updated to call native binary | | Bash script + native client | YES | Native client writes .msg files when no daemon is running (file-compat mode) | | Mixed native daemon + native clients | YES | Primary mode | | Multiple daemons | NO | Socket lock prevents this. First daemon wins, others become clients. | **Migration checklist:** 1. Build native binary with identical CLI interface 2. Test output byte-compatibility with bash scripts 3. Replace `$OWL`/`$LIVE` env vars to point at native binary 4. SKILL.md and LIVE-SKILL.md require zero changes (they reference `$OWL`/`$LIVE`) 5. Deploy to `~/.claude/skills/owl/owl` (or `owl.exe` on Windows) ## Patterns to Follow ### Pattern 1: Auto-Start Daemon **What:** Every CLI command checks if the daemon is running. If not, fork the daemon in the background, wait for socket to become available, then proceed. **When:** Every CLI invocation. **Example flow:** ``` owl poll my-id --setup -> try connect to socket -> ECONNREFUSED or socket missing -> fork: owl daemon --background -> retry connect (with 100ms backoff, 3 attempts) -> send POLL command ``` **Why:** Agents should never have to think about daemon lifecycle. The first `$OWL` call in a session starts it; `cleanup-session` or `stop --all` shuts it down. ### Pattern 2: Length-Prefixed Message Framing **What:** Multi-line message bodies are sent as `\n` rather than relying on delimiters. **When:** Any command that transmits message content (deliver, reply, send). **Why:** Message bodies can contain any text, including newlines. The current `.msg` file format uses `__REPLY_TO__:` as the first line with the rest being body. The wire protocol needs explicit framing since there are no file boundaries. ### Pattern 3: Graceful Shutdown Chain **What:** On shutdown, the daemon: (1) sends termination to all Psyche child processes, (2) drains pending messages, (3) notifies waiting poll clients with a disconnect, (4) removes the socket file. **When:** `stop --all`, `cleanup-session`, or daemon receives SIGTERM. **Why:** The current system has a two-phase kill (USR1 + sleep + kill -9). The native system should do this cleanly via async task cancellation. ### Pattern 4: State Persistence for Session Resume **What:** The daemon periodically writes a snapshot of the perch table to `~/.claude/owlery/state.json`. On restart, it reads this file to know which perches were active. **When:** On perch creation/deletion and on daemon shutdown. **Why:** `session-resume` currently scans the owlery directory for dead perches. With a daemon, the directory may be empty (everything is in-memory). The state file bridges daemon restarts. ## Anti-Patterns to Avoid ### Anti-Pattern 1: HTTP/REST for Local IPC **What:** Using an HTTP server (e.g., hyper, actix) as the daemon transport. **Why bad:** HTTP adds ~1ms overhead per request (header parsing, connection management) with no benefit for local single-machine IPC. The owl protocol has 7 commands -- an HTTP API adds route parsing, serialization, and content-type handling for zero gain. Named pipes/UDS are lower latency and simpler. **Instead:** Use `interprocess::local_socket` with a custom line protocol. ### Anti-Pattern 2: SQLite as Message Store **What:** Using SQLite (e.g., via `rusqlite`) as the message queue backend. **Why bad:** Messages are transient (write-once, read-once, delete-on-read). A database adds write-ahead log overhead, VACUUM concerns, and file locking complexity for what is fundamentally an in-memory queue. The current file-based system's problems (disk I/O, polling) would be partially replicated. **Instead:** In-memory `VecDeque` per perch. Persistence only needed for the perch table metadata (for session-resume), not messages. ### Anti-Pattern 3: Async Runtime for the CLI Client **What:** Pulling in Tokio for the thin CLI client side. **Why bad:** Each `$OWL deliver` invocation spawns a new process, connects to the socket, sends one command, gets one response, and exits. An async runtime's startup cost (~1ms) is wasted here. The client only makes one blocking call. **Instead:** Use synchronous `std::io` for the client. Only the daemon needs Tokio (for multiplexing connections and managing timers/tasks). Use conditional compilation or runtime branching to only initialize Tokio in daemon mode. ### Anti-Pattern 4: Daemon as Mandatory First Step **What:** Requiring users to run `owl daemon start` before any other command works. **Why bad:** Breaks the zero-setup promise. The current system works the moment files are copied. **Instead:** Auto-start daemon on first CLI call. The user never knows the daemon exists. ## Scalability Considerations | Concern | 5 agents (typical) | 50 agents (stress) | 500 agents (theoretical) | |---------|---------------------|---------------------|--------------------------| | Memory | ~1MB (perch table + queues) | ~10MB | ~100MB (still fine) | | Connections | 5 concurrent sockets | 50 sockets (well within OS limits) | 500 sockets (may need connection pooling) | | Message latency | <1ms (local socket push) | <1ms | <5ms (contention on router lock) | | Daemon startup | ~10ms | ~10ms | ~10ms | | Psyche processes | 1-3 `claude -p` children | 10+ (CPU-bound, not daemon's problem) | Impractical (claude itself is the bottleneck) | The daemon is not the bottleneck in any realistic scenario. Claude Code agents themselves are the expensive resource. ## Build Order (Suggested Phases) 1. **CLI skeleton + daemon loop** -- Parse all subcommands with clap, start local socket listener, accept connections. No message routing yet. 2. **Perch management + message routing** -- Implement setup, deliver, list, stop. Messages flow through daemon. Poll blocks correctly. 3. **Output compatibility layer** -- Match ANSI output byte-for-byte with bash scripts. Test against captured bash output. 4. **Psyche supervisor** -- Port the wrapper loop from live.sh. Launch/monitor claude processes. 5. **Session management** -- cleanup-session, session-resume, state persistence. 6. **File-compat bridge** -- Optional: monitor owlery directory for .msg files from legacy bash agents. 7. **Cross-platform builds** -- CI pipeline for Windows, Linux, macOS binaries. ## Sources - [interprocess crate (cross-platform local sockets)](https://github.com/kotauskas/interprocess) -- HIGH confidence, actively maintained, standard Rust IPC library - [interprocess docs](https://docs.rs/interprocess) -- local_socket module provides named pipe (Windows) / UDS (Unix) abstraction - [clap subcommand patterns](https://docs.rs/clap/latest/clap/trait.Subcommand.html) -- standard Rust CLI argument parsing - [Rust cross-compilation guide](https://rust-lang.github.io/rustup/cross-compilation.html) -- official rustup documentation - [cross tool for CI builds](https://github.com/japaric/rust-cross) -- Docker-based cross-compilation - [Single binary gRPC server-client pattern](https://tjtelan.com/blog/lets-build-a-single-binary-grpc-server-client-with-rust-in-2020/) -- architectural reference (we use local sockets, not gRPC, but the single-binary pattern applies) --- *Architecture research: 2026-03-29*