# ADR 0005: Deploy Topology — Two-App Fly Split + Litestream + Tigris [doc->REQ-DEP-01] [doc->REQ-DEP-02] [doc->REQ-DEP-03] **Date:** 2026-05-08 **Phase:** 05 (during planning) ## Status **Accepted** — locked at start of Phase 5 planning. Re-evaluation gates listed in **Forcing Functions for Re-Open** below; absent any of those triggers, the contract here is binding through Phase 7 PAR-05 (`.bnu` per-user character migration, which reuses the D-17 ssh-sftp-then-delete ritual verbatim). Supersedes: nothing. Superseded by: nothing. ## Context Phase 4 produced a deployable Node 22 + Colyseus 0.17.10 + Better-Auth + better-sqlite3 server at `apps/server`. Phase 5 ships that server to production. The topology decisions are locked in [`.planning/phases/05-deploy/05-CONTEXT.md`](../../.planning/phases/05-deploy/05-CONTEXT.md) §"Environment topology" (D-01..D-04) and are recorded here for permanent reference. **D-01 (two-app split):** `rebno-staging` + `rebno-prod` are independent Fly apps, each with its own Fly Volume + per-app Tigris bucket. This separation lets RESTORE.md drills and Phase 4 carry-forward verification (kill -9 mid-tick, argon2 bench, multi-client smoke) execute on staging without any prod blast-radius. Two-app fixed cost stays well under $10/mo at the <50 CCU target. **D-02 (region `lax`):** Both apps in Los Angeles (US-West). Same region for staging and prod so latency profile + Tigris RTT translate 1:1 from RESTORE.md validation runs to prod incidents. The operator is West-Coast-adjacent and the initial audience is 1-5 people, minimizing operator latency. Region is revisitable if the playerbase expands. Custom domains: `staging.rebno.decidel.com` + `rebno.decidel.com` (Namecheap-managed; A/AAAA records added per `fly certs add` output post-deploy). The locked value in `apps/server/fly.staging.toml` and `apps/server/fly.prod.toml` is: `primary_region = "lax"`. **D-03 (staging persists across redeploys):** `pnpm migrate:legacy-accounts` runs **once** on initial staging deploy (mirrors the prod ritual exactly, dogfooded once before prod ever sees it). Subsequent staging redeploys preserve the SQLite DB on the Fly Volume. Volume wipe is an explicit operator action when needed (e.g., schema-test reset), documented in `docs/runbooks/RESTORE.md`. Litestream WAL replication target RPO < 1 second, measured on staging soak. **D-04 (staging access = IP allowlist + invite token):** Staging carries real `localList.txt`-derived plaintext credentials until first-login rehash purges them. Two-layer access gate: (a) Fly Machines TCP allowlist on operator IP(s) via Fly proxy configuration, and (b) `apps/server` middleware (`src/staging-invite.ts`) that rejects WS handshakes lacking `?invite=` when `STAGING_MODE=1`. Token rotates via `fly secrets set` on demand. Prod machines never set `STAGING_MODE` (middleware no-ops; public WSS endpoint). `lint-deploy-stack.mjs` enforces this at CI time — `STAGING_MODE` in `fly.prod.toml` fails the drift guard. The canonical Fly Volume mount layout, established by these decisions, is: ``` /data/rebno.db — SQLite database (WAL-mode) /data/keys/ — Ed25519 keypair (room signing, Phase 4 D-11) /data/seed/ — One-shot legacy import landing zone (deleted after use) /data/snapshots/ — Litestream pre-migrate snapshots (D-09) ``` ## Decision `rebno-staging` and `rebno-prod` are deployed as two independent Fly applications in region `lax`. Each app owns its Fly Volume (mounted at `/data`) and its Tigris bucket (provisioned via `fly storage create -a `), provisioned separately so a staging-WAL restore never touches the prod-bucket path. Both apps run on a single Fly Machine (`auto_stop_machines = "off"`, `min_machines_running = 1`) — no autoscaling in v1. Staging access is gated by two layers: Fly proxy IP allowlist (operator IPs only) and `STAGING_INVITE_TOKEN` middleware. The middleware is toggled via `STAGING_MODE=1` env var and is a zero-cost no-op when that var is absent (prod). `lint-deploy-stack.mjs` is the enforcement mechanism — it reads `apps/server/fly.staging.toml` and `apps/server/fly.prod.toml` and exits non-zero if either invariant drifts. The docker entrypoint at `apps/server/docker-entrypoint.sh` enforces the deploy sequence on every container start: (1) Litestream pre-migrate snapshot, (2) `drizzle-kit migrate` (fail-fast; failure = Fly health check red = no traffic), (3) Litestream WAL replication sidecar background, (4) `exec node dist/index.js`. The one-shot legacy account import (`pnpm migrate:legacy-accounts`) is NOT part of this entrypoint — it runs once per env via `fly ssh console` and is documented in `docs/runbooks/RESTORE.md` §"Legacy Credentials Import". Promotion to prod uses image-SHA reuse (D-05/D-07): operator creates a git tag `v0.5.0` on a staging-verified commit; `deploy-prod.yml` resolves the SHA from that tag and runs `flyctl deploy -a rebno-prod --image registry.fly.io/rebno-staging:`. The same bits that survived staging land on prod, with no rebuild. Per-env secrets (`BETTER_AUTH_SECRET`, `STAGING_INVITE_TOKEN`, Tigris keys) are injected via `fly secrets set` — never baked into the image or committed to the repository. ## Consequences ### Positive - RESTORE.md drill validated on staging without prod blast-radius; Phase 4 carry-forward verification closes on staging first before any prod traffic. - Two-app separation means staging SIGKILL recovery, argon2 bench, and multi-client smoke are exercised on Fly hardware (real musl + ext4) before production ever receives a connection. - Same-region (`lax`) staging + prod means latency profile and Tigris RTT translate 1:1 from RESTORE.md validation runs to prod incidents. - Two-app fixed cost is well under $10/mo at <50 CCU — within budget. - `auto_stop_machines = "off"` + `min_machines_running = 1` keeps both machines always-on for instant WS handshake latency without cold-start penalties. ### Negative 1. Two Tigris buckets to provision and maintain (`fly storage create` per app); operator must not skip either during initial env setup. 2. One extra Fly app fixed cost (staging machine runs 24/7 even when idle). 3. Operator must manually run `fly storage create` per app before first deploy — `lint-deploy-stack.mjs` warns if `BUCKET_NAME` env var is absent from `fly.staging.toml`, but the actual Tigris provisioning is manual. 4. `STAGING_MODE` env var MUST NEVER be set on prod; `lint-deploy-stack.mjs` is the only automated forcing function — a misconfigured prod deploy would expose the staging invite gate erroneously (the gate itself is harmless, but the signal is confusing). Operators must treat the lint as a hard gate. 5. `auto_stop_machines = "off"` means baseline cost even when idle. This is acceptable at <50 CCU but becomes a line item if the project expands without revisiting autoscaling. ### Neutral - Prod machine is identically shaped to the staging machine in CPU, RAM, and image — only env vars (`STAGING_MODE`, `LOG_LEVEL`, `ALLOWED_ORIGINS`, `OTEL_RESOURCE_ATTRIBUTES`) and the Tigris bucket contents differ. - Drizzle migrations are forward-only; the rollback procedure combines image-SHA redeployment with Litestream point-in-time restore (documented in RESTORE.md §"Combined Rollback") rather than down-migrations. ## Forcing Functions for Re-Open Re-open this ADR ONLY if one of the following surfaces: 1. CCU exceeds 50 sustained → re-evaluate single-machine model (v2 OPS-01/02 sharding, LiteFS read-replicas, Colyseus matchmaker + Redis driver). 2. Region latency complaints from playerbase → re-evaluate `lax` choice (alternatives: `ord` (Chicago), `iad` (Ashburn), `ams` (Amsterdam) — cost-of-switch is one `fly regions set` + redeploy). 3. Tigris pricing change > 2× current → re-evaluate per-app bucket isolation (options: shared bucket with prefix isolation; LiteFS; Postgres via ADR 0002 re-open). 4. RESTORE.md <5 min target consistently missed on staging → re-evaluate Litestream pin (currently 0.3.13) or restore strategy (candidates: LiteFS-based snapshot, Fly Volume clone API). ## References - [`.planning/phases/05-deploy/05-CONTEXT.md`](../../.planning/phases/05-deploy/05-CONTEXT.md) §"Environment topology" — verbatim D-01..D-04 source for this ADR. - [`.planning/phases/05-deploy/05-RESEARCH.md`](../../.planning/phases/05-deploy/05-RESEARCH.md) §"Standard Stack" + §"State of the Art" — alternatives considered for topology. - [`.planning/phases/05-deploy/05-VALIDATION.md`](../../.planning/phases/05-deploy/05-VALIDATION.md) — per-task verification map; rows 5-01-01 and 5-02-01 validate Dockerfile + fly.toml. - [`apps/server/fly.staging.toml`](../../apps/server/fly.staging.toml) — staging Fly config (`primary_region = "lax"`, `STAGING_MODE = "1"`, `auto_stop_machines = "off"`). - [`apps/server/fly.prod.toml`](../../apps/server/fly.prod.toml) — prod Fly config (`primary_region = "lax"`, no `STAGING_MODE`, `LOG_LEVEL = "info"`). - [`apps/server/litestream.yml`](../../apps/server/litestream.yml) — Litestream WAL replication config (`sync-interval: 1s`, Tigris S3-compat endpoint). - [`apps/server/docker-entrypoint.sh`](../../apps/server/docker-entrypoint.sh) — container entrypoint: snapshot → migrate → sidecar → exec server. - [`docs/runbooks/RESTORE.md`](../../docs/runbooks/RESTORE.md) — end-to-end restore runbook validated on staging in <5 min; documents D-17 legacy-import ritual. - [ADR 0002 — Persistence Layer](0002-persistence-layer.md) — SQLite + Litestream → Tigris locked; Phase 5 lands the actual replicator config + bucket provisioning. - [ADR 0006 — Observability Stack](0006-observability-stack.md) — sibling ADR locking D-12..D-16 (OpenObserve on `rebno-obs`, dual-rail logs, OTel signals).