# ADR 0005: Deploy Topology — Two-App Fly Split + Litestream + Tigris

[doc->REQ-DEP-01] [doc->REQ-DEP-02] [doc->REQ-DEP-03]

**Date:** 2026-05-08
**Phase:** 05 (during planning)

## Status

**Accepted** — locked at start of Phase 5 planning. Re-evaluation gates listed in
**Forcing Functions for Re-Open** below; absent any of those triggers, the contract
here is binding through Phase 7 PAR-05 (`.bnu` per-user character migration, which
reuses the D-17 ssh-sftp-then-delete ritual verbatim).

Supersedes: nothing. Superseded by: nothing.

## Context

Phase 4 produced a deployable Node 22 + Colyseus 0.17.10 + Better-Auth + better-sqlite3
server at `apps/server`. Phase 5 ships that server to production. The topology
decisions are locked in
[`.planning/phases/05-deploy/05-CONTEXT.md`](../../.planning/phases/05-deploy/05-CONTEXT.md)
§"Environment topology" (D-01..D-04) and are recorded here for permanent reference.

**D-01 (two-app split):** `rebno-staging` + `rebno-prod` are independent Fly apps, each
with its own Fly Volume + per-app Tigris bucket. This separation lets RESTORE.md drills
and Phase 4 carry-forward verification (kill -9 mid-tick, argon2 bench, multi-client
smoke) execute on staging without any prod blast-radius. Two-app fixed cost stays
well under $10/mo at the <50 CCU target.

**D-02 (region `lax`):** Both apps in Los Angeles (US-West). Same region for staging
and prod so latency profile + Tigris RTT translate 1:1 from RESTORE.md validation runs
to prod incidents. The operator is West-Coast-adjacent and the initial audience is 1-5
people, minimizing operator latency. Region is revisitable if the playerbase expands.
Custom domains: `staging.rebno.decidel.com` + `rebno.decidel.com` (Namecheap-managed;
A/AAAA records added per `fly certs add` output post-deploy). The locked value in
`apps/server/fly.staging.toml` and `apps/server/fly.prod.toml` is:
`primary_region = "lax"`.

**D-03 (staging persists across redeploys):** `pnpm migrate:legacy-accounts` runs
**once** on initial staging deploy (mirrors the prod ritual exactly, dogfooded once
before prod ever sees it). Subsequent staging redeploys preserve the SQLite DB on the
Fly Volume. Volume wipe is an explicit operator action when needed (e.g., schema-test
reset), documented in `docs/runbooks/RESTORE.md`. Litestream WAL replication target
RPO < 1 second, measured on staging soak.

**D-04 (staging access = IP allowlist + invite token):** Staging carries real
`localList.txt`-derived plaintext credentials until first-login rehash purges them.
Two-layer access gate: (a) Fly Machines TCP allowlist on operator IP(s) via Fly proxy
configuration, and (b) `apps/server` middleware (`src/staging-invite.ts`) that rejects
WS handshakes lacking `?invite=<STAGING_INVITE_TOKEN>` when `STAGING_MODE=1`. Token
rotates via `fly secrets set` on demand. Prod machines never set `STAGING_MODE`
(middleware no-ops; public WSS endpoint). `lint-deploy-stack.mjs` enforces this at
CI time — `STAGING_MODE` in `fly.prod.toml` fails the drift guard.

The canonical Fly Volume mount layout, established by these decisions, is:

```
/data/rebno.db          — SQLite database (WAL-mode)
/data/keys/             — Ed25519 keypair (room signing, Phase 4 D-11)
/data/seed/             — One-shot legacy import landing zone (deleted after use)
/data/snapshots/        — Litestream pre-migrate snapshots (D-09)
```

## Decision

`rebno-staging` and `rebno-prod` are deployed as two independent Fly applications in
region `lax`. Each app owns its Fly Volume (mounted at `/data`) and its Tigris bucket
(provisioned via `fly storage create -a <env>`), provisioned separately so a
staging-WAL restore never touches the prod-bucket path. Both apps run on a single
Fly Machine (`auto_stop_machines = "off"`, `min_machines_running = 1`) — no autoscaling
in v1.

Staging access is gated by two layers: Fly proxy IP allowlist (operator IPs only)
and `STAGING_INVITE_TOKEN` middleware. The middleware is toggled via `STAGING_MODE=1`
env var and is a zero-cost no-op when that var is absent (prod). `lint-deploy-stack.mjs`
is the enforcement mechanism — it reads `apps/server/fly.staging.toml` and
`apps/server/fly.prod.toml` and exits non-zero if either invariant drifts.

The docker entrypoint at `apps/server/docker-entrypoint.sh` enforces the deploy
sequence on every container start: (1) Litestream pre-migrate snapshot, (2)
`drizzle-kit migrate` (fail-fast; failure = Fly health check red = no traffic),
(3) Litestream WAL replication sidecar background, (4) `exec node dist/index.js`.
The one-shot legacy account import (`pnpm migrate:legacy-accounts`) is NOT part of
this entrypoint — it runs once per env via `fly ssh console` and is documented in
`docs/runbooks/RESTORE.md` §"Legacy Credentials Import".

Promotion to prod uses image-SHA reuse (D-05/D-07): operator creates a git tag
`v0.5.0` on a staging-verified commit; `deploy-prod.yml` resolves the SHA from that
tag and runs `flyctl deploy -a rebno-prod --image registry.fly.io/rebno-staging:<sha>`.
The same bits that survived staging land on prod, with no rebuild. Per-env secrets
(`BETTER_AUTH_SECRET`, `STAGING_INVITE_TOKEN`, Tigris keys) are injected via
`fly secrets set` — never baked into the image or committed to the repository.

## Consequences

### Positive

- RESTORE.md drill validated on staging without prod blast-radius; Phase 4
  carry-forward verification closes on staging first before any prod traffic.
- Two-app separation means staging SIGKILL recovery, argon2 bench, and
  multi-client smoke are exercised on Fly hardware (real musl + ext4) before
  production ever receives a connection.
- Same-region (`lax`) staging + prod means latency profile and Tigris RTT
  translate 1:1 from RESTORE.md validation runs to prod incidents.
- Two-app fixed cost is well under $10/mo at <50 CCU — within budget.
- `auto_stop_machines = "off"` + `min_machines_running = 1` keeps both machines
  always-on for instant WS handshake latency without cold-start penalties.

### Negative

1. Two Tigris buckets to provision and maintain (`fly storage create` per app);
   operator must not skip either during initial env setup.
2. One extra Fly app fixed cost (staging machine runs 24/7 even when idle).
3. Operator must manually run `fly storage create` per app before first deploy —
   `lint-deploy-stack.mjs` warns if `BUCKET_NAME` env var is absent from
   `fly.staging.toml`, but the actual Tigris provisioning is manual.
4. `STAGING_MODE` env var MUST NEVER be set on prod; `lint-deploy-stack.mjs` is
   the only automated forcing function — a misconfigured prod deploy would expose
   the staging invite gate erroneously (the gate itself is harmless, but the signal
   is confusing). Operators must treat the lint as a hard gate.
5. `auto_stop_machines = "off"` means baseline cost even when idle. This is
   acceptable at <50 CCU but becomes a line item if the project expands without
   revisiting autoscaling.

### Neutral

- Prod machine is identically shaped to the staging machine in CPU, RAM, and image —
  only env vars (`STAGING_MODE`, `LOG_LEVEL`, `ALLOWED_ORIGINS`,
  `OTEL_RESOURCE_ATTRIBUTES`) and the Tigris bucket contents differ.
- Drizzle migrations are forward-only; the rollback procedure combines image-SHA
  redeployment with Litestream point-in-time restore (documented in RESTORE.md
  §"Combined Rollback") rather than down-migrations.

## Forcing Functions for Re-Open

Re-open this ADR ONLY if one of the following surfaces:

1. CCU exceeds 50 sustained → re-evaluate single-machine model (v2 OPS-01/02 sharding,
   LiteFS read-replicas, Colyseus matchmaker + Redis driver).
2. Region latency complaints from playerbase → re-evaluate `lax` choice (alternatives:
   `ord` (Chicago), `iad` (Ashburn), `ams` (Amsterdam) — cost-of-switch is one `fly
   regions set` + redeploy).
3. Tigris pricing change > 2× current → re-evaluate per-app bucket isolation (options:
   shared bucket with prefix isolation; LiteFS; Postgres via ADR 0002 re-open).
4. RESTORE.md <5 min target consistently missed on staging → re-evaluate Litestream
   pin (currently 0.3.13) or restore strategy (candidates: LiteFS-based snapshot,
   Fly Volume clone API).

## References

- [`.planning/phases/05-deploy/05-CONTEXT.md`](../../.planning/phases/05-deploy/05-CONTEXT.md)
  §"Environment topology" — verbatim D-01..D-04 source for this ADR.
- [`.planning/phases/05-deploy/05-RESEARCH.md`](../../.planning/phases/05-deploy/05-RESEARCH.md)
  §"Standard Stack" + §"State of the Art" — alternatives considered for topology.
- [`.planning/phases/05-deploy/05-VALIDATION.md`](../../.planning/phases/05-deploy/05-VALIDATION.md)
  — per-task verification map; rows 5-01-01 and 5-02-01 validate Dockerfile + fly.toml.
- [`apps/server/fly.staging.toml`](../../apps/server/fly.staging.toml) — staging Fly
  config (`primary_region = "lax"`, `STAGING_MODE = "1"`, `auto_stop_machines = "off"`).
- [`apps/server/fly.prod.toml`](../../apps/server/fly.prod.toml) — prod Fly config
  (`primary_region = "lax"`, no `STAGING_MODE`, `LOG_LEVEL = "info"`).
- [`apps/server/litestream.yml`](../../apps/server/litestream.yml) — Litestream WAL
  replication config (`sync-interval: 1s`, Tigris S3-compat endpoint).
- [`apps/server/docker-entrypoint.sh`](../../apps/server/docker-entrypoint.sh) —
  container entrypoint: snapshot → migrate → sidecar → exec server.
- [`docs/runbooks/RESTORE.md`](../../docs/runbooks/RESTORE.md) — end-to-end restore
  runbook validated on staging in <5 min; documents D-17 legacy-import ritual.
- [ADR 0002 — Persistence Layer](0002-persistence-layer.md) — SQLite + Litestream
  → Tigris locked; Phase 5 lands the actual replicator config + bucket provisioning.
- [ADR 0006 — Observability Stack](0006-observability-stack.md) — sibling ADR locking
  D-12..D-16 (OpenObserve on `rebno-obs`, dual-rail logs, OTel signals).
