# ADR 0002: Persistence Layer — SQLite + Drizzle + Litestream

**Date:** 2026-05-03
**Phase:** 03 close

## Status

**Accepted** — locked at end of Phase 3 (SDOC-04). Re-evaluation gates listed
in **Forcing Functions for Re-Open** below; absent any of those triggers, the
stack here is binding through Phase 4 SRV-01..03 (DB runtime), Phase 5
DEP-02/03 (Fly.io machine + volume) and DEP-07 (RESTORE.md restore SLA),
Phase 6 CLI-01..08 (the MVP gate consumes the schema indirectly through
`packages/protocol`), and Phase 7 PAR-04..07 (relational join surface review).

Supersedes: nothing. Superseded by: nothing.

## Context

Phase 3 reverse-engineering produced
[`docs/extracted-server/save-formats.json`](../extracted-server/save-formats.json)
(8 formats, 16 file-IO scripts) and
[`docs/extracted-server/persistence.md`](../extracted-server/persistence.md)
(write cadence + crash-loss surface). Together they document four properties
of the legacy data shape that drive this lock:

1. **Single-writer access pattern.** Every save is written by the singleton
   `server` object on a deterministic schedule
   ([`extracted/server-5-4/scripts/0349-operations.gml`](../../extracted/server-5-4/scripts/0349-operations.gml)
   step event; per-player save alarm `global.p_sendalarm[pid] = 900` at 30 fps
   = 30 s) or per-event (login/logout/save). The `0367-users_restore.gml`
   load-side mirrors this — there is no read-replica fan-out and no concurrent
   writer in the original. Modern equivalent: a Node singleton process driving
   the Colyseus tick loop. SQLite WAL with one writer is a perfect fit; a
   multi-writer relational store would be over-engineered for this shape.
2. **Append-mostly with full-file rewrite cadence.** `MB_Log.bnb` is rewritten
   in entirety on every change ([`0365-mb_backup.gml`](../../extracted/server-5-4/scripts/0365-mb_backup.gml)
   "load whole, mutate, write whole" pattern; per `persistence.md` "The whole
   MB file is rewritten on every change"). Frequency is low (topic/reply
   events, not per-tick). A SQL transactional INSERT replaces the rewrite
   without touching the access shape.
3. **Per-user denormalized layout.** `save-formats.json` documents
   `UserData/HXB/Bridges_<uid>.bnu`, `UserData/Inv/Inventory_<uid>.bnu`,
   `UserData/Areas/Areas_<uid>.bnu`, `UserData/News/News_<uid>.bnu` —
   `filename_pattern` per format, one file per (uid, category). The legacy
   schema is already "one row per user, multiple files per category." No
   relational join requirement surfaces in any documented script (Phase 3
   plan 03-04 protocol scan + plan 03-05 save-format scan). The Drizzle
   schema in [`tools/db-schema/src/tables.ts`](../../tools/db-schema/src/tables.ts)
   carries this forward as denormalized columns + composite primary keys
   (e.g. `inventory_items` keyed on `(account_id, category, slot)`); only
   `audit_log`, `characters`, and the message-board pair declare FKs.
4. **Tiny working set.** Per
   [`.planning/codebase/CONCERNS.md`](../../.planning/codebase/CONCERNS.md)
   the deployed `legacy/servers/enlyzeam-current/` instance has ~50 user
   rows, `MB_Log.bnb` ~150 KB at peak, total persistent footprint ≪ 1 MB.
   Even projecting forward to BNO's historical ~973 user files at ~1-3 KB
   each plus a 39 KB MB log, the total world state is < 5 MB.
   `persistence.md` summarizes: "SQLite handles this without breathing
   hard." A Postgres deployment for ≪ 1 MB of data is operationally
   absurd.

Phase 4 SRV-08 additionally requires SIGTERM-grace + atomic-writes survival
under `kill -9` (CONCERNS.md crash-loss surface, also flagged at
`persistence.md` "Crash-loss surface" — original is **not** crash-safe). The
legacy server has no journaling, no fsync discipline, and no double-buffer /
atomic-rename pattern, so a mid-rewrite crash truncates the `.bnu` and the
next `users_restore` reads garbage. SQLite WAL + `fullfsync` solves this by
construction.

[`.planning/research/PITFALLS.md`](../../.planning/research/PITFALLS.md) §B7
("Persistence loss on Fly.io machine restart") documents the production
risk: Fly.io restarts machines for maintenance, scaling, and deploy events,
and the legacy "every 30 s alarm" cadence loses up to 30 s (or 90 s, for
some files) of state per restart. The §B7 mitigation explicitly names
**Litestream** ("near-RPO-zero off-machine backup") and **SIGTERM grace
handlers** ("flush WAL, exit").

[`.planning/research/STACK.md`](../../.planning/research/STACK.md) records
the Phase-0 research call:

> **Persistence:** **`better-sqlite3` 12.9.0 + Litestream** to a Tigris
> bucket. Postgres is wrong for this scale; the server is single-machine
> and stateful by design.

STACK.md also pins component versions: Drizzle ORM 0.45.2, drizzle-kit
0.31.10, Litestream 0.3.x, all targeting Node 22 LTS. Plan 03-06 implemented
the Drizzle schema ([tables.ts](../../tools/db-schema/src/tables.ts)) and
emitted the baseline DDL ([0001_baseline.sql](../extracted-server/0001_baseline.sql))
this phase. This ADR locks the runtime that consumes those artifacts in
Phase 4.

## Decision

We will use **`better-sqlite3` 12.9.0 + Drizzle ORM 0.45.2 + drizzle-kit
0.31.10 + Litestream 0.3.13** for Phase 4+ persistence, with the SQLite
database file on a Fly.io persistent volume and the Litestream replica
target on a Tigris S3-compatible bucket.

Rationale, by component:

- **better-sqlite3 12.9.0** — synchronous DB access maps perfectly to the
  Colyseus single-tick game loop (no `await` spaghetti for player save/load
  on hot paths); per STACK.md "~10× faster than `node-sqlite` for the
  small-row, high-frequency reads a game server does." WAL mode + atomic
  writes survive `kill -9` (Phase 4 SRV-08 verifiable on staging — a direct
  mitigation of the CONCERNS.md / `persistence.md` crash-loss surface).
- **Drizzle ORM 0.45.2 + drizzle-kit 0.31.10** — schema-first authoring
  shipped this phase in [`tools/db-schema/src/tables.ts`](../../tools/db-schema/src/tables.ts);
  Phase 4 SRV-01..03 imports those table definitions verbatim. Bundle size
  ~7 KB vs Prisma's 1.6 MB engine; native `better-sqlite3` driver; no
  separate codegen step. The `payloadJson` escape-hatch on `audit_log`
  (D-13 + plan-3-06 RESEARCH §Open Question 1) means Phase 7 PAR-07 admin
  actions ship without a schema migration.
- **Litestream 0.3.13** — replicates SQLite WAL frames to the Tigris bucket
  with measured RPO < 1 s (per §B7 mitigation). The 0.3.x line is the
  production-recommended series at the time of this lock; designed by the
  Fly team explicitly for the Fly.io + persistent-volume + SQLite pattern.
  This satisfies Phase 5 DEP-07 RESTORE.md SLA without standing up a
  separate DB tier.
- **Fly.io persistent volume + Tigris bucket** — STACK.md records the
  documented hosting target. The volume holds `/data/rebno.db` plus the
  Litestream WAL working set; Tigris holds the off-machine replica. One
  Fly machine in `iad` (or owner-local) is sufficient for the < 50 CCU
  v1 scope.

**Postgres rejected for v1.** Reasons:

- Single-writer + per-user denormalized data shape doesn't justify
  relational overhead. The Phase 3 schema ([tables.ts](../../tools/db-schema/src/tables.ts))
  has only four FKs, all to a single `accounts` table; no many-to-many
  joins surface in the documented save-formats inventory.
- Total < 5 MB world state + < 50 CCU target fit comfortably in a single
  embedded SQLite file on local SSD; Postgres adds a network hop, a
  separate Fly app or pooled connection, and on-call rotation for
  another stateful service for negative scale benefit.
- Operational complexity disproportionate to the shape: pg server,
  connection-pool tuning, `pg_basebackup` + WAL-archiving tooling, schema
  migrations gated on engine version — none of which we need for the v1
  data shape, and all of which divert Phase 4 / 5 engineering effort
  away from the CLI-08 milestone.
- OPS-03 (v2 requirement, deferred per CONTEXT.md D-11) preserves the
  Postgres migration path as a forcing function: if Phase 7+ surfaces a
  relational join requirement that doesn't fit single-writer SQLite, the
  Drizzle data layer migrates with a one-line driver swap (per
  STACK.md §"Bridging to Postgres if you ever need it" — Drizzle's `pg`
  driver has the same API as the SQLite driver). The migration cost is
  scoped to a Phase-9 retrospective; v1 ships SQLite.

**Better-Auth integration deferred to Phase 4 SRV-09.** The hand-authored
`accounts` table in tables.ts uses Drizzle conventions (`id`, `username`,
`passwordHash`, `email`) that are partially compatible with Better-Auth's
generated Drizzle adapter columns; full reconciliation is a Phase 4
implementation step (Assumption A5) and may add Better-Auth-managed tables
alongside (not replacing) `accounts`. This does not affect the persistence
layer choice — argon2id-only credentials live in `accounts.passwordHash`
per CLAUDE.md hard rule #2.

## Consequences

### Positive

- **Phase 4 SRV-01..03 has zero-redesign-loop schema.** It consumes
  [tables.ts](../../tools/db-schema/src/tables.ts) +
  [0001_baseline.sql](../extracted-server/0001_baseline.sql) verbatim;
  the runtime stack is locked here so SRV-01 starts with `pnpm add
  better-sqlite3@12.9.0 drizzle-orm@0.45.2 -D drizzle-kit@0.31.10` and
  no further architecture decision.
- **Phase 4 SRV-08 SIGTERM-grace tractable.** Synchronous WAL writes +
  better-sqlite3's `pragma fullfsync = true` survive `kill -9`; the
  graceful-shutdown handler is a small flush-and-exit script, not a
  multi-stage drain protocol.
- **Phase 5 DEP-02/03 has documented Fly.io single-machine + persistent-
  volume + Litestream pattern.** STACK.md cites the official Fly endorsement
  ("All In on SQLite + Litestream"); Phase 5 mirrors it directly via
  `fly volumes create /data` + `fly storage create` (Tigris bucket) +
  `litestream.yml` `replicas: - type: s3`.
- **Phase 5 DEP-07 RESTORE SLA achievable.** Litestream-to-Tigris RPO < 1 s
  satisfies the RESTORE.md "minimum data-loss window" requirement; restore
  is `litestream restore` against the Tigris URL onto a fresh volume.
- **Crash-loss surface mitigated by construction.** All writes are SQLite
  transactions; the legacy "half-rewritten `.bnu`" hazard cannot occur. The
  `persistence.md` crash-loss section explicitly forward-links this ADR.

### Negative

- **Single-machine.** No multi-region failover; multi-region = OPS-01
  (v2 deferral). One Fly.io machine going down is total downtime until
  Litestream-restore on a fresh machine completes (minutes, not hours,
  per §B7 mitigation, but non-zero).
- **No cross-machine replication for scale-out reads.** SQLite WAL doesn't
  replicate writes to readers across machines. Future scale-out beyond 50
  CCU = OPS-02 (room sharding inside Colyseus matchmaker) and possibly
  Postgres-via-OPS-03; both are explicit v2 deferrals.
- **Litestream 0.3.x is in maintenance mode.** 0.5.x is in beta as of the
  ADR date. The 0.3.13 pin is the production-recommended series at lock
  time; Phase 5 dogfood window must re-verify the recommended pin (one of
  the Forcing Functions below).

### Neutral

- **Better-Auth's Drizzle adapter conventions partially overlap our
  `accounts` table.** Phase 4 SRV-09 reconciles at adapter-wire time
  (Assumption A5 from plan-3-06 RESEARCH); the `payloadJson`-style escape
  hatches on `audit_log` are retained regardless. No persistence-layer
  decision rides on this.

## Forcing Functions for Re-Open

If any one of the following fires, ADR 0002 is re-opened in a Phase-9
retrospective and the Postgres migration path (OPS-03) is enacted:

1. **Relational join surfaces in Phase 7.** Phase 7 PAR-04..06 surfaces a
   many-to-many relational query whose non-relational shape requires N+1
   SQLite reads exceeding 5 ms inside the Colyseus tick budget. The
   `audit_log.payloadJson` escape-hatch is the first mitigation; if it
   doesn't suffice, the migration triggers.
2. **CCU exceeds 50 sustained on a single Fly machine** AND room-shard
   mitigation (OPS-02) fails to relieve write contention. STACK.md targets
   < 50 CCU per machine; the Fly + persistent-volume + SQLite pattern
   stops being correct at the multi-machine boundary.
3. **Litestream 0.3.x deprecated by upstream** AND 0.5.x stabilized AND
   Fly.io's documented pattern shifts to the new line. The Phase 5 dogfood
   window is the natural re-verification point; this is also the trigger
   that may flip just the Litestream pin (without re-opening the SQLite +
   Drizzle decision).

## Related ADRs

- ADR 0001 — Client Engine (Phaser 3.90.0 locked at Phase 2 close).
- ADR 0003 — Canonical Snapshot (Phase 3 close, plan 03-08); locks the
  golden-snapshot diff harness that consumes the persistence layer.

## References

- [`docs/extracted-server/save-formats.json`](../extracted-server/save-formats.json)
  — 8-format data-shape inventory; the single-writer / append-mostly /
  per-user-denormalized signals come from `filename_pattern` + `gml_origin`
  fields here.
- [`docs/extracted-server/0001_baseline.sql`](../extracted-server/0001_baseline.sql)
  — Drizzle baseline DDL committed plan 03-06; Phase 4 SRV-01 runs this
  via drizzle-kit migrate.
- [`tools/db-schema/src/tables.ts`](../../tools/db-schema/src/tables.ts)
  — hand-authored Drizzle TS schema (Phase 4 imports verbatim); cites
  `accounts.passwordHash` as the argon2id-only credentials column.
- [`docs/extracted-server/persistence.md`](../extracted-server/persistence.md)
  — write-cadence + crash-loss-surface narrative; legacy "every 30 s"
  alarm vs. SQLite-transaction-per-event.
- [`.planning/research/STACK.md`](../../.planning/research/STACK.md) §Persistence
  — locks the better-sqlite3 + Drizzle + Litestream + Tigris stack and
  enumerates the Postgres-rejection rationale at scale.
- [`.planning/research/PITFALLS.md`](../../.planning/research/PITFALLS.md) §B7
  — Fly.io machine-restart loss + Litestream + SIGTERM-grace mitigation.
- [`.planning/codebase/CONCERNS.md`](../../.planning/codebase/CONCERNS.md)
  — plaintext-credentials risk (drives D-04 staging table separation) +
  drift'd-snapshots context (drives the migration-input-only treatment of
  legacy `.bnu`/`.bnb` files).
- **OPS-03** (v2 requirement, deferred per CONTEXT.md D-11) — Postgres
  migration path preserved as a forcing function; this ADR's re-open
  triggers (above) are the gating conditions for OPS-03.
- [`CLAUDE.md`](../../CLAUDE.md) Tech Stack section + hard rules #2
  (no plaintext passwords) and #6 (extract → document → rewrite — this
  ADR closes the "document" stage for SDOC-04 and unlocks the "rewrite"
  Phase 4 stage).
