# ADR 0006: Observability Stack — Self-Hosted OpenObserve + In-Process OTel SDK + Dual-Rail Logs

[doc->REQ-DEP-06]

**Date:** 2026-05-08
**Phase:** 05 (during planning)

## Status

**Accepted** — locked at start of Phase 5 planning. Re-evaluation gates listed in
**Forcing Functions for Re-Open** below; absent any of those triggers, the contract
here is binding through Phase 7.

Supersedes: nothing. Superseded by: nothing.

## Context

Phase 4 established structured pino JSON logging to stdout (Phase 4 D-23 redact list,
Plans 04-01..04-13). Phase 5 extends observability to cover all three signals (logs +
metrics + traces) via an in-process OTel SDK and a self-hosted OpenObserve instance.
The decisions are locked in
[`.planning/phases/05-deploy/05-CONTEXT.md`](../../.planning/phases/05-deploy/05-CONTEXT.md)
§"Logging + observability (DEP-06)" (D-12..D-16).

**D-12 (dual-rail: Fly stdout AND OpenObserve via OTel):** pino JSON writes to
stdout (Fly's built-in `fly logs` viewer; short but reliable retention). In parallel,
an in-process `@opentelemetry/sdk-node` collector co-located in each game-server
container ships logs + metrics + traces via OTLP-HTTP to `rebno-obs` over Fly's
private 6PN network (`flycast`). Failure to ship to OpenObserve is non-fatal — the
game server keeps logging to stdout. The dual-rail ensures observability is never
entirely dark even when the obs app is down.

**D-13 (signals: logs + metrics + traces, all three):**
- **Logs:** Every Colyseus lifecycle event (auth/onJoin/onLeave/onMessage/error),
  every persistence write, every rate-limit drop, every SIGTERM step, every
  `fs.watch` room hot-reload event, every legacy-credential staging hit. Redact
  list: passwords, session tokens, argon2id_hash, legacy_hash (Phase 4 D-23,
  `apps/server/src/log.ts:23-67`). Trace IDs propagate from request → tick handler
  → state-diff broadcast.
- **Metrics:** Node process (RSS, CPU, GC counts/durations); Colyseus
  (`rooms_active`, `players_per_room` histogram, `messages_in_per_sec` by msg_type,
  `messages_out_per_sec` by msg_type); game-loop (`tick_duration_ms` p50/p95/p99,
  `step_duration_ms`, `accumulator_lag`); persistence (`db_write_ms` by table,
  `litestream_replicate_ms`); auth (`auth_attempts_total` by outcome,
  `force_reset_active`); rate-limiter (`rate_limit_drops_total` by msg_type).
- **Traces:** Span per c2s message handler (validate → step → broadcast), span per
  HTTP `/api/auth/*` request, span per `room_join` (manifest verify → layout
  broadcast), span per persistence write batch.

**D-14 (OpenObserve on `rebno-obs` Fly app):** Third Fly app, single OpenObserve
binary, region `lax` (same as game servers — minimizes OTLP RTT). Fly Volume mounted
at `/data` holds local OpenObserve WAL; `fly storage create` provisions a per-obs
Tigris bucket backing OpenObserve's columnar log/metric/trace storage (Parquet on S3).
Pricing target: Fly shared-cpu-1x / 1 GB RAM ~$2-3/mo + Tigris storage minimal at
<50 CCU log volume. Aggressive log cycling: `ZO_DATA_RETENTION_DAYS=14` +
`ZO_COMPACT_DATA_RETENTION_DAYS=30` + `ZO_USAGE_RETENTION_DAYS=14` to minimize RAM
and Tigris growth.

**D-15 (obs access = IP allowlist + admin password):** OpenObserve UI is behind the
same Fly proxy IP allowlist as `rebno-staging` (operator IPs only). Admin password
injected via `fly secrets set ZO_ROOT_USER_PASSWORD=...` (`openssl rand -base64 24`
generated locally). No public OAuth for v1; revisit if team grows beyond solo
operator. The `default` org name is locked (Pitfall 5 — OpenObserve single-binary
startup creates exactly one org named `default`; using a different org name requires
multi-user config and is out of scope).

**D-16 (log levels):** `LOG_LEVEL=info` on prod; `LOG_LEVEL=debug` on staging by
default. `TRACE=1` env behind manual op for tick-loop microsecond timings (used during
Phase 6 performance work; off by default). This closes Phase 4 SRV-08 verify
visibility carry-forward.

Alternatives considered and rejected:

- **Grafana stack (Alloy + Loki + Mimir + Tempo):** 4 separate services to operate
  vs OpenObserve's single binary. Significant ops overhead relative to <50 CCU
  audience. Rejected per D-14 + CONTEXT §"Specific Ideas".
- **SigNoz:** Higher minimum resource footprint (requires ClickHouse). Single
  OpenObserve binary is lighter for v1 scale. Rejected per RESEARCH §"State of the
  Art".
- **Sidecar OTel Collector process:** Higher RSS overhead vs in-process
  `@opentelemetry/sdk-node`. The in-process SDK adds only ~5 MB RSS vs ~50 MB+
  for the Collector binary. Rejected per CONTEXT D-12 / RESEARCH §"OTel collector
  form factor".
- **Hosted Honeycomb/Datadog/New Relic:** Cloud-vendor lock-in; per-event pricing
  punitive at scale; privacy concerns with game event data. Rejected.

## Decision

The observability stack uses **in-process `@opentelemetry/sdk-node`** loaded via
`node --import ./dist/otel-init.js dist/index.js` (per `apps/server/docker-entrypoint.sh`).
This is NOT a sidecar collector binary — the in-process SDK patches Node's `require()`
before `index.ts` evaluates, enabling automatic instrumentation of `http`, `ws`,
`express`, and `better-sqlite3`. OTel SDK initialization is fail-soft: a `try/catch`
in `apps/server/src/otel-init.ts` logs the failure and continues — the server never
crashes on an OTel init failure.

`pino-opentelemetry-transport` acts as a parallel pino transport alongside stdout,
forwarding structured JSON log records over OTLP-HTTP to OpenObserve. The pino
redact pipeline (`apps/server/src/log.ts`) runs BEFORE the log record reaches either
transport, preserving the single-redact-pipeline invariant (Plan 08 Pitfall 10 —
a unit test in `apps/server/test/otel-init.test.ts` asserts the redact list
survives dual-transport). This prevents secrets from appearing in OpenObserve even
if stdout is sanitized but OTel is not.

`rebno-obs` is a third Fly app in region `lax` running a single OpenObserve binary
backed by Tigris. OTLP-HTTP ingestion is routed via
`rebno-obs.flycast:5080/api/default` — the `flycast` hostname makes this private
Fly 6PN traffic that never traverses the public internet. OpenObserve's HTTP UI on
port 5080 is gated behind the Fly proxy IP allowlist and the `ZO_ROOT_USER_PASSWORD`
secret. The `default` org name is the locked constant for all OTLP routing
(Pitfall 5); changing it requires rebuilding the OpenObserve config and updating
`otel-init.ts` — a re-open trigger.

`ZO_ROOT_USER_PASSWORD` is generated via `openssl rand -base64 24` and injected via
`fly secrets set`. It is never committed to the repository or baked into
`apps/obs/fly.toml`. Secret rotation is documented in `docs/runbooks/RESTORE.md`
§"Secret Rotation".

`LOG_LEVEL` env var is set per-app in the Fly toml files: `info` on prod (default
pino level for structured production logs), `debug` on staging (enables Phase 4
carry-forward argon2 bench timing visibility). `TRACE=1` is an operator-only env for
Phase 6 performance deep-dives and is documented in RESTORE.md but not set by default.

## Consequences

### Positive

- ~$2-3/mo OpenObserve fixed cost on `shared-cpu-1x@1024MB`; one binary to operate
  vs Grafana stack's 4+ services.
- Parquet-on-S3 (Tigris) is 140× cheaper than Elasticsearch at comparable log volumes
  (OpenObserve benchmark claims; acceptable for <50 CCU).
- In-process OTel SDK adds only ~5 MB RSS to the game server vs ~50 MB+ for a
  sidecar Collector binary.
- Dual-rail (stdout + OTLP) means `fly logs` is always available even when
  `rebno-obs` is down or being upgraded — observability is never entirely dark.
- Single-redact-pipeline preservation (pino redact → both transports) eliminates a
  class of secret-leak bugs that dual-transport naively introduces (Plan 08 / Pitfall
  10 test guards this).

### Negative

1. Single-OpenObserve-instance is a SPOF for the observability plane (acceptable per
   <50 CCU; if `rebno-obs` is down, stdout logging still functions).
2. Bespoke dashboards are v2 — Phase 5 ships with OpenObserve's default dashboard
   only; per-room latency heatmaps, per-player Colyseus metrics, etc., defer to
   Phase 7 OPS-04.
3. No GitHub OAuth on the obs UI in v1 — operator IPs in the Fly allowlist must be
   maintained manually when the operator's IP changes.
4. pino + OTel multi-target redact pipeline is fragile — Pitfall 10 forces a unit
   test (`apps/server/test/otel-init.test.ts`) to assert the redact list survives
   dual-transport. Any future pino or `pino-opentelemetry-transport` upgrade must
   re-run this test.
5. Litestream replica encryption (Age) is deprecated in Litestream 0.5.x — relevant
   when ADR 0002 upgrades the Litestream pin from 0.3.13. This ADR's observability
   stack is unaffected, but the combined RESTORE.md procedure will need updating.

### Neutral

- Locks `default` as the OpenObserve org name (Pitfall 5). This is a constant in
  `apps/server/src/otel-init.ts` OTLP endpoint path
  (`/api/default/...`) — changing it is a code change + forcing function.
- OpenObserve is pinned to the latest stable release at plan time; `lint-deploy-stack.mjs`
  drift-guards the `apps/obs/Dockerfile` FROM tag to prevent silent upgrades via
  `latest`.

## Forcing Functions for Re-Open

Re-open this ADR ONLY if one of the following surfaces:

1. OpenObserve goes EOL or the community edition licensing changes in a way that
   breaks the self-hosted `shared-cpu-1x` deployment model.
2. OTel SDK ships a breaking change to `BatchSpanProcessor` / `pino-opentelemetry-transport`
   interface that silently drops signals or breaks the redact pipeline (Pitfall 10).
3. Team grows beyond solo operator and needs SSO/OAuth on the observability UI —
   current v1 design (IP allowlist + admin password) is not multi-user-safe.
4. Log/metric volume exceeds OpenObserve `shared-cpu-1x@1024MB` capacity
   (OOM-kill observed on `rebno-obs`) → upgrade machine tier or migrate to
   Grafana/SigNoz.

## References

- [`.planning/phases/05-deploy/05-CONTEXT.md`](../../.planning/phases/05-deploy/05-CONTEXT.md)
  §"Logging + observability (DEP-06)" — verbatim D-12..D-16 source for this ADR.
- [`.planning/phases/05-deploy/05-RESEARCH.md`](../../.planning/phases/05-deploy/05-RESEARCH.md)
  §"State of the Art" + §"Standard Stack" — alternatives considered for observability.
- [`apps/obs/Dockerfile`](../../apps/obs/Dockerfile) — OpenObserve single-binary
  container image, Tigris-backed.
- [`apps/obs/fly.toml`](../../apps/obs/fly.toml) — `rebno-obs` Fly config; region
  `lax`; listens 5080 (flycast only — private 6PN); no public port.
- [`apps/server/src/otel-init.ts`](../../apps/server/src/otel-init.ts) — in-process
  OTel SDK bootstrap; fail-soft `try/catch`; SIGTERM → `sdk.shutdown()`.
- [`apps/server/src/log.ts`](../../apps/server/src/log.ts) — singleton pino logger;
  redact list covering passwords, session tokens, argon2id_hash, legacy_hash.
- [ADR 0005 — Deploy Topology](0005-deploy-topology.md) — sibling ADR locking D-01..D-04
  (`rebno-staging` + `rebno-prod` two-app split; `rebno-obs` is the third Fly app
  in the same region `lax`).
