# Pitfalls Research — REBNO

**Domain:** Reverse-engineered rebuild of a 2000s-era GameMaker 5.3a real-time multiplayer game on a modern Node+TS / Phaser-or-PixiJS / Fly.io stack
**Researched:** 2026-05-01
**Confidence:** HIGH for sections A and D (direct evidence in `decomp/wiki/` and `.planning/codebase/CONCERNS.md`); MEDIUM for sections B and C (informed by ecosystem patterns plus repo-specific knowledge)

This document is organized into four pitfall surfaces stacked on top of each other, plus cross-cutting tables. Stage references map to the 7-stage plan in `.planning/PROJECT.md`.

---

## Section A — Reverse-Engineering / Extraction Pitfalls

### A1: Reaching for UndertaleModTool / Altar.NET first

**What goes wrong:**
Engineer sees "GameMaker game, need to extract assets" and downloads UTMT or Altar.NET. Tool throws "FORM header not found" or opens with zero assets. Engineer assumes the binary is corrupt or anti-tamper-protected and burns days investigating phantom problems.

**Why it happens:**
Modern decompilers parse by hunting for the `FORM` magic header at a known offset, then walking tagged chunks (`GEN8`, `TXTR`, `CODE`, `SPRT`, etc.). GM 5.3a `.exe` files have **no `FORM` header anywhere** — they are a Delphi 5 stub with a sequential ZLIB-serialized payload appended to EOF (per `decomp/wiki/13-modern-tool-incompat.md`). Architectural mismatch, not a tooling bug.

**How to avoid:**
Hard-code the era-appropriate tool list in Stage 1 setup notes: only **GM Decompiler v2.1** (static, JAR), **GMD-Recovery** (dynamic, requires WinXP VM), and **LateralGM** (post-extraction parsing). Add a `decomp/TOOLS.md` at Stage 1 kickoff that explicitly lists "do not use" tools (UTMT, Altar.NET, any Studio-era tool) with the symptom signatures from `decomp/wiki/13-modern-tool-incompat.md`.

**Warning signs:**
- "FORM header not found"
- "Unrecognized data format"
- "Not a valid GameMaker file"
- Tool opens but asset count is 0

**Phase to address:**
Stage 1 (Extraction) — make this a documented precondition in the runbook before anyone touches a tool.

---

### A2: Extraction order wrong — XOR-decode before ZLIB-decompress (or vice versa)

**What goes wrong:**
Engineer locates the appended payload offset in the runner `.exe`, attempts to decompress with ZLIB, gets garbage. Or applies the XOR key to a ZLIB-decompressed stream that is no longer XOR-encoded.

**Why it happens:**
The 5.3a payload is **layered**: outer XOR obfuscation wraps an inner ZLIB-compressed `.gmd` blob. The XOR key is mathematically derived from the runner stub state (per `decomp/wiki/15-extraction-pipeline.md` Rank 2). Order: locate EOF payload → derive XOR key → XOR-decode → ZLIB-decompress → land plaintext `.gmd`. Skip a step or swap the order and every byte downstream is noise.

**How to avoid:**
Use **GM Decompiler v2.1** as the first attempt — it does the layering correctly. Do NOT write a custom extractor before Rank 2 has been tried and failed. If a custom extractor is ever needed, write it as a literal port of v2.1's algorithm with each stage materializing an intermediate file (`payload.raw` → `payload.xor-decoded` → `payload.zlib` → `project.gmd`) so the failing stage is obvious.

**Warning signs:**
- ZLIB returns "invalid block type" or "incorrect header check" on what should be the inner blob
- XOR-decoded output looks like ZLIB header bytes (`78 9C` / `78 DA`) — that's the goal; if you see GameMaker resource strings instead, you've already gone too far
- Magic bytes don't match `decomp/wiki/03-gmd-format.md` after extraction

**Phase to address:**
Stage 1 (Extraction).

---

### A3: Treating decompiled DnD blocks as faithful GML

**What goes wrong:**
Decompiler emits "GML" for a Drag-and-Drop action sequence. Engineer reads it, ports it to TypeScript, ships. Behavior diverges subtly from the original — a DnD "If Variable" block does not have identical short-circuit semantics to a GML `if` expression, and DnD-emitted local variable scoping can differ from hand-written GML.

**Why it happens:**
GameMaker 5.3a serializes DnD as **binary action nodes**, not GML strings (`decomp/wiki/04-dnd-serialization.md`, referenced from concerns audit). Decompilation is **translation**, not recovery — comments are gone, formatting is invented, edge-case semantics are approximated. Most BNO scripts are likely DnD-heavy given the era and the audience that built it.

**How to avoid:**
For every extracted asset, **retain both the raw binary node tree and the decompiled GML side-by-side**. Treat the binary as canonical and the decompiled GML as a derived view. When porting in Stage 4/6, if behavior is suspicious or the GML looks unnatural (deeply nested if/else with no comments, repeated `var` declarations, redundant `set_variable` calls), open the binary in LateralGM to verify the original DnD intent before porting.

**Warning signs:**
- Decompiled GML has unnatural structure — long flat if/else chains, no idiomatic GML patterns
- Variables scoped strangely (everything looks "local" or everything looks "global")
- LateralGM and the standalone decompiler disagree on the same `.gmd` script

**Phase to address:**
Stage 2 (Client analysis) and Stage 3 (Server analysis) — establish the binary-canonical convention; Stage 4/6 (rebuilds) — enforce checking binary when GML looks off.

---

### A4: Reading 39dll wire protocol from packet captures instead of GML call order

**What goes wrong:**
Engineer fires up Wireshark against the original server, captures packets, tries to reverse-engineer the protocol from byte patterns. Gets nowhere because there is no length prefix, no message ID schema, no framing — just opaque byte streams whose meaning is determined by **the order of `write*` calls in the GML emitter**.

**Why it happens:**
39dll has zero schema. Per `decomp/wiki/08-39dll-networking.md`: "packet structure = read/write call order." A `writebyte(MSG_TYPE_MOVE)` followed by `writedouble(x)` is exactly 9 bytes on the wire — no version byte, no length, no delimiter unless the script emits one explicitly. The packet layout literally **does not exist** outside the source. Wireshark sees a TCP byte stream and shrugs.

**How to avoid:**
Stage 3 protocol reverse-engineering is **GML-driven, not capture-driven**. Procedure:
1. Extract Master `.gmd` to plaintext GML.
2. `grep` every `sendmessage` and `receivemessage` call site.
3. For each site, trace backward through the script to enumerate the exact `clearbuffer; writebyte(...); writedouble(...); writestring(...)` sequence.
4. That sequence **is** the packet. Document as a TypeScript schema.
5. Mirror reads/writes on the new server in identical order.

Use packet captures only as a **validation cross-check** after the GML-derived schema exists, never as the primary source.

**Warning signs:**
- Anyone proposing "let's sniff the wire and figure it out" — redirect to GML-first.
- Packet schema doc has fields like "unknown bytes 5-12" — go back to GML, the answer is in the script.
- Server emulator works for the first packet then desyncs — almost always a missed `write*` call somewhere mid-script that only fires on a code path you haven't traced.

**Phase to address:**
Stage 3 (Server analysis) — bake the GML-first procedure into the protocol-RE runbook.

---

### A5: Treating `.bno` / `.bnb` / `.bnu` as parseable formats

<!-- ERRATA 2026-05-03 (Phase 3 D-08): file_bin_* → file_text_*. Ground truth in extracted/server-5-4/scripts/0365-mb_backup.gml + 0367-users_restore.gml. -->

**What goes wrong:**
Engineer opens `Settings.bno` in a hex editor, sees lines like `Jarhead111\nbahoobutt\n0.0E+0\n...`, assumes it's some flavor of INI, writes a parser, ships save-data migration. Fields go in wrong slots because the field meaning is positional and the order is determined by the **order of `file_text_read_*` calls** (line-based primitives — note: NOT `file_bin_*`; see ERRATA above and `decomp/wiki/16-bno-bnb-notes.md`) in the GML loader script.

**Why it happens:**
Per `decomp/wiki/16-bno-bnb-notes.md` (referenced from CONCERNS.md): no external spec exists. The format **is** the call sequence of `file_text_*` (line-based) primitives — confirmed via `extracted/server-5-4/scripts/0365-mb_backup.gml` (`.bnb` writer with `@TOPIC`/`@REPLY` markers) and `0367-users_restore.gml` (`.bnu` reader). Earlier revisions of this section incorrectly cited `file_bin_*`; the `file_text_*` correction is the verified ground truth. A "double" field and a "string of digits that looks like a double" still cannot be distinguished without the GML.

**How to avoid:**
Do not write `.bno` / `.bnb` / `.bnu` parsers until Stage 3 has produced a documented schema **derived from the Master `.gmd`'s file-IO scripts**. The schema lives in `.planning/research/` (or equivalent) and includes: file-name → load-script-name → ordered field list (name, type, size, semantics). Stage 4 migration code consumes this schema; it never reads bytes blind.

**Warning signs:**
- Migration script "mostly works" but a few users have garbled inventory or wrong levels — positional offset error.
- Field-count mismatch between save files of nominally the same version (some `.bnu` files are older format and the loader had a versioning branch you didn't trace).
- `Settings.bno` parses fine but `MSettings.bno` doesn't — different loader script, different field order.

**Phase to address:**
Stage 3 (Server analysis) — schema extraction; Stage 4 (Server rebuild) — migration code consumes schema, never reads bytes blind.

---

### A6: Trusting only the latest `.gmd` and ignoring `.gb1`–`.gb9` backups

**What goes wrong:**
Engineer extracts `BN Online Master 5-4.gmd` and `BN Online Client 5-8.gmd`, declares Stage 1 done, moves on. Later in Stage 3, a referenced script appears to be missing or behaves oddly — turns out the latest `.gmd` was saved mid-refactor, and an earlier `.gb1` has the working version of the same script.

**Why it happens:**
GameMaker 5.3a writes `.gb1`–`.gb9` rotational autosaves on every IDE save (per `decomp/wiki/14-gb1-backups.md`). They are **byte-identical to `.gmd` once renamed** but represent rotation history. Filename versioning (`5-2 DEBUG`, `5-6 TSide Revamp`) is the only changelog (per CONCERNS.md). The "latest" file might not be the most coherent file.

**How to avoid:**
Stage 1 extracts **every `.gmd` and every `.gb1`–`.gb9`** in `legacy/source-archive/`, `legacy/open-source-release/`, and `legacy/servers/*/` to plaintext. Land them in a structured tree per source file with a manifest noting filename, mtime, and source path. Stage 2/3 analysis can then diff across revisions. Build a "synthetic version history" by ordering extracted scripts chronologically (per CONCERNS.md fix approach).

**Warning signs:**
- A script references a function or object that doesn't exist in the latest `.gmd` — it likely exists in an earlier backup.
- Server behavior documented in `legacy/open-source-release/,ServerCommands.txt` doesn't match what's in `BN Online Master 5-4.gmd` — admin-keybind code may have lived in a Ctrl+O snippet (per `legacy/servers/enlyzeam-current/Ctrl+O Codes.txt`) that was never persisted to the project.
- Three server snapshots (enlyzeam-current, enlyzeam-archive, local-current) have overlapping but non-identical `.gmd` files (per CONCERNS.md "Three drift'd server snapshots").

**Phase to address:**
Stage 1 (Extraction) — extract everything; Stage 2/3 (Analysis) — cross-reference revisions when something looks wrong.

---

### A7: Skipping the WinXP VM and trying to run GMD-Recovery on Windows 11

**What goes wrong:**
Static decompilation (Rank 2) fails because the runner is packed, and engineer tries Rank 3 (GMD-Recovery dynamic injection) directly on Windows 11. Tool either doesn't run, hangs, or attaches but reads garbage from a memory layout it doesn't recognize.

**Why it happens:**
GMD-Recovery hooks Win32 process memory at offsets that assume a Windows XP-era loader. Per `decomp/wiki/15-extraction-pipeline.md` and CONCERNS.md "GameMaker 5.3a — abandoned 22+ years": the IDE and runner target XP kernels and run unreliably on modern Windows without compatibility shims.

**How to avoid:**
Establish a **Windows XP VM** as a Stage 1 prerequisite (before any extraction begins), not as a fallback discovered mid-Stage-1. Document VM image provenance, snapshot before each extraction run, and treat the VM as the **only** environment where Rank 3 happens. Try Rank 2 on the host (it's safe — no execution); only drop to Rank 3 inside the VM.

**Warning signs:**
- GMD-Recovery throws access-violation errors immediately.
- Memory dump has zero recognizable strings (because the runner never actually ran past initialization).
- Tool reports "process not found" even though Task Manager shows it.

**Phase to address:**
Stage 1 (Extraction) — VM setup is a Day-0 task, not a "we'll figure it out if we need it" item.

---

## Section B — Multiplayer Game Server Pitfalls

### B1: Trusting the client for game state

**What goes wrong:**
Server accepts a position update from the client and writes it directly to authoritative state. A user sends `{x: 99999, y: 99999}` and teleports anywhere. Or sends "I picked up item X" without server-side validation that they were near it.

**Why it happens:**
The original 39dll-based server may itself have trusted the client (server didn't have transport encryption — see CONCERNS.md "No transport encryption"). Porting that pattern faithfully reproduces the trust gap. Authoritative-server is a **discipline**, not a flag — every handler must validate.

**How to avoid:**
Hard rule in Stage 4: **client sends intent, server emits state**. Movement: client sends `{intent: "move-north", dt}`, server runs movement physics with collision and emits new authoritative position. Inventory: client sends `{intent: "pickup", item_local_id}`, server validates proximity and ownership before mutating. Make this enforceable with a TypeScript shape: every inbound message type carries `Intent` in its name; no inbound message type ever contains an authoritative state field.

**Warning signs:**
- Any `if (msg.x !== undefined) state.x = msg.x` pattern in handlers.
- A handler that mutates state without checking the actor's permission/proximity/cooldown.
- Test with a hand-crafted WS frame: can a malicious client achieve a state change you'd reject in the UI? If yes, the server is not authoritative for that path.

**Phase to address:**
Stage 4 (Server rebuild) — bake into the handler scaffold from the first packet.

---

### B2: Zombie sessions on WebSocket disconnect

**What goes wrong:**
User's connection drops (laptop closes, WiFi flake, NAT timeout). Server's `socket.on('close')` doesn't fire promptly because TCP keep-alive defaults are 2 hours. Server still thinks the user is in the room. Other players see a frozen avatar; the user reconnects, gets "already logged in" error, can't rejoin.

**Why it happens:**
WebSocket close events are not guaranteed to fire on network failures — only on clean closes. Without an application-level heartbeat (ping/pong) with timeout, half-open connections persist until OS-level TCP timeout.

**How to avoid:**
- Application-level heartbeat: server sends WS ping every 15s, expects pong within 10s; missed pong → `terminate()` the socket and clean up session.
- On reconnect, **replace** any existing session for the same account atomically (kick the old socket, install the new one) rather than rejecting the new login. Wrap in a per-account mutex so two near-simultaneous reconnects don't race.
- Persist a short reconnect grace window (~30s) where the player's avatar stays in the room (marked "disconnecting") so reconnecting feels seamless.

**Warning signs:**
- Players reporting "I can't log in, says I'm already online" — zombie session.
- Other players seeing a non-moving avatar that doesn't respond to chat — stuck session.
- Server memory grows monotonically over hours — session map leaking.

**Phase to address:**
Stage 4 (Server rebuild) — heartbeat and reconnect logic are part of the session lifecycle from day 1, not an afterthought.

---

### B3: Message framing assumptions on binary WS

**What goes wrong:**
Engineer assumes "WebSocket gives me message boundaries, so I can just `JSON.parse(data)` or read my packet struct directly." Then under load or with large messages, the `ws` library delivers a fragmented frame, or two small messages arrive as one buffer in some libraries' configs, and the deserializer mis-frames everything downstream.

**Why it happens:**
WebSocket **does** preserve message boundaries at the protocol level (unlike raw TCP). But: (a) some libraries (especially `uWebSockets.js` in certain modes) may deliver per-frame rather than per-message if `permessage-deflate` and continuation frames interact weirdly; (b) developers used to TCP byte streams sometimes write length-prefixed parsers on top of WS, which is correct in spirit but error-prone if not also checking the WS opcode.

**How to avoid:**
- Use `ws` library's default behavior (one `'message'` event = one complete WebSocket message) and **trust the framing** — do NOT write a length-prefixed re-framer on top.
- For binary frames: every message is a self-contained packet. First byte is message type; remaining bytes are the payload according to the schema for that type.
- In tests, send fragmented frames deliberately and confirm the handler still receives one logical message.

**Warning signs:**
- Intermittent "unknown packet type" errors on the server with non-matching first bytes.
- Large messages (>16KB) cause desyncs while small messages work.
- Switching from `ws` to `uWebSockets.js` (or vice versa) changes packet-handling correctness — schema is too coupled to library framing semantics.

**Phase to address:**
Stage 4 (Server rebuild) — pick the WS library (`ws` is the conservative default for v1's <50-player target) and lock framing assumptions in tests.

---

### B4: Rate limiting / chat flood ignored until production

**What goes wrong:**
A single client sends 10,000 chat messages per second. Server fans them out to every player in the room. Every other client's WS buffer fills, server buffers fill, garbage collector thrashes, server falls over.

**Why it happens:**
Rate limiting feels like "we'll add it before launch" but the right place is **at the inbound packet handler**, before any server work happens. Bolting it on after handlers exist means revisiting every handler.

**How to avoid:**
Per-account, per-message-type token bucket at the packet dispatcher. Defaults (tunable):
- Movement intents: 30/s (matches expected tick rate; original game likely <30 FPS).
- Chat: 1/s sustained, burst of 5.
- Auth attempts: 5/min per IP.
- All other: 10/s.
Exceeding the bucket: drop packet silently for movement (idempotent — next intent supersedes); for chat/auth, send a rate-limit notice to the offender. Implement as middleware at Stage 4 packet dispatch, not per-handler.

**Warning signs:**
- Server CPU spikes correlated with a single user being online.
- Chat history fills with rapid identical messages.
- Other players' clients lag when one player is "active."

**Phase to address:**
Stage 4 (Server rebuild) — rate limit middleware exists from the first handler.

---

### B5: bcrypt migration silently breaking dormant accounts

**What goes wrong:**
Plan: on first successful login, transparently re-hash the password. Reality: a user who hasn't logged in for 5 years tries the new system, password works, gets re-hashed, fine. **But:** a user whose remembered password is slightly different ("was it `harrypotter` or `Harrypotter`?") gets "invalid password" with no recourse — there's no forgot-password flow because there's no email on file in the original `localList.txt` (per CONCERNS.md it's just username + plaintext password, ~298 accounts).

The deeper risk: the bcrypt migration code accidentally hashes the **already-hashed** value on the user's second login (because the schema flag wasn't set), permanently locking the account.

**Why it happens:**
Two-state migration (legacy plaintext vs new bcrypt) needs an explicit per-row flag and an idempotent flow. Easy to get wrong with subtle race conditions (concurrent logins to the same account).

**How to avoid:**
- Add `password_hash_algo` enum column: `'legacy_plaintext' | 'bcrypt'`.
- Login flow:
  ```
  if algo == 'legacy_plaintext':
      if user_input == stored_value:
          stored_value = bcrypt(user_input)
          algo = 'bcrypt'
          commit
      else: reject
  elif algo == 'bcrypt':
      if bcrypt.compare(user_input, stored_value): accept
      else: reject
  ```
- Wrap the upgrade in a transaction with an account-level lock so two concurrent logins don't both try to upgrade.
- Add an audit log row on every algo transition; alert on any double-transition (would indicate the bug).
- **Provide an out-of-band reset path** for accounts where the original password is forgotten: a manual admin tool that issues a one-time reset token printed to the admin console (since there's no email on file). Document this in the runbook before launching.
- Hash usernames in the migration source data so the legacy plaintext file isn't sitting on disk forever.

**Warning signs:**
- A user reports "I logged in fine yesterday, now my password doesn't work" — possible double-hash.
- Audit log shows non-monotonic algo transitions (`bcrypt` → `legacy_plaintext` is impossible; if it appears, you have a bug).
- Login latency on legacy accounts is identical to bcrypt accounts on first login (means upgrade didn't run).

**Phase to address:**
Stage 4 (Server rebuild) — migration is a Stage 4 task with its own test suite covering: cold legacy login, hot legacy login under concurrency, second login post-upgrade, wrong-password legacy, wrong-password post-upgrade, manual admin reset flow.

---

### B6: Tick rate drift — server time vs client time

**What goes wrong:**
Server runs game loop at 30 Hz. Some ticks take longer than 33ms (GC pause, persistence flush, slow handler). Server falls behind. Either: (a) client sees jerky updates because server delivered no state for several frames, or (b) server "catches up" by running 5 ticks back-to-back, applying 5 frames of input in 1ms, and players experience teleportation.

**Why it happens:**
Naive `setInterval(tick, 33)` drifts. Naive "catch up" by running missed ticks back-to-back violates the assumption that `dt` per tick is bounded.

**How to avoid:**
- Use a **fixed-timestep accumulator** loop (Glenn Fiedler style): measure real time delta, accumulate, run as many fixed-dt physics ticks as fit, **cap** at e.g. 5 catch-up ticks to prevent spiral-of-death.
- Send periodic `server_time` in the heartbeat so the client can compute its clock offset and time-stamp its own intents in server-time.
- Log per-tick duration; alert if p99 > 2x target.
- Decouple expensive work (persistence, large fan-out) from the tick: offload to a worker or queue.

**Warning signs:**
- Player positions appear to "snap" forward periodically.
- Server log shows tick durations occasionally >100ms.
- Two players with low ping see each other lag-spike at the same time → server-side, not network.

**Phase to address:**
Stage 4 (Server rebuild) — game loop is one of the first things written; get the accumulator pattern right from the start.

---

### B7: Persistence loss on Fly.io machine restart

**What goes wrong:**
Fly.io restarts the machine for a routine maintenance event, scaling event, or deploy. The server was holding 30 minutes of unsaved state in memory. Players lose progress. Or the server was mid-save when killed and the save file is half-written / corrupt.

**Why it happens:**
Fly machines can restart at any time. Stateful WS servers must assume their process can die between any two instructions. Naive "save every 5 minutes" loses up to 5 minutes; naive "save synchronously on every change" tanks the tick rate.

**How to avoid:**
- **Atomic writes** for save files: write to `tmp_path`, `fsync`, rename to `final_path`. Never overwrite in place.
- **Write-ahead log (WAL)** for in-flight changes: append-only log of state mutations, fsync per batch, rotated periodically. On crash recovery, replay the WAL from the last full snapshot. SQLite's WAL mode does this for you if persistence layer is SQLite (likely choice per `.planning/PROJECT.md` Key Decisions).
- **Litestream** to S3-compatible storage for SQLite — gives near-RPO-zero off-machine backup (continuous WAL streaming).
- **Graceful shutdown handler**: on SIGTERM, stop accepting new connections, send shutdown notice to clients, flush WAL, exit. Fly.io sends SIGTERM with a grace period before SIGKILL.
- **Health check** that fails if WAL hasn't been flushed in N seconds — surfaces stuck-flush bugs.

**Warning signs:**
- Player reports "I leveled up and then it was gone" after a deploy.
- Save file corruption errors after a hard restart.
- Persistent volume disk usage growing without bound (WAL not rotating).

**Phase to address:**
Stage 5 (Deploy) — Fly machine lifecycle is Stage 5 territory; Stage 4 (Server rebuild) — the WAL / atomic-write discipline must exist before deploy.

---

### B8: Schema versioning between client and server during rolling deploys

**What goes wrong:**
Engineer adds a new field to a packet, deploys server, deploys client. For ~30s, old clients are connected to new server (or new clients to old server during deploy ordering). Mismatched schemas cause parse errors, dropped messages, or worse — silent data corruption (a field added in the middle shifts every subsequent field).

**Why it happens:**
Binary packet schemas are positional (especially when faithfully porting 39dll's "call order = wire format"). Adding a field is not backward-compatible by default.

**How to avoid:**
- Every packet starts with a 1-byte (or 2-byte) **type discriminator** AND a 1-byte **version**. Server dispatches by `(type, version)` to versioned handlers.
- New fields go **at the end** of the payload, never inserted in the middle.
- Server understands the previous protocol version for at least one deploy cycle.
- Client sends supported protocol version on connect; server responds with negotiated version.
- On mismatch: client shows "please reload" overlay rather than connecting and silently corrupting.
- Shared TypeScript packet definitions live in a `shared/` package consumed by both client and server; bumping a version in `shared/` is the trigger for the deploy-cycle conversation.

**Warning signs:**
- Mid-deploy: chat messages contain garbage characters (string-length field misread as part of body).
- Client crashes with "unknown packet type 0xFF" — almost always a version-skew issue.
- A new feature works in dev but breaks an old client that hasn't reloaded.

**Phase to address:**
Stage 4 (Server rebuild) — version discriminator from packet 1; Stage 5 (Deploy) — deploy procedure documents the "support N-1 for one cycle" rule.

---

### B9: Room state leakage when rooms hot-restart

**What goes wrong:**
Server detects a room is in a bad state (deadlocked timer, stale entity, etc.) and "restarts" the room by reinitializing it. But entity references held elsewhere (in player session objects, in the entity manager, in pending message queues) still point at the old objects. Operations on those stale references either no-op silently or, worse, mutate ghost state that's never visible to anyone.

**Why it happens:**
Room restart is ad-hoc maintenance code that develops late. By then, references to room state are spread across the codebase and the "restart" doesn't enumerate all of them.

**How to avoid:**
- Avoid ad-hoc "restart this room" entirely. If a room is in a bad state, restart the whole server (with WAL recovery from B7).
- If room hot-restart is genuinely needed: assign each room a monotonically increasing `room_instance_id`. All entity references include the `(room_id, room_instance_id)` tuple. Operations check the instance ID matches the current; mismatched ops are dropped with a warning log.
- Treat any "restart this subsystem" code with deep suspicion — usually it indicates the subsystem isn't designed to recover, which is a deeper problem.

**Warning signs:**
- After a room restart: some players in the room can move, others can't.
- Chat messages addressed to a recently-restarted room appear to vanish.
- Entity counts don't decrease after a restart even though no players are in the room.

**Phase to address:**
Stage 4 (Server rebuild) — design the entity/reference model to either tolerate or forbid hot-restart from day 1.

---

### B10: Lag compensation copied from FPS playbooks when the original game didn't have it

**What goes wrong:**
Engineer reads about lag comp / client-side prediction / server reconciliation from FPS literature, implements it. Original BNO did **not** have any of that — players moved on a discrete grid, server sent state, client displayed it, no rewinding. New client now interpolates and predicts, and the "feel" diverges from the original because movement is now smooth where it used to be discrete-frame.

**Why it happens:**
Modern multiplayer best practices assume an FPS-style continuous world. BNO is closer to a 2D MMO with discrete tile/pixel movement — different problem class.

**How to avoid:**
Stage 2 (Client analysis) **must** document the original movement model precisely: pixels-per-tick, tick rate, whether movement is grid-locked or pixel-free, whether the client renders intermediate frames or snaps. Stage 6 (Client rebuild) implements **exactly that model** for the MVP. Lag comp is an explicit decision in Stage 7 (Full parity), not a default. If the original "feel" was that you sometimes saw other players take a small visible jump on lag, **preserve that** — players who loved the original game loved that feel.

**Warning signs:**
- Stage 6 client uses `lerp` for other players' positions and you didn't decide that explicitly.
- Original players testing the new client say "it feels different but I can't say why" — that's the smoothing.
- Client-side prediction code exists before server has even shipped — premature.

**Phase to address:**
Stage 2 (Client analysis) — document movement model; Stage 6 (Client rebuild) — match it; Stage 7 (Full parity) — explicit decision on whether to add interpolation.

---

## Section C — Faithful-Rebuild Pitfalls

### C1: Movement feel — pixel-per-tick differences

**What goes wrong:**
Original game moved 4 pixels per frame at 30 FPS. New client moves "fast enough that it feels right" (5 pixels at 60 FPS = 10 px/sec faster). Original players notice within seconds — "this is wrong" — even if they can't articulate why.

**Why it happens:**
Movement speed in GameMaker 5.3a was implicit in `image_speed` / `hspeed` / `vspeed` per-frame and tied to room speed. Easy to lose the exact constant during a port.

**How to avoid:**
Stage 2 documents every movement-related constant from the extracted GML: room speed (FPS), `hspeed`/`vspeed` for player walking, running, dashing, knockback. Stage 6 implements movement with a fixed-timestep loop at the **same effective px/sec** regardless of render FPS. Add automated tests: "player at (0,0) holding north for 1 game-second arrives at (0, -120)" with the constant from the original GML.

**Warning signs:**
- A returning player says "feels off" without specifying.
- Stage 2 documentation has movement speed listed as "TBD" or "approximate."
- Movement speed depends on render FPS in the new client (uncapped framerate makes you faster).

**Phase to address:**
Stage 2 (Client analysis) — extract exact constants; Stage 6 (Client rebuild) — implement with framerate-independent timing.

---

### C2: Audio timing and sample-rate differences

**What goes wrong:**
Original audio assets are MIDI (`.mid`) and WAV at original sample rates (per CONCERNS.md `legacy/audio/bno-songs/*.mid`). New client converts to OGG/MP3 at 44.1kHz. Loops have a 1-frame gap. Sound effects play 50ms later than original because of Web Audio scheduling. MIDI playback in browsers requires a soundfont, and the chosen soundfont sounds different from the original Windows MIDI synth players grew up with.

**Why it happens:**
Browser audio is fundamentally different from Win32 DirectSound + GM5's MIDI playback. Naively "convert and play" loses fidelity; "use Web MIDI with a soundfont" sounds different from the original GM5 default MIDI device.

**How to avoid:**
- Audio decisions explicitly in Stage 2 (asset cataloguing): document each sound's original format, sample rate, intended loop points.
- For sound effects: use Web Audio with pre-decoded buffers and explicit scheduling (`source.start(when)`) — never rely on `<audio>` elements.
- For BGM: prefer pre-rendering MIDI to OGG **with a soundfont chosen to match the original Windows General MIDI sound** (FluidSynth + a vintage SoundFont like SGM-V2.01 or the actual mscore-fluid set used in older Windows). Render once at build time, ship OGG.
- Loop points: use OGG with explicit loop start/end markers, or pre-compute crossfaded loops. Test loop seams empirically.

**Warning signs:**
- A returning player says "the music sounds different" — soundfont mismatch.
- Sound effects feel late or "muddy" — Web Audio scheduling not used.
- Loops have an audible click — gapless looping not configured.

**Phase to address:**
Stage 2 (Client analysis — asset cataloguing); Stage 6 (Client rebuild) — implement audio pipeline; Stage 7 (Full parity) — soundfont fidelity verified by original players.

---

### C3: Sprite scaling artifacts on HiDPI displays

**What goes wrong:**
Original game ran at a fixed low resolution (likely 640x480 or 800x600) with pixel-art sprites. New client targets HiDPI Chrome at 1920x1080+. Engineer enables CSS upscaling or browser anti-aliasing; pixel art becomes blurry mush.

**Why it happens:**
Browsers default to bilinear filtering on canvas/image scaling. Pixel art needs nearest-neighbor (or a dedicated pixel-art shader like xBR/HQ4x).

**How to avoid:**
- `image-rendering: pixelated` (CSS) and `imageSmoothingEnabled = false` (canvas context).
- Phaser 3: set `pixelArt: true` and `roundPixels: true` in game config. PixiJS: set `SCALE_MODES.NEAREST` as default.
- Render at a fixed internal resolution (the original's 640x480 or whatever Stage 2 documents) and integer-scale to the viewport. Non-integer scaling breaks pixel art even with nearest-neighbor.
- Optional: support an explicit "smooth scaling" toggle in settings for users who prefer it, but default to crisp nearest-neighbor.

**Warning signs:**
- Sprites look "soft" on high-DPI screens.
- Sprite seams visible at non-integer scales (rendering at 2.5x of original).
- Letters in custom font assets are unreadable or have color fringing.

**Phase to address:**
Stage 6 (Client rebuild) — set scaling defaults at engine init; Stage 2 — document original render resolution.

---

### C4: Chat command syntax drift

**What goes wrong:**
Original BNO had chat commands like `/who`, `/tell <name> <msg>`, possibly `!command` admin syntax. New client implements `/whisper` instead of `/tell`, or requires `/who all` instead of bare `/who`. Returning players' muscle memory breaks; they think the feature is missing.

**Why it happens:**
Chat parser is a mundane piece of code easy to redesign with "better" syntax. But returning players type these commands without thinking.

**How to avoid:**
Stage 3 extracts every chat command from the Master `.gmd`'s chat parser GML — exact prefix character, exact command keyword, exact argument grammar, exact error message text. Stage 4 (server) and Stage 6/7 (client UI) reproduce them **verbatim**, including misspellings if any exist in the original (some users may have memorized the misspelling). Add new commands only as additions, never as replacements.

**Warning signs:**
- Chat command list in the new client doesn't match an extracted list from the original.
- "Modernized" command names — `/whisper` instead of `/tell`, `/help` instead of original keyword.
- New error messages for old commands — also disorienting.

**Phase to address:**
Stage 3 (Server analysis) — extract chat parser; Stage 4 — server reproduces; Stage 7 (Full parity) — verify against original via returning-player testing.

---

### C5: Save data migration — `.bnu` users expect their progress

**What goes wrong:**
~298 accounts in `localList.txt` have associated `.bnu` files in `legacy/servers/*/UserData/{HXB,Inv,MB_News}/` (per CONCERNS.md). Engineer ships new server with bcrypt-migrated accounts but forgets to migrate the per-user state — users log in with their old name and start at level 1 with no inventory.

**Why it happens:**
Account migration (`localList.txt` → users table) and state migration (`.bnu` files → user_state tables) are separate problems. Easy to do the first and forget the second. Worse: three drift'd server snapshots (per CONCERNS.md) means it's unclear which `.bnu` is canonical for any given user.

**How to avoid:**
- Stage 3 documents the `.bnu` schema (per A5).
- Stage 4 migration is a single transaction per user: account + all related `.bnu` files, all-or-nothing.
- Resolve the canonical-snapshot question in Stage 3: write a merge tool that prefers most-recent-mtime per user across the three snapshots (per CONCERNS.md fix approach), or pick one snapshot as authoritative and document that decision.
- Allow users to view their migration source (which snapshot, which date) on first login so they can flag "this isn't my latest progress."

**Warning signs:**
- Users complain "I had X items, now I have Y."
- Migration log shows account migrated but no `.bnu` parsed for that account — orphaned account.
- Some users find their progress, others don't, with no clear pattern — snapshot-selection inconsistency.

**Phase to address:**
Stage 3 (Server analysis) — schema + canonical-snapshot decision; Stage 4 (Server rebuild) — migration transaction.

---

### C6: Room layout drift — off-by-one tile edges

**What goes wrong:**
Stage 6 reimplements rooms by eyeballing screenshots or by reading the original GML's `instance_create` calls. Some tiles end up one pixel off, a wall section is shifted one tile, a teleport trigger is in the wrong spot. Returning players notice immediately because they navigated these rooms by muscle memory.

**Why it happens:**
GameMaker 5.3a stores room layouts as serialized lists of (x, y, object_index) — directly extractable. Engineers sometimes "redraw" rooms instead of importing the data, introducing drift.

**How to avoid:**
- Stage 1 extracts room layouts as structured data (`decomp/wiki/15-extraction-pipeline.md` says this is achievable).
- Stage 6 client **imports the extracted layout data directly** — no manual room building.
- Round-trip test: render the original screenshot from the original game, render the new client's room with the same view, diff pixel-by-pixel. Tile-level diffs flag immediately.
- Specifically test: room transitions / door positions / teleport pads, since off-by-one there breaks navigation entirely.

**Warning signs:**
- Player says "this wall used to be here" or "the door is one tile over."
- Visual diff between original screenshot and new render shows misaligned regions.
- Room layout file in the new repo was hand-written rather than imported from extraction.

**Phase to address:**
Stage 1 (Extraction) — get structured room data; Stage 6 (Client rebuild) — consume directly.

---

### C7: "Parity" scope explosion

**What goes wrong:**
Stage 7 ("Full parity") becomes infinite. Every detail — particle effect on attack, exact NPC dialogue tree, every undocumented chat-room easter egg — has someone who remembers it and notices when it's missing. The MVP shipped in Stage 6 keeps not feeling "done." Burnout follows (see D5).

**Why it happens:**
"Parity" is a vibe, not a spec. Without an explicit scope, every gap is a defect. With ~298 users' lifetime memory of the original, the gap surface is unbounded.

**How to avoid:**
- **Define "parity" as a checklist of explicit features** in Stage 2 (client) and Stage 3 (server) outputs. Anything not on the checklist is **not parity** — it's a Stage 8 wishlist item.
- The checklist is closed at the end of Stage 3. New items added later go to "deferred."
- Use the extracted GML/asset inventory as the definitive source: every script, sprite, sound, room, NPC enumerated. Stage 7 success = each item has been explicitly addressed (implemented, deferred, or rejected).
- Plan a small **public retrospective** with original players AFTER Stage 7 declared done — collect deferred items as a prioritized backlog, not as Stage 7 scope.

**Warning signs:**
- Stage 7 has been in progress for 2x the original estimate.
- "We can't launch yet, X is missing" where X wasn't on the original checklist.
- Multiple parallel "polish" branches with no merge-back plan.
- The team is afraid to ship because "people will notice things."

**Phase to address:**
Stage 2/3 — closed feature checklist; Stage 7 — measured against the closed list, no scope creep.

---

## Section D — Project Meta-Pitfalls

### D1: Doing extraction + rewrite simultaneously

**What goes wrong:**
Engineer extracts a few scripts, starts porting them to TypeScript while still extracting the rest. Discovers a script depends on a global variable initialized in a script not yet extracted. Discovers a packet type used in code not yet read. Has to re-architect mid-port. Both stages are now half-done and unverifiable.

**Why it happens:**
Excitement to ship something, plus underestimation of how interlinked the extracted code will be. Extraction feels like "boring prep work" and porting feels like "the real project."

**How to avoid:**
**Stages 1-3 are extract-and-document only. No new TypeScript is written before Stage 4.** Treat the extracted-and-documented artifacts as a self-contained product:
- Stage 1 ships: a structured tree of extracted GML, sprites, rooms, sounds.
- Stage 2 ships: a feature catalogue of the client.
- Stage 3 ships: a feature catalogue of the server, the 39dll packet schema, the `.bno`/`.bnb`/`.bnu` schemas, the persistence-layer decision.
- Only then does Stage 4 begin. The prior outputs are inputs; they don't change once a stage is closed.

This is consistent with `.planning/PROJECT.md`'s 7-stage plan; the discipline is **enforcing it** when the impulse to "just start coding" hits.

**Warning signs:**
- New TypeScript files appearing in the repo before Stages 1-3 are formally closed.
- Stage 1/2/3 outputs being edited after Stage 4 begins ("I noticed X while porting Y, let me update the doc") — should instead be a Stage 4 issue/note, not an upstream edit.
- Frequent "wait, what does this script do" moments during Stage 4 — Stage 2/3 was incomplete.

**Phase to address:**
Project meta — enforce stage boundaries via a stage-completion checklist in `.planning/PROJECT.md`. The `/gsd-transition` flow is the enforcement mechanism.

---

### D2: Skipping Stage 2/3 documentation, reading GML while rewriting

**What goes wrong:**
Engineer figures "I'll read the GML as I go in Stage 4." Stage 4 becomes constant context-switching between TypeScript and decompiled GML. Bugs appear weeks later because a script the engineer hadn't yet read held a constraint the new code violated. There's no documentation for the next contributor.

**Why it happens:**
Documentation feels like overhead when you're the only person on the project. "I have it in my head" is a lie that compounds.

**How to avoid:**
Stage 2 and Stage 3 produce **written artifacts** that are the input to Stage 4. The artifacts answer: what features exist, what data shapes flow between them, what the protocol looks like, what edge cases the original handled. If Stage 4 needs a fact that's not in the Stage 2/3 docs, the Stage 4 work pauses and the doc gets updated.

For a **single-developer project**, the discipline is even more important — your future self in Stage 6 won't remember what your Stage 4 self knew about the GML.

**Warning signs:**
- Stage 4 PRs that link to GML line numbers as the only "spec."
- Stage 2/3 docs are <50% the volume you'd expect for a game of this scope.
- "Why does the client do X?" → answer requires re-reading GML rather than checking docs.

**Phase to address:**
Stage 2 and Stage 3 — explicit documentation deliverables. Stage transition gate verifies docs exist.

---

### D3: Phaser/PixiJS lock-in before knowing engine feature requirements

**What goes wrong:**
Engineer picks Phaser 3 in Stage 1 because "it's GameMaker-like." In Stage 6, discovers BNO uses a feature combination Phaser handles awkwardly (e.g., custom blend modes, a specific particle system, mid-frame surface targets). Either fights Phaser, switches engines mid-Stage-6 (huge cost), or compromises faithfulness.

**Why it happens:**
Tool selection feels like a Day-0 decision. It isn't — it's a Stage-2 decision once the feature surface is documented. `.planning/PROJECT.md` correctly defers this to "end of Stage 2," but the temptation to start prototyping in Phaser earlier is real.

**How to avoid:**
- Hard rule: **no client-engine code before Stage 2 closes.** Not even a "quick prototype." Prototypes lock in mental commitment.
- Stage 2 deliverable includes a feature-vs-engine matrix: for each documented client engine feature (collision, animation, blend modes, particle systems, save/load, audio), grade Phaser 3 and PixiJS as Native / Plugin / Manual / Hard.
- Decision is made on the matrix at end of Stage 2. Decision is logged in `.planning/PROJECT.md` Key Decisions and not revisited unless Stage 6 hits a hard wall.

**Warning signs:**
- A `client/` folder appearing before Stage 2 closes.
- "I just want to see what the assets look like" prototypes that grow.
- Stage 6 hitting friction with the chosen engine — go back to the matrix and check whether the friction point was scored correctly.

**Phase to address:**
Stage 1 (don't start client code); Stage 2 (matrix + decision); `.planning/PROJECT.md` Key Decisions log.

---

### D4: IP/license trap — committing to public release before clearing assets

**What goes wrong:**
Project ships Stage 6 MVP, posts about it publicly, gets attention. Capcom (or their counsel) sends DMCA. Project shuts down or pivots painfully. Worse case: the maintainer named in `decidel.com` or git history gets personally targeted.

**Why it happens:**
BNO is a Capcom-derived fan project (CONCERNS.md "Unlicensed derivative work of Capcom IP"). Asset filenames, sprites, music, and game mechanics openly trade on *Mega Man Battle Network*. Capcom has historically pursued MMBN fan projects. The repo also contains cracked software (GM 5.3a registration, RealVNC keygen) that compounds public-release risk.

**How to avoid:**
- **Keep the repo private through Stage 7.** Public-release decision is deferred per `.planning/PROJECT.md` Key Decisions ("License + public distribution: defer") — uphold that.
- Before any public action: complete the legal-prep steps in CONCERNS.md (delete cracked software, segregate Capcom-derived assets from original-team code, redact `localList.txt`/`remoteList.txt` plaintext credentials).
- Decide between (a) private friends-only deployment (lowest risk, matches the <50 player target), (b) "spiritual successor" with original-art replacement (medium risk, requires asset rework), (c) public release as-is (high risk, not recommended).
- Do not name Capcom IP in the project's public-facing materials (repo name, deployed URL, marketing copy) regardless of which path is chosen.
- Consider whether the maintainer's identity should be associated with the public artifact at all.

**Warning signs:**
- Discussion of "we should announce this on Reddit" before legal-prep is done.
- Deployment URL contains "Battle Network," "Megaman," "MMBN," or similar.
- Push to a public remote with `legacy/` accidentally included (CONCERNS.md "confirm `legacy/` has never been committed to a remote").

**Phase to address:**
Project meta — IP/release decision is a **gate before** any public-facing Stage 5 / Stage 6 deployment, not an afterthought. CONCERNS.md cleanup tasks are a prerequisite.

---

### D5: Single-developer burnout on a 7-stage project

**What goes wrong:**
Solo developer enthusiastic in Stages 1-2, slogs through Stage 3 (tedious protocol RE), runs out of energy mid-Stage-4. Project stalls indefinitely with most code half-done and no shipping artifact. Worse: the half-done state is in nobody's head and resuming requires re-learning everything.

**Why it happens:**
7 stages is a long arc. Without intermediate "shippable" milestones, motivation degrades. Reverse-engineering work has poor visible-progress feedback — you can spend a week understanding a packet and have nothing to show.

**How to avoid:**
- **Define "ship-worthy" outputs at every stage**, not just at Stage 7. Examples:
  - Stage 1: a published-internally extraction tree with a one-page "here's what we got" summary.
  - Stage 2: a one-page "BNO client features" doc — readable as a standalone artifact.
  - Stage 3: same for server.
  - Stage 4: a runnable server (no client) that accepts WS connections and echoes packets.
  - Stage 5: a deployed `https://...` URL that responds to a ping handler.
  - Stage 6: the MVP gate (two players, movement, chat).
  - Stage 7: the closed-checklist sign-off.
- Each ship-worthy output gets a small celebration / writeup. Even private notes count.
- **Time-box stages.** If Stage 2 takes 4x its estimate, that's a signal to scope down (drop catalogue depth, defer non-critical features) rather than push through.
- Build a **resumability buffer**: at the end of every working session, a short "where I am, what's next" note. Resuming after a break is the highest-friction moment for solo projects.
- Be honest about whether to recruit: if Stages 4-6 require a sustained pace the solo developer can't maintain, find a collaborator before the burnout, not after.
- For Stage 3 specifically (the most tedious): pair RE work with concrete deliverables (one packet schema = one TS type) so progress is visible.

**Warning signs:**
- More than 2 weeks since the last commit.
- "I'll get back to it next weekend" repeated for multiple weekends.
- Stage estimates blown by >2x without scope reduction.
- A growing list of TODOs, FIXMEs, and half-finished branches.
- Loss of the high-level mental model of the project — needing to re-read your own notes to remember what's next.

**Phase to address:**
Project meta — bake ship-worthy outputs and time-boxes into the Stage definitions in `.planning/PROJECT.md`. Resumability notes are a per-session discipline.

---

### D6: Dual-purpose `legacy/` directory — preservation vs working source

**What goes wrong:**
`legacy/` contains both authoritative source artifacts AND items that must be deleted before publication (cracked software, plaintext credentials per CONCERNS.md). Engineer in Stage 4 references files inside `legacy/`. Later, deletions for publication break the build/migration code.

**Why it happens:**
The same directory serves two purposes (preservation archive + working source for extraction).

**How to avoid:**
- Stage 1 extracts everything needed for Stages 2-6 into a **separate** `extracted/` (or similar) tree that is the canonical input for downstream stages. `legacy/` is read-only after Stage 1.
- The CONCERNS.md cleanup list (delete cracked software, redact credentials) operates on `legacy/` independently — `extracted/` doesn't carry the legal liabilities.
- `.gitignore` is consciously decided per-tree: `legacy/` likely stays git-ignored; `extracted/` likely is committed (with credentials redacted at extraction time).

**Warning signs:**
- Stage 4+ code that imports from a `legacy/` path.
- Migration scripts that read `legacy/servers/.../localList.txt` directly instead of from a Stage-1 extracted-and-redacted equivalent.
- Confusion about which version of an asset is "the one" — `legacy/` has multiples, `extracted/` should have a single canonical.

**Phase to address:**
Stage 1 (Extraction) — establish the `legacy/` → `extracted/` split as a deliverable.

---

## Technical Debt Patterns

| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Skip 39dll packet schema doc, write port from GML grep results | Faster Stage 4 start | Stage 4 becomes archaeology; future contributors stranded | Never — Stage 3 must produce the schema |
| Use `JSON.parse` over WS instead of binary frames | Easy debugging | Doesn't match original protocol; can't faithfully port; bandwidth waste | Only for an early Stage 4 prototype that's discarded |
| Plaintext password storage matching original | "Faithful" to original | Critical security incident waiting to happen; CONCERNS.md flags this | Never — bcrypt migration is non-negotiable |
| Manually rebuild rooms in editor instead of importing extraction | Visual / "fun" work | Off-by-one drift; can't re-sync if extraction improves | Never — always import |
| Skip WAL / atomic writes, save on a timer | Simpler code | Data loss on Fly.io restart; corruption risk | Only for local dev; never in deploy |
| `setInterval(tick, 33)` instead of accumulator loop | Simple | Tick drift, spiral of death under load | Only for early prototype; replace before Stage 5 |
| Hard-code legacy paths (like the `decidel.com` precedent in CONCERNS.md) | Faster | Reproduces a documented original-project bug | Never — externalize from the start |
| One big `handleMessage` function | Easier to start | Untestable; rate-limit retrofitting expensive | Only for the first packet; refactor to dispatcher before packet 3 |
| Skip protocol versioning byte | Saves 1 byte | Deploy-day breakage; B8 in full force | Never — version byte from packet 1 |
| Shared-types-by-copy-paste between client and server | No build setup | Schema drift; the bug class B8 prevents | Only if shared types package is genuinely too heavy for v1 |

---

## Integration Gotchas

| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| Fly.io | Treat as ephemeral; lose state on restart | Persistent volume + Litestream + WAL + atomic writes (B7); SIGTERM grace handler |
| Fly.io | Single machine, single region for global users | Acceptable for v1's <50-player scope; revisit only if latency complaints concentrate by region |
| `ws` library | Send strings when binary intended | Set `binaryType` explicitly; use `Buffer` / `Uint8Array` end-to-end |
| `ws` library | Forget `clientTracking: false` for high-conn-count perf | Configure ws options based on expected concurrency; profile before tuning |
| `uWebSockets.js` | Use without understanding its async model | Stick with `ws` for v1 unless concrete perf need; uWS is faster but easier to misuse |
| Phaser 3 | Rely on `update(time, delta)` with variable delta for game logic | Run game logic in fixed-timestep loop driven from server state; render in `update` |
| PixiJS | Build whole game on top of PixiJS (it's a renderer, not a game framework) | If choosing PixiJS, also choose a state/scene/input layer (or build one) |
| SQLite | Use journal mode `DELETE` — slow, doesn't crash-recover well | Use WAL mode; configure synchronous=NORMAL; periodic checkpoint |
| Litestream | Forget to validate restore actually works | Test restore from a fresh machine into a staging environment monthly |
| bcrypt | Use cost factor 10 by default | Pick cost factor based on Stage 5 hardware (target ~250ms hash); document choice |
| Vite | Ship dev-mode bundle to production | Verify production build before Stage 6 deploy; check bundle size |
| Chrome | Assume gamepad/keyboard events fire identically across OSes | Test on Win/Mac/Linux Chrome; key codes can differ |

---

## Performance Traps

| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| Per-player state broadcast every tick to every player | CPU scales O(N²); latency rises | Send only state deltas; only broadcast to players in the same room | Breaks at ~30-50 concurrent (matches v1 target — design for it now) |
| GC pauses during tick | Periodic frame stalls; server log shows occasional 100ms+ tick | Pre-allocate buffers; reuse packet objects; profile with `--inspect` | At any scale; pauses get longer with more state |
| String concatenation for binary packets | Packet construction is hot path; concatenation allocates | Use `Buffer.alloc` once, write to offsets | At ~10-20 players with frequent updates |
| SQLite write per state change | Tick latency spikes; disk I/O dominates | Batch writes via WAL; persist on a separate timer or worker | At any write-heavy scale; first noticed ~5-10 active players |
| Synchronous bcrypt in request handler | Login latency tanks; event loop blocked during hash | Use bcrypt's async API; or move auth to a worker thread | Immediately on first concurrent login |
| Loading all room state on player join | Slow joins; memory grows with every join | Lazy-load; cache; share read-only state across rooms | At ~50+ rooms or large rooms |
| Logging every packet to disk | Disk fills; logging itself becomes the bottleneck | Sample logs; log only errors and key events at INFO; structured logs | Within hours of production traffic |
| Sending full state snapshots when delta would do | Bandwidth scales O(state × players × tickrate) | Diff against last-acked state per player | At ~10+ active players in same room |
| Re-rendering full canvas every frame | Client CPU pegged; battery drain on laptops | Phaser/PixiJS handle this if you let them; don't force full redraws | At any scale on low-end hardware |

For v1 (<50 concurrent players, single Fly machine per `.planning/PROJECT.md` Constraints), most of these are headroom-comfortable but **the design choices are still locked in early** — adding deltas after launch is much more painful than building them in.

---

## Security Mistakes

| Mistake | Risk | Prevention |
|---------|------|------------|
| Faithfully port original plaintext-password storage | Catastrophic — reproduces the CRITICAL CONCERNS.md issue | Hard rule: bcrypt/argon2 from packet 1 of Stage 4; never write a plaintext-password code path |
| Faithfully port original "Ctrl+E executes clipboard on client" remote-exec feature | Wide-open RCE-by-design (CONCERNS.md "Server runs arbitrary clipboard code") | Refuse to port admin remote-exec features. Replace with limited admin tools (kick, mute, broadcast) with explicit allow-lists |
| Trust client for state (B1 above) | Cheating, griefing, data corruption | Authoritative server discipline from packet 1 |
| Commit `.env` / Fly tokens / production secrets | Standard secret-leak risk | `.gitignore` from project init; pre-commit hook scans |
| Log full packets including chat / passwords | PII leaks via logs | Redact known-sensitive fields in the logger |
| Echo error messages including stack traces / DB paths to client | Info leak; helps targeted attacks | Map server errors to client-safe codes; full detail goes to server logs only |
| No auth on the WS connection (treat WS as authenticated by IP) | Hijack risk; account takeover | Auth handshake is the first packet; subsequent packets carry session token |
| Reuse session tokens across deploys | Stolen tokens linger | Sign tokens with a per-deploy secret; rotate; short TTL |
| Username enumeration via differential login errors | Lets attackers build a target list | Always return same "invalid credentials" regardless of which field is wrong; same response time (B5 transparent re-hash logic must not be a timing oracle) |
| No rate limit on auth (brute-force) | Account takeover via password guessing | Per-IP and per-account auth rate limit (B4) |
| Ignore the cracked-software / pirated-binary problem (CONCERNS.md) | Legal liability + malware risk on dev machines | Delete cracked software from working tree; quarantine `RealVNC*Keygen.exe`; AV-scan all uncertain `.exe` files before any execution (CONCERNS.md fix approach) |
| Public deployment with Capcom-derived assets | DMCA, possible legal action | D4 above — keep private until decision made; remove obvious branding from any public artifact |
| Plaintext credentials archive (`localList.txt`) committed to a remote | Mass user credential leak — CRITICAL per CONCERNS.md | Verify `legacy/` has never been pushed; redact before any publication; consider hashing usernames too |

---

## UX Pitfalls

| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| Force returning users through password reset | "I lost my account" frustration | Transparent re-hash on first successful login (B5); reset only as fallback |
| Modernize chat command syntax (C4) | Returning players' muscle memory broken | Verbatim port; new commands are additions only |
| Add motion smoothing / interpolation that wasn't in original (B10, C1) | "Feels different" complaints with no fix | Match original movement model exactly; smoothing is an opt-in setting only |
| Show "X is typing..." indicators that didn't exist | Players notice the addition; some find it intrusive | Default off; opt-in setting |
| Replace original UI fonts with "modern" fonts | Loses identity; readable text feels foreign | Reuse original font assets if extractable; otherwise pick a close visual match documented in Stage 2 |
| Center the viewport on player when original used fixed-room view | Different feel; some rooms look wrong | Match original camera behavior per room (Stage 2 documents) |
| Show ping / latency UI indicators by default | Surfaces network jank that wasn't visible originally | Default off; debug overlay |
| Modal "you have been disconnected" alerts on every flicker | Annoying for marginal connections | Show only after grace period (matching B2 reconnect window); auto-reconnect silently when possible |
| Re-render "Loading..." screens for instant transitions | Adds perceived latency that wasn't there | Skip loading UI for transitions <500ms |
| HiDPI sprite blur (C3) | "Why does it look blurry now" | Pixel-perfect rendering settings + integer scaling |
| Force fullscreen on launch | Surprises users in a browser context | Fullscreen is opt-in via a button; default windowed |

---

## "Looks Done But Isn't" Checklist

- [ ] **Extraction complete:** GML extracted, but DnD nodes also extracted as binary (A3) — verify both are present per asset.
- [ ] **Packet schema done:** Schema docs exist, but verify every `sendmessage`/`receivemessage` site in extracted GML has a corresponding schema entry (A4) — grep-driven coverage check.
- [ ] **Save format done:** Schema for `.bno`/`.bnu`/`.bnb` exists, but verify a round-trip migration test on real `.bnu` files from `legacy/servers/` succeeds (A5).
- [ ] **Server authoritative:** "Authoritative" claim made, but verify with a hand-crafted malicious WS frame test against every state-changing handler (B1).
- [ ] **Reconnect works:** Reconnect tested in dev, but verify under network failure (kill WiFi mid-session, not just `socket.close()`) and zombie-cleanup with heartbeat timeout (B2).
- [ ] **bcrypt migration done:** First-login upgrade works, but verify: cold legacy login, two concurrent logins to same account, second login post-upgrade, wrong-password-then-right-password sequence, manual admin reset path (B5).
- [ ] **WAL persistence done:** SQLite WAL configured, but verify: kill -9 the process mid-tick, restart, confirm zero data loss; verify Litestream restore from S3 works on a clean machine (B7).
- [ ] **Schema versioning done:** Version byte present, but verify: server with version N still accepts version N-1 client correctly; client with version N-1 against version N server gets the "please reload" prompt, not a crash (B8).
- [ ] **MVP gate (Stage 6):** "Two players in a room" works, but verify with two players in different geographic regions, on different network conditions (one on 4G, one on home WiFi) — the MVP per `.planning/PROJECT.md`.
- [ ] **Movement matches original (C1):** Player walks across a room — measure pixel/sec against extracted GML constants. Don't trust eyeball.
- [ ] **Audio matches original (C2):** Returning player listens to BGM and SFX, confirms "yes that's the right sound." If only the developer has tested audio, audio is not done.
- [ ] **Sprite scaling correct (C3):** Test on a HiDPI display (Retina/4K) — sprites look crisp, not soft.
- [ ] **Chat commands match (C4):** All extracted chat commands work with original syntax; error messages match.
- [ ] **Save migration done (C5):** Pick 5 random users from `localList.txt`, run migration, verify their `.bnu` data lands intact.
- [ ] **Rooms identical (C6):** Pixel-diff at least 5 rooms (lobby, main hub, popular zones) against original screenshots — no off-by-one.
- [ ] **Stage 7 closed-checklist (C7):** Every item on the Stage 2/3 feature list has a status (done / deferred / rejected) — no items in "in progress" at sign-off.
- [ ] **CONCERNS.md addressed (D4):** Cracked software deleted, plaintext credentials redacted, `legacy/` never pushed to a remote — verify with `git log --all -- legacy/`.
- [ ] **Deploy lifecycle (B7):** SIGTERM handler tested by triggering a Fly.io deploy mid-traffic; players reconnect within grace window without data loss.
- [ ] **Extracted vs working trees split (D6):** No Stage 4+ code imports from `legacy/` — only from `extracted/` (or equivalent).

---

## Recovery Strategies

| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| A1 — Wasted time on UTMT | LOW | Switch to GM Decompiler v2.1; lost time only |
| A2 — Broken extraction layer order | LOW | Use GM Decompiler v2.1 (does it for you); abandon custom extractor |
| A3 — Ported DnD blocks behave wrong | MEDIUM | Open binary in LateralGM for the suspect script; re-port from binary intent; add regression test against expected behavior |
| A4 — Packet schema wrong | MEDIUM-HIGH | Re-derive from GML for the broken packet type; this is hours per packet, days for many; mitigated by GML-first discipline |
| A5 — `.bno`/`.bnu` migration corrupted save data | HIGH | Original `legacy/servers/` is read-only — re-run migration from source; users who already played on the new server have lost interim progress; communicate, possibly roll back |
| A7 — GMD-Recovery doesn't work on modern Windows | LOW | Set up WinXP VM (Stage 1 prereq anyway); run there |
| B1 — Client trusted, exploited in production | HIGH | Identify exploited handler, add server-side validation, ship hotfix; audit all sibling handlers; possibly roll back state for affected players |
| B2 — Zombie sessions | LOW-MEDIUM | Add heartbeat + timeout; restart server to clear zombies; document fix |
| B5 — Account locked due to double-hash bug | HIGH per affected user | Manual admin reset per affected account; audit log identifies which accounts are at risk; consider mass password reset if scope is broad |
| B7 — Lost state on Fly restart | HIGH | Restore from Litestream backup (some loss); communicate to players; add WAL/atomic-write discipline |
| B8 — Mid-deploy schema-skew corruption | HIGH | Roll back deploy; identify affected records via audit log; restore from backup; harden version-byte handling |
| C1 — Movement feel wrong | LOW | Adjust constants from extracted GML; ship patch |
| C5 — Some users missing migrated saves | MEDIUM | Identify orphans via migration log; re-run migration for them; communicate |
| C6 — Rooms shifted off-by-one | LOW | Re-import room data from Stage 1 extraction; ship patch |
| C7 — Stage 7 scope explosion / stuck | MEDIUM (project-time, not engineering) | Define a closed checklist retroactively; cut to Stage 7-MVP; everything else becomes Stage 8 backlog |
| D1 — Mid-stage rewrite spiral | HIGH | Stop new work; complete the in-flight stage doc artifact; resume Stage N+1 with a clean baseline |
| D4 — DMCA / legal incident | CRITICAL | Take down public artifact immediately; consult counsel; do not engage publicly without advice |
| D5 — Burnout / project stall | HIGH | Honest scope cut; consider collaborator; resume with smaller, ship-worthy increments per stage |

---

## Pitfall-to-Phase Mapping

| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| A1 — Wrong tool (UTMT etc.) | Stage 1 | Tool list documented; CI fails if a banned tool name appears in setup notes |
| A2 — Extraction layer order | Stage 1 | GM Decompiler v2.1 used; intermediate-file outputs at each layer |
| A3 — DnD decompilation infidelity | Stage 1 (preserve binaries) + Stage 4/6 (verify against binary) | Binary node trees retained alongside GML for every asset |
| A4 — Packet schema from captures | Stage 3 | Schema doc has a row for every grep'd `sendmessage`/`receivemessage` site |
| A5 — `.bno`/`.bnu` parsing without GML | Stage 3 | Schema document exists; migration tests pass on real `.bnu` files |
| A6 — Ignoring `.gb1`-`.gb9` | Stage 1 | Extraction tree includes outputs from every `.gmd` AND every `.gb*` backup, with manifest |
| A7 — No WinXP VM | Stage 1 (Day 0) | VM image documented; extraction runbook tested in VM |
| B1 — Client-trusted state | Stage 4 | Hand-crafted malicious WS frame tests for every state handler |
| B2 — Zombie WS sessions | Stage 4 | Heartbeat + timeout in place; tested with simulated network failure |
| B3 — Message framing | Stage 4 | Library framing trusted; no custom re-framer; tests with fragmented frames |
| B4 — Rate limiting | Stage 4 | Per-account, per-type token bucket at packet dispatcher; tests for chat flood |
| B5 — bcrypt migration | Stage 4 | Test suite covers cold/concurrent/upgraded/wrong-password/admin-reset paths |
| B6 — Tick rate drift | Stage 4 | Fixed-timestep accumulator with capped catch-up; per-tick duration logged |
| B7 — Persistence loss on restart | Stage 4 (WAL) + Stage 5 (Fly lifecycle) | kill -9 + restart test; Litestream restore test |
| B8 — Schema version skew | Stage 4 (version byte) + Stage 5 (deploy procedure) | Version-N server tested against version-N-1 client; deploy runbook documents N-1 support |
| B9 — Room hot-restart leakage | Stage 4 | Room hot-restart either avoided or implemented with instance-id discipline |
| B10 — Premature lag comp | Stage 2 (movement model) + Stage 6 (match it) | Movement model documented; lag comp is an explicit Stage 7 decision |
| C1 — Movement feel | Stage 2 (constants) + Stage 6 (impl) | Automated tests against extracted constants; returning-player playtest |
| C2 — Audio fidelity | Stage 2 (catalogue) + Stage 6 (impl) + Stage 7 (verify) | Returning-player audio playtest; soundfont decision logged |
| C3 — HiDPI sprite blur | Stage 6 | Pixel-perfect engine settings; HiDPI display test |
| C4 — Chat command drift | Stage 3 (extract) + Stage 4/6 (impl) | Verbatim match against extracted command list |
| C5 — Save migration loss | Stage 3 (schema) + Stage 4 (transactional migration) | 5-user round-trip migration test |
| C6 — Room layout drift | Stage 1 (structured extraction) + Stage 6 (direct import) | Pixel-diff against original screenshots |
| C7 — Parity scope explosion | Stage 2/3 (closed checklist) + Stage 7 (sign-off against checklist) | Every checklist item has a status before Stage 7 close |
| D1 — Extraction + rewrite simultaneous | Project meta — Stage 1-3 boundaries | `/gsd-transition` enforced; no `src/` TS code before Stage 4 |
| D2 — Skip Stage 2/3 docs | Stage 2/3 deliverables | Doc artifacts exist as Stage transition gate |
| D3 — Engine lock-in pre-Stage-2 | Stage 1-2 — no client engine code | Feature-vs-engine matrix at end of Stage 2; decision logged |
| D4 — IP/license trap | Project meta — pre-Stage-5/6 deploy gate | CONCERNS.md cleanup checklist completed before any public artifact; Capcom-IP names absent from public materials |
| D5 — Burnout | Project meta — every stage | Ship-worthy artifact per stage; resumability notes; time-box stage estimates |
| D6 — `legacy/` working-source confusion | Stage 1 | `extracted/` tree exists; Stage 4+ never imports from `legacy/` |

---

## Sources

- `C:\Users\decid\Documents\projects\rebno\.planning\PROJECT.md` — 7-stage plan, scope, constraints
- `C:\Users\decid\Documents\projects\rebno\.planning\codebase\CONCERNS.md` — legal, security, bit-rot, data-integrity risks already surfaced
- `C:\Users\decid\Documents\projects\rebno\decomp\wiki\00-overview.md` — engine identity and extraction goals
- `C:\Users\decid\Documents\projects\rebno\decomp\wiki\08-39dll-networking.md` — packet structure = call order; `sendmessage`/`receivemessage` workflow
- `C:\Users\decid\Documents\projects\rebno\decomp\wiki\13-modern-tool-incompat.md` — UTMT/Altar.NET architectural mismatch
- `C:\Users\decid\Documents\projects\rebno\decomp\wiki\15-extraction-pipeline.md` — ranked extraction methodology
- General multiplayer game server patterns (Glenn Fiedler "Fix Your Timestep!", Gaffer On Games series) — informing B6 fixed-timestep accumulator. MEDIUM confidence (industry-standard pattern, not BNO-specific).
- General `ws` library and WebSocket framing behavior — informing B3. MEDIUM confidence (well-known library; verify exact semantics in chosen WS lib at Stage 4 start).
- Litestream + SQLite WAL patterns — informing B7. MEDIUM confidence (standard pattern; verify Fly.io specifics at Stage 5).
- Capcom enforcement history against MMBN fan projects — informing D4. MEDIUM confidence (well-documented in fan-project communities; consult current legal advice before public release).

---

*Pitfalls research for: REBNO multi-stage rebuild (RE + multiplayer + faithful-rebuild + meta)*
*Researched: 2026-05-01*