# Client-only fast-path rollback runbook

[doc->REQ-DEP-04]

Operator-facing runbook for rolling back a Phase 06.5 client-only fast-path
release. This is the single-command rollback path established by
`scripts/client-release.sh` and the `/data/client-assets/releases/<sha>`
content-addressed layout.

---

## When to use this runbook

Use this when **all of the following** are true:

- The most recent deploy was a **client-only fast-path** release (no Docker
  image push event in Fly logs; only a tarball SSH-upload + `client-release.sh`
  invocation).
- The site is broken in a way that points squarely at a **bad client bundle** —
  white screen, JS error in the browser console, missing assets, hashed-asset
  404s.
- Server logs **do NOT** indicate a process crash, OOM, or container restart.
  A server-process crash needs a full-image rollback (re-run a prior
  `deploy-staging.yml` full-path job), not this runbook.

If you are unsure which path the bad deploy took, check the latest GitHub
Actions run on the relevant workflow. If the run pushed a Docker image, use
the full-image rollback (out of scope here). If it only ran the fast-path
job, continue.

---

## Preflight (60 seconds)

1. SSH into the affected machine:
   ```bash
   flyctl ssh console -a rebno-staging
   # OR for prod:
   flyctl ssh console -a rebno-prod
   ```

2. Inspect the current pointer:
   ```bash
   readlink /data/client-assets/current
   ```
   Note the SHA at the tail of the path — that is the deploy you are
   about to roll back **from**.

3. List available rollback targets, newest first:
   ```bash
   ls -lt /data/client-assets/releases/ | head -6
   ```
   You will see up to 5 retained release directories plus the just-deployed
   failure = up to 6 lines. The retention is **5 releases** (see
   06.5-RESEARCH § Pitfall 5).

4. Pick the **previous good** SHA. In most rollback scenarios this is the
   second entry in the `ls -lt` output (newest after the current failure).
   **Confirm the target dir exists** on the volume before issuing the swap —
   if it has been GC'd, follow the "If the rollback target is missing"
   section below.

---

## Atomic rollback (one command)

Run this **on the machine** (you are still inside the `flyctl ssh console`
session from Preflight step 1):

```bash
cd /data/client-assets
ln -s releases/<previous-sha> current.new && mv -T current.new current
```

Replace `<previous-sha>` with the SHA you picked in Preflight step 4.

**Why `mv -T`:** `mv -T` invokes the `rename(2)` syscall, which is atomic
on ext4 — there is no instant at which `/data/client-assets/current` is
missing for `express.static` to 404 on. Do **NOT** substitute the
non-atomic alternative (`ln -s` with the force-no-deref flag pair) here
even though older deploy guides use it; it is not atomic. See
`.planning/phases/06.5-static-client-asset-split-zero-cost-client-only-fly-deploys-/06.5-RESEARCH.md`
§ Pattern 1.

You should see no output (a successful `mv -T` is silent). Verify the swap
took effect:
```bash
readlink /data/client-assets/current
# Expected: /data/client-assets/releases/<previous-sha>
```

---

## Verify rollback

Run these probes **from a terminal NOT on the machine** (your laptop is
fine):

```bash
# 1. Root probe — should serve the previous index.html
curl -fsS https://staging.rebno.decidel.com/ >/dev/null && echo OK_root

# 2. Health probe (NOTE: endpoint is /health — confirmed via
#    apps/server/src/index.ts; 06.5-RESEARCH § Pitfall 7)
curl -fsS https://staging.rebno.decidel.com/health >/dev/null && echo OK_health

# 3. Hashed-asset probe — pull the hashed JS path from the rolled-back
#    manifest, then verify it serves.
#    The Fly machine does not ship `jq` by default (06.5-RESEARCH
#    § Environment Availability), so the helper one-liner uses Node:
HASHED=$(flyctl ssh console -a rebno-staging -C \
  "node -e 'console.log(require(\"/data/client-assets/current/.vite/manifest.json\")[\"index.html\"].file)'")
curl -fsS "https://staging.rebno.decidel.com/${HASHED}" >/dev/null && echo OK_hashed
```

Three `OK_*` lines = rollback verified.

If any probe fails, the rollback target itself is also broken. Try the
**next-older** retained release SHA (one more entry up in the `ls -lt`
output) and re-run this Verify section.

---

## If the rollback target is missing

Retention is **5 releases** (`KEEP_LAST=5` in `scripts/client-release.sh`).
If the operator needs to roll back to something older than the 5-deep
buffer, the release directory has been GC'd and the symlink swap cannot
recover it.

**Escape hatch — full-image redeploy of the matching SHA:**

1. `git checkout <sha>` locally to the desired prior commit.
2. Re-run the existing full-image deploy workflow (`deploy-staging.yml` /
   `deploy-prod.yml`) — this rebuilds + pushes the Docker image with that
   SHA's `apps/server/public/` baked in, restoring the served bundle.
3. The full-path job typically takes 4-8 minutes (vs. the fast-path's
   30-90 seconds). This is the cost of going outside the retention window.

To reduce future occurrences, raise `KEEP_LAST` in
`scripts/client-release.sh` (volume headroom is generous — 5 releases ×
~14 MB = ~70 MB; the volume is 10 GB per `apps/server/fly.staging.toml`).

---

## After rollback — what to do next

1. **File the broken release.** Note the broken-deploy SHA, the timestamp,
   the rollback timestamp, and the user-visible failure mode in your
   operator notebook or the operator UAT log.
2. **Open a fix PR.** The broken bundle is still in
   `/data/client-assets/releases/<broken-sha>` on the volume. Do not
   manually `rm -rf` it — the next deploy's GC will trim it naturally.
3. **Re-deploy when ready.** A new fast-path commit will create a fresh
   release dir and swap the symlink forward. No special steps needed.
