# 15 - API Keys Dashboard (`allowed_apps.json` → Database)

**Status: NOT STARTED**

## Summary

Replace the static `allowed_apps.json` file — which every service loads synchronously at boot to validate `Authorization: Bearer` app keys — with a database-backed table managed through a new dashboard in arda. Key management (creating, rotating, disabling, scoping to APIs) becomes a SuperUser operation in the UI instead of a manual file edit + deploy. Existing integrators keep their current keys via a one-time migration.

**Shape of the change:**

- New Firestore collection `AllowedApps` becomes the source of truth at runtime.
- `AuthHandler` (in `auth/Auth.ts:323-406`) subscribes to the collection via snapshot listener; updates propagate to all running services within seconds of a dashboard change.
- JSON file degrades to "bootstrap fallback + disaster recovery snapshot" role — regenerated periodically from the DB, read only on cold boot if Firestore is unreachable.
- New dashboard UI at `/system/api-keys` in arda (SuperUser-only).
- One-time migration on admin_api boot seeds the Firestore collection from the current `allowed_apps.json` file; all migrated keys default to `reserved: true` to prevent accidental disruption of load-bearing integrations.
- New `reserved` flag protects critical keys from accidental deletion, rotation, or scope reduction. Destructive operations on reserved keys require a time-boxed "unlock" ceremony with a reason string, all audited.

## Current State

### `allowed_apps.json` today

- Location: path from `process.env.ALLOWED_APPS_DATA_FILE`. In the repo, `tests/allowed_apps.json` is the test fixture (7 entries); production uses a separate file on the EC2 filesystem.
- Loaded via synchronous `require()` at `auth/Auth.ts:337` inside the `AuthHandler` constructor.
- Entry shape (`auth/Auth.ts:40-44`):
  ```ts
  type AllowedApp = {
      name: string;
      apis: string[];
      id: string;
  }
  ```
- Validation path (`auth/Auth.ts:362-406`):
  - Extract bearer token, sanitize via `.replace(/[^a-zA-Z0-9]/g, '')`
  - Linear scan: `this.allowedApps.find(o => o.id === sanitizedToken)`
  - If found, check that `o.apis.includes(this.apiName)`; otherwise 403
  - If not found, 401
- Every service that calls `Auth.AuthHandler.createAuthHandler({ apiName: ... })` loads the full list. Confirmed callers:
  - `apps/api/api.ts` (auth server) — apiName: "accounts-api" et al
  - `apps/admin_api/admin_api.ts:41,71,93,114` — four handlers (admin-api, fusion-api, cnc-machine-api, pose-api)
  - Likely also cloud-api, fabricator APIs, etc.

### Pain points this resolves

1. **Adding/rotating a key requires a file edit + deploy of every service that loads the file.** The change is not atomic across services.
2. **No audit trail.** Any change is a git commit on the ops repo; there's no in-system record of who changed what, when, or why.
3. **Plaintext secrets in the filesystem.** Anyone with EC2 SSH access or the ops-repo read permission can view all API keys.
4. **No visibility.** There's no way to see which keys exist, which APIs they can call, or which are still in active use without reading the file.
5. **Collision footgun.** The current `tests/allowed_apps.json` has two entries with the same `id` (`accounts-integration-tests` and `admin-integration-tests` both use `f4czwPZdYidGTcWRbhcWAcFiTaDYhYDraDsjYLYO2YXd4LxQqpMpbcdmyxZEBKNZ`). `Array.find` returns the first match, so one of the two entries is effectively dead code. A DB with a unique-key constraint catches this.
6. **No usage tracking.** Can't tell which keys are stale or unused.

## Key Architectural Decisions

### 1. Where the data lives: Firestore (live) + JSON file (fallback snapshot)

Evaluated options:

| Option | Pros | Cons |
|---|---|---|
| **Firestore collection + snapshot listener** | Every service already has Firestore SDK (`AuthDatabase.ts` uses it for `BigScreenUsers`, `RefreshTokenWhitelist`, etc.); live snapshots eliminate polling; zero new infra | Dependent on Firestore uptime at boot |
| Postgres on admin_api + pull-based cache | Familiar; admin_api already has Postgres | Other services don't currently connect to admin_api's DB; adds new service boundary + polling complexity |
| Redis only | Fast; every service has Redis | Redis is cache/ephemeral; risky as source of truth for durable auth config |
| Keep JSON + sync writes from dashboard | Minimal runtime change | Requires shared filesystem or S3 push; fragile multi-EC2 sync |

**Chosen: Firestore**. Source of truth is the `AllowedApps` collection. Services use snapshot listeners for live updates. The JSON file at `process.env.ALLOWED_APPS_DATA_FILE` is regenerated periodically (every 5 min) by a background job that exports the current Firestore state; it's used only when Firestore is unreachable at cold boot.

### 2. Reserved key protection

Some keys are load-bearing (arda's production key, core service-to-service keys). Accidental deletion or rotation in the dashboard would cause an outage. The `reserved` boolean flag on each row gates destructive operations behind a time-boxed unlock ceremony.

### 3. Keys stored as hashes, not plaintext

Dashboard generates a new key on creation: 64-char base64url, shown **once** to the SuperUser at creation time. The DB stores `SHA256(key)` for lookup. Validation becomes: hash the incoming bearer, look up by hash (O(1) indexed lookup vs. today's O(n) linear scan).

**Why SHA256 and not bcrypt:** API keys have ~384 bits of entropy (64 base64url chars). Deterministic hashing is necessary for O(1) lookup; bcrypt with random salt would require iterating every row per request. Rainbow tables are infeasible at 384 bits. SHA256 is adequate for this threat model (DB exposure doesn't directly leak secrets); bcrypt is not.

### 4. Migration is one-time and conservative

First admin_api boot that sees an empty `AllowedApps` collection reads `ALLOWED_APPS_DATA_FILE`, inserts a document per entry with `reserved: true`, and writes a `_migration_marker` document with `{ completedAt, sourceFileSha1, entryCount }`. Subsequent boots see the marker and skip.

Every migrated key starts reserved so the dashboard cannot accidentally nuke a production integration on day one. Operators unlock individual keys as they're verified.

---

## Architecture

### Service boundaries

| Service | Role |
|---|---|
| **arda** (port 3010) | Dashboard UI at `/system/api-keys` (SuperUser-only). Calls admin_api CRUD endpoints via the existing `/api/admin/*` proxy. |
| **admin_api** (port 3999) | CRUD endpoints under `/admin/api-keys/*`. Writes to Firestore. Writes audit events to Postgres. Periodic job regenerates JSON snapshot. |
| **All services using `AuthHandler`** | Subscribe to Firestore `AllowedApps` on startup. Keep in-memory cache. Validate bearer tokens against the cache. |

### Runtime load sequence (service boot)

```
1. AuthHandler constructor called with { apiName }.
2. Initialize empty in-memory cache.
3. Try Firestore first:
    a. Query AllowedApps collection (simple getAll with a 5s timeout).
    b. On success: populate cache, set up snapshot listener for updates.
    c. On failure (timeout, auth, network): log CRITICAL, fall back.
4. Fallback: read JSON at ALLOWED_APPS_DATA_FILE synchronously, populate cache.
    a. Retry Firestore every 30s in the background; on success, swap to Firestore path.
5. Service accepts traffic once cache is non-empty.
6. Live updates: snapshot listener writes through to the cache atomically (build
   new cache, then swap reference).
```

### Token validation (hot path, unchanged in shape)

```
authorizeHttpRequest(req):
    token = extract & sanitize bearer
    hash = SHA256(token)
    entry = cache.byHash.get(hash) ?? cache.byLegacyPlaintext.get(token)  // see below
    if !entry: 401
    if entry.disabledAt: 401
    if !entry.apis.includes(this.apiName): 403
    if entry.rotationPrevValidUntil && hash === entry.previousHash
        && Date.now() > entry.rotationPrevValidUntil: 401
    record usage (fire-and-forget — see §Usage tracking)
    return new AuthorizedApp(entry.name)
```

`cache.byLegacyPlaintext` is a transitional map populated only during fallback mode (when we fell back to the JSON file, which holds plaintext). Once a service is fully on Firestore (keys are hashed), the legacy map is empty and unused.

### Usage tracking (fire-and-forget)

On every successful authorization, push an entry onto an in-memory batch and flush to Firestore every 10 seconds:

```
UPDATE AllowedApps/<id> SET
  lastUsedAt = max(existing, now),
  lastUsedIp = <request IP>,
  usageCount = usageCount + <batch count>
```

This keeps the hot path at zero added latency and provides "last used" visibility without hammering Firestore.

---

## Database Schema

### Firestore collection: `AllowedApps`

Each document represents one API key entry. Document ID is the key's `uniqueId` (UUID).

```ts
{
    uniqueId: string,              // UUID; document ID
    name: string,                  // Human name, e.g. "arda-production"; unique
    description: string | null,    // Optional context
    apis: string[],                // Array of apiNames this key can call
    keyHash: string,               // SHA256 of the current plaintext key (hex)
    keyPrefix: string,             // First 8 chars of plaintext, for display (never the whole key)
    previousKeyHash: string | null,       // During rotation overlap
    previousKeyPrefix: string | null,
    rotationPrevValidUntil: number | null, // Unix ms; old key accepted until this timestamp
    reserved: boolean,             // Default false; migrated entries default true
    migratedFromFile: boolean,     // True for rows created by the migration
    ownerAccountId: string | null, // Bigscreen account who "owns" this key
    createdAt: number,             // Unix ms
    createdByAccountId: string | null, // null for migrated rows
    disabledAt: number | null,
    disabledReason: string | null,
    lastUsedAt: number | null,
    lastUsedIp: string | null,
    usageCount: number,            // Cumulative successful-auth count
    unlockedUntil: number | null,  // If reserved, timestamp when unlock expires
    unlockedByAccountId: string | null,
    unlockReason: string | null
}
```

Indexes: `keyHash` (for O(1) lookup), `previousKeyHash`, `name` (uniqueness), `disabledAt`.

### Firestore document: `AllowedApps/_migration_marker`

```ts
{
    completedAt: number,
    sourceFileSha1: string,  // SHA1 of the allowed_apps.json that was migrated
    entryCount: number,
    migratedByAccountId: string | null  // null if ran via automatic boot task
}
```

### Postgres table on admin_api: `api_key_audit_log`

Audit events live in admin_api's Postgres because that's where the dashboard UI reads from, and where structured querying is needed. New file `apps/db_setup/api_keys_audit_setup.ts`:

```sql
CREATE TABLE api_key_audit_log (
    "uniqueId"       uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
    "at"             BIGINT      NOT NULL,
    "eventType"      VARCHAR(48) NOT NULL,
    "actorAccountId" VARCHAR(64),
    "keyUniqueId"    VARCHAR(64),    -- Firestore document ID; denormalized for traceability
    "keyName"        VARCHAR(128),   -- Denormalized so audit survives deletion
    "ip"             VARCHAR(64),
    "details"        jsonb
);
CREATE INDEX idx_api_key_audit_at  ON api_key_audit_log("at" DESC);
CREATE INDEX idx_api_key_audit_key ON api_key_audit_log("keyUniqueId", "at" DESC);
```

Audit event types:
- `key_created`, `key_deleted`, `key_rotated`, `key_disabled`, `key_enabled`, `key_updated`
- `key_apis_added`, `key_apis_removed`
- `reserved_unlocked`, `reserved_relocked`, `reserved_toggled`
- `migrated_from_file`
- `validation_failed_unknown_key`, `validation_failed_wrong_api` (sampled — see §Rate limiting; don't want to flood this table on abusive probing)

---

## Reserved Semantics

### What `reserved: true` blocks

| Operation | Allowed on reserved? |
|---|---|
| Create a key with `reserved: true` | Yes (SuperUser only) |
| Edit description, owner | Yes |
| Add APIs to the `apis` list | Yes |
| Remove APIs from the `apis` list | **No, requires unlock** |
| Rotate key | **No, requires unlock** |
| Disable key | **No, requires unlock** |
| Delete key | **No, requires unlock** |
| Rename key | **No, requires unlock** |
| Toggle `reserved` off | **Requires unlock + confirmation** |

### Unlock ceremony

1. SuperUser navigates to the key's detail page, clicks "Unlock for edit".
2. Modal: confirm the key name, fill mandatory reason text (min 20 chars), re-enter SuperUser's password.
3. On confirm: set `unlockedUntil = now + 15 min`, `unlockedByAccountId`, `unlockReason`. Write audit row `reserved_unlocked`.
4. Dashboard shows a prominent banner: "UNLOCKED — expires in 14:58" with a live countdown and a manual "Lock now" button.
5. Destructive operations are available during the window. Each one writes its own audit row in addition to the unlock row.
6. When `Date.now() > unlockedUntil` OR "Lock now" is clicked: banner clears, `unlockedUntil` is cleared, audit row `reserved_relocked` (auto) or `reserved_relocked_manual`.

### Implementation notes

- The 15-minute window lives in Firestore (not in cookies / client state), so multiple SuperUsers see consistent state.
- Server-side every destructive operation checks `reserved === false OR (unlockedUntil !== null AND unlockedUntil > now)`. Client-side state is cosmetic only.
- `reserved` can only be toggled off by going through unlock. You cannot skip unlock by toggling `reserved` first.

---

## Migration Plan

### Trigger

On admin_api boot. Runs in the same process, synchronous-looking but async under the hood. Blocks admin_api's ready-for-traffic state until it completes (fails fast if Firestore is unreachable).

### Logic

```
1. Read AllowedApps/_migration_marker from Firestore.
2. If exists and completedAt is set: log "migration already complete", exit.
3. Else:
    a. Acquire Firestore advisory lock by creating AllowedApps/_migration_lock with
       { startedAt, startedByHost }. If it already exists and is < 5 min old,
       another instance is doing the migration — wait+retry. If it's older,
       assume stale lock and overwrite (another instance crashed).
    b. Read JSON file at ALLOWED_APPS_DATA_FILE.
    c. Compute SHA1 of the file contents as a traceability artifact.
    d. For each entry [name, apis, id]:
        - If an entry in Firestore already has the same keyHash, skip (defensive;
          should only happen on retried-after-partial-crash migrations).
        - Otherwise create a Firestore document with:
            uniqueId = uuid()
            name = entry.name
            apis = entry.apis
            keyHash = SHA256(entry.id)
            keyPrefix = entry.id.slice(0, 8)
            previousKeyHash = null
            rotationPrevValidUntil = null
            reserved = true
            migratedFromFile = true
            ownerAccountId = null
            createdAt = now()
            createdByAccountId = null
            disabledAt = null
            usageCount = 0
          Write Postgres audit row migrated_from_file per entry.
    e. Handle duplicate-id cases (see note on tests/allowed_apps.json below):
        - Log a WARNING with both names
        - Create ONE Firestore document whose name is the concatenation
          "nameA+nameB" and whose apis is the union; this matches current
          runtime behavior (first-match wins, but with the same id both entries
          were effectively the same record anyway).
        - Audit row notes the merge.
    f. Write _migration_marker with { completedAt, sourceFileSha1, entryCount }.
    g. Delete _migration_lock.
```

**About the `tests/allowed_apps.json` duplicate id:** the current file has `accounts-integration-tests` and `admin-integration-tests` sharing the same `id`. `Array.find` returns the first match only, so today one of these effectively shadows the other. The merge behavior above makes this bug's resolution explicit and reviewable in the audit log.

### What to do with the JSON file after migration

1. The file is **not deleted automatically**. After migration, services still fall back to it on Firestore unavailability, so removing it would reduce resilience.
2. A periodic job on admin_api (every 5 min) regenerates the file from Firestore, overwriting the migrated-from version. This keeps the fallback snapshot fresh.
3. Operators can manually delete the file after confirming Firestore is stable for a few weeks; the bootstrap-fallback path will be dead code at that point.
4. For tests: the in-repo `tests/allowed_apps.json` is **not migrated**. Tests run with `API_KEYS_SOURCE=file` (new env var) which forces the fallback path. Test isolation preserved.

---

## API Endpoints (admin_api)

All SuperUser-only (via `AuthApi.getAccessPolicyHandler([AuthSchemas.AccessPolicy.SuperUser])`).

| Method | Endpoint | Purpose |
|---|---|---|
| GET | `/admin/api-keys` | List all keys (no keyHash in response; only prefix + metadata) |
| POST | `/admin/api-keys` | Create key; returns plaintext key ONCE in the response body |
| GET | `/admin/api-keys/:uniqueId` | Fetch single key metadata |
| PUT | `/admin/api-keys/:uniqueId` | Update description, owner, apis (see reserved rules) |
| POST | `/admin/api-keys/:uniqueId/rotate` | Generate new plaintext; set `previousKeyHash`; return new plaintext ONCE |
| POST | `/admin/api-keys/:uniqueId/disable` | Set `disabledAt` |
| POST | `/admin/api-keys/:uniqueId/enable` | Clear `disabledAt` |
| DELETE | `/admin/api-keys/:uniqueId` | Hard delete (requires unlock if reserved) |
| POST | `/admin/api-keys/:uniqueId/unlock` | Start unlock ceremony (requires reason + password reconfirm) |
| POST | `/admin/api-keys/:uniqueId/relock` | Cancel unlock immediately |
| POST | `/admin/api-keys/:uniqueId/reserved` | Toggle reserved flag (requires unlock to turn off) |
| GET | `/admin/api-keys/:uniqueId/audit` | Paged audit log for this key |
| GET | `/admin/api-keys/audit` | Global audit log (paged) |
| POST | `/admin/api-keys/force-migrate` | Re-run migration (ignores marker); only callable when DB is manually cleared; ceremony + extra confirmation required |

### Response shape for list/detail

```ts
{
    uniqueId, name, description, apis,
    keyPrefix, keyHashTruncated,   // "abcd1234..."
    reserved, migratedFromFile, ownerAccountId,
    createdAt, createdByAccountId,
    disabledAt, disabledReason,
    lastUsedAt, lastUsedIp, usageCount,
    unlockedUntil, unlockedByAccountId, unlockReason,
    rotationPrevValidUntil, previousKeyPrefix
}
// keyHash and previousKeyHash are NEVER returned to any client.
```

### Response shape for create/rotate (ONE-TIME reveal)

```ts
{
    uniqueId, name, ...
    plaintextKey: "64-char-key-shown-once"
}
```

The UI displays the plaintext key in a copy-to-clipboard modal with a warning: "This is the only time this key will be shown. Store it securely now." Refreshing the page does not re-show the key — the plaintext was never stored.

---

## UI Changes (arda)

### New top-level page: `/system/api-keys`

SuperUser-gated (via `ArdaWrapper` access-policy check, same pattern as other staff-only pages). Not exposed to non-SuperUsers at all; menu entry hidden.

### Pages

1. **List view** (`/system/api-keys`)
   - Table columns: Name, APIs (as tags), Reserved badge, Disabled badge, Last Used, Created, Actions (View / Disable / Unlock / Rotate)
   - Filters: by API, by reserved, by disabled, free-text search on name
   - "Create new key" button top-right

2. **Create modal**
   - Form: name, description, APIs (multi-select from known apiNames), owner account, reserved checkbox (default: false)
   - On submit: create → reveal modal shows plaintext ONCE with copy button and acknowledgment checkbox

3. **Detail page** (`/system/api-keys/:uniqueId`)
   - Header: name, key prefix, reserved badge, disabled badge, live unlock countdown if active
   - Tabs: Details | Audit Log | Usage
   - Details tab: all editable fields (subject to reserved rules); APIs list with per-api add/remove chips
   - Audit Log tab: paged event list with filters
   - Usage tab: last used timestamp, last used IP, usage count total, simple sparkline of daily usage for last 30 days (pulled from audit log)
   - Actions panel: Rotate, Disable/Enable, Delete, Unlock/Relock, Toggle Reserved

4. **Reveal modal** (shared by Create and Rotate)
   - Prominent display of plaintext key
   - "Copy to clipboard" button
   - "I have saved this key" acknowledgment checkbox (required to dismiss)
   - On dismiss: key cannot be shown again

### Menu

New entry under a "System" menu category in `ArdaWrapper.jsx`, visible only to SuperUser. Adjacent to "Developers" (from plan 14) if that ships.

---

## Files to Create

| File | Purpose | Approx LOC |
|---|---|---|
| `auth/AllowedAppsDatabase.ts` | Firestore CRUD + snapshot listener; SHA256 hashing helpers; usage batch flush | 350 |
| `auth/AllowedAppsCache.ts` | In-memory cache with atomic swap; fallback-to-JSON path | 200 |
| `api/src/admin/AllowedAppsApi.ts` | Admin API handlers for `/admin/api-keys/*` | 500 |
| `apps/db_setup/api_keys_audit_setup.ts` | Postgres audit table migration | 60 |
| `apps/admin_api/bootstrap/migrateAllowedApps.ts` | One-time migration from JSON file to Firestore; runs on admin_api boot | 200 |
| `apps/admin_api/bootstrap/exportAllowedApps.ts` | Periodic job that writes JSON snapshot from Firestore | 100 |
| `webapps/src/components/System/ApiKeys/ApiKeysList.jsx` | Dashboard list view | 250 |
| `webapps/src/components/System/ApiKeys/ApiKeyDetail.jsx` | Detail page | 300 |
| `webapps/src/components/System/ApiKeys/ApiKeyEditor.jsx` | Create / edit form | 250 |
| `webapps/src/components/System/ApiKeys/KeyRevealModal.jsx` | One-time plaintext display modal | 80 |
| `webapps/src/components/System/ApiKeys/UnlockCeremonyModal.jsx` | Reason + password reconfirm modal | 120 |
| `webapps/src/components/System/ApiKeys/ApiKeyAuditLog.jsx` | Audit log viewer | 150 |
| `tests/auth/AllowedAppsDatabase.spec.ts` | Unit tests for DB layer | 300 |
| `tests/api/AllowedAppsApi.spec.ts` | Integration tests for admin endpoints | 400 |
| `tests/auth/Migration.spec.ts` | Migration tests (empty DB, partial migration, re-run protection, duplicate id handling) | 250 |
| `docs/api-keys-dashboard.md` | Operator documentation | 300 |

## Files to Modify

| File | Change |
|---|---|
| `auth/Auth.ts` | Rewrite `AuthHandler` constructor (lines 332-350) + `authorizeHttpRequest` (lines 362-406): replace JSON-file `require()` with `AllowedAppsCache.load()`; switch linear `find` to `byHash.get(sha256(token))`; handle rotation overlap via `previousKeyHash`; fire-and-forget usage tracking. |
| `auth/index.ts` | Export `AllowedAppsDatabase`, `AllowedAppsCache`. |
| `auth/AuthDatabase.ts` | No direct changes, but document that `AllowedApps` is a sibling Firestore collection alongside existing ones at lines 28-40. |
| `apps/admin_api/admin_api.ts` | Register `/admin/api-keys/*` routes (SuperUser-gated). Wire the boot-time migration (`migrateAllowedApps.ts`) to run before `api.listen`. Start the periodic JSON-export job. |
| `webapps/arda/arda.js` | No direct changes — new admin API routes are forwarded via the existing `/api/admin/*` proxy at lines 69-74. |
| `webapps/arda/app/App.jsx` | Add React Router routes: `/system/api-keys`, `/system/api-keys/:uniqueId`. |
| `webapps/arda/app/ArdaWrapper.jsx` (lines 81-93) | Add System menu section with "API Keys" entry, SuperUser-gated. |
| `.env` sample / `DEV_SETUP.md` | New env vars: `API_KEYS_SOURCE=firestore\|file` (default `file` for backward compatibility during rollout; flipped to `firestore` after migration), `API_KEYS_FIRESTORE_COLLECTION` (default `AllowedApps`), `API_KEYS_JSON_EXPORT_INTERVAL_MS` (default 300000). |

## Files NOT to Modify

- `tests/allowed_apps.json` — stays as-is for test runs.
- The many test files that construct integration-test requests with hardcoded keys — unaffected because tests run with `API_KEYS_SOURCE=file`.
- The `AllowedApp` type definition at `auth/Auth.ts:40-44` — stays for backward compatibility with any external code that imports it; the Firestore document shape is a superset.

---

## Security Requirements

All P0.

- **Keys never stored in plaintext at rest** (except transiently in memory after creation/rotation for the one-time display). DB holds only `SHA256(key)`.
- **Keys never logged.** Existing `Logger.error` call at `auth/Auth.ts:400` prints the sanitized token on auth failure — this must be redacted to just the prefix (`substring(0,8) + "..."`) before rollout, because audit events will reach log aggregation.
- **Validation errors for unknown keys are sampled before audit-writing.** Writing an audit row for every failed probe is a DoS vector. Sample rate: 1-in-100, or aggregate by IP over a rolling window. Successful validations are NOT individually logged; usage counters capture them in batch.
- **SuperUser policy enforced on every endpoint.** Double-checked: the API route middleware requires SuperUser, and the Firestore security rules (Firebase security rules if applicable) restrict writes to the admin_api service account.
- **Plaintext reveal modal**: server-side one-time return; front-end holds in component state only (no localStorage, no URL params, no server logs).
- **Unlock ceremony password reconfirm**: prevents "forgot logged-in session left at desk" trivial attack. The reconfirm call uses the existing `POST /auth/login` flow to verify, but without issuing a new token (just a "yes this password matches" check). New endpoint may be needed on auth-api: `POST /auth/verify-password`.
- **Reserved flag enforcement on the server side, not just UI.** Every destructive endpoint checks `reserved === false || (unlockedUntil && unlockedUntil > now)`. Client-side disabling of buttons is UX only.
- **Rate limiting on admin endpoints** — same concerns as plan 14; resolve the `apps/api/api.ts:93-94` commented-out limiter before ship.
- **Key rotation window**: during `previousKeyHash` validity, both old and new keys are accepted. The window defaults to 7 days; configurable per rotation. After expiry, the old key permanently fails; no way to resurrect.
- **Admin_api itself authenticates with an API key.** Chicken-and-egg: if the migration corrupts admin_api's own key, admin_api can't restart. Mitigation: the bootstrap loads from JSON first (fast, synchronous), does its own auth, THEN swaps to Firestore. Also: admin_api's production key should be reserved, and SuperUser should rotate-with-overlap rather than hard-replace.
- **Audit row immutability**: audit table is append-only. No update endpoint. Admin_api's DB user has `INSERT, SELECT` grants only (no UPDATE, DELETE) on `api_key_audit_log`.
- **Firestore security rules**: only admin_api's service account can write to `AllowedApps`. Auth-api, cloud-api, etc. have read-only access. No client-side SDK access.

---

## Rollout Phases

### Phase 1 — Schema + CRUD + audit (no runtime change)

- Firestore `AllowedApps` collection shape agreed
- Postgres `api_key_audit_log` migration
- `AllowedAppsDatabase.ts` CRUD (Firestore)
- Admin_api `/admin/api-keys/*` endpoints (SuperUser)
- Arda dashboard UI: list, detail, create, rotate, disable, audit log
- Services continue to load JSON at boot (no change to hot path)
- New keys created in the dashboard are written to BOTH Firestore AND appended to the JSON file (via admin_api writing to its filesystem and the periodic export job)

**Deliverable:** dashboard usable for viewing keys + creating new keys. Existing keys still flow through JSON file. No services changed.

### Phase 2 — Migration

- Run `migrateAllowedApps.ts` on admin_api boot (or manually via `POST /admin/api-keys/force-migrate` with super-confirmation)
- Verify Firestore now mirrors JSON file (entry count matches, hashes check out by comparing SHA256(JSON entry id) vs DB keyHash)
- JSON file remains authoritative at runtime — services still load from it

**Deliverable:** Firestore is populated from the current JSON; all migrated entries reserved.

### Phase 3 — Services read from Firestore with JSON fallback

- Rewrite `AuthHandler` to subscribe to Firestore snapshots when `API_KEYS_SOURCE=firestore`
- Default `API_KEYS_SOURCE=file` in production until explicitly flipped per service
- Flip services one at a time: cloud-api first (lowest risk), then auth-api, then admin_api last
- Each flip is a simple env var change + rolling restart
- JSON file stays as fallback; periodic export keeps it fresh

**Deliverable:** all services use Firestore as primary; JSON is fallback only.

### Phase 4 — Deprecate JSON for new key management

- Dashboard stops writing to the JSON file on create/rotate (only Firestore)
- Periodic export job keeps writing JSON from Firestore for disaster recovery
- Documented: "Do not edit allowed_apps.json manually — it is auto-generated"

**Deliverable:** Firestore is the only edit surface; JSON is read-only artifact.

### Phase 5 — Usage tracking + UX polish

- Usage counters, last-used timestamps, sparkline charts
- Reserved unlock ceremony fully wired including password reconfirm
- Bulk operations: disable multiple keys, filter-driven batch actions

**Deliverable:** observability and safety rails complete.

### Phase 6 (optional) — Retire the JSON file

- After ~3 months of stable Firestore operation: document JSON file removal
- Services still have fallback code; it simply looks for the file, fails to find it, and logs a warning
- Operators can delete the file on EC2

---

## Verification

Run the existing auth + admin_api test suites after every phase to confirm no regression.

### Phase 1 tests (`tests/api/AllowedAppsApi.spec.ts`)

- Create key → dashboard returns plaintext once; subsequent GET does NOT include plaintext; `keyHash === SHA256(plaintext)`
- Duplicate name rejected (400)
- Rotate key → new plaintext returned; old hash moves to `previousKeyHash`; both hashes validate until `rotationPrevValidUntil`
- Disable → key rejects with 401 on bearer auth; audit row written
- Enable → key accepts again
- Delete (non-reserved) → document removed from Firestore
- Non-SuperUser → 403 on all endpoints

### Phase 2 tests (`tests/auth/Migration.spec.ts`)

- Empty Firestore + JSON with 7 entries → migration creates 7 docs (or 6 if duplicate ids merge), all `reserved=true`, `migratedFromFile=true`
- Re-run with marker present → no changes; audit confirms idempotency
- Duplicate-id case (test file has this) → merged entry; audit row explains the merge
- Partial failure (Firestore returns error mid-migration) → marker NOT written; next run retries cleanly
- Concurrent migration attempts → advisory lock prevents double-write; second attempt waits

### Phase 3 tests (`tests/auth/AllowedAppsCache.spec.ts`)

- `AuthHandler` with `API_KEYS_SOURCE=firestore` and Firestore populated → validates bearer correctly via hash lookup
- Firestore snapshot update (add new key) → new key validates within 5 seconds
- Firestore snapshot update (disable key) → key rejected within 5 seconds
- Firestore unavailable at boot → falls back to JSON; logs CRITICAL; retries in background
- Rotation overlap: old key accepted until `rotationPrevValidUntil`, then rejected

### Reserved unlock tests

- Destructive op on reserved key without unlock → 403 `reserved_locked`
- Unlock → destructive op allowed for 15 min
- Destructive op 16 min after unlock → 403 again
- Manual relock → destructive op rejected immediately
- Unlock requires password reconfirm (wrong password → 401; correct → 200)

### Security-critical tests

- Key NEVER returned in any GET response body
- Creating a key and reading list immediately → plaintext is not in list
- Keys deleted still have audit rows (denormalized name)
- Rate limiting: 1000 failed-auth requests from same IP → audit table has at most ~10 sampled rows, not 1000

---

## Operational Notes

### On-boarding a new service using AuthHandler

No changes in code for new services — they still call `Auth.AuthHandler.createAuthHandler({ apiName: "new-api" })`. When they come up:
- If `API_KEYS_SOURCE=firestore`: subscribes to Firestore collection; needs read access to the `AllowedApps` collection via its service account.
- If `API_KEYS_SOURCE=file`: needs `ALLOWED_APPS_DATA_FILE` on local disk.

To actually grant the new service permissions: create a key via the dashboard with `apis: ["new-api"]`. Distribute the plaintext key via secure channel.

### Breaking glass: Firestore completely down

- All services are already running with their cached in-memory list — keys keep validating.
- If a service restarts during the outage: it falls back to JSON file at `ALLOWED_APPS_DATA_FILE`, logs CRITICAL.
- If Firestore is down for more than the 5-minute JSON-export cadence: the JSON file might be up to 5 minutes stale. Acceptable for most incidents.
- Dashboard (CRUD) is unavailable during outage; existing keys keep working.

### Breaking glass: key accidentally deleted

- If audit log shows who deleted when and there's a recent Firestore export:
  - Manually restore from the JSON file copy of that export
  - OR: re-create via the dashboard with the known key from the integrator's environment (if they still have it)
- No automatic recovery. Deletion is audited but not undoable.

### Migrating to a second Firestore project

If multi-region / DR becomes a concern later: the snapshot listener pattern works against any Firestore project. Point `API_KEYS_FIRESTORE_COLLECTION` at the new project and run a data copy job. Out of scope for this plan.

---

## Open Questions to Resolve During Implementation

- **Which Firestore project hosts `AllowedApps`?** Same as the existing `BigScreenUsers` / `RefreshTokenWhitelist` project, or a separate one? Same is simpler; separate gives blast-radius isolation.
- **`POST /auth/verify-password` endpoint on auth-api** — needed for the unlock password reconfirm. Does this exist anywhere? If not, add it in Phase 5.
- **Service account IAM for Firestore writes** — auth-api currently only reads Firestore via `AuthDatabase`. Does its service account have write permission? (It should not need to write to `AllowedApps`; only admin_api should.) Confirm principle-of-least-privilege.
- **Handling of the duplicate-id case in `tests/allowed_apps.json`** — confirm with whoever originally authored the file whether this was intentional. If intentional, the merge behavior is correct. If accidental, this plan fixes a latent bug; call it out in the release notes.
- **Reserved flag default for migrated entries** — this SPEC defaults to `reserved=true`. Alternative: a bootstrap manifest maps known entries to their correct reserved state (integration-test entries as `reserved=false`, production entries as `reserved=true`). If we have time to classify them, the manifest approach is better ops ergonomics. Otherwise, default-reserved + unlock-as-needed is safe.
- **JSON export: overwrite atomically?** — the export job must write to a temp file and rename to avoid torn reads during service boot. Standard atomic-rename pattern.
- **Interaction with plan 14 (OAuth)** — the OAuth SPEC says OAuth clients do NOT need an `allowed_apps.json` entry. This plan does not change that. OAuth clients and API keys are parallel, separate trust systems.
