# 15 - API Keys Dashboard (REVISED)

**Status: NOT STARTED**

**Supersedes:** `SPEC.md` in this folder. The only substantive change is the storage stack: the source of truth moves from Firestore to **Postgres RDS** (on admin_api's existing fabricator pool), with **Redis as the runtime read cache** used by every service on the hot path. Rest of the design — reserved flag, unlock ceremony, one-time-reveal, hashed storage, migration from JSON, dashboard UI — is unchanged.

## Summary

Replace the static `allowed_apps.json` file with a Postgres-backed table managed through a new dashboard in arda. Every service continues to validate `Authorization: Bearer` app keys on the hot path, but does so by reading from a shared Redis hash rather than a per-service in-memory copy of the file. The dashboard writes to Postgres (transactional, audited), then propagates the change to Redis. Services read Redis per-request; no per-service snapshot subscription, no per-service Postgres connection.

**Data flow at a glance:**

```mermaid
flowchart LR
    SU([SuperUser]) -->|HTTPS| ARDA[arda<br/>/system/api-keys]
    ARDA -->|/api/admin/api-keys/*<br/>via existing proxy| ADMIN[admin_api]

    ADMIN -->|INSERT/UPDATE<br/>in one tx| PG[(Postgres RDS<br/>api_keys +<br/>api_keys_audit_log)]
    ADMIN -->|HSET/HDEL| RED[(Redis hash<br/>allowed_apps)]
    ADMIN -.->|5 min<br/>snapshot| JSONF[[JSON file<br/>ALLOWED_APPS_DATA_FILE]]

    IT([Integrator<br/>external app]) -->|Authorization: Bearer| AUTH[auth-api]
    IT -->|Authorization: Bearer| CLOUD[cloud-api]
    IT -->|Authorization: Bearer| ADMIN

    AUTH -->|HGET per request| RED
    CLOUD -->|HGET per request| RED
    ADMIN -->|HGET per request| RED

    AUTH -.->|boot fallback| JSONF
    CLOUD -.->|boot fallback| JSONF
    ADMIN -.->|boot fallback| JSONF

    classDef store fill:#eef,stroke:#88a;
    classDef fallback fill:#fee,stroke:#a88;
    class PG,RED store;
    class JSONF fallback;
```

- **Solid arrows** are per-request hot-path calls.
- **Dashed arrows** are fallback paths used only when Redis is unreachable at boot, plus the periodic snapshot that keeps the fallback fresh.
- Postgres is the source of truth; Redis is the read cache; JSON is disaster-recovery.

Every migrated key defaults to `reserved: true`. Destructive operations on reserved keys require a time-boxed 15-minute unlock ceremony with a mandatory reason and password reconfirm, all audited in Postgres.

## Current State

### `allowed_apps.json` today

- Path: `process.env.ALLOWED_APPS_DATA_FILE`. In-repo fixture: `tests/allowed_apps.json` (7 entries). Production file lives on EC2 and is not in the repo.
- Loaded synchronously at service boot via `require(process.env.ALLOWED_APPS_DATA_FILE)` at `auth/Auth.ts:337`.
- Entry shape (`auth/Auth.ts:40-44`): `{ name, apis: string[], id }`.
- Validation path (`auth/Auth.ts:362-406`): sanitize bearer, linear scan `this.allowedApps.find(o => o.id === sanitizedToken)`, check `o.apis.includes(this.apiName)`.
- Callers construct one `AuthHandler` per api they serve. Confirmed in `apps/admin_api/admin_api.ts:41,71,93,114` (admin-api, fusion-api, cnc-machine-api, pose-api), `apps/api/api.ts` (auth server), likely cloud-api etc.

### Existing Postgres stack (source of truth for the new table)

- `lib/PostgresDatabase.ts` exposes `Postgres.getFabricatorClient()` (line 251) and `Postgres.getCloudClient()` (line 255). Both use `pg.Pool` backed by RDS in production (connection config via env).
- admin_api already owns tables in the fabricator pool; adding `api_keys` and `api_keys_audit_log` fits existing conventions (see `apps/db_setup/beyond_db_setup.ts` for style).
- Base class `PostgresRow` at `lib/PostgresDatabase.ts:85` provides `uniqueId`, `id`, `createdAt` plus `toRedisObject()` serialization — reuse as the pattern for the new table's row class.

### Existing Redis stack

- `@bigscreen/lib` `RedisClient()` (imported throughout `auth/AuthDatabase.ts`).
- Redis is already used for: `tokenRenewalNonceList`, `accessToken:{accountId}` sets, `singleUseToken:*`, `singleUseToken2:*`.
- Every service currently uses Redis. No new connection needed.

### Pain points this resolves

1. Every file edit requires a redeploy of every service consuming the file.
2. No audit trail for who changed which key when and why.
3. Plaintext secrets on disk — readable by anyone with EC2 SSH or ops-repo access.
4. No UI to see what keys exist, which APIs they can call, or which are stale.
5. **Duplicate-id bug** in `tests/allowed_apps.json` (two entries, `accounts-integration-tests` and `admin-integration-tests`, share `f4czwPZdYidGTcWRbhcWAcFiTaDYhYDraDsjYLYO2YXd4LxQqpMpbcdmyxZEBKNZ`); `Array.find` returns the first match, silently shadowing the second. Unique constraints in Postgres catch this class of bug.
6. No usage tracking — can't identify stale keys.

## Key Architectural Decisions

### 1. Source of truth: Postgres RDS (admin_api fabricator pool)

Why Postgres, not Firestore:
- admin_api already owns the pool. No new infra.
- Transactional writes (row + audit row in one commit).
- Schema-enforced types, uniqueness constraints (catches the dup-id bug).
- Standard backup/restore lives on RDS.
- SQL lets the audit viewer do complex queries (who edited reserved keys this month, keys with no usage in 90 days, etc.) without inventing query APIs.

Non-admin services do **not** connect to Postgres. Only admin_api writes and reads; every other service gets its data via Redis (see §2).

### 2. Runtime cache: Redis hash `allowed_apps`

Every service's hot path does one call: `HGET allowed_apps <sha256(bearer)>`. Typical cost: 0.1–1 ms round-trip. Replaces today's in-memory `.find()` that scans the file-loaded array.

Redis structure:

```
Key: "allowed_apps"   (Redis hash)
Fields: <sha256(plaintextKey) as hex>
Values: JSON:
  {
    uniqueId, name, apis: string[],
    reserved, disabledAt, rotationPrevValidUntil,
    isPreviousKey: boolean     // true for the old hash during rotation overlap
  }
```

Rotation overlap is expressed by writing two fields to the hash for the same logical record (current + previous), both with `uniqueId` pointing at the same Postgres row. Field values carry enough metadata for the validator to enforce `rotationPrevValidUntil`. A periodic cleanup job on admin_api HSCANs the hash every 5 min and HDELs previous-key entries whose `rotationPrevValidUntil` has passed.

Why Redis and not per-service in-memory cache:
- Cache invalidation is solved by the admin_api write path — single writer, single cache, always consistent.
- No per-service subscription, polling, or boot-time data fetch.
- Changes propagate within the one Redis round-trip a service already makes.

Why Redis and not "go to Postgres on every request":
- 10–30× faster per-request.
- admin_api's RDS would eat the load of every API call to every service.
- Keeps the existing security-group model (only admin_api needs RDS access).

### 3. Boot fallback: JSON file (unchanged behaviour)

If Redis is unreachable at boot, `AuthHandler` still loads `ALLOWED_APPS_DATA_FILE` into an in-memory map, logs CRITICAL, and keeps polling Redis in the background. The JSON file is regenerated every 5 minutes by admin_api from the Postgres source. This preserves the current resilience properties: if everything cloud-side goes down, the service still starts with its last-known-good snapshot.

### 4. Keys stored as SHA256 hashes, not plaintext

Dashboard generates 64-char base64url keys on creation; shows the plaintext **once** in a reveal modal; stores `SHA256(key)` in Postgres (`keyHash` column) and writes that same field into Redis. Validation = `HGET allowed_apps sha256(incomingToken)`.

SHA256 not bcrypt: bcrypt's random salt makes O(1) lookup impossible (would require iterating all rows per request). API keys have ~384 bits of entropy, making rainbow tables infeasible. Deterministic hash is adequate for this threat model.

### 5. `reserved` flag with unlock ceremony

Identical to the original SPEC. Reserved rows block destructive ops (rotate, disable, delete, rename, remove-APIs) behind a 15-minute unlock window that requires a mandatory reason string and password reconfirm.

### 6. Migration from JSON is one-time, idempotent, conservative

Admin_api boot reads `ALLOWED_APPS_DATA_FILE`, inserts one Postgres row per entry with `reserved=true`, writes a marker row, and populates Redis. Idempotency via marker row. Subsequent boots skip.

---

## Architecture

### Service boundaries

```mermaid
graph TB
    subgraph browser[Browser]
        SU([SuperUser<br/>arda cookie session])
    end

    subgraph cloud[Bigscreen cloud]
        subgraph ardaBox["arda :3010"]
            ARDA_UI[React dashboard]
            ARDA_PROXY["Express proxy<br/>/api/admin/*"]
        end

        subgraph adminBox["admin_api :3999"]
            ADMIN_ROUTES["/admin/api-keys/* CRUD"]
            ADMIN_JOBS["Background jobs<br/>export / reconcile<br/>sweep / usage flush"]
            ADMIN_AH[AuthHandler]
        end

        subgraph authBox["auth-api :3009"]
            AUTH_AH[AuthHandler]
        end

        subgraph otherBox["cloud-api / others"]
            OTHER_AH[AuthHandler]
        end
    end

    subgraph stores[Data stores]
        PG[(Postgres RDS<br/>fabricator pool)]
        RED[(Redis<br/>shared cluster)]
        JSONF[[JSON file<br/>on EC2 local disk]]
    end

    SU -->|HTTPS| ARDA_UI
    ARDA_UI --> ARDA_PROXY
    ARDA_PROXY -->|forward| ADMIN_ROUTES

    IT([Integrator app<br/>Authorization Bearer]) --> AUTH_AH
    IT --> OTHER_AH
    IT --> ADMIN_AH

    ADMIN_ROUTES -->|tx: row + audit| PG
    ADMIN_ROUTES -->|HSET/HDEL| RED
    ADMIN_JOBS -->|read| PG
    ADMIN_JOBS -->|maintain| RED
    ADMIN_JOBS -->|write snapshot| JSONF

    ADMIN_AH -->|HGET| RED
    AUTH_AH -->|HGET| RED
    OTHER_AH -->|HGET| RED

    ADMIN_AH -.->|boot fallback| JSONF
    AUTH_AH -.->|boot fallback| JSONF
    OTHER_AH -.->|boot fallback| JSONF
```

| Service | Role | Storage access |
|---|---|---|
| arda (port 3010) | Dashboard UI at `/system/api-keys` (SuperUser only) | None direct; uses existing `/api/admin/*` proxy |
| admin_api (port 3999) | CRUD endpoints; Postgres reads/writes; Redis HSET/HDEL; periodic JSON export; Redis TTL cleanup | Postgres (fabricator pool) + Redis |
| Every service using `AuthHandler` (auth-api, cloud-api, admin_api itself, etc.) | Per-request Redis HGET on auth path | Redis (read-only) + JSON file (boot fallback) |

No new cross-service dependencies; no new network connections; no new env vars for non-admin services.

### Runtime validation (hot path, post-migration)

```mermaid
sequenceDiagram
    autonumber
    participant C as Integrator client
    participant S as Service<br/>(auth-api / cloud-api / admin_api)
    participant AH as AuthHandler
    participant R as Redis<br/>hash: allowed_apps
    participant UT as UsageTracker<br/>(in-process batch)

    C->>S: HTTP request<br/>Authorization: Bearer <token>
    S->>AH: authorizeHttpRequest(req)
    AH->>AH: token = sanitize(bearer)<br/>hash = sha256(token)
    AH->>R: HGET allowed_apps hash

    alt hash found
        R-->>AH: {uniqueId, name, apis,<br/>disabledAt, isPreviousKey,<br/>rotationPrevValidUntil}
        alt disabledAt set
            AH-->>S: AuthError(401)
            S-->>C: 401
        else isPreviousKey and expired
            AH-->>S: AuthError(401)
            S-->>C: 401
        else apiName not in apis
            AH-->>S: AuthError(403)
            S-->>C: 403
        else all checks pass
            AH-)UT: record(uniqueId, ip)
            AH-->>S: AuthorizedApp(name)
            S-->>C: 200 OK
        end
    else hash not found
        R-->>AH: (nil)
        AH->>AH: sampled audit<br/>(1-in-100 or per-IP bucket)
        AH-->>S: AuthError(401)
        S-->>C: 401
    end
```

Pseudocode equivalent:

```
AuthHandler.authorizeHttpRequest(req):
    rawToken = extract bearer from req
    token = rawToken.replace(/[^a-zA-Z0-9]/g, '')
    hash = sha256(token).toString('hex')

    entry = await RedisClient.hget("allowed_apps", hash)
    if entry is null:
        record sampled-audit on failure (see §Security)
        throw AuthError(401)

    record = JSON.parse(entry)
    if record.disabledAt:
        throw AuthError(401)
    if record.isPreviousKey and Date.now() > record.rotationPrevValidUntil:
        throw AuthError(401)
    if not record.apis.includes(this.apiName):
        throw AuthError(403)

    // Fire-and-forget usage tracking
    pushUsageBatch(record.uniqueId, req.ip)

    return new AuthorizedApp(record.name)
```

The in-process cache from the original SPEC is gone — Redis is the cache. This simplifies `AuthHandler` significantly.

### Boot / fallback

```mermaid
flowchart TD
    Start([Service boot<br/>AuthHandler constructor]) --> Ping{Redis PING<br/>5s timeout}

    Ping -->|success| RedisMode[Mode: Redis primary<br/>validator uses HGET per request]
    Ping -->|failure| Critical[Log CRITICAL<br/>Redis unavailable at boot]

    Critical --> ReadJSON[Read ALLOWED_APPS_DATA_FILE<br/>synchronous require]
    ReadJSON --> BuildMap[Build fallbackMap<br/>keyed by plaintext id]
    BuildMap --> FallbackMode[Mode: JSON fallback<br/>validator uses Map lookup]

    FallbackMode --> BG[Start background<br/>reconnect loop]
    BG --> Retry{Retry Redis PING<br/>every 30s}
    Retry -->|still failing| Retry
    Retry -->|success| Swap[Swap to Redis mode<br/>drop fallbackMap reference]

    RedisMode --> Ready([Service accepts traffic])
    FallbackMode --> Ready
    Swap --> Ready
```

Pseudocode equivalent:

```
AuthHandler constructor:
    apiName = config.apiName
    fallbackMap = null

    try await RedisClient.ping() (5s timeout)
    catch:
        logger.critical("Redis unavailable at boot; falling back to JSON")
        fallbackMap = new Map()
        for entry in require(process.env.ALLOWED_APPS_DATA_FILE):
            fallbackMap.set(entry.id, entry)   // keyed by plaintext id, legacy style
        startBackgroundRedisReconnect()
```

In fallback mode, the validator consults `fallbackMap` with the plaintext id (matching today's behaviour). As soon as Redis is reachable again, the service swaps to the Redis path. This preserves exact parity with current resilience.

### Dashboard write path (admin_api)

Every create / update / rotate / disable / enable / delete follows the same shape: Postgres transaction first, Redis second, pub/sub third.

```mermaid
sequenceDiagram
    autonumber
    actor SU as SuperUser
    participant ARDA as arda<br/>(proxy)
    participant ADMIN as admin_api<br/>handler
    participant PG as Postgres<br/>fabricator pool
    participant R as Redis<br/>allowed_apps
    participant Reconcile as Reconciliation<br/>job (1 min)

    SU->>ARDA: POST /api/admin/api-keys<br/>{name, apis, description}
    ARDA->>ADMIN: POST /admin/api-keys

    ADMIN->>ADMIN: reserved-check:<br/>unlocked OR not reserved?
    alt reserved and not unlocked
        ADMIN-->>SU: 403 reserved_locked
    end

    ADMIN->>ADMIN: generate plaintext<br/>keyHash = sha256<br/>keyPrefix = plaintext[0..8]

    rect rgb(235, 245, 235)
    Note over ADMIN,PG: Single Postgres transaction
    ADMIN->>PG: BEGIN
    ADMIN->>PG: INSERT api_keys (...)
    ADMIN->>PG: INSERT api_keys_audit_log<br/>(event: key_created)
    ADMIN->>PG: COMMIT
    end

    ADMIN->>R: HSET allowed_apps<br/>keyHash → serialized record
    alt Redis write fails
        ADMIN->>ADMIN: log ERROR<br/>enqueue retry
        Reconcile-->>R: (later) HSET on next sweep
    end

    ADMIN->>R: PUBLISH allowed_apps:changed<br/>{type, uniqueId}
    Note over R: optional reactive hook<br/>(no reader today)

    ADMIN-->>ARDA: 200 OK<br/>{uniqueId, ..., plaintextKey}
    ARDA-->>SU: Reveal modal<br/>plaintext shown ONCE
```

Pseudocode equivalent:

```
handler(req):
    if reservedOperation and not unlocked:
        throw 403

    await Postgres.transaction(client =>
        client.query("INSERT/UPDATE/DELETE api_keys ...")
        client.query("INSERT INTO api_keys_audit_log ...")
    )

    // After commit, update Redis cache:
    if create or rotate:
        RedisClient.hset("allowed_apps", newKeyHash, json)
    if rotate:
        RedisClient.hset("allowed_apps", oldKeyHash, jsonWithIsPreviousKey)
    if disable / enable:
        RedisClient.hset("allowed_apps", keyHash, updatedJson)
    if delete:
        RedisClient.hdel("allowed_apps", keyHash, previousKeyHashIfAny)

    RedisClient.publish("allowed_apps:changed", JSON.stringify({type, uniqueId}))
    // Not required by any reader today; hook for future.
```

If the Redis write fails after the Postgres commit: log ERROR, enqueue a retry. Readers will see the stale value for seconds-to-minutes; a reconciliation job (every 1 min) walks Postgres vs Redis and fixes drift. Never a data-loss scenario — Postgres is truth.

### Key rotation with overlap

Rotation must not break in-flight integrations. Both the old and new key hashes live in Redis simultaneously for a configurable window (default 7 days); the old one is flagged `isPreviousKey: true` with an expiry timestamp. A 5-minute sweep job HDELs expired previous-key fields.

```mermaid
sequenceDiagram
    autonumber
    actor SU as SuperUser
    participant ADMIN as admin_api
    participant PG as Postgres
    participant R as Redis
    participant Sweep as Sweep job<br/>(5 min)
    participant Old as Client<br/>(old plaintext)
    participant New as Client<br/>(new plaintext)

    SU->>ADMIN: POST /admin/api-keys/:id/rotate
    ADMIN->>ADMIN: newPlaintext = random<br/>newHash = sha256

    ADMIN->>PG: UPDATE api_keys SET<br/>keyHash=newHash,<br/>previousKeyHash=oldHash,<br/>rotationPrevValidUntil=now+7d
    ADMIN->>PG: INSERT audit (key_rotated)

    ADMIN->>R: HSET allowed_apps<br/>newHash → {isPreviousKey:false}
    ADMIN->>R: HSET allowed_apps<br/>oldHash → {isPreviousKey:true,<br/>rotationPrevValidUntil:T+7d}

    ADMIN-->>SU: {plaintextKey: new}
    Note over SU: Reveal modal — distribute<br/>new plaintext to integrator

    rect rgb(232, 245, 232)
    Note over Old,New: During 7-day overlap window
    Old->>R: HGET oldHash
    R-->>Old: {isPreviousKey:true,<br/>valid until T+7d}
    Note over Old: now < T+7d → ACCEPT
    New->>R: HGET newHash
    R-->>New: {isPreviousKey:false}
    Note over New: → ACCEPT
    end

    Sweep->>R: HSCAN allowed_apps
    Sweep->>Sweep: find fields with<br/>rotationPrevValidUntil < now
    Sweep->>R: HDEL allowed_apps oldHash

    rect rgb(250, 235, 235)
    Note over Old: After sweep
    Old->>R: HGET oldHash
    R-->>Old: (nil)
    Note over Old: → 401
    end
```

### Periodic background jobs on admin_api

| Job | Interval | Purpose |
|---|---|---|
| Redis → JSON export | 5 min | Write current Postgres state to `ALLOWED_APPS_DATA_FILE` atomically (temp file + rename). Keeps boot-fallback snapshot fresh. |
| Redis reconciliation | 1 min | Walk Postgres `api_keys`; confirm every row has a corresponding Redis field and vice-versa. Fix drift. Log at WARN when drift found (surface rare Postgres-commit-but-Redis-fail cases). |
| Previous-key TTL sweep | 5 min | HSCAN `allowed_apps`; for each entry with `isPreviousKey=true` and `rotationPrevValidUntil < now`, HDEL. |
| Usage batch flush | 10 s | Drain in-process usage queue; `UPDATE api_keys SET lastUsedAt = GREATEST(lastUsedAt, $1), usageCount = usageCount + $2, lastUsedIp = $3 WHERE uniqueId = $4`. One UPDATE per touched key per flush cycle. |

---

## Database Schema

```mermaid
erDiagram
    api_keys {
        uuid uniqueId PK
        varchar name UK
        text description
        array apis
        varchar keyHash UK "SHA256 hex"
        varchar keyPrefix "first 8 plaintext chars"
        varchar previousKeyHash "during rotation"
        varchar previousKeyPrefix
        bigint rotationPrevValidUntil
        boolean reserved
        boolean migratedFromFile
        varchar ownerAccountId
        bigint createdAt
        varchar createdByAccountId
        bigint disabledAt
        text disabledReason
        bigint lastUsedAt
        varchar lastUsedIp
        bigint usageCount
        bigint unlockedUntil
        varchar unlockedByAccountId
        text unlockReason
    }
    api_keys_audit_log {
        uuid uniqueId PK
        bigint at
        varchar eventType
        varchar actorAccountId
        uuid keyUniqueId "FK-less, denormalised"
        varchar keyName "denormalised"
        varchar ip
        jsonb details
    }
    api_keys_migration_marker {
        int id PK "always 1"
        bigint completedAt
        bigint startedAt
        varchar sourceFileSha1
        int entryCount
        varchar migratedByAccountId
        varchar lockedByHost
        bigint lockExpiresAt
    }

    api_keys ||..o{ api_keys_audit_log : "referenced by uniqueId<br/>(no FK — survives delete)"
```

Three tables, all in the existing fabricator pool. `api_keys_audit_log` has no foreign key to `api_keys` so that deletion audit events survive the row they describe; the `keyName` and `keyUniqueId` columns are denormalised for traceability.

### Postgres (fabricator pool) — `api_keys`

New migration file `apps/db_setup/api_keys_setup.ts`, mirroring `apps/db_setup/beyond_db_setup.ts` style.

```sql
CREATE TABLE api_keys (
    "uniqueId"              uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
    "name"                  VARCHAR(128) NOT NULL UNIQUE,
    "description"           TEXT,
    "apis"                  TEXT[]       NOT NULL,
    "keyHash"               VARCHAR(128) NOT NULL UNIQUE,        -- SHA256 hex of current key
    "keyPrefix"             VARCHAR(16)  NOT NULL,               -- First 8 chars of plaintext, for display
    "previousKeyHash"       VARCHAR(128),                        -- Old key during rotation overlap
    "previousKeyPrefix"     VARCHAR(16),
    "rotationPrevValidUntil" BIGINT,                             -- Unix ms
    "reserved"              BOOLEAN      NOT NULL DEFAULT false,
    "migratedFromFile"      BOOLEAN      NOT NULL DEFAULT false,
    "ownerAccountId"        VARCHAR(64),
    "createdAt"             BIGINT       NOT NULL,
    "createdByAccountId"    VARCHAR(64),
    "disabledAt"            BIGINT,
    "disabledReason"        TEXT,
    "lastUsedAt"            BIGINT,
    "lastUsedIp"            VARCHAR(64),
    "usageCount"            BIGINT       NOT NULL DEFAULT 0,
    "unlockedUntil"         BIGINT,
    "unlockedByAccountId"   VARCHAR(64),
    "unlockReason"          TEXT
);
CREATE UNIQUE INDEX idx_api_keys_key_hash          ON api_keys("keyHash");
CREATE INDEX        idx_api_keys_prev_key_hash     ON api_keys("previousKeyHash")  WHERE "previousKeyHash" IS NOT NULL;
CREATE INDEX        idx_api_keys_disabled          ON api_keys("disabledAt");
CREATE INDEX        idx_api_keys_owner             ON api_keys("ownerAccountId");
CREATE INDEX        idx_api_keys_reserved          ON api_keys("reserved");
```

### Postgres — `api_keys_audit_log`

```sql
CREATE TABLE api_keys_audit_log (
    "uniqueId"       uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
    "at"             BIGINT      NOT NULL,
    "eventType"      VARCHAR(48) NOT NULL,
    "actorAccountId" VARCHAR(64),
    "keyUniqueId"    uuid,                    -- FK-less; survives deletion via denormalised name
    "keyName"        VARCHAR(128),            -- Denormalised copy
    "ip"             VARCHAR(64),
    "details"        jsonb
);
CREATE INDEX idx_api_keys_audit_at  ON api_keys_audit_log("at" DESC);
CREATE INDEX idx_api_keys_audit_key ON api_keys_audit_log("keyUniqueId", "at" DESC);
```

DB grants enforce append-only: admin_api's DB user gets `SELECT, INSERT` only on `api_keys_audit_log` (no `UPDATE, DELETE`). This must be set at provisioning time; document in `DEV_SETUP.md`.

### Postgres — `api_keys_migration_marker`

```sql
CREATE TABLE api_keys_migration_marker (
    "id"                  INTEGER PRIMARY KEY,        -- always 1
    "completedAt"         BIGINT,
    "startedAt"           BIGINT,
    "sourceFileSha1"      VARCHAR(48),
    "entryCount"          INTEGER,
    "migratedByAccountId" VARCHAR(64),
    "lockedByHost"        VARCHAR(128),               -- set while in progress
    "lockExpiresAt"       BIGINT
);
-- Seeded with a single row id=1 on table creation; migration updates this row.
```

### Redis — `allowed_apps`

Single hash, fields keyed by SHA256 hex of the plaintext key.

```
HGET allowed_apps <hash>  →  JSON string
```

JSON shape (exactly what the validator needs; Postgres has more fields):

```ts
{
    uniqueId: string,            // points back to Postgres row
    name: string,
    apis: string[],
    reserved: boolean,
    disabledAt: number | null,
    isPreviousKey: boolean,
    rotationPrevValidUntil: number | null   // only meaningful when isPreviousKey=true
}
```

Dashboard-only fields (owner, description, audit, usage counters) are not in Redis — validator doesn't need them. Keeps Redis value size small.

---

## Reserved Semantics

(Unchanged from the original SPEC.)

`reserved=true` blocks the following without an active unlock: rotate, disable, delete, rename, remove-APIs, toggle-reserved-off. Allowed without unlock: edit description/owner, add APIs.

Unlock ceremony: mandatory reason (min 20 chars), SuperUser password reconfirm, 15-minute window, audited. Manual relock and auto-relock both supported. Destructive operations during unlock each write their own audit row on top of the unlock row.

Enforcement is server-side on every endpoint: `reserved === false OR (unlockedUntil !== null AND unlockedUntil > now)`. The unlock state lives in Postgres (not a cookie) so all SuperUsers see consistent state.

```mermaid
stateDiagram-v2
    [*] --> Unreserved: create with<br/>reserved=false
    [*] --> ReservedLocked: migrated from file<br/>OR create with<br/>reserved=true

    Unreserved --> ReservedLocked: toggle reserved ON<br/>(no ceremony needed)

    ReservedLocked --> UnlockPending: POST /unlock<br/>reason + password

    UnlockPending --> ReservedLocked: password wrong<br/>OR reason < 20 chars
    UnlockPending --> Unlocked: verified<br/>unlockedUntil = now + 15 min

    Unlocked --> Unlocked: destructive op<br/>(rotate / disable / delete / rename /<br/>remove-APIs)<br/>extra audit row per op
    Unlocked --> ReservedLocked: auto-relock<br/>(unlockedUntil expired)
    Unlocked --> ReservedLocked: POST /relock<br/>(manual)
    Unlocked --> Unreserved: POST /reserved toggle OFF<br/>(finishes the unlock)

    ReservedLocked --> [*]: delete — BLOCKED<br/>(403 reserved_locked)
    Unreserved --> [*]: delete — OK
    Unlocked --> [*]: delete — OK during window

    note right of UnlockPending
      Ceremony requires mandatory
      reason (≥20 chars) and
      SuperUser password reconfirm.
      Every transition is audited.
    end note
```

---

## Migration Plan

### Trigger

On admin_api boot, before `api.listen`. Blocking.

### Algorithm

```mermaid
flowchart TD
    Start([admin_api boot]) --> ReadMarker[SELECT<br/>api_keys_migration_marker<br/>WHERE id = 1]
    ReadMarker --> MarkerSet{completedAt<br/>set?}
    MarkerSet -->|yes| Done([Skip —<br/>migration complete])
    MarkerSet -->|no| TryLock[UPDATE marker<br/>SET lockedByHost, lockExpiresAt<br/>WHERE lockExpiresAt IS NULL<br/>OR lockExpiresAt < now]

    TryLock --> LockAcquired{rowcount<br/>= 1?}
    LockAcquired -->|no| Wait[Wait 10s]
    Wait --> ReadMarker
    LockAcquired -->|yes| ReadJSON[Read<br/>ALLOWED_APPS_DATA_FILE<br/>sha1 = SHA1 of contents]

    ReadJSON --> DedupeScan[Scan JSON for<br/>entries sharing same id<br/>plan merges up front]
    DedupeScan --> Loop[For each entry or merged-group]

    Loop --> HashIt[keyHash = sha256 entry.id]
    HashIt --> ExistsQ{Row exists<br/>WHERE keyHash = ?}
    ExistsQ -->|yes| NextEntry[Skip<br/>idempotent]
    ExistsQ -->|no| IsMerge{Merged<br/>group?}

    IsMerge -->|yes| InsMerge["INSERT api_keys row<br/>name = &quot;A+B&quot;<br/>apis = union<br/>reserved = true"]
    IsMerge -->|no| InsSingle[INSERT api_keys row<br/>name = entry.name<br/>apis = entry.apis<br/>reserved = true<br/>migratedFromFile = true]

    InsMerge --> AuditMerge[INSERT audit<br/>eventType = migrated_from_file<br/>details.merged = true<br/>details.originalNames list]
    InsSingle --> AuditSingle[INSERT audit<br/>eventType = migrated_from_file]
    AuditMerge --> RedisSet[HSET allowed_apps<br/>keyHash -> serialized record]
    AuditSingle --> RedisSet
    RedisSet --> NextEntry
    NextEntry --> MoreEntries{More<br/>entries?}
    MoreEntries -->|yes| Loop
    MoreEntries -->|no| Finish[UPDATE marker SET<br/>completedAt = now<br/>sourceFileSha1 = sha1<br/>entryCount = n<br/>lockedByHost = NULL<br/>lockExpiresAt = NULL]
    Finish --> Success([Migration complete])
```

Pseudocode equivalent:

```
1. Read api_keys_migration_marker row (seeded at table creation with id=1, completedAt NULL).
2. If completedAt is set: log "migration already complete", exit.
3. Acquire lock:
    a. If lockExpiresAt > now: another admin_api is migrating. Wait+retry, capped at 60s.
    b. Else: UPDATE marker SET lockedByHost = hostname(), lockExpiresAt = now + 5min WHERE id=1
       AND (lockExpiresAt IS NULL OR lockExpiresAt <= now)
       (optimistic lock; rowcount 0 means someone else took it)
4. Read JSON file at ALLOWED_APPS_DATA_FILE; compute SHA1 of contents.
5. For each entry in the JSON:
    a. keyHash = sha256(entry.id)
    b. Check if a Postgres row with this keyHash already exists (defensive; guards against
       partial-crash reruns after a lock expiry).
    c. If not, INSERT:
         uniqueId = uuid(), name = entry.name, apis = entry.apis, keyHash,
         keyPrefix = entry.id.slice(0,8), reserved = true, migratedFromFile = true,
         createdAt = now(), ownerAccountId = null, createdByAccountId = null
    d. Write audit row migrated_from_file.
    e. Also HSET to Redis allowed_apps hash.
6. Handle duplicate-id case (tests/allowed_apps.json has this):
    - If two JSON entries have the same id, merge: one Postgres row with name =
      "nameA+nameB", apis = union of both, audit row flagged explicitly with both
      original names in details.jsonb.
7. UPDATE marker SET completedAt = now, sourceFileSha1 = <computed>, entryCount = <n>,
   lockedByHost = NULL, lockExpiresAt = NULL WHERE id=1.
```

Idempotency: steps 5c and 6 check-and-insert; rerunning doesn't duplicate.

### Dealing with the test file's duplicate id

`tests/allowed_apps.json` entries 6 and 7 (`accounts-integration-tests` and `admin-integration-tests`) share the id `f4czwPZdYidGTcWRbhcWAcFiTaDYhYDraDsjYLYO2YXd4LxQqpMpbcdmyxZEBKNZ`. Current runtime behaviour: `Array.find` returns the first one (`accounts-integration-tests`); the second one never matches. The migration resolves this by merging — one row, combined apis list, audit trail explains the resolution. If this was accidental, the fix is effectively retroactive; if intentional, the merge preserves behaviour.

Release notes should call this out explicitly for reviewers.

### JSON file after migration

- NOT deleted automatically. The 5-minute export job regenerates it continuously from Postgres; boot-fallback still depends on its presence.
- After ~3 months of stable operation, the file can be deleted by an operator. The fallback path becomes dead code and a later phase can remove it.

### Tests keep using the JSON

`tests/allowed_apps.json` is not migrated. Tests set `API_KEYS_SOURCE=file` (new env var), which forces `AuthHandler` into fallback mode from the start, bypassing Redis. This preserves exact test-time behaviour.

---

## API Endpoints (admin_api)

All SuperUser-only via existing `AuthApi.getAccessPolicyHandler([AuthSchemas.AccessPolicy.SuperUser])`.

| Method | Endpoint | Purpose |
|---|---|---|
| GET | `/admin/api-keys` | List all keys (no keyHash in response; prefix + metadata only) |
| POST | `/admin/api-keys` | Create key; returns plaintext ONCE in response body |
| GET | `/admin/api-keys/:uniqueId` | Fetch single key metadata |
| PUT | `/admin/api-keys/:uniqueId` | Update description, owner, apis (subject to reserved rules) |
| POST | `/admin/api-keys/:uniqueId/rotate` | Generate new plaintext; set previousKeyHash; return plaintext ONCE |
| POST | `/admin/api-keys/:uniqueId/disable` | Set disabledAt; HSET updated Redis entry |
| POST | `/admin/api-keys/:uniqueId/enable` | Clear disabledAt; HSET updated Redis entry |
| DELETE | `/admin/api-keys/:uniqueId` | Hard delete (requires unlock if reserved) |
| POST | `/admin/api-keys/:uniqueId/unlock` | Start unlock ceremony (reason + password reconfirm) |
| POST | `/admin/api-keys/:uniqueId/relock` | Cancel unlock immediately |
| POST | `/admin/api-keys/:uniqueId/reserved` | Toggle reserved (requires unlock to turn off) |
| GET | `/admin/api-keys/:uniqueId/audit` | Paged audit log for this key |
| GET | `/admin/api-keys/audit` | Global audit log (paged, filterable by eventType / actor / date) |
| POST | `/admin/api-keys/force-migrate` | Re-run migration (ignores marker); ceremony + confirmation required; only usable when DB is manually cleared |

Response shape for list/detail: full Postgres row MINUS `keyHash` and `previousKeyHash` (never returned to any client, including SuperUsers).

Response shape for create/rotate: Postgres row + `plaintextKey: string` shown ONCE. Plaintext is only present in the immediate HTTP response; never stored server-side beyond the transient handler scope.

---

## UI Changes (arda)

(Unchanged from the original SPEC — recap:)

- New top-level page: `/system/api-keys` (SuperUser-gated via existing `ArdaWrapper` access-policy check).
- List view with filters (by API, by reserved, by disabled, free-text name search).
- Create modal → reveal modal (one-time plaintext with copy-to-clipboard + ack checkbox).
- Detail page with Details / Audit / Usage tabs. Live countdown when unlocked.
- Unlock modal: password reconfirm + mandatory reason text (min 20 chars).
- Menu entry in `ArdaWrapper.jsx` under a new "System" category, SuperUser-only.

No changes in web-framework layer. Uses existing Semantic UI components (matches `Login.jsx`, `ArdaWrapper.jsx` style). Calls admin_api via the existing `/api/admin/*` proxy at `webapps/arda/arda.js:69-74` — no new arda-server routes.

---

## Files to Create

| File | Purpose | Approx LOC |
|---|---|---|
| `auth/ApiKeysRepository.ts` | Postgres CRUD on `api_keys` + audit; returns POJOs | 450 |
| `auth/ApiKeysCache.ts` | Redis read helpers (HGET); Redis write helpers (HSET/HDEL); fallback-to-JSON loader on Redis outage | 220 |
| `auth/UsageTracker.ts` | In-process usage batch queue; periodic flush via Postgres UPDATE | 120 |
| `api/src/admin/ApiKeysApi.ts` | Admin_api handlers for `/admin/api-keys/*` | 520 |
| `apps/db_setup/api_keys_setup.ts` | Postgres migration (api_keys + audit + marker tables, grants) | 130 |
| `apps/admin_api/bootstrap/migrateAllowedApps.ts` | Boot-time migration from JSON file to Postgres + Redis | 220 |
| `apps/admin_api/bootstrap/exportAllowedApps.ts` | Periodic JSON snapshot regeneration from Postgres | 110 |
| `apps/admin_api/bootstrap/reconcileRedisCache.ts` | Periodic Postgres↔Redis drift check + repair | 140 |
| `apps/admin_api/bootstrap/sweepRotationOverlap.ts` | Periodic Redis HSCAN + HDEL of expired previous-key fields | 80 |
| `webapps/src/components/System/ApiKeys/ApiKeysList.jsx` | Dashboard list view | 250 |
| `webapps/src/components/System/ApiKeys/ApiKeyDetail.jsx` | Detail page with tabs | 300 |
| `webapps/src/components/System/ApiKeys/ApiKeyEditor.jsx` | Create/edit form | 250 |
| `webapps/src/components/System/ApiKeys/KeyRevealModal.jsx` | One-time plaintext display | 80 |
| `webapps/src/components/System/ApiKeys/UnlockCeremonyModal.jsx` | Reason + password reconfirm | 120 |
| `webapps/src/components/System/ApiKeys/ApiKeyAuditLog.jsx` | Audit log viewer | 150 |
| `tests/auth/ApiKeysCache.spec.ts` | Unit tests for Redis read/write + fallback | 350 |
| `tests/api/ApiKeysApi.spec.ts` | Integration tests for admin endpoints (CRUD, rotate, unlock) | 450 |
| `tests/auth/Migration.spec.ts` | Migration tests (empty DB, rerun, dup-id merge, lock contention, partial failure) | 280 |
| `docs/api-keys-dashboard.md` | Operator doc (onboarding, recovery, disable-glass procedures) | 300 |

## Files to Modify

| File | Change |
|---|---|
| `auth/Auth.ts` | Rewrite `AuthHandler` constructor (lines 332-350): drop synchronous `require()` of JSON file; instead construct an `ApiKeysCache` that pings Redis, populates a fallback map from JSON on failure. Rewrite `authorizeHttpRequest` (lines 362-406): replace linear `find` with `ApiKeysCache.lookupByHash(sha256(token))`; handle `isPreviousKey` + `rotationPrevValidUntil`; fire `UsageTracker.record(...)` on success. |
| `auth/index.ts` | Export `ApiKeysCache`, `ApiKeysRepository`, `UsageTracker`. |
| `auth/AuthDatabase.ts` | No changes. Document in header that `api_keys` lives in Postgres (not Firestore like other auth tables) because it's admin_api-owned. |
| `apps/admin_api/admin_api.ts` | Before `api.listen` (or equivalent): await `migrateAllowedApps()`. Register `/admin/api-keys/*` routes. Start the periodic jobs (export / reconcile / sweep / usage-flush). |
| `webapps/arda/arda.js` | No code changes; `/api/admin/*` proxy already forwards to admin_api. |
| `webapps/arda/app/App.jsx` | Add routes: `/system/api-keys`, `/system/api-keys/:uniqueId`. |
| `webapps/arda/app/ArdaWrapper.jsx` (lines 81-93) | Add "System" menu category with "API Keys" entry, SuperUser-gated. |
| `.env` sample / `DEV_SETUP.md` | New env vars: `API_KEYS_SOURCE=redis\|file` (default `file` during rollout; flip to `redis` per-service after Phase 3), `API_KEYS_REDIS_KEY` (default `allowed_apps`), `API_KEYS_JSON_EXPORT_INTERVAL_MS` (default 300000), `API_KEYS_USAGE_FLUSH_INTERVAL_MS` (default 10000), `API_KEYS_ROTATION_OVERLAP_DAYS` (default 7). Document that admin_api's Postgres user needs `SELECT,INSERT,UPDATE,DELETE` on `api_keys`, `api_keys_migration_marker`, and `SELECT,INSERT` only on `api_keys_audit_log`. |

## Files NOT to Modify

- `tests/allowed_apps.json` — stays for test runs (tests use `API_KEYS_SOURCE=file`).
- Test files that hard-code integration-test keys — unaffected because tests run in file-source mode.
- `auth/Auth.ts:40-44` (`AllowedApp` type) — kept for backward compatibility with any external importer.

---

## Security Requirements

All P0.

- **Plaintext never stored at rest.** DB and Redis both hold `SHA256(key)` only. Plaintext returned exactly once at creation/rotation, never persisted server-side.
- **Plaintext never logged.** The existing `Logger.error` at `auth/Auth.ts:400` prints the sanitized token on auth failure — must be reduced to `substring(0,8) + "..."` before rollout. Log aggregation must never see full keys.
- **Failed-auth audit sampling.** Writing an audit row for every failed probe is a DoS vector on the audit table. Rate: sample 1-in-100 OR aggregate by IP on a rolling 1-minute window; whichever is smaller. Successful auths are counted in-batch via `UsageTracker`, not individually audited.
- **SuperUser enforced on every admin endpoint.** Middleware + server-side recheck. Client-side role checks are UX only.
- **One-time reveal is strictly one-time.** Plaintext is not written to localStorage, not in URLs, not in server logs, not in Redis (only the hash), and the HTTP response body is the only place it appears. The front-end should zero the variable after display.
- **Unlock password reconfirm** — new endpoint `POST /auth/verify-password` on auth-api or an equivalent check via existing `/auth/login` machinery (returns yes/no without issuing new tokens). Without this, "walked away from desk" → trivial unlock bypass.
- **Reserved enforcement on the server side, not just UI.** Every destructive endpoint verifies `reserved === false OR (unlockedUntil && unlockedUntil > now)`. Client button disables are cosmetic.
- **Rate limiting on `/admin/api-keys/*`.** Reuse whatever solution plans 14's `apps/api/api.ts:93-94` rate-limiter question settles on. Key-rotation loops could otherwise produce a floor of plaintext in HTTP logs.
- **Rotation window default 7 days, max 30.** Configurable per rotation call. After expiry, the previous key is unreachable via Redis and validation returns 401.
- **Admin_api self-bootstrap.** admin_api itself needs a valid API key to start serving requests, but it also hosts the migration. Bootstrap order: (1) load JSON file into in-memory fallback for auth, (2) start serving, (3) run migration, (4) switch to Redis-primary. This prevents a chicken-and-egg if the JSON file is the only surviving copy at first boot post-migration.
- **Audit log append-only via DB grants.** admin_api's Postgres role has `SELECT,INSERT` only on `api_keys_audit_log`. Enforces immutability even against bugs in the application layer.
- **Redis ACL.** The shared Redis cluster is already trusted at the network layer; `allowed_apps` doesn't need a special ACL beyond existing setup.
- **Postgres access limited to admin_api.** Other services don't connect; RDS security group remains tight.
- **Drift between Postgres and Redis.** Reconciliation job at 1-min interval detects + repairs. If Redis write fails after Postgres commit, the affected key may be temporarily unreachable until reconcile catches up — acceptable, never wrong data.
- **Deletion audit survives the row.** Audit log denormalises `keyName`; deleting a key doesn't cascade-delete audit history.

---

## Rollout Phases

```mermaid
flowchart LR
    P1[Phase 1<br/>Schema + CRUD + Dashboard UI]
    P2[Phase 2<br/>Migration from JSON<br/>all entries reserved=true]
    P3[Phase 3<br/>Redis cache +<br/>service flips]
    P4[Phase 4<br/>Dashboard-only writes]
    P5[Phase 5<br/>Usage tracking +<br/>UX polish]
    P6[Phase 6 optional<br/>Retire JSON file]

    P1 --> P2 --> P3 --> P4 --> P5 -.-> P6

    subgraph flipOrder[Phase 3 — service flips, one at a time]
      direction LR
      F1[Flip cloud-api<br/>API_KEYS_SOURCE=redis] --> F2[Observe 24h]
      F2 --> F3[Flip auth-api]
      F3 --> F4[Observe 24h]
      F4 --> F5[Flip admin_api last]
    end

    P3 --- flipOrder
```

Each phase lands independently. First-party auth behaviour unchanged throughout.

### Phase 1 — Schema + CRUD + audit (no runtime change)

- Postgres migrations (`api_keys_setup.ts`, `api_keys_audit_log`, `api_keys_migration_marker`) in dev + staging + prod
- `ApiKeysRepository.ts` CRUD
- Admin_api `/admin/api-keys/*` endpoints (SuperUser)
- Arda dashboard UI: list, detail, create, rotate, disable, audit
- New keys created in the dashboard are written to BOTH Postgres AND appended to the JSON file (via admin_api regenerating and atomically renaming the file)
- Services still load JSON at boot; no hot path changes

**Deliverable:** dashboard usable for creating + viewing keys; existing services see newly-created keys via JSON refresh on next restart.

### Phase 2 — Migration

- Run `migrateAllowedApps()` at admin_api boot (or force-migrate endpoint)
- Verify Postgres row count == JSON entry count; `keyHash` of each row matches SHA256 of each JSON id
- Every migrated row `reserved=true`
- JSON file remains authoritative at runtime for other services

**Deliverable:** Postgres mirrors the JSON; marker row set.

### Phase 3 — Redis cache + services flip

- Admin_api populates Redis `allowed_apps` hash from Postgres at boot and on every write
- `AuthHandler` rewrite deployed to all services; default `API_KEYS_SOURCE=file` preserves current behaviour
- Flip services one at a time to `API_KEYS_SOURCE=redis` + rolling restart. Order: cloud-api first (smallest blast radius), then auth-api, then admin_api last
- Each flip is observed for 24 h before the next
- JSON fallback stays in place; periodic export keeps it fresh

**Deliverable:** all services use Redis primary, Postgres as source of truth, JSON as boot fallback.

### Phase 4 — Dashboard-only writes

- Dashboard stops writing to JSON on create/rotate (Postgres + Redis only)
- Periodic export keeps JSON read-only from Postgres
- Documented: "Do not edit allowed_apps.json manually"

**Deliverable:** single edit surface is the dashboard.

### Phase 5 — UX polish + usage tracking

- Usage counters, last-used timestamps, sparkline charts on detail page
- Reserved unlock ceremony fully wired including password reconfirm
- Bulk operations on the list view

**Deliverable:** observability and safety rails complete.

### Phase 6 (optional) — Retire the JSON file

- Confirm ~3 months of stable Redis operation
- Remove the boot fallback code path from `AuthHandler`
- Operators delete the JSON file from EC2

**Deliverable:** JSON file removed.

---

## Verification

Run the existing auth + admin_api test suites after every phase to confirm no regression.

### Phase 1 tests (`tests/api/ApiKeysApi.spec.ts`)

- Create key → plaintext returned ONCE; follow-up GET excludes plaintext; `keyHash === SHA256(plaintext)`
- Duplicate name → 400
- Rotate → new plaintext returned; `previousKeyHash` set; `rotationPrevValidUntil` set; both hashes validate during overlap window (via direct `authorizeHttpRequest` call against the handler)
- Disable → next auth request → 401; audit row written
- Enable → key accepts again
- Delete non-reserved → row removed; Redis field HDEL'd
- Non-SuperUser → 403 on all endpoints

### Phase 2 tests (`tests/auth/Migration.spec.ts`)

- Empty DB + JSON with 7 entries → migration creates 6 rows (two IDs merge) with `reserved=true`, `migratedFromFile=true`
- Re-run with marker set → exits no-op; idempotency verified
- Duplicate-id case → merged row; audit details.jsonb names both original entries
- Partial-failure sim (abort mid-migration) → marker `completedAt` NULL; next boot retries; no duplicate rows created
- Lock contention sim (two admin_api instances start simultaneously) → one wins; other waits+retries; no double-write

### Phase 3 tests (`tests/auth/ApiKeysCache.spec.ts`)

- `AuthHandler` with `API_KEYS_SOURCE=redis` + Redis populated → validates via `HGET allowed_apps <hash>`
- Update in admin_api → Redis HSET → validator sees new state on next request (no cache lag because Redis is the cache)
- Disable → validator rejects on next request
- Rotation overlap: old key valid until `rotationPrevValidUntil`; after, rejected
- Redis unreachable at boot → `AuthHandler` falls back to JSON; CRITICAL log emitted; background reconnect attempts
- Redis drops mid-life → current request completes (already in Redis); subsequent requests eat ~1s timeouts until reconnect; log noise but no data corruption

### Reserved unlock tests

- Destructive op on reserved key without unlock → 403 `reserved_locked`
- Unlock → destructive op allowed; auto-relocks 15 min later
- Destructive op 16 min post-unlock → 403 again
- Manual relock → destructive op rejected immediately
- Unlock with wrong password → 401
- Two concurrent unlock attempts → both succeed and extend window to max of both; tested for race safety

### Security-critical tests

- List response has no `keyHash` or `previousKeyHash` field for any row
- Detail response has no `keyHash`
- Create/rotate response HAS `plaintextKey`; immediate follow-up GET to the same row does NOT
- 1000 failed-auth requests from same IP → audit table has ~10 rows (sampled), not 1000
- Admin_api with deleted audit row via direct SQL → `DELETE` fails with permission error (grant verification)

### Drift / reconciliation

- Manually `HDEL` a known-valid Redis field → next reconciliation run restores it within 60s
- Manually add a bogus Redis field not in Postgres → next reconciliation HDELs it
- Rotation previous key: manually set `rotationPrevValidUntil` to past → sweep job HDELs the previous-key field within 5 min

### End-to-end acceptance (end of Phase 3)

- Create a new "e2e-test" key via dashboard with `apis=["admin-api"]`
- Make an authenticated admin_api call with `Authorization: Bearer <plaintext>` → 200
- Rotate the key → old key still works for 7 days; new key works immediately
- Disable → both keys → 401
- Enable → both keys → 200 again (within rotation window)
- Delete (after unlocking if reserved) → both keys → 401, permanent

---

## Operational Notes

### Onboarding a new service

Existing pattern unchanged: call `Auth.AuthHandler.createAuthHandler({ apiName: "new-api" })`. Service comes up; in Redis mode, uses HGET against the shared hash. Needs no Postgres access.

To grant a new service its keys: SuperUser creates a key in the dashboard with `apis: ["new-api"]` (and whatever else). Distributes the plaintext via secure channel (1Password, encrypted email, in-person).

### Breaking glass: Redis completely down

- In-process validators return 401 for all bearer checks until fallback kicks in (~5s ping timeout at boot).
- Services that were already up continue with their fallback map (loaded at boot).
- New service instances that boot during the outage fall back to JSON file directly; operate normally.
- Dashboard (Postgres-only) still works; operators can still see keys but mutations to Redis fail (retry queue builds).
- Recovery: Redis comes back; reconciliation job re-populates Redis from Postgres within one minute.

### Breaking glass: Postgres down (RDS outage)

- Redis is already populated; all services continue validating normally.
- Dashboard CRUD fails (writes hit Postgres first).
- Existing in-flight keys work fine.
- Recovery: Postgres returns; dashboard works again.

### Breaking glass: key accidentally deleted

- Audit log shows who deleted when.
- If an export snapshot from before the deletion exists on disk: extract the plaintext from... wait, the snapshot only contains hashes post-Phase-4. Before Phase 4 the snapshot still has plaintexts (since it's a regeneration of the original file format).
- Actually: the export format becomes a concern. **Decision**: the JSON export from Postgres uses the same legacy format (`{ name, apis, id }` with plaintext id) ONLY IF Phase 6 hasn't retired the fallback. To support plaintext in the export, the Postgres row would need to store it — **which it doesn't**. Therefore after migration, the JSON export format has to change, and boot fallback has to handle hashed format.

Correction to the design: post-migration, the JSON file format changes to include `keyHash` instead of `id`. The boot fallback in `AuthHandler` handles both formats (legacy plaintext for services that haven't upgraded; new hashed format post-upgrade). This is a small wrinkle worth documenting.

- **Consequence**: recovery of a deleted key by plaintext from the JSON snapshot is not possible post-migration. If a key is accidentally deleted, recovery requires the integrator to have the plaintext on their side.
- **No automatic undo.** Deletion is audited but irreversible.

### Migrating to multi-region / DR

The pattern works against any shared Redis cluster. Moving to a different RDS instance: standard Postgres replication. Out of scope for this plan.

---

## Open Questions to Resolve During Implementation

- **`POST /auth/verify-password` for unlock.** Does any existing endpoint do this without issuing a new token? If not, add it on auth-api in Phase 5.
- **Duplicate-id in `tests/allowed_apps.json`.** Confirm with the original author whether this is intentional. Release notes should call it out either way.
- **Reserved flag default for migrated entries.** SPEC defaults to `true`. Alternative: bootstrap a classification manifest (integration-test keys → `false`, production keys → `true`) before Phase 2. Cheaper ops; requires someone to classify existing entries.
- **JSON export format post-migration.** The current format has plaintext ids; the migrated export needs to have hashed keys (because plaintexts aren't stored). Boot fallback code needs to read both formats. Confirm the transition strategy.
- **Rate limiter.** Same question as plans 14: the limiter commented out at `apps/api/api.ts:93-94` needs to be resolved (enabled or replaced) before dashboard writes are rate-limited.
- **Postgres role privileges.** admin_api's DB user currently has full CRUD; the audit-log append-only restriction requires a separate role with reduced grants, or a dedicated "audit writer" role. Worth confirming the deployment can provision this.
- **Redis cluster topology.** Is Redis run as a single instance, a replica set, or a sharded cluster today? The design assumes a single logical hash `allowed_apps` accessible by HGET from every service. If the cluster is sharded, all operations on this hash hit the same shard — not a scaling concern at Bigscreen's size (hundreds of keys), but worth confirming.
- **Interaction with plan 14 (OAuth).** Unchanged: OAuth clients and `api_keys` are parallel trust systems. OAuth clients have their own Postgres table (`oauth_clients`); regular API keys live in `api_keys`. The dashboard navigation should group them under a common "System" area but keep them as distinct tables and distinct admin pages.
