# 16 - TRUST_PROXY Rollout and Rate Limiter Re-enablement

**Status: NOT STARTED**

## Summary

Configure Express's `trust proxy` setting in production so `req.ip` returns the real client IP instead of the Nginx loopback address. This unblocks (a) re-enabling rate limiting on `apps/api`, (b) making the existing rate limiter in `cloud/cloud_api` actually bucket per-user instead of collapsing all traffic to the ALB's VPC IP, and (c) letting direct X-Forwarded-For readers in `auth/`, `api/src/Analytics.ts`, and the OAuth modules migrate to a trust-proxy-respecting source.

The `TRUST_PROXY` env var and the Express `trust proxy` wiring were added in commit `66c8521b` ("Redis changes for rate limiter etc.") but have never been set in any deployed environment.

## Current State

### What's already in the code

- `apps/api/api.ts:21-34` — canonical trust-proxy contract comment; reads `TRUST_PROXY` env var and calls `api.set("trust proxy", ...)`.
- `apps/admin_api/admin_api.ts:29-34` — same pattern.
- Also wired in `cloud/cloud_api/cloud_api.ts` and `cloud/src/CloudAdminApi.ts` (per the commit).
- `api/src/RateLimiter.ts:102-126` — documents the policy: read `req.ip` only, never raw `x-forwarded-for`.
- `TRUST_PROXY` env var is not set on any fleet today.

### Rate limiter status

- `apps/api/api.ts:16,121,124` — imports and middleware commented out. **Disabled.**
- `apps/admin_api/admin_api.ts` — no rate limiter imports. **Not applied.**
- `cloud/cloud_api/cloud_api.ts:14,387,394,400,613` — active on public endpoints only. **Enabled but broken** — every request buckets to the ALB's VPC IP because `req.ip` is the Nginx loopback, so `getClientIp` falls back to `req.socket.remoteAddress`. Result: 100 req/min shared across all users.

### Direct X-Forwarded-For readers (bypass trust-proxy policy)

- `auth/AuthApi.ts:48, 79, 317, 370, 600` — five call sites
- `api/src/Analytics.ts:67`
- `api/src/OAuthApi.ts:185`, `api/src/OAuthClientsApi.ts:687`
- `website/src/api.js:175`
- `cloud/src/CloudApi.ts:2918, 2964` — rate-limit bucketing in two places

These all use the pattern `forwardedRealIp || req.get("x-forwarded-for") || req.ip`. They work today (the ALB populates the header), but they're attacker-spoofable on any endpoint reachable without going through the ALB.

### Topology (per `devops/SERVER_TOPOLOGY_SPEC.md`)

All fleet services share the same chain: **Client → ALB → Nginx → Node.js**. Two trusted hops in front of Node for every service (API, Admin API, Cloud API, Cloud WebSocket, WebApps).

### Diagnostic endpoint

`apps/api/api.ts:184` — `/auth/whoami` has been extended to return a structured `chain` object with `reqIp`, `reqIps`, `socket`, `xForwardedFor`, and `forwardedRealIp`. Confirmed working on dev — returns a populated `ip` field after adding `AuthApi.requiresValidAccessToken` middleware.

## Recommended value: `TRUST_PROXY=2`

Two hops (Nginx + ALB), both in the VPC. Numeric over CIDR (`loopback, 172.31.0.0/16`) because it's explicit and fails loudly if the topology ever changes — silent misbehavior from a CIDR-based policy is harder to diagnose than a broken rate limiter.

## Rollout Plan

### Phase 1: Baseline

1. Hit `/auth/whoami` on every running fleet (at minimum `main-rice`, `dev-gem`) and record the `chain` object.
2. Expected (no TRUST_PROXY set): `chain.reqIps = []`, `chain.reqIp = "127.0.0.1"`, `chain.xForwardedFor` has the real chain.
3. Store outputs alongside this plan for before/after comparison.

### Phase 2: Dev validation

1. Set `TRUST_PROXY=2` on `dev-gem` (all five services — `api`, `admin_api`, `cloud_api`, `cloud_ws`, `webapps`).
2. Redeploy via Jenkins 003 pipeline.
3. Hit `/auth/whoami` on `dev-gem`.
4. Expected: `chain.reqIps = ["<client-ip>", "<alb-vpc-ip>"]` (2 entries), `chain.reqIp = "<client-ip>"`.
5. If 1 entry: ALB not adding to X-Forwarded-For — investigate ALB config. If 3+ entries: unknown extra proxy — investigate.

### Phase 3: Production rollout

One fleet at a time, not parallel:

1. `main-kiwi` first (smallest blast radius — Arda/Admin API only).
2. Then `main-ocean`.
3. Then `main-rice` (primary production).
4. Verify `/auth/whoami` after each. Watch Discord/error logs for 24h before advancing.

### Phase 4: Fix the cloud_api rate limiter

With `TRUST_PROXY=2` live, `req.ip` in `cloud_api` now returns real client IPs. The existing rate limiter starts actually bucketing per-user. No code change — just verify behavior.

The two bespoke rate limiters in `cloud/src/CloudApi.ts:2918,2964` still read `x-forwarded-for` directly. Migrate them to `req.ip` for consistency.

### Phase 5: Re-enable the apps/api rate limiter

1. Uncomment `apps/api/api.ts:16,121,124`.
2. Review limits — `authRateLimiter` is 5/15min (prod) which is aggressive for `/auth/login`. Validate against real traffic shapes.
3. Run `tests/api/098.RateLimiter.ts` to confirm behavior.
4. Deploy to dev, then prod.

### Phase 6: Clean up direct X-Forwarded-For readers

Migrate the 9+ call sites listed above to use `req.ip` instead of the raw header fallback. This removes the attacker-spoofing surface for any endpoint reachable without the ALB (e.g., direct EC2 access during SSH debugging).

Candidate utility: a single `getRealIp(req)` helper in `auth/` that encapsulates the policy, so future callers have one obvious answer.

## Verification

After Phase 3:

- `/auth/whoami` on every production fleet shows `chain.reqIp === <your-ip>` and `chain.reqIps.length === 2`.
- Rate limiter tests: `cd tests && ts-mocha --bail --exit --timeout=500000 api/098.RateLimiter.ts`
- Spot check: send 101 requests from a single IP to a `cloud_api` public endpoint within a minute; the 101st should 429. Before this work, it wouldn't (all buckets to ALB).

## Out of scope / future work

- **Security Group → Managed Prefix List migration** for `ADMIN_API_SECURITY_GROUP_ID`. The current approach works because the SG is directly attached (per user confirmation). Prefix lists would be cleaner if the same team IP allowlist is ever needed on a second SG. Draft plan at `~/.claude/plans/abstract-waddling-pine.md`; not promoting to a numbered plan until there's a concrete driver.

## Critical files

- `apps/api/api.ts:21-34, 184` — trust proxy setup + whoami diagnostic
- `apps/admin_api/admin_api.ts:29-34`
- `cloud/cloud_api/cloud_api.ts` — active rate limiter
- `api/src/RateLimiter.ts` — rate limiter infrastructure + policy doc
- `auth/AuthApi.ts` — direct header readers (5 sites)
- `devops/SERVER_TOPOLOGY_SPEC.md` — topology reference (§3, §7.1)