# Debug: Media Server Delay

```
Claude Code v2.1.92
Opus 4.6 (1M context) · Claude Max
C:\Bigscreen\cloud\cloud\src
```

---

> `/effort max`

Set effort level to max (this session only): Maximum capability with deepest reasoning (Opus 4.6 only)

---

> I'm trying to diagnose a problem. The function `sendLiveRoomUpdatedToRoomInstance` delays sending out the websocket message in the function `onUserConnectedToRoomInternal` by several seconds. I want to understand why. There is no delay when calling it from other places e.g. `sendRoomUpdated` updates instantly, so this is not due to intermittent internet problems. The delay can be measured in seconds, anywhere from 5 to 10.

*Selected 1 lines from Cloud.ts in Visual Studio Code*

Let me read the relevant functions to understand the flow.

*Searched for 3 patterns, read 1 file*

The delay isn't in `sendLiveRoomUpdatedToRoomInstance` itself — it's in the work that happens before it. Let me look at the lock acquisition and the full caller.

*Searched for 2 patterns, read 1 file*

Now let me look at `joinLiveRoomInternal` to understand the full join flow and what might be holding the lock.

*Searched for 1 pattern, read 1 file*

Let me now trace how `onUserConnectedToRoom` gets called, and check the potentially expensive operations.

*Searched for 5 patterns, read 3 files*

Now let me look at the message queue implementation to understand if there's polling involved.

*Searched for 1 pattern, read 1 file*

Now let me check the media server's polling loop to understand how frequently it checks for messages.

*Searched for 4 patterns, read 2 files*

Let me look at how the media server sends the `UserConnected` response back and the `onUserConnectionRequest` handler.

*Searched for 4 patterns, read 2 files*

Now let me check how the websocket server consumes outbox messages and the cloud worker's polling interval.

*Searched for 2 patterns, read 1 file*

Let me check the Redlock configuration and retry behavior.

*Searched for 2 patterns, read 1 file*

Let me look at the `createPeer` function in the media server to check if it's slow.

*Searched for 1 pattern, read 1 file*

Let me check the cloud worker to see if there's an additional processing delay.

*Searched for 4 patterns, read 1 file*

---

Now I have a complete picture. Here's my diagnosis:

## Diagnosis: Why `sendLiveRoomUpdatedToRoomInstance` is delayed in `onUserConnectedToRoomInternal`

The delay is **not** in `sendLiveRoomUpdatedToRoomInstance` itself — it sends to a Redis queue that the websocket server picks up within 100ms. The delay is in everything that happens **before** it's called through the `onUserConnectedToRoom` path. There are three compounding factors:

### 1. Media Server Round-Trip (adds 300-800ms baseline)

The `joinRoom` → `onUserConnectedToRoom` path has a full round-trip through the media server that `sendRoomUpdated` doesn't:

```
joinLiveRoomInternal (Cloud.ts:1476)
  → pushes UserConnectionRequest to Redis queue
  → media_server_next polls every 200ms (line 1343)
  → createPeer: 9 sequential mediasoup IPC calls (lines 255-427)
  → HTTP POST back to cloud (RemoteQueue.sendMessage)
  → Cloud handles it in MediaServerApi.sendMessage (line 154)
  → Cloud.onUserConnectedToRoom
```

The `sendRoomUpdated` path skips all of this and calls `sendLiveRoomUpdatedToRoomInstance` directly.

### 2. Room Lock Contention (the primary suspect for 5-10s)

`onUserConnectedToRoom` must acquire the room lock (`Cloud.ts:1517`). This same lock is held by **13 different operations** including `joinRoom`, `updateSeat`, `updateRoomScreen`, `leaveRoom`, etc.

Critically, `joinRoom` **still holds the room lock** when it sends the `UserConnectionRequest` to the media server (`Cloud.ts:1476`). It continues holding it through `updateRoomSocialGraph` (`Cloud.ts:1487`) before releasing at line 1262. If the media server responds quickly, `onUserConnectedToRoom` arrives while `joinRoom` or other operations still hold the lock.

### 3. Redlock Retry Configuration Amplifies Contention (`lib/Redis.ts:123-129`)

```typescript
Redis._redlock = new Redlock([Redis.redis], {
    retryCount: -1,      // infinite retries
    retryDelay: 200,     // 200ms base delay
    retryJitter: 400     // up to 400ms random jitter
});
```

Each failed lock attempt waits **200-600ms** before retrying. Even if the lock becomes free immediately after a retry fails, the caller must wait the full retry interval. This means:

- 1 queued operation ahead: ~200-600ms wasted
- 3 operations ahead: ~1-2 seconds
- 5-8 operations ahead: **5-10 seconds** — matches your observed delay

In a room where multiple users are joining simultaneously, each `joinRoom` and each corresponding `onUserConnectedToRoom` all compete for the same lock. They don't form an orderly queue — Redlock is first-to-retry-wins, so operations can be starved.

### The Timeline That Produces 5-10s Delay

```
T=0ms:     User A joinRoom acquires lock
T=50ms:    User B joinRoom tries lock → RETRY (200-600ms wait)
T=100ms:   joinRoom sends UserConnectionRequest to media server
T=200ms:   User A joinRoom releases lock
T=300ms:   Media server polls, gets message
T=500ms:   User B joinRoom retries, acquires lock
T=500ms:   createPeer completes, HTTP POST with UserConnected
T=510ms:   onUserConnectedToRoom tries lock → BLOCKED by User B
T=700ms:   User C joinRoom arrives, tries lock → RETRY
T=800ms:   User B releases lock
T=900ms:   User C retries, acquires lock (beat onUserConnected randomly)
T=...      This pattern repeats for every concurrent operation
T=5-10s:   onUserConnectedToRoom finally acquires the lock
T=5-10s:   sendLiveRoomUpdatedToRoomInstance finally fires
```

### To Confirm

Add timing logs around the lock acquisition in `onUserConnectedToRoom`:

```typescript
const t0 = Date.now();
const _lock = await this.ACQUIRE_ROOM_LOCK_WITH_ID("onUserConnectedToRoom", roomId);
Logger.info(`[onUserConnectedToRoom] lock acquired in ${Date.now() - t0}ms for room ${roomId}`);
```

If the number is consistently in the seconds range, lock contention is confirmed.

*Crunched for 10m 46s*

---

> In this case, we only have a single user joining the room, and there's still a delay. Could the lock also be the cause of this?

No — with a single user and no concurrent operations, the lock is almost certainly **not** the cause. Here's why:

`joinRoom` sends the `UserConnectionRequest` at `Cloud.ts:1476`, then only has `updateRoomSocialGraph` (`Cloud.ts:1487`) left before releasing the lock. With 1 participant, that social graph call generates **zero pairs** (`N*(N-1) = 0`), so it completes in a few ms. The lock is released well before the media server can possibly respond, because the media server still needs:

- Up to 200ms polling delay (`media_server_next.ts:1343`)
- `createPeer` execution time
- HTTP round-trip back to the cloud

By the time `onUserConnectedToRoom` is called, the `joinRoom` lock has been free for hundreds of milliseconds.

**The 5-10 second delay is in the media server round-trip.** The chain is:

```
joinLiveRoomInternal (Cloud.ts:1476)
  → Redis lpush (MessageQueue.ts:190)
  → media_server_next polls via HTTP GET every 200ms (media_server_next.ts:1314)
  → createPeer: 9 sequential mediasoup IPC calls (media_server_next.ts:234-437)
  → RemoteQueue.sendMessage: HTTP POST back to cloud (RemoteMessageQueue.ts:92)
  → MediaServerApi.sendMessage handles it (MediaServerApi.ts:154)
  → Cloud.onUserConnectedToRoom → onUserConnectedToRoomInternal
  → sendLiveRoomUpdatedToRoomInstance
```

The most likely bottleneck is `createPeer` — it makes **9 sequential mediasoup worker IPC calls** (4 `createPlainTransport`, 1 `produceData`, 3 `produce`), plus potentially a `createRouter` on the first user in a group. If the mediasoup worker is slow to respond (resource pressure, port allocation delays, or worker initialization), this can easily account for seconds.

To pinpoint it, add timing around each stage in `media_server_next.ts`:

```typescript
async function onUserConnectionRequest(messageId: string, data: any) {
    const t0 = Date.now();
    Logger.debug("onUserConnectionRequest");

    const { userSessionId, roomId, groupId } = data;

    try {
        const peerId = getUserPeerId(userSessionId, roomId, groupId);

        const newPeer = await createPeer(peerId, roomId, groupId, userSessionId);
        Logger.info(`[onUserConnectionRequest] createPeer took ${Date.now() - t0}ms`);

        // ... build message ...

        const t1 = Date.now();
        await RemoteQueue.sendMessage(authToken, message);
        Logger.info(`[onUserConnectionRequest] sendMessage took ${Date.now() - t1}ms (total: ${Date.now() - t0}ms)`);
```

No, for a single user the lock is almost certainly **not** the cause. Here's why:

`joinRoom` releases the room lock (`Cloud.ts:1262`) well before the media server can respond. The media server polling alone adds 200ms (`media_server_next:1343`), plus `createPeer` time, plus the HTTP round-trip back. By that point, `joinRoom` has long released the lock. With no contention, Redlock acquires instantly (single Redis SET NX, <5ms).

The delay for a single user must be in the **media server round-trip chain**. To pinpoint it, I'd suggest adding timing at each stage. Let me check `createPeer` and the HTTP flow more closely.

*Searched for 1 pattern, read 1 file*

Let me check if there's a `RoomCreationRequest` that precedes the `UserConnectionRequest` and could be blocking.

*Searched for 2 patterns, read 1 file*

Let me check if expired key events could be acquiring the room lock in the background.

*Searched for 3 patterns, read 1 file*

Good question. There is one scenario where the lock could still be involved even for a single user: **background expiry events**.

Looking at `Cloud.ts:326-341`, Redis keyspace notifications trigger `onExpiredId`:

```
user:expiry:{userId}  → removeBigscreenUser → leaveRoom_WithoutUserLock → ACQUIRES ROOM LOCK (line 2350)
room:expiry:{roomId}  → deleteLiveRoom → ACQUIRES ROOM LOCK MULTIPLE TIMES (lines 996, 1033)
```

If another user's session is expiring from the same room, or the room itself is in the process of being cleaned up, that operation holds the room lock. `leaveRoomInternal` (lines 2375-2521) does ~12 sequential Redis operations, potentially a Postgres write, and media server messages while holding the lock.

**But** — for a *consistent* 5-10 second delay, background lock contention is unlikely. Expiry events are intermittent, and they'd cause variable delays, not a reliable 5-10s.

For a consistent delay with a single user, the most likely cause is **`createPeer` on the media server** (`media_server_next.ts:234-437`). It runs **9 sequential mediasoup IPC calls** — 4 `createPlainTransport` + 1 `produceData` + 3 `produce`. Each is a synchronous round-trip to the mediasoup C++ worker. If the worker is under CPU load (processing RTP packets for other users), each IPC call gets queued behind packet processing, and 9 calls at ~500-1000ms each = 5-9 seconds.

The fastest way to confirm is adding timestamps at each stage. Want me to add diagnostic timing logs across the chain?

*Crunched for 8m 29s*

---

> The media server code has not changed for two years though, and in the past the `createPeer` function has not been this much of a bottleneck. The main architectural change made recently is that the cloud_worker has been removed from the application stack. cloud_worker no longer shuttles messages around the app; instead, the cloud_api and ws_server have direct loops that handle these messages.

That's a critical clue. The cloud_worker previously polled the `BigscreenCloud` queue every 100ms. If the cloud_api now handles those messages directly, the polling interval or processing loop could be the bottleneck. Let me look at how cloud_api handles this.

*Read 1 file*

The key is `startMediaServerWorker()` at line 29. Let me read that file.

*Read 1 file*

This is the key piece of information. Now I see the `media_server_worker.ts` running inside the cloud_api process. Let me also check what `removeOldMediaServers` does — it's being called twice (from both `cloud_api.ts:645` AND `media_server_worker.ts:65`).

*Searched for 2 patterns, read 1 file*

Let me check how `Cloud.initialize()` works and if the missing `watchExpired` flag matters.

*Searched for 1 pattern, read 1 file*

Let me check how the `WebsocketInbox` is processed now, since the user said the ws_server also handles messages directly.

*Searched for 3 patterns*

The user mentioned the rate limiter was recently tightened. Let me check if it's throttling the media server's HTTP requests.

*Searched for 2 patterns, read 1 file*

---

This is very revealing. The `media_server_worker.ts` code looks functionally equivalent to the old `cloud_worker_v2.ts`, but there's one concrete bug I can see from the migration and one important architectural question.

### Bug: `removeOldMediaServers` runs twice

Both `cloud_api.ts:645` and `media_server_worker.ts:65` call `Cloud.removeOldMediaServers()` every 20 seconds. If a media server happens to be cleaned up, `deleteMediaServer` calls `deleteLiveRoom` for every connected room, which acquires the room lock **multiple times** per room (lines 996, 1033, and per-user at 2350). Two racing `removeOldMediaServers` calls could cause extended lock contention. But this alone wouldn't explain a *consistent* delay.

### Key question: which media server are you using?

For `media_server_next` (HTTP-based): The `UserConnected` response goes via HTTP POST to `MediaServerApi.sendMessage` (`cloud_api.ts:372`), which calls `onUserConnectedToRoom` directly. This path was **always handled by cloud_api** — the cloud_worker was never involved. So removing the cloud_worker shouldn't have changed anything for this path.

For `media_server_aws` (queue-based): The `UserConnected` message goes to the `BigscreenCloud` Redis queue, polled by `media_server_worker.ts`. This path DID move from a separate process to the cloud_api process.

Which one are you using? Because if it's `media_server_next`, the cloud_worker removal shouldn't have affected the `onUserConnectedToRoom` path at all, and the delay might be caused by something else that changed alongside the migration. If it's `media_server_aws`, the processing moved into the same process as Express, which changes the event loop dynamics.

*Cogitated for 9m 45s*

---

> In the current production system where we say this problem, `media_server_next` is used.

*Selected 1 lines from EyeTracking.ts in Visual Studio Code*

That's the crucial detail. For `media_server_next`, the `UserConnected` response comes via **HTTP POST** to `/api/media-server/message` (`cloud_api.ts:372`), which is handled directly by `MediaServerApi.sendMessage` (`MediaServerApi.ts:154`). The cloud_worker was **never in this path** — it only processed the `BigscreenCloud` Redis queue, which `media_server_next` doesn't use.

So the cloud_worker removal is a **red herring** for this specific problem. Something else changed at the same time. Looking at the recent commits, I see `07aac79d "Tighten rate limiter & simplify websocket server"` — let me check if the ws_server simplification changed the outbox polling behavior.

*Read 1 file*

That confirms the cloud_worker removal is a **red herring** for this problem. The `media_server_next` `UserConnected` flow is:

```
media_server_next → HTTP POST /api/media-server/message → MediaServerApi.sendMessage → Cloud.onUserConnectedToRoom
```

This path was **always handled directly by the cloud_api Express server**. The cloud_worker never touched it. So whatever is causing the 5-10 second delay, it's not the cloud_worker removal.

Something else must have changed at the same time. Looking at your recent commits, `07aac79d "Tighten rate limiter & simplify websocket server"` is suspicious. The `standardRateLimiter` is applied globally (`cloud_api.ts:83`) — **including the media server endpoints**. The media server polls every 200ms (75 GET requests per 15 seconds), which is close to the `maxRequests: 100` limit. If you have more than one media server behind the same IP, or if a burst pushes past 100, the rate limiter starts returning 429s, and `RemoteQueue.getMessages` silently swallows the error:

```typescript
// RemoteMessageQueue.ts:107
public static async getMessages(authToken: string): Promise<any[]> {
    try {
        let res = await superagent.get(...)
        return res.body;
    } catch (ex) {
        Logger.warn(ex);   // 429 caught here, returns undefined
    }
}
```

Then in the media server polling loop:

```typescript
// media_server_next.ts:1314
const messages = await RemoteQueue.getMessages(authToken);
if (messages) {   // undefined is falsy — entire batch is silently skipped
```

The `UserConnectionRequest` sits in the Redis queue until a poll succeeds. With a 15-second window, the delay could be anywhere from **0 to 15 seconds** — matching your observed 5-10 seconds.

**To verify this quickly:** Check your media server logs for `Logger.warn` entries around the time of the delay. If you see superagent HTTP 429 errors, the rate limiter is the culprit.

**The fix** would be to exempt the media server routes from the rate limiter, since they're already authenticated with a dedicated bearer token:

```typescript
// cloud_api.ts — apply rate limiter AFTER media server routes, or skip it for /api/media-server
api.use(/^(?!\/api\/media-server)/, standardRateLimiter);
```

Or more simply, move `api.use(standardRateLimiter)` below the media server route registrations and apply it selectively to the public/admin routes.

Want me to make that change?

*Baked for 2m 5s*

---

> Yes, in fact `standardRateLimiter` should only apply to the public endpoints, it should not apply to any of the `/admin` endpoints, nor the `/api/media-server` endpoints.

*Read 1 file*

Now let me check where the public endpoints begin so I can place the rate limiter correctly.

*Read 1 file*

The rate limiter should be added to the `publicHandlers` array instead of applied globally. This is the cleanest approach since all public endpoints already use that array.

### Changes made to `cloud_api.ts`:

**Removed global rate limiter:**

```diff
- // Apply rate limiter to all endpoints except admin endpoints (handled in middleware)
- api.use(standardRateLimiter);
+ // Rate limiter is applied only to public endpoints via publicHandlers below.
```

**Added to `publicHandlers` array:**

```diff
  const publicHandlers = [
+     standardRateLimiter,
      requiresStandardAccessToken,
      AuthApi.getAccessPolicyHandler([AuthSchemas.AccessPolicy.BigscreenDefault]),
      CloudApi.requiresConnectedUser
  ];
```

**Added to `/connection` route (custom handler chain):**

```diff
- api.get("/connection", requiresStandardAccessToken, ...);
+ api.get("/connection", standardRateLimiter, requiresStandardAccessToken, ...);
```

**Added to `/rooms/latest` route (custom handler chain):**

```diff
- api.get("/rooms/latest", requiresStandardBearerAuth, CloudApi.getPublicRooms);
+ api.get("/rooms/latest", standardRateLimiter, requiresStandardBearerAuth, CloudApi.getPublicRooms);
```

**Added to `sharedBrowserHandlers` array:**

```diff
  const sharedBrowserHandlers = [
+     standardRateLimiter,
      requiresWebsiteAccessToken,
      AuthApi.getAccessPolicyHandler([AuthSchemas.AccessPolicy.BigscreenDefault]),
      CloudApi.requiresConnectedUser
  ];
```

### Summary

1. **Removed** the global `api.use(standardRateLimiter)` that was hitting every endpoint
2. **Added** `standardRateLimiter` to the `publicHandlers` array (covers most public routes)
3. **Added** `standardRateLimiter` to the three public routes with custom handler chains: `/connection`, `/rooms/latest`, and `sharedBrowserHandlers`
4. **Not rate-limited**: `/admin/*` endpoints (protected by admin auth), `/api/media-server/*` endpoints (protected by media server bearer auth), and `/hyperbeam/webhook/*` (protected by hyperbeam bearer auth)

The media server polls `/api/media-server/messages` at 5 req/s — with the global rate limiter, any burst or multiple media servers from the same IP would have hit the 100 req/15s limit and silently dropped messages, stalling the `UserConnectionRequest` → `UserConnected` round-trip for up to 15 seconds.

*Crunched for 51s*
