fix: Fix users unable to see or hear each other in voice channels due to

stale server sockets, passive non-initiators, and race conditions during peer connection setup. Fix users unable to see or hear each other in voice channels due to stale server sockets, passive non-initiators, and race conditions during peer connection setup. Server: - Close stale WebSocket connections sharing the same oderId in handleIdentify instead of letting them linger up to 45s - Make user_joined/user_left broadcasts identity-aware so duplicate sockets don't produce phantom join/leave events - Include serverIds in user_left payload for multi-room presence - Simplify findUserByOderId now that stale sockets are cleaned up Client - signaling: - Add fallback offer system with 1s timer for missed user_joined races - Add non-initiator takeover after 5s when the initiator fails to send an offer (NON_INITIATOR_GIVE_UP_MS) - Scope peerServerMap per signaling URL to prevent cross-server collisions - Add socket identity guards on all signaling event handlers - Replace canReusePeerConnection with hasActivePeerConnection and isPeerConnectionNegotiating with extended grace periods Client - peer connections: - Extract replaceUnusablePeer helper to deduplicate stale peer replacement in offer and ICE handlers - Add stale connectionstatechange guard to ignore events from replaced RTCPeerConnection instances - Use deterministic initiator election in peer recovery reconnects - Track createdAt on PeerData for staleness detection Client - presence: - Add multi-room presence tracking via presenceServerIds on User - Replace clearUsers + individual userJoined with syncServerPresence for atomic server roster updates - Make userLeft handle partial server removal instead of full eviction Documentation: - Add server-side connection hygiene, non-initiator takeover, and stale peer replacement sections to the realtime README
2026-04-04 02:47:58 +02:00
parent ae0ee8fac7
commit de2d3300d4
24 changed files with 1128 additions and 164 deletions
--- a/toju-app/src/app/infrastructure/realtime/README.md
+++ b/toju-app/src/app/infrastructure/realtime/README.md
@@ -144,13 +144,25 @@ sequenceDiagram

 When the WebSocket drops, `SignalingManager` schedules reconnection with exponential backoff (1s, 2s, 4s, ... up to 30s). On reconnect it replays the cached `identify` and `join_server` messages so presence is restored without the UI doing anything.

+### Server-side connection hygiene
+
+Browsers do not reliably fire WebSocket close events during page refresh or navigation (especially Chromium). The server's `handleIdentify` now closes any existing connection that shares the same `oderId` but a different `connectionId`. This guarantees `findUserByOderId` always routes offers and presence events to the freshest socket, eliminating a class of bugs where signaling messages landed on a dead tab's socket and were silently lost.
+
+Join and leave broadcasts are also identity-aware: `handleJoinServer` only broadcasts `user_joined` when the identity is genuinely new to that server (not just a second WebSocket connection for the same user), and `handleLeaveServer` / dead-connection cleanup only broadcast `user_left` when no other open connection for that identity remains in the server. The `user_left` payload includes `serverIds` listing the rooms the identity still belongs to, so the client can subtract correctly without over-removing.
+
+### Multi-room presence
+
+`server_users`, `user_joined`, and `user_left` are room-scoped presence messages, but the renderer must treat them as updates into a global multi-room presence view. The users store tracks `presenceServerIds` per user instead of clearing the whole slice when a new `server_users` snapshot arrives, so startup/search background rooms keep their server-rail voice badges and active voice peers do not disappear when the user views a different server.
+
+Peer routing also has to stay scoped to the signaling server that reported the membership. A `user_left` from one signaling cluster must only subtract that cluster's shared servers; otherwise a leave on `signal.toju.app` can incorrectly tear down a peer that is still shared through `signal-sweden.toju.app` or a local signaling server. Route metadata is therefore kept across peer recreation and only cleared once the renderer no longer shares any servers with that peer.
+
 ## Peer connection lifecycle

-Peers connect to each other directly with `RTCPeerConnection`. The "initiator" (whoever was already in the room) creates the data channel and audio/video transceivers, then sends an offer. The other side creates an answer.
+Peers connect to each other directly with `RTCPeerConnection`. The initiator is chosen deterministically from the identified logical peer IDs so only one side creates the offer and primary data channel for a given pair. The other side creates an answer. If identity or negotiation is still settling, the retry timer defers instead of comparing against the ephemeral local transport ID or reusing a half-open peer forever.

 ```mermaid
 sequenceDiagram
-    participant A as Peer A (initiator)
+    participant A as Peer A (elected initiator)
    participant Sig as Signaling Server
    participant B as Peer B

@@ -180,6 +192,16 @@ sequenceDiagram

 Both peers might send offers at the same time ("glare"). The negotiation module implements the "polite peer" pattern: one side is designated polite (the non-initiator) and will roll back its local offer if it detects a collision, then accept the remote offer instead. The impolite side ignores the incoming offer.

+Existing members also schedule a short `user_joined` fallback offer, and the `server_users` path now re-arms the same retry when an initial attempt stalls. The joiner still tries first via its `server_users` snapshot, but the fallback heals late-join races or half-open peers where that initial offer never arrives or never finishes. The retry uses the same deterministic initiator election as the main `server_users` path so the pair cannot regress into dual initiators.
+
+### Non-initiator takeover
+
+If the elected initiator's offer never arrives (stale socket, network issue, page still loading), the non-initiator does not wait forever. It tracks the start of each waiting period in `nonInitiatorWaitStart`. For the first `NON_INITIATOR_GIVE_UP_MS` (5 s) it reschedules and logs. Once that window expires it takes over: removes any stale peer, creates a fresh `RTCPeerConnection` as initiator, and sends its own offer. This ensures every peer pair eventually establishes a connection regardless of which side was originally elected.
+
+### Stale peer replacement
+
+Offers or ICE candidates can arrive while the existing `RTCPeerConnection` for that peer is in `failed` or `closed` state (the browser's `connectionstatechange` event hasn't fired yet to clean it up). `replaceUnusablePeer()` in `negotiation.ts` detects this, closes the dead connection, removes it from the active map, and lets the caller proceed with a fresh peer. The `connectionstatechange` handler in `create-peer-connection.ts` also guards against stale events: if the connection object no longer matches the current map entry for that peer, the event is ignored so it cannot accidentally remove a replacement peer.
+
 ### Disconnect recovery

 ```mermaid
@@ -196,7 +218,7 @@ stateDiagram-v2
    Closed --> [*]
 ```

-When a peer connection enters `disconnected`, a 10-second grace period starts. If it recovers on its own (network blip), nothing happens. If it reaches `failed`, the connection is torn down and a reconnect loop starts: a fresh `RTCPeerConnection` is created and a new offer is sent every 5 seconds, up to 12 attempts.
+When a peer connection enters `disconnected`, a 10-second grace period starts. If it recovers on its own (network blip), nothing happens. If it reaches `failed`, the connection is torn down and a reconnect loop starts. A fresh `RTCPeerConnection` is created every 5 seconds, up to 12 attempts; only the deterministically elected initiator sends a reconnect offer, while the other side waits for that offer.

 ## Data channel