fix: Fix users unable to see or hear each other in voice channels due to
stale server sockets, passive non-initiators, and race conditions during peer connection setup. Fix users unable to see or hear each other in voice channels due to stale server sockets, passive non-initiators, and race conditions during peer connection setup. Server: - Close stale WebSocket connections sharing the same oderId in handleIdentify instead of letting them linger up to 45s - Make user_joined/user_left broadcasts identity-aware so duplicate sockets don't produce phantom join/leave events - Include serverIds in user_left payload for multi-room presence - Simplify findUserByOderId now that stale sockets are cleaned up Client - signaling: - Add fallback offer system with 1s timer for missed user_joined races - Add non-initiator takeover after 5s when the initiator fails to send an offer (NON_INITIATOR_GIVE_UP_MS) - Scope peerServerMap per signaling URL to prevent cross-server collisions - Add socket identity guards on all signaling event handlers - Replace canReusePeerConnection with hasActivePeerConnection and isPeerConnectionNegotiating with extended grace periods Client - peer connections: - Extract replaceUnusablePeer helper to deduplicate stale peer replacement in offer and ICE handlers - Add stale connectionstatechange guard to ignore events from replaced RTCPeerConnection instances - Use deterministic initiator election in peer recovery reconnects - Track createdAt on PeerData for staleness detection Client - presence: - Add multi-room presence tracking via presenceServerIds on User - Replace clearUsers + individual userJoined with syncServerPresence for atomic server roster updates - Make userLeft handle partial server removal instead of full eviction Documentation: - Add server-side connection hygiene, non-initiator takeover, and stale peer replacement sections to the realtime README
This commit is contained in:
@@ -144,13 +144,25 @@ sequenceDiagram
|
||||
|
||||
When the WebSocket drops, `SignalingManager` schedules reconnection with exponential backoff (1s, 2s, 4s, ... up to 30s). On reconnect it replays the cached `identify` and `join_server` messages so presence is restored without the UI doing anything.
|
||||
|
||||
### Server-side connection hygiene
|
||||
|
||||
Browsers do not reliably fire WebSocket close events during page refresh or navigation (especially Chromium). The server's `handleIdentify` now closes any existing connection that shares the same `oderId` but a different `connectionId`. This guarantees `findUserByOderId` always routes offers and presence events to the freshest socket, eliminating a class of bugs where signaling messages landed on a dead tab's socket and were silently lost.
|
||||
|
||||
Join and leave broadcasts are also identity-aware: `handleJoinServer` only broadcasts `user_joined` when the identity is genuinely new to that server (not just a second WebSocket connection for the same user), and `handleLeaveServer` / dead-connection cleanup only broadcast `user_left` when no other open connection for that identity remains in the server. The `user_left` payload includes `serverIds` listing the rooms the identity still belongs to, so the client can subtract correctly without over-removing.
|
||||
|
||||
### Multi-room presence
|
||||
|
||||
`server_users`, `user_joined`, and `user_left` are room-scoped presence messages, but the renderer must treat them as updates into a global multi-room presence view. The users store tracks `presenceServerIds` per user instead of clearing the whole slice when a new `server_users` snapshot arrives, so startup/search background rooms keep their server-rail voice badges and active voice peers do not disappear when the user views a different server.
|
||||
|
||||
Peer routing also has to stay scoped to the signaling server that reported the membership. A `user_left` from one signaling cluster must only subtract that cluster's shared servers; otherwise a leave on `signal.toju.app` can incorrectly tear down a peer that is still shared through `signal-sweden.toju.app` or a local signaling server. Route metadata is therefore kept across peer recreation and only cleared once the renderer no longer shares any servers with that peer.
|
||||
|
||||
## Peer connection lifecycle
|
||||
|
||||
Peers connect to each other directly with `RTCPeerConnection`. The "initiator" (whoever was already in the room) creates the data channel and audio/video transceivers, then sends an offer. The other side creates an answer.
|
||||
Peers connect to each other directly with `RTCPeerConnection`. The initiator is chosen deterministically from the identified logical peer IDs so only one side creates the offer and primary data channel for a given pair. The other side creates an answer. If identity or negotiation is still settling, the retry timer defers instead of comparing against the ephemeral local transport ID or reusing a half-open peer forever.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant A as Peer A (initiator)
|
||||
participant A as Peer A (elected initiator)
|
||||
participant Sig as Signaling Server
|
||||
participant B as Peer B
|
||||
|
||||
@@ -180,6 +192,16 @@ sequenceDiagram
|
||||
|
||||
Both peers might send offers at the same time ("glare"). The negotiation module implements the "polite peer" pattern: one side is designated polite (the non-initiator) and will roll back its local offer if it detects a collision, then accept the remote offer instead. The impolite side ignores the incoming offer.
|
||||
|
||||
Existing members also schedule a short `user_joined` fallback offer, and the `server_users` path now re-arms the same retry when an initial attempt stalls. The joiner still tries first via its `server_users` snapshot, but the fallback heals late-join races or half-open peers where that initial offer never arrives or never finishes. The retry uses the same deterministic initiator election as the main `server_users` path so the pair cannot regress into dual initiators.
|
||||
|
||||
### Non-initiator takeover
|
||||
|
||||
If the elected initiator's offer never arrives (stale socket, network issue, page still loading), the non-initiator does not wait forever. It tracks the start of each waiting period in `nonInitiatorWaitStart`. For the first `NON_INITIATOR_GIVE_UP_MS` (5 s) it reschedules and logs. Once that window expires it takes over: removes any stale peer, creates a fresh `RTCPeerConnection` as initiator, and sends its own offer. This ensures every peer pair eventually establishes a connection regardless of which side was originally elected.
|
||||
|
||||
### Stale peer replacement
|
||||
|
||||
Offers or ICE candidates can arrive while the existing `RTCPeerConnection` for that peer is in `failed` or `closed` state (the browser's `connectionstatechange` event hasn't fired yet to clean it up). `replaceUnusablePeer()` in `negotiation.ts` detects this, closes the dead connection, removes it from the active map, and lets the caller proceed with a fresh peer. The `connectionstatechange` handler in `create-peer-connection.ts` also guards against stale events: if the connection object no longer matches the current map entry for that peer, the event is ignored so it cannot accidentally remove a replacement peer.
|
||||
|
||||
### Disconnect recovery
|
||||
|
||||
```mermaid
|
||||
@@ -196,7 +218,7 @@ stateDiagram-v2
|
||||
Closed --> [*]
|
||||
```
|
||||
|
||||
When a peer connection enters `disconnected`, a 10-second grace period starts. If it recovers on its own (network blip), nothing happens. If it reaches `failed`, the connection is torn down and a reconnect loop starts: a fresh `RTCPeerConnection` is created and a new offer is sent every 5 seconds, up to 12 attempts.
|
||||
When a peer connection enters `disconnected`, a 10-second grace period starts. If it recovers on its own (network blip), nothing happens. If it reaches `failed`, the connection is torn down and a reconnect loop starts. A fresh `RTCPeerConnection` is created every 5 seconds, up to 12 attempts; only the deterministically elected initiator sends a reconnect offer, while the other side waits for that offer.
|
||||
|
||||
## Data channel
|
||||
|
||||
|
||||
Reference in New Issue
Block a user