fix: Fix users unable to see or hear each other in voice channels due to

stale server sockets, passive non-initiators, and race conditions
during peer connection setup.

Fix users unable to see or hear each other in voice channels due to
stale server sockets, passive non-initiators, and race conditions
during peer connection setup.

Server:
- Close stale WebSocket connections sharing the same oderId in
  handleIdentify instead of letting them linger up to 45s
- Make user_joined/user_left broadcasts identity-aware so duplicate
  sockets don't produce phantom join/leave events
- Include serverIds in user_left payload for multi-room presence
- Simplify findUserByOderId now that stale sockets are cleaned up

Client - signaling:
- Add fallback offer system with 1s timer for missed user_joined races
- Add non-initiator takeover after 5s when the initiator fails to send
  an offer (NON_INITIATOR_GIVE_UP_MS)
- Scope peerServerMap per signaling URL to prevent cross-server
  collisions
- Add socket identity guards on all signaling event handlers
- Replace canReusePeerConnection with hasActivePeerConnection and
  isPeerConnectionNegotiating with extended grace periods

Client - peer connections:
- Extract replaceUnusablePeer helper to deduplicate stale peer
  replacement in offer and ICE handlers
- Add stale connectionstatechange guard to ignore events from replaced
  RTCPeerConnection instances
- Use deterministic initiator election in peer recovery reconnects
- Track createdAt on PeerData for staleness detection

Client - presence:
- Add multi-room presence tracking via presenceServerIds on User
- Replace clearUsers + individual userJoined with syncServerPresence
  for atomic server roster updates
- Make userLeft handle partial server removal instead of full eviction

Documentation:
- Add server-side connection hygiene, non-initiator takeover, and stale
  peer replacement sections to the realtime README
This commit is contained in:
2026-04-04 02:47:58 +02:00
parent ae0ee8fac7
commit de2d3300d4
24 changed files with 1128 additions and 164 deletions

View File

@@ -144,13 +144,25 @@ sequenceDiagram
When the WebSocket drops, `SignalingManager` schedules reconnection with exponential backoff (1s, 2s, 4s, ... up to 30s). On reconnect it replays the cached `identify` and `join_server` messages so presence is restored without the UI doing anything.
### Server-side connection hygiene
Browsers do not reliably fire WebSocket close events during page refresh or navigation (especially Chromium). The server's `handleIdentify` now closes any existing connection that shares the same `oderId` but a different `connectionId`. This guarantees `findUserByOderId` always routes offers and presence events to the freshest socket, eliminating a class of bugs where signaling messages landed on a dead tab's socket and were silently lost.
Join and leave broadcasts are also identity-aware: `handleJoinServer` only broadcasts `user_joined` when the identity is genuinely new to that server (not just a second WebSocket connection for the same user), and `handleLeaveServer` / dead-connection cleanup only broadcast `user_left` when no other open connection for that identity remains in the server. The `user_left` payload includes `serverIds` listing the rooms the identity still belongs to, so the client can subtract correctly without over-removing.
### Multi-room presence
`server_users`, `user_joined`, and `user_left` are room-scoped presence messages, but the renderer must treat them as updates into a global multi-room presence view. The users store tracks `presenceServerIds` per user instead of clearing the whole slice when a new `server_users` snapshot arrives, so startup/search background rooms keep their server-rail voice badges and active voice peers do not disappear when the user views a different server.
Peer routing also has to stay scoped to the signaling server that reported the membership. A `user_left` from one signaling cluster must only subtract that cluster's shared servers; otherwise a leave on `signal.toju.app` can incorrectly tear down a peer that is still shared through `signal-sweden.toju.app` or a local signaling server. Route metadata is therefore kept across peer recreation and only cleared once the renderer no longer shares any servers with that peer.
## Peer connection lifecycle
Peers connect to each other directly with `RTCPeerConnection`. The "initiator" (whoever was already in the room) creates the data channel and audio/video transceivers, then sends an offer. The other side creates an answer.
Peers connect to each other directly with `RTCPeerConnection`. The initiator is chosen deterministically from the identified logical peer IDs so only one side creates the offer and primary data channel for a given pair. The other side creates an answer. If identity or negotiation is still settling, the retry timer defers instead of comparing against the ephemeral local transport ID or reusing a half-open peer forever.
```mermaid
sequenceDiagram
participant A as Peer A (initiator)
participant A as Peer A (elected initiator)
participant Sig as Signaling Server
participant B as Peer B
@@ -180,6 +192,16 @@ sequenceDiagram
Both peers might send offers at the same time ("glare"). The negotiation module implements the "polite peer" pattern: one side is designated polite (the non-initiator) and will roll back its local offer if it detects a collision, then accept the remote offer instead. The impolite side ignores the incoming offer.
Existing members also schedule a short `user_joined` fallback offer, and the `server_users` path now re-arms the same retry when an initial attempt stalls. The joiner still tries first via its `server_users` snapshot, but the fallback heals late-join races or half-open peers where that initial offer never arrives or never finishes. The retry uses the same deterministic initiator election as the main `server_users` path so the pair cannot regress into dual initiators.
### Non-initiator takeover
If the elected initiator's offer never arrives (stale socket, network issue, page still loading), the non-initiator does not wait forever. It tracks the start of each waiting period in `nonInitiatorWaitStart`. For the first `NON_INITIATOR_GIVE_UP_MS` (5 s) it reschedules and logs. Once that window expires it takes over: removes any stale peer, creates a fresh `RTCPeerConnection` as initiator, and sends its own offer. This ensures every peer pair eventually establishes a connection regardless of which side was originally elected.
### Stale peer replacement
Offers or ICE candidates can arrive while the existing `RTCPeerConnection` for that peer is in `failed` or `closed` state (the browser's `connectionstatechange` event hasn't fired yet to clean it up). `replaceUnusablePeer()` in `negotiation.ts` detects this, closes the dead connection, removes it from the active map, and lets the caller proceed with a fresh peer. The `connectionstatechange` handler in `create-peer-connection.ts` also guards against stale events: if the connection object no longer matches the current map entry for that peer, the event is ignored so it cannot accidentally remove a replacement peer.
### Disconnect recovery
```mermaid
@@ -196,7 +218,7 @@ stateDiagram-v2
Closed --> [*]
```
When a peer connection enters `disconnected`, a 10-second grace period starts. If it recovers on its own (network blip), nothing happens. If it reaches `failed`, the connection is torn down and a reconnect loop starts: a fresh `RTCPeerConnection` is created and a new offer is sent every 5 seconds, up to 12 attempts.
When a peer connection enters `disconnected`, a 10-second grace period starts. If it recovers on its own (network blip), nothing happens. If it reaches `failed`, the connection is torn down and a reconnect loop starts. A fresh `RTCPeerConnection` is created every 5 seconds, up to 12 attempts; only the deterministically elected initiator sends a reconnect offer, while the other side waits for that offer.
## Data channel