fix: recurriing network issue

2026-04-30 04:04:34 +02:00
parent b1fe286be8
commit a49e18b9f0
16 changed files with 522 additions and 17 deletions
--- a/toju-app/src/app/infrastructure/realtime/README.md
+++ b/toju-app/src/app/infrastructure/realtime/README.md
@@ -115,18 +115,22 @@ graph TD

 ## Signaling (WebSocket)

-The signaling layer's only job is getting two peers to exchange SDP offers/answers and ICE candidates so they can establish a direct WebRTC connection. Once the peer connection is up, signaling is only used for presence (user joined/left) and reconnection.
+The signaling layer gets peers to exchange SDP offers/answers and ICE candidates so they can establish direct WebRTC connections. It also carries identity, room membership, presence, typing, and selected server-relayed fallback events when the peer data channel is unavailable.

 Each signaling URL gets its own `SignalingManager` (one WebSocket each). `SignalingTransportHandler` picks the right socket based on which server the message is for. `ServerSignalingCoordinator` tracks which peers belong to which servers and which signaling URLs, so we know when it is safe to tear down a peer connection after leaving a server.

 Room affinity is authoritative at this layer as well. The renderer repairs each room's saved `sourceId` / `sourceUrl` from server-directory responses and routes `join_server`, `view_server`, and room-scoped signaling traffic to that room's signaling URL first. If that route fails, alternate endpoints can be tried temporarily, but server-scoped raw messages are no longer broadcast to every connected signaling manager when the route is unknown.

+Server-relayed fallbacks are intentionally narrow. Room chat (`chat_message`), direct-message events (`direct-message`, `direct-message-status`, `direct-message-mutation`), and voice presence (`voice_state`) may flow over signaling so users can still see written chat and voice roster state while P2P data channels are down. Media, attachments, message inventory sync, screen/camera state, and plugin data-channel traffic remain peer-plane responsibilities.
+
 In UI/debug conversations, a **chat-server** means one of the saved rooms navigated from the server rail. Each chat-server has its own assigned signal server via `sourceId` / `sourceUrl`, and room-scoped feature/config checks must prefer that signal server before considering any global active endpoint. For example, KLIPY GIF picker visibility is resolved against the currently viewed chat-server's signal server so an unrelated offline chat-server does not hide the button everywhere.

 Cold-start routing now waits for the initial server-directory health probes so same-backend aliases can collapse to one canonical signaling endpoint before any saved rooms reconnect. When a room is reconnected on a chosen socket, its background rooms are re-joined on that same socket as well so stale per-signal memberships do not keep orphan managers alive, and reconnect replay only sends `view_server` for rooms that manager still has joined.

 This is still a non-federated model. Different signaling servers do not share peer registries or relay WebRTC offers for each other, so users in the same room must converge on the same signaling endpoint to discover one another reliably.

+The fallback path is fragile by design: it only helps when a usable signaling socket exists. If a production origin returns Cloudflare `521`/`522` or the WebSocket closes with `1006`, room reconnect must continue to other active compatible endpoints instead of treating the room as missing or the client as incompatible.
+
 ```mermaid
 sequenceDiagram
    participant UI as App