Files
Toju/agents-docs/features/voice-signaling.md
brogeby b19c39208c docs: populate initial cross-context feature docs
Add area-level documentation for the five most significant cross-context
feature areas under agents-docs/features/:

- websocket-envelopes: full envelope catalogue, lifecycle, dispatcher
- ipc-bridge: window.electronAPI surface, IPC channels, CQRS dispatch
- plugin-system: manifest contract, runtime, capabilities, plugin-support API
- server-directory: REST endpoints, CQRS, entities, business rules
- voice-signaling: mesh signaling, RNNoise pipeline, domain split

Update agents-docs/FEATURES.md index alphabetically and remove the
"no cross-context feature docs" placeholder.

Each doc records honest TODOs for verified gaps (stale signaling-contracts.ts,
window.api vs window.electronAPI mismatch, IPC error envelope drift from
CONTEXT.md, missing OpenAPI coverage for server-directory routes, no
envelope round-trip test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:36:36 +02:00

12 KiB
Raw Blame History

Voice & WebRTC Signaling

Area: voice-signaling Status: Active Last updated: 2026-05-25

Overview

Voice and screen-share in Toju are pure WebRTC mesh: peers establish RTCPeerConnections directly, while the signaling server only forwards SDP and ICE messages. This area covers the end-to-end flow — envelope routing, peer election, RTCPeerConnection lifecycle, RNNoise denoising, and the relationships between the three product-client domains involved: voice-session, voice-connection, and direct-call. Screen-share rides on the same peer connection; its UI orchestration is its own domain but the signaling path is shared.

Responsibilities

  • Negotiate WebRTC sessions between peers using offer / answer / ice_candidate envelopes forwarded by the signaling server.
  • Elect an initiator deterministically when multiple peers arrive simultaneously, with a non-initiator fallback timer.
  • Maintain the local audio pipeline: mic capture → optional RNNoise denoising → RTCPeerConnection sender.
  • Track per-peer playback gain, mute, deafen, and speaking-activity state on the receive side.
  • Mirror voice presence (voice_state) and direct-call signalling (direct-call) to other peers via the WebSocket.

This area does not own:

  • The WebSocket envelope shape (see websocket-envelopes).
  • Screen-share UI orchestration (its own domain at toju-app/src/app/domains/screen-share/); only the peer connection plumbing is shared.
  • Persistent user settings beyond voiceSettingsStorage (audio device IDs, volumes, bitrate, latency profile, noise-reduction toggle, persisted to localStorage).

Key concepts

  • Mesh — every participant holds an RTCPeerConnection per other participant. No SFU / MCU.
  • Voice session — high-level "user is currently in voice room X" state. Owned by voice-session domain.
  • Voice connection — low-level transport/peer concerns: speaking detection, per-peer gain, mute / deafen state. Owned by voice-connection domain.
  • Direct call — 1:1 voice/video call with an optional group-upgrade path. Owned by direct-call domain.
  • Initiator — the peer responsible for sending the first offer. Elected first-peer-wins; non-initiators wait NON_INITIATOR_GIVE_UP_MS (≈5 s) before generating their own offer.
  • Data channelchat-labelled data channel established alongside each peer connection for P2P chat fallback and direct-message delivery.
  • Noise suppressor worklet — RNNoise WASM running in an AudioWorkletNode (NoiseSuppressorWorklet), loaded from rnnoise-worklet.js at the app root.

Signaling envelopes (consumed)

Defined in websocket-envelopes. Voice-relevant types:

  • offer, answer, ice_candidate — forwarded by the server to targetUserId without inspection.
  • direct-call — forwarded; payload carries call-scoped events (ring, participant join/leave, call end).
  • voice_state — broadcast to a server. Payload includes roomId, voiceGateway, mute/deafen flags.
  • server_users — full peer roster on join; seeds the initial offer fan-out.
  • user_joined — schedules a fallback offer after a grace delay (USER_JOINED_FALLBACK_OFFER_DELAY_MS, ≈1 s).
  • user_left — peer teardown, with special handling that preserves peers still under an active voice session.
  • connected / access_denied — connection lifecycle (server bootstrap and authorization).

The server is purely signaling: it does not track which oderId is in which voice room. Voice membership is derived client-side from the voice_state broadcasts observed on the server.


Session establishment flow

A new participant joining a voice room produces this exchange (initiator perspective; symmetrical when both arrive at once):

  1. Local user clicks "Join voice" → VoiceSessionFacade.startSession() populates the session model and asks voice-connection to ready peer transport.
  2. Server broadcasts user_joined to existing peers.
  3. Each existing peer evaluates: am I the elected initiator for the (me, new-peer) pair? If yes, the peer-connection manager calls doCreateAndSendOffer().
  4. Initiator constructs new RTCPeerConnection({ iceServers }) (infrastructure/realtime/peer-connection-manager/.../create-peer-connection.ts), adds local tracks, creates the data channel chat, generates an SDP offer, and sends it via the signaling transport.
  5. Responder receives offerdoHandleOffer() sets remote description, generates SDP answer, sends answer.
  6. Initiator receives answerdoHandleAnswer() sets remote description.
  7. Both sides emit ice_candidate as they gather candidates via onicecandidate.
  8. iceConnectionState reaches connected / completed → media flows.
  9. Either side may open the chat data channel for P2P text payloads (direct messages, etc.).

If the elected initiator never sends an offer within NON_INITIATOR_GIVE_UP_MS, the non-initiator promotes itself and initiates instead — preserves liveness across asymmetric drop-outs.

user_left is treated carefully: the signaling-message-handler.spec.ts covers the case where a peer is still required by an active voice session and must not be torn down, even if other parts of the system think the peer has disconnected.


Domain responsibilities

voice-session (toju-app/src/app/domains/voice-session/)

  • VoiceSessionFacade (application/facades/voice-session.facade.ts) — owns the active session metadata (serverId, roomId, participantIds); drives a showFloatingControls signal when the user navigates away from the room.
  • VoiceWorkspaceService (application/services/voice-workspace.service.ts) — UI state for the workspace (hidden / expanded / minimized), focused stream ID, mini-window position.
  • voiceSettingsStorage (infrastructure/util/voice-settings-storage.util.ts) — localStorage persistence: input/output device IDs, output volume (0100), bitrate (32256 kbps), latency profile (low | balanced | high), noise-reduction toggle.
  • Joining a new voice target first calls endSession() so transitions cannot leak peer connections.

voice-connection (toju-app/src/app/domains/voice-connection/)

Bridges the application layer to the low-level WebRTC infrastructure under toju-app/src/app/infrastructure/realtime/.

  • VoiceActivityService — RMS-based speaking detection via AnalyserNode (fftSize 256, RMS ≥ 0.015, 8-frame grace period).
  • VoicePlaybackService — per-peer GainNode chains (0200% range), localStorage-persisted; deafen sets all gains to 0.
  • VoiceConnectionFacade — exposes signals like isVoiceConnected, isMuted; methods like toggleMute(), toggleNoiseReduction(), setOutputVolume().

Per the domain README, voice-connection does not own RTCPeerConnection construction or signaling — those live in infrastructure/realtime/peer-connection-manager.

direct-call (toju-app/src/app/domains/direct-call/)

  • Initiator flow (DirectCallService.startCall()): create/reuse the 1:1 DM, start a call-scoped voice session, send a direct-call "ring" envelope via PeerDeliveryService.
  • Recipient flow: store incoming session, ring assets/audio/call.wav (unless DND), show in-app modal + desktop notification.
  • Group upgrade: adding a third participant spawns a new group conversation; the active call swaps its chat panel to the new conversation but original DM history is preserved.
  • Invariant: incoming direct-call events are ignored unless the local user is in participantIds.

Screen share (toju-app/src/app/domains/screen-share/)

  • Adds dedicated MediaStreamTrack senders to the existing peer connection (does not open a new one).
  • Request / response model: a receiver sends screen-share-request; the sender attaches the share track; screen-share-stop tears it down.
  • Quality presets: low / balanced / high (resolution + FPS).
  • On Electron, ScreenShareSourcePickerService drives a Promise-based picker over getSources (see ipc-bridge).

RNNoise pipeline

Manager: infrastructure/realtime/media/noise-reduction.manager.ts.

Raw mic → MediaStreamAudioSourceNode → NoiseSuppressorWorklet (AudioWorkletNode) → MediaStreamAudioDestinationNode → clean stream → RTCPeerConnection sender
  • AudioContext at 48 kHz.
  • Worklet loaded from rnnoise-worklet.js (built from @timephy/rnnoise-wasm, output written to toju-app/public/).
  • If worklet load fails, the raw stream is passed through unchanged.
  • Mute takes priority — when muted, noise reduction is also disabled.

Technical implementation

  • Envelope types: see websocket-envelopes.
  • Signaling adapter (renderer): toju-app/src/app/infrastructure/realtime/signaling/signaling-message-handler.ts (and signaling-transport-handler.ts).
  • Peer-connection manager: toju-app/src/app/infrastructure/realtime/peer-connection-manager/create-peer-connection.ts, recovery (grace timers, reconnect), data-channel plumbing.
  • Voice settings: domains/voice-session/infrastructure/util/voice-settings-storage.util.ts.
  • Noise reduction: infrastructure/realtime/media/noise-reduction.manager.ts.
  • Worklet asset: toju-app/public/rnnoise-worklet.js.
  • Server side: signaling only — server/src/websocket/handler.ts::forwardRtcMessage.

Invariants

  • The server forwards offer / answer / ice_candidate / direct-call envelopes opaquely and never persists media or call state.
  • Switching voice rooms always tears down the prior session before starting the new one.
  • Mute overrides noise reduction (the manager disables the worklet path when muted).
  • Direct-call events with the local user absent from participantIds are ignored.

Testing

  • toju-app/src/app/infrastructure/realtime/signaling/signaling-message-handler.spec.tsuser_left peer preservation under active voice.
  • toju-app/src/app/infrastructure/realtime/peer-connection-manager/recovery/peer-recovery.spec.ts — reconnect, grace timers, exponential backoff.
  • toju-app/src/app/infrastructure/realtime/peer-connection-manager/messaging/data-channel.spec.ts.
  • toju-app/src/app/domains/direct-call/application/services/direct-call.service.spec.ts.
  • E2E: e2e/tests/voice/multi-signal-eight-user-voice.spec.ts, e2e/tests/voice/direct-call.spec.ts (verify exact filenames in the suite — TODO).

Security considerations

  • WebRTC bypasses the server entirely once connected — peer IPs may be exposed to other participants via ICE candidates. Standard WebRTC privacy caveat.
  • Signaling envelopes are forwarded without verifying that source and target share a server — TODO: confirm whether forwardRtcMessage enforces membership.
  • The data channel chat carries P2P text payloads; integrity / authentication of those payloads is owned by the chat/direct-message domains, not by this area.
  • RNNoise runs entirely client-side; mic audio never leaves the local AudioContext until it enters the encrypted RTCPeerConnection.

Performance considerations

  • Mesh topology — N×(N-1)/2 peer connections per voice room. Practical ceiling is bound by client CPU and uplink; no documented soft cap.
  • Bitrate is client-controlled (32256 kbps); no server-enforced QoS.
  • Voice activity detection runs at fftSize 256 with an 8-frame grace period — chosen to minimise CPU while staying responsive to natural speech.
  • The signaling server's only cost is envelope forwarding (O(1) per envelope).

Known issues and limitations

  • No SFU / MCU. Large rooms scale linearly with participant count on each client.
  • No recording or server-side mixing for voice or screen.
  • Bitrate is not enforced server-side — adversarial clients could ignore the suggested range.
  • No documented call-quality telemetry pipeline.
  • websocket-envelopes — owns the wire types this area consumes.
  • ipc-bridgegetSources and the Linux audio-routing methods are used by screen-share.
  • plugin-system — plugins may participate as observers via voice_state broadcasts (subject to capability grants); no direct call control surface today.

Changelog

Date Change
2026-05-25 Initial documentation