Add area-level documentation for the five most significant cross-context feature areas under agents-docs/features/: - websocket-envelopes: full envelope catalogue, lifecycle, dispatcher - ipc-bridge: window.electronAPI surface, IPC channels, CQRS dispatch - plugin-system: manifest contract, runtime, capabilities, plugin-support API - server-directory: REST endpoints, CQRS, entities, business rules - voice-signaling: mesh signaling, RNNoise pipeline, domain split Update agents-docs/FEATURES.md index alphabetically and remove the "no cross-context feature docs" placeholder. Each doc records honest TODOs for verified gaps (stale signaling-contracts.ts, window.api vs window.electronAPI mismatch, IPC error envelope drift from CONTEXT.md, missing OpenAPI coverage for server-directory routes, no envelope round-trip test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
Voice & WebRTC Signaling
Area: voice-signaling Status: Active Last updated: 2026-05-25
Overview
Voice and screen-share in Toju are pure WebRTC mesh: peers establish RTCPeerConnections directly, while the signaling server only forwards SDP and ICE messages. This area covers the end-to-end flow — envelope routing, peer election, RTCPeerConnection lifecycle, RNNoise denoising, and the relationships between the three product-client domains involved: voice-session, voice-connection, and direct-call. Screen-share rides on the same peer connection; its UI orchestration is its own domain but the signaling path is shared.
Responsibilities
- Negotiate WebRTC sessions between peers using
offer/answer/ice_candidateenvelopes forwarded by the signaling server. - Elect an initiator deterministically when multiple peers arrive simultaneously, with a non-initiator fallback timer.
- Maintain the local audio pipeline: mic capture → optional RNNoise denoising → RTCPeerConnection sender.
- Track per-peer playback gain, mute, deafen, and speaking-activity state on the receive side.
- Mirror voice presence (
voice_state) and direct-call signalling (direct-call) to other peers via the WebSocket.
This area does not own:
- The WebSocket envelope shape (see websocket-envelopes).
- Screen-share UI orchestration (its own domain at
toju-app/src/app/domains/screen-share/); only the peer connection plumbing is shared. - Persistent user settings beyond
voiceSettingsStorage(audio device IDs, volumes, bitrate, latency profile, noise-reduction toggle, persisted to localStorage).
Key concepts
- Mesh — every participant holds an
RTCPeerConnectionper other participant. No SFU / MCU. - Voice session — high-level "user is currently in voice room X" state. Owned by
voice-sessiondomain. - Voice connection — low-level transport/peer concerns: speaking detection, per-peer gain, mute / deafen state. Owned by
voice-connectiondomain. - Direct call — 1:1 voice/video call with an optional group-upgrade path. Owned by
direct-calldomain. - Initiator — the peer responsible for sending the first
offer. Elected first-peer-wins; non-initiators waitNON_INITIATOR_GIVE_UP_MS(≈5 s) before generating their own offer. - Data channel —
chat-labelled data channel established alongside each peer connection for P2P chat fallback and direct-message delivery. - Noise suppressor worklet — RNNoise WASM running in an
AudioWorkletNode(NoiseSuppressorWorklet), loaded fromrnnoise-worklet.jsat the app root.
Signaling envelopes (consumed)
Defined in websocket-envelopes. Voice-relevant types:
offer,answer,ice_candidate— forwarded by the server totargetUserIdwithout inspection.direct-call— forwarded; payload carries call-scoped events (ring, participant join/leave, call end).voice_state— broadcast to a server. Payload includesroomId,voiceGateway, mute/deafen flags.server_users— full peer roster on join; seeds the initial offer fan-out.user_joined— schedules a fallback offer after a grace delay (USER_JOINED_FALLBACK_OFFER_DELAY_MS, ≈1 s).user_left— peer teardown, with special handling that preserves peers still under an active voice session.connected/access_denied— connection lifecycle (server bootstrap and authorization).
The server is purely signaling: it does not track which oderId is in which voice room. Voice membership is derived client-side from the voice_state broadcasts observed on the server.
Session establishment flow
A new participant joining a voice room produces this exchange (initiator perspective; symmetrical when both arrive at once):
- Local user clicks "Join voice" →
VoiceSessionFacade.startSession()populates the session model and asksvoice-connectionto ready peer transport. - Server broadcasts
user_joinedto existing peers. - Each existing peer evaluates: am I the elected initiator for the (me, new-peer) pair? If yes, the peer-connection manager calls
doCreateAndSendOffer(). - Initiator constructs
new RTCPeerConnection({ iceServers })(infrastructure/realtime/peer-connection-manager/.../create-peer-connection.ts), adds local tracks, creates the data channelchat, generates an SDP offer, and sends it via the signaling transport. - Responder receives
offer→doHandleOffer()sets remote description, generates SDP answer, sendsanswer. - Initiator receives
answer→doHandleAnswer()sets remote description. - Both sides emit
ice_candidateas they gather candidates viaonicecandidate. iceConnectionStatereachesconnected/completed→ media flows.- Either side may open the
chatdata channel for P2P text payloads (direct messages, etc.).
If the elected initiator never sends an offer within NON_INITIATOR_GIVE_UP_MS, the non-initiator promotes itself and initiates instead — preserves liveness across asymmetric drop-outs.
user_left is treated carefully: the signaling-message-handler.spec.ts covers the case where a peer is still required by an active voice session and must not be torn down, even if other parts of the system think the peer has disconnected.
Domain responsibilities
voice-session (toju-app/src/app/domains/voice-session/)
VoiceSessionFacade(application/facades/voice-session.facade.ts) — owns the active session metadata (serverId,roomId,participantIds); drives ashowFloatingControlssignal when the user navigates away from the room.VoiceWorkspaceService(application/services/voice-workspace.service.ts) — UI state for the workspace (hidden / expanded / minimized), focused stream ID, mini-window position.voiceSettingsStorage(infrastructure/util/voice-settings-storage.util.ts) — localStorage persistence: input/output device IDs, output volume (0–100), bitrate (32–256 kbps), latency profile (low | balanced | high), noise-reduction toggle.- Joining a new voice target first calls
endSession()so transitions cannot leak peer connections.
voice-connection (toju-app/src/app/domains/voice-connection/)
Bridges the application layer to the low-level WebRTC infrastructure under toju-app/src/app/infrastructure/realtime/.
VoiceActivityService— RMS-based speaking detection viaAnalyserNode(fftSize 256, RMS ≥ 0.015, 8-frame grace period).VoicePlaybackService— per-peerGainNodechains (0–200% range), localStorage-persisted; deafen sets all gains to 0.VoiceConnectionFacade— exposes signals likeisVoiceConnected,isMuted; methods liketoggleMute(),toggleNoiseReduction(),setOutputVolume().
Per the domain README, voice-connection does not own RTCPeerConnection construction or signaling — those live in infrastructure/realtime/peer-connection-manager.
direct-call (toju-app/src/app/domains/direct-call/)
- Initiator flow (
DirectCallService.startCall()): create/reuse the 1:1 DM, start a call-scoped voice session, send adirect-call"ring" envelope viaPeerDeliveryService. - Recipient flow: store incoming session, ring
assets/audio/call.wav(unless DND), show in-app modal + desktop notification. - Group upgrade: adding a third participant spawns a new group conversation; the active call swaps its chat panel to the new conversation but original DM history is preserved.
- Invariant: incoming
direct-callevents are ignored unless the local user is inparticipantIds.
Screen share (toju-app/src/app/domains/screen-share/)
- Adds dedicated
MediaStreamTracksenders to the existing peer connection (does not open a new one). - Request / response model: a receiver sends
screen-share-request; the sender attaches the share track;screen-share-stoptears it down. - Quality presets:
low/balanced/high(resolution + FPS). - On Electron,
ScreenShareSourcePickerServicedrives a Promise-based picker overgetSources(see ipc-bridge).
RNNoise pipeline
Manager: infrastructure/realtime/media/noise-reduction.manager.ts.
Raw mic → MediaStreamAudioSourceNode → NoiseSuppressorWorklet (AudioWorkletNode) → MediaStreamAudioDestinationNode → clean stream → RTCPeerConnection sender
- AudioContext at 48 kHz.
- Worklet loaded from
rnnoise-worklet.js(built from@timephy/rnnoise-wasm, output written totoju-app/public/). - If worklet load fails, the raw stream is passed through unchanged.
- Mute takes priority — when muted, noise reduction is also disabled.
Technical implementation
- Envelope types: see websocket-envelopes.
- Signaling adapter (renderer):
toju-app/src/app/infrastructure/realtime/signaling/signaling-message-handler.ts(andsignaling-transport-handler.ts). - Peer-connection manager:
toju-app/src/app/infrastructure/realtime/peer-connection-manager/—create-peer-connection.ts, recovery (grace timers, reconnect), data-channel plumbing. - Voice settings:
domains/voice-session/infrastructure/util/voice-settings-storage.util.ts. - Noise reduction:
infrastructure/realtime/media/noise-reduction.manager.ts. - Worklet asset:
toju-app/public/rnnoise-worklet.js. - Server side: signaling only —
server/src/websocket/handler.ts::forwardRtcMessage.
Invariants
- The server forwards
offer/answer/ice_candidate/direct-callenvelopes opaquely and never persists media or call state. - Switching voice rooms always tears down the prior session before starting the new one.
- Mute overrides noise reduction (the manager disables the worklet path when muted).
- Direct-call events with the local user absent from
participantIdsare ignored.
Testing
toju-app/src/app/infrastructure/realtime/signaling/signaling-message-handler.spec.ts—user_leftpeer preservation under active voice.toju-app/src/app/infrastructure/realtime/peer-connection-manager/recovery/peer-recovery.spec.ts— reconnect, grace timers, exponential backoff.toju-app/src/app/infrastructure/realtime/peer-connection-manager/messaging/data-channel.spec.ts.toju-app/src/app/domains/direct-call/application/services/direct-call.service.spec.ts.- E2E:
e2e/tests/voice/multi-signal-eight-user-voice.spec.ts,e2e/tests/voice/direct-call.spec.ts(verify exact filenames in the suite — TODO).
Security considerations
- WebRTC bypasses the server entirely once connected — peer IPs may be exposed to other participants via ICE candidates. Standard WebRTC privacy caveat.
- Signaling envelopes are forwarded without verifying that source and target share a server — TODO: confirm whether
forwardRtcMessageenforces membership. - The data channel
chatcarries P2P text payloads; integrity / authentication of those payloads is owned by the chat/direct-message domains, not by this area. - RNNoise runs entirely client-side; mic audio never leaves the local AudioContext until it enters the encrypted RTCPeerConnection.
Performance considerations
- Mesh topology — N×(N-1)/2 peer connections per voice room. Practical ceiling is bound by client CPU and uplink; no documented soft cap.
- Bitrate is client-controlled (32–256 kbps); no server-enforced QoS.
- Voice activity detection runs at fftSize 256 with an 8-frame grace period — chosen to minimise CPU while staying responsive to natural speech.
- The signaling server's only cost is envelope forwarding (O(1) per envelope).
Known issues and limitations
- No SFU / MCU. Large rooms scale linearly with participant count on each client.
- No recording or server-side mixing for voice or screen.
- Bitrate is not enforced server-side — adversarial clients could ignore the suggested range.
- No documented call-quality telemetry pipeline.
Related features
- websocket-envelopes — owns the wire types this area consumes.
- ipc-bridge —
getSourcesand the Linux audio-routing methods are used by screen-share. - plugin-system — plugins may participate as observers via
voice_statebroadcasts (subject to capability grants); no direct call control surface today.
Changelog
| Date | Change |
|---|---|
| 2026-05-25 | Initial documentation |