Files
Toju/agents-docs/features/voice-signaling.md
brogeby b19c39208c docs: populate initial cross-context feature docs
Add area-level documentation for the five most significant cross-context
feature areas under agents-docs/features/:

- websocket-envelopes: full envelope catalogue, lifecycle, dispatcher
- ipc-bridge: window.electronAPI surface, IPC channels, CQRS dispatch
- plugin-system: manifest contract, runtime, capabilities, plugin-support API
- server-directory: REST endpoints, CQRS, entities, business rules
- voice-signaling: mesh signaling, RNNoise pipeline, domain split

Update agents-docs/FEATURES.md index alphabetically and remove the
"no cross-context feature docs" placeholder.

Each doc records honest TODOs for verified gaps (stale signaling-contracts.ts,
window.api vs window.electronAPI mismatch, IPC error envelope drift from
CONTEXT.md, missing OpenAPI coverage for server-directory routes, no
envelope round-trip test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:36:36 +02:00

178 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Voice & WebRTC Signaling
> **Area:** voice-signaling
> **Status:** Active
> **Last updated:** 2026-05-25
## Overview
Voice and screen-share in Toju are pure WebRTC mesh: peers establish RTCPeerConnections directly, while the signaling server only forwards SDP and ICE messages. This area covers the end-to-end flow — envelope routing, peer election, RTCPeerConnection lifecycle, RNNoise denoising, and the relationships between the three product-client domains involved: `voice-session`, `voice-connection`, and `direct-call`. Screen-share rides on the same peer connection; its UI orchestration is its own domain but the signaling path is shared.
## Responsibilities
- Negotiate WebRTC sessions between peers using `offer` / `answer` / `ice_candidate` envelopes forwarded by the signaling server.
- Elect an initiator deterministically when multiple peers arrive simultaneously, with a non-initiator fallback timer.
- Maintain the local audio pipeline: mic capture → optional RNNoise denoising → RTCPeerConnection sender.
- Track per-peer playback gain, mute, deafen, and speaking-activity state on the receive side.
- Mirror voice presence (`voice_state`) and direct-call signalling (`direct-call`) to other peers via the WebSocket.
This area does **not** own:
- The WebSocket envelope shape (see [websocket-envelopes](./websocket-envelopes.md)).
- Screen-share UI orchestration (its own domain at `toju-app/src/app/domains/screen-share/`); only the peer connection plumbing is shared.
- Persistent user settings beyond `voiceSettingsStorage` (audio device IDs, volumes, bitrate, latency profile, noise-reduction toggle, persisted to localStorage).
## Key concepts
- **Mesh** — every participant holds an `RTCPeerConnection` per other participant. No SFU / MCU.
- **Voice session** — high-level "user is currently in voice room X" state. Owned by `voice-session` domain.
- **Voice connection** — low-level transport/peer concerns: speaking detection, per-peer gain, mute / deafen state. Owned by `voice-connection` domain.
- **Direct call** — 1:1 voice/video call with an optional group-upgrade path. Owned by `direct-call` domain.
- **Initiator** — the peer responsible for sending the first `offer`. Elected first-peer-wins; non-initiators wait `NON_INITIATOR_GIVE_UP_MS` (≈5 s) before generating their own offer.
- **Data channel** — `chat`-labelled data channel established alongside each peer connection for P2P chat fallback and direct-message delivery.
- **Noise suppressor worklet** — RNNoise WASM running in an `AudioWorkletNode` (`NoiseSuppressorWorklet`), loaded from `rnnoise-worklet.js` at the app root.
---
## Signaling envelopes (consumed)
Defined in [websocket-envelopes](./websocket-envelopes.md). Voice-relevant types:
- `offer`, `answer`, `ice_candidate` — forwarded by the server to `targetUserId` without inspection.
- `direct-call` — forwarded; payload carries call-scoped events (ring, participant join/leave, call end).
- `voice_state` — broadcast to a server. Payload includes `roomId`, `voiceGateway`, mute/deafen flags.
- `server_users` — full peer roster on join; seeds the initial offer fan-out.
- `user_joined` — schedules a fallback offer after a grace delay (`USER_JOINED_FALLBACK_OFFER_DELAY_MS`, ≈1 s).
- `user_left` — peer teardown, with special handling that preserves peers still under an active voice session.
- `connected` / `access_denied` — connection lifecycle (server bootstrap and authorization).
The server is **purely signaling**: it does not track which `oderId` is in which voice room. Voice membership is derived client-side from the `voice_state` broadcasts observed on the server.
---
## Session establishment flow
A new participant joining a voice room produces this exchange (initiator perspective; symmetrical when both arrive at once):
1. Local user clicks "Join voice" → `VoiceSessionFacade.startSession()` populates the session model and asks `voice-connection` to ready peer transport.
2. Server broadcasts `user_joined` to existing peers.
3. Each existing peer evaluates: am I the elected initiator for the (me, new-peer) pair? If yes, the peer-connection manager calls `doCreateAndSendOffer()`.
4. Initiator constructs `new RTCPeerConnection({ iceServers })` (`infrastructure/realtime/peer-connection-manager/.../create-peer-connection.ts`), adds local tracks, creates the data channel `chat`, generates an SDP offer, and sends it via the signaling transport.
5. Responder receives `offer``doHandleOffer()` sets remote description, generates SDP answer, sends `answer`.
6. Initiator receives `answer``doHandleAnswer()` sets remote description.
7. Both sides emit `ice_candidate` as they gather candidates via `onicecandidate`.
8. `iceConnectionState` reaches `connected` / `completed` → media flows.
9. Either side may open the `chat` data channel for P2P text payloads (direct messages, etc.).
If the elected initiator never sends an offer within `NON_INITIATOR_GIVE_UP_MS`, the non-initiator promotes itself and initiates instead — preserves liveness across asymmetric drop-outs.
`user_left` is treated carefully: the `signaling-message-handler.spec.ts` covers the case where a peer is still required by an active voice session and must not be torn down, even if other parts of the system think the peer has disconnected.
---
## Domain responsibilities
### `voice-session` (`toju-app/src/app/domains/voice-session/`)
- `VoiceSessionFacade` (`application/facades/voice-session.facade.ts`) — owns the active session metadata (`serverId`, `roomId`, `participantIds`); drives a `showFloatingControls` signal when the user navigates away from the room.
- `VoiceWorkspaceService` (`application/services/voice-workspace.service.ts`) — UI state for the workspace (hidden / expanded / minimized), focused stream ID, mini-window position.
- `voiceSettingsStorage` (`infrastructure/util/voice-settings-storage.util.ts`) — localStorage persistence: input/output device IDs, output volume (0100), bitrate (32256 kbps), latency profile (`low | balanced | high`), noise-reduction toggle.
- Joining a new voice target first calls `endSession()` so transitions cannot leak peer connections.
### `voice-connection` (`toju-app/src/app/domains/voice-connection/`)
Bridges the application layer to the low-level WebRTC infrastructure under `toju-app/src/app/infrastructure/realtime/`.
- **`VoiceActivityService`** — RMS-based speaking detection via `AnalyserNode` (fftSize 256, RMS ≥ 0.015, 8-frame grace period).
- **`VoicePlaybackService`** — per-peer `GainNode` chains (0200% range), localStorage-persisted; deafen sets all gains to 0.
- **`VoiceConnectionFacade`** — exposes signals like `isVoiceConnected`, `isMuted`; methods like `toggleMute()`, `toggleNoiseReduction()`, `setOutputVolume()`.
Per the domain README, voice-connection does **not** own RTCPeerConnection construction or signaling — those live in `infrastructure/realtime/peer-connection-manager`.
### `direct-call` (`toju-app/src/app/domains/direct-call/`)
- Initiator flow (`DirectCallService.startCall()`): create/reuse the 1:1 DM, start a call-scoped voice session, send a `direct-call` "ring" envelope via `PeerDeliveryService`.
- Recipient flow: store incoming session, ring `assets/audio/call.wav` (unless DND), show in-app modal + desktop notification.
- Group upgrade: adding a third participant spawns a new group conversation; the active call swaps its chat panel to the new conversation but original DM history is preserved.
- Invariant: incoming `direct-call` events are ignored unless the local user is in `participantIds`.
### Screen share (`toju-app/src/app/domains/screen-share/`)
- Adds dedicated `MediaStreamTrack` senders to the existing peer connection (does not open a new one).
- Request / response model: a receiver sends `screen-share-request`; the sender attaches the share track; `screen-share-stop` tears it down.
- Quality presets: `low` / `balanced` / `high` (resolution + FPS).
- On Electron, `ScreenShareSourcePickerService` drives a Promise-based picker over `getSources` (see [ipc-bridge](./ipc-bridge.md)).
---
## RNNoise pipeline
Manager: `infrastructure/realtime/media/noise-reduction.manager.ts`.
```
Raw mic → MediaStreamAudioSourceNode → NoiseSuppressorWorklet (AudioWorkletNode) → MediaStreamAudioDestinationNode → clean stream → RTCPeerConnection sender
```
- AudioContext at 48 kHz.
- Worklet loaded from `rnnoise-worklet.js` (built from `@timephy/rnnoise-wasm`, output written to `toju-app/public/`).
- If worklet load fails, the raw stream is passed through unchanged.
- Mute takes priority — when muted, noise reduction is also disabled.
## Technical implementation
- **Envelope types**: see [websocket-envelopes](./websocket-envelopes.md).
- **Signaling adapter (renderer)**: `toju-app/src/app/infrastructure/realtime/signaling/signaling-message-handler.ts` (and `signaling-transport-handler.ts`).
- **Peer-connection manager**: `toju-app/src/app/infrastructure/realtime/peer-connection-manager/``create-peer-connection.ts`, recovery (grace timers, reconnect), data-channel plumbing.
- **Voice settings**: `domains/voice-session/infrastructure/util/voice-settings-storage.util.ts`.
- **Noise reduction**: `infrastructure/realtime/media/noise-reduction.manager.ts`.
- **Worklet asset**: `toju-app/public/rnnoise-worklet.js`.
- **Server side**: signaling only — `server/src/websocket/handler.ts::forwardRtcMessage`.
## Invariants
- The server forwards `offer` / `answer` / `ice_candidate` / `direct-call` envelopes opaquely and never persists media or call state.
- Switching voice rooms always tears down the prior session before starting the new one.
- Mute overrides noise reduction (the manager disables the worklet path when muted).
- Direct-call events with the local user absent from `participantIds` are ignored.
## Testing
- `toju-app/src/app/infrastructure/realtime/signaling/signaling-message-handler.spec.ts``user_left` peer preservation under active voice.
- `toju-app/src/app/infrastructure/realtime/peer-connection-manager/recovery/peer-recovery.spec.ts` — reconnect, grace timers, exponential backoff.
- `toju-app/src/app/infrastructure/realtime/peer-connection-manager/messaging/data-channel.spec.ts`.
- `toju-app/src/app/domains/direct-call/application/services/direct-call.service.spec.ts`.
- E2E: `e2e/tests/voice/multi-signal-eight-user-voice.spec.ts`, `e2e/tests/voice/direct-call.spec.ts` (verify exact filenames in the suite — TODO).
## Security considerations
- WebRTC bypasses the server entirely once connected — peer IPs may be exposed to other participants via ICE candidates. Standard WebRTC privacy caveat.
- Signaling envelopes are forwarded without verifying that source and target share a server — TODO: confirm whether `forwardRtcMessage` enforces membership.
- The data channel `chat` carries P2P text payloads; integrity / authentication of those payloads is owned by the chat/direct-message domains, not by this area.
- RNNoise runs entirely client-side; mic audio never leaves the local AudioContext until it enters the encrypted RTCPeerConnection.
## Performance considerations
- Mesh topology — N×(N-1)/2 peer connections per voice room. Practical ceiling is bound by client CPU and uplink; no documented soft cap.
- Bitrate is client-controlled (32256 kbps); no server-enforced QoS.
- Voice activity detection runs at fftSize 256 with an 8-frame grace period — chosen to minimise CPU while staying responsive to natural speech.
- The signaling server's only cost is envelope forwarding (O(1) per envelope).
## Known issues and limitations
- **No SFU / MCU.** Large rooms scale linearly with participant count on each client.
- **No recording or server-side mixing** for voice or screen.
- **Bitrate is not enforced server-side** — adversarial clients could ignore the suggested range.
- **No documented call-quality telemetry pipeline.**
## Related features
- **[websocket-envelopes](./websocket-envelopes.md)** — owns the wire types this area consumes.
- **[ipc-bridge](./ipc-bridge.md)** — `getSources` and the Linux audio-routing methods are used by screen-share.
- **[plugin-system](./plugin-system.md)** — plugins may participate as observers via `voice_state` broadcasts (subject to capability grants); no direct call control surface today.
## Changelog
| Date | Change |
|------|--------|
| 2026-05-25 | Initial documentation |