docs: populate initial cross-context feature docs
Add area-level documentation for the five most significant cross-context feature areas under agents-docs/features/: - websocket-envelopes: full envelope catalogue, lifecycle, dispatcher - ipc-bridge: window.electronAPI surface, IPC channels, CQRS dispatch - plugin-system: manifest contract, runtime, capabilities, plugin-support API - server-directory: REST endpoints, CQRS, entities, business rules - voice-signaling: mesh signaling, RNNoise pipeline, domain split Update agents-docs/FEATURES.md index alphabetically and remove the "no cross-context feature docs" placeholder. Each doc records honest TODOs for verified gaps (stale signaling-contracts.ts, window.api vs window.electronAPI mismatch, IPC error envelope drift from CONTEXT.md, missing OpenAPI coverage for server-directory routes, no envelope round-trip test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
177
agents-docs/features/voice-signaling.md
Normal file
177
agents-docs/features/voice-signaling.md
Normal file
@@ -0,0 +1,177 @@
|
||||
# Voice & WebRTC Signaling
|
||||
|
||||
> **Area:** voice-signaling
|
||||
> **Status:** Active
|
||||
> **Last updated:** 2026-05-25
|
||||
|
||||
## Overview
|
||||
|
||||
Voice and screen-share in Toju are pure WebRTC mesh: peers establish RTCPeerConnections directly, while the signaling server only forwards SDP and ICE messages. This area covers the end-to-end flow — envelope routing, peer election, RTCPeerConnection lifecycle, RNNoise denoising, and the relationships between the three product-client domains involved: `voice-session`, `voice-connection`, and `direct-call`. Screen-share rides on the same peer connection; its UI orchestration is its own domain but the signaling path is shared.
|
||||
|
||||
## Responsibilities
|
||||
|
||||
- Negotiate WebRTC sessions between peers using `offer` / `answer` / `ice_candidate` envelopes forwarded by the signaling server.
|
||||
- Elect an initiator deterministically when multiple peers arrive simultaneously, with a non-initiator fallback timer.
|
||||
- Maintain the local audio pipeline: mic capture → optional RNNoise denoising → RTCPeerConnection sender.
|
||||
- Track per-peer playback gain, mute, deafen, and speaking-activity state on the receive side.
|
||||
- Mirror voice presence (`voice_state`) and direct-call signalling (`direct-call`) to other peers via the WebSocket.
|
||||
|
||||
This area does **not** own:
|
||||
|
||||
- The WebSocket envelope shape (see [websocket-envelopes](./websocket-envelopes.md)).
|
||||
- Screen-share UI orchestration (its own domain at `toju-app/src/app/domains/screen-share/`); only the peer connection plumbing is shared.
|
||||
- Persistent user settings beyond `voiceSettingsStorage` (audio device IDs, volumes, bitrate, latency profile, noise-reduction toggle, persisted to localStorage).
|
||||
|
||||
## Key concepts
|
||||
|
||||
- **Mesh** — every participant holds an `RTCPeerConnection` per other participant. No SFU / MCU.
|
||||
- **Voice session** — high-level "user is currently in voice room X" state. Owned by `voice-session` domain.
|
||||
- **Voice connection** — low-level transport/peer concerns: speaking detection, per-peer gain, mute / deafen state. Owned by `voice-connection` domain.
|
||||
- **Direct call** — 1:1 voice/video call with an optional group-upgrade path. Owned by `direct-call` domain.
|
||||
- **Initiator** — the peer responsible for sending the first `offer`. Elected first-peer-wins; non-initiators wait `NON_INITIATOR_GIVE_UP_MS` (≈5 s) before generating their own offer.
|
||||
- **Data channel** — `chat`-labelled data channel established alongside each peer connection for P2P chat fallback and direct-message delivery.
|
||||
- **Noise suppressor worklet** — RNNoise WASM running in an `AudioWorkletNode` (`NoiseSuppressorWorklet`), loaded from `rnnoise-worklet.js` at the app root.
|
||||
|
||||
---
|
||||
|
||||
## Signaling envelopes (consumed)
|
||||
|
||||
Defined in [websocket-envelopes](./websocket-envelopes.md). Voice-relevant types:
|
||||
|
||||
- `offer`, `answer`, `ice_candidate` — forwarded by the server to `targetUserId` without inspection.
|
||||
- `direct-call` — forwarded; payload carries call-scoped events (ring, participant join/leave, call end).
|
||||
- `voice_state` — broadcast to a server. Payload includes `roomId`, `voiceGateway`, mute/deafen flags.
|
||||
- `server_users` — full peer roster on join; seeds the initial offer fan-out.
|
||||
- `user_joined` — schedules a fallback offer after a grace delay (`USER_JOINED_FALLBACK_OFFER_DELAY_MS`, ≈1 s).
|
||||
- `user_left` — peer teardown, with special handling that preserves peers still under an active voice session.
|
||||
- `connected` / `access_denied` — connection lifecycle (server bootstrap and authorization).
|
||||
|
||||
The server is **purely signaling**: it does not track which `oderId` is in which voice room. Voice membership is derived client-side from the `voice_state` broadcasts observed on the server.
|
||||
|
||||
---
|
||||
|
||||
## Session establishment flow
|
||||
|
||||
A new participant joining a voice room produces this exchange (initiator perspective; symmetrical when both arrive at once):
|
||||
|
||||
1. Local user clicks "Join voice" → `VoiceSessionFacade.startSession()` populates the session model and asks `voice-connection` to ready peer transport.
|
||||
2. Server broadcasts `user_joined` to existing peers.
|
||||
3. Each existing peer evaluates: am I the elected initiator for the (me, new-peer) pair? If yes, the peer-connection manager calls `doCreateAndSendOffer()`.
|
||||
4. Initiator constructs `new RTCPeerConnection({ iceServers })` (`infrastructure/realtime/peer-connection-manager/.../create-peer-connection.ts`), adds local tracks, creates the data channel `chat`, generates an SDP offer, and sends it via the signaling transport.
|
||||
5. Responder receives `offer` → `doHandleOffer()` sets remote description, generates SDP answer, sends `answer`.
|
||||
6. Initiator receives `answer` → `doHandleAnswer()` sets remote description.
|
||||
7. Both sides emit `ice_candidate` as they gather candidates via `onicecandidate`.
|
||||
8. `iceConnectionState` reaches `connected` / `completed` → media flows.
|
||||
9. Either side may open the `chat` data channel for P2P text payloads (direct messages, etc.).
|
||||
|
||||
If the elected initiator never sends an offer within `NON_INITIATOR_GIVE_UP_MS`, the non-initiator promotes itself and initiates instead — preserves liveness across asymmetric drop-outs.
|
||||
|
||||
`user_left` is treated carefully: the `signaling-message-handler.spec.ts` covers the case where a peer is still required by an active voice session and must not be torn down, even if other parts of the system think the peer has disconnected.
|
||||
|
||||
---
|
||||
|
||||
## Domain responsibilities
|
||||
|
||||
### `voice-session` (`toju-app/src/app/domains/voice-session/`)
|
||||
|
||||
- `VoiceSessionFacade` (`application/facades/voice-session.facade.ts`) — owns the active session metadata (`serverId`, `roomId`, `participantIds`); drives a `showFloatingControls` signal when the user navigates away from the room.
|
||||
- `VoiceWorkspaceService` (`application/services/voice-workspace.service.ts`) — UI state for the workspace (hidden / expanded / minimized), focused stream ID, mini-window position.
|
||||
- `voiceSettingsStorage` (`infrastructure/util/voice-settings-storage.util.ts`) — localStorage persistence: input/output device IDs, output volume (0–100), bitrate (32–256 kbps), latency profile (`low | balanced | high`), noise-reduction toggle.
|
||||
- Joining a new voice target first calls `endSession()` so transitions cannot leak peer connections.
|
||||
|
||||
### `voice-connection` (`toju-app/src/app/domains/voice-connection/`)
|
||||
|
||||
Bridges the application layer to the low-level WebRTC infrastructure under `toju-app/src/app/infrastructure/realtime/`.
|
||||
|
||||
- **`VoiceActivityService`** — RMS-based speaking detection via `AnalyserNode` (fftSize 256, RMS ≥ 0.015, 8-frame grace period).
|
||||
- **`VoicePlaybackService`** — per-peer `GainNode` chains (0–200% range), localStorage-persisted; deafen sets all gains to 0.
|
||||
- **`VoiceConnectionFacade`** — exposes signals like `isVoiceConnected`, `isMuted`; methods like `toggleMute()`, `toggleNoiseReduction()`, `setOutputVolume()`.
|
||||
|
||||
Per the domain README, voice-connection does **not** own RTCPeerConnection construction or signaling — those live in `infrastructure/realtime/peer-connection-manager`.
|
||||
|
||||
### `direct-call` (`toju-app/src/app/domains/direct-call/`)
|
||||
|
||||
- Initiator flow (`DirectCallService.startCall()`): create/reuse the 1:1 DM, start a call-scoped voice session, send a `direct-call` "ring" envelope via `PeerDeliveryService`.
|
||||
- Recipient flow: store incoming session, ring `assets/audio/call.wav` (unless DND), show in-app modal + desktop notification.
|
||||
- Group upgrade: adding a third participant spawns a new group conversation; the active call swaps its chat panel to the new conversation but original DM history is preserved.
|
||||
- Invariant: incoming `direct-call` events are ignored unless the local user is in `participantIds`.
|
||||
|
||||
### Screen share (`toju-app/src/app/domains/screen-share/`)
|
||||
|
||||
- Adds dedicated `MediaStreamTrack` senders to the existing peer connection (does not open a new one).
|
||||
- Request / response model: a receiver sends `screen-share-request`; the sender attaches the share track; `screen-share-stop` tears it down.
|
||||
- Quality presets: `low` / `balanced` / `high` (resolution + FPS).
|
||||
- On Electron, `ScreenShareSourcePickerService` drives a Promise-based picker over `getSources` (see [ipc-bridge](./ipc-bridge.md)).
|
||||
|
||||
---
|
||||
|
||||
## RNNoise pipeline
|
||||
|
||||
Manager: `infrastructure/realtime/media/noise-reduction.manager.ts`.
|
||||
|
||||
```
|
||||
Raw mic → MediaStreamAudioSourceNode → NoiseSuppressorWorklet (AudioWorkletNode) → MediaStreamAudioDestinationNode → clean stream → RTCPeerConnection sender
|
||||
```
|
||||
|
||||
- AudioContext at 48 kHz.
|
||||
- Worklet loaded from `rnnoise-worklet.js` (built from `@timephy/rnnoise-wasm`, output written to `toju-app/public/`).
|
||||
- If worklet load fails, the raw stream is passed through unchanged.
|
||||
- Mute takes priority — when muted, noise reduction is also disabled.
|
||||
|
||||
## Technical implementation
|
||||
|
||||
- **Envelope types**: see [websocket-envelopes](./websocket-envelopes.md).
|
||||
- **Signaling adapter (renderer)**: `toju-app/src/app/infrastructure/realtime/signaling/signaling-message-handler.ts` (and `signaling-transport-handler.ts`).
|
||||
- **Peer-connection manager**: `toju-app/src/app/infrastructure/realtime/peer-connection-manager/` — `create-peer-connection.ts`, recovery (grace timers, reconnect), data-channel plumbing.
|
||||
- **Voice settings**: `domains/voice-session/infrastructure/util/voice-settings-storage.util.ts`.
|
||||
- **Noise reduction**: `infrastructure/realtime/media/noise-reduction.manager.ts`.
|
||||
- **Worklet asset**: `toju-app/public/rnnoise-worklet.js`.
|
||||
- **Server side**: signaling only — `server/src/websocket/handler.ts::forwardRtcMessage`.
|
||||
|
||||
## Invariants
|
||||
|
||||
- The server forwards `offer` / `answer` / `ice_candidate` / `direct-call` envelopes opaquely and never persists media or call state.
|
||||
- Switching voice rooms always tears down the prior session before starting the new one.
|
||||
- Mute overrides noise reduction (the manager disables the worklet path when muted).
|
||||
- Direct-call events with the local user absent from `participantIds` are ignored.
|
||||
|
||||
## Testing
|
||||
|
||||
- `toju-app/src/app/infrastructure/realtime/signaling/signaling-message-handler.spec.ts` — `user_left` peer preservation under active voice.
|
||||
- `toju-app/src/app/infrastructure/realtime/peer-connection-manager/recovery/peer-recovery.spec.ts` — reconnect, grace timers, exponential backoff.
|
||||
- `toju-app/src/app/infrastructure/realtime/peer-connection-manager/messaging/data-channel.spec.ts`.
|
||||
- `toju-app/src/app/domains/direct-call/application/services/direct-call.service.spec.ts`.
|
||||
- E2E: `e2e/tests/voice/multi-signal-eight-user-voice.spec.ts`, `e2e/tests/voice/direct-call.spec.ts` (verify exact filenames in the suite — TODO).
|
||||
|
||||
## Security considerations
|
||||
|
||||
- WebRTC bypasses the server entirely once connected — peer IPs may be exposed to other participants via ICE candidates. Standard WebRTC privacy caveat.
|
||||
- Signaling envelopes are forwarded without verifying that source and target share a server — TODO: confirm whether `forwardRtcMessage` enforces membership.
|
||||
- The data channel `chat` carries P2P text payloads; integrity / authentication of those payloads is owned by the chat/direct-message domains, not by this area.
|
||||
- RNNoise runs entirely client-side; mic audio never leaves the local AudioContext until it enters the encrypted RTCPeerConnection.
|
||||
|
||||
## Performance considerations
|
||||
|
||||
- Mesh topology — N×(N-1)/2 peer connections per voice room. Practical ceiling is bound by client CPU and uplink; no documented soft cap.
|
||||
- Bitrate is client-controlled (32–256 kbps); no server-enforced QoS.
|
||||
- Voice activity detection runs at fftSize 256 with an 8-frame grace period — chosen to minimise CPU while staying responsive to natural speech.
|
||||
- The signaling server's only cost is envelope forwarding (O(1) per envelope).
|
||||
|
||||
## Known issues and limitations
|
||||
|
||||
- **No SFU / MCU.** Large rooms scale linearly with participant count on each client.
|
||||
- **No recording or server-side mixing** for voice or screen.
|
||||
- **Bitrate is not enforced server-side** — adversarial clients could ignore the suggested range.
|
||||
- **No documented call-quality telemetry pipeline.**
|
||||
|
||||
## Related features
|
||||
|
||||
- **[websocket-envelopes](./websocket-envelopes.md)** — owns the wire types this area consumes.
|
||||
- **[ipc-bridge](./ipc-bridge.md)** — `getSources` and the Linux audio-routing methods are used by screen-share.
|
||||
- **[plugin-system](./plugin-system.md)** — plugins may participate as observers via `voice_state` broadcasts (subject to capability grants); no direct call control surface today.
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2026-05-25 | Initial documentation |
|
||||
Reference in New Issue
Block a user