CHAPTER 04 — SOURCE ANALYSIS
Voice Mode Deep Dive
Claude Code's voice mode hides a sophisticated audio pipeline — Deepgram STT proxied through an Anthropic privacy relay, three distinct input modes, and a resilient reconnect strategy. Here's exactly how it works, traced from the v2.1.88 source.
What is Voice Mode?
Voice Mode lets you speak to Claude Code instead of type. What looks like a simple microphone button actually involves six distinct stages of audio processing — from WebAudio capture to TTS playback — with Anthropic acting as a privacy relay between your microphone and the Deepgram speech recognition service.
The key discovery: your audio never goes directly to Deepgram. It streams through wss://ws.anthropic.com/v1/audio first, keeping your voice data inside Anthropic's infrastructure.
The Audio Pipeline
Click any stage to see what happens under the hood.
The Anthropic WebSocket relay at wss://ws.anthropic.com/v1/audio means Deepgram only receives audio that has already transited Anthropic's infrastructure. Anthropic controls the data flow and can apply its usage policies before any third-party STT service sees your voice data.
Three Input Modes
Claude Code supports three distinct ways to trigger voice capture.
The AnalyserNode in AudioCapture.ts monitors audio energy levels continuously. When energy drops below a threshold for ~1.5 seconds (configurable), the utterance is considered complete. Fully hands-free.
Completely hands-free, most natural feel
Background noise can cause premature cutoff
AnalyserNode from WebAudio API monitors RMS energy in real-timeSilent-Drop Recovery
If the WebSocket drops mid-utterance, buffered audio is replayed on reconnect with exponential backoff.
Buffered audio preserved — replay on next attempt
Buffered audio preserved — replay on next attempt
Buffered audio preserved — replay on next attempt
Buffered audio replayed — transcript delivered as normal
If all 4 retries fail, the partial transcript captured up to the disconnect point is used as-is. The user sees the incomplete transcription and can retry manually.
Key Source Files
Voice mode is cleanly separated into five modules in the v2.1.88 source.
voice/VoiceSession.tsMain session manager — orchestrates all other modules
voice/AudioCapture.tsMicrophone access, WebAudio resampling to 16kHz mono PCM
voice/DeepgramConnection.tsWebSocket client for wss://ws.anthropic.com/v1/audio relay
voice/VoiceActivityDetector.tsAnalyserNode silence threshold detection for VAD mode
voice/AudioPlayback.tsTTS audio queue and playback management
Key Discoveries
What the v2.1.88 source reveals about voice mode design decisions.
Despite assumptions, Claude Code uses Deepgram for speech recognition — not OpenAI's Whisper or Google STT.
Your voice data routes through wss://ws.anthropic.com/v1/audio before reaching Deepgram. Anthropic controls the data flow.
Audio capture starts immediately when you press the mic button — before the WebSocket handshake completes. No lost syllables.
Text-to-Speech uses the Anthropic API's own audio output — not ElevenLabs, AWS Polly, or any other third-party TTS service.