CHAPTER 04 — SOURCE ANALYSIS

Voice Mode Deep Dive

Claude Code's voice mode hides a sophisticated audio pipeline — Deepgram STT proxied through an Anthropic privacy relay, three distinct input modes, and a resilient reconnect strategy. Here's exactly how it works, traced from the v2.1.88 source.

What is Voice Mode?

Voice Mode lets you speak to Claude Code instead of type. What looks like a simple microphone button actually involves six distinct stages of audio processing — from WebAudio capture to TTS playback — with Anthropic acting as a privacy relay between your microphone and the Deepgram speech recognition service.

The key discovery: your audio never goes directly to Deepgram. It streams through wss://ws.anthropic.com/v1/audio first, keeping your voice data inside Anthropic's infrastructure.

6
pipeline stages
16kHz
mono PCM
3
input modes
4x
retry backoff

The Audio Pipeline

Click any stage to see what happens under the hood.

Privacy Architecture

The Anthropic WebSocket relay at wss://ws.anthropic.com/v1/audio means Deepgram only receives audio that has already transited Anthropic's infrastructure. Anthropic controls the data flow and can apply its usage policies before any third-party STT service sees your voice data.

Three Input Modes

Claude Code supports three distinct ways to trigger voice capture.

Voice Activity Detection
Automatic silence detection

The AnalyserNode in AudioCapture.ts monitors audio energy levels continuously. When energy drops below a threshold for ~1.5 seconds (configurable), the utterance is considered complete. Fully hands-free.

Pro

Completely hands-free, most natural feel

Con

Background noise can cause premature cutoff

VAD Internals — VoiceActivityDetector.ts
AnalyserNode from WebAudio API monitors RMS energy in real-time
Silence threshold: configurable, defaults to ~1.5s of sub-threshold audio
The same AnalyserNode is used for the microphone visualizer in the UI

Silent-Drop Recovery

If the WebSocket drops mid-utterance, buffered audio is replayed on reconnect with exponential backoff.

WebSocket disconnects mid-utterance
Retry 1 — replay buffered audiowait 1s

Buffered audio preserved — replay on next attempt

Retry 2 — exponential backoffwait 2s

Buffered audio preserved — replay on next attempt

Retry 3 — exponential backoffwait 4s

Buffered audio preserved — replay on next attempt

Retry 4 — reconnected, stream continueswait 8s

Buffered audio replayed — transcript delivered as normal

Max Retries Exceeded

If all 4 retries fail, the partial transcript captured up to the disconnect point is used as-is. The user sees the incomplete transcription and can retry manually.

Key Source Files

Voice mode is cleanly separated into five modules in the v2.1.88 source.

voice/VoiceSession.ts

Main session manager — orchestrates all other modules

voice/AudioCapture.ts

Microphone access, WebAudio resampling to 16kHz mono PCM

voice/DeepgramConnection.ts

WebSocket client for wss://ws.anthropic.com/v1/audio relay

voice/VoiceActivityDetector.ts

AnalyserNode silence threshold detection for VAD mode

voice/AudioPlayback.ts

TTS audio queue and playback management

Key Discoveries

What the v2.1.88 source reveals about voice mode design decisions.

Deepgram, not Whisper

Despite assumptions, Claude Code uses Deepgram for speech recognition — not OpenAI's Whisper or Google STT.

Anthropic as privacy relay

Your voice data routes through wss://ws.anthropic.com/v1/audio before reaching Deepgram. Anthropic controls the data flow.

Pre-connection buffering

Audio capture starts immediately when you press the mic button — before the WebSocket handshake completes. No lost syllables.

TTS is Anthropic-native

Text-to-Speech uses the Anthropic API's own audio output — not ElevenLabs, AWS Polly, or any other third-party TTS service.