Chapter 17: Voice Mode
OpenClaw's Voice Mode lets users speak to your AI agent and hear spoken responses back. Instead of typing a message, a user can send a voice note on WhatsApp or Telegram, and the agent transcribes it, processes it, and replies with both text and an audio message. This chapter explains how to set up and tune Voice Mode โ including the four reply modes, all supported providers, and the built-in Talk interface.
How Voice Mode Works
User sends voice note
โ
OpenClaw downloads audio
โ
Speech-to-Text (transcription)
โ
Agent processes transcribed text
โ
Text-to-Speech (synthesis) โ if mode requires it
โ
Agent sends text reply + audio message
Both directions are configurable: you can enable transcription only, synthesis only, or both.
Channel Support
| Channel | Receives Voice | Sends Voice | Format |
|---|---|---|---|
| Yes | Yes | Opus OGG (48kHz, 64kbps) | |
| Telegram | Yes | Yes | Opus OGG |
| Discord | Yes | Yes | MP3 |
| Signal | Yes | Yes | OGG |
| Matrix | Yes | Yes | OGG |
| Slack | Yes | Partial (text fallback) | โ |
| iMessage | No | No | โ |
| Teams | No | No | โ |
Output format: WhatsApp requires Opus-encoded OGG at 48kHz/64kbps. OpenClaw handles this encoding automatically regardless of which TTS provider you use.
The Four Voice Reply Modes
The most important voice setting is mode โ it controls when the agent speaks back. The hard part is not making the agent speak, but knowing when it should.
| Mode | Behavior | When to Use |
|---|---|---|
always | Every reply becomes a voice note | Dedicated voice-first bots |
inbound | If the user sent a voice note, reply with voice | Recommended production default |
tagged | Agent decides โ adds [[tts]] marker when voice is appropriate | Smart mixed-mode bots |
off | Text only, no voice output | Default |
{
"voice": {
"mode": "inbound"
}
}
The inbound mode is the recommended default because it mirrors the user's own modality โ if they spoke to you, speak back; if they typed, type back.
Enabling Voice Mode
{
"voice": {
"enabled": true,
"mode": "inbound",
"transcription": {
"provider": "openai-whisper",
"apiKey": "${OPENAI_API_KEY}",
"model": "whisper-1",
"language": "auto"
},
"synthesis": {
"provider": "openai-tts",
"apiKey": "${OPENAI_API_KEY}",
"model": "tts-1",
"voice": "nova",
"speed": 1.0
},
"maxVoiceChars": 1500
}
}
Character limit: Responses over
maxVoiceChars(default 1,500) are automatically summarized before synthesis, or fall back to text-only. This prevents extremely long audio files that users won't sit through.
Transcription Providers
OpenAI Whisper
{
"transcription": {
"provider": "openai-whisper",
"apiKey": "${OPENAI_API_KEY}",
"model": "whisper-1",
"language": "auto"
}
}
Set "language": "auto" to auto-detect the spoken language, or use a specific code like "ur" for Urdu, "en" for English.
Deepgram
{
"transcription": {
"provider": "deepgram",
"apiKey": "${DEEPGRAM_API_KEY}",
"model": "nova-2",
"language": "en-US",
"punctuate": true,
"smartFormat": true
}
}
Local Whisper
Run Whisper locally for zero cost and full privacy:
{
"transcription": {
"provider": "whisper-local",
"model": "base",
"device": "cpu"
}
}
Install local Whisper:
pip install openai-whisper
| Model | Size | Speed | Accuracy |
|---|---|---|---|
tiny | 39M | Very fast | Basic |
base | 74M | Fast | Good |
small | 244M | Moderate | Better |
medium | 769M | Slow | Very good |
large | 1.5G | Very slow | Best |
Text-to-Speech Providers
Microsoft Edge TTS (Free โ No API Key)
The easiest way to get started. Uses Microsoft's Edge browser TTS engine โ completely free with no API key:
{
"synthesis": {
"provider": "edge-tts",
"voice": "en-US-AriaNeural",
"speed": 1.0
}
}
Popular Edge TTS voices:
| Voice | Language | Style |
|---|---|---|
en-US-AriaNeural | English (US) | Friendly, natural |
en-US-GuyNeural | English (US) | Professional |
ur-PK-AsadNeural | Urdu | Clear |
ar-SA-ZariyahNeural | Arabic | Warm |
zh-CN-XiaoxiaoNeural | Chinese | Expressive |
List all available voices: openclaw voice list-voices --provider edge-tts
OpenAI TTS
{
"synthesis": {
"provider": "openai-tts",
"apiKey": "${OPENAI_API_KEY}",
"model": "tts-1-hd",
"voice": "nova",
"speed": 1.0
}
}
Available voices: alloy, echo, fable, onyx, nova, shimmer
Use tts-1-hd for higher quality at slightly higher cost. Cost: ~$0.015/1K chars (tts-1), ~$0.030/1K chars (tts-1-hd).
ElevenLabs
{
"synthesis": {
"provider": "elevenlabs",
"apiKey": "${ELEVENLABS_API_KEY}",
"voiceId": "21m00Tcm4TlvDq8ikWAM",
"modelId": "eleven_turbo_v2",
"stability": 0.5,
"similarityBoost": 0.75
}
}
ElevenLabs produces the most natural-sounding voices and supports voice cloning. ElevenLabs v3 (latest) offers even more expressive output.
MiniMax
Chinese provider with high-quality multilingual voices, particularly strong for Asian languages:
{
"synthesis": {
"provider": "minimax",
"apiKey": "${MINIMAX_API_KEY}",
"model": "speech-01-turbo",
"voiceId": "Calm_Woman"
}
}
Local TTS (Coqui)
{
"synthesis": {
"provider": "coqui",
"model": "tts_models/en/ljspeech/tacotron2-DDC",
"device": "cpu"
}
}
Provider Cost Comparison
| Provider | Synthesis Cost | Quality | API Key? |
|---|---|---|---|
| Edge TTS | Free | Good | No |
| OpenAI tts-1 | $0.015/1K chars | Great | Yes |
| OpenAI tts-1-hd | $0.030/1K chars | Excellent | Yes |
| ElevenLabs | $0.18/1K chars | Best | Yes |
| MiniMax | ~$0.01/1K chars | Very good | Yes |
| Coqui (local) | Free (CPU) | Moderate | No |
| OpenAI Whisper | $0.006/min | โ | Yes |
| Deepgram Nova | $0.0043/min | โ | Yes |
| Local Whisper | Free | Good | No |
Talk Mode (Control UI)
In addition to messaging channel voice, OpenClaw supports real-time Talk Mode in the browser Control UI โ a live voice conversation directly in your browser, similar to ChatGPT's Advanced Voice Mode.
Access it at http://127.0.0.1:18789/ โ Talk tab.
Supported Talk providers:
- OpenAI Realtime API
- Google Live API
{
"voice": {
"talk": {
"enabled": true,
"provider": "openai-realtime",
"apiKey": "${OPENAI_API_KEY}",
"model": "gpt-4o-realtime-preview",
"voice": "alloy"
}
}
}
Talk Mode uses a browser WebSocket connection directly to the provider's realtime API โ responses feel instantaneous compared to TTS synthesis.
User Commands for Voice
| Command | Effect |
|---|---|
/voice on | Enable voice replies for this session |
/voice off | Disable voice replies, text only |
/voice mode inbound | Switch to inbound mode |
/voice mode always | Switch to always-on voice |
/voice speed 1.2 | Set playback speed (0.5โ2.0) |
/tts latest | Switch to the latest available TTS model |
/transcribe | Show transcription of the last voice note without responding |
Per-Workspace Voice Config
Override voice settings per workspace:
{
"workspaces": [
{
"id": "voice-first",
"agent": "balanced",
"voice": {
"mode": "always",
"synthesis": {
"provider": "edge-tts",
"voice": "en-US-AriaNeural"
},
"maxVoiceChars": 800
}
},
{
"id": "text-only",
"agent": "fast",
"voice": {
"mode": "off"
}
}
]
}
Next: Chapter 18 โ Live Canvas โ How to use OpenClaw's shared drawing and diagramming surface for visual collaboration.