Chapter 17: Voice Mode

OpenClaw's Voice Mode lets users speak to your AI agent and hear spoken responses back. Instead of typing a message, a user can send a voice note on WhatsApp or Telegram, and the agent transcribes it, processes it, and replies with both text and an audio message. This chapter explains how to set up and tune Voice Mode — including the four reply modes, all supported providers, and the built-in Talk interface.

How Voice Mode Works

User sends voice note
        ↓
OpenClaw downloads audio
        ↓
Speech-to-Text (transcription)
        ↓
Agent processes transcribed text
        ↓
Text-to-Speech (synthesis) — if mode requires it
        ↓
Agent sends text reply + audio message

Both directions are configurable: you can enable transcription only, synthesis only, or both.

Channel Support

Channel	Receives Voice	Sends Voice	Format
WhatsApp	Yes	Yes	Opus OGG (48kHz, 64kbps)
Telegram	Yes	Yes	Opus OGG
Discord	Yes	Yes	MP3
Signal	Yes	Yes	OGG
Matrix	Yes	Yes	OGG
Slack	Yes	Partial (text fallback)	—
iMessage	No	No	—
Teams	No	No	—

Output format: WhatsApp requires Opus-encoded OGG at 48kHz/64kbps. OpenClaw handles this encoding automatically regardless of which TTS provider you use.

The Four Voice Reply Modes

The most important voice setting is mode — it controls when the agent speaks back. The hard part is not making the agent speak, but knowing when it should.

Mode	Behavior	When to Use
`always`	Every reply becomes a voice note	Dedicated voice-first bots
`inbound`	If the user sent a voice note, reply with voice	Recommended production default
`tagged`	Agent decides — adds `[[tts]]` marker when voice is appropriate	Smart mixed-mode bots
`off`	Text only, no voice output	Default

{
  "voice": {
    "mode": "inbound"
  }
}

The inbound mode is the recommended default because it mirrors the user's own modality — if they spoke to you, speak back; if they typed, type back.

Enabling Voice Mode

{
  "voice": {
    "enabled": true,
    "mode": "inbound",
    "transcription": {
      "provider": "openai-whisper",
      "apiKey": "${OPENAI_API_KEY}",
      "model": "whisper-1",
      "language": "auto"
    },
    "synthesis": {
      "provider": "openai-tts",
      "apiKey": "${OPENAI_API_KEY}",
      "model": "tts-1",
      "voice": "nova",
      "speed": 1.0
    },
    "maxVoiceChars": 1500
  }
}

Character limit: Responses over maxVoiceChars (default 1,500) are automatically summarized before synthesis, or fall back to text-only. This prevents extremely long audio files that users won't sit through.

Transcription Providers

OpenAI Whisper

{
  "transcription": {
    "provider": "openai-whisper",
    "apiKey": "${OPENAI_API_KEY}",
    "model": "whisper-1",
    "language": "auto"
  }
}

Set "language": "auto" to auto-detect the spoken language, or use a specific code like "ur" for Urdu, "en" for English.

Deepgram

{
  "transcription": {
    "provider": "deepgram",
    "apiKey": "${DEEPGRAM_API_KEY}",
    "model": "nova-2",
    "language": "en-US",
    "punctuate": true,
    "smartFormat": true
  }
}

Local Whisper

Run Whisper locally for zero cost and full privacy:

{
  "transcription": {
    "provider": "whisper-local",
    "model": "base",
    "device": "cpu"
  }
}

Install local Whisper:

pip install openai-whisper

Model	Size	Speed	Accuracy
`tiny`	39M	Very fast	Basic
`base`	74M	Fast	Good
`small`	244M	Moderate	Better
`medium`	769M	Slow	Very good
`large`	1.5G	Very slow	Best

Text-to-Speech Providers

Microsoft Edge TTS (Free — No API Key)

The easiest way to get started. Uses Microsoft's Edge browser TTS engine — completely free with no API key:

{
  "synthesis": {
    "provider": "edge-tts",
    "voice": "en-US-AriaNeural",
    "speed": 1.0
  }
}

Popular Edge TTS voices:

Voice	Language	Style
`en-US-AriaNeural`	English (US)	Friendly, natural
`en-US-GuyNeural`	English (US)	Professional
`ur-PK-AsadNeural`	Urdu	Clear
`ar-SA-ZariyahNeural`	Arabic	Warm
`zh-CN-XiaoxiaoNeural`	Chinese	Expressive

List all available voices: openclaw voice list-voices --provider edge-tts

OpenAI TTS

{
  "synthesis": {
    "provider": "openai-tts",
    "apiKey": "${OPENAI_API_KEY}",
    "model": "tts-1-hd",
    "voice": "nova",
    "speed": 1.0
  }
}

Available voices: alloy, echo, fable, onyx, nova, shimmer

Use tts-1-hd for higher quality at slightly higher cost. Cost: ~$0.015/1K chars (tts-1), ~$0.030/1K chars (tts-1-hd).

ElevenLabs

{
  "synthesis": {
    "provider": "elevenlabs",
    "apiKey": "${ELEVENLABS_API_KEY}",
    "voiceId": "21m00Tcm4TlvDq8ikWAM",
    "modelId": "eleven_turbo_v2",
    "stability": 0.5,
    "similarityBoost": 0.75
  }
}

ElevenLabs produces the most natural-sounding voices and supports voice cloning. ElevenLabs v3 (latest) offers even more expressive output.

MiniMax

Chinese provider with high-quality multilingual voices, particularly strong for Asian languages:

{
  "synthesis": {
    "provider": "minimax",
    "apiKey": "${MINIMAX_API_KEY}",
    "model": "speech-01-turbo",
    "voiceId": "Calm_Woman"
  }
}

Local TTS (Coqui)

{
  "synthesis": {
    "provider": "coqui",
    "model": "tts_models/en/ljspeech/tacotron2-DDC",
    "device": "cpu"
  }
}

Provider Cost Comparison

Provider	Synthesis Cost	Quality	API Key?
Edge TTS	Free	Good	No
OpenAI tts-1	$0.015/1K chars	Great	Yes
OpenAI tts-1-hd	$0.030/1K chars	Excellent	Yes
ElevenLabs	$0.18/1K chars	Best	Yes
MiniMax	~$0.01/1K chars	Very good	Yes
Coqui (local)	Free (CPU)	Moderate	No
OpenAI Whisper	$0.006/min	—	Yes
Deepgram Nova	$0.0043/min	—	Yes
Local Whisper	Free	Good	No

Talk Mode (Control UI)

In addition to messaging channel voice, OpenClaw supports real-time Talk Mode in the browser Control UI — a live voice conversation directly in your browser, similar to ChatGPT's Advanced Voice Mode.

Access it at http://127.0.0.1:18789/ → Talk tab.

Supported Talk providers:

OpenAI Realtime API
Google Live API

{
  "voice": {
    "talk": {
      "enabled": true,
      "provider": "openai-realtime",
      "apiKey": "${OPENAI_API_KEY}",
      "model": "gpt-4o-realtime-preview",
      "voice": "alloy"
    }
  }
}

Talk Mode uses a browser WebSocket connection directly to the provider's realtime API — responses feel instantaneous compared to TTS synthesis.

User Commands for Voice

Command	Effect
`/voice on`	Enable voice replies for this session
`/voice off`	Disable voice replies, text only
`/voice mode inbound`	Switch to inbound mode
`/voice mode always`	Switch to always-on voice
`/voice speed 1.2`	Set playback speed (0.5–2.0)
`/tts latest`	Switch to the latest available TTS model
`/transcribe`	Show transcription of the last voice note without responding

Per-Workspace Voice Config

Override voice settings per workspace:

{
  "workspaces": [
    {
      "id": "voice-first",
      "agent": "balanced",
      "voice": {
        "mode": "always",
        "synthesis": {
          "provider": "edge-tts",
          "voice": "en-US-AriaNeural"
        },
        "maxVoiceChars": 800
      }
    },
    {
      "id": "text-only",
      "agent": "fast",
      "voice": {
        "mode": "off"
      }
    }
  ]
}

Next: Chapter 18 — Live Canvas — How to use OpenClaw's shared drawing and diagramming surface for visual collaboration.