๐Ÿ”
Voice & CanvasChapter 17 of 33ยท 7 min read

Chapter 17: Voice Mode

OpenClaw's Voice Mode lets users speak to your AI agent and hear spoken responses back. Instead of typing a message, a user can send a voice note on WhatsApp or Telegram, and the agent transcribes it, processes it, and replies with both text and an audio message. This chapter explains how to set up and tune Voice Mode โ€” including the four reply modes, all supported providers, and the built-in Talk interface.


How Voice Mode Works

User sends voice note
        โ†“
OpenClaw downloads audio
        โ†“
Speech-to-Text (transcription)
        โ†“
Agent processes transcribed text
        โ†“
Text-to-Speech (synthesis) โ€” if mode requires it
        โ†“
Agent sends text reply + audio message

Both directions are configurable: you can enable transcription only, synthesis only, or both.


Channel Support

ChannelReceives VoiceSends VoiceFormat
WhatsAppYesYesOpus OGG (48kHz, 64kbps)
TelegramYesYesOpus OGG
DiscordYesYesMP3
SignalYesYesOGG
MatrixYesYesOGG
SlackYesPartial (text fallback)โ€”
iMessageNoNoโ€”
TeamsNoNoโ€”

Output format: WhatsApp requires Opus-encoded OGG at 48kHz/64kbps. OpenClaw handles this encoding automatically regardless of which TTS provider you use.


The Four Voice Reply Modes

The most important voice setting is mode โ€” it controls when the agent speaks back. The hard part is not making the agent speak, but knowing when it should.

ModeBehaviorWhen to Use
alwaysEvery reply becomes a voice noteDedicated voice-first bots
inboundIf the user sent a voice note, reply with voiceRecommended production default
taggedAgent decides โ€” adds [[tts]] marker when voice is appropriateSmart mixed-mode bots
offText only, no voice outputDefault
{
  "voice": {
    "mode": "inbound"
  }
}

The inbound mode is the recommended default because it mirrors the user's own modality โ€” if they spoke to you, speak back; if they typed, type back.


Enabling Voice Mode

{
  "voice": {
    "enabled": true,
    "mode": "inbound",
    "transcription": {
      "provider": "openai-whisper",
      "apiKey": "${OPENAI_API_KEY}",
      "model": "whisper-1",
      "language": "auto"
    },
    "synthesis": {
      "provider": "openai-tts",
      "apiKey": "${OPENAI_API_KEY}",
      "model": "tts-1",
      "voice": "nova",
      "speed": 1.0
    },
    "maxVoiceChars": 1500
  }
}

Character limit: Responses over maxVoiceChars (default 1,500) are automatically summarized before synthesis, or fall back to text-only. This prevents extremely long audio files that users won't sit through.


Transcription Providers

OpenAI Whisper

{
  "transcription": {
    "provider": "openai-whisper",
    "apiKey": "${OPENAI_API_KEY}",
    "model": "whisper-1",
    "language": "auto"
  }
}

Set "language": "auto" to auto-detect the spoken language, or use a specific code like "ur" for Urdu, "en" for English.

Deepgram

{
  "transcription": {
    "provider": "deepgram",
    "apiKey": "${DEEPGRAM_API_KEY}",
    "model": "nova-2",
    "language": "en-US",
    "punctuate": true,
    "smartFormat": true
  }
}

Local Whisper

Run Whisper locally for zero cost and full privacy:

{
  "transcription": {
    "provider": "whisper-local",
    "model": "base",
    "device": "cpu"
  }
}

Install local Whisper:

pip install openai-whisper
ModelSizeSpeedAccuracy
tiny39MVery fastBasic
base74MFastGood
small244MModerateBetter
medium769MSlowVery good
large1.5GVery slowBest

Text-to-Speech Providers

Microsoft Edge TTS (Free โ€” No API Key)

The easiest way to get started. Uses Microsoft's Edge browser TTS engine โ€” completely free with no API key:

{
  "synthesis": {
    "provider": "edge-tts",
    "voice": "en-US-AriaNeural",
    "speed": 1.0
  }
}

Popular Edge TTS voices:

VoiceLanguageStyle
en-US-AriaNeuralEnglish (US)Friendly, natural
en-US-GuyNeuralEnglish (US)Professional
ur-PK-AsadNeuralUrduClear
ar-SA-ZariyahNeuralArabicWarm
zh-CN-XiaoxiaoNeuralChineseExpressive

List all available voices: openclaw voice list-voices --provider edge-tts

OpenAI TTS

{
  "synthesis": {
    "provider": "openai-tts",
    "apiKey": "${OPENAI_API_KEY}",
    "model": "tts-1-hd",
    "voice": "nova",
    "speed": 1.0
  }
}

Available voices: alloy, echo, fable, onyx, nova, shimmer

Use tts-1-hd for higher quality at slightly higher cost. Cost: ~$0.015/1K chars (tts-1), ~$0.030/1K chars (tts-1-hd).

ElevenLabs

{
  "synthesis": {
    "provider": "elevenlabs",
    "apiKey": "${ELEVENLABS_API_KEY}",
    "voiceId": "21m00Tcm4TlvDq8ikWAM",
    "modelId": "eleven_turbo_v2",
    "stability": 0.5,
    "similarityBoost": 0.75
  }
}

ElevenLabs produces the most natural-sounding voices and supports voice cloning. ElevenLabs v3 (latest) offers even more expressive output.

MiniMax

Chinese provider with high-quality multilingual voices, particularly strong for Asian languages:

{
  "synthesis": {
    "provider": "minimax",
    "apiKey": "${MINIMAX_API_KEY}",
    "model": "speech-01-turbo",
    "voiceId": "Calm_Woman"
  }
}

Local TTS (Coqui)

{
  "synthesis": {
    "provider": "coqui",
    "model": "tts_models/en/ljspeech/tacotron2-DDC",
    "device": "cpu"
  }
}

Provider Cost Comparison

ProviderSynthesis CostQualityAPI Key?
Edge TTSFreeGoodNo
OpenAI tts-1$0.015/1K charsGreatYes
OpenAI tts-1-hd$0.030/1K charsExcellentYes
ElevenLabs$0.18/1K charsBestYes
MiniMax~$0.01/1K charsVery goodYes
Coqui (local)Free (CPU)ModerateNo
OpenAI Whisper$0.006/minโ€”Yes
Deepgram Nova$0.0043/minโ€”Yes
Local WhisperFreeGoodNo

Talk Mode (Control UI)

In addition to messaging channel voice, OpenClaw supports real-time Talk Mode in the browser Control UI โ€” a live voice conversation directly in your browser, similar to ChatGPT's Advanced Voice Mode.

Access it at http://127.0.0.1:18789/ โ†’ Talk tab.

Supported Talk providers:

  • OpenAI Realtime API
  • Google Live API
{
  "voice": {
    "talk": {
      "enabled": true,
      "provider": "openai-realtime",
      "apiKey": "${OPENAI_API_KEY}",
      "model": "gpt-4o-realtime-preview",
      "voice": "alloy"
    }
  }
}

Talk Mode uses a browser WebSocket connection directly to the provider's realtime API โ€” responses feel instantaneous compared to TTS synthesis.


User Commands for Voice

CommandEffect
/voice onEnable voice replies for this session
/voice offDisable voice replies, text only
/voice mode inboundSwitch to inbound mode
/voice mode alwaysSwitch to always-on voice
/voice speed 1.2Set playback speed (0.5โ€“2.0)
/tts latestSwitch to the latest available TTS model
/transcribeShow transcription of the last voice note without responding

Per-Workspace Voice Config

Override voice settings per workspace:

{
  "workspaces": [
    {
      "id": "voice-first",
      "agent": "balanced",
      "voice": {
        "mode": "always",
        "synthesis": {
          "provider": "edge-tts",
          "voice": "en-US-AriaNeural"
        },
        "maxVoiceChars": 800
      }
    },
    {
      "id": "text-only",
      "agent": "fast",
      "voice": {
        "mode": "off"
      }
    }
  ]
}

Next: Chapter 18 โ€” Live Canvas โ€” How to use OpenClaw's shared drawing and diagramming surface for visual collaboration.