Discord Voice

Introduction

# Discord Voice Plugin for Clawdbot

Real-time voice conversations in Discord voice channels. Join a voice channel, speak, and have your words transcribed, processed by Claude, and spoken back.

## Features

- **Join/Leave Voice Channels**: Via slash commands, CLI, or agent tool - **Voice Activity Detection (VAD)**: Automatically detects when users are speaking - **Speech-to-Text**: Whisper API (OpenAI), Deepgram, or Local Whisper (Offline) - **Streaming STT**: Real-time transcription with Deepgram WebSocket (~1s latency reduction) - **Agent Integration**: Transcribed speech is routed through the Clawdbot agent - **Text-to-Speech**: OpenAI TTS, ElevenLabs, or Kokoro (Local/Offline) - **Audio Playback**: Responses are spoken back in the voice channel - **Barge-in Support**: Stops speaking immediately when user starts talking - **Auto-reconnect**: Automatic heartbeat monitoring and reconnection on disconnect

## Requirements

- Discord bot with voice permissions (Connect, Speak, Use Voice Activity) - API keys for STT and TTS providers - System dependencies for voice: - `ffmpeg` (audio processing) - Native build tools for `@discordjs/opus` and `sodium-native`

## Installation

### 1. Install System Dependencies

```bash # Ubuntu/Debian sudo apt-get install ffmpeg build-essential python3

# Fedora/RHEL sudo dnf install ffmpeg gcc-c++ make python3

# macOS brew install ffmpeg ```

### 2. Install via ClawdHub

```bash clawdhub install discord-voice ```

Or manually:

```bash cd ~/.clawdbot/extensions git clone <repository-url> discord-voice cd discord-voice npm install ```

### 3. Configure in clawdbot.json

```json5 { plugins: { entries: { "discord-voice": { enabled: true, config: { sttProvider: "local-whisper", ttsProvider: "openai", ttsVoice: "nova", vadSensitivity: "medium", allowedUsers: [], // Empty = allow all users silenceThresholdMs: 1500, maxRecordingMs: 30000, openai: { apiKey: "sk-...", // Or use OPENAI_API_KEY env var }, }, }, }, }, } ```

### 4. Discord Bot Setup

Ensure your Discord bot has these permissions:

- **Connect** - Join voice channels - **Speak** - Play audio - **Use Voice Activity** - Detect when users speak

Add these to your bot's OAuth2 URL or configure in Discord Developer Portal.

## Configuration

| Option | Type | Default | Description | | --------------------- | -------- | ----------------- | ----------------------------------------------- | | `enabled` | boolean | `true` | Enable/disable the plugin | | `sttProvider` | string | `"local-whisper"` | `"whisper"`, `"deepgram"`, or `"local-whisper"` | | `streamingSTT` | boolean | `true` | Use streaming STT (Deepgram only, ~1s faster) | | `ttsProvider` | string | `"openai"` | `"openai"` or `"elevenlabs"` | | `ttsVoice` | string | `"nova"` | Voice ID for TTS | | `vadSensitivity` | string | `"medium"` | `"low"`, `"medium"`, or `"high"` | | `bargeIn` | boolean | `true` | Stop speaking when user talks | | `allowedUsers` | string[] | `[]` | User IDs allowed (empty = all) | | `silenceThresholdMs` | number | `1500` | Silence before processing (ms) | | `maxRecordingMs` | number | `30000` | Max recording length (ms) | | `heartbeatIntervalMs` | number | `30000` | Connection health check interval | | `autoJoinChannel` | string | `undefined` | Channel ID to auto-join on startup |

### Provider Configuration

#### OpenAI (Whisper + TTS)

```json5 { openai: { apiKey: "sk-...", whisperModel: "whisper-1", ttsModel: "tts-1", }, } ```

#### ElevenLabs (TTS only)

```json5 { elevenlabs: { apiKey: "...", voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel modelId: "eleven_multilingual_v2", }, } ```

#### Deepgram (STT only)

```json5 { deepgram: { apiKey: "...", model: "nova-2", }, } ```

## Usage

### Slash Commands (Discord)

Once registered with Discord, use these commands:

- `/discord_voice join <channel>` - Join a voice channel - `/discord_voice leave` - Leave the current voice channel - `/discord_voice status` - Show voice connection status

### CLI Commands

```bash # Join a voice channel clawdbot discord_voice join <channelId>

# Leave voice clawdbot discord_voice leave --guild <guildId>

# Check status clawdbot discord_voice status ```

### Agent Tool

The agent can use the `discord_voice` tool:

``` Join voice channel 1234567890 ```

The tool supports actions:

- `join` - Join a voice channel (requires channelId) - `leave` - Leave voice channel - `speak` - Speak text in the voice channel - `status` - Get current voice status

## How It Works

1. **Join**: Bot joins the specified voice channel 2. **Listen**: VAD detects when users start/stop speaking 3. **Record**: Audio is buffered while user speaks 4. **Transcribe**: On silence, audio is sent to STT provider 5. **Process**: Transcribed text is sent to Clawdbot agent 6. **Synthesize**: Agent response is converted to audio via TTS 7. **Play**: Audio is played back in the voice channel

## Streaming STT (Deepgram)

When using Deepgram as your STT provider, streaming mode is enabled by default. This provides:

- **~1 second faster** end-to-end latency - **Real-time feedback** with interim transcription results - **Automatic keep-alive** to prevent connection timeouts - **Fallback** to batch transcription if streaming fails

To use streaming STT:

```json5 { sttProvider: "deepgram", streamingSTT: true, // default deepgram: { apiKey: "...", model: "nova-2", }, } ```

## Barge-in Support

When enabled (default), the bot will immediately stop speaking if a user starts talking. This creates a more natural conversational flow where you can interrupt the bot.

To disable (let the bot finish speaking):

```json5 { bargeIn: false, } ```

## Auto-reconnect

The plugin includes automatic connection health monitoring:

- **Heartbeat checks** every 30 seconds (configurable) - **Auto-reconnect** on disconnect with exponential backoff - **Max 3 attempts** before giving up

If the connection drops, you'll see logs like:

``` [discord-voice] Disconnected from voice channel [discord-voice] Reconnection attempt 1/3 [discord-voice] Reconnected successfully ```

## VAD Sensitivity

- **low**: Picks up quiet speech, may trigger on background noise - **medium**: Balanced (recommended) - **high**: Requires louder, clearer speech

## Troubleshooting

### "Discord client not available"

Ensure the Discord channel is configured and the bot is connected before using voice.

### Opus/Sodium build errors

Install build tools:

```bash npm install -g node-gyp npm rebuild @discordjs/opus sodium-native ```

### No audio heard

1. Check bot has Connect + Speak permissions 2. Check bot isn't server muted 3. Verify TTS API key is valid

### Transcription not working

1. Check STT API key is valid 2. Check audio is being recorded (see debug logs) 3. Try adjusting VAD sensitivity

### Enable debug logging

```bash DEBUG=discord-voice clawdbot gateway start ```

## Environment Variables

| Variable | Description | | -------------------- | ------------------------------ | | `DISCORD_TOKEN` | Discord bot token (required) | | `OPENAI_API_KEY` | OpenAI API key (Whisper + TTS) | | `ELEVENLABS_API_KEY` | ElevenLabs API key | | `DEEPGRAM_API_KEY` | Deepgram API key |

## Limitations

- Only one voice channel per guild at a time - Maximum recording length: 30 seconds (configurable) - Requires stable network for real-time audio - TTS output may have slight delay due to synthesis

## License

MIT

Back

Introduction

More Products

Slack

AgentMail

Telegram