Qwen3-tts

Introduction

# Qwen TTS

Local text-to-speech using Hugging Face's Qwen3-TTS-12Hz-1.7B-CustomVoice model.

## Quick Start

Generate speech from text:

```bash scripts/tts.py "Ciao, come va?" -l Italian -o output.wav ```

With voice instruction (emotion/style):

```bash scripts/tts.py "Sono felice!" -i "Parla con entusiasmo" -l Italian -o happy.wav ```

Different speaker:

```bash scripts/tts.py "Hello world" -s Ryan -l English -o hello.wav ```

## Installation

**First-time setup** (one-time):

```bash cd skills/public/qwen-tts bash scripts/setup.sh ```

This creates a local virtual environment and installs `qwen-tts` package (~500MB).

**Note:** First synthesis downloads ~1.7GB model from Hugging Face automatically.

## Usage

```bash scripts/tts.py [options] "Text to speak" ```

### Options

- `-o, --output PATH` - Output file path (default: qwen_output.wav) - `-s, --speaker NAME` - Speaker voice (default: Vivian) - `-l, --language LANG` - Language (default: Auto) - `-i, --instruct TEXT` - Voice instruction (emotion, style, tone) - `--list-speakers` - Show available speakers - `--model NAME` - Model name (default: CustomVoice 1.7B)

### Examples

**Basic Italian speech:** ```bash scripts/tts.py "Benvenuto nel futuro del text-to-speech" -l Italian -o welcome.wav ```

**With emotion/instruction:** ```bash scripts/tts.py "Sono molto felice di vederti!" -i "Parla con entusiasmo e gioia" -l Italian -o happy.wav ```

**Different speaker:** ```bash scripts/tts.py "Hello, nice to meet you" -s Ryan -l English -o ryan.wav ```

**List available speakers:** ```bash scripts/tts.py --list-speakers ```

## Available Speakers

The CustomVoice model includes 9 premium voices:

| Speaker | Language | Description | |---------|----------|-------------| | Vivian | Chinese | Bright, slightly edgy young female | | Serena | Chinese | Warm, gentle young female | | Uncle_Fu | Chinese | Seasoned male, low mellow timbre | | Dylan | Chinese (Beijing) | Youthful Beijing male, clear | | Eric | Chinese (Sichuan) | Lively Chengdu male, husky | | Ryan | English | Dynamic male, rhythmic | | Aiden | English | Sunny American male | | Ono_Anna | Japanese | Playful female, light nimble | | Sohee | Korean | Warm female, rich emotion |

**Recommendation:** Use each speaker's native language for best quality, though all speakers support all 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian).

## Voice Instructions

Use `-i, --instruct` to control emotion, tone, and style:

**Italian examples:** - `"Parla con entusiasmo"` - `"Tono serio e professionale"` - `"Voce calma e rilassante"` - `"Leggi come un narratore"`

**English examples:** - `"Speak with excitement"` - `"Very happy and energetic"` - `"Calm and soothing voice"` - `"Read like a narrator"`

## Integration with OpenClaw

The script outputs the audio file path to stdout (last line), making it compatible with OpenClaw's TTS workflow:

```bash # OpenClaw captures the output path cd skills/public/qwen-tts OUTPUT=$(scripts/tts.py "Ciao" -s Vivian -l Italian -o /tmp/audio.wav 2>/dev/null) # OUTPUT = /tmp/audio.wav ```

## Performance

- **GPU (CUDA):** ~1-3 seconds for short phrases - **CPU:** ~10-30 seconds for short phrases - **Model size:** ~1.7GB (auto-downloads on first run) - **Venv size:** ~500MB (installed dependencies)

## Troubleshooting

**Setup fails:** ```bash # Ensure Python 3.10-3.12 is available python3.12 --version

# Re-run setup cd skills/public/qwen-tts rm -rf venv bash scripts/setup.sh ```

**Model download slow/fails:** ```bash # Use mirror (China mainland) export HF_ENDPOINT=https://hf-mirror.com scripts/tts.py "Test" -o test.wav ```

**Out of memory (GPU):** The model automatically falls back to CPU if GPU memory insufficient.

**Audio quality issues:** - Try different speaker: `--list-speakers` - Add instruction: `-i "Speak clearly and slowly"` - Check language matches text: `-l Italian` for Italian text

## Model Details

- **Model:** Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice - **Source:** Hugging Face (https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice) - **License:** Check model card for current license terms - **Sample Rate:** 16kHz - **Output Format:** WAV (uncompressed)

Back

Introduction

More Products

self-improving-agent

Find Skills

Sonoscli