Qwen3-TTS: Multilingual Speech Generation with Instruction-Based Voice Control

Voice interfaces are evolving from basic audio output into expressive, real-time communication.
Qwen3-TTS is an end-to-end text-to-speech model on AIOZ AI, designed for real-time, multilingual speech generation with fine-grained voice control.
It combines streaming performance, instruction-based modulation, and multiple voice-generation workflows into a single model family.
About Qwen3-TTS
Qwen3-TTS supports speech generation across 10 major languages, including Chinese, Japanese, Korean, and Russian. The model enables users to control key aspects of speech through natural-language instructions, such as timbre, emotion, prosody, and rhythm.
Trained on over 5 million hours of speech data across these languages, Qwen3-TTS also supports real-time voice generation, 3-second voice cloning, and natural language voice control.
How Qwen3-TTS Works
Qwen3-TTS is built on a discrete multi-codebook language model architecture that enables end-to-end speech generation.
It is powered by the Qwen3-TTS-Tokenizer-12Hz, which:
- Applies efficient acoustic compression at 12Hz
- Supports high-dimensional semantic modeling
This design enables speech generation that balances efficiency and expressive control.
Real-Time Streaming Performance
Qwen3-TTS supports streaming speech generation with:
- End-to-end latency as low as 97ms
- A Dual-Track hybrid streaming architecture
- Support for both streaming and non-streaming generation within a single model
These features enable real-time interaction for voice-driven applications.
Key Capabilities
Qwen3-TTS supports a range of voice generation workflows:
- Instruction-driven speech generation
- Voice design from text descriptions
- Rapid voice cloning from ~3-second audio samples
- Multiple timbre options across gender, age, and dialect combinations
These capabilities allow users to generate speech with controlled style and expression.

Model Variants
Qwen3-TTS is released as a model family with multiple configurations:
- 0.6B and 1.7B parameter sizes
- Variants including: Base, CustomVoice, and VoiceDesign
The 1.7B variants support instruction control, while the 0.6B variants are optimized for lightweight deployment.
All variants support streaming generation.
Ideal Use Cases
Qwen3-TTS can be applied across a wide range of voice-first applications, including:
- Voice assistants: Real-time, multilingual, and streaming-ready.
- Dubbing and localization: Speech generation across 10 languages with natural prosody.
- Voice cloning: Rapid cloning from as little as 3 seconds of audio.
- Audiobook and narration: Expressive, long-form speech with fine-grained style control.
- Voice-enabled AI agents: Low-latency streaming for real-time conversational systems.
Its combination of streaming performance and controllable speech output makes it suitable for both interactive systems and content production workflows.
Try Qwen3-TTS on AIOZ AI
Qwen3-TTS combines multilingual support, real-time streaming, and instruction-based control within a single model family.
Explore this model on AIOZ AI today and start building voice experiences that feel more natural, responsive, and global.