Supertonic 2: On-Device Multilingual TTS for Real-Time Voice Generation

Text-to-speech is often easy to demonstrate but harder to ship at production quality when latency, privacy, and multilingual support are all required simultaneously.
Supertonic 2 is built for this exact constraint set. It is an on-device multilingual text-to-speech (TTS) model focused on fast inference and practical deployment, so you can generate speech locally without adding cloud API dependencies to your voice stack.
About Supertonic 2
Supertonic 2 is a compact 66M-parameter TTS model designed for real-time multilingual synthesis on local hardware.
It supports five languages - English, Korean, Spanish, Portuguese, and French - and correctly handles complex real-world expressions that trip up major cloud TTS APIs.
The model is designed for real-time voice generation and local deployment, such as offline assistants, accessibility tools, and enterprise workflows that require tighter control over speech data.
The core value is a single, on-device path for multilingual voice output that relies on no external services.
How It Works
Supertonic 2 follows a local inference workflow:
- Receive text input in a supported language
- Run synthesis through a local inference pipeline
- Generate speech output without external API calls
This architecture helps reduce latency variability and keeps audio generation closer to your product runtime.
Key Capabilities
Supertonic 2 combines speed, portability, and multilingual consistency to deliver the following:
- Local execution: Runs fully on-device with no external TTS service required.
- Multilingual output: Five languages share a single architecture and inference pipeline.
- Low-latency design: Runs significantly faster across 11 deployment platforms with zero network dependency.
- Efficient runtime: Deploys across 11 platforms with no speed degradation across languages.
Core Technical Profile
The model includes:
- 66M parameter architecture: Compact enough for on-device deployment while delivering production-quality output.
- ONNX Runtime: Powers on-device inference across all 11 supported platforms.
- Shared pipeline: All supported languages use the same model architecture and inference path.
- PyTorch Training: Model weights are trained with PyTorch under BSD 3-Clause.
- GPU and Apple Silicon: Optimized for RTX-class GPUs and M-series CPU/WebGPU runtimes.
This profile is purpose-built for teams that need a single, compact multilingual TTS path rather than separate language-specific stacks.
Performance Indicators
Supertonic 2 delivers leading real-time performance across hardware:
- Up to 167× faster than real-time inference
- Around 12,164 characters/second on RTX 4090
- Around 1,263 characters/second on M4 Pro CPU
These figures point to a model aimed at low-latency production use, not only offline experimentation.
Best-Fit Use Cases
Supertonic 2 is especially relevant for:
- Offline voice assistants
- Real-time dubbing and narration
- Accessibility applications
- Privacy-sensitive enterprise TTS pipelines
Get Started
For teams building real-time voice products, Supertonic 2 provides a focused path to on-device multilingual TTS with fewer infrastructure dependencies.
Start with a multilingual test set and validate latency, language consistency, and output quality with your own prompts before scaling into production workflows.