Supertonic 2: On-Device Multilingual TTS for Real-Time Voice Generation

Supertonic 2: On-Device Multilingual TTS for Real-Time Voice Generation

Text-to-speech is often easy to demonstrate but harder to ship at production quality when latency, privacy, and multilingual support are all required simultaneously.

Supertonic 2 is built for this exact constraint set. It is an on-device multilingual text-to-speech (TTS) model focused on fast inference and practical deployment, so you can generate speech locally without adding cloud API dependencies to your voice stack.

About Supertonic 2

Supertonic 2 is a compact 66M-parameter TTS model designed for real-time multilingual synthesis on local hardware.

It supports five languages - English, Korean, Spanish, Portuguese, and French - and correctly handles complex real-world expressions that trip up major cloud TTS APIs.

The model is designed for real-time voice generation and local deployment, such as offline assistants, accessibility tools, and enterprise workflows that require tighter control over speech data.

The core value is a single, on-device path for multilingual voice output that relies on no external services.

How It Works

Supertonic 2 follows a local inference workflow:

  1. Receive text input in a supported language
  2. Run synthesis through a local inference pipeline
  3. Generate speech output without external API calls

This architecture helps reduce latency variability and keeps audio generation closer to your product runtime.

Key Capabilities

Supertonic 2 combines speed, portability, and multilingual consistency to deliver the following:

  • Local execution: Runs fully on-device with no external TTS service required.
  • Multilingual output: Five languages share a single architecture and inference pipeline.
  • Low-latency design: Runs significantly faster across 11 deployment platforms with zero network dependency.
  • Efficient runtime: Deploys across 11 platforms with no speed degradation across languages.

Core Technical Profile

The model includes:

  • 66M parameter architecture: Compact enough for on-device deployment while delivering production-quality output.
  • ONNX Runtime: Powers on-device inference across all 11 supported platforms.
  • Shared pipeline: All supported languages use the same model architecture and inference path.
  • PyTorch Training: Model weights are trained with PyTorch under BSD 3-Clause.
  • GPU and Apple Silicon: Optimized for RTX-class GPUs and M-series CPU/WebGPU runtimes.

This profile is purpose-built for teams that need a single, compact multilingual TTS path rather than separate language-specific stacks.

Performance Indicators

Supertonic 2 delivers leading real-time performance across hardware:

  • Up to 167× faster than real-time inference
  • Around 12,164 characters/second on RTX 4090
  • Around 1,263 characters/second on M4 Pro CPU

These figures point to a model aimed at low-latency production use, not only offline experimentation.

Best-Fit Use Cases

Supertonic 2 is especially relevant for:

  • Offline voice assistants
  • Real-time dubbing and narration
  • Accessibility applications
  • Privacy-sensitive enterprise TTS pipelines

Get Started

For teams building real-time voice products, Supertonic 2 provides a focused path to on-device multilingual TTS with fewer infrastructure dependencies.

Start with a multilingual test set and validate latency, language consistency, and output quality with your own prompts before scaling into production workflows.