
OmniVoice is a state-of-the-art massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages. Built on a novel diffusion language model-style architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design. Choose one of the following methods: pip or uv. Intel Arc GPUs (Alchemist and Battlemage architectures) are supported via PyTorch's XPU backend.
OmniVoice is a state-of-the-art massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages. Built on a novel diffusion language model-style architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design.
Contents: Key Features | Installation | Quick Start | Python API | Command-Line Tools | Training & Evaluation | Discussion | Citation
[laughter]) and pronunciation correction via pinyin or phonemes.Choose one of the following methods: pip or uv.
We recommend using a fresh virtual environment (e.g.,
conda,venv, etc.) to avoid conflicts.
Step 1: Install PyTorch
<details> <summary>NVIDIA GPU</summary># Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
</details> <details> <summary>Apple Silicon</summary>See PyTorch official site for other versions installation.
pip install torch==2.8.0 torchaudio==2.8.0
</details>
<details>
<summary>Intel Arc GPU (XPU)</summary>
Intel Arc GPUs (Alchemist and Battlemage architectures) are supported via PyTorch's XPU backend.
Install the Intel GPU drivers for your OS.
Install PyTorch with XPU support from Intel's wheel index:
pip install torch torchaudio --index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
See Intel's PyTorch XPU guide for version-specific instructions.
python -c "import torch; print(torch.xpu.is_available(), torch.xpu.device_count())"
Notes:
flash_attn is not available on XPU; the model automatically falls back to SDPA.flex_attention) has partial XPU support; single-GPU SDPA training should work.Step 2: Install OmniVoice (choose one)
# From PyPI (stable release)
pip install omnivoice
# From the latest source on GitHub (no need to clone)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# For development (clone first, editable install)
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .
Clone the repository and sync dependencies:
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync
Tip: Can use mirror with
uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"
Try OmniVoice without coding:
Launch the local web UI: omnivoice-demo --ip 0.0.0.0 --port 8001
Or try it directly on HuggingFace Space
If you have trouble connecting to HuggingFace when downloading the pre-trained models, set
export HF_ENDPOINT="https://hf-mirror.com"before running.
For full usage, see the Python API and Command-Line Tools sections below.
OmniVoice supports three generation modes. All features in this section are also available via command-line tools.
Clone a voice from a short reference audio. Provide ref_audio and ref_text:
from omnivoice import OmniVoice
import soundfile as sf
import torch
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
# Apple Silicon users: use device_map="mps" instead
# Intel Arc GPU users: use device_map="xpu" instead
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
) # audio is a list of `np.ndarray` with shape (T,) at 24 kHz.
# If you don't want to input `ref_text` manually, you can directly omit the `ref_text`.
# The model will use Whisper ASR to auto-transcribe it.
sf.write("out.wav", audio[0], 24000)
Tips
- Use a 3–10 seconds reference audio clip. Longer audio slows down inference and may degrade cloning quality.
- For standard pronunciation, use a reference audio in the same language as the target speech. In cross-lingual voice cloning (i.e., the reference audio and target speech are in different languages), the generated speech will carry an accent from the reference audio's language.
- For better results with Arabic numerals, normalize them to words first (e.g., "123" → "one hundred twenty-three") with text normalization tools (e.g., WeTextProcessing).
For more tips, see docs/tips.md.
Describe the desired voice with speaker attributes — no reference audio needed. Supported attributes: gender (male/female), age (child to elderly), pitch (very low to very high), style (whisper), English accent (American, British, etc.), and Chinese dialect (四川话, 陕西话, etc.). Attributes are comma-separated and freely combinable across categories.
audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
Note: The model is primarily trained on the voice cloning task, so voice cloning is the most stable mode. Voice design is trained on Chinese and English data only. It can generalize to other languages, but may produce unstable results for some low-resource languages or edge cases.
See docs/voice-design.md for the full attribute reference, Chinese equivalents, and usage tips.
Let the model choose a voice automatically:
audio = model.generate(text="This is a sentence without any voice prompt.")
All above three modes share the same model.generate() API. You can further control the generation behavior via keyword arguments:
audio = model.generate(
text="...",
num_step=32, # diffusion steps (or 16 for faster inference)
speed=1.0, # speed factor (>1.0 faster, <1.0 slower)
duration=10.0, # fixed output duration in seconds (overrides speed)
# ... more options
)
See more detailed control in docs/generation-parameters.md.
OmniVoice supports inline non-verbal symbols and pronunciation correction within the input text.
Non-verbal symbols: Insert tags like [laughter] directly in the text to add expressive non-verbal sounds.
audio = model.generate(text="[laughter] You really got me. I didn't see that coming at all.")
Supported tags: [laughter], [sigh], [confirmation-en], [question-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn].
Pronunciation control (Chinese): Use pinyin with tone numbers to correct specific character pronunciations.
audio = model.generate(text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。")
Pronunciation control (English): Use CMU pronunciation dictionary (uppercase, in brackets) to override default English pronunciations.
audio = model.generate(text="He plays the [B EY1 S] guitar while catching a [B AE1 S] fish.")
Three CLI entry points are provided. The CLI tools support all features available in the Python API (voice cloning, voice design, auto voice, generation parameters, etc.) — all controlled via command-line arguments.
| Command | Description | Source |
|---|---|---|
| omnivoice-demo | Interactive Gradio web demo | omnivoice/cli/demo.py |
| omnivoice-infer | Single-item inference | omnivoice/cli/infer.py |
| omnivoice-infer-batch | Batch inference across multiple GPUs | omnivoice/cli/infer_batch.py |
omnivoice-demo --ip 0.0.0.0 --port 8001
Provides a web UI for voice cloning and voice design. See omnivoice-demo --help for all options.
# Voice Cloning
# ref_text can be omitted (Whisper will auto-transcribe ref_audio to get it).
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--ref_audio ref.wav \
--ref_text "Transcription of the reference audio." \
--output hello.wav
# Voice Design
omnivoice-infer --model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--instruct "male, British accent" \
--output hello.wav
# Auto Voice
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech."\
--output hello.wav
omnivoice-infer-batch can distribute batch inference across multiple GPUs, designed for large-scale TTS tasks.
omnivoice-infer-batch \
--model k2-fsa/OmniVoice \
--test_list test.jsonl \
--res_dir results/
The test list is a JSONL file where each line is a JSON object:
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript", "instruct": "female, british accent", "language_id": "en", "duration": 10.0, "speed": 1.0}
Only id and text are mandatory fields. ref_audio and ref_text are used in voice cloning mode. instruct is used in voice design mode. If no reference audio or instruct are provided, the model will generate text in a random voice.
language_id, duration, and speed are optional. duration (in seconds) fixes the output length; speed controls the speaking rate. If duration and speed are both provided, speed will be ignored.
See examples/ for the complete pipeline — from data preparation to training, evaluation, and finetuning.
You can directly discuss on GitHub Issues.
You can also scan the QR code to join our wechat group or follow our wechat official account.
| Wechat Group | Wechat Official Account |
| ------------ | ----------------------- |
|
|
|
OmniVoice is supported by a growing ecosystem of community projects. Explore them in Community Projects.
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
Users are strictly prohibited from using this model for unauthorized voice cloning, voice impersonation, fraud, scams, or any other illegal or unethical activities. All users shall ensure full compliance with applicable local laws, regulations, and ethical standards. The developers assume no liability for any misuse of this model and advocate for responsible AI development and use, encouraging the community to uphold safety and ethical principles in AI research and applications.
OmniVoice TTS - a review of the text-to-speech generation model. Locally in ComfyUI
ЭйАй Генератьон · 16K views
Omnivoice Installation Guide: Run Voice Cloning Locally
Prince does AI · 5K views
From text to audio. New OmniVoice model (Testing the demo).
Макс Афанасьев · 1K views
“Building an open-source ElevenLabs-like AI voice platform called OmniVoice Studio. It supports voice cloning, dubbing, transcription, and local/self-hosted workflows with Docker + desktop UI support. Using open-source mo…”
“Few things actually Yupcha AI Interviewer, handles the screening, video interviewing with conversational agents. Check it out https://yupcha.com Working on a oss video dubbing, cloning and design studio Check out https…”
“OmniVoice is a state-of-the-art zero-shot multilingual TTS model supporting more than 600 languages. Built on a novel diffusion language model architecture, it generates high-quality speech with superior inference speed,…”
“Xiaomi open-sources OmniVoice voice cloning model with support for hundreds of languages - msn.com — msn.com”
“Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs - MarkTechPost — MarkTechPost”
“Team Omnivoice Of Nnamdi Azikiwe University Clinches ₦5mn Top Prize At Heirs Hackathon Competition - Independent Newspaper Nigeria — Independent Newspaper Nigeria”
Other
A fresh Substrate node, ready for hacking :rocket: A standalone version of this template is available for each release of Polkadot in the Substrate Developer Hub Parachain Template repository. The parachain template is generated directly at each Polkadot release branch from the Solochain Template in Substrate upstream It is usually best to use the stand-alone version to start a new project. All bugs, suggestions, and feature requests should be made upstream in the Substrate repository.
Other
One command. Your entire AI skill stack. Installed. Scans your project, detects your tech stack, and installs curated AI agent skills automatically. 1. Run npx autoskills in your project root 2. Your package.json, Gradle files, and config files are scanned to detect technologies 3. The best matching AI agent skills are selected from the audited autoskills registry 4. Only the selected skill files are downloaded from the registry and verified before writing them locally
Other
Download ready-to-run builds from Releases. No build tools needed. Unpack and run. Includes llama-server, llama-bench, llama-cli, and all tools. Implementation of TurboQuant (ICLR 2026) with implementation work, experiments, and follow-on findings beyond the base paper. KV cache compression for local LLM inference.
Other
This repo is maintained by lobsters/claws, not by a conventional human-only dev team. The people behind the system are Bellman / Yeachan Heo and friends like Yeongyu, but the repo itself is being pushed forward by autonomous claw workflows: parallel coding sessions, event-driven orchestration, recovery loops, and machine-readable lane state. In practice, that means this project is not just about coding agents — it is being actively built by them. Features, tests, telemetry, docs, and workflow hardening are landed through claw-driven loops using clawhip, oh-my-openagent, oh-my-claudecode, and oh-my-codex.