Per-OS install guides — pick yours and follow it end-to-end: Stuck? See docs/install/troubleshooting.md for the top 10 install errors. The in-app error UI deeplinks to those entries when something breaks at runtime. For Hugging Face token setup, see docs/setup/huggingface-token.md. For diarization-specific gating, see docs/features/diarization.md.
Real-time dictation, zero-shot voice cloning, and cinematic video dubbing — all on your desktop.
Open-source, no API keys, fully local. 646 languages.
Quickstart · Features · Why OmniVoice Studio? · TTS Engines · ASR Engines · Contributing · Discord · 简体中文
[!WARNING] OmniVoice Studio is in active beta. Things may break between releases. For the latest features and fixes, clone the repo and run from source rather than using pre-built installers. Bug reports and PRs are very welcome — open an issue or join Discord.
🎙️ Voice Cloning3-second clip → mirror any voice. |
🎨 Voice DesignGender, age, accent, pitch, speed, |
🎬 Video DubbingYouTube URL or file → transcribe → |
⌨️ Dictation Widget
|
🔊 Vocal IsolationDemucs-powered. Splits speech |
👥 Speaker DiarizationPyannote + WhisperX. |
📦 Batch QueueDrop 50 videos, walk away. |
🤖 MCP ServerUse OmniVoice from Claude, |
🛡️ AI WatermarkAudioSeal (Meta). Invisible, |
🔐 100% LocalNo keys, no cloud, no accounts. |
⚡ GPU Auto-DetectCUDA · MPS · ROCm · CPU. |
🧩 ExtensibleSubclass |
Per-OS install guides — pick yours and follow it end-to-end:
Stuck? See docs/install/troubleshooting.md for the top 10 install errors. The in-app error UI deeplinks to those entries when something breaks at runtime.
For Hugging Face token setup, see docs/setup/huggingface-token.md. For diarization-specific gating, see docs/features/diarization.md.
Voice Clone Drop a 3-second clip → mirror any voice. 646 languages, zero-shot. |
Voice Design Build new voices from scratch — gender, age, accent, pitch, style. |
Video Dubbing Upload or paste a YouTube URL. Transcribe, translate, re-voice, export. |
Voice Gallery Search YouTube, browse categories, download clips, build your library. |
Settings → Models 15 models. One-click install. Auto-detects your platform (CUDA / MPS / CPU). |
Projects Dub projects, voice profiles, generation history, exports — all searchable. |
Settings → Logs Live backend, frontend, and Tauri runtime logs. Filter, refresh, clear. | |
ElevenLabs charges $5–$330/mo and processes your audio on their servers. OmniVoice Studio runs on your hardware, with no usage limits.
| ElevenLabs | OmniVoice Studio | |
|---|---|---|
| Pricing | $5–$330/mo, per-character billing | Free for personal use · Commercial license for business |
| Voice Cloning | ✅ 3s clip | ✅ 3s clip, zero-shot |
| Voice Design | ✅ Gender, age | ✅ Gender, age, accent, pitch, style, dialect |
| Languages | 32 | 646 |
| Video Dubbing | ✅ Cloud-only | ✅ Fully local |
| Data Privacy | Audio sent to cloud | Nothing leaves your machine |
| API Keys | Required | Not needed |
| GPU Support | N/A (cloud) | CUDA · Apple Silicon · ROCm · CPU |
| Desktop App | ❌ | ✅ macOS · Windows · Linux |
| Customizable | ❌ Closed | ✅ Fork it, extend it, ship it |
OmniVoice Studio gives you professional-grade AI tools without the subscription or the cloud.
| Minimum | Recommended | |
|---|---|---|
| OS | Windows 10, macOS 12+, Ubuntu 20.04+ | Any modern 64-bit OS |
| RAM | 8 GB | 16 GB+ |
| VRAM (GPU) | 4 GB (auto-offloads TTS to CPU) | 8 GB+ (NVIDIA RTX 3060+) |
| Disk | 10 GB free (models + cache) | 20 GB+ SSD |
| Python | 3.10+ (managed by uv) | 3.11–3.12 |
| GPU | Optional — CPU works | NVIDIA CUDA · Apple Silicon MPS · AMD ROCm |
[!TIP] On GPUs with ≤8 GB VRAM, OmniVoice automatically offloads TTS to CPU during transcription — no config needed. A dedicated GPU is not required; the entire pipeline runs on CPU (just slower).
OmniVoice ships a multi-engine TTS backend. The default engine (OmniVoice) is always available; additional engines are opt-in and auto-detected. Switch engines in Settings → TTS Engine or via the OMNIVOICE_TTS_BACKEND env var.
| Engine | Languages | Clone | Instruct | Linux | macOS ARM | Windows | License |
|---|---|---|---|---|---|---|---|
| OmniVoice (default) | 600+ | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Built-in |
| CosyVoice 3 | 9 + 18 dialects | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Apache-2.0 |
| MLX-Audio (Kokoro, Qwen3-TTS, CSM, Dia, …) | Multi | Varies | Varies | ❌ | ✅ Native | ❌ | Varies |
| VoxCPM2 | 30 | ✅ | ✅ | ✅ CUDA/CPU | ✅ MPS | ✅ CUDA/CPU | Apache-2.0 |
| MOSS-TTS-Nano | 20 | ✅ | ❌ | ✅ CUDA/CPU | ✅ CPU | ✅ CUDA/CPU | Apache-2.0 |
| KittenTTS | English | ❌ | ❌ | ✅ CPU | ✅ CPU | ✅ CPU | MIT |
CUDA = GPU-accelerated · MPS = Apple Silicon Metal · CPU = runs everywhere, slower for large models · KittenTTS and MOSS-TTS-Nano run realtime on CPU · MLX-Audio is Apple Silicon only.
OmniVoice ships a multi-engine ASR (speech-to-text) backend that powers dictation, video dubbing, and subtitle generation — all fully local. WhisperX is the cross-platform default; the rest are opt-in and auto-detected. Switch in Settings → ASR Engine or via the OMNIVOICE_ASR_BACKEND env var.
| Engine | OMNIVOICE_ASR_BACKEND | Languages | Best for |
|---|---|---|---|
| WhisperX (default) | whisperx | ~100 | Dubbing & subtitles — word-level timing via wav2vec2 forced alignment |
| Faster-Whisper | faster-whisper | ~100 | Fast transcription on Linux / macOS / Windows (CTranslate2) |
| MLX Whisper | mlx-whisper | ~100 | Native Apple Silicon speed (Apple MLX / Metal) |
| PyTorch Whisper | pytorch-whisper | ~100 | CUDA / CPU fallback via 🤗 Transformers |
| Parakeet TDT | nemo-parakeet | English + 25 EU | SOTA English accuracy, auto language detection (NVIDIA NeMo, GPU only) |
| Moonshine | moonshine | English | Edge / low-latency, ONNX |
| FunASR | funasr | 50+ | All-in-one multilingual — built-in VAD + inline speaker diarization (SenseVoice) |
Whisper-family engines cover ~100 languages; FunASR / SenseVoice adds an all-in-one multilingual path with built-in voice-activity detection and inline speaker diarization. Every engine runs on-device — no API keys, no cloud.
┌─────────────────────────────────────────────────┐
│ Frontend (React) │
│ DubTab · VoicePreview · BatchQueue · Gallery │
├─────────────────────────────────────────────────┤
│ Backend (FastAPI) │
│ 97 API endpoints · SSE streaming · SQLite │
├──────────┬──────────┬──────────┬────────────────┤
│ WhisperX │ Demucs │OmniVoice │ Pyannote │
│ ASR │ Source │ TTS │ Diarization │
│ │ Sep. │ │ │
└──────────┴──────────┴──────────┴────────────────┘
CUDA / MPS / ROCm / CPU (auto-detected)
| Category | Features |
|---|---|
| Dubbing | Full pipeline (transcribe→translate→synthesize→mux), scene-aware splitting, lip-sync scoring, streaming TTS |
| Voice | Zero-shot cloning, voice design, A/B comparison, voice preview widget, gallery with favorites/tags |
| Audio | Demucs vocal isolation, per-segment gain, selective track export, stem/SRT/VTT/MP3 export |
| Multi-Lang | Multi-language batch picker, batch dubbing queue with sequential GPU execution |
| Diarization | Pyannote ML diarization, auto speaker clone extraction, per-speaker voice assignment |
| Infra | Docker deployment, CUDA/MPS/ROCm auto-detect, cuDNN 8 compat, VRAM-aware model offloading |
| AI Provenance | AudioSeal invisible watermarking (SynthID-like), video logo overlay, watermark detection API |
| UX | Undo/redo, keyboard shortcuts, drag-and-drop, session persistence, glassmorphism design system |
| Real-time Events | WebSocket event bus — instant sidebar refresh on data mutations, exponential backoff reconnect |
| State Management | Zustand store migration — uiSlice, pillSlice, dubSlice, generateSlice, prefsSlice, glossarySlice |
| Desktop | Cross-platform Tauri installers (macOS DMG, Windows MSI, Linux deb/AppImage), auto-update infrastructure |
| Windows Hardening | Cross-platform log paths, Triton workaround, HF symlink bypass, 300s health check timeout |
| Dictation | Global system-wide hotkey (⌘+⇧+Space), frameless floating widget, streaming ASR via WebSocket, auto-paste |
| Batch Pipeline | Full batch TTS: extract → transcribe → translate → generate → mix → export, with live progress tracking |
| Channel | What happens there |
|---|---|
#showcase | Members share their dubs, clones, and voice designs |
#help | Setup issues, GPU troubleshooting, model questions |
#feature-requests | Vote on what gets built next |
#dev | Architecture discussions, PR reviews, engine integrations |
#announcements | Release notes, breaking changes, early access |
→ Join the Discord — we respond to setup questions within hours, not days.
We welcome contributions of all kinds — bug fixes, new TTS engine adapters, UI improvements, docs, and translations.
TTSBackend in backend/services/tts_backend.py and add it to the _REGISTRY dictionary at the bottom. Six engines are built in: OmniVoice, CosyVoice, MLX-Audio (14+ sub-engines), VoxCPM2, MOSS-TTS-Nano, and KittenTTS. See the TTS Engines section for details.
OmniVoice Studio is source-available under the Functional Source License (FSL-1.1-ALv2).
Free for personal, educational, research, internal team, and non-commercial use. Each release converts to Apache 2.0 automatically two years after publication.
Business / enterprise users building a competing product or service on top of OmniVoice Studio need a commercial license. Pricing tiers coming soon. For inquiries in the meantime, reach out at OmniVoice@palash.dev.
See LICENSE for the full terms.
OmniVoice Studio is built on the shoulders of exceptional open-source work:
| Project | Role |
|---|---|
| OmniVoice (k2-fsa) | Zero-shot diffusion TTS engine — the core voice synthesis model |
| WhisperX | Word-level speech recognition and alignment |
| Demucs (Meta) | Music source separation for vocal isolation |
| Pyannote | Speaker diarization — who said what |
| CTranslate2 | Optimized Transformer inference on CPU and GPU |
| AudioSeal (Meta) | Invisible neural audio watermarking for AI provenance |
| Tauri | Native desktop app framework |
If you read this far, you're our kind of person.
⭐ Star this repo so others can find it too.
💬 Join the Discord to share what you build.
Cubase Pro 15 Omnivoice Demo -
Vincent Pendleton · 6K views
Omnivoice Installation Guide: Run Voice Cloning Locally
Prince does AI · 5K views
RVC Web UI - FREE Open Source AI Voice Cloning OmniVoice Tutorial
Vibe Coding Academy · 1K views
“Do you want to see cloud Chat + TTS + STT models on OVS? — Omnivoice was created as local first alternative for ElevenLabs even tho for lack of good translations locally i was prone to choose google/cloud translate. No…”
“Shape OVS evolves more than ElevenLabs for your local voice ai 💡 — Help shape how OVS becomes a true local ElevenLabs alternative Hey everyone 👋 First, a personal note: much of APAC — India especially — is suffering ba…”
“Building an open-source ElevenLabs-like AI voice platform called OmniVoice Studio. It supports voice cloning, dubbing, transcription, and local/self-hosted workflows with Docker + desktop UI support. Using open-source mo…”
“Few things actually Yupcha AI Interviewer, handles the screening, video interviewing with conversational agents. Check it out https://yupcha.com Working on a oss video dubbing, cloning and design studio Check out https…”
“Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs - MarkTechPost — MarkTechPost”
Media
HyperFrames is an open-source framework for turning HTML, CSS, media, and seekable animations into deterministic MP4 videos. Use it locally with the CLI, from AI coding agents with skills, or as the rendering core behind hosted authoring workflows. Install the HyperFrames skills, then describe the video you want: The skills teach agents the HyperFrames production loop: plan the video, write valid HTML, wire seekable animations, add media, lint, preview, and render. They work with Claude Code, Cursor, Gemini CLI, Codex, and other coding agents that support skills.
Media
Your colleague quit, leaving behind a mountain of unmaintained docs? Your intern left, nothing but an empty desk and a half-finished project? Your mentor graduated, taking all the context and experience with them? Your partner transferred, and the chemistry you built reset to zero overnight? Your predecessor handed over, trying to condense three years into three pages? Turn cold goodbyes into warm Skills — welcome to cyber-immortality! Provide source materials (Feishu messages, DingTalk docs, Slack messages, emails, screenshots) plus your subjective description of the person and get an AI Skill that actually works like them
Media
OmniVoice is a state-of-the-art massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages. Built on a novel diffusion language model-style architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design. Choose one of the following methods: pip or uv. Intel Arc GPUs (Alchemist and Battlemage architectures) are supported via PyTorch's XPU backend.
Media
A self-hosted, open-source video and audio downloader with a clean web UI. Paste links from YouTube, TikTok, Instagram, Twitter/X, and 1000+ other sites — download as MP4 or MP3. 1. Paste one or more video URLs into the input box 2. Choose MP4 (video) or MP3 (audio) 3. Click Fetch to load video info and thumbnails 4. Select quality/resolution if available 5. Click Download on individual videos, or Download All YouTube, TikTok, Instagram, Twitter/X, Reddit, Facebook, Vimeo, Twitch, Dailymotion, SoundCloud, Loom, Streamable, Pinterest, Tumblr, Threads, LinkedIn, and many more.