Voicebox: The Local-First Voice Cloning Studio That Runs Entirely on Your Mac
Voicebox is what happens when someone looks at ElevenLabs and asks: why does my voice need to live on someone else's server? Built by Jamie Pine — the developer behind the Spacedrive file manager — Voicebox is an MIT-licensed desktop app that clones voices, generates speech, and composes multi-track voice projects without ever sending a byte to the cloud.
The project has quietly crossed 2,700 GitHub stars and released 13 versions since shipping, with v0.1.12 landing at the end of January 2026. And unlike most "local alternative to X" projects, Voicebox is a complete studio product — not a CLI wrapper or a half-finished demo.
What Voicebox Actually Is
At its core, Voicebox is a native desktop app built with Tauri (Rust) and a FastAPI Python backend, powered by Alibaba's Qwen3-TTS voice synthesis model. You download the app, point it at a few seconds of audio, and it produces a voice profile you can use to generate arbitrary speech — the same core workflow ElevenLabs offers, except everything runs on your machine.
The punch list of features reads like a subscription-tier SaaS product:
- Voice cloning from a few seconds of audio, with multi-sample support for higher-fidelity profiles
- Multi-track stories editor — a DAW-style timeline for composing conversations and narratives
- In-app recording with system audio capture on macOS and Windows
- Automatic transcription via Whisper
- Generation history with search, filter, and one-click regeneration
- Full REST API so you can pipe voice synthesis into your own apps
- Remote mode for running the inference engine on a GPU server and the UI on your laptop
The whole thing is distributed under the MIT license. No subscription, no usage caps, no "you've hit your monthly character limit" modal.
The MLX Trick on Apple Silicon
Here's where Voicebox gets technically interesting. Most local AI apps on Mac run PyTorch through Metal and call it a day. Voicebox ships a dedicated MLX backend — Apple's ML framework optimized for the Neural Engine and unified memory — alongside the standard PyTorch path.
The README claims 4–5× faster inference on Apple Silicon versus the PyTorch fallback. That's not marketing hype; it's what happens when you use the actual framework Apple built for this hardware instead of going through a generic GPU abstraction. For M1/M2/M3 users, the difference between "watch a progress bar" and "instant playback" is exactly the difference between a toy and a usable tool.
The fallback matrix covers everyone else:
- Apple Silicon: MLX backend, native Metal acceleration
- Windows/Linux: PyTorch backend, CUDA GPU strongly recommended
- Intel Mac: PyTorch with CPU inference (slower but functional)
Linux builds are listed as "coming soon, currently blocked by GitHub runner disk space limitations" — a very normal open-source constraint when you're bundling a multi-gigabyte model with the app.
Getting It Running
Voicebox skips the usual "clone this repo, install Python, fight with CUDA drivers" onboarding. You grab a binary from the GitHub Releases page and double-click.
For v0.1.0 (the most recent tagged full release), the downloads are:
macOS (Apple Silicon): voicebox_aarch64.app.tar.gz
macOS (Intel): voicebox_x64.app.tar.gz
Windows (MSI): voicebox_0.1.0_x64_en-US.msi
Windows (Setup): voicebox_0.1.0_x64-setup.exe
If you want to hack on it locally, the development setup uses Bun for package management:
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
make setup
make dev
Prerequisites: Bun, Rust (rustup), Python 3.11+, and Xcode on macOS. The Makefile handles the painful glue — make help shows every available command.
The API Is the Real Story
For power users and integration-minded developers, the feature that actually matters is the REST API. Voicebox boots a FastAPI server on localhost:8000 and exposes endpoints for generation, profile management, and history — all documented via an auto-generated OpenAPI schema at /docs.
Here's the absolute minimum workflow:
# Generate speech from a cloned voice
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'
# List all voice profiles
curl http://localhost:8000/profiles
# Create a new profile
curl -X POST http://localhost:8000/profiles \
-H "Content-Type: application/json" \
-d '{"name": "My Voice", "language": "en"}'
That unlocks use cases where sending audio to a third-party server is a non-starter: game dialogue systems that generate NPC lines at runtime, podcast production pipelines that need to iterate on narration without re-uploading everything, accessibility tools that can't rely on cloud availability, and content automation scripts where you don't want every generated clip billing against an API quota.
Where It Falls Short
Voicebox is clearly still in early access, and the roadmap is honest about what's missing:
- Real-time streaming — right now you wait for full generations to complete rather than hearing audio as it's produced
- More models — currently Qwen3-TTS only; XTTS and Bark support are listed as "coming soon"
- Conversation mode with automatic speaker turn-taking
- Voice effects — pitch shift, reverb, and similar post-processing live on the roadmap, not in the product
- Word-level timeline editing — the current editor works at the clip level
- Linux builds — blocked on CI infrastructure, not a deep technical problem
- Mobile companion — remote control from a phone is a "future vision" item
The GitHub Issues page shows 49 open issues at the time of writing, which is reasonable for a v0.1.x project but signals you should treat this as alpha-quality software. If your use case depends on absolute reliability — say, a customer-facing production product — you're probably still better off with a commercial service until Voicebox hits a v1.0.
How It Stacks Against the Alternatives
If you already know the voice-synthesis landscape, the natural question is how Voicebox compares.
- ElevenLabs wins on model quality and polish, and loses on price, privacy, and customization. If your voice samples are sensitive (licensed actors, legal recordings, internal executive audio), the cloud model is a non-starter.
- Coqui XTTS is the open-source incumbent. It's more mature as a model but leaves you assembling a UI, a project system, and an API yourself. Voicebox is the studio Coqui never shipped.
- Mistral's Voxtral TTS (which shipped this quarter as an open-weight model) is better positioned as a building block. Voicebox is the finished house.
Voicebox's real competitive move is bundling a model, a GUI, a timeline editor, recording, transcription, and an API into a single Tauri binary. That's a hard package to beat when you care about owning the full stack end-to-end.
The Bottom Line
Voicebox is the voice-cloning product the open-source ecosystem has been missing: not a model, not a Python script, but a local-first studio app that behaves like something you'd pay for. The MLX backend makes it genuinely fast on Apple Silicon, the API makes it genuinely embeddable, and the MIT license makes it genuinely yours.
It's early. It's alpha. It's missing features a paid competitor would include. But for a single-contributor project that ships working binaries for two platforms and exposes a usable REST API, it's a preview of where local AI tooling is headed — away from "run this Colab notebook" and toward "double-click an app icon." That's the direction the rest of the ecosystem should be moving, too.


