llama.cpp Gets Voice: Open-Source AI Can Now Recognize and Translate Speech Locally

llama.cpp Merges Audio Support

On April 12, 2026, a pull request adding audio processing capabilities was merged into llama.cpp's main branch. For the thousands of developers already running llama.cpp in production, this means automatic speech recognition (ASR) is now available without any additional services, APIs, or vendor lock-in.

Until now, llama.cpp handled text and vision tasks well. Audio was the last major gap separating it from commercial solutions like OpenAI's Whisper API or Google Cloud Speech-to-Text.

Supported Models

Audio processing currently works with two Google Gemma 4 models:

Gemma-4-E4B-it — 4.5B effective parameters (8B with embeddings), requires ~10 GB VRAM
Gemma-4-E2B-it — 2.3B effective parameters (5.1B with embeddings), suitable for consumer hardware

Both models support 35+ languages, a 128k context window, and multimodal input — text, images, video, and now audio — within a single open architecture.

How It Works

Users download GGUF quantized model files from Hugging Face. The Q8_0 quantization is recommended for the best accuracy-to-memory balance. Audio input is currently limited to approximately 30 seconds, though longer segments sometimes process successfully.

Simon Willison demonstrated running the Gemma 4 E2B model via MLX — Apple's optimized computation framework for Apple Silicon — with a single command:

uv run --with mlx_vlm mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio recording.wav \
  --prompt "Transcribe this audio"

This means a MacBook Pro with an M-series chip can now transcribe speech entirely locally, without any internet connection.

What This Means for European Developers and Businesses

Privacy and GDPR compliance. When audio processing runs on your own server, client voice data never leaves your infrastructure. For healthcare, legal, and financial services — sectors where data residency matters most — this is significant. EU-based companies running Gemma 4 locally on their own servers face zero cross-border data transfer concerns.

Cost. OpenAI's Whisper API costs $0.006 per minute. For organizations with high audio volume — call centers, transcription services, voice assistants — this adds up to thousands of euros per month. A local alternative means a one-time hardware investment, not an ongoing API bill.

Voice AI without cloud dependency. Baltic and European companies building voice bots and transcription tools can now architect fully on-premise solutions. No reliance on third-party APIs that can change pricing, terms, or availability.

What's Next

The 30-second audio limit is the main constraint to watch. Once llama.cpp's team optimizes this, the solution becomes viable for longer recordings — meetings, lectures, customer calls.

Google's decision to release Gemma 4 as a multimodal open model capable of text, vision, and audio from a single architecture signals a clear direction: open-source models are catching up to proprietary ones in capability, not just benchmark scores.

According to Habr's analysis, the implementation requires specific launch parameters (-b 1024 -ub 1024) for stability, with a public demo already live. Simon Willison's write-up confirms the MLX route works well on macOS.

Conclusion

llama.cpp gaining audio support is a practical milestone, not just a benchmark achievement. Speech AI is moving on-premise, and the combination of Gemma 4's multilingual capabilities with llama.cpp's deployment flexibility gives developers a serious local alternative to cloud transcription. For anyone building voice assistants or transcription tools in the Baltic region, this development is worth tracking closely.

llama.cpp Gets Voice: Open-Source AI Can Now Recognize and Translate Speech Locally

llama.cpp Merges Audio Support

Supported Models

How It Works

What This Means for European Developers and Businesses

What's Next

Conclusion

WebEdge

Ready to implement AI in your business?

Related articles

AI Implementation Got 130× Cheaper: What It Means for Your Business

AI Automation for Marketing Agencies: Scale Without Hiring | WebEdge

Multi-Agent Architecture for Business Operations: How webedge-org Structures AI Teams