llama.cpp Merges Audio Support
On April 12, 2026, a pull request adding audio processing capabilities was merged into llama.cpp's main branch. For the thousands of developers already running llama.cpp in production, this means automatic speech recognition (ASR) is now available without any additional services, APIs, or vendor lock-in.
Until now, llama.cpp handled text and vision tasks well. Audio was the last major gap separating it from commercial solutions like OpenAI's Whisper API or Google Cloud Speech-to-Text.
Supported Models
Audio processing currently works with two Google Gemma 4 models:
- Gemma-4-E4B-it — 4.5B effective parameters (8B with embeddings), requires ~10 GB VRAM
- Gemma-4-E2B-it — 2.3B effective parameters (5.1B with embeddings), suitable for consumer hardware
Both models support 35+ languages, a 128k context window, and multimodal input — text, images, video, and now audio — within a single open architecture.
How It Works
Users download GGUF quantized model files from Hugging Face. The Q8_0 quantization is recommended for the best accuracy-to-memory balance. Audio input is currently limited to approximately 30 seconds, though longer segments sometimes process successfully.
Simon Willison demonstrated running the Gemma 4 E2B model via MLX — Apple's optimized computation framework for Apple Silicon — with a single command:
uv run --with mlx_vlm mlx_vlm.generate \
--model google/gemma-4-e2b-it \
--audio recording.wav \
--prompt "Transcribe this audio"
This means a MacBook Pro with an M-series chip can now transcribe speech entirely locally, without any internet connection.
What This Means for European Developers and Businesses
Privacy and GDPR compliance. When audio processing runs on your own server, client voice data never leaves your infrastructure. For healthcare, legal, and financial services — sectors where data residency matters most — this is significant. EU-based companies running Gemma 4 locally on their own servers face zero cross-border data transfer concerns.
Cost. OpenAI's Whisper API costs $0.006 per minute. For organizations with high audio volume — call centers, transcription services, voice assistants — this adds up to thousands of euros per month. A local alternative means a one-time hardware investment, not an ongoing API bill.
Voice AI without cloud dependency. Baltic and European companies building voice bots and transcription tools can now architect fully on-premise solutions. No reliance on third-party APIs that can change pricing, terms, or availability.
What's Next
The 30-second audio limit is the main constraint to watch. Once llama.cpp's team optimizes this, the solution becomes viable for longer recordings — meetings, lectures, customer calls.
Google's decision to release Gemma 4 as a multimodal open model capable of text, vision, and audio from a single architecture signals a clear direction: open-source models are catching up to proprietary ones in capability, not just benchmark scores.
According to Habr's analysis, the implementation requires specific launch parameters (-b 1024 -ub 1024) for stability, with a public demo already live. Simon Willison's write-up confirms the MLX route works well on macOS.
Conclusion
llama.cpp gaining audio support is a practical milestone, not just a benchmark achievement. Speech AI is moving on-premise, and the combination of Gemma 4's multilingual capabilities with llama.cpp's deployment flexibility gives developers a serious local alternative to cloud transcription. For anyone building voice assistants or transcription tools in the Baltic region, this development is worth tracking closely.