Ronan from Trelis Research demonstrates Voice Loop, an open-source local voice agent defined in under 500 lines of Python. It runs entirely on a Mac using a cascade of small, efficient models — Moonshine Base for speech-to-text, Gemma 4B (via MLX) for reasoning, and Kokoro for text-to-speech. The video opens with a live conversation demo, then dives deep into the technical architecture: the STT → LLM → TTS pipeline, turn detection via PipeCat's Smart Turn V3, echo cancellation for interruption handling, memory persistence, audio input mode, and the future direction toward end-to-end models.
The video opens with an unscripted live conversation between Ronan and Voice Loop. The agent introduces itself, asks for the user's name, and remembers it throughout the session. The demo showcases several capabilities in rapid succession:
Name memory — "What should I call you?" → remembers "Ronan" for the rest of the conversation.
Instruction following — counts to ten, counts backwards, skips every second number — all correctly.
Hesitation handling — when Ronan hesitates ("I'm wondering if maybe…"), the agent waits patiently instead of jumping in.
Interruption detection — when Ronan interrupts mid-response, the agent stops speaking and listens.
Task execution — provides a pancake recipe, asks clarifying follow-up questions ("How many people?").
Key insight: The demo is deliberately conversational and imperfect — it showcases the real behavior of turn detection, hesitation handling, and interruption detection, not a polished scripted demo.
Voice Loop is a local voice agent that runs on a Mac, defined by fewer than 500 lines of Python code. It includes advanced functionalities that are only possible with recently released models:
Turn detection — knowing when the user has finished speaking, powered by models like PipeCat's Smart Turn V3.
Interruption handling — allowing users to cut in mid-response.
Modular design — plug in different models for STT, LLM, and TTS while keeping the codebase compact.
Developer-focused — designed as a starting point for building custom voice agents, not a finished product.
The entire project is available on GitHub under the Trelis Research organization.
The fundamental architecture is a three-stage cascade pipeline:
Voice In → Moonshine Base (STT) — microphone audio is transcribed to text using Moonshine Base, a small and fast transcription model. The transcription is "very fast because Moonshine Base is a very small model."
Text → Gemma 4B (LLM) — the transcribed text is passed to Gemma 4B (running locally via MLX), which generates a text response. Gemma is "also fairly fast" for reasoning.
Text → Kokoro (TTS) — the LLM's text output is converted back to speech using Kokoro. This step can introduce latency: "if there's a lot of words in the response, there can be more of a delay."
Key insight: Each model in the pipeline is deliberately small and efficient — Moonshine Base for near-instant transcription, Gemma 4B (not a frontier model) for solid local reasoning, and Kokoro for natural-sounding speech. The trade-off is latency vs. running entirely locally on consumer hardware.
4 Turn Detection: Knowing When the User Is Done Speaking ▶ 3:20
Simply piping the three models together isn't enough. The first major challenge is turn detection — determining when the user has actually finished speaking. Without it, the agent would respond to every pause, including hesitations.
Voice Loop solves this with a two-stage pre-processing pipeline before the audio reaches the LLM:
Voice Activity Detection (VAD) — determines whether the incoming audio is speech or background noise.
Smart Turn V3 (PipeCat) — an open-source model that predicts the probability that the user has finished speaking. It outputs a percentage (e.g., "the probability of Ronan being finished is only 4%").
While the probability of the user being finished is low, the agent holds back. When the user completes a phrase that sounds final, or when silence extends long enough, the turn probability increases and the agent responds.
Ronan acknowledges the system isn't perfect: in one case during the demo, he said "I'm wondering if…" and the model briefly thought he was finished. But for most conversational scenarios, it handles turn detection well.
Key insight: "The smart turn model is basically saying the probability of Ronan being finished is only 4%, 1%, 6%, 4%. So while the probabilities of me being finished are low, it's not going to respond."
5 Echo Cancellation and Interruption Detection ▶ 5:23
The second major challenge is interruption detection. When the agent is speaking (playing TTS audio through the speaker), the microphone picks up that audio. A naive approach ("if you hear a noise, that's an interruption") would fail because the agent would constantly interrupt itself.
The solution is a combination of two techniques working together:
Echo cancellation — while TTS audio is playing, the system cancels the echo of its own voice from the microphone input. What remains should only be the user's actual voice.
VAD on the residual signal — after echo cancellation, Voice Activity Detection runs on whatever audio remains. If human speech is detected, it counts as a genuine interruption.
When an interruption is detected, the agent immediately stops playing TTS, starts listening to the user, and routes the new input back through Gemma for a fresh response.
Key insight: Echo cancellation + VAD is the key combination that makes interruption detection work. Without echo cancellation, the agent would think its own voice playing through the speaker is the user interrupting.
Ronan highlights one layer of nuance that Voice Loop does not yet handle: back-channeling. This is the distinction between a genuine interruption ("Wait, stop, I want to say something different") and a conversational affirmation ("Oh yeah, yeah, that's right") where the user is agreeing without actually wanting the agent to stop.
Currently, any detected voice during TTS playback is treated as an interruption. Models that can distinguish back-channeling from interruption exist, and Ronan plans to cover this in a future video and integrate it into Voice Loop.
In practice, even without back-channeling detection, the current interruption system captures most real interruptions. The worst case is occasionally stopping when the user was just agreeing — after which the conversation continues naturally.
Beyond the core pipeline, Voice Loop includes several configurable features:
Chime notification — an audio chime plays while waiting for the agent's response, so the user knows something is happening. Especially helpful when latency is longer.
Persistent memory — enabled with a --memory flag. After every turn, Gemma reviews the conversation and writes notable facts to a memory file (e.g., "the user's name is Ronan"). Every 5-10 turns, the memory is consolidated. On the next conversation, the agent remembers previous context.
No-TTS mode — run without text-to-speech for faster, text-only responses. Useful for scenarios where voice output isn't needed or for testing the reasoning pipeline in isolation.
Recording — record full conversations for playback and review.
Voice Loop is designed to be highly configurable. Ronan walks through the available flags and options:
Echo cancellation toggle — can be turned off, but interruption detection will stop working (the agent will keep getting interrupted by its own voice).
Kokoro voice selection — choose from multiple available voices for TTS output.
Model size — the default is Gemma E4B (4 billion parameters). A 2B model is available and works, but "the logic is not quite as strong." The 4B model handles reasonably advanced reasoning (counting backwards, skipping numbers, etc.).
Silence timeout — configurable duration of silence before the agent assumes the user is done and responds.
Recording mode — save full conversation recordings for later playback.
Gemma 4B supports audio input as well as text input. Voice Loop has an experimental audio mode that passes the raw audio directly to the LLM instead of first transcribing it with Moonshine. Ronan demonstrates this mode live — it works, with the agent correctly handling counting tasks and recipe requests.
However, there are significant limitations:
Reliability — audio input "can be not as reliable as first transcribing it." The text-first pipeline is more consistent.
Model size dependency — the 2B model is "not quite strong enough to handle the audio input directly." The 4B model handles it better but not perfectly.
Conversational drift — in audio mode, the model "sometimes starts to lose track" of the conversation, possibly because it's not trained on enough conversational audio data.
MLX library limitation — currently only allows injecting audio for the most recent turn. Conversation history still relies on Moonshine transcriptions as text. So the system runs both in parallel: Moonshine always transcribes to maintain history, while audio mode passes raw audio only for the latest turn.
Key insight: Audio mode runs Moonshine transcription in parallel with direct audio input — Moonshine keeps the text history, while the latest turn can optionally use raw audio. This dual-path approach hedges against the reliability gap of direct audio input.
10 The Future: From Cascade to End-to-End Models ▶ 13:37
Ronan reflects on the current state and future direction of voice agents. Today's systems, including Voice Loop, are fundamentally cascade models: speech-to-text → LLM → text-to-speech, with separate neural networks for each stage.
The direction of travel is toward more integrated, end-to-end models — fewer neural networks that handle the full audio-in-to-audio-out pipeline. In principle, direct audio input could eventually subsume turn detection and interruption detection (the model would understand from the audio alone when to respond and when to stop).
But there's a fundamental trade-off today: to get good reasoning, you need a strong LLM with extensive pre-training data. Integrated audio models don't yet match frontier-level intelligence, which is why the cascade approach persists — "you end up cobbling together these models so that you have a first transcription part, then a strong reasoning model, and then a model that will go back to speech."
Key insight: "These end-to-end models are still cascade models. But over time, the direction of travel probably is to have a more integrated flow where we have a fewer number of neural nets." The bottleneck: integrated models can't yet match the reasoning quality of dedicated LLMs.
🎯 Key Takeaways
Under 500 lines of Python — Voice Loop is deliberately compact and modular, serving as a developer starting point rather than a monolithic framework.
Three-model cascade — Moonshine Base (STT) → Gemma 4B via MLX (reasoning) → Kokoro (TTS), all running locally on a Mac.
Turn detection is critical — PipeCat's Smart Turn V3 predicts the probability that the user has finished speaking, preventing premature responses during hesitations.
Echo cancellation enables interruption — without cancelling the agent's own TTS output from the microphone input, interruption detection is impossible.
VAD + Smart Turn = two-stage gate — Voice Activity Detection filters noise first, then Smart Turn evaluates whether detected speech represents a completed turn.
Back-channeling is the next frontier — distinguishing "yeah, go on" from "stop, I want to say something" remains unsolved in Voice Loop.
Memory is simple but effective — Gemma extracts notable facts after each turn and writes them to a file, with periodic consolidation.
Audio input mode works but is unreliable — direct audio to Gemma 4B is possible but less consistent than the text-first pipeline, especially with smaller models.
Dual-path architecture for audio mode — Moonshine always runs in parallel to maintain text history, even when audio mode passes raw audio for the latest turn.
Cascade models persist because reasoning matters — end-to-end audio models can't yet match dedicated LLMs in reasoning quality, forcing the three-stage pipeline approach.
Model size trade-offs are real — 4B handles counting, reasoning, and audio input; 2B works but struggles with logic and direct audio.
Highly configurable — echo cancellation, voice selection, model size, silence timeout, chime, recording, and TTS can all be toggled via CLI flags.