Voice Loop — A Local Voice Agent

Voice Loop — A Local Voice Agent in ~500 Lines of Python

Ronan — Trelis Research ~14 min Watch on YouTube ↗

Overview

Ronan from Trelis Research demonstrates Voice Loop, an open-source local voice agent defined in under 500 lines of Python. It runs entirely on a Mac using a cascade of small, efficient models — Moonshine Base for speech-to-text, Gemma 4B (via MLX) for reasoning, and Kokoro for text-to-speech. The video opens with a live conversation demo, then dives deep into the technical architecture: the STT → LLM → TTS pipeline, turn detection via PipeCat's Smart Turn V3, echo cancellation for interruption handling, memory persistence, audio input mode, and the future direction toward end-to-end models.

1 Live Demo: Conversing with Voice Loop ▶ 0:00

The video opens with an unscripted live conversation between Ronan and Voice Loop. The agent introduces itself, asks for the user's name, and remembers it throughout the session. The demo showcases several capabilities in rapid succession:

Key insight: The demo is deliberately conversational and imperfect — it showcases the real behavior of turn detection, hesitation handling, and interruption detection, not a polished scripted demo.

2 What Is Voice Loop? ▶ 1:19

Voice Loop is a local voice agent that runs on a Mac, defined by fewer than 500 lines of Python code. It includes advanced functionalities that are only possible with recently released models:

The entire project is available on GitHub under the Trelis Research organization.

3 The Core Pipeline: STT → LLM → TTS ▶ 2:02

The fundamental architecture is a three-stage cascade pipeline:

  1. Voice In → Moonshine Base (STT) — microphone audio is transcribed to text using Moonshine Base, a small and fast transcription model. The transcription is "very fast because Moonshine Base is a very small model."
  2. Text → Gemma 4B (LLM) — the transcribed text is passed to Gemma 4B (running locally via MLX), which generates a text response. Gemma is "also fairly fast" for reasoning.
  3. Text → Kokoro (TTS) — the LLM's text output is converted back to speech using Kokoro. This step can introduce latency: "if there's a lot of words in the response, there can be more of a delay."
Key insight: Each model in the pipeline is deliberately small and efficient — Moonshine Base for near-instant transcription, Gemma 4B (not a frontier model) for solid local reasoning, and Kokoro for natural-sounding speech. The trade-off is latency vs. running entirely locally on consumer hardware.

4 Turn Detection: Knowing When the User Is Done Speaking ▶ 3:20

Simply piping the three models together isn't enough. The first major challenge is turn detection — determining when the user has actually finished speaking. Without it, the agent would respond to every pause, including hesitations.

Voice Loop solves this with a two-stage pre-processing pipeline before the audio reaches the LLM:

  1. Voice Activity Detection (VAD) — determines whether the incoming audio is speech or background noise.
  2. Smart Turn V3 (PipeCat) — an open-source model that predicts the probability that the user has finished speaking. It outputs a percentage (e.g., "the probability of Ronan being finished is only 4%").

While the probability of the user being finished is low, the agent holds back. When the user completes a phrase that sounds final, or when silence extends long enough, the turn probability increases and the agent responds.

Ronan acknowledges the system isn't perfect: in one case during the demo, he said "I'm wondering if…" and the model briefly thought he was finished. But for most conversational scenarios, it handles turn detection well.

Key insight: "The smart turn model is basically saying the probability of Ronan being finished is only 4%, 1%, 6%, 4%. So while the probabilities of me being finished are low, it's not going to respond."

5 Echo Cancellation and Interruption Detection ▶ 5:23

The second major challenge is interruption detection. When the agent is speaking (playing TTS audio through the speaker), the microphone picks up that audio. A naive approach ("if you hear a noise, that's an interruption") would fail because the agent would constantly interrupt itself.

The solution is a combination of two techniques working together:

When an interruption is detected, the agent immediately stops playing TTS, starts listening to the user, and routes the new input back through Gemma for a fresh response.

Key insight: Echo cancellation + VAD is the key combination that makes interruption detection work. Without echo cancellation, the agent would think its own voice playing through the speaker is the user interrupting.

6 The Back-Channeling Problem (Unsolved) ▶ 7:09

Ronan highlights one layer of nuance that Voice Loop does not yet handle: back-channeling. This is the distinction between a genuine interruption ("Wait, stop, I want to say something different") and a conversational affirmation ("Oh yeah, yeah, that's right") where the user is agreeing without actually wanting the agent to stop.

Currently, any detected voice during TTS playback is treated as an interruption. Models that can distinguish back-channeling from interruption exist, and Ronan plans to cover this in a future video and integrate it into Voice Loop.

In practice, even without back-channeling detection, the current interruption system captures most real interruptions. The worst case is occasionally stopping when the user was just agreeing — after which the conversation continues naturally.

7 Additional Features: Chime, Memory, No-TTS Mode ▶ 7:56

Beyond the core pipeline, Voice Loop includes several configurable features:

8 Configuration Options and Model Choices ▶ 10:01

Voice Loop is designed to be highly configurable. Ronan walks through the available flags and options:

9 Audio Input Mode: Direct Audio to LLM ▶ 10:58

Gemma 4B supports audio input as well as text input. Voice Loop has an experimental audio mode that passes the raw audio directly to the LLM instead of first transcribing it with Moonshine. Ronan demonstrates this mode live — it works, with the agent correctly handling counting tasks and recipe requests.

However, there are significant limitations:

Key insight: Audio mode runs Moonshine transcription in parallel with direct audio input — Moonshine keeps the text history, while the latest turn can optionally use raw audio. This dual-path approach hedges against the reliability gap of direct audio input.

10 The Future: From Cascade to End-to-End Models ▶ 13:37

Ronan reflects on the current state and future direction of voice agents. Today's systems, including Voice Loop, are fundamentally cascade models: speech-to-text → LLM → text-to-speech, with separate neural networks for each stage.

The direction of travel is toward more integrated, end-to-end models — fewer neural networks that handle the full audio-in-to-audio-out pipeline. In principle, direct audio input could eventually subsume turn detection and interruption detection (the model would understand from the audio alone when to respond and when to stop).

But there's a fundamental trade-off today: to get good reasoning, you need a strong LLM with extensive pre-training data. Integrated audio models don't yet match frontier-level intelligence, which is why the cascade approach persists — "you end up cobbling together these models so that you have a first transcription part, then a strong reasoning model, and then a model that will go back to speech."

Key insight: "These end-to-end models are still cascade models. But over time, the direction of travel probably is to have a more integrated flow where we have a fewer number of neural nets." The bottleneck: integrated models can't yet match the reasoning quality of dedicated LLMs.

🎯 Key Takeaways

  1. Under 500 lines of Python — Voice Loop is deliberately compact and modular, serving as a developer starting point rather than a monolithic framework.
  2. Three-model cascade — Moonshine Base (STT) → Gemma 4B via MLX (reasoning) → Kokoro (TTS), all running locally on a Mac.
  3. Turn detection is critical — PipeCat's Smart Turn V3 predicts the probability that the user has finished speaking, preventing premature responses during hesitations.
  4. Echo cancellation enables interruption — without cancelling the agent's own TTS output from the microphone input, interruption detection is impossible.
  5. VAD + Smart Turn = two-stage gate — Voice Activity Detection filters noise first, then Smart Turn evaluates whether detected speech represents a completed turn.
  6. Back-channeling is the next frontier — distinguishing "yeah, go on" from "stop, I want to say something" remains unsolved in Voice Loop.
  7. Memory is simple but effective — Gemma extracts notable facts after each turn and writes them to a file, with periodic consolidation.
  8. Audio input mode works but is unreliable — direct audio to Gemma 4B is possible but less consistent than the text-first pipeline, especially with smaller models.
  9. Dual-path architecture for audio mode — Moonshine always runs in parallel to maintain text history, even when audio mode passes raw audio for the latest turn.
  10. Cascade models persist because reasoning matters — end-to-end audio models can't yet match dedicated LLMs in reasoning quality, forcing the three-stage pipeline approach.
  11. Model size trade-offs are real — 4B handles counting, reasoning, and audio input; 2B works but struggles with logic and direct audio.
  12. Highly configurable — echo cancellation, voice selection, model size, silence timeout, chime, recording, and TTS can all be toggled via CLI flags.

⏱ Timestamp Index

▶ 0:00 Live demo — conversing with Voice Loop
▶ 1:19 What is Voice Loop? Overview and goals
▶ 2:02 Core pipeline: voice in → Moonshine → Gemma → Kokoro
▶ 2:35 Gemma 4B reasoning and Kokoro TTS latency
▶ 3:20 Turn detection problem explained
▶ 3:47 Hesitation handling in the transcript
▶ 4:17 Silence-based turn probability increase
▶ 4:31 VAD + Smart Turn V3 two-stage detection
▶ 4:59 Turn detection accuracy and edge cases
▶ 5:23 Interruption detection challenge
▶ 5:57 Echo cancellation explained
▶ 6:35 VAD on residual signal after echo cancellation
▶ 7:09 Back-channeling problem (unsolved)
▶ 7:56 Chime notification for UX
▶ 8:17 Persistent memory with Gemma
▶ 8:52 No-TTS mode for text-only responses
▶ 10:01 Echo cancellation toggle and config options
▶ 10:22 Kokoro voice selection and model size (4B vs 2B)
▶ 10:58 Audio input mode — direct audio to Gemma
▶ 12:13 Audio mode limitations and reliability
▶ 12:49 MLX library limitation — audio only for last turn
▶ 13:37 Future: cascade to end-to-end models