Utkarsh

Posted on Apr 13

Building a Voice-Controlled Local AI Agent with LangGraph and Mem0

#agents #ai #python #showdev

I wanted to work with Local AI Agents; it turned to be one of the more fun projects I've worked on, and a valuable primer on TUI UI/UX.

Why I Built This

The goal was simple enough on paper: build a voice-controlled AI agent that can accept audio input, figure out what the user wants, and execute local actions like creating files or summarizing content. The catch was that it had to run locally, use a HuggingFace STT model, and integrate Mem0 as the memory layer.

I could have taken the easy route and built a glorified if/else dispatcher that classifies intent and calls a function. Instead I decided to build a real agent with a proper graph, persistent memory, human-in-the-loop confirmation, and a terminal UI. Partly because the problem asked for it, and partly because I wanted to actually understand how these systems work at the seams.

How I Structured the Build

The worst thing you can do when building an agent is try to build all of it at once. You end up debugging three things simultaneously and not knowing which one is broken. I structured the build as four layers, where each layer had to be independently working before I touched the next one.

Layer 1: The Dumb Pipeline

Before any LLM was involved, I built a straight line. Audio goes in, text comes out, files get read and written. No agent, no graph, no framework. Just Python functions.

For speech-to-text I used Whisper locally. The model gets loaded once at startup into a module level cache variable. This matters because Whisper is not small, and loading it on every transcription call would mean a multi-second delay every time someone speaks. Load it once, reuse it forever.

For microphone recording I used sounddevice instead of PyAudio. The reason is purely practical: sounddevice ships PortAudio binaries with the pip package, which means pip install sounddevice just works. PyAudio requires a separate system level installation of PortAudio that fails differently on every operating system. When you want other people to actually run your project, every extra installation step is a place where they give up.

The three tools I built were read_file, write_to_file, and summarize_file. All three are plain Python functions at this stage. The summarizer calls a secondary Ollama model, Gemma3:1b, to generate an extended summary of the file contents before the main agent condenses it. Using a 1B parameter model for this required a much more structured system prompt than I would need for a larger model. Small models cannot fill in gaps from context the way larger models can, so you have to be explicit about everything: what format you want, what constraints apply, what the model should never do.

Every file operation is restricted to an output directory using pathlib's is_relative_to method. The idea is that even if the LLM hallucinates a path like ../../etc/passwd, the jail check raises a PermissionError before anything happens.

I wrote test scripts for each piece and a test_pipeline.py that chains all of them together. The pipeline test records audio, transcribes it, writes the transcript to a file, reads it back, asserts the content matches, and then summarizes it. Layer 1 was done when this test passed end to end on real hardware.

Layer 2: The Agent

With the pipeline proven, I wrapped the tool functions with LangChain's @tool decorator and built a LangGraph StateGraph. The graph has two nodes: an agent node that calls Qwen3:4b with the tools bound, and a tool node that executes whatever tool the model decided to call. A conditional edge called should_continue checks whether the last message contains tool calls. If it does, route to the tool node. If not, we are done.

The reason compound commands work in this architecture is that the loop keeps running until the model stops asking for tools. If you say "create a Python file with a bubble sort function and then summarize it," the agent writes the file, summarizes it, and only then produces a final response. A simple intent classifier stops after the first action. A real agent loop does not stop until the task is fully complete.

Human in the loop confirmation is implemented using LangGraph's interrupt() primitive inside the tool node. Before executing any write operation, the graph pauses and surfaces the pending tool call to the user showing the exact filename and content that would be written. The user approves or rejects. LangGraph's MemorySaver checkpoints the entire graph state at every step so that when the user makes a decision the graph resumes from exactly where it paused.

Layer 3: Memory

Mem0 handles long term memory. At the start of every agent invocation, the agent node searches Mem0 for relevant context from past conversations using the current user message as the search query. The results get injected into the system prompt before the model is called. After the model responds, the interaction gets saved back to Mem0.

LangGraph's MemorySaver handles short term memory within a session, persisting the message history across turns using a thread ID. The two systems are doing different things: LangGraph remembers what happened this session, Mem0 remembers what happened in every session before this one.

Layer 4: The TUI

The interface is built with Textual. Pressing 1 starts microphone recording in a background worker thread. Pressing Enter stops the recording, Whisper transcribes the audio, and the transcript gets pre-filled into a text input so the user can review and correct it before submitting. Pressing 2 asks for an audio file path. Pressing 3 opens a text input directly.

The agent runs in a worker thread so the UI does not freeze during inference. Updates from the worker thread get posted back to the UI using call_from_thread, which is Textual's mechanism for safely bridging blocking background work and the async event loop.

Each conversation turn renders as a widget with a collapsible trace panel that shows the tool call chain, and a RichLog below it where the agent response streams in. When a write operation triggers the human in the loop interrupt, an approve/reject button bar appears at the bottom of the screen.

The Challenges That Actually Took Time

Python imports in a multi-package project. When you run a script directly, Python adds the script's directory to sys.path. When a module is imported as part of a package, that directory is no longer in the path. This caused ModuleNotFoundError in multiple places early on and the fix, setting up an editable install with pip install -e ., is not something you immediately reach for if you haven't built a project at this scale before.

Small models need precise system prompts. Without explicitly telling Qwen3:4b to execute compound commands fully, it would stop after the first tool call and declare success. Without telling it that input is transcribed speech and might be informal, it would occasionally misinterpret natural spoken phrasing. The system prompt for a small local model is not optional boilerplate. It is load-bearing.

The create_agent detour. Midway through building the agent layer, I found LangChain's create_agent abstraction which promised human-in-the-loop support out of the box via HumanInTheLoopMiddleware. I spent time integrating it before realising that it sits on top of LangGraph and essentially hides the graph from you. That is fine for simple use cases, but the moment I needed to wire Mem0 into a specific node in the graph, the abstraction got in the way. There was no clean place to inject the memory fetch and save steps because create_agent manages the graph internally. I ended up scrapping it entirely and going back to a hand-built StateGraph, which gave me full control over exactly what each node does and in what order. The lesson here is that high-level abstractions are great until they are not, and knowing when to drop down a level is a skill worth developing early.

Mem0's v2 API search semantics. The search method requires filters to be passed as an explicit filters parameter rather than as a keyword argument. The error message it gives you when you get this wrong says "Filters are required and cannot be empty" which is not immediately helpful if you are following older documentation or examples. Once you find the right parameter it is a one line fix, but it costs time to get there.

Whisper on CPU. The base model on CPU takes a few seconds per transcription. This is acceptable for a demo but would not be acceptable in production. If your machine cannot run Whisper efficiently, the README documents how to swap it for a Groq-hosted STT endpoint.

Threading in a terminal UI. Textual is async. Whisper is blocking. LangGraph's stream is synchronous. Getting all three to coexist without freezing the interface required being deliberate about which operations run in thread workers versus async workers, and making sure that every UI update from a background thread goes through call_from_thread.

What I Would Do Differently

I would add silence detection to the recorder earlier. Right now the user presses Enter to stop recording, which works fine but feels slightly awkward in practice. Stopping on a half second of silence is better UX and is not especially hard to implement with numpy.

I would also separate the Mem0 search and save into a dedicated memory node in the graph rather than handling both inside the agent node. It works as is but the agent node is doing two things now, and a node that does two things is harder to debug than two nodes that each do one thing.

The Stack

Whisper for speech to text. sounddevice for microphone recording. Qwen3:4b via Ollama as the main agent model with a Groq fallback. Gemma3:1b for summarization. LangGraph for the agent graph and short term memory. Mem0 for long term memory across sessions. Textual for the terminal interface.

The code is on GitHub. The layered build approach is the most transferable thing from this project. Build the dumbest possible version first, verify it completely, then add one layer of intelligence at a time.

DEV Community