DEV Community

Kshitij Chauhan
Kshitij Chauhan

Posted on

Building a Fully Local Voice-Controlled AI Agent on an 8GB M1 Mac (Without Melting It)

Hey everyone! πŸ‘‹

Recently, I took on a challenging assignment for an AI/ML Developer Internship at Mem0. The goal was to build a local, voice-controlled AI agent that could transcribe audio, understand user intent, and execute local tools (like creating files or writing code) based on that intent.

The catch? It had to run locally, and my daily driver is an Apple M1 MacBook Air with only 8GB of RAM.

Running a Speech-to-Text (STT) model and a Large Language Model (LLM) simultaneously on 8GB of unified memory is like trying to pack a suitcase that’s already full. Here is a breakdown of how I built Project Mercury, the architecture I used, and how I bypassed a massive audio processing headache.
πŸ—οΈ The Architecture
From the beginning, I knew I had to keep the stack incredibly lightweight. I avoided heavy frontend frameworks and bloated backends.

Here is the stack I landed on:

Frontend: Vanilla HTML, CSS, and JavaScript. (Zero build steps, zero framework overhead).

Backend: FastAPI (Python). It’s asynchronous, blazingly fast, and acts as the perfect REST bridge between the UI and the local ML models.

Tool Execution: A strictly sandboxed Python execution environment that routes all generated code and text files into a safe output/ directory.
🧠 Model Selection: The 8GB Unified Memory Diet
Because M1 Macs share memory between the CPU and GPU, I had to choose my models very carefully.

  1. Speech-to-Text: mlx-whisper
    Standard PyTorch implementations of Whisper would eat up too much RAM. Instead, I used mlx-community/whisper-base.en-mlx.
    Apple’s MLX framework is specifically optimized for Apple Silicon. By using the base.en model through MLX, it taps directly into the Mac's Neural Engine, resulting in near-instant transcriptions with a tiny memory footprint.

  2. Intent & Generation: Ollama + qwen2.5:0.5b
    I needed an LLM smart enough to accurately extract intents into JSON formats (e.g., classifying "Write a Python script" as a WRITE_CODE intent) but small enough to fit alongside Whisper.
    I chose Qwen 2.5 (0.5 Billion parameters) running via Ollama. At under 400MB, it is astoundingly capable for its size. It handles the JSON-based intent classification and the actual text/code generation flawlessly without pushing my Mac into Swap memory.

🚧 The Biggest Challenge: The ffmpeg Nightmare
The hardest part of this project wasn't the AIβ€”it was the audio processing.

The Problem: When you record audio in a web browser using the MediaRecorder API, it defaults to creating a WebM/Opus file. Whisper, however, expects standard formats like WAV or MP3. Normally, backend developers solve this by installing ffmpeg on the server to decode the audio before passing it to Whisper.
But I wanted this project to be truly portable. Forcing users to deal with system-level ffmpeg installations via Homebrew is a horrible developer experience.

The Solution: I shifted the audio encoding entirely to the browser.
Instead of sending a WebM file to my FastAPI backend, I used the browser's native AudioContext API. When the user stops recording, the frontend takes the raw audio buffer, decodes it, and manually encodes it into a clean 16-bit PCM WAV (16 kHz, mono) file.

πŸ›‘οΈ Safety First: Human-in-the-Loop (HITL)
Giving an LLM the ability to write files to your local hard drive is inherently dangerous.

To solve this, I implemented a strict Human-in-the-Loop (HITL) system. Whenever Qwen detects a destructive intent (like CREATE_FILE or WRITE_CODE), the backend halts execution. The frontend pops up a modal showing exactly what the AI wants to do, what file it wants to name, and waits for the user to explicitly click Approve before a single byte is written to the disk.
πŸš€ Wrapping Up
Building Project Mercury taught me a ton about resource optimization, edge-computing, and the incredible power of Apple's MLX framework. If you want to check out the code or see it in action, check out the links below!

πŸ”— GitHub Repository: [https://github.com/KN-lang/Project-Mercury]
πŸ“Ί Video Demo: [https://youtu.be/nee8HdI8ArI

 ]

Have you ever tried running local LLMs on lower-end hardware? Let me know what models you ended up using in the comments!

Top comments (0)