Building EchoKernel: A Voice-Controlled AI Agent That Actually Does Things

Manas Ranjan Jena — Sun, 12 Apr 2026 05:24:38 +0000

I want to be upfront about something before we start: the phrase "local AI agent" is one of the most overloaded terms in the current AI landscape. Half the demos you'll find online are chatbots with a file-picker attached. The other half require a $3,000 workstation with 24GB of VRAM just to boot.

EchoKernel is my attempt to build something in the middle — a voice-controlled agent that genuinely executes actions on your local machine (creating files, writing code, summarizing text), runs on any laptop without a GPU, and has a pipeline transparent enough that you can understand and modify every stage.

This article walks through the full architecture, the reasoning behind every major decision, and the specific bugs that bit me hardest. The source code is on GitHub.

What It Does

You speak a command (or type one). EchoKernel:

Transcribes your audio to text using Groq's Whisper API
Sends that transcript to LLaMA 3.3 70B to classify your intent as structured JSON
Routes the intent to the right local tool — file creation, code generation, summarization, or chat
Displays the transcription, detected intent, action taken, and output in a clean three-panel UI

A full interaction looks like this:

User says:  "Write a Python function that retries failed HTTP requests and save it"

Pipeline:
  audio blob  →  Whisper Large v3  →  "Write a Python function that..."
  transcript  →  LLaMA 3.3 70B    →  { "primary_intent": "write_code",
                                         "target_filename": "retry.py", ... }
  intent JSON →  tool executor    →  generates code, writes output/retry.py
  result      →  UI               →  shows code block + download link

Everything generated lands in an output/ directory. Nothing touches the rest of your filesystem.

The Architecture

Browser  (frontend/index.html)
     │
     │  multipart/form-data  (audio blob + session metadata)
     │  application/json     (text commands, confirmations)
     │
     ▼
FastAPI  (backend/main.py)
     │
     ├── [1] STT Service     →  Groq Whisper API       →  transcript text
     ├── [2] Intent Service  →  Groq LLaMA 3.3 70B     →  structured JSON intent
     ├── [3] Tool Executor   →  local Python functions  →  file / code / summary / chat
     └── [4] Memory Store    →  in-process dict         →  per-session chat history
     │
     ▼
output/   ← every file write is sandboxed here

The pipeline is deliberately sequential and single-responsibility. Each stage produces a typed Pydantic object and hands it to the next:

TranscriptionResult  →  IntentResult  →  ToolResult

This means any stage can be replaced without touching the others. Want to swap Groq Whisper for a local faster-whisper binary? You change exactly one async function in stt.py. The rest of the pipeline doesn't know or care.

Stage 1: Speech-to-Text

How it works

The browser records audio using the Web MediaRecorder API and sends the raw blob to the /agent/audio endpoint as a multipart upload. FastAPI reads the bytes and forwards them to Groq's /v1/audio/transcriptions endpoint, which is OpenAI-API-compatible and runs Whisper Large v3 on Groq's LPU hardware.

async def transcribe_audio(audio_bytes: bytes, filename: str, content_type: str) -> TranscriptionResult:
    async with httpx.AsyncClient(timeout=60.0) as client:
        files = {"file": (filename, audio_bytes, content_type)}
        data  = {"model": "whisper-large-v3", "response_format": "verbose_json"}
        headers = {"Authorization": f"Bearer {GROQ_API_KEY}"}

        response = await client.post(
            "https://api.groq.com/openai/v1/audio/transcriptions",
            headers=headers, files=files, data=data,
        )
        response.raise_for_status()
        payload = response.json()

    return TranscriptionResult(
        text=payload.get("text", "").strip(),
        language=payload.get("language"),
        duration=payload.get("duration"),
    )

The bug that wasted two hours

Browsers emit recorded audio as audio/webm;codecs=opus. My original content-type validation used a Python set membership check:

ALLOWED_AUDIO_TYPES = {"audio/wav", "audio/mpeg", "audio/webm", ...}

if audio.content_type not in ALLOWED_AUDIO_TYPES:
    raise HTTPException(status_code=415, ...)

Every microphone recording returned a 415 Unsupported Media Type. The issue: "audio/webm;codecs=opus" is not equal to "audio/webm". The codec suffix makes them different strings.

The fix was switching from exact-match to prefix-match, and separately stripping the codec suffix before forwarding to Groq (which also rejects the full string):

ALLOWED_AUDIO_PREFIXES = ("audio/wav", "audio/mpeg", "audio/webm", "audio/ogg", ...)

content_type = audio.content_type or ""
if not any(content_type.startswith(p) for p in ALLOWED_AUDIO_PREFIXES):
    raise HTTPException(status_code=415, ...)

# groq doesn't accept codec params — strip before forwarding
clean_content_type = content_type.split(";")[0].strip()

Lesson: always treat browser-emitted MIME types as prefixes, not exact strings.

Why Whisper Large v3?

The large-v3 checkpoint is meaningfully better than medium or small on short, command-like utterances — exactly what a voice agent receives. Smaller checkpoints are more prone to hallucinating filler words or mis-hearing technical terms ("create a YAML file" becoming "create a yaml pile"). The latency difference on Groq's LPU is small enough (~100–200ms) that it's not worth compromising accuracy.

Stage 2: Intent Classification

This is the most architecturally interesting part of the system. The challenge is taking freeform transcribed text — which could be anything from "make a file called config dot yaml" to "write me a function that debounces events in JavaScript" — and turning it into a structured, typed object that the tool executor can act on reliably.

The prompt design

The system prompt instructs LLaMA 3.3 70B to return only a JSON object — no preamble, no explanation, no markdown fences:

INTENT_SYSTEM_PROMPT = """You are an intent classifier for a voice-controlled AI agent.
Analyze the user's transcribed speech and return ONLY a valid JSON object — no markdown, no explanation.

Intent categories:
- "create_file": user wants to create an empty file or folder
- "write_code": user wants code generated and saved to a file
- "summarize": user wants text content summarized
- "chat": general conversation, questions, or anything else
- "compound": multiple distinct actions in one command

JSON schema to return:
{
  "primary_intent": "<one of the five categories>",
  "secondary_intents": ["<additional intents if compound, else empty>"],
  "confidence": "<high|medium|low>",
  "target_filename": "<suggested filename with extension if applicable, else null>",
  "extracted_content": "<text or topic the user wants to act on, if any>",
  "reasoning": "<one sentence explaining your classification>"
}"""

The API call enforces this at the model level using response_format: {"type": "json_object"}, which tells the model to guarantee well-formed JSON output. This means json.loads() never throws — even if the model decides to return an unexpected schema, it's still parseable, and dict.get() with defaults handles missing fields cleanly.

async with httpx.AsyncClient(timeout=30.0) as client:
    response = await client.post(
        f"{GROQ_BASE_URL}/chat/completions",
        headers={"Authorization": f"Bearer {GROQ_API_KEY}"},
        json={
            "model": "llama-3.3-70b-versatile",
            "messages": messages,
            "temperature": 0.1,          # low temp for deterministic classification
            "max_tokens": 300,
            "response_format": {"type": "json_object"},
        },
    )

Temperature is set to 0.1 rather than 0 — this avoids the model getting stuck in degenerate outputs while still being close to deterministic for classification tasks.

Why the JSON schema has `target_filename` and `extracted_content`

Early versions of the system made the tool executor re-parse the original transcript to figure out things like "what file did they want to name this?" That's fragile — the tool executor would have to implement its own mini NLP layer.

Instead, the intent classifier does that work once and packages the results into structured fields. When primary_intent is write_code, target_filename already contains something like "retry_handler.py" and extracted_content has the description of what to write. The tool executor receives everything it needs without touching the raw text again.

Session history as context

The last four turns of conversation history are injected into the intent call:

for msg in conversation_history[-4:]:
    messages.append({"role": msg["role"], "content": msg["content"]})

This solves a real usability problem: after generating a summary, a user might say "now save that to a file." Without context, this classifies as chat (there's no explicit action in the phrase). With the prior turn in context, the model correctly classifies it as a compound summarize + create_file intent with the summary content extracted from the assistant's previous response.

Why LLaMA 3.3 70B for classification?

I tested this with smaller models during development. The pattern that caused failures was edge cases near intent boundaries — commands like "create a Python file with a retry function" which could reasonably be create_file OR write_code (it's actually both — a compound intent). The 70B model consistently identifies these as compound. Models in the 8B–13B range tend to collapse compound utterances to a single intent and miss the secondary action.

Stage 3: Tool Execution

The routing pattern

The tool executor is a dispatcher that maps primary_intent strings to async handler functions:

async def execute_tool(intent: IntentResult, transcribed_text: str, history: list[dict]) -> ToolResult:
    primary = intent.primary_intent

    if primary == "create_file":  return await _handle_create_file(intent)
    if primary == "write_code":   return await _handle_write_code(intent, transcribed_text, history)
    if primary == "summarize":    return await _handle_summarize(intent, transcribed_text, history)
    if primary == "compound":     return await _handle_compound(intent, transcribed_text, history)

    return await _handle_chat(transcribed_text, history)  # default fallback

Each handler returns a ToolResult — a typed Pydantic model with success, action_taken, output, file_path, and code_content fields. The frontend renders different UI components depending on which fields are populated (a code block if code_content is set, a download link if file_path is set).

The filesystem sandbox

Every file operation goes through two validation layers before anything touches disk.

Layer 1 — Filename sanitization:

def _safe_filename(name: str) -> str:
    # strip anything that could escape the output directory
    clean = re.sub(r"[^\w.\-]", "_", Path(name).name)
    return clean or "output.txt"

Path(name).name drops any directory components (so "../../etc/passwd" becomes "passwd"). The regex then strips anything that isn't a word character, dot, or hyphen.

Layer 2 — Resolved path validation:

def _resolve_output_path(filename: str) -> Path:
    safe = _safe_filename(filename)
    path = OUTPUT_DIR / safe
    path.resolve().relative_to(OUTPUT_DIR.resolve())  # raises ValueError if outside
    return path

After sanitization, the path is still resolved against the real filesystem and checked against OUTPUT_DIR. If somehow a sanitized filename still resolves outside output/ (symlink attacks, OS-specific edge cases), this raises a ValueError before any write happens. Two independent layers means a bypass of the first doesn't automatically mean a bypass of the second.

Code generation and the markdown fence problem

LLMs are trained to format code inside markdown fences. Even when you explicitly instruct "return only raw code, no markdown fences," frontier models comply about 95% of the time — but that 5% writes invalid Python to disk because the file starts with `python.

The fix is a post-processing strip that runs regardless of whether the model complied:

`python
code = await _call_llm(messages)

strip markdown fences that some models sneak in despite instructions

code = re.sub(r"^[\w]*\n?", "", code.strip()) code = re.sub(r"\n?$", "", code.strip())

path.write_text(code, encoding="utf-8")
`

This costs a negligible regex pass and makes the code writing 100% reliable instead of 95%.

Compound commands

When the intent is compound, the executor iterates over secondary_intents and recursively calls execute_tool for each sub-intent, stitching the results together:

`python
async def _handle_compound(intent, transcribed_text, history):
outputs = []
for sub in intent.secondary_intents or ["chat"]:
sub_intent = IntentResult(
primary_intent=sub if sub in VALID_INTENTS else "chat",
secondary_intents=[],
target_filename=intent.target_filename,
extracted_content=intent.extracted_content,
...
)
result = await execute_tool(sub_intent, transcribed_text, history)
outputs.append(f"[{sub}] {result.output}")

return ToolResult(
    action_taken="Executed compound command",
    output="\n\n".join(outputs),
    ...
)

This means a command like "summarize this text and save it to notes.txt" executes as two sequential tool calls — first a summarization, then a file write — and the user sees both results merged in a single response card.

Stage 4: Session Memory

The memory store is intentionally simple:

`python
_sessions: dict[str, SessionHistory] = {}

def append_message(session_id: str, role: str, content: str, intent: str | None = None):
session = _sessions.setdefault(session_id, SessionHistory(session_id=session_id))
session.messages.append(ChatMessage(role=role, content=content, intent=intent, ...))

def get_history(session_id: str) -> list[dict]:
session = _sessions.get(session_id)
return [{"role": m.role, "content": m.content} for m in session.messages] if session else []
`

A plain Python dict, no database, no Redis. For a single-user local agent this is exactly the right choice — zero setup friction, zero operational overhead, data scoped to the server process lifetime. The frontend generates a UUID on first response and attaches it to every subsequent request, so sessions are naturally isolated.

The tradeoff is that sessions disappear when you restart the server. For a local development tool this is acceptable. For a production deployment you'd replace the dict with Redis or a lightweight SQLite write — and because of the single-responsibility design, that's a change to memory.py only.

The Frontend

The UI is a single HTML file with no build step, no framework, no node_modules. It opens directly in the browser with double-click.

The three-panel layout was chosen deliberately:

Left — input controls (mic, file upload, text, toggles)
Center — scrollable conversation feed showing the full pipeline result for each interaction
Right — output file browser and session history log

The most interesting frontend engineering challenge was scroll containment. With a CSS grid layout, if you set overflow: hidden anywhere in the ancestor chain, the flex children can't scroll independently. The fix requires a specific combination of properties:

`css

feed-col {

display: flex;
flex-direction: column;
min-height: 0; /* critical: without this, the column refuses to shrink */
}

feed {

flex: 1;
overflow-y: auto;
min-height: 0; /* allows the feed to scroll rather than grow infinitely */
}

.msg-card {
flex-shrink: 0; /* prevents cards from being squished by the layout */
}
`

Without min-height: 0 on both the column and the scrollable child, the feed grows to fit its content instead of scrolling — so after 3–4 messages the layout breaks and scroll stops working. This is a subtle CSS flexbox behaviour that isn't obvious from reading the spec.

Human-in-the-Loop

When the HITL toggle is on, the /agent/audio endpoint returns HTTP 202 (Accepted but not yet acted upon) instead of executing immediately:

`python if require_confirmation and detected_intent.primary_intent in ("create_file", "write_code"): return JSONResponse( status_code=202, content={ "status": "awaiting_confirmation", "session_id": sid, "transcription": transcription.model_dump(), "intent": detected_intent.model_dump(), }, ) `

The frontend detects the 202, renders an Execute / Cancel prompt, and only calls the /agent/confirm endpoint if the user approves. This follows the HTTP semantic correctly — 202 means "the request was understood and will be processed pending further action" — rather than inventing a custom status code.

Model and Provider Decisions

Why Groq over OpenAI, Anthropic, or local models?

vs. OpenAI: The API surface is identical — Groq implements the OpenAI spec, so switching would be a one-line URL change. The difference is latency. Groq's LPU (Language Processing Unit) is purpose-built silicon for transformer inference and delivers roughly 3–5× faster token throughput than GPU-based providers. For a voice pipeline, where you're waiting for STT + LLM sequentially, that difference is the gap between an interaction that feels responsive and one that feels like a web search from 2008.

vs. local inference (Ollama, llama.cpp, LM Studio): Running LLaMA 3.3 70B locally requires either 16–24GB of VRAM for acceptable GPU inference, or a 15–30 second response time on CPU. Neither is usable for a voice agent. The honest tradeoff is: Groq makes EchoKernel work on any laptop with an internet connection, at the cost of one API key and a few cents per day of usage. The architecture is designed so swapping back to local inference is 10 lines of code:

`python

stt.py — replace Groq call with faster-whisper

import faster_whisper
model = faster_whisper.WhisperModel("large-v3", device="cpu")
segments, _ = model.transcribe(audio_bytes_io)
transcript = " ".join(s.text for s in segments)

intent.py — replace Groq call with Ollama

import ollama
response = ollama.chat(model="llama3.3", messages=messages)
raw = response["message"]["content"]
`

vs. Anthropic / Google: Both offer excellent LLMs but neither provides a Whisper-equivalent STT API. You'd need two providers for one pipeline. Keeping everything on Groq means one API key, one base URL, one billing account.

Challenges

The MIME type suffix problem was the most frustrating bug — a 415 error with no obvious cause until I checked what audio.content_type actually contained in the server logs. The browser string audio/webm;codecs=opus is technically a valid MIME type with parameters (RFC 2045), but treating it as an exact match against a set of strings silently rejects every microphone recording.

Getting the intent classifier to reliably produce compound intents required several prompt iterations. The model would correctly identify compound commands about 70% of the time with a basic prompt. Adding explicit examples of compound vs. single intents in the system prompt, and reducing temperature to 0.1, pushed this above 95%.

CSS scroll containment in a grid layout took longer than it should have. The symptom was that the feed stopped scrolling after a few messages. The root cause was missing min-height: 0 on flex children inside a grid column — a property that's meaningless in most contexts but critical here. The browser's default min-height: auto for flex items means they will never shrink below their content size, so the scrollable container grows instead of scrolling.

The session_id 422 error came from a Pydantic schema where session_id: str was a required field with no default. The frontend correctly sends null on first request (there's no session ID yet), but Pydantic rejected null for a non-optional str. The fix was session_id: Optional[str] = None.

What I'd Build Next

The architecture has four clean extension points:

Persistent memory — replace the in-process dict with SQLite via aiosqlite. Sessions would survive server restarts and you could browse history across days.

More tools — the tool executor is just a dispatcher. Adding a search_web tool, a run_shell_command tool (with appropriate confirmation gates), or a read_file tool for context injection are all isolated additions to tools.py.

Streaming responses — the current architecture returns a complete response after the LLM finishes. Adding Server-Sent Events would let the UI render code token-by-token as the model generates it, which dramatically improves perceived latency for long outputs.

Local model benchmarking — running the same test suite against faster-whisper medium vs large-v3 on CPU, and llama3.1 8B vs llama3.3 70B on Ollama, would produce concrete latency/accuracy numbers that justify the current model choices with data rather than intuition.

Running It Yourself

`bash
git clone https://github.com/ManasRanjanJena253/EchoKernel
cd EchoKernel

cp .env.example .env

add your GROQ_API_KEY to .env

pip install -r backend/requirements.txt
python run.py

then open frontend/index.html in your browser

Get a free Groq API key at console.groq.com. The free tier is more than enough for development and demos.

If you have questions about any part of the architecture or hit a bug I didn't cover, the GitHub issues are open. And if you end up extending it with new tools or a local model swap, I'd genuinely like to see it.

DEV Community: Manas Ranjan Jena