Gad Shalev

Posted on Feb 24

How I Built a Multi-Agent AI Orchestrator with Voice Control (Architecture Deep Dive)

#typescript #ai #opensource #tutorial

I've been working with AI coding agents — Claude Code, Codex CLI, Cursor — and hit a wall that I think a lot of developers are running into: managing multiple agents at once is a mess.

Three terminal windows. Three separate contexts. No shared memory. No way to talk to all of them without tab-switching and copy-pasting. I wanted to treat them like a team, so I built Jam — an open-source desktop app that orchestrates multiple AI agents from one interface, with voice control.

This post is a technical walkthrough of the architecture decisions, the hard problems, and what I learned building it.

The Architecture

Jam is a TypeScript monorepo built on Electron + React. Here's the high-level structure:

packages/
  core/           # Domain models, port interfaces, events
  eventbus/       # In-process pub/sub EventBus
  agent-runtime/  # PTY management, agent lifecycle, runtimes
  voice/          # STT/TTS providers, command parser
  memory/         # File-based agent memory & persistence
apps/
  desktop/        # Electron + React + Zustand desktop app

The key architectural decision was using port interfaces in @jam/core with concrete implementations in separate packages. This means runtimes (Claude Code, Codex, etc.) and voice providers (Whisper, ElevenLabs, OpenAI) are pluggable via the Strategy pattern. Adding a new agent runtime is implementing an interface, not modifying core logic.

Problem #1: Real PTY Management

The first thing I got wrong was trying to wrap agent CLIs with HTTP/API calls. That strips away half their power — tool use, file system access, interactive prompts.

Instead, each agent gets a real pseudo-terminal (PTY) via node-pty. This means:

Agents run as actual CLI processes on your machine
Full tool use, shell access, and file editing capabilities preserved
No middleware stripping features
You see exactly what you'd see if you ran the CLI yourself

The AgentManager handles lifecycle — spawning, monitoring, and gracefully shutting down PTY processes. Each agent gets its own working directory, so they can operate on different projects simultaneously.

// Simplified — each agent gets an isolated PTY
interface AgentRuntime {
  spawn(config: AgentConfig): PTYProcess;
  send(input: string): void;
  onOutput(callback: (data: string) => void): void;
  terminate(): Promise<void>;
}

Running 4-5 agents in parallel on a MacBook Pro works fine. The bottleneck isn't local compute — it's the API rate limits of the underlying models.

Problem #2: Voice Routing to the Right Agent

This was the most fun problem to solve. When you have 3 agents running and you say "Sue, refactor the auth middleware" — how does the system know to route it to Sue and not Max?

The pipeline:

STT (Speech-to-Text): Audio → text via Whisper
Command Parser: Extract agent name + intent from transcription
Router: Match agent name, send command to the correct PTY
TTS (Text-to-Speech): Agent response → audio via ElevenLabs or OpenAI

The command parser does name-based routing. It's surprisingly robust — even with Whisper's occasional transcription quirks, matching against a known list of agent names works well. Each agent can have a unique TTS voice, so you can tell them apart just by sound.

interface VoiceService {
  startListening(): void;
  onCommand(handler: (agentName: string, command: string) => void): void;
  speak(agentName: string, text: string): Promise<void>;
}

Problem #3: Persistent Memory Without a Cloud

Most AI tools forget everything when you close the session. I wanted agents that remember.

The solution is dead simple: file-based persistence.

~/.jam/agents/sue/
├── SOUL.md              # Living personality — evolves over time
├── conversations/       # Daily JSONL conversation logs
│   ├── 2026-02-20.jsonl
│   └── 2026-02-21.jsonl
└── skills/              # Auto-generated reusable skill files
    ├── react-patterns.md
    └── deploy-staging.md

SOUL.md — Living Personalities

This is my favorite feature. Each agent has a SOUL.md that starts as a basic personality prompt but evolves as you work together. The agent updates its own soul file to reflect what it's learned — your coding conventions, your preferences, project-specific knowledge.

After a week of working with an agent, its SOUL.md contains institutional knowledge that makes it genuinely more useful. It's not RAG, it's not fine-tuning — it's just a markdown file that the agent reads at session start and writes to when it learns something worth remembering.

Dynamic Skills

When an agent figures out a recurring pattern — how to deploy your staging environment, your team's PR review process, your test conventions — it writes a skill file. These are markdown docs stored in the agent's skills/ directory. Next session, the agent (or any agent) can reference them.

This is emergent behavior you get from giving agents persistent, writable storage. They naturally start documenting what they learn.

Problem #4: The UI — Chat vs. Stage View

Jam has two views:

Chat View: Unified conversation stream across all agents. Good for focused work with one agent at a time.
Stage View: A grid showing all agents' terminals simultaneously. This is the "mission control" view — you see what every agent is doing in real time.

Stage View is built with a responsive grid layout in React. Each cell is a terminal renderer connected to an agent's PTY output stream. The state management is Zustand — lightweight, no boilerplate, perfect for this kind of app where you need reactive updates from multiple async sources (PTY streams, voice events, etc.).

What I'd Do Differently

Start with the voice pipeline. I built agent management first and added voice later. Voice control changes the entire UX paradigm — I should have designed around it from day one.
Test PTY management on Windows earlier. macOS and Linux PTY behavior is similar. Windows... is Windows. ConPTY works, but the edge cases are different.
Invest in the SOUL.md format sooner. The living personality system is the feature that creates the most long-term value. I underestimated how useful persistent agent memory would be.

Try It

Jam is MIT licensed:

github.com/dag7/jam

Pre-built binaries for macOS, Windows, and Linux. Or:

git clone https://github.com/Dag7/jam.git
cd jam && ./scripts/setup.sh && yarn dev

If you're juggling multiple AI coding agents, give it a try. If you have ideas or want to contribute — issues and PRs are open.

DEV Community