I got tired of pausing to look things up.
Reading a book and hitting a word I didn't know. Watching anime and missing a line. Every time, the same friction — stop what I'm doing, open a new tab, search, lose my place, lose the moment. I wanted something that would just... explain it to me. Out loud. Without me doing anything.
So I built Samuel. A voice AI that floats on my macOS desktop, watches my screen, listens to my audio, and talks to me in real time. No typing. No switching windows. I just keep doing what I'm doing and he's there when I need him.
But the thing that turned into the most interesting engineering problem wasn't the language teaching. It was this: what happens when I want Samuel to do something he doesn't know how to do yet?
The Problem With Fixed Tool Sets
Most AI agents — including ones built on OpenAI's Agents SDK — have a fixed set of tools defined at compile time. You add a tool, rebuild the app, restart. That's fine for most use cases.
But Samuel is a desktop companion. He's running continuously. Rebuilding every time I want a new capability breaks the whole experience.
I wanted to be able to say "Hey Samuel, add a weather tool" and have it work. Immediately. In the same conversation.
What I Built: Runtime Self-Modification
Here's what it actually looks like:
You: "Hey Samuel, add a weather tool"
Samuel: "I'll create a tool that fetches weather from wttr.in. [Approve] [Reject]"
You: *clicks Approve*
Samuel: *generates code via GPT-4o-mini → writes to disk → hot-loads into live session*
Samuel: "Done, sir. The weather tool is ready."
You: "What's the weather in Tokyo?"
Samuel: "Currently 18°C and partly cloudy in Tokyo, sir."
No rebuild. No restart. The new tool is live in the same voice conversation.
How It Works Under the Hood
The flow has four steps:
1. propose_plugin
When Samuel receives a request for a new capability, he calls propose_plugin with a description of what he'll build. This triggers a UI card in the frontend with Approve and Reject buttons — and the description of what code will be generated. Nothing runs until the user approves.
2. User approval
The user clicks Approve, or says "yes" / "go ahead." Samuel is listening either way.
3. write_plugin
This is where the actual generation happens. Samuel calls a separate GPT-4o-mini completion (not the live Realtime session — keeping it separate avoids latency in the voice conversation) with a system prompt that includes the plugin schema and the available APIs. The output is a JavaScript async function with a standard shape: name, description, parameters, and an execute function. The generated file is saved to ~/.samuel/plugins/.
4. Hot-load into the live session
This is the part that took the most iteration. The plugin loader reads the file, executes it via new Function(), and registers the new tool with the live agent. The critical piece is session.updateAgent() from @openai/agents — this lets you swap the agent's tool set mid-conversation without dropping the Realtime API WebRTC session.
The new tool is live. No restart. The same conversation continues.
Self-Repair
The same architecture works for bug fixes. If a plugin throws an error, Samuel catches it, reads the error message, proposes a fix, and rewrites the file after approval. Old versions are backed up automatically before any overwrite.
Samuel: "Sir, the weather tool returned an error — the API endpoint changed.
I can rewrite it to use the v2 endpoint. [Approve] [Reject]"
You: *clicks Approve*
Samuel: *rewrites plugin → hot-loads → tests*
Samuel: "Fixed, sir."
This means Samuel can maintain his own plugin ecosystem over time without me touching the codebase.
A Note on Sandboxing
new Function() is not OS-sandboxed. Plugins have full JavaScript access within the renderer process. I made a deliberate choice here: the user-approval flow (you see a description of what will be generated before anything runs, and you explicitly approve it) is the current security model.
OS-level sandboxing for plugins is on the roadmap. For now, the trust model is: you approve what gets generated, you can read the generated file at ~/.samuel/plugins/, and you can delete it at any time.
The Rest of Samuel
The self-modifying plugin system is the most novel piece, but Samuel does a lot more:
Always watching, always listening. He runs a continuous perception loop — GPT-4o Vision captures the screen every 20 seconds (with change detection to skip identical frames), and ScreenCaptureKit transcribes system audio with PID-level filtering that excludes Samuel's own voice output. A triage classifier decides whether each observation warrants interruption.
Ambient language teaching. The original use case. Samuel watches your screen, hears the dialogue, and speaks vocabulary out loud without you doing anything. Drop a YouTube link and he fetches synced lyrics, annotates vocabulary and grammar, and embeds the player. Drop a manga screenshot and he OCRs it right-to-left and breaks down the words. Your flashcards are saved 20-second audio clips from the actual scene — not text cards.
Persistent memory. Three memory types, all local in ~/.samuel/memory.json: behavioral preferences ("be more concise"), corrections ("that explanation was wrong, don't repeat it"), and facts ("I'm intermediate at Japanese"). These load into every session. Say something once and he remembers it permanently.
Voice-controlled UI. "Make yourself smaller." "Show cards less often." "Hide the romaji." Every UI setting is a voice command. No preferences panel exists.
Architecture Overview
The flow is: wake word → Realtime API → tool dispatch → voice response, with a parallel perception loop running continuously in the background — screen capture every 20 seconds, system audio transcription, and a triage classifier deciding what's worth interrupting you for. Plugin state, secrets, and memory are all local. Nothing goes to a cloud storage layer.
Six models in total:
| Model | Purpose | Latency |
|---|---|---|
| OpenAI Realtime API | Voice conversation | ~500ms |
| GPT-4o Vision | Screen scanning | ~3–5s |
| GPT-4o-mini | Triage, annotation, plugin generation | ~1s |
| GPT-5.4 Computer Use | Visual UI navigation | ~5–10s/turn |
| gpt-4o-mini-transcribe | Wake word + ambient audio | ~1s |
| gpt-4o-transcribe | Recording mode | ~3–10s |
The stack is Tauri v2 (Rust + WebView) for the desktop layer, React 19 + Vite on the frontend, and @openai/agents for the agent framework.
What Doesn't Work Yet
Being honest about the rough edges:
- macOS only — ScreenCaptureKit and Peekaboo are Apple-specific. Cross-platform support requires a different screen/audio capture layer.
- GPT-5.4 access required for Computer Use (Apple Books navigation). If you don't have it, that feature won't work.
- Plugins are JS only — dynamic plugins can call web APIs but can't add native macOS capabilities. For that you still need a rebuild.
- Always-on costs — ambient mode costs roughly $0.02–0.05/min. That's real money if you leave it running all day.
- LRCLIB coverage — not all songs have synced lyrics.
Try It
The repo is open source (MIT): github.com/sambuild04/reading-ai-agent
You'll need macOS 14+, Node.js 20+, Rust, and an OpenAI API key with Realtime API access.
The runtime self-modification pattern is genuinely underexplored — I'd love to hear from anyone who's thought about this differently, especially around sandboxing approaches and session management in long-running Realtime API agents. Issues and PRs welcome.
Built by Sam Feng — solo project, work in progress.
Top comments (0)