hyunjun

Posted on Apr 15

I Built a macOS J.A.R.V.I.S. That Turns Voice Commands Into Search, Media, and Visual Knowledge Maps

#ai #productivity #showdev #sideprojects

For the last few days, I’ve been building something I originally treated like a fun side project:

A personal J.A.R.V.I.S.-style desktop assistant for macOS.

But somewhere in the middle of building it, it stopped feeling like a toy.

It started to feel like a real interface for thinking.

What it does

The core idea is simple:

You speak naturally, and the assistant can:

understand your command in real time
search the web
open media like YouTube
generate draggable result blocks
connect those blocks visually with lines
analyze relationships between them
respond with short voice acknowledgements and full text explanations

So instead of a normal chatbot UI where everything becomes a long scrolling conversation, the result becomes a kind of visual operating surface.

You don’t just ask.
You map.

Why I built it

Most AI assistants still feel like one of these two things:

a chat window
a voice layer on top of a normal app

I wanted something different.

I wanted an interface that felt closer to:

a command bridge
a visual intelligence desk
a block-based research surface
a system that can collect, arrange, and connect information spatially

Basically, I wanted something that feels less like “chatting with AI” and more like operating intelligence.

Current features

Here’s what currently works in my build:

1. Real-time voice command input

I can speak to the app naturally and it recognizes commands like:

play a song on YouTube
search the web for a topic
open news results
analyze relationships between generated blocks

2. Web and news search

The assistant can pull search results and create visual result panels.

3. Media blocks

It can open media/video results and place them in the workspace.

4. Draggable UI blocks

Search results, analysis notes, and media blocks can be dragged around the screen.

5. Connection lines between blocks

Blocks can be connected visually, which turns the UI into something closer to a spatial reasoning tool than a normal assistant.

6. Relationship analysis

Once multiple blocks exist, the assistant can analyze how they relate and explain the connection in text.

7. Hybrid response style

I found that long natural voice responses slowed the workflow down too much, so I changed the response model:

voice output = short acknowledgements like “Got it”, “On it”, “Done”
main explanation = text inside the interface

That made it feel much faster and more usable.

Tech direction

I’m building this as a macOS desktop app, not just a web toy.

Current stack and direction are roughly:

macOS desktop app
Electron-based app structure
real-time voice input
web/media interaction layer
draggable block UI
visual linking and relationship analysis
account/payment/download launcher site for distribution

I also built a landing page + auth flow + payment flow + downloadable .dmg access system around it, so it’s no longer just a local experiment.

What surprised me

The biggest surprise was this:

The UI itself changed how the assistant feels.

A normal assistant returns text.

This one makes it feel like information is being laid out in space.
That changes the whole experience.

The moment I added:

draggable blocks
connecting lines
floating analysis panels

…it stopped feeling like “an AI feature” and started feeling like a system.

What still feels unfinished

A lot.

The biggest issues right now are not raw functionality, but product friction:

unsigned macOS build friction
download trust issues
setup complexity
BYOK onboarding
deciding whether this should stay a power-user tool or become a polished mainstream app later

Right now I’m intentionally leaning toward builders / developers / AI power users first, not general consumers.

Product question I’m thinking about

I don’t think this is a mass-market assistant yet.

It feels more like a tool for people who want:

visual research
command-driven exploration
AI-assisted knowledge mapping
a more cinematic / operational interface for information work

So I’m currently asking myself:

Is this a niche but powerful tool for builders?
Or the beginning of a very different kind of AI desktop product?

What I’d love feedback on

I’d really love feedback from people here on these points:

Does this feel like a real product category, or still a cool demo?
Is the visual block + connection model actually useful, or just visually impressive?
If you saw this as a developer/power user, what would be the first real use case you’d expect?
Would you rather use this as:
- a research workspace
- a voice-controlled browser layer
- a personal AI operations HUD
- something else entirely?

Final thought

I started by trying to build “my own J.A.R.V.I.S.”

What I’m actually building now might be closer to:

a voice-controlled visual AI workspace for macOS.

And honestly, that feels more interesting.

If people want, I can post a follow-up with:

architecture decisions
interaction design choices
what worked / what broke during debugging
and how I handled the voice + visual response split

DEV Community