Jackson Ly

Posted on Jun 29

Why I built my Mac assistant to run 100% on-device (and what local-first actually cost me)

#buildinpublic #showdev #privacy #ai

I'm building a proactive personal assistant for the Mac called recal. It watches how I work, learns my patterns, and starts doing the repetitive busywork for me. The one constraint I refused to bend on: it runs entirely on-device. No cloud, no account, no server. 0 bytes of my activity leave the machine.

This is the honest engineering version of that decision: why I made it, what it bought me, and what it genuinely cost. I'm the founder, building in public, and the product is pre-launch, so this is about the architecture, not a sales pitch.

Privacy by architecture, not by policy

Every AI productivity tool I tried wanted the same thing: my screen, my files, my activity history, living on someone else's server. For a tool whose entire job is to watch how I work all day, "trust our privacy policy" was never going to be enough. A policy is a promise. Architecture is a guarantee.

If the data never leaves the device, there is no server to breach, no account to leak, no policy to quietly change next quarter. You can pull the ethernet cable and it still works. That is a fundamentally different security model than "we encrypt it in transit."

The catch is that you have to actually build for it, and "local-first" stops being a marketing word the moment you hit the real engineering.

Capture is the scariest part, so it is default-deny

A tool that observes your work is one bad decision away from being spyware. The Microsoft Recall backlash made that vivid for everyone. So the capture layer is built default-deny: it does not record unless a specific, allowed signal says it is safe to.

The model I kept coming back to is Helen Nissenbaum's contextual integrity: information is not simply private or public, it is appropriate or inappropriate to a context. A password field, a banking tab, an incognito window are contexts where capture is never appropriate, no matter how useful the data would be. So sensitivity is a first-class tag on every observed event, decided at capture time, and anything uncertain gets dropped rather than stored. The expensive, paranoid default is the correct one here.

Local retrieval without a cloud index

The fun objection to local-first AI is "but the good models live in data centers." True, and I am not pretending a Mac runs a frontier model. But most of what a personal assistant needs is not a giant model, it is your own context, retrievable fast.

That part runs entirely on-device: content gets embedded with a small ONNX model running locally, and retrieval is a mix of embeddings, recency, and plain filters. For personal re-finding ("what was I looking at last Tuesday before the meeting"), that combination beats a cloud round-trip, and it never ships your life to an index you do not control. The thing that surprised me: you need far less model than the industry implies, as long as you are retrieving over your own data instead of trying to reason about the whole world.

The part I care about most: it proposes, I approve

"It did something on its own" is terrifying the first time it is wrong. So recal never acts on its own. It does the work in the background, then surfaces a finished result for one tap: approve, or do not.

That approve-gate is not a UX afterthought, it is the trust contract. It keeps a human as the one deciding while the machine does the doing. It also makes a wrong proposal cheap (you reject it and move on) instead of catastrophic (it already emailed the wrong person). Building the gate first, before the autonomy, was the single best product decision I made.

What it cost

Being honest, because that is the point of building in public.

No giant cloud model to lean on. Anything that genuinely needs frontier-scale reasoning has to be designed around, deferred, or made opt-in, and you feel that constraint daily.
Capture you can trust is slow to build. Default-deny means you are constantly saying no to data you would love to have, and writing the rules that decide "appropriate" is most of the work.
You carry the whole stack. There is no server to hotfix. The observation engine, the index, and the agent all ship to the user's machine in Swift and Rust, and a bug is a bug on their hardware, not a deploy you roll back.

Would I do it again? Yes, without hesitation. The cost is real, but what I get back is a tool I would actually trust to watch my whole working life, which is the only kind worth building.

If the on-device, local-first direction resonates, I am building recal in public and there is a waitlist at recal.so. But mostly I wanted to write the architecture down honestly, because I think more personal software should be built this way, and I would genuinely like to hear how others are drawing the same lines.

Top comments (9)

UnitBuilds • Jun 29

You should look at MCP-Lite on my git (also old post on it), if you want your agent to use a browser, it's pretty good... Switching DOM scraping for AOM traversal makes a huge performance difference and improves your signal to noise.

Jackson Ly • Jun 30

Appreciate the pointer, I'll take a look. The AOM-over-DOM point matches what I've seen: the accessibility tree is a much cleaner signal than raw DOM because it's already collapsed to roles and labels, so you skip most of the markup noise. Funny timing too, I've been doing a lot of browser-driving lately and the gap between parsing the whole DOM and reading the a11y tree is basically the difference between fighting the page and working with it. Thanks for reading.

UnitBuilds • Jun 30

Exactly, AOM on sites like Git and Amazon provide clear semantically labeled objects, whereas the DOM is 400k tokens worth of html. It's night and day to your wallet when using cloud models

Jackson Ly • Jul 1

Right, and that wallet point is the whole reason I went on-device in the first place. When every token is metered you feel a 400k-token DOM dump immediately; running locally on your own machine, a noisy context just costs you a bit of latency, not a bill. AOM keeping the signal tight is a nice win either way, cloud or local.

UnitBuilds • Jul 1

Running local, you still feel the context hit though. Chances are you arent saving 400k tokens for a page, it'd fill your context window entirely, so a large chunk is just discarded when you run locally, cloud models tend to use a sub-agent, which just runs it, parses it, returns relevant data to main agent, which means you pay for 400k disposably, not cached.

Jackson Ly • Jul 1

Yeah, fair correction, and that's the sharper version of the point. Locally the currency isn't dollars, it's context budget and latency, but a 400k-token page will blow a local context window long before it blows your wallet, so the constraint is real either way.

Which is exactly why the sub-agent isolation pattern you're describing matters more locally, not less. You want a disposable worker whose whole job is: run the page, traverse the AOM, hand back the distilled semantic objects, then throw its context away. The main agent never sees the 400k of markup. AOM is what makes that worker's output small enough to return cheaply, so local + AOM + a throwaway sub-agent context is the combination that actually holds.

The one distinction I'd draw: cloud makes you pay for that 400k disposably per run, local makes you pay in tokens-per-second and a context window you have to be disciplined about. Same architecture, different meter. But you're right that "it's free locally" is too glib. The real win is not polluting the main context, and that's independent of where the model runs.

VoltageGPU • Jul 1

Interesting take on local-first AI — I've been working on on-device ML inference for secure workloads, and the trade-offs between performance and privacy are always tricky. Have you considered leveraging Apple's Neural Engine more aggressively for lower power consumption during continuous monitoring? I've seen similar approaches with VoltageGPU for edge workloads.

Jackson Ly • Jul 2

Yes, and the honest answer is that you don't get to be as aggressive with the ANE as you'd like. You compile to Core ML and the runtime decides placement, so the real work happens at export time: fp16, static shapes, ANE-friendly ops. One unsupported op and a chunk of the graph silently falls back to GPU or CPU, and the power budget changes without any API telling you loudly.

For continuous monitoring specifically, the scheduler turned out to be a bigger power lever than raw ANE efficiency. The pattern that works for me is a cascade: a tiny always-on gate (cheap embeddings and heuristics) decides whether anything interesting happened, and the heavier models only wake on candidate events. Duty-cycling the expensive tier saves more than any amount of graph tuning on it. The embedding tier on the ANE at fp16 is nearly free; it's the generative tier you have to keep event-driven.

Jackson Ly • Jul 2

yeah, and thats the part people miss about local: you cant let the main model eat the 400k, so you're forced to build the sub-agent tier yourself instead of the cloud hiding it for you. the pattern that survives locally is the same one you're describing, a cheap pass (AOM traversal, or bm25/embedding over the chunks, or a tiny extractor model) reads the 400k and hands back the 2k that matter, and the main context only ever sees the 2k.

the twist is the economics flip. in the cloud that disposable 400k pass is real money every single time. local, the discard is just compute you already paid for in electricity, not per-token, so eating a big throwaway pass is actually cheaper on-device. the constraint becomes latency and context window, not dollars. so local doesnt remove the sub-agent, it just moves the bill from cash to milliseconds and makes you own the harness.

which is kind of the whole local-first trade in one example: you give up the managed convenience and pay in engineering, but the marginal run is basically free and private.