DEV Community: GaltRanch

Why I Built My Own AI: The Case for Self-Hosted Domain Agents (Kulvex)

GaltRanch — Thu, 21 May 2026 17:06:34 +0000

Originally published on the AstroLexis blog. Cross-posted here for the community.

In 2024 I got tired of paying OpenAI to know everything about my house, my conversations, my calendar, my code, and my family. So I built Kulvex AI — a self-hosted AI platform that runs an 80-billion parameter model on a pair of consumer GPUs in my office, exposes 17 domain agents over a private API, and handles everything from Zigbee lights to messaging across eight platforms. Here's why I built it, what it actually does, and where the trade-offs land in 2026.

The premise that broke for me

For about eighteen months I lived inside the OpenAI / Anthropic ecosystem like everyone else. Then a few things piled up at once:

OpenAI deprecated three model versions in a single quarter. Prompts that had been stable for nine months suddenly produced different outputs.
I was paying close to $400/month across personal + work API usage.
A friend got rate-limited mid-presentation by Anthropic.
I started building accessibility software for my dad and clinical software for my kid's therapist. For both, "send the user's voice to OpenAI" was a non-starter. Once I'd done the on-device work for those, the "we have to use cloud" thinking for everything else stopped making sense.

The breaking point was simple: I wanted an AI that would still work in five years exactly the way it works today, on hardware I own, that I could point at any task without asking anyone for permission. That product didn't exist. So I built it.

What Kulvex actually is

Three pieces:

The server — runs on your own hardware (a workstation with at least one 24GB GPU, ideally two). Hosts a quantized 80B-parameter model via llama.cpp, exposes a Socket.IO API for clients, and orchestrates a set of domain agents.
The iOS client (in App Review as of writing) — talks to your Kulvex server over a private endpoint. Apple Foundation Models on-device for low-latency tasks, falls back to the server for anything requiring the 80B model.
Kulvex Code — terminal IDE with 15+ plugins. Sibling product to KCode, but for general dev work instead of security audit.

The 17 domain agents

"AI assistant" is too vague. What Kulvex actually does is split work across specialized agents:

🏠 Home Automation · 📱 Messaging Hub · 📅 Calendar · 📧 Email · 📰 News Curation · 🎵 Music · 🌤️ Weather · 📷 Cameras (Hikvision) · 💡 Lights (Zigbee + Tuya) · 🔌 Energy/Solar · 🍔 Food · 🚗 Vehicle · 📚 Research · 💰 Finances · 📝 Notes · 🛠️ Code · 🧬 Self-Evolution

Each agent has a narrow job, its own toolset, and its own context budget. When the user says "turn off the kitchen lights and tell me what's on the calendar tomorrow," the platform routes to two agents in parallel. Neither agent needs to know about the other.

This design made Kulvex tractable for me as a solo developer. The general-purpose "do everything" AI model is hard to make good. Seventeen narrow agents, each replaceable independently, is much easier to ship.

The piece I'm proudest of: self-evolution

One of the 17 agents is called Self-Evolution. Its job is to read Kulvex's own codebase, identify bugs or improvements, write the fix, run the test suite, and — if everything passes — commit and deploy.

Three guardrails make this safe:

Sandboxed worktree. Changes apply to a clone, tested in isolation, only merged if CI passes.
I review every commit. Agent opens PRs against my GitHub; nothing lands without human approval. No write access to main.
Scope bounded by directive files. Agent reads OWNER-DIRECTIVES.md at the repo root, which tells it what it's allowed to modify (typos, dead code, small refactors, dependency upgrades) and what it's NOT (auth, payments, model selection logic, API surfaces).

In four months: ~340 PRs opened, ~280 merged. Meaningful chunk of the 200K-line codebase maintenance offloaded.

The hardware reality

Self-hosted AI in 2026 means buying GPUs:

RTX 5090 (32 GB VRAM) — primary inference. Runs the 80B model in 4-bit quantization.
RTX 4090 (24 GB VRAM) — secondary. Whisper-large + vision.
96 GB DDR5 system RAM for context cache.
~$5,500 total. Three-year amortization = $155/month — less than half what I was paying OpenAI.

The math gets better the more you use it. Hardware costs the same whether I run 10 queries or 10,000. After breakeven, marginal cost approaches zero.

For users who don't want to run their own GPU, Kulvex has a hosted "Home" tier where we run the server on shared infrastructure — still no cloud LLM vendor in the loop, model weights pinned, but you're renting compute instead of buying.

What I didn't have to give up

The 80B model on the RTX 5090 produces output that, for 95% of tasks (code review, drafting, knowledge retrieval, agent orchestration), is roughly comparable to GPT-4-class. The frontier (Claude Opus 4.x, GPT-5) still beats it on hard reasoning. For those cases Kulvex's orchestrator can route to Claude or GPT if the user opts in. I rarely turn that on. Privacy + cost + reliability beat marginal capability for my workload.

What I did have to give up

Honest list:

Setup is harder. Kulvex server takes 30-60 minutes to install if you're comfortable with Docker and CUDA. Not great for non-technical users. Hosted tier exists for this.
Cold starts. Idle GPU + first query = 5-10s while model loads back into VRAM. Subsequent queries sub-second.
Image generation. Not bundled. Need DALL-E quality? Fall back to a hosted service. ComfyUI integration on roadmap.
Internet research. Local model has no live web (by design). Agents can fire search to SearXNG or Perplexity. Not as smooth as ChatGPT browsing.

For casual AI users, cloud is still better. Kulvex makes sense once your usage is heavy enough that the privacy + cost + reliability balance flips.

Pricing

Starter — free, hosted, capped.
Home — $19/month hosted, unmetered personal use.
Pro — $49/month hosted + on-prem option.
Enterprise — custom, on-prem with SLA.

The on-prem tier never phones home except for license validation.

The honest part

iOS client in App Review right now (build 4.3.6). Apple rejected previous builds for AI consent flow; resolved.
Real users on Home and Pro exist but small numbers.
Platform mature, product/GTM still finding shape. Biggest gap: install-and-onboard for non-technical users.
I use Kulvex personally every day. That's the proof. Whether it's a business — same as KCode, we'll know in two quarters.

Who this is for

People who already self-host (Home Assistant, Nextcloud, Jellyfin homelabbers).
Privacy/compliance constraints (healthcare, legal, defense, GDPR).
Heavy AI users — $50+/month in API spend.
People who believe intelligence-as-software should be ownable, not rented.

— Bruno Galtranch, founder, AstroLexis LLC. If you self-host or are considering it: contact@astrolexis.space.

Building a Clinical Speech-Therapy App With a Real SLP: 4 Lessons From PhoenixSteps

GaltRanch — Thu, 21 May 2026 14:35:36 +0000

Originally published on the AstroLexis blog. Cross-posted here for the community.

My son's speech-language pathologist drives all the clinical content for the app I built around her practice. PhoenixSteps is what came out of it: a pediatric clinical app that does what existing apps don't because the protocols, the exercise catalog, and the success criteria come from a working SLP — not from a developer's imagination. Here are four lessons from co-designing it, including how we taught Apple's Vision framework to do something Apple flatly refused to.

How this started

My son has a speech sound disorder. Specifically, rotacismo — he struggles with /r/ and /rr/, which in Spanish are foundational phonemes that show up in roughly one in every six words. His speech-language pathologist is Stefania. We've been seeing her weekly for over a year and the progress has been real, but inconsistent: he'd nail a sound during a session and lose it by mid-week.

The gap was obvious to both of us. He'd do exercises with Stefa for forty minutes, then we'd go home and the exercises mostly stopped, because:

The "drill at home" sheet Stefa sent had no feedback loop. My kid would say "ratón" five times and have no idea if any of them were correct.
Existing pediatric speech-therapy apps in Spanish are either commercially mediocre (gamified versions of basic flashcards) or clinically rigid (built for adult speech rehab, not children).
The market for tools that actually run the clinical exercises a Spanish-speaking SLP would prescribe — with audio feedback, automatic scoring, and progress tracking the therapist can read — basically did not exist for a private practice working with a 4-year-old.

I asked Stefa if she'd want to co-design something. She said yes. That's how PhoenixSteps started — and the four lessons below are the ones I wish I'd known going in.

Lesson 1: A clinical co-creator changes everything about what you ship

I had built consumer iOS apps before. I had not built a clinical tool. The thing I underestimated was how much of the actual product is the protocol, not the software.

Stefa works from named, published clinical protocols — Borrás, Bosch, the AELFA articulation drills. When she prescribes an exercise, she's pulling from a tradition that has decades of consensus on order, dosage, and progression. "Lengua a la nariz" isn't a cute idea — it's Borrás Exercise 29, with specific instructions about duration, repetitions per day, and what to do if the child can't sustain the position.

Before working with Stefa, I would have built a "speech therapy app" that was basically a glorified flashcard deck with cute animations. With Stefa, the exercise catalog became:

Orofacial praxias — 7 exercises pulled directly from her clinical sheet, in the order she actually prescribes them.
R-group syllable warmups — "ra ra ra," "rrrr-on" — building muscle memory before tackling words.
R simple words — rosa, ratón, mira, perro — graded by Stefa for difficulty.
R-cluster words (sinfones) — bra, cra, dra, fra, gra, pra, tra. The hard ones.
Minimal pairs — R/RR, R/L, D/R, T/D. Auditory discrimination drills.
Carrier phrases — embedding the target sound in real sentences.
"Tren de la Risa" — a karaoke song Stefa wrote that hits every R context across 8 verses.

None of that comes out of an engineer's imagination. It comes out of a working SLP's notebook.

Lesson 1 distilled: if you're building a clinical product, the clinician is not a "domain advisor." They're the architect of what the product actually does. Compensate them properly, give them a real voice on the roadmap, and make their clinical judgment the spine of every feature.

Lesson 2: Apple won't give you what your patient needs. Build it yourself.

This is the technical story, and it's the one I'm most proud of.

One of the most prescribed praxias for kids working on /r/ is "lengua a la nariz" — extending the tongue tip toward the nose. The exercise builds the lingual elevation needed for the alveolar trill. Stefa wants the app to automatically verify the kid did the exercise correctly: tongue out, pointed up, sustained for 10 seconds.

This sounds like a job for ARKit. Apple has had face tracking with the TrueDepth camera since the iPhone X. ARFaceAnchor.blendShapes includes jawOpen, mouthSmileLeft, cheekPuff — and yes, tongueOut.

Except: tongueOut is a scalar. It's 0 when the tongue is in, and 1 when it's out. Apple does not tell you where the tongue is pointing. Up, down, left, right — they all read identical.

I emailed Apple developer support. The answer was: no, the tongue is not modeled as 3D geometry, and there's no API to detect tongue direction. Tongue tracking is inherently unstable (occlusion by teeth and lips), so Apple chose not to ship something they couldn't validate at Face ID precision.

So Stefa and I built the detector ourselves.

The pipeline

ARKit captures the camera frame on the TrueDepth camera at 60 fps.
We grab the raw frame.capturedImage — the YUV pixel buffer ARKit hands you for free.
Vision detects face landmarks: VNDetectFaceLandmarksRequest returns outerLips, innerLips, and nose as 2D polygons.
Three Regions of Interest outside the lip polygon:
- UP ROI — rectangle between top of upper lip and bottom of nose
- LEFT ROI — extending leftward from the left corner of the lips
- RIGHT ROI — same, mirrored
Count pink/red pixels inside each ROI. The lip-skin transition is at Cr ≈ 18; the tongue is at Cr ≈ 25-50. We threshold Cr > 25 to filter out facial skin and pale lips.
If a ROI has > 400 "tongue-colored" pixels, the tongue is projecting in that direction. Cross-check with ARKit's tongueOut blendshape, mirror-compensate for the front-facing camera.

The detector reports up, down, left, right, center, or notVisible at 20Hz with a confidence score. The first time I showed Stefa the demo — me sticking my tongue toward my nose and watching the screen say "ARRIBA conf 100% pix 3,974" — she didn't believe it was real until I sent her the source code.

Lesson 2 distilled: the most defensible technical work in a clinical product is the part Apple won't ship. If you can do something the platform doesn't expose — and it matters for the clinical outcome — that's your moat.

Lesson 3: Audio quality is a feature, not a detail

PhoenixSteps ships with about 325 pre-recorded voice prompts, all generated using OpenAI's gpt-4o-mini-tts with the "nova" voice. Why pre-recorded TTS instead of letting iOS synthesize on the fly?

Pediatric voice consistency. Kids learn faster when the audio prompt sounds the same every time.
Speed and articulation. Stefa wanted slower-than-normal pronunciation for warmups, regular pace for practice, a specific cadence for the song. Generating with explicit instructions ("habla en español neutro latinoamericano, ritmo lento y articulado, énfasis infantil sin caricaturizar") gets us the exact register a real SLP would use.
Reliability. Pre-recorded audio works offline, doesn't depend on a phone's TTS pipeline being up, doesn't get interrupted by Siri.

We learned the hard way that the OpenAI API will occasionally return a truncated mp3 (we caught three files at 0.36s when they should have been 1.2s). The fix was a post-generation validation step: every newly generated mp3 has to pass a minimum-duration check.

Lesson 3 distilled: for pediatric/clinical apps, audio is content. Pre-render every prompt with a consistent voice and pace. Validate audio duration before bundling.

Lesson 4: HIPAA-equivalent privacy isn't optional

The users of PhoenixSteps are children. Their voice recordings and progress data are protected health information.

Speech recognition on-device (WhisperKit). Voice never leaves the iPhone.
Face tracking on-device (ARKit + Vision).
Progress data in SwiftData, syncing to family's private iCloud.
No analytics, no third-party SDKs, no Crashlytics, no Facebook Pixel.
AI features gated by parental consent. Apple Foundation Models on-device, opt-in.

PhoenixSteps will never have a data breach involving children's voice samples, because there's no centralized data to breach.

Lesson 4 distilled: if you're building anything where the user is a minor or a patient, design as if the audit is happening tomorrow.

Where PhoenixSteps is right now

Not in the App Store yet. Build 28. Finishing the clinical pilot with Stefa.
Spanish-first. English localization on the roadmap once the clinical content is validated by an English-speaking SLP.
Free for parents, with an optional Pro tier for clinicians.
Stefa drives the clinical content. The exercise catalog, the protocols, the cadence, the success criteria — all hers. The app exists because she said yes to co-designing it.

If you're an SLP working with pediatric patients in Spanish, write us. We're going to add more clinical advisors as the product matures: contact@astrolexis.space.

— Bruno Galtranch, founder, AstroLexis LLC. Clinical content by Stefania, SLP.

Live Captions Without Sending Your Voice to the Cloud: Building ClearCaps

GaltRanch — Thu, 21 May 2026 14:27:21 +0000

Originally published on the AstroLexis blog. Cross-posted here for the community.

My dad started losing his hearing about five years ago. Not catastrophically — just enough that family dinners turned into "what did she say?" and TV got a little louder every month. Off-the-shelf captioning apps existed but every single one required uploading audio to a vendor's cloud. For private family conversations, medical appointments, work calls — that wasn't going to fly. So I built ClearCaps. Here's the founder story and the technical pieces that make on-device live captioning actually work in 2026.

The motivating problem

Hearing loss is one of the most common chronic conditions on the planet. The WHO estimates over 430 million people worldwide live with disabling hearing loss — and that number is rising as the population ages. Most of them are not deaf; they hear, just less reliably. Sound gets muddier. Speech gets harder to parse, especially in noisy environments. Conversations become exhausting in a way that's invisible to anyone who hasn't experienced it.

The existing accessibility stack on iOS is genuinely good. Apple's Live Captions (built into iOS 16+) work in many contexts. Speech-to-text apps abound. But almost all of them have the same architecture: capture audio, send it to a server, get back text. For someone with hearing loss, this works fine in casual settings. It does not work for:

Medical appointments. HIPAA-protected health information, often deeply personal.
Therapy sessions. Same reasoning, plus the person on the other side might object to being recorded by a cloud service.
Family conversations. Nobody wants a vendor harvesting their kid's voice or their spouse's medical complaints.
Work meetings under NDA. The lawyer didn't sign off on routing audio through someone else's datacenter.
Anywhere there's no internet. Buses, trains, basements, planes, rural areas.

The market for "live captions that respect your privacy" was — for years — basically non-existent. The reason was technical: doing speech recognition well on a phone, in real time, with speaker identification and translation, wasn't feasible. The models were too big, the CPUs too slow, the batteries too weak. In 2026 that ceiling lifted.

What changed: on-device ASR finally got good

Three independent pieces of technology converged to make this viable on an iPhone:

WhisperKit. Argmax's optimized port of OpenAI's Whisper to the Apple Neural Engine. Whisper-small (240M parameters) runs in real time on any iPhone with an A14 or newer. Whisper-base is even faster. The accuracy is strikingly good — better than most cloud APIs for accented English and major non-English languages.
Apple's Translate framework. Built into iOS 17.4+, fully on-device, supports 10+ languages including English ↔ Spanish, Portuguese, French, German, Mandarin, Japanese, Korean. Latency is sub-second per sentence.
Pyannote speaker diarization, ported to Core ML. The piece that took the longest to get right.

None of these are mine. The work was integrating them — making them run together on a single iPhone, in real time, with low enough latency that the captions actually keep up with the conversation, without melting the battery in 20 minutes.

The architecture

ClearCaps splits the computation across every accelerator the chip has:

Apple Neural Engine (ANE): Whisper-small for automatic speech recognition. Runs exclusively on the ANE so it doesn't fight the GPU for memory bandwidth.
GPU: Pyannote embedder for speaker diarization. The embedder produces 256-dim vectors for short audio chunks; the GPU is the right place because the operations are big matmuls.
Audio DSP block: noise suppression, automatic gain control, acoustic echo cancellation. Apple's built-in voice processing, hardware-accelerated, doesn't touch ANE or GPU.
CPU: Pyannote segmenter, clusterer, voice activity detection, audio resampling, and SwiftUI rendering.

The split matters because if you naively run everything on the GPU, you bottleneck on memory bandwidth before you bottleneck on compute. By splitting across ANE + GPU + DSP, the chip's actual peak throughput becomes accessible. An iPhone 15 Pro or newer handles the full pipeline (ASR + diarization + UI) at ~30% CPU and ~15W package power. That's about half what watching a YouTube video draws.

The hard part: speaker diarization on-device

Automatic speech recognition has been a solved problem for cloud services since 2022 and for on-device since Whisper-small dropped. Diarization — figuring out who is speaking at any given moment — is much less mature.

The state of the art on the cloud side is pyannote.audio, a fantastic open-source library by Hervé Bredin. It's PyTorch under the hood, and the pretrained models assume you have a workstation GPU and Python at runtime. Neither of which exists on an iPhone.

Porting pyannote to run inside an iOS app required:

Converting the embedder to Core ML. The segmenter neural net (a 1D-CNN that ingests audio and outputs voice-activity + speaker-change probability per frame) and the embedder (which produces a 256-dim vector per active speaker segment) both convert cleanly. The clusterer is pure Python and gets reimplemented in Swift.
Streaming the inference. The pretrained pyannote models expect 10-second chunks. For live captioning, 10-second latency is unusable. We slide a 2-second window and re-cluster every 500ms. The clusters get stable after about 3-4 seconds of speech per speaker.
Handling cold-start. The first 2-3 seconds of any conversation have no diarization data. Captions show up immediately, just with a placeholder speaker label ("Speaker 1") until the clusterer locks on.
Naming speakers. The user can tap any speaker label and rename it. "Speaker 1" becomes "Doctor Rodríguez." The rename persists for the whole session and gets re-applied if the clusterer recovers the same speaker after a gap.
The "did someone address me?" signal. ClearCaps detects when a speaker directly addresses the user (questions tagged "You" or "Bruno") and triggers a haptic. The user doesn't have to stare at the screen — they can look at the person they're with and feel a buzz when something needs their attention. This came from talking to my dad: the worst part of hearing loss in conversation isn't missing words, it's missing when someone has just asked you a question.

Why on-device matters for accessibility specifically

I want to be careful here because accessibility tech often gets framed as a charity case, and that's the wrong frame. Hard-of-hearing people are paying customers. They have specific product requirements. They evaluate tools the same way anyone else evaluates tools.

The privacy-first architecture isn't a feel-good add-on for accessibility users. It's a product requirement that surfaces specifically in this market:

Medical conversations. A captioning app that requires uploading audio to a cloud service is incompatible with patient privacy expectations in most jurisdictions.
Family privacy. Spouse discussing health symptoms over dinner. Kid asking about something embarrassing at school. The captioning user doesn't want that going into anyone's training dataset.
The recipient's consent. When you're using captions in a conversation, the other person hasn't consented to a cloud service capturing their voice. On-device captions sidestep this entirely — the audio never leaves your phone.
Offline reliability. Hearing-loss users need captions most when they're most stressed, which is often in environments where wifi is bad: hospitals, public transit, large crowded events.

The first time my dad used ClearCaps in a real conversation, the thing he commented on wasn't the accuracy — it was that it kept working when the wifi flickered. That's the architectural payoff.

The AI assistant on top

ClearCaps ships with an optional AI layer on top of the captions, powered by a 3B-parameter LLM running through Apple Foundation Models on iOS 26+. The model does four things:

Cleans up the transcript. Whisper is great but it captures every "um" and "uh" and false start. The cleanup pass produces a readable version.
Summarizes long sessions. A 90-minute consultation becomes a one-page bullet summary.
Identifies speakers by context. If "Doctor Rodríguez" appears in the conversation naming themselves, the assistant infers that label automatically.
Visual context. Take a photo during the conversation (a whiteboard, a prescription, a slide) and the LLM describes it.

All of this runs on-device. The LLM is Apple's, the framework is Foundation Models, and there's a privacy manifest in the app bundle that auditors can verify.

Where ClearCaps falls short (and where it's heading)

Honest assessment:

Heavy accents. Whisper-small degrades on heavy regional accents in Spanish (rural Caribbean, Andalusian) and English (Glaswegian, deep Southern US). Whisper-medium would help but doubles the memory footprint.
Crosstalk in groups bigger than 4. Pyannote handles 2-4 speakers cleanly. Above that, clusters merge and split.
Sign-language input. Not in scope yet. ASL/LSE/LSA via camera is on the roadmap but the recognition stack isn't there.
iPad / Mac versions. iPhone only at launch.

The product

ClearCaps is on the App Store. iOS 26+, free download with a paid AI tier ($2.99/month or $19.99/year). The captioning itself — ASR + diarization + translation — is free forever.

I made it free for the captioning because of who the users are. Hard-of-hearing people are often on fixed incomes (older population), and the captioning is a basic accessibility tool that I felt strongly should be available without payment. The AI features are nice-to-have, not need-to-have, and that's where the monetization lives.

— Bruno Galtranch, founder, AstroLexis LLC. If you have feedback or a use case we missed: contact@astrolexis.space.

Apple Silicon as a Serious AI Dev Box: What an M4 Max Actually Does With a 70B Model

GaltRanch — Thu, 21 May 2026 14:22:15 +0000

Originally published on the AstroLexis blog. Cross-posted here for the community.

If you're shopping for an LLM workstation in 2026, the default mental model is still "NVIDIA GPU, lots of VRAM, big tower." That's not wrong, but it's also not the only correct answer anymore. Apple Silicon — M3, M4, M5 — has quietly become one of the best local AI development boxes on the market, and almost nobody outside of MLX twitter is talking about the actual numbers. Here's what an M4 Max really does, where it crushes NVIDIA, where it doesn't, and why I built SiliconMon to see what's happening underneath.

The thesis: unified memory changes the math

The single architectural decision that makes Apple Silicon competitive for AI workloads is unified memory. On a typical NVIDIA system, the model weights live in dedicated GPU VRAM, separate from system RAM, connected by a PCIe bus. On Apple Silicon, there's one pool of memory — say, 128 GB on an M4 Max — and the CPU, GPU, and Neural Engine all see the same physical pages. No copy between host and device, no PCIe bottleneck on transfers, no juggling layers between cards.

For LLM inference, this matters more than people initially expect:

You can load a 70B parameter model in 4-bit quantization (~40 GB) directly into the unified pool, addressable by the GPU, without renting an enterprise card.
Context window expansion is cheap. Going from 4K to 32K context tokens doesn't require swapping or specialized layer offloading — it just uses more of the same pool.
Multimodal workloads (vision encoder + LLM + speech) coexist in one address space. ClearCaps' on-device captioning pipeline runs WhisperKit, an LLM, and Apple SpeakerKit on the same chip with no inter-device coordination.

The trade-off: total memory bandwidth on Apple Silicon (around 400-800 GB/s depending on chip tier) is below a top-tier NVIDIA card (HBM3 cards push north of 3 TB/s). For pure inference throughput on small models that fit easily in a 4090, NVIDIA still wins. For anything larger than ~20B parameters where you'd otherwise need multi-GPU setups, Apple's unified pool starts looking very attractive.

Real numbers on M-series for LLM inference

The tokens-per-second numbers depend heavily on quantization, framework (MLX vs llama.cpp), and whether you're measuring prefill or decode. Here's a rough baseline for decode speed on the most common configurations, with 4-bit quantized weights running on MLX:

Chip	Unified RAM	7B model	13B model	30B model	70B model
M2 Pro	32 GB	~45 tok/s	~22 tok/s	~8 tok/s	not viable
M3 Max	64 GB	~75 tok/s	~38 tok/s	~16 tok/s	~5 tok/s
M4 Max	128 GB	~110 tok/s	~55 tok/s	~28 tok/s	~10 tok/s
M3 Ultra	192 GB	~130 tok/s	~70 tok/s	~36 tok/s	~14 tok/s

For interactive use, anything above 15 tokens/second feels "instant" to a human reader. That means an M3 Max comfortably handles 30B models for interactive chat, and an M4 Max handles 70B models if you're patient on long generations.

The number that matters most for indie developers: a base M4 Mac mini at $1,400 with 24 GB unified memory runs quantized 13B models at 50+ tokens/second. That's a usable AI workstation for the price of a mid-range laptop, with zero noise, zero rack space, and 20W idle power draw.

Where Apple Silicon wins

Models that don't fit on a single consumer NVIDIA card. A 70B model in 4-bit needs ~40 GB. The biggest consumer NVIDIA card (5090) ships with 32 GB. You can split across multiple cards, but inter-card communication becomes the bottleneck. M4 Max with 128 GB swallows the whole model and has headroom for 32K context.
Power efficiency. An M4 Max under sustained inference load draws 30-50W. The equivalent NVIDIA workstation can pull 600-900W. If you're paying for electricity (anyone running 24/7 self-hosted inference) the OpEx delta is enormous.
Acoustic profile. Mac Studio is silent. Mac mini is silent. A workstation with two RTX cards is a lawnmower. For anyone working from home, this is non-negotiable.
Out-of-the-box experience. macOS + MLX + Homebrew + Ollama installs in twenty minutes and just works. CUDA-on-Linux remains a persistent source of pain.
Multimodal workflows. Unified memory means you can pipeline speech-to-text, LLM, and TTS without ever materializing intermediate buffers across PCIe.

Where Apple Silicon loses

Training and fine-tuning. Mac is great for inference but the training stack (PyTorch on MPS, MLX training APIs) is still meaningfully behind CUDA. Anything beyond LoRA on small models is faster on NVIDIA.
Throughput per dollar at scale. If you're running production serving with hundreds of concurrent requests, a rack of L40S cards beats a fleet of Mac Studios on raw cost-per-token. Apple wins for development; NVIDIA wins for production serving above a certain volume.
Software ecosystem for very new research. Cutting-edge research code lands on CUDA first. The Mac port arrives weeks to months later, sometimes with reduced functionality.
Tooling visibility. NVIDIA gives you nvidia-smi, nvtop, NVIDIA Nsight, profiling tools that work on day one. macOS gives you Activity Monitor and a vague sense of where your watts are going. This last gap is why I ended up writing SiliconMon.

What you can't see (and why SiliconMon exists)

When you fire up Ollama, llama.cpp, MLX, LM Studio, ComfyUI, or vLLM on a Mac, the operating system shows you almost nothing useful. Activity Monitor reports CPU% per process, but the GPU and Neural Engine residency are invisible. Memory pressure is a single colored bar. Power draw is hidden behind powermetrics, which requires sudo and outputs an unreadable wall of text.

I'd been running multiple local LLM stacks for over a year and had no way to answer simple questions:

When I run Ollama and ComfyUI simultaneously, are they sharing the GPU or fighting for it?
Is my 70B model actually using the Neural Engine, or is it entirely on the GPU?
What's the package power draw during inference vs idle? Am I thermal throttling on a long generation?
Why does the system feel sluggish — am I swapping unified memory, or is something else going on?

Existing tools each gave fragments. asitop shows IOReport stats but is command-line only and stops being maintained periodically. macmon and mactop are similar. Stats and iStat Menus are general-purpose and don't know what an MLX process is. None of them detect "this Python process is actually serving Llama 4 via vLLM" or "this is Ollama loading a Qwen3 quantization."

So I built SiliconMon. It does three things the others don't:

AI workload detection. SiliconMon recognizes the canonical names and command-line patterns of MLX, Ollama, llama.cpp, LM Studio, ComfyUI, vLLM, and Hugging Face's transformers stack. When you see "Inference 47% • Ollama: qwen3-32b" in the menu bar, that's because the detector matched the process name, command line arguments, and loaded library set.
IOReport-based residency. Real CPU/GPU/ANE residency numbers from Apple's IOReport private framework, the same source Apple uses internally. Sampled once per second, no sudo required, sub-1% CPU footprint at idle.
Energy unit correctness across chip generations. M5 Max ships IOReport channels with mixed energy units — millijoules, nanojoules, microjoules — in the same response. Getting the conversion wrong is a 30× error on power numbers. SiliconMon has explicit per-channel unit handling and a regression test for every M-series chip we support.

How to think about buying a Mac for local AI

Rough buying guide based on what I'd actually recommend to friends asking:

Hobbyist / curious: M4 Mac mini, 24 GB unified, $1,400. Runs 7B and 13B models smoothly. Won't handle 30B+ comfortably. Best dollar-for-LLM machine on the market for non-pros.

Developer running local LLMs daily: M3 Max MacBook Pro 14"/16" with 64 GB unified, $3,200-3,600. Handles 30B models for interactive use, fine for 70B if you're patient.

Serious indie / small team self-hosted AI: M3 Ultra Mac Studio with 192 GB unified, $5,500-7,500. Runs 70B comfortably and 120B+ models in quantized form. Silent, sits under a desk, draws less power than a microwave. Sweet spot for self-hosted AI assistants like Kulvex AI.

Production / training: Use NVIDIA. The Mac isn't the right tool for serving at scale or training large models.

Software stack: what to install on day one

# Homebrew (if you don't have it)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Ollama — easiest entry point
brew install ollama
ollama serve &
ollama run qwen3:13b

# MLX — for Python-side LLM work
pip install mlx mlx-lm
python -m mlx_lm.generate --model mlx-community/Llama-4-7B-Instruct-4bit \
    --prompt "Hello, world"

# llama.cpp
brew install llama.cpp
llama-server -hf mlx-community/Qwen3-32B-Instruct-GGUF

# LM Studio — GUI alternative
# Download from https://lmstudio.ai

# SiliconMon — see what's actually happening
open https://astrolexis.space/siliconmon

The honest take

If you're already invested in CUDA, building Linux workstations, and serving inference at scale: Apple Silicon is probably not for you, and that's fine. NVIDIA's lead on production infrastructure is real and not closing soon.

If you're an indie developer, a researcher who needs to iterate locally, a security-conscious team that can't ship code to the cloud, or anyone who values a quiet, low-power, easy-to-set-up AI workstation — Apple Silicon is dramatically better than its reputation. The M4 generation is the inflection point. The M5 Max coming later this year extends the lead.

Buy the unified memory, not the cores. If you're agonizing between the cheaper config and the next tier up, always go for more RAM. Models grow, context windows grow, and you can't upgrade Mac memory after purchase.

— Bruno Galtranch, founder, AstroLexis LLC. Questions on Apple Silicon for AI: contact@astrolexis.space.

Static Analysis Without Sending Your Code to the Cloud: Building KCode

GaltRanch — Thu, 21 May 2026 14:13:20 +0000

Originally published on the AstroLexis blog. Cross-posted here for the community.

Every modern SAST tool — Snyk, SonarQube Cloud, GitHub Advanced Security, Semgrep AppSec Platform — asks the same thing: ship your source code to us, we'll tell you what's wrong with it. For a non-trivial number of teams, that's a non-starter. Here's how we built KCode, the static analysis tool that runs the LLM verifier on your own hardware, and what we learned about getting machine-grade precision out of a local model.

The day SAST became my problem

I'm Bruno, founder of AstroLexis. About a year before we started building KCode, I was the only engineer on a codebase that didn't tolerate uploading source. The reasons were the usual mix: enterprise customers with NDAs that explicitly forbade third-party SaaS code scanning, defense-adjacent contracts, jurisdictional restrictions that made any non-EU data residency a paperwork nightmare. The work was real, the policies were real, and the tooling we needed wasn't.

The market for "static analysis you can actually deploy on-prem" turned out to be remarkably bad. Snyk, SonarQube Cloud, and GitHub Advanced Security are SaaS-first. The on-prem versions exist but are priced for Fortune 500 and ship with the kind of installation playbook that needs a dedicated DevSecOps engineer to maintain. Semgrep has an open-source core, which is great, but the rule set that catches real bugs lives in their commercial platform. Local linters (ESLint, Pylint, Bandit, gosec) catch surface-level issues but miss anything that requires reasoning across files or distinguishing between "this looks scary" and "this actually exploits."

And then LLMs arrived and complicated everything. Suddenly you could ask Claude or GPT-4 about a file and get genuinely insightful security analysis. The catch: that file just went to someone else's datacenter. For the work I was doing, that wasn't a tradeoff — it was a deal-breaker.

So we built the tool we needed.

What KCode actually does

The architecture is intentionally boring:

Deterministic pre-filter. 414 hand-curated patterns across 20+ languages (C, C++, Rust, Go, Python, TypeScript, JavaScript, Java, Kotlin, Swift, Ruby, PHP, Bash, SQL, YAML, HCL, and more). 372 of them are regex, 27 are AST-based for the rules that need structural awareness (control flow, taint, scope). The patterns generate candidates: files and line ranges that look like they might be a problem.
Local LLM verifier. The candidates get fed to a local LLM (we recommend a 24GB+ GPU running a 30B-parameter model in 4-bit quantization). The model's job is to confirm or reject: "is this candidate actually exploitable given the surrounding code, or is it a false positive?" The verifier sees only the relevant code snippets — it doesn't need the whole repo in context.
Output. SARIF format for CI integration, Markdown reports for humans, optional PDF for stakeholders.

That's it. Two stages, deterministic plus probabilistic. The cleverness is in the patterns and in how we prompt the verifier — not in trying to make the LLM do everything from scratch.

Benchmarks on the SAST validation suite:

100% precision
92.3% recall
F1 score: 0.96
414 hand-curated patterns across 20+ languages

Why the architecture matters

People who haven't shipped a SAST tool tend to underestimate how much of the difficulty is false positive management. A scanner that finds 500 issues, of which 30 are real, doesn't actually help anyone. Developers stop opening the report after the third Tuesday. The signal-to-noise ratio kills adoption faster than missed bugs do.

This is where the local LLM earns its keep. Regex and AST patterns can identify shape — "this function calls strcpy with a user-controlled buffer", "this SQL string interpolates a variable" — but they can't reason about context. Does the buffer get bounded earlier? Is the variable sanitized at the controller layer? Is the entire function only reachable from a test fixture?

The LLM verifier handles exactly that contextual judgment, and it's good at it. In our benchmarks, the verifier rejects roughly 60-75% of the candidates that the deterministic pre-filter raises. The ones that survive are the real findings.

Crucially, the LLM never has to find the bug from scratch. The deterministic pre-filter narrows the search space from "scan a million lines of code" to "evaluate 800 candidates." That makes the inference budget manageable: a full audit of a 500K-line codebase runs in about 10,000 tokens of verifier input, not 300K+. We can run that on a single consumer GPU in minutes.

The benchmark that mattered: NASA IDF

Public benchmarks are great for marketing slides. Real validation comes from running against actual codebases written by people who weren't grading themselves.

We ran KCode against NASA's IDF — a piece of flight-software-adjacent open source. The IDF repo isn't toy code: it's instrumentation infrastructure used in real telemetry pipelines, written in C++ and Python, maintained by people whose job titles include "Senior Software Engineer, Flight Systems".

KCode opened PR #107 against the repo, identifying 28 bugs across the codebase. The breakdown:

Buffer overflows from unchecked string operations (the C++ classics).
Missing null checks on pointers returned from allocation paths.
Integer truncation in size calculations that would silently corrupt under specific input ranges.
Race conditions in concurrent state mutation that the linter had missed because the relevant globals were declared three files away.
A handful of Python issues around exception handling that swallowed errors silently.

The NASA team merged the changes. That's the validation that matters: real bugs, in real production-adjacent code, accepted by maintainers who know the codebase.

What we got wrong (and how we fixed it)

The first version of KCode was a mess. The verifier was hallucinating. The pre-filter was over-firing. Our F1 on the validation suite was a depressing 0.71 for months. Three things turned it around:

1. Cascade verification

A single LLM verifier has a measurable false-positive rate. We could either (a) lower the temperature and pray, or (b) chain two verifiers with different model families and only accept findings both confirm. We picked (b). The current production setup runs Grok + Claude Opus in an ensemble: both have to agree the candidate is real before it lands in the report. False positives dropped by 60%. The cost is roughly 2× verifier tokens, which on local hardware costs nothing meaningful.

2. Output filter for "prompt rules miss"

The LLM verifier will occasionally produce output that looks like a valid finding but is structurally malformed for SARIF — wrong line numbers, missing severity, weird character escaping. We built a strict output filter that rejects malformed verifier output and re-prompts. This sounds boring; it's actually one of the most load-bearing pieces of the system. Without it, ~3% of findings showed up as garbage. With it, the SARIF output is parseable by every downstream tool we've tried (GitHub Code Scanning, SonarQube import, custom dashboards).

3. The "audit your auditor" week

For one full week, we ran KCode against itself and another tool (Inquisitor, our agent QA daemon) against KCode. The goal was to find every silent failure in our own pipeline before customers did. Inquisitor surfaced 8+ silent-failure bugs in the first week: hallucinated tool results that propagated through the pipeline, exit-code-0 hangs that no human or test suite had caught, edge cases where verifier rejection was masked as success. Every one of those is now a test case in our CI.

If you ship developer tooling, audit your auditor. It's the highest-leverage week of QA you can do.

How to install and use it

KCode is distributed as binaries (Linux x64/ARM64, macOS Apple Silicon) and an npm package. Three install paths:

# Option A: one-line install (recommended for local use)
curl -fsSL https://kulvex.ai/kcode/install.sh | sh

# Option B: npm
npm install -g @astrolexisai/kcode

# Option C: GitHub Action (drop into .github/workflows)
- uses: AstrolexisAI/kcode-action@v1
  with:
    target: ./src
    severity: medium

For CI integration, the GitHub Action publishes SARIF to GitHub Code Scanning, which means the findings show up in the Security tab and as inline PR comments. No additional dashboard required.

For local development, kcode scan ./src --verifier-model qwen3.6-heretic runs a full pass and writes the report to stdout. If you have a Mac with 32GB+ unified memory, MLX serves the verifier directly. If you have a GPU server, point KCode at any OpenAI-compatible endpoint serving the model you want.

Free tier is permissive: full feature set, no source-code upload, you bring your own model. Pro at $19/month adds priority pattern updates, the curated weekly verifier model release, and access to the cascade ensemble pre-configured. Pricing details and binaries.

The honest part: where we are with revenue

I'm not going to pretend KCode is a runaway hit. Here's where we actually are:

Revenue: $0 confirmed Pro subscribers as of this writing. The free tier has users — actual installs, actual scans, actual SARIF reports landing in CI — but the Pro conversion hasn't started.
Phase 1 goal: 10 paying subs or 2 paid audit engagements. That's the bar we set for "this is a real product."
What we know works: the technical core. Precision is real, the patterns are good, the verifier doesn't hallucinate, the SARIF output is clean. The bug we found in NASA's code wasn't a one-off.
What we're testing: whether the buyer who can't ship code to Snyk actually exists in the volume we hope. Our hypothesis is yes — defense, healthcare, EU SaaS, anyone with GDPR data residency, anyone with NDA constraints. We're going to find out over the next two quarters.

I'm sharing this because the indie software world is full of "we're crushing it" posts that don't match the financial reality, and that makes it harder for anyone building something legitimate to talk straight. KCode is a real tool that solves a real problem. We don't yet know if it'll be a business. That's where we are.

Who this is for

If your team is in any of these buckets, KCode is built for you:

You have source code that contractually cannot leave your infrastructure. Defense, healthcare, financial services with strict residency.
You run on-prem CI and the SaaS SAST tools don't ship a self-hosted edition you can actually afford.
You've tried Snyk/SonarQube/GHAS and find the noise level untenable. You want a tool that fires less and lands more.
You're philosophically opposed to your code training someone else's model. Reasonable position.
You're a security consultant doing one-off engagements and want a tool that runs on your laptop without phoning home.

If your team is happily on a SaaS SAST and your auditors don't care, KCode is probably not for you. That's fine. We're not trying to displace the SaaS market — we're serving the chunk of it that can't use SaaS at all.

— Bruno Galtranch, founder, AstroLexis LLC. If you're evaluating KCode for your team or want to talk about a paid audit engagement: contact@astrolexis.space.

Why We Run LLMs On-Device in 2026

GaltRanch — Thu, 21 May 2026 14:13:19 +0000

Originally published on the AstroLexis blog. Cross-posted here for the community.

For most of the last three years, "AI" has meant calling someone else's API. Your prompt leaves your machine, hits a datacenter, and a response comes back. In 2026 that's no longer the only sensible architecture. Here's the case for running LLMs on your own hardware — and what we ship at AstroLexis to make it actually work.

The cloud isn't the only place for AI anymore

When OpenAI shipped GPT-3.5 in late 2022, running an LLM locally was an exotic hobby. The smallest useful models needed a workstation, the tooling barely worked outside a research lab, and inference was slow enough that real-time use was out of reach. The cloud was the only practical option.

That's not the world we live in anymore. As of mid-2026:

An Apple M4 Pro Mac mini ($1,400) runs a quantized 30B parameter model at 25-40 tokens/second using MLX.
A consumer RTX 5090 (24GB VRAM) handles 70B models in 4-bit quantization with comfortable headroom for context windows.
Apple's own Foundation Models (built into iOS 26 and macOS) ship a 3B-parameter on-device LLM that's available to every app through a system framework.
Llama 4, Qwen 3.6, Mistral Small 3.1 and Gemma 4 all ship 4-bit weights designed to run on commodity hardware.

The cost-performance curve has crossed a line where, for a large class of real applications, running locally is now better — not just feasible. The question stopped being "can we run this without the cloud?" and became "why are we still sending this to someone else's datacenter?"

Cost: the math has flipped

Cloud LLM pricing in 2024 was an order of magnitude cheaper than running your own inference. By 2026, for any sustained workload, the math is the opposite.

Take a concrete example. A static code analysis pipeline that scans 500 commits per day against a 1M-line codebase. With KCode we measured:

OpenAI o4-mini, hosted API: ~$340/month, plus the latency overhead of going to the cloud per file.
Local Qwen3.6-Heretic 30B on a single RTX 5090: roughly $0 marginal cost after the GPU is purchased, with a sub-second turnaround per file because the model is warm in VRAM and there's no network hop.

The capex is real — a workstation isn't free. But for any team doing real volume, the breakeven against API pricing arrives in 4-8 months. After that, every additional run is essentially free. The same calculus applies to support agents, document classification pipelines, voice transcription, image captioning, anything that runs at scale.

Privacy: your data is your data

The privacy story is easier to explain when the user is non-technical: if your data never leaves your machine, no one can lose it, sell it, or train on it.

This matters more in some contexts than others. We ship products on both ends of the privacy spectrum:

ClearCaps generates live captions and diarized transcripts for users with hearing loss. The audio is profoundly personal — medical conversations, family calls, work meetings. Running speech recognition (WhisperKit) and speaker diarization on-device means there's nothing for an attacker to intercept or a vendor to monetize.
PhoenixSteps is a clinical speech-therapy companion for pediatric patients. The users are children. Their speech recordings are protected health information under HIPAA-equivalent frameworks across most jurisdictions. There's no possible "cloud version" that we'd ship.
Kulvex AI is a self-hosted assistant. It runs on hardware the user owns, in their home, on their network. We never see the conversations.

This isn't ideology. It's a product constraint. There are categories of software — health, legal, family, identity — where shipping to a cloud LLM is a non-starter. On-device is the only viable architecture.

Latency: 50ms vs 800ms

A cloud LLM round-trip is at minimum the network latency (50-200ms) plus the time-to-first-token (200-1000ms depending on load) plus the streaming of the response. For a short reply that's a 1-2 second user-facing delay.

An on-device model on Apple Silicon, with the weights already memory-mapped into RAM, can start producing tokens in under 50ms and stream at 30+ tokens/second for a 7B model. For interactive UX — autocomplete, voice assistants, real-time captions — this is the difference between "feels native" and "feels like a web form."

We're working with this constraint right now on our iOS apps. The Apple Foundation Models framework gives us a 3B-parameter LLM that responds in 100-200ms total on an iPhone 16. That's fast enough that the user never sees a spinner. The same query against an OpenAI API would feel slower even if it produced a higher-quality answer — because the perceived speed of UI dominates short interactions.

Freedom: no vendor lock-in

This is the underappreciated one. Every cloud LLM you build on top of is a dependency on someone else's roadmap, pricing, and content policy. They can deprecate the model you're using, double the price overnight, refuse to serve your jurisdiction, or decide that your use case violates their terms.

We've watched this play out repeatedly:

The original GPT-4 API was deprecated and replaced with new versions that broke established prompt patterns for thousands of products.
Anthropic, OpenAI, and Google have all rejected or rate-limited use cases at various points (security tooling, certain medical applications, anything touching content moderation).
Hosted prices have moved up and down without warning, making it impossible to model unit economics.

On-device, you can pin the model version forever. Llama 4 will run on your 5090 in 2030 the same way it runs today. No one can take it away. Your customers' workflows don't break because a vendor changed their mind.

The on-device weights become a real asset. It's the opposite of "renting" intelligence.

What we ship at AstroLexis

Everything we build runs locally by default. The full lineup:

Kulvex AI — self-hosted AI platform with 17 domain agents (home automation, messaging across 8 platforms, voice control). Runs on your own GPU.
KCode — deterministic security audit tool with 414 hand-curated patterns across 20+ languages. Pre-filters with regex/AST, verifies with a local LLM. Your source code never leaves your machine. SARIF output, GitHub Action.
ClearCaps — live captions and speaker diarization on iPhone. WhisperKit + Apple SpeakerKit, all on-device.
SiliconMon — Apple Silicon system monitor for macOS. Shows you exactly what your GPU, ANE, and unified memory are doing while you run MLX, Ollama, llama.cpp, or LM Studio locally.
PhoenixSteps — clinical speech-therapy companion for pediatric SLPs. iOS-only, MLX-based.
Vela — memory companion for adults with memory impairment. iOS-only, on-device.
Tutto — conversational practice for English and Spanish learners. In development.

The common thread isn't a particular AI framework or model. It's the architectural commitment: the user owns the inference. We don't sit in the middle.

How to start

If you're building software in 2026 and considering whether to make an on-device version, our take:

Start with the right hardware target. Apple Silicon is the most underrated AI dev box on the market. An M2 Pro or newer Mac with 32GB+ unified memory handles 7-13B parameter models comfortably. For server work, a single 24GB consumer GPU (RTX 4090/5090) handles 30B models.
Pick a model family and stay on it. Llama 4, Qwen 3.6, Mistral Small, Gemma 4. All ship 4-bit quantizations. All have stable APIs through MLX, llama.cpp, or vLLM. Don't chase weekly model releases — pick one, learn its quirks, ship.
Treat the local LLM as a tool, not a magic box. Wrap it in deterministic pre-processing and post-processing. KCode does this: regex/AST patterns find candidates, the LLM verifies. The local model doesn't have to be GPT-5-level to be useful — it has to be reliable for a narrow task.
Measure honestly. Track tokens-per-second, time-to-first-token, memory footprint, and battery impact on real devices. The numbers you see on a research blog don't match what you'll see on a customer's M1 Air.

— Bruno, founder, AstroLexis LLC. If you build in this space, drop a line: contact@astrolexis.space.