<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: GaltRanch</title>
    <description>The latest articles on DEV Community by GaltRanch (@galtranch).</description>
    <link>https://dev.to/galtranch</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3944273%2F322c6642-58a7-4aae-af05-0395cbe135ef.jpeg</url>
      <title>DEV Community: GaltRanch</title>
      <link>https://dev.to/galtranch</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/galtranch"/>
    <language>en</language>
    <item>
      <title>Building a Clinical Speech-Therapy App With a Real SLP: 4 Lessons From PhoenixSteps</title>
      <dc:creator>GaltRanch</dc:creator>
      <pubDate>Thu, 21 May 2026 14:35:36 +0000</pubDate>
      <link>https://dev.to/galtranch/building-a-clinical-speech-therapy-app-with-a-real-slp-4-lessons-from-phoenixsteps-3n6h</link>
      <guid>https://dev.to/galtranch/building-a-clinical-speech-therapy-app-with-a-real-slp-4-lessons-from-phoenixsteps-3n6h</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on the &lt;a href="https://astrolexis.space/blog/clinical-speech-therapy-app-phoenixsteps/" rel="noopener noreferrer"&gt;AstroLexis blog&lt;/a&gt;. Cross-posted here for the community.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My son's speech-language pathologist became my co-founder. &lt;a href="https://astrolexis.space/phoenixsteps" rel="noopener noreferrer"&gt;PhoenixSteps&lt;/a&gt; is what came out of it: a pediatric clinical app that does what existing apps don't because we built it together — engineer plus therapist plus actual patient (also my kid). Here are four lessons from the last six months, including how we taught Apple's Vision framework to do something Apple flatly refused to.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this started
&lt;/h2&gt;

&lt;p&gt;My son has a speech sound disorder. Specifically, rotacismo — he struggles with /r/ and /rr/, which in Spanish are foundational phonemes that show up in roughly one in every six words. His speech-language pathologist is Stefania. We've been seeing her weekly for over a year and the progress has been real, but inconsistent: he'd nail a sound during a session and lose it by mid-week.&lt;/p&gt;

&lt;p&gt;The gap was obvious to both of us. He'd do exercises with Stefa for forty minutes, then we'd go home and the exercises mostly stopped, because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The "drill at home" sheet Stefa sent had no feedback loop. My kid would say "ratón" five times and have no idea if any of them were correct.&lt;/li&gt;
&lt;li&gt;Existing pediatric speech-therapy apps in Spanish are either commercially mediocre (gamified versions of basic flashcards) or clinically rigid (built for adult speech rehab, not children).&lt;/li&gt;
&lt;li&gt;The market for tools that actually run the clinical exercises a Spanish-speaking SLP would prescribe — with audio feedback, automatic scoring, and progress tracking the therapist can read — basically did not exist for a private practice working with a 4-year-old.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I asked Stefa if she'd want to co-design something. She said yes. That's how PhoenixSteps started — and the four lessons below are the ones I wish I'd known going in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 1: A clinical co-creator changes everything about what you ship
&lt;/h2&gt;

&lt;p&gt;I had built consumer iOS apps before. I had not built a clinical tool. The thing I underestimated was how much of the actual product is the &lt;em&gt;protocol&lt;/em&gt;, not the software.&lt;/p&gt;

&lt;p&gt;Stefa works from named, published clinical protocols — Borrás, Bosch, the AELFA articulation drills. When she prescribes an exercise, she's pulling from a tradition that has decades of consensus on order, dosage, and progression. "Lengua a la nariz" isn't a cute idea — it's Borrás Exercise 29, with specific instructions about duration, repetitions per day, and what to do if the child can't sustain the position.&lt;/p&gt;

&lt;p&gt;Before working with Stefa, I would have built a "speech therapy app" that was basically a glorified flashcard deck with cute animations. With Stefa, the exercise catalog became:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orofacial praxias&lt;/strong&gt; — 7 exercises pulled directly from her clinical sheet, in the order she actually prescribes them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R-group syllable warmups&lt;/strong&gt; — "ra ra ra," "rrrr-on" — building muscle memory before tackling words.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R simple words&lt;/strong&gt; — rosa, ratón, mira, perro — graded by Stefa for difficulty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R-cluster words (sinfones)&lt;/strong&gt; — bra, cra, dra, fra, gra, pra, tra. The hard ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal pairs&lt;/strong&gt; — R/RR, R/L, D/R, T/D. Auditory discrimination drills.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Carrier phrases&lt;/strong&gt; — embedding the target sound in real sentences.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Tren de la Risa"&lt;/strong&gt; — a karaoke song Stefa wrote that hits every R context across 8 verses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of that comes out of an engineer's imagination. It comes out of a working SLP's notebook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 1 distilled&lt;/strong&gt;: if you're building a clinical product, the clinician is not a "domain advisor." They're a co-founder. Hire them, equity them in, give them a real voice on the product roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 2: Apple won't give you what your patient needs. Build it yourself.
&lt;/h2&gt;

&lt;p&gt;This is the technical story, and it's the one I'm most proud of.&lt;/p&gt;

&lt;p&gt;One of the most prescribed praxias for kids working on /r/ is "lengua a la nariz" — extending the tongue tip toward the nose. The exercise builds the lingual elevation needed for the alveolar trill. Stefa wants the app to &lt;em&gt;automatically verify&lt;/em&gt; the kid did the exercise correctly: tongue out, pointed up, sustained for 10 seconds.&lt;/p&gt;

&lt;p&gt;This sounds like a job for ARKit. Apple has had face tracking with the TrueDepth camera since the iPhone X. &lt;code&gt;ARFaceAnchor.blendShapes&lt;/code&gt; includes &lt;code&gt;jawOpen&lt;/code&gt;, &lt;code&gt;mouthSmileLeft&lt;/code&gt;, &lt;code&gt;cheekPuff&lt;/code&gt; — and yes, &lt;code&gt;tongueOut&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Except: &lt;code&gt;tongueOut&lt;/code&gt; is a scalar. It's 0 when the tongue is in, and 1 when it's out. Apple does not tell you &lt;em&gt;where&lt;/em&gt; the tongue is pointing. Up, down, left, right — they all read identical.&lt;/p&gt;

&lt;p&gt;I emailed Apple developer support. The answer was: no, the tongue is not modeled as 3D geometry, and there's no API to detect tongue direction. Tongue tracking is inherently unstable (occlusion by teeth and lips), so Apple chose not to ship something they couldn't validate at Face ID precision.&lt;/p&gt;

&lt;p&gt;So Stefa and I built the detector ourselves.&lt;/p&gt;

&lt;h3&gt;
  
  
  The pipeline
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ARKit captures the camera frame&lt;/strong&gt; on the TrueDepth camera at 60 fps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We grab the raw &lt;code&gt;frame.capturedImage&lt;/code&gt;&lt;/strong&gt; — the YUV pixel buffer ARKit hands you for free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision detects face landmarks&lt;/strong&gt;: &lt;code&gt;VNDetectFaceLandmarksRequest&lt;/code&gt; returns &lt;code&gt;outerLips&lt;/code&gt;, &lt;code&gt;innerLips&lt;/code&gt;, and &lt;code&gt;nose&lt;/code&gt; as 2D polygons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three Regions of Interest outside the lip polygon&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;UP ROI&lt;/strong&gt; — rectangle between top of upper lip and bottom of nose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LEFT ROI&lt;/strong&gt; — extending leftward from the left corner of the lips&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RIGHT ROI&lt;/strong&gt; — same, mirrored&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Count pink/red pixels inside each ROI&lt;/strong&gt;. The lip-skin transition is at Cr ≈ 18; the tongue is at Cr ≈ 25-50. We threshold Cr &amp;gt; 25 to filter out facial skin and pale lips.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If a ROI has &amp;gt; 400 "tongue-colored" pixels, the tongue is projecting in that direction.&lt;/strong&gt; Cross-check with ARKit's &lt;code&gt;tongueOut&lt;/code&gt; blendshape, mirror-compensate for the front-facing camera.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The detector reports &lt;code&gt;up&lt;/code&gt;, &lt;code&gt;down&lt;/code&gt;, &lt;code&gt;left&lt;/code&gt;, &lt;code&gt;right&lt;/code&gt;, &lt;code&gt;center&lt;/code&gt;, or &lt;code&gt;notVisible&lt;/code&gt; at 20Hz with a confidence score. The first time I showed Stefa the demo — me sticking my tongue toward my nose and watching the screen say "ARRIBA conf 100% pix 3,974" — she didn't believe it was real until I sent her the source code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 2 distilled&lt;/strong&gt;: the most defensible technical work in a clinical product is the part Apple won't ship. If you can do something the platform doesn't expose — and it matters for the clinical outcome — that's your moat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 3: Audio quality is a feature, not a detail
&lt;/h2&gt;

&lt;p&gt;PhoenixSteps ships with about 325 pre-recorded voice prompts, all generated using OpenAI's &lt;code&gt;gpt-4o-mini-tts&lt;/code&gt; with the "nova" voice. Why pre-recorded TTS instead of letting iOS synthesize on the fly?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pediatric voice consistency.&lt;/strong&gt; Kids learn faster when the audio prompt sounds the same every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed and articulation.&lt;/strong&gt; Stefa wanted slower-than-normal pronunciation for warmups, regular pace for practice, a specific cadence for the song. Generating with explicit instructions ("habla en español neutro latinoamericano, ritmo lento y articulado, énfasis infantil sin caricaturizar") gets us the exact register a real SLP would use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability.&lt;/strong&gt; Pre-recorded audio works offline, doesn't depend on a phone's TTS pipeline being up, doesn't get interrupted by Siri.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We learned the hard way that the OpenAI API will occasionally return a truncated mp3 (we caught three files at 0.36s when they should have been 1.2s). The fix was a post-generation validation step: every newly generated mp3 has to pass a minimum-duration check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 3 distilled&lt;/strong&gt;: for pediatric/clinical apps, audio is content. Pre-render every prompt with a consistent voice and pace. Validate audio duration before bundling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 4: HIPAA-equivalent privacy isn't optional
&lt;/h2&gt;

&lt;p&gt;The users of PhoenixSteps are children. Their voice recordings and progress data are protected health information.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speech recognition on-device&lt;/strong&gt; (WhisperKit). Voice never leaves the iPhone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Face tracking on-device&lt;/strong&gt; (ARKit + Vision).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progress data in SwiftData&lt;/strong&gt;, syncing to family's private iCloud.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No analytics, no third-party SDKs, no Crashlytics, no Facebook Pixel.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI features gated by parental consent.&lt;/strong&gt; Apple Foundation Models on-device, opt-in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PhoenixSteps will never have a data breach involving children's voice samples, because there's no centralized data to breach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 4 distilled&lt;/strong&gt;: if you're building anything where the user is a minor or a patient, design as if the audit is happening tomorrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where PhoenixSteps is right now
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not in the App Store yet.&lt;/strong&gt; Build 28. Finishing the clinical pilot with Stefa.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spanish-first.&lt;/strong&gt; English localization on the roadmap once the clinical content is validated by an English-speaking SLP.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Free for parents, with an optional Pro tier for clinicians.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stefa is a co-founder.&lt;/strong&gt; Equity, not consulting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're an SLP working with pediatric patients in Spanish, write us. We're going to add more clinical advisors as the product matures: &lt;a href="mailto:contact@astrolexis.space"&gt;contact@astrolexis.space&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;— Bruno Galtranch, founder, &lt;a href="https://astrolexis.space" rel="noopener noreferrer"&gt;AstroLexis LLC&lt;/a&gt;. With Stefania, SLP and co-founder.&lt;/p&gt;

</description>
      <category>ios</category>
      <category>a11y</category>
      <category>ai</category>
      <category>healthcare</category>
    </item>
    <item>
      <title>Live Captions Without Sending Your Voice to the Cloud: Building ClearCaps</title>
      <dc:creator>GaltRanch</dc:creator>
      <pubDate>Thu, 21 May 2026 14:27:21 +0000</pubDate>
      <link>https://dev.to/galtranch/live-captions-without-sending-your-voice-to-the-cloud-building-clearcaps-13k7</link>
      <guid>https://dev.to/galtranch/live-captions-without-sending-your-voice-to-the-cloud-building-clearcaps-13k7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on the &lt;a href="https://astrolexis.space/blog/live-captions-clearcaps/" rel="noopener noreferrer"&gt;AstroLexis blog&lt;/a&gt;. Cross-posted here for the community.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My dad started losing his hearing about five years ago. Not catastrophically — just enough that family dinners turned into "what did she say?" and TV got a little louder every month. Off-the-shelf captioning apps existed but every single one required uploading audio to a vendor's cloud. For private family conversations, medical appointments, work calls — that wasn't going to fly. So I built &lt;a href="https://astrolexis.space/clearcaps" rel="noopener noreferrer"&gt;ClearCaps&lt;/a&gt;. Here's the founder story and the technical pieces that make on-device live captioning actually work in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  The motivating problem
&lt;/h2&gt;

&lt;p&gt;Hearing loss is one of the most common chronic conditions on the planet. The WHO estimates over 430 million people worldwide live with disabling hearing loss — and that number is rising as the population ages. Most of them are not deaf; they hear, just less reliably. Sound gets muddier. Speech gets harder to parse, especially in noisy environments. Conversations become exhausting in a way that's invisible to anyone who hasn't experienced it.&lt;/p&gt;

&lt;p&gt;The existing accessibility stack on iOS is genuinely good. Apple's Live Captions (built into iOS 16+) work in many contexts. Speech-to-text apps abound. But almost all of them have the same architecture: capture audio, send it to a server, get back text. For someone with hearing loss, this works fine in casual settings. It does not work for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Medical appointments.&lt;/strong&gt; HIPAA-protected health information, often deeply personal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Therapy sessions.&lt;/strong&gt; Same reasoning, plus the person on the other side might object to being recorded by a cloud service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Family conversations.&lt;/strong&gt; Nobody wants a vendor harvesting their kid's voice or their spouse's medical complaints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Work meetings under NDA.&lt;/strong&gt; The lawyer didn't sign off on routing audio through someone else's datacenter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anywhere there's no internet.&lt;/strong&gt; Buses, trains, basements, planes, rural areas.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The market for "live captions that respect your privacy" was — for years — basically non-existent. The reason was technical: doing speech recognition well on a phone, in real time, with speaker identification and translation, wasn't feasible. The models were too big, the CPUs too slow, the batteries too weak. In 2026 that ceiling lifted.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed: on-device ASR finally got good
&lt;/h2&gt;

&lt;p&gt;Three independent pieces of technology converged to make this viable on an iPhone:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/argmaxinc/WhisperKit" rel="noopener noreferrer"&gt;WhisperKit&lt;/a&gt;&lt;/strong&gt;. Argmax's optimized port of OpenAI's Whisper to the Apple Neural Engine. Whisper-small (240M parameters) runs in real time on any iPhone with an A14 or newer. Whisper-base is even faster. The accuracy is strikingly good — better than most cloud APIs for accented English and major non-English languages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apple's Translate framework&lt;/strong&gt;. Built into iOS 17.4+, fully on-device, supports 10+ languages including English ↔ Spanish, Portuguese, French, German, Mandarin, Japanese, Korean. Latency is sub-second per sentence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pyannote speaker diarization, ported to Core ML&lt;/strong&gt;. The piece that took the longest to get right.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are mine. The work was integrating them — making them run together on a single iPhone, in real time, with low enough latency that the captions actually keep up with the conversation, without melting the battery in 20 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;ClearCaps splits the computation across every accelerator the chip has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apple Neural Engine (ANE)&lt;/strong&gt;: Whisper-small for automatic speech recognition. Runs exclusively on the ANE so it doesn't fight the GPU for memory bandwidth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: Pyannote embedder for speaker diarization. The embedder produces 256-dim vectors for short audio chunks; the GPU is the right place because the operations are big matmuls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio DSP block&lt;/strong&gt;: noise suppression, automatic gain control, acoustic echo cancellation. Apple's built-in voice processing, hardware-accelerated, doesn't touch ANE or GPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU&lt;/strong&gt;: Pyannote segmenter, clusterer, voice activity detection, audio resampling, and SwiftUI rendering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The split matters because if you naively run everything on the GPU, you bottleneck on memory bandwidth before you bottleneck on compute. By splitting across ANE + GPU + DSP, the chip's actual peak throughput becomes accessible. An iPhone 15 Pro or newer handles the full pipeline (ASR + diarization + UI) at ~30% CPU and ~15W package power. That's about half what watching a YouTube video draws.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hard part: speaker diarization on-device
&lt;/h2&gt;

&lt;p&gt;Automatic speech recognition has been a solved problem for cloud services since 2022 and for on-device since Whisper-small dropped. &lt;em&gt;Diarization&lt;/em&gt; — figuring out who is speaking at any given moment — is much less mature.&lt;/p&gt;

&lt;p&gt;The state of the art on the cloud side is &lt;a href="https://github.com/pyannote/pyannote-audio" rel="noopener noreferrer"&gt;pyannote.audio&lt;/a&gt;, a fantastic open-source library by Hervé Bredin. It's PyTorch under the hood, and the pretrained models assume you have a workstation GPU and Python at runtime. Neither of which exists on an iPhone.&lt;/p&gt;

&lt;p&gt;Porting pyannote to run inside an iOS app required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Converting the embedder to Core ML.&lt;/strong&gt; The segmenter neural net (a 1D-CNN that ingests audio and outputs voice-activity + speaker-change probability per frame) and the embedder (which produces a 256-dim vector per active speaker segment) both convert cleanly. The clusterer is pure Python and gets reimplemented in Swift.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming the inference&lt;/strong&gt;. The pretrained pyannote models expect 10-second chunks. For live captioning, 10-second latency is unusable. We slide a 2-second window and re-cluster every 500ms. The clusters get stable after about 3-4 seconds of speech per speaker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handling cold-start&lt;/strong&gt;. The first 2-3 seconds of any conversation have no diarization data. Captions show up immediately, just with a placeholder speaker label ("Speaker 1") until the clusterer locks on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Naming speakers.&lt;/strong&gt; The user can tap any speaker label and rename it. "Speaker 1" becomes "Doctor Rodríguez." The rename persists for the whole session and gets re-applied if the clusterer recovers the same speaker after a gap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "did someone address me?" signal&lt;/strong&gt;. ClearCaps detects when a speaker directly addresses the user (questions tagged "&lt;em&gt;You&lt;/em&gt;" or "&lt;em&gt;Bruno&lt;/em&gt;") and triggers a haptic. The user doesn't have to stare at the screen — they can look at the person they're with and feel a buzz when something needs their attention. This came from talking to my dad: the worst part of hearing loss in conversation isn't missing words, it's missing when someone has just asked &lt;em&gt;you&lt;/em&gt; a question.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why on-device matters for accessibility specifically
&lt;/h2&gt;

&lt;p&gt;I want to be careful here because accessibility tech often gets framed as a charity case, and that's the wrong frame. Hard-of-hearing people are paying customers. They have specific product requirements. They evaluate tools the same way anyone else evaluates tools.&lt;/p&gt;

&lt;p&gt;The privacy-first architecture isn't a feel-good add-on for accessibility users. It's a product requirement that surfaces specifically in this market:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Medical conversations.&lt;/strong&gt; A captioning app that requires uploading audio to a cloud service is incompatible with patient privacy expectations in most jurisdictions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Family privacy.&lt;/strong&gt; Spouse discussing health symptoms over dinner. Kid asking about something embarrassing at school. The captioning user doesn't want that going into anyone's training dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The recipient's consent&lt;/strong&gt;. When you're using captions in a conversation, the other person hasn't consented to a cloud service capturing their voice. On-device captions sidestep this entirely — the audio never leaves &lt;em&gt;your&lt;/em&gt; phone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline reliability&lt;/strong&gt;. Hearing-loss users need captions most when they're &lt;em&gt;most stressed&lt;/em&gt;, which is often in environments where wifi is bad: hospitals, public transit, large crowded events.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first time my dad used ClearCaps in a real conversation, the thing he commented on wasn't the accuracy — it was that it kept working when the wifi flickered. That's the architectural payoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI assistant on top
&lt;/h2&gt;

&lt;p&gt;ClearCaps ships with an optional AI layer on top of the captions, powered by a 3B-parameter LLM running through Apple Foundation Models on iOS 26+. The model does four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cleans up the transcript&lt;/strong&gt;. Whisper is great but it captures every "um" and "uh" and false start. The cleanup pass produces a readable version.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarizes long sessions&lt;/strong&gt;. A 90-minute consultation becomes a one-page bullet summary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identifies speakers by context&lt;/strong&gt;. If "Doctor Rodríguez" appears in the conversation naming themselves, the assistant infers that label automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual context&lt;/strong&gt;. Take a photo during the conversation (a whiteboard, a prescription, a slide) and the LLM describes it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All of this runs on-device. The LLM is Apple's, the framework is Foundation Models, and there's a privacy manifest in the app bundle that auditors can verify.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where ClearCaps falls short (and where it's heading)
&lt;/h2&gt;

&lt;p&gt;Honest assessment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Heavy accents&lt;/strong&gt;. Whisper-small degrades on heavy regional accents in Spanish (rural Caribbean, Andalusian) and English (Glaswegian, deep Southern US). Whisper-medium would help but doubles the memory footprint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crosstalk in groups bigger than 4&lt;/strong&gt;. Pyannote handles 2-4 speakers cleanly. Above that, clusters merge and split.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sign-language input&lt;/strong&gt;. Not in scope yet. ASL/LSE/LSA via camera is on the roadmap but the recognition stack isn't there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iPad / Mac versions&lt;/strong&gt;. iPhone only at launch.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The product
&lt;/h2&gt;

&lt;p&gt;ClearCaps is on the App Store. iOS 26+, free download with a paid AI tier ($2.99/month or $19.99/year). The captioning itself — ASR + diarization + translation — is free forever.&lt;/p&gt;

&lt;p&gt;I made it free for the captioning because of who the users are. Hard-of-hearing people are often on fixed incomes (older population), and the captioning is a basic accessibility tool that I felt strongly should be available without payment. The AI features are nice-to-have, not need-to-have, and that's where the monetization lives.&lt;/p&gt;




&lt;p&gt;— Bruno Galtranch, founder, &lt;a href="https://astrolexis.space" rel="noopener noreferrer"&gt;AstroLexis LLC&lt;/a&gt;. If you have feedback or a use case we missed: &lt;a href="mailto:contact@astrolexis.space"&gt;contact@astrolexis.space&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>a11y</category>
      <category>ios</category>
      <category>ai</category>
      <category>whisper</category>
    </item>
    <item>
      <title>Apple Silicon as a Serious AI Dev Box: What an M4 Max Actually Does With a 70B Model</title>
      <dc:creator>GaltRanch</dc:creator>
      <pubDate>Thu, 21 May 2026 14:22:15 +0000</pubDate>
      <link>https://dev.to/galtranch/apple-silicon-as-a-serious-ai-dev-box-what-an-m4-max-actually-does-with-a-70b-model-316b</link>
      <guid>https://dev.to/galtranch/apple-silicon-as-a-serious-ai-dev-box-what-an-m4-max-actually-does-with-a-70b-model-316b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on the &lt;a href="https://astrolexis.space/blog/apple-silicon-ai-dev-box/" rel="noopener noreferrer"&gt;AstroLexis blog&lt;/a&gt;. Cross-posted here for the community.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you're shopping for an LLM workstation in 2026, the default mental model is still "NVIDIA GPU, lots of VRAM, big tower." That's not wrong, but it's also not the only correct answer anymore. Apple Silicon — M3, M4, M5 — has quietly become one of the best local AI development boxes on the market, and almost nobody outside of MLX twitter is talking about the actual numbers. Here's what an M4 Max really does, where it crushes NVIDIA, where it doesn't, and why I built &lt;a href="https://astrolexis.space/siliconmon" rel="noopener noreferrer"&gt;SiliconMon&lt;/a&gt; to see what's happening underneath.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thesis: unified memory changes the math
&lt;/h2&gt;

&lt;p&gt;The single architectural decision that makes Apple Silicon competitive for AI workloads is unified memory. On a typical NVIDIA system, the model weights live in dedicated GPU VRAM, separate from system RAM, connected by a PCIe bus. On Apple Silicon, there's one pool of memory — say, 128 GB on an M4 Max — and the CPU, GPU, and Neural Engine all see the same physical pages. No copy between host and device, no PCIe bottleneck on transfers, no juggling layers between cards.&lt;/p&gt;

&lt;p&gt;For LLM inference, this matters more than people initially expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can load a 70B parameter model in 4-bit quantization (~40 GB) directly into the unified pool, addressable by the GPU, without renting an enterprise card.&lt;/li&gt;
&lt;li&gt;Context window expansion is cheap. Going from 4K to 32K context tokens doesn't require swapping or specialized layer offloading — it just uses more of the same pool.&lt;/li&gt;
&lt;li&gt;Multimodal workloads (vision encoder + LLM + speech) coexist in one address space. ClearCaps' on-device captioning pipeline runs WhisperKit, an LLM, and Apple SpeakerKit on the same chip with no inter-device coordination.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off: total memory bandwidth on Apple Silicon (around 400-800 GB/s depending on chip tier) is below a top-tier NVIDIA card (HBM3 cards push north of 3 TB/s). For pure inference throughput on small models that fit easily in a 4090, NVIDIA still wins. For anything larger than ~20B parameters where you'd otherwise need multi-GPU setups, Apple's unified pool starts looking very attractive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real numbers on M-series for LLM inference
&lt;/h2&gt;

&lt;p&gt;The tokens-per-second numbers depend heavily on quantization, framework (MLX vs llama.cpp), and whether you're measuring prefill or decode. Here's a rough baseline for decode speed on the most common configurations, with 4-bit quantized weights running on MLX:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chip&lt;/th&gt;
&lt;th&gt;Unified RAM&lt;/th&gt;
&lt;th&gt;7B model&lt;/th&gt;
&lt;th&gt;13B model&lt;/th&gt;
&lt;th&gt;30B model&lt;/th&gt;
&lt;th&gt;70B model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;M2 Pro&lt;/td&gt;
&lt;td&gt;32 GB&lt;/td&gt;
&lt;td&gt;~45 tok/s&lt;/td&gt;
&lt;td&gt;~22 tok/s&lt;/td&gt;
&lt;td&gt;~8 tok/s&lt;/td&gt;
&lt;td&gt;not viable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M3 Max&lt;/td&gt;
&lt;td&gt;64 GB&lt;/td&gt;
&lt;td&gt;~75 tok/s&lt;/td&gt;
&lt;td&gt;~38 tok/s&lt;/td&gt;
&lt;td&gt;~16 tok/s&lt;/td&gt;
&lt;td&gt;~5 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M4 Max&lt;/td&gt;
&lt;td&gt;128 GB&lt;/td&gt;
&lt;td&gt;~110 tok/s&lt;/td&gt;
&lt;td&gt;~55 tok/s&lt;/td&gt;
&lt;td&gt;~28 tok/s&lt;/td&gt;
&lt;td&gt;~10 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M3 Ultra&lt;/td&gt;
&lt;td&gt;192 GB&lt;/td&gt;
&lt;td&gt;~130 tok/s&lt;/td&gt;
&lt;td&gt;~70 tok/s&lt;/td&gt;
&lt;td&gt;~36 tok/s&lt;/td&gt;
&lt;td&gt;~14 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For interactive use, anything above 15 tokens/second feels "instant" to a human reader. That means an M3 Max comfortably handles 30B models for interactive chat, and an M4 Max handles 70B models if you're patient on long generations.&lt;/p&gt;

&lt;p&gt;The number that matters most for indie developers: a base M4 Mac mini at &lt;strong&gt;$1,400&lt;/strong&gt; with 24 GB unified memory runs quantized 13B models at 50+ tokens/second. That's a usable AI workstation for the price of a mid-range laptop, with zero noise, zero rack space, and 20W idle power draw.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Apple Silicon wins
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Models that don't fit on a single consumer NVIDIA card.&lt;/strong&gt; A 70B model in 4-bit needs ~40 GB. The biggest consumer NVIDIA card (5090) ships with 32 GB. You can split across multiple cards, but inter-card communication becomes the bottleneck. M4 Max with 128 GB swallows the whole model and has headroom for 32K context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Power efficiency.&lt;/strong&gt; An M4 Max under sustained inference load draws 30-50W. The equivalent NVIDIA workstation can pull 600-900W. If you're paying for electricity (anyone running 24/7 self-hosted inference) the OpEx delta is enormous.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acoustic profile.&lt;/strong&gt; Mac Studio is silent. Mac mini is silent. A workstation with two RTX cards is a lawnmower. For anyone working from home, this is non-negotiable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-the-box experience.&lt;/strong&gt; macOS + MLX + Homebrew + Ollama installs in twenty minutes and just works. CUDA-on-Linux remains a persistent source of pain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal workflows.&lt;/strong&gt; Unified memory means you can pipeline speech-to-text, LLM, and TTS without ever materializing intermediate buffers across PCIe.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where Apple Silicon loses
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Training and fine-tuning.&lt;/strong&gt; Mac is great for inference but the training stack (PyTorch on MPS, MLX training APIs) is still meaningfully behind CUDA. Anything beyond LoRA on small models is faster on NVIDIA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput per dollar at scale.&lt;/strong&gt; If you're running production serving with hundreds of concurrent requests, a rack of L40S cards beats a fleet of Mac Studios on raw cost-per-token. Apple wins for development; NVIDIA wins for production serving above a certain volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software ecosystem for very new research.&lt;/strong&gt; Cutting-edge research code lands on CUDA first. The Mac port arrives weeks to months later, sometimes with reduced functionality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tooling visibility.&lt;/strong&gt; NVIDIA gives you &lt;code&gt;nvidia-smi&lt;/code&gt;, &lt;code&gt;nvtop&lt;/code&gt;, NVIDIA Nsight, profiling tools that work on day one. macOS gives you Activity Monitor and a vague sense of where your watts are going. This last gap is why I ended up writing SiliconMon.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What you can't see (and why SiliconMon exists)
&lt;/h2&gt;

&lt;p&gt;When you fire up Ollama, llama.cpp, MLX, LM Studio, ComfyUI, or vLLM on a Mac, the operating system shows you almost nothing useful. Activity Monitor reports CPU% per process, but the GPU and Neural Engine residency are invisible. Memory pressure is a single colored bar. Power draw is hidden behind &lt;code&gt;powermetrics&lt;/code&gt;, which requires sudo and outputs an unreadable wall of text.&lt;/p&gt;

&lt;p&gt;I'd been running multiple local LLM stacks for over a year and had no way to answer simple questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When I run Ollama and ComfyUI simultaneously, are they sharing the GPU or fighting for it?&lt;/li&gt;
&lt;li&gt;Is my 70B model actually using the Neural Engine, or is it entirely on the GPU?&lt;/li&gt;
&lt;li&gt;What's the package power draw during inference vs idle? Am I thermal throttling on a long generation?&lt;/li&gt;
&lt;li&gt;Why does the system feel sluggish — am I swapping unified memory, or is something else going on?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Existing tools each gave fragments. &lt;code&gt;asitop&lt;/code&gt; shows IOReport stats but is command-line only and stops being maintained periodically. &lt;code&gt;macmon&lt;/code&gt; and &lt;code&gt;mactop&lt;/code&gt; are similar. Stats and iStat Menus are general-purpose and don't know what an MLX process is. None of them detect "this Python process is actually serving Llama 4 via vLLM" or "this is Ollama loading a Qwen3 quantization."&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://astrolexis.space/siliconmon" rel="noopener noreferrer"&gt;SiliconMon&lt;/a&gt;. It does three things the others don't:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI workload detection.&lt;/strong&gt; SiliconMon recognizes the canonical names and command-line patterns of MLX, Ollama, llama.cpp, LM Studio, ComfyUI, vLLM, and Hugging Face's transformers stack. When you see "Inference 47% • Ollama: qwen3-32b" in the menu bar, that's because the detector matched the process name, command line arguments, and loaded library set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IOReport-based residency.&lt;/strong&gt; Real CPU/GPU/ANE residency numbers from Apple's IOReport private framework, the same source Apple uses internally. Sampled once per second, no sudo required, sub-1% CPU footprint at idle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Energy unit correctness across chip generations.&lt;/strong&gt; M5 Max ships IOReport channels with mixed energy units — millijoules, nanojoules, microjoules — in the same response. Getting the conversion wrong is a 30× error on power numbers. SiliconMon has explicit per-channel unit handling and a regression test for every M-series chip we support.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to think about buying a Mac for local AI
&lt;/h2&gt;

&lt;p&gt;Rough buying guide based on what I'd actually recommend to friends asking:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hobbyist / curious&lt;/strong&gt;: M4 Mac mini, 24 GB unified, $1,400. Runs 7B and 13B models smoothly. Won't handle 30B+ comfortably. Best dollar-for-LLM machine on the market for non-pros.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developer running local LLMs daily&lt;/strong&gt;: M3 Max MacBook Pro 14"/16" with 64 GB unified, $3,200-3,600. Handles 30B models for interactive use, fine for 70B if you're patient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Serious indie / small team self-hosted AI&lt;/strong&gt;: M3 Ultra Mac Studio with 192 GB unified, $5,500-7,500. Runs 70B comfortably and 120B+ models in quantized form. Silent, sits under a desk, draws less power than a microwave. Sweet spot for self-hosted AI assistants like &lt;a href="https://kulvex.ai" rel="noopener noreferrer"&gt;Kulvex AI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production / training&lt;/strong&gt;: Use NVIDIA. The Mac isn't the right tool for serving at scale or training large models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Software stack: what to install on day one
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Homebrew (if you don't have it)&lt;/span&gt;
/bin/bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Ollama — easiest entry point&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;ollama
ollama serve &amp;amp;
ollama run qwen3:13b

&lt;span class="c"&gt;# MLX — for Python-side LLM work&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;mlx mlx-lm
python &lt;span class="nt"&gt;-m&lt;/span&gt; mlx_lm.generate &lt;span class="nt"&gt;--model&lt;/span&gt; mlx-community/Llama-4-7B-Instruct-4bit &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"Hello, world"&lt;/span&gt;

&lt;span class="c"&gt;# llama.cpp&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;llama.cpp
llama-server &lt;span class="nt"&gt;-hf&lt;/span&gt; mlx-community/Qwen3-32B-Instruct-GGUF

&lt;span class="c"&gt;# LM Studio — GUI alternative&lt;/span&gt;
&lt;span class="c"&gt;# Download from https://lmstudio.ai&lt;/span&gt;

&lt;span class="c"&gt;# SiliconMon — see what's actually happening&lt;/span&gt;
open https://astrolexis.space/siliconmon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The honest take
&lt;/h2&gt;

&lt;p&gt;If you're already invested in CUDA, building Linux workstations, and serving inference at scale: Apple Silicon is probably not for you, and that's fine. NVIDIA's lead on production infrastructure is real and not closing soon.&lt;/p&gt;

&lt;p&gt;If you're an indie developer, a researcher who needs to iterate locally, a security-conscious team that can't ship code to the cloud, or anyone who values a quiet, low-power, easy-to-set-up AI workstation — Apple Silicon is dramatically better than its reputation. The M4 generation is the inflection point. The M5 Max coming later this year extends the lead.&lt;/p&gt;

&lt;p&gt;Buy the unified memory, not the cores. If you're agonizing between the cheaper config and the next tier up, always go for more RAM. Models grow, context windows grow, and you can't upgrade Mac memory after purchase.&lt;/p&gt;




&lt;p&gt;— Bruno Galtranch, founder, &lt;a href="https://astrolexis.space" rel="noopener noreferrer"&gt;AstroLexis LLC&lt;/a&gt;. Questions on Apple Silicon for AI: &lt;a href="mailto:contact@astrolexis.space"&gt;contact@astrolexis.space&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>apple</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>mac</category>
    </item>
    <item>
      <title>Static Analysis Without Sending Your Code to the Cloud: Building KCode</title>
      <dc:creator>GaltRanch</dc:creator>
      <pubDate>Thu, 21 May 2026 14:13:20 +0000</pubDate>
      <link>https://dev.to/galtranch/static-analysis-without-sending-your-code-to-the-cloud-building-kcode-19im</link>
      <guid>https://dev.to/galtranch/static-analysis-without-sending-your-code-to-the-cloud-building-kcode-19im</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on the &lt;a href="https://astrolexis.space/blog/static-analysis-local-llm-kcode/" rel="noopener noreferrer"&gt;AstroLexis blog&lt;/a&gt;. Cross-posted here for the community.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every modern SAST tool — Snyk, SonarQube Cloud, GitHub Advanced Security, Semgrep AppSec Platform — asks the same thing: ship your source code to us, we'll tell you what's wrong with it. For a non-trivial number of teams, that's a non-starter. Here's how we built &lt;a href="https://kulvex.ai/kcode" rel="noopener noreferrer"&gt;KCode&lt;/a&gt;, the static analysis tool that runs the LLM verifier on your own hardware, and what we learned about getting machine-grade precision out of a local model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The day SAST became my problem
&lt;/h2&gt;

&lt;p&gt;I'm Bruno, founder of &lt;a href="https://astrolexis.space" rel="noopener noreferrer"&gt;AstroLexis&lt;/a&gt;. About a year before we started building KCode, I was the only engineer on a codebase that didn't tolerate uploading source. The reasons were the usual mix: enterprise customers with NDAs that explicitly forbade third-party SaaS code scanning, defense-adjacent contracts, jurisdictional restrictions that made any non-EU data residency a paperwork nightmare. The work was real, the policies were real, and the tooling we needed wasn't.&lt;/p&gt;

&lt;p&gt;The market for "static analysis you can actually deploy on-prem" turned out to be remarkably bad. &lt;strong&gt;Snyk&lt;/strong&gt;, &lt;strong&gt;SonarQube Cloud&lt;/strong&gt;, and &lt;strong&gt;GitHub Advanced Security&lt;/strong&gt; are SaaS-first. The on-prem versions exist but are priced for Fortune 500 and ship with the kind of installation playbook that needs a dedicated DevSecOps engineer to maintain. &lt;strong&gt;Semgrep&lt;/strong&gt; has an open-source core, which is great, but the rule set that catches real bugs lives in their commercial platform. Local &lt;strong&gt;linters&lt;/strong&gt; (ESLint, Pylint, Bandit, gosec) catch surface-level issues but miss anything that requires reasoning across files or distinguishing between "this looks scary" and "this actually exploits."&lt;/p&gt;

&lt;p&gt;And then LLMs arrived and complicated everything. Suddenly you could ask Claude or GPT-4 about a file and get genuinely insightful security analysis. The catch: &lt;em&gt;that file just went to someone else's datacenter&lt;/em&gt;. For the work I was doing, that wasn't a tradeoff — it was a deal-breaker.&lt;/p&gt;

&lt;p&gt;So we built the tool we needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What KCode actually does
&lt;/h2&gt;

&lt;p&gt;The architecture is intentionally boring:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic pre-filter.&lt;/strong&gt; 414 hand-curated patterns across 20+ languages (C, C++, Rust, Go, Python, TypeScript, JavaScript, Java, Kotlin, Swift, Ruby, PHP, Bash, SQL, YAML, HCL, and more). 372 of them are regex, 27 are AST-based for the rules that need structural awareness (control flow, taint, scope). The patterns generate &lt;em&gt;candidates&lt;/em&gt;: files and line ranges that look like they might be a problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM verifier.&lt;/strong&gt; The candidates get fed to a local LLM (we recommend a 24GB+ GPU running a 30B-parameter model in 4-bit quantization). The model's job is to confirm or reject: "is this candidate actually exploitable given the surrounding code, or is it a false positive?" The verifier sees only the relevant code snippets — it doesn't need the whole repo in context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output.&lt;/strong&gt; SARIF format for CI integration, Markdown reports for humans, optional PDF for stakeholders.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Two stages, deterministic plus probabilistic. The cleverness is in the patterns and in how we prompt the verifier — not in trying to make the LLM do everything from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmarks on the SAST validation suite:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100% precision&lt;/li&gt;
&lt;li&gt;92.3% recall&lt;/li&gt;
&lt;li&gt;F1 score: 0.96&lt;/li&gt;
&lt;li&gt;414 hand-curated patterns across 20+ languages&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why the architecture matters
&lt;/h2&gt;

&lt;p&gt;People who haven't shipped a SAST tool tend to underestimate how much of the difficulty is false positive management. A scanner that finds 500 issues, of which 30 are real, doesn't actually help anyone. Developers stop opening the report after the third Tuesday. The signal-to-noise ratio kills adoption faster than missed bugs do.&lt;/p&gt;

&lt;p&gt;This is where the local LLM earns its keep. Regex and AST patterns can identify &lt;em&gt;shape&lt;/em&gt; — "this function calls strcpy with a user-controlled buffer", "this SQL string interpolates a variable" — but they can't reason about &lt;em&gt;context&lt;/em&gt;. Does the buffer get bounded earlier? Is the variable sanitized at the controller layer? Is the entire function only reachable from a test fixture?&lt;/p&gt;

&lt;p&gt;The LLM verifier handles exactly that contextual judgment, and it's good at it. In our benchmarks, the verifier rejects roughly 60-75% of the candidates that the deterministic pre-filter raises. The ones that survive are the real findings.&lt;/p&gt;

&lt;p&gt;Crucially, the LLM never has to find the bug from scratch. The deterministic pre-filter narrows the search space from "scan a million lines of code" to "evaluate 800 candidates." That makes the inference budget manageable: &lt;strong&gt;a full audit of a 500K-line codebase runs in about 10,000 tokens of verifier input&lt;/strong&gt;, not 300K+. We can run that on a single consumer GPU in minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark that mattered: NASA IDF
&lt;/h2&gt;

&lt;p&gt;Public benchmarks are great for marketing slides. Real validation comes from running against actual codebases written by people who weren't grading themselves.&lt;/p&gt;

&lt;p&gt;We ran KCode against &lt;a href="https://github.com/nasa/IDF" rel="noopener noreferrer"&gt;NASA's IDF&lt;/a&gt; — a piece of flight-software-adjacent open source. The IDF repo isn't toy code: it's instrumentation infrastructure used in real telemetry pipelines, written in C++ and Python, maintained by people whose job titles include "Senior Software Engineer, Flight Systems".&lt;/p&gt;

&lt;p&gt;KCode opened &lt;strong&gt;PR #107&lt;/strong&gt; against the repo, identifying &lt;strong&gt;28 bugs&lt;/strong&gt; across the codebase. The breakdown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Buffer overflows from unchecked string operations (the C++ classics).&lt;/li&gt;
&lt;li&gt;Missing null checks on pointers returned from allocation paths.&lt;/li&gt;
&lt;li&gt;Integer truncation in size calculations that would silently corrupt under specific input ranges.&lt;/li&gt;
&lt;li&gt;Race conditions in concurrent state mutation that the linter had missed because the relevant globals were declared three files away.&lt;/li&gt;
&lt;li&gt;A handful of Python issues around exception handling that swallowed errors silently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The NASA team merged the changes. That's the validation that matters: real bugs, in real production-adjacent code, accepted by maintainers who know the codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we got wrong (and how we fixed it)
&lt;/h2&gt;

&lt;p&gt;The first version of KCode was a mess. The verifier was hallucinating. The pre-filter was over-firing. Our F1 on the validation suite was a depressing 0.71 for months. Three things turned it around:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Cascade verification
&lt;/h3&gt;

&lt;p&gt;A single LLM verifier has a measurable false-positive rate. We could either (a) lower the temperature and pray, or (b) chain two verifiers with different model families and only accept findings both confirm. We picked (b). The current production setup runs Grok + Claude Opus in an ensemble: &lt;em&gt;both&lt;/em&gt; have to agree the candidate is real before it lands in the report. False positives dropped by 60%. The cost is roughly 2× verifier tokens, which on local hardware costs nothing meaningful.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Output filter for "prompt rules miss"
&lt;/h3&gt;

&lt;p&gt;The LLM verifier will occasionally produce output that &lt;em&gt;looks like&lt;/em&gt; a valid finding but is structurally malformed for SARIF — wrong line numbers, missing severity, weird character escaping. We built a strict output filter that rejects malformed verifier output and re-prompts. This sounds boring; it's actually one of the most load-bearing pieces of the system. Without it, ~3% of findings showed up as garbage. With it, the SARIF output is parseable by every downstream tool we've tried (GitHub Code Scanning, SonarQube import, custom dashboards).&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The "audit your auditor" week
&lt;/h3&gt;

&lt;p&gt;For one full week, we ran KCode against &lt;em&gt;itself&lt;/em&gt; and another tool (&lt;a href="https://astrolexis.space/inquisitor" rel="noopener noreferrer"&gt;Inquisitor&lt;/a&gt;, our agent QA daemon) against KCode. The goal was to find every silent failure in our own pipeline before customers did. Inquisitor surfaced &lt;strong&gt;8+ silent-failure bugs in the first week&lt;/strong&gt;: hallucinated tool results that propagated through the pipeline, exit-code-0 hangs that no human or test suite had caught, edge cases where verifier rejection was masked as success. Every one of those is now a test case in our CI.&lt;/p&gt;

&lt;p&gt;If you ship developer tooling, audit your auditor. It's the highest-leverage week of QA you can do.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to install and use it
&lt;/h2&gt;

&lt;p&gt;KCode is distributed as binaries (Linux x64/ARM64, macOS Apple Silicon) and an npm package. Three install paths:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Option A: one-line install (recommended for local use)&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://kulvex.ai/kcode/install.sh | sh

&lt;span class="c"&gt;# Option B: npm&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @astrolexisai/kcode

&lt;span class="c"&gt;# Option C: GitHub Action (drop into .github/workflows)&lt;/span&gt;
- uses: AstrolexisAI/kcode-action@v1
  with:
    target: ./src
    severity: medium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For CI integration, the GitHub Action publishes SARIF to GitHub Code Scanning, which means the findings show up in the Security tab and as inline PR comments. No additional dashboard required.&lt;/p&gt;

&lt;p&gt;For local development, &lt;code&gt;kcode scan ./src --verifier-model qwen3.6-heretic&lt;/code&gt; runs a full pass and writes the report to stdout. If you have a Mac with 32GB+ unified memory, MLX serves the verifier directly. If you have a GPU server, point KCode at any OpenAI-compatible endpoint serving the model you want.&lt;/p&gt;

&lt;p&gt;Free tier is permissive: full feature set, no source-code upload, you bring your own model. &lt;strong&gt;Pro at $19/month&lt;/strong&gt; adds priority pattern updates, the curated weekly verifier model release, and access to the cascade ensemble pre-configured. &lt;a href="https://kulvex.ai/kcode" rel="noopener noreferrer"&gt;Pricing details and binaries&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest part: where we are with revenue
&lt;/h2&gt;

&lt;p&gt;I'm not going to pretend KCode is a runaway hit. Here's where we actually are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Revenue: $0 confirmed Pro subscribers&lt;/strong&gt; as of this writing. The free tier has users — actual installs, actual scans, actual SARIF reports landing in CI — but the Pro conversion hasn't started.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1 goal&lt;/strong&gt;: 10 paying subs &lt;em&gt;or&lt;/em&gt; 2 paid audit engagements. That's the bar we set for "this is a real product."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What we know works&lt;/strong&gt;: the technical core. Precision is real, the patterns are good, the verifier doesn't hallucinate, the SARIF output is clean. The bug we found in NASA's code wasn't a one-off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What we're testing&lt;/strong&gt;: whether the buyer who can't ship code to Snyk actually exists in the volume we hope. Our hypothesis is yes — defense, healthcare, EU SaaS, anyone with GDPR data residency, anyone with NDA constraints. We're going to find out over the next two quarters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm sharing this because the indie software world is full of "we're crushing it" posts that don't match the financial reality, and that makes it harder for anyone building something legitimate to talk straight. KCode is a real tool that solves a real problem. We don't yet know if it'll be a business. That's where we are.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who this is for
&lt;/h2&gt;

&lt;p&gt;If your team is in any of these buckets, KCode is built for you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have source code that contractually cannot leave your infrastructure. Defense, healthcare, financial services with strict residency.&lt;/li&gt;
&lt;li&gt;You run on-prem CI and the SaaS SAST tools don't ship a self-hosted edition you can actually afford.&lt;/li&gt;
&lt;li&gt;You've tried Snyk/SonarQube/GHAS and find the noise level untenable. You want a tool that fires less and lands more.&lt;/li&gt;
&lt;li&gt;You're philosophically opposed to your code training someone else's model. Reasonable position.&lt;/li&gt;
&lt;li&gt;You're a security consultant doing one-off engagements and want a tool that runs on your laptop without phoning home.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your team is happily on a SaaS SAST and your auditors don't care, KCode is probably not for you. That's fine. We're not trying to displace the SaaS market — we're serving the chunk of it that can't use SaaS at all.&lt;/p&gt;




&lt;p&gt;— Bruno Galtranch, founder, &lt;a href="https://astrolexis.space" rel="noopener noreferrer"&gt;AstroLexis LLC&lt;/a&gt;. If you're evaluating KCode for your team or want to talk about a paid audit engagement: &lt;a href="mailto:contact@astrolexis.space"&gt;contact@astrolexis.space&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>sast</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Why We Run LLMs On-Device in 2026</title>
      <dc:creator>GaltRanch</dc:creator>
      <pubDate>Thu, 21 May 2026 14:13:19 +0000</pubDate>
      <link>https://dev.to/galtranch/why-we-run-llms-on-device-in-2026-1bbh</link>
      <guid>https://dev.to/galtranch/why-we-run-llms-on-device-in-2026-1bbh</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published on the &lt;a href="https://astrolexis.space/blog/why-on-device-llms-2026/" rel="noopener noreferrer"&gt;AstroLexis blog&lt;/a&gt;. Cross-posted here for the community.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For most of the last three years, "AI" has meant calling someone else's API. Your prompt leaves your machine, hits a datacenter, and a response comes back. In 2026 that's no longer the only sensible architecture. Here's the case for running LLMs on your own hardware — and what we ship at AstroLexis to make it actually work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cloud isn't the only place for AI anymore
&lt;/h2&gt;

&lt;p&gt;When OpenAI shipped GPT-3.5 in late 2022, running an LLM locally was an exotic hobby. The smallest useful models needed a workstation, the tooling barely worked outside a research lab, and inference was slow enough that real-time use was out of reach. The cloud was the only practical option.&lt;/p&gt;

&lt;p&gt;That's not the world we live in anymore. As of mid-2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An Apple M4 Pro Mac mini ($1,400) runs a quantized 30B parameter model at 25-40 tokens/second using &lt;a href="https://github.com/ml-explore/mlx" rel="noopener noreferrer"&gt;MLX&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;A consumer RTX 5090 (24GB VRAM) handles 70B models in 4-bit quantization with comfortable headroom for context windows.&lt;/li&gt;
&lt;li&gt;Apple's own Foundation Models (built into iOS 26 and macOS) ship a 3B-parameter on-device LLM that's available to every app through a system framework.&lt;/li&gt;
&lt;li&gt;Llama 4, Qwen 3.6, Mistral Small 3.1 and Gemma 4 all ship 4-bit weights designed to run on commodity hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost-performance curve has crossed a line where, for a large class of real applications, running locally is now &lt;em&gt;better&lt;/em&gt; — not just feasible. The question stopped being "can we run this without the cloud?" and became "why are we still sending this to someone else's datacenter?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost: the math has flipped
&lt;/h2&gt;

&lt;p&gt;Cloud LLM pricing in 2024 was an order of magnitude cheaper than running your own inference. By 2026, for any sustained workload, the math is the opposite.&lt;/p&gt;

&lt;p&gt;Take a concrete example. A static code analysis pipeline that scans 500 commits per day against a 1M-line codebase. With &lt;a href="https://kulvex.ai/kcode" rel="noopener noreferrer"&gt;KCode&lt;/a&gt; we measured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI o4-mini, hosted API&lt;/strong&gt;: ~$340/month, plus the latency overhead of going to the cloud per file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local Qwen3.6-Heretic 30B on a single RTX 5090&lt;/strong&gt;: roughly $0 marginal cost after the GPU is purchased, with a sub-second turnaround per file because the model is warm in VRAM and there's no network hop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The capex is real — a workstation isn't free. But for any team doing real volume, the breakeven against API pricing arrives in 4-8 months. After that, every additional run is essentially free. The same calculus applies to support agents, document classification pipelines, voice transcription, image captioning, anything that runs at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Privacy: your data is your data
&lt;/h2&gt;

&lt;p&gt;The privacy story is easier to explain when the user is non-technical: &lt;em&gt;if your data never leaves your machine, no one can lose it, sell it, or train on it&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This matters more in some contexts than others. We ship products on both ends of the privacy spectrum:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://astrolexis.space/clearcaps" rel="noopener noreferrer"&gt;&lt;strong&gt;ClearCaps&lt;/strong&gt;&lt;/a&gt; generates live captions and diarized transcripts for users with hearing loss. The audio is profoundly personal — medical conversations, family calls, work meetings. Running speech recognition (WhisperKit) and speaker diarization on-device means there's nothing for an attacker to intercept or a vendor to monetize.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://astrolexis.space/phoenixsteps" rel="noopener noreferrer"&gt;&lt;strong&gt;PhoenixSteps&lt;/strong&gt;&lt;/a&gt; is a clinical speech-therapy companion for pediatric patients. The users are children. Their speech recordings are protected health information under HIPAA-equivalent frameworks across most jurisdictions. There's no possible "cloud version" that we'd ship.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kulvex.ai" rel="noopener noreferrer"&gt;&lt;strong&gt;Kulvex AI&lt;/strong&gt;&lt;/a&gt; is a self-hosted assistant. It runs on hardware the user owns, in their home, on their network. We never see the conversations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't ideology. It's a product constraint. There are categories of software — health, legal, family, identity — where shipping to a cloud LLM is a non-starter. On-device is the only viable architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency: 50ms vs 800ms
&lt;/h2&gt;

&lt;p&gt;A cloud LLM round-trip is at minimum the network latency (50-200ms) plus the time-to-first-token (200-1000ms depending on load) plus the streaming of the response. For a short reply that's a 1-2 second user-facing delay.&lt;/p&gt;

&lt;p&gt;An on-device model on Apple Silicon, with the weights already memory-mapped into RAM, can start producing tokens in under 50ms and stream at 30+ tokens/second for a 7B model. For interactive UX — autocomplete, voice assistants, real-time captions — this is the difference between "feels native" and "feels like a web form."&lt;/p&gt;

&lt;p&gt;We're working with this constraint right now on our iOS apps. The Apple Foundation Models framework gives us a 3B-parameter LLM that responds in 100-200ms total on an iPhone 16. That's fast enough that the user never sees a spinner. The same query against an OpenAI API would feel slower even if it produced a higher-quality answer — because the perceived speed of UI dominates short interactions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Freedom: no vendor lock-in
&lt;/h2&gt;

&lt;p&gt;This is the underappreciated one. Every cloud LLM you build on top of is a dependency on someone else's roadmap, pricing, and content policy. They can deprecate the model you're using, double the price overnight, refuse to serve your jurisdiction, or decide that your use case violates their terms.&lt;/p&gt;

&lt;p&gt;We've watched this play out repeatedly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The original GPT-4 API was deprecated and replaced with new versions that broke established prompt patterns for thousands of products.&lt;/li&gt;
&lt;li&gt;Anthropic, OpenAI, and Google have all rejected or rate-limited use cases at various points (security tooling, certain medical applications, anything touching content moderation).&lt;/li&gt;
&lt;li&gt;Hosted prices have moved up &lt;em&gt;and&lt;/em&gt; down without warning, making it impossible to model unit economics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On-device, you can pin the model version forever. Llama 4 will run on your 5090 in 2030 the same way it runs today. No one can take it away. Your customers' workflows don't break because a vendor changed their mind.&lt;/p&gt;

&lt;p&gt;The on-device weights become a real asset. It's the opposite of "renting" intelligence.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we ship at AstroLexis
&lt;/h2&gt;

&lt;p&gt;Everything we build runs locally by default. The full lineup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://kulvex.ai" rel="noopener noreferrer"&gt;Kulvex AI&lt;/a&gt;&lt;/strong&gt; — self-hosted AI platform with 17 domain agents (home automation, messaging across 8 platforms, voice control). Runs on your own GPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://kulvex.ai/kcode" rel="noopener noreferrer"&gt;KCode&lt;/a&gt;&lt;/strong&gt; — deterministic security audit tool with 414 hand-curated patterns across 20+ languages. Pre-filters with regex/AST, verifies with a local LLM. Your source code never leaves your machine. SARIF output, GitHub Action.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://astrolexis.space/clearcaps" rel="noopener noreferrer"&gt;ClearCaps&lt;/a&gt;&lt;/strong&gt; — live captions and speaker diarization on iPhone. WhisperKit + Apple SpeakerKit, all on-device.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://astrolexis.space/siliconmon" rel="noopener noreferrer"&gt;SiliconMon&lt;/a&gt;&lt;/strong&gt; — Apple Silicon system monitor for macOS. Shows you exactly what your GPU, ANE, and unified memory are doing while you run MLX, Ollama, llama.cpp, or LM Studio locally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://astrolexis.space/phoenixsteps" rel="noopener noreferrer"&gt;PhoenixSteps&lt;/a&gt;&lt;/strong&gt; — clinical speech-therapy companion for pediatric SLPs. iOS-only, MLX-based.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://astrolexis.space/vela" rel="noopener noreferrer"&gt;Vela&lt;/a&gt;&lt;/strong&gt; — memory companion for adults with memory impairment. iOS-only, on-device.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tutto&lt;/strong&gt; — conversational practice for English and Spanish learners. In development.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread isn't a particular AI framework or model. It's the architectural commitment: the user owns the inference. We don't sit in the middle.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to start
&lt;/h2&gt;

&lt;p&gt;If you're building software in 2026 and considering whether to make an on-device version, our take:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with the right hardware target.&lt;/strong&gt; Apple Silicon is the most underrated AI dev box on the market. An M2 Pro or newer Mac with 32GB+ unified memory handles 7-13B parameter models comfortably. For server work, a single 24GB consumer GPU (RTX 4090/5090) handles 30B models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick a model family and stay on it.&lt;/strong&gt; Llama 4, Qwen 3.6, Mistral Small, Gemma 4. All ship 4-bit quantizations. All have stable APIs through MLX, llama.cpp, or vLLM. Don't chase weekly model releases — pick one, learn its quirks, ship.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat the local LLM as a tool, not a magic box.&lt;/strong&gt; Wrap it in deterministic pre-processing and post-processing. KCode does this: regex/AST patterns find candidates, the LLM verifies. The local model doesn't have to be GPT-5-level to be useful — it has to be reliable for a narrow task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure honestly.&lt;/strong&gt; Track tokens-per-second, time-to-first-token, memory footprint, and battery impact on real devices. The numbers you see on a research blog don't match what you'll see on a customer's M1 Air.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;— Bruno, founder, &lt;a href="https://astrolexis.space" rel="noopener noreferrer"&gt;AstroLexis LLC&lt;/a&gt;. If you build in this space, drop a line: &lt;a href="mailto:contact@astrolexis.space"&gt;contact@astrolexis.space&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>privacy</category>
      <category>indiedev</category>
    </item>
  </channel>
</rss>
