DEV Community: Donald Cruver

The FCC Just Validated What Homelabbers Have Known for Years

Donald Cruver — Tue, 24 Mar 2026 00:38:06 +0000

The FCC this week designated all consumer routers manufactured outside the US as a national security risk, banning new foreign-made models from sale. The rule catches not just TP-Link but Eero, NetGear, and Google Nest too, since they all manufacture overseas.

The move has a concrete backstory. Three Chinese nation-state groups, Volt Typhoon, Flax Typhoon, and Salt Typhoon, used compromised SOHO routers to build botnets targeting US critical infrastructure over several years. The FBI and DOJ shut down one of these botnets in 2024. The FCC's National Security Determination names all three operations explicitly as justification for the ban.

I've been running OPNsense on a fanless mini-PC as my home router since 2022. It runs on an Intel N5105 with dual NICs, 8GB RAM, and a 250GB NVMe drive. The hardware cost about $200. The software is open source and BSD-licensed. There is no cloud account in the chain, no manufacturer firmware to trust or distrust, and no update process I didn't initiate. It routes packets and does exactly what the configuration specifies.

The FCC's concern is about foreign firmware as an attack surface for nation-state actors. My concern in 2022 was simpler: I wanted to know what was on my own network without a consumer device deciding that for me. The concerns are identical, and so is the fix.

Full writeup: It Turns Out My Router Was a National Security Decision

Three Production Apps, Zero Code: What Keip Actually Looks Like in Practice

Donald Cruver — Fri, 13 Mar 2026 21:43:43 +0000

Three Production Apps, Zero Code: What Keip Actually Looks Like in Practice

In the past month I shipped three integration apps to my home cluster: a Spanish translator that replies in audio, a health tracking bot I can query from any room in my house, and a camera alert system that decides for itself what's worth telling me about. Each one took under an hour to have running. None of them required me to write a single line of application code.

That last part is the actual story. The apps are configuration, end to end. There is no glue script. An AI assistant produced most of that configuration in the time it would have taken me to scaffold a project and write the first class.

A Quick Recap

I wrote about Keip and Enterprise Integration Patterns a few weeks ago. The short version: Keip is an open source platform built on Spring Integration that turns Enterprise Integration Patterns into Kubernetes resources. Instead of writing a Spring Boot application to wire services together, I write an XML route config and Keip deploys it as a pod.

apiVersion: keip.codice.org/v1alpha2
kind: IntegrationRoute
metadata:
  name: my-route
  namespace: keip
spec:
  replicas: 1
  routeConfigMap: my-route-xml

The route config lives in a ConfigMap. There is no Dockerfile, no application entrypoint, and no build step. Keip handles all of that.

Installing Keip on an existing cluster is a single command:

kubectl apply -f https://github.com/codice/keip/releases/latest/download/install.yaml

That installs the operator, CRDs, and controller. From there, deploying a route is just a kubectl apply on a ConfigMap and an IntegrationRoute resource.

What I've been building on top of this is keip-connect, a library of protocol adapters that connect integration routes to the services I actually use: Matrix chat, ntfy push notifications, and an Anthropic Claude chat model. Each adapter follows the same Spring Integration channel adapter pattern: an inbound adapter that produces messages onto a channel, or an outbound gateway that sends them somewhere and optionally waits for a reply. (keip-connect is not yet publicly released.)

With those pieces in place, here's what I built.

App 1: Personal Translator App

My wife and I are visiting Mexico soon. My Spanish is nearly-nonexistent, and I wanted something faster than stopping to open a separate app. I also wanted it to respond in audio when translating to Spanish, so I could hear how things should sound rather than just reading them.

The obvious alternative is Google Translate. The reason I didn't use it is the same one that runs through everything else in this stack: I don't want my conversations in someone else's logs. A translation request is a conversation fragment, and sending it to a third-party API means it leaves my network. The local model is fast and the text stays on my hardware.

There's a practical reason too. Matrix is already the interface I use for everything: talking to my AI assistant, checking health data, getting camera alerts. The translator is just another room in the same app I already have open. There is no separate tool to install and no context switching.

The route listens on a private Matrix room. Messages arrive as text or voice. It detects the language, translates in the appropriate direction, and when the output is Spanish, generates a voice reply using a local TTS model running on my own hardware. English output stays as text.

The XML that does all of this:

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:int="http://www.springframework.org/schema/integration"
       xmlns:int-http="http://www.springframework.org/schema/integration/http"
       xmlns:matrix="http://cruver.ai/schema/keip-connect/matrix">

  <!-- Receive messages from the translation room -->
  <matrix:inbound-channel-adapter
      client-factory-ref="matrixClient"
      channel="rawInput"
      room-ids="!MpzjvTEceOoscPvapJ:matrix.cruver.network"/>

  <!-- Audio: transcribe via Whisper before translating -->
  <int:filter input-channel="rawInput" output-channel="audioInput"
      expression="headers['matrix_content_type'] == 'm.audio'"/>
  <int-http:outbound-gateway
      request-channel="audioInput" reply-channel="textInput"
      url="${whisper.url}/v1/audio/transcriptions"
      http-method="POST" expected-response-type="java.lang.String"/>

  <!-- Text input arrives directly -->
  <int:channel id="textInput"/>

  <!-- Call LLM: detect language and translate; returns JSON
       {"direction":"en→es|es→en","translation":"..."} -->
  <int:header-enricher input-channel="textInput" output-channel="llmCall">
      <int:header name="Content-Type" value="application/json"/>
  </int:header-enricher>
  <int-http:outbound-gateway
      request-channel="llmCall" reply-channel="llmResponse"
      url="${llm.url}/v1/chat/completions"
      http-method="POST" expected-response-type="java.lang.String"
      mapped-request-headers="Content-Type"/>

  <!-- Parse direction from LLM response -->
  <int:router input-channel="llmResponse"
      expression="new com.fasterxml.jackson.databind.ObjectMapper()
                      .readTree(payload).get('direction').asText()
                      .startsWith('en') ? 'toSpanishAudio' : 'textReply'"/>

  <!-- ES→EN: return translated text directly -->
  <int:channel id="textReply"/>
  <matrix:outbound-gateway client-factory-ref="matrixClient"
      request-channel="textReply"/>

  <!-- EN→ES: call XTTS for audio, then send WAV to room -->
  <int:channel id="toSpanishAudio"/>
  <int-http:outbound-gateway
      request-channel="toSpanishAudio" reply-channel="audioReply"
      url="${xtts.url}/v1/audio/speech"
      http-method="POST" expected-response-type="byte[]"/>
  <matrix:outbound-gateway client-factory-ref="matrixClient"
      request-channel="audioReply"/>

</beans>

The services it talks to, Whisper for transcription, a local LLM for translation, and XTTS for speech, all run on my own hardware. Nothing leaves my network. The route is the plumbing, and the services are the intelligence.

From "I want a translator bot" to a working app in the Matrix room took about 45 minutes. Most of that was getting the Matrix room configured and E2E encryption keys sorted. The route itself was maybe 15 minutes, most of which was an AI assistant drafting the XML while I reviewed it.

The translator isn't the point. It's just one thing the pattern makes easy. The next two examples are completely different problems, and the approach is the same.

App 2: Health Tracker

I track glucose, ketones, weight, and blood pressure, among other things, in InfluxDB. The data is useful, but querying it has always required either a Grafana dashboard or writing a Flux query by hand. I wanted to ask questions in plain language and get answers.

The route listens on a dedicated Matrix room and forwards messages to a local LLM with context about what data is available. The LLM generates a Flux query, the route executes it against InfluxDB, and the result comes back as a natural language summary.

I can send messages like "How have my ketones been this week?" or "What was my average glucose yesterday?" and get a direct answer.

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:int="http://www.springframework.org/schema/integration"
       xmlns:int-http="http://www.springframework.org/schema/integration/http"
       xmlns:matrix="http://cruver.ai/schema/keip-connect/matrix">

  <matrix:inbound-channel-adapter
      client-factory-ref="matrixClient"
      channel="healthInput"
      room-ids="!yZRhRvDISqwNnBokVM:matrix.cruver.network"/>

  <!-- Route: slash commands log data; plain questions query it -->
  <int:router input-channel="healthInput"
      expression="payload.startsWith('/') ? 'logChannel' : 'queryChannel'"/>

  <!-- Log path: LLM converts slash command to InfluxDB line protocol -->
  <int:header-enricher input-channel="logChannel" output-channel="llmLogCall">
      <int:header name="Content-Type" value="application/json"/>
  </int:header-enricher>
  <int-http:outbound-gateway
      request-channel="llmLogCall" reply-channel="lineProtocol"
      url="${llm.url}/v1/chat/completions"
      http-method="POST" expected-response-type="java.lang.String"/>

  <!-- Write line protocol to InfluxDB -->
  <int:header-enricher input-channel="lineProtocol" output-channel="influxWrite">
      <int:header name="Authorization" expression="'Token ' + '${influx.token}'"/>
  </int:header-enricher>
  <int-http:outbound-channel-adapter
      channel="influxWrite"
      url="${influx.url}/api/v2/write?org=${influx.org}&amp;bucket=${influx.bucket}"
      http-method="POST" mapped-request-headers="Authorization"/>

  <!-- Query path: fetch last 7 days from InfluxDB, summarize with LLM -->
  <int:header-enricher input-channel="queryChannel" output-channel="influxQuery">
      <int:header name="Authorization" expression="'Token ' + '${influx.token}'"/>
  </int:header-enricher>
  <int-http:outbound-gateway
      request-channel="influxQuery" reply-channel="rawData"
      url="${influx.url}/api/v2/query?org=${influx.org}"
      http-method="POST" expected-response-type="java.lang.String"/>

  <!-- LLM summarizes raw CSV into plain language -->
  <int-http:outbound-gateway
      request-channel="rawData" reply-channel="replies"
      url="${llm.url}/v1/chat/completions"
      http-method="POST" expected-response-type="java.lang.String"/>

  <matrix:outbound-gateway client-factory-ref="matrixClient"
      request-channel="replies"/>

</beans>

This one took about 40 minutes end to end. The hardest part was writing a clear system prompt that reliably produces valid Flux syntax. The route itself was straightforward, and again, AI drafted it.

App 3: Intelligent Camera Alerts

I have six cameras running through Frigate, my local NVR. Frigate does motion detection and object recognition with a Google Coral TPU. What it doesn't do is decide whether a detection is actually interesting.

A car parked in the street shouldn't page me. A delivery truck at the door should. A person I don't recognize in the backyard at 2 AM definitely should. Making that distinction requires judgment, and that's what the integration layer handles.

The route subscribes to Frigate's MQTT event stream. When a detection comes in, it grabs the camera snapshot, sends it to a local vision model for analysis, and acts on the verdict. Routine activity is dropped silently. Anything worth knowing about triggers a push notification through ntfy with a description of what the model saw.

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:int="http://www.springframework.org/schema/integration"
       xmlns:int-mqtt="http://www.springframework.org/schema/integration/mqtt"
       xmlns:ntfy="http://cruver.ai/schema/keip-connect/ntfy">

  <!-- Subscribe to Frigate detection events -->
  <int-mqtt:message-driven-channel-adapter
      client-ref="mqttClient"
      topics="frigate/events"
      channel="rawEvents"/>

  <!-- Filter to high-confidence detections only -->
  <int:filter input-channel="rawEvents" output-channel="significantEvents"
      expression="payload.score >= 0.65"/>

  <!-- Fetch camera snapshot from Frigate API -->
  <int-http:outbound-gateway
      request-channel="significantEvents" reply-channel="withSnapshot"
      url="${frigate.url}/api/{camera}/latest.jpg?h=720"
      http-method="GET" expected-response-type="byte[]"
      uri-variables-expression="headers"/>

  <!-- Ask vision model: is this worth reporting?
       Returns JSON {"verdict":"ALERT|ROUTINE","description":"..."} -->
  <int:header-enricher input-channel="withSnapshot" output-channel="visionCall">
      <int:header name="Content-Type" value="application/json"/>
  </int:header-enricher>
  <int-http:outbound-gateway
      request-channel="visionCall" reply-channel="analyzed"
      url="${vision.url}/v1/chat/completions"
      http-method="POST" expected-response-type="java.lang.String"
      mapped-request-headers="Content-Type"/>

  <!-- Filter out ROUTINE judgments -->
  <int:filter input-channel="analyzed" output-channel="alerts"
      expression="payload.verdict == 'ALERT'"/>

  <!-- Send push notification -->
  <ntfy:outbound-gateway
      client-factory-ref="ntfyClient"
      request-channel="alerts"
      default-topic="camera-analysis"/>

</beans>

The vision model runs locally on the same GPU stack as everything else. The push notification goes to my phone via a self-hosted ntfy server. The route has no idea what's in the images; it moves them to the right place and acts on the verdict.

What This Means in Practice

Each of these apps has real logic: language detection, LLM prompting, vision model inference, database queries. None of that logic lives in the route. It lives in the services the route calls: a Whisper server, a local LLM, a vision model endpoint, an InfluxDB instance. The integration layer connects those services to each other and to the world, deciding what triggers what, what data flows where, and what happens when something fails.

The routes are readable. Looking at any of those XML configs, the structure is clear in thirty seconds, because the primitives are named for what they do: filter, transform, route, split, aggregate. There's no application framework to understand, no dependency injection container to trace through, no build system to run.

Every service those routes call runs on my hardware. The translations, the health queries, the camera analysis; none of it touches an external API. When the internet goes down, the translator still works. When a provider changes its terms or pricing, nothing breaks. That's the practical side of building on infrastructure I own.

The Flywheel Effect

All of this has produced a flywheel effect. Camera alerts go to ntfy, and that's useful by itself. But ntfy topics are just another event source, and any other route can subscribe to the same topics and build on them. It would be trivial to create a route that logs alert frequency to InfluxDB, one that aggregates overnight detections into a morning digest delivered to Matrix, or one that silences notifications during a window set from a Matrix message. None of those exist yet, but every piece needed to build them is already running. Adding them is just a matter of creating the IntegrationRoutes in k8s.

The same pattern holds across all three apps. The health tracker produces InfluxDB data. The translator produces Matrix messages. The camera watcher produces ntfy events. Each output is something another route can consume. The routes compound on each other.

What this means in practice is that the ceiling is not complexity or integration friction, it's compute. Adding a new route costs nothing in subscription fees, nothing in API quotas, and nothing in terms of data sovereignty. On a cluster with local GPU inference, the only real question becomes whether I want the thing, not whether building it is worth the overhead. And the more I build, the easier it gets to add something new. The Matrix client, the LLM endpoint, and the ntfy connection are already running. A new route potentially inherits all of it.

The next step is making the routes themselves even easier to create. That's a topic for a future post.

My Data, My Stack

Donald Cruver — Tue, 10 Mar 2026 04:22:31 +0000

I want to introduce a concept I'll be coming back to throughout this blog: Personal Data Sovereignty, or PDS. It's the idea that individuals can and should have meaningful control over where their data lives, who can access it, and what happens to it. Not as a legal right, as something you actually build.

If you've been reading this blog for a while, some of what follows will look familiar. I've written about the network setup, the GPU server, the second brain, the AI inference stack. This post isn't about new technical ground, it's more philosophical. PDS is the concept I've been building toward without naming it directly, and I wanted to put it in one place before the series goes any further.

Personal Data Sovereignty

PDS isn't a legal framework, or a political position. It's a simple idea: that individuals can and arguably should have meaningful control over where their data lives, how it's processed, and who can access it.

The challenge is that "data sovereignty" usually stops at the level of rights and policy. You have the right to your data, but a right you can only exercise by asking nicely isn't sovereignty. If your data lives on someone else's servers, runs through someone else's software, and is governed by someone else's terms, then their decisions are your reality. A privacy policy doesn't change that. At best it creates a new chore: find time to read it, parse the legal language, figure out what it actually permits, and trust that they'll follow it. Most people don't, which is a rational response to being handed a 5-page document before they can use a service. That's not sovereignty, i's asking permission, and then doing homework about it.

There's a common observation about the internet: if you aren't paying for the product, you are the product. It's a useful heuristic for understanding ad-supported services, search engines, social media, free email. Your attention and your data are what's being sold.

But the real principle isn't about who's paying, it's about who's in control. If the infrastructure belongs to someone else, then the rules belong to someone else. Here are some examples of what I mean:

Amazon Ring had a webpage where law enforcement could fill out a form, claim a life-threatening emergency, and access your footage without your consent, a court order, or a warrant. Ring customers paid for their hardware. They paid a subscription for cloud storage. They were, by any normal definition, paying for the product. It didn't matter. The footage still flowed to law enforcement on request. Amazon has since updated their policy to require warrants in most cases, but an emergency exception remains, and the infrastructure that made warrantless access possible in the first place hasn't changed.
Flock Safety runs automated license plate reader cameras mounted on street lights, utility poles, HOA entrances, apartment complexes, and private businesses across the country. Their network aggregates vehicle movement data into a shared system where any subscribed party gets alerts when a tracked vehicle is spotted. Law enforcement, cities, and private organizations are all customers. You don't opt into it. You don't know which intersections, driveways, or parking lots have cameras. Your movements, when you leave, when you come home, where you go, are being catalogued and made available to whoever has a subscription.
The surveillance angle isn't the only one. In April 2025, Google announced it was ending support for first and second generation Nest Learning Thermostats, effective October 25, 2025. The thermostats hadn't broken. The hardware hadn't changed. Google had simply decided they were done with them, and because the devices depended on Google's cloud to function, that decision was theirs to make. Backlash was significant enough that Google offered compensation toward a replacement, but the end of support went ahead as planned. There is now a class action arbitration being organized against them. The device was in your home. The kill switch belonged to someone else.

The Nest situation isn't unique, any cloud-dependent device carries the same risk. But for most categories of home automation, there are alternatives. Home Assistant runs locally and supports thousands of devices without any vendor cloud. Thermostats like the ecobee or Z-wave units can be controlled entirely on your own network. Cameras can run Frigate, which does all its processing on your hardware. There's usually a community-maintained solution, and where there isn't, there's often a commercial option that can be isolated on a separate VLAN and prevented from phoning home. The ecosystem isn't perfect, but it's far more capable than it was five years ago.

These are examples, not the whole story. Your search history, your email, your files, your health data, all of it is subject to the same dynamic. Every service you don't control is a service that can change its terms, respond to a subpoena, get acquired, or decide that your data is useful for something you didn't anticipate.

What actually gets you to PDS is owning the infrastructure. Not necessarily the same infrastructure I own, but infrastructure you control.

A Note on Isolation

Complete isolation isn't the goal, and it wouldn't be very useful if it were. The internet is built on connecting to other systems. Email crosses networks by design. Search requires access to an index that would be expensive and difficult maintain yourself.

What matters is how those connections happen. SearXNG still reaches out to Google, but the query is anonymized. Google sees a request, not a person. Proton handles email that travels across the open internet, but end-to-end encryption means the content stays private. I'm still in the middle of moving my email there from Gmail, which is exactly how these transitions work: gradually, service by service.

The goal isn't a sealed box, it's external connections that respect your privacy rather than exploit it.

What I Actually Run

Before getting into the deep dives, I want to give a plain description of what I'm actually running and why each layer is there.

Network: An OPNsense firewall running on a fanless mini-PC, with VLANs for network segmentation and internal DNS so every service gets a proper hostname instead of an IP address. I covered this in more detail in a separate post.

Compute: A Proxmox hypervisor running LXC containers and VMs for most services, and a separate workstation with a pair of AMD MI60 GPUs for running local AI workloads. The MI60s have 32GB of HBM2 memory each, which is enough to run larger language models comfortably.

Storage: ZFS across the storage nodes, which gives me snapshots, replication, and data integrity checksums. I've had drives fail without losing data, which is the point.

AI: I run my own LLM inference server using vLLM, and a local AI-powered search tool called Perplexica. Perplexica uses my own models and runs entirely on my hardware, so queries don't leave my network. I've also been running Nabu, an AI assistant built on top of this stack, which is a topic for its own post.

Services: Home Assistant for home automation, Frigate for local camera recording, Syncthing for file sync across devices, n8n for workflow automation, and an org-roam knowledge base that I use as a second brain. These are things I use every day, and I'd be reaching for them even if PDS wasn't a consideration.

The Route I Chose

I want to be clear that my hardware choices aren't the point.

MI60 GPUs are not the only way to run local AI. Proxmox is not the only hypervisor. OPNsense is not the only firewall. I landed on these things through a combination of research, opportunity, and the particular shape of my needs. Someone else pursuing PDS might do it on a Raspberry Pi cluster, or a single NUC, or a secondhand server from eBay. The principle scales. The hardware doesn't have to match.

What I'm documenting here is my Sovereign Stack. The decisions that led to it, the trade-offs I made, and what it actually looks like to maintain it. Take what's useful. Ignore what isn't.

What It Costs

The main cost is time. Most of this runs without much attention, but things break occasionally and it's on me to fix them. There's a real learning curve early on that takes a while to get through, and it never fully disappears.

The other cost is hardware. I've spent real money building this out over several years, a mix of new and used equipment. The argument that self-hosting pays for itself in avoided subscriptions is true over the (very) long run, but it requires upfront investment that not everyone can make.

What You Get Back

I know where my data is, and what software is processing it. When I search for something using Perplexica, the query goes to my hardware and nowhere else. That's the practical side of PDS, and it's not abstract.

The less obvious benefit is that running your own infrastructure teaches you things. I understand networking better because I had to configure it. I understand LLM inference better because I had to get it working. That knowledge builds up over time in a way that just using a service doesn't.

What's Coming

This post is the overview. The deep dives are coming, one layer at a time:

The network layer -- OPNsense, VLANs, internal DNS, and how I expose services safely (already published)
Compute -- Proxmox, the GPU server, and why bare metal still matters
Storage -- ZFS and what it means to actually trust your storage
AI -- Local inference, Perplexica, and what PDS looks like when your assistant doesn't phone home
Services -- The glue layer: everything I run and why

Each post stands alone. Read them in order or jump to whatever layer interests you. The goal isn't to convince you to replicate my stack. It's to show you what's possible when you start treating your infrastructure as something you own.

The stack will also keep changing. Hardware gets replaced, better software comes along, requirements shift. I'll document that as it happens.

I Pushed Local LLMs Harder. Here's What Two Models Actually Did.

Donald Cruver — Mon, 02 Mar 2026 21:51:26 +0000

In Part 1 of this series, I set up Claude Code against local LLMs on dual MI60 GPUs and watched it scaffold a Flask application from scratch. Small tasks worked. Complex ones did not. I ended with three ideas I wanted to test: running a dense model, trying Claude Code's agent teams feature, and building a persistent memory layer for coding sessions.

I started the experiment that mattered most: giving a local LLM a project with real scope and seeing what happened. I ran the same project against two different models. The results were instructive, and not in the direction I expected.

The Project

The model's goal was to build a Python CLI tool called health-correlate. It would connect to my InfluxDB health database, retrieve time-series data for metrics like glucose readings, blood ketone levels, blood pressure, body weight, and subjective wellbeing scores, resample everything to daily aggregates, and run Pearson correlation analysis with configurable time-lag support. The output: a ranked table of correlations with p-values. A weather bucket fed from a Weatherflow sensor extended the scope further. The tool could also correlate outdoor conditions against health metrics.

That scope requires several interdependent modules: a data layer for InfluxDB Flux queries, a statistics layer for the correlation math, a CLI layer with multiple subcommands, and a visualization layer. Enough moving parts that the model would need to maintain architectural coherence across files and across iterations.

I structured each session using the subagent pattern. Three Claude Code subagent prompt files handled the phases in sequence: data layer first, then the correlation engine, then the CLI and visualization. Each subagent prompt included a mandatory test-fix loop: write, run, fail, fix, repeat until tests pass before declaring the phase done.

The Prompt

The top-level prompt I gave Claude Code was short. Its job was to delegate work to the subagents, not to do it directly:

# Build health-correlate using Subagents

Build a Python CLI tool that finds correlations between health metrics in InfluxDB.

## Strategy
Use subagents to keep each phase in its own context window:

1. **Phase 1**: Delegate to @phase1-influx to build the InfluxDB data layer
2. **Phase 2**: Delegate to @phase2-correlation to add statistical analysis
3. **Phase 3**: Delegate to @phase3-cli to create CLI and visualization

## Critical Requirement: Test-Fix Loop

Each subagent MUST:
1. Write tests for their code
2. Run tests after each function/module
3. If tests fail → debug and fix → re-test
4. Only report "Done" when ALL tests pass

Do NOT accept a phase as complete if tests are failing.

## Coordination
- Each subagent writes/updates PROGRESS.md when finished
- Wait for each phase to complete before starting the next
- Verify tests pass before proceeding: `python -m pytest tests/ -v`
- Do not implement phases yourself - delegate to the subagents

## Environment
- INFLUXDB_TOKEN is set in the environment
- InfluxDB at influxdb.cruver.network:30086
- Virtual environment at ./venv (activate with `source venv/bin/activate`)

## Begin
Start by delegating Phase 1 to @phase1-influx.
After it completes, verify tests pass, then delegate Phase 2.
After Phase 2 completes, verify tests pass, then delegate Phase 3.

Each subagent lived in .claude/agents/ as a markdown file. The files defined the tools available to the subagent, the specific deliverables, and critically, the test-fix loop requirement. The Phase 1 subagent prompt looked like this:

---
name: phase1-influx
description: "Build InfluxDB data layer for health-correlate."
tools: Read, Write, Edit, Bash, Glob, Grep
model: inherit
---

# Phase 1: InfluxDB Data Layer

...

## Test-Fix Loop (MANDATORY)

After writing each function:
1. Create a test in tests/test_influx.py
2. Run: `python -m pytest tests/test_influx.py -v`
3. If test fails:
   - Read the error message carefully
   - Fix the code
   - Re-run the test
   - Repeat until ALL tests pass
4. Do NOT proceed to the next function until current tests pass

The pattern repeats across all three phases. The subagent does not consider its work done until tests pass. This is the mechanism that produced the automated bug-fixing: the model was not doing anything clever; it was following the loop.

Model One: Qwen3-Coder-Next 80B

The first model was Qwen3-Coder-Next: 80 billion total parameters, 3 billion active per forward pass, 512 experts with 10 activated per token. I ran an AWQ-quantized version under vLLM across both MI60s in tensor-parallel mode.

The vLLM container command:

--model cyankiwi/Qwen3-Coder-Next-AWQ-4bit
--tensor-parallel-size 2
--max-model-len 65536
--gpu-memory-utilization 0.95
--enable-auto-tool-choice
--tool-call-parser qwen3_coder

LiteLLM translated Claude API calls to OpenAI format. Pointing Claude Code at the local stack:

export ANTHROPIC_BASE_URL="http://feynman.cruver.network:4000"
export ANTHROPIC_API_KEY="sk-litellm-local-dev"

The first attempt hit a hard wall at 32,768 tokens. Agentic sessions accumulate context fast: every file read, every tool call result, every test output appends to the history. By the time the model was writing the CLI module, the session had consumed 33,839 tokens and vLLM returned a context overflow error.

I bumped the window to 65,536 and ran again. The data layer completed. The correlation engine completed. The test-fix loop worked as designed: the model generated code, ran the tests, read the failures, and corrected them. Five bugs were caught and fixed automatically before either module was signed off.

The CLI phase did not finish. Not from another hard limit, but from what I started calling the context snowball problem. At 65,000 tokens, processing the incoming prompt takes several minutes per turn. The GPUs peg at 100%, but no generation is happening. Local inference does not have attention caching the way cloud providers do. Every turn reprocesses the entire conversation history from scratch. The model was not thinking; it was rereading.

The results from Qwen3-Coder-Next: the architecture was sound, the correlation math was correct, the Flux queries for InfluxDB were valid. The project just ran out of runway before finishing.

The Tool-Calling Flag That Is Not Optional

The --tool-call-parser qwen3_coder flag in vLLM matters. Without it, Qwen3-Coder-Next generates tool calls in a format Claude Code cannot parse. The model produces output that looks like a valid tool call, Claude Code processes it, and nothing happens. No error, no message; the tool call evaporates. Adding the flag fixed it immediately.

The Model Switch

Qwen3.5-35B-A3B was released while the Qwen3-Coder-Next experiment was still in progress. The naming is confusing, but it is a 35 billion parameter MoE with 3 billion active per token. Smaller headline count than Coder-Next, but one difference mattered more for this use case: it runs in GGUF format under llama.cpp, which made 131,072 tokens of context per instance practical.

The bigger change was the GPU architecture. Instead of tensor-parallel across both GPUs for one model instance, the new configuration runs two independent llama-server instances, one per GPU:

llama-server-0:           # GPU0: Sonnet tier, no thinking
  environment:
    HIP_VISIBLE_DEVICES=0
  command:
    --ctx-size 131072
    --flash-attn on
    --jinja
    --reasoning-budget 0

llama-server-1:           # GPU1: Opus tier, extended thinking
  environment:
    HIP_VISIBLE_DEVICES=1
  command:
    --ctx-size 131072
    --flash-attn on
    --jinja
    --reasoning-budget -1

LiteLLM routes claude-opus-* requests to GPU1 and everything else to GPU0. Both instances serve the same GGUF file from a shared volume. For agentic coding sessions, which are mostly sequential, one instance handles the full session while the other sits idle for chat.

Two flags turned out to be essential discoveries.

The first is --jinja. Without it, Qwen3.5 tool calls fail silently, the same failure mode as Qwen3-Coder-Next without --tool-call-parser. The Jinja2-based chat template handles the tool call formatting that Claude Code expects. This is the llama.cpp equivalent of vLLM's --tool-call-parser flag. It is not documented prominently.

The second is --dangerously-skip-permissions. For unattended agentic sessions through a remote connection, Claude Code's permission prompts appear in a TUI that is not visible over SSH. The first Qwen3.5 session stalled for sixteen minutes while Claude Code waited for bash approval input. The code was being created, the tests were queued, and the process was sleeping. The flag eliminates the prompts entirely.

One other finding from monitoring the running sessions: Claude Code makes outbound HTTPS connections to Anthropic infrastructure even when ANTHROPIC_BASE_URL points to a local endpoint. Model API calls route correctly to the local LiteLLM proxy. Process authentication and telemetry go to Anthropic's servers directly. A valid Anthropic account is still required; fully offline deployments are not possible without network-level blocking.

Model Two: Qwen3.5-35B-A3B Results

Same project, same subagent prompt files, same InfluxDB target.

Phase 1 (data layer): complete. The model wrote valid Flux queries for the InfluxDB client library on the first attempt. Flux syntax is unusual enough that I expected at least one hallucinated API call. There were none.

Phase 2 (correlation engine): complete. 25 tests passing. Four bugs were caught and fixed by the test-fix loop: a numpy bool versus Python bool type mismatch, a NaN handling order issue before correlation, a sorting column reference error, and a list length mismatch in the test suite. The model found them, diagnosed them, and fixed them without any intervention.

Phase 3 (CLI and visualization): incomplete. The test suite had a structural mock issue: the patch decorators targeted fetch_weather and fetch_metric, but get_influx_client was called first in the execution path, so the mocks never fired. The production code itself was correct. Running the CLI against the live InfluxDB database: list-metrics returned real metric types, correlate glucose ketone computed a real Pearson coefficient, correlate-all ranked glucose against all available weather metrics. The visualization module was not written before the session ended.

One hallucination appeared in Phase 3: df.to_json(default=str). The default parameter belongs to json.dumps, not DataFrame.to_json. The model conflated two similar APIs. It surfaced immediately under test and would not have survived a type checker either.

What the Configuration Changes Actually Mean

Context window size was the single most significant variable between the two runs. 65,536 tokens is not enough runway for a multi-module Python project built in a single agentic session. 131,072 is enough to mostly finish one, with the subagent pattern distributing the load across isolated context windows.

The GPU split architecture change is also meaningful beyond raw numbers. Tensor-parallel across two GPUs keeps both GPUs active for every request, but introduces inter-GPU communication overhead for each forward pass. Two independent instances run each GPU at full throughput for its own requests, with LiteLLM load-balancing between them. For sequential agentic work, this is a better fit than tensor-parallel.

The context snowball problem does not disappear with a wider window or a different model. It is structural: local inference without attention caching reprocesses the entire history on every turn. The mitigations are a wide enough context ceiling that the snowball does not hit it, and Claude Code's /compact command to compress history mid-session when it does.

Where Things Stand

Qwen3-Coder-Next showed that the agentic loop produces correct code and the subagent architecture handles multi-phase projects. It ran out of context before finishing.

Qwen3.5 showed that with more headroom and the right flags, the same approach produces a mostly functional tool from a clean cold start: valid database queries, correct statistical analysis, a working CLI with real data, and only one meaningful hallucination across three phases.

"Mostly functional" is an honest description. Phase 3 was incomplete. The test suite had fixable issues I had to understand to fix. The visualization module was not written. This is not a one-shot codebase generator. It is a capable assistant that gets significant work done and then needs a human to close the gap.

That is better than where Part 1 left things. For a tool built mostly autonomously from a cold start, it is a more useful outcome than I expected.

The same Qwen3.5 stack turned out to be useful beyond coding sessions. I have been running it as the backend for a fully local AI assistant, handling everything from daily digests to email triage to camera alerts. That is a different kind of test, and it deserves its own post.

Enterprise Integration Patterns Aren't Dead; They're Running on Kubernetes and Orchestrating AI

Donald Cruver — Tue, 24 Feb 2026 22:16:13 +0000

Keip is a Kubernetes operator for Enterprise Integration Patterns. Gregor Hohpe and Bobby Woolf documented these patterns in 2003: content-based routers, message transformers, splitters, aggregators, dead letter channels. Spring Integration has implemented them for years, but deploying Spring Integration on Kubernetes has always been harder than it should be. Keip fixes that. It turns integration routes into native Kubernetes resources, and I use it to run an LLM-powered media analysis pipeline on my home cluster.

Why I Built It

The problem is deploying Spring Integration in Kubernetes. The traditional workflow is: write Java, compile, build a container image, push to a registry, deploy. For every route change, the whole cycle repeats. Adding a filter step, restructuring routing logic, or wiring in a new output channel was indistinguishable from a logic change: same build cycle, same container push, same deployment. The integration logic is buried inside a Java application, and the deployment artifact tells you nothing about what the route does or why.

Keip separates the integration logic from the application lifecycle. The route XML is the source of truth. The operator handles everything else. The route definition is the documentation, because there's nothing else to read.

What It Does

An integration route in Keip is an XML definition inside a Kubernetes custom resource. The operator reads it, creates a Spring Boot application, deploys it as a pod, and manages its lifecycle. Here's what a route resource looks like:

apiVersion: keip.codice.org/v1alpha2
kind: IntegrationRoute
metadata:
  name: npr-world-news
spec:
  image: gitea.cruver.network/dcruver/signal-scope/keip-custom:dev
  replicas: 1
  routeConfigMap: npr-world-news-xml
  propSources:
    - name: npr-world-news-props

The route logic itself is Spring Integration XML. This one ingests an RSS feed, normalizes the content, deduplicates against the database, classifies topics, and persists to PostgreSQL:

<!-- RSS Inbound Channel Adapter -->
<int-feed:inbound-channel-adapter
    id="nprWorldNewsFeedAdapter"
    url="${rss.feed.url:https://feeds.npr.org/1004/rss.xml}"
    channel="rawArticlesChannel"
    auto-startup="true">
    <int:poller fixed-delay="${rss.poll.rate:300000}"/>
</int-feed:inbound-channel-adapter>

<!-- Transform SyndEntry to Map -->
<int:transformer
    id="syndEntryTransformer"
    input-channel="rawArticlesChannel"
    output-channel="transformedRssChannel"
    expression="{ 'id': T(java.util.UUID).randomUUID().toString(),
                  'sourceUrl': payload.link,
                  'title': payload.title ?: 'Untitled',
                  'body': payload.description?.value ?: '',
                  'feedSource': 'npr-world-news' }"/>

<!-- Deduplication Filter -->
<int:filter
    id="deduplicationFilter"
    input-channel="normalizedArticlesChannel"
    output-channel="dedupedArticlesChannel"
    discard-channel="duplicateArticlesChannel"
    expression="@deduplicationService.isNew(payload)"/>

<!-- JDBC: Insert into database -->
<int-jdbc:outbound-gateway
    id="articlePersister"
    request-channel="transformedArticlesChannel"
    data-source="dataSource"
    update="INSERT INTO articles
            (article_guid, source_url, title, body, feed_source)
            VALUES
            (:payload[id], :payload[sourceUrl], :payload[title],
             :payload[body], :payload[feedSource])"
    keys-generated="true"/>

That's a real route from SignalScope, my media analysis system. RSS ingestion, transformation, deduplication, topic classification via an HTTP call to a separate service, and persistence to Postgres. All defined in XML, deployed as a Kubernetes resource.

Changing the poll rate, adding a new feed source, or swapping the classification endpoint is a config edit. No Java compilation, no container rebuild. Update the ConfigMap, the operator reconciles, and the new behavior is live.

What Spring Integration Brings

Spring Integration has been around since 2007 and implements every pattern in the EIP book. But the part that matters for Keip is the connector library. Out of the box, Spring Integration provides adapters for HTTP, JMS, Kafka, AMQP, FTP, JDBC, RSS/Atom feeds, file systems, MQTT, TCP/UDP, mail, and many more. Each adapter is a few lines of XML.

The RSS feed route above is a good example. The int-feed:inbound-channel-adapter handles polling, parsing Atom and RSS formats, and tracking which entries have already been seen. The int-jdbc:outbound-gateway handles connection pooling, parameterized queries, and transaction management. None of that is custom code. It's configuration pointing at well-tested library components.

The pattern library is equally important. Content-based routers send messages down different paths based on payload or header inspection. Filters drop messages that don't meet criteria. Splitters break a single message into many; aggregators collect them back. Wire taps copy messages to a secondary channel for monitoring without affecting the main flow. Dead letter channels catch failures. All of these are declarative XML elements.

The error handling deserves its own mention. Every channel in Spring Integration can have an error channel. Failed messages route automatically to error handlers, retry policies, or dead letter queues. In most hand-rolled LLM pipelines, error handling is an afterthought bolted on after the first production incident. With Spring Integration, it's built into the messaging model.

Because each Keip container is a Spring Boot application, the entire Spring ecosystem is available inside it. Spring Data repositories, Spring Security, Micrometer metrics, Spring AI: anything that can be wired as a bean works here. There is no separate plugin system or adapter API to learn. The integration infrastructure is the application.

What Kubernetes Brings

Keip deploys each integration route as its own Kubernetes deployment. This is the fundamental scaling advantage.

A route polling an RSS feed every five minutes needs minimal resources. A scoring worker running LLM inference on GPU hardware needs a completely different resource profile. In a traditional integration platform, both routes run inside the same application process and scale together. If the scoring worker needs more capacity, you scale the whole application, feed pollers included. If the scoring worker crashes, it can take everything with it. With Keip, they're independent deployments with their own resource limits, health checks, and scaling policies.

apiVersion: keip.codice.org/v1alpha2
kind: IntegrationRoute
metadata:
  name: scoring-worker
spec:
  image: gitea.cruver.network/dcruver/signal-scope/keip-custom:dev
  replicas: 3
  routeConfigMap: scoring-worker-xml
  propSources:
    - name: scoring-worker-props

Scaling a route means changing the replica count or attaching a Horizontal Pod Autoscaler. Three scoring worker replicas means three articles pulled and scored in parallel without touching the feed pollers. Rolling updates to one route don't touch the others. If a scoring worker crashes, Kubernetes restarts it. If a feed poller falls behind, it can scale out independently. Standard Kubernetes primitives handle all of this without any custom orchestration layer.

Observability comes free. Pod logs, Prometheus metrics, liveness and readiness probes all work through standard K8s tooling. No separate monitoring stack for the integration layer.

SignalScope: The Real Test

SignalScope processes content from 11 RSS feeds and 10 YouTube channels through LLM-powered scoring pipelines. Each feed source has its own integration route running as a separate Kubernetes deployment. The scoring workers are separate routes that pull unscored articles from the database and send them through a local vLLM instance.

The scheduling is the part I like most. My GPUs serve other workloads during the day, so LLM scoring needs to run between 1 and 5 AM. I use the ControlBus pattern for this. ControlBus is one of the less well-known EIP patterns; it lets a system inspect and modify its own integration routes at runtime. The implementation is two channel adapters wired to a control bus: one starts the scoring adapter at 1 AM, the other stops it at 5 AM.

<int:control-bus input-channel="controlBusChannel"/>

<int:inbound-channel-adapter
    channel="controlBusChannel"
    expression="'@scoringInboundAdapter.start()'">
    <int:poller cron="0 0 1 * * *"/>
</int:inbound-channel-adapter>

<int:inbound-channel-adapter
    channel="controlBusChannel"
    expression="'@scoringInboundAdapter.stop()'">
    <int:poller cron="0 0 5 * * *"/>
</int:inbound-channel-adapter>

No external cron jobs. No schedulers. The integration infrastructure manages itself.

The Argument for EIP in AI Pipelines

The LLM pipelines I've built follow the same basic shape. Content arrives, gets routed somewhere based on its characteristics, passes through transformations between stages, and lands in different places depending on the outcome. Failures need to go somewhere for retry or human review. The instinct is to wire all of this together with custom scripts and ad hoc error handling.

These are all patterns that Hohpe and Woolf named in 2003. The mapping is direct:

An intent classifier that selects different logical branches based on the user's prompt is a content-based router. Spring Integration's router element handles the inspection and branching declaratively, routing to different channels based on payload content or message headers.
An agent system that distributes subtasks to specialized workers and combines results is a splitter-aggregator. The splitter breaks a message into parts, each part flows independently through the pipeline, and the aggregator collects them based on a correlation strategy. Spring Integration handles the correlation, timeout, and partial-result logic that most hand-rolled implementations get wrong.
A pipeline stage that reshapes data between an API response and the next model's expected input format is a message transformer. Instead of writing conversion code inline, the transformation is a declared step in the route with its own error handling.
A model endpoint that stops responding or starts returning errors triggers a circuit breaker. After a threshold of failures, the circuit opens and requests route to a fallback path automatically. When the endpoint recovers, the circuit closes. Spring Integration's request-handler-advice-chain provides this out of the box.
Failed inference calls that need retry logic land in a dead letter channel. The message is preserved, the failure is logged, and a separate route can attempt reprocessing on a schedule or with different parameters.
Monitoring an AI pipeline without modifying it uses a wire tap. Messages are copied to a secondary channel for logging, metrics, or debugging while the primary flow continues unaffected.

The difference with Keip is that all of this runs as Kubernetes-native resources. The patterns come from Spring Integration, in production for over fifteen years. The scaling and lifecycle management come from Kubernetes. The operator connects them so that an AI pipeline is a set of declarative route definitions, not a pile of glue code.

The patterns are already everywhere. Every product-specific node in n8n, every trigger in Zapier, every action in Make is a Channel Adapter, translating the source protocol into a message the rest of the integration can work with. The no-code integration industry built product businesses by packaging EIP patterns and giving them brand names.

What's changed is that generating a Channel Adapter for anything is now trivial. An HTTP endpoint with no library support, a proprietary data format, a legacy system nobody has written an adapter for: describe it to an LLM and get back a working Spring Integration adapter. Keip's custom container support means those generated adapters plug directly into the same infrastructure managing scaling, health checks, and routing. The catalog goes from "whatever's in the library" to "whatever you can describe."

Getting Started

Keip is open source under Apache 2.0. The operator and framework are at codice/keip on GitHub. Issues and contributions are welcome. If the ControlBus scheduling pattern or the EIP-to-AI mappings above look useful for something you're building, I'd like to hear about it.

In a follow-up post on Kairos, I'll cover the AI-native components for Spring Integration that bring LLM-powered routing, transformation, and orchestration directly into integration routes.

I Ran Claude Code on Local LLMs for a Month. Here's What Worked for Me.

Donald Cruver — Tue, 17 Feb 2026 04:02:12 +0000

I have been running local LLMs on dual AMD MI60 GPUs for over a year, and recently pointed Claude Code at them to see how close I could get to frontier-quality agentic coding on my own hardware. I wanted to know: how close can local models actually get to Anthropic's Claude for building real software?

This is the first post in a series documenting that experiment. Future posts will cover specific agentic coding sessions, hardware and model setup, and the tools I am building to close the gap.

My setup: Dual AMD MI60 GPUs with 64GB of total VRAM, running models through vLLM and LiteLLM. If that sounds unfamiliar, I have written about the hardware, the MI60 setup, and getting Claude Code working with local models. The short version: about $1,000 in used enterprise GPUs (two MI60s at $500 each on eBay), enough VRAM to run 70-80B parameter models.

Small Tasks Work. Complex Ones Do Not.

I got Claude Code working against a local Qwen3-Coder-Next:80B MoE model. It successfully scaffolded a complete Flask application with SQLite, modern CSS, and responsive design. The agentic workflow created files, ran commands, iterated on errors. For small, well-scoped tasks it is genuinely usable.

The key phrase there is "small, well-scoped."

Building a large codebase with interdependent modules, maintaining consistency across dozens of files, recovering from errors mid-task; this is where local models break down for me. The individual completions are often fine. But the agentic loop, where the model plans, executes, evaluates, and iterates, requires a level of sustained coherence that I have not been able to get from local models yet.

What surprised me is that scaling up parameters does not straightforwardly fix this. A 70B model produces fewer bugs than a 7B model, certainly. But the jump from "can scaffold a small app" to "can architect and build a complex multi-module project" has not materialized for me yet, even with the largest models I can run locally. The gap between local and frontier for agentic coding is not incremental. It is qualitative.

Why the Gap Exists

The gap stems from several compounding technical factors, none of which explain it alone.

That 80B model is a Mixture of Experts architecture. It has 80 billion total parameters, but only about 3 billion are active for any given token; the rest sit idle. It is 80B of storage, not 80B of thinking. I have a dense Llama 3.3 70B running on the same hardware doing about 26 tokens per second, compared to about 35 for the MoE. The MoE is faster because it is doing less work per token, but the dense model uses all 70 billion parameters on every forward pass. That is roughly 23 times more compute per token. I have not yet run the dense 70B through Claude Code on the same agentic tasks. That experiment is next on the list. If the results are dramatically better, it confirms that active parameter count matters more than headline size. If they are not, it tells me the gap is more about training and alignment than raw compute. I will cover the hardware setup and benchmarks in detail in an upcoming post.

Training investment widens the gap further. Frontier labs invest orders of magnitude more compute into training than open-source model creators. More compute means more passes over more data, which translates directly into better generalization, fewer blind spots, and stronger reasoning chains. The data matters as much as the compute; frontier labs invest heavily in data curation, filtering, decontamination, and synthetic data generation with proprietary pipelines that smaller labs cannot replicate.

Post-training alignment is where reasoning quality really separates. Frontier labs do extensive reinforcement learning from human feedback, constitutional AI training, and iterative red-teaming. This is not just safety work. It is what teaches the model to reason carefully, follow complex multi-step instructions, and maintain coherence over long outputs. Open-source models get some of this, but not at the same depth or investment level.

Then there is quantization. My local MoE runs at Q4_0 precision via llama.cpp; my dense Llama 3.3 70B uses AWQ 4-bit via vLLM. Both are lossy compressions of the original weights. On simple tasks the loss is negligible. On complex reasoning where subtle weight differences compound across layers, quantization degrades output quality in ways that are difficult to measure but easy to feel. The model becomes slightly less precise at every step, and over a long chain of reasoning those small losses accumulate.

Context window is another moving target. The Qwen3-Coder-Next supports 256K tokens natively; my dense Llama 3.3 70B supports 128K. But the ceiling keeps rising; the latest generation has jumped to 1M tokens. Longer context means the model can hold an entire codebase in memory at once.

No single one of these factors explains the gap on its own. Less active compute per token, less training investment, less alignment work, lossy weight compression, and smaller context windows all stack on top of each other. Each one costs a few percentage points of quality. Together they produce that qualitative difference between a local model that works 70% of the time and a frontier model that works 95% of the time.

Closing the Gap

I am not done experimenting. The most immediate test is running my dense Llama 3.3 70B through Claude Code on a real agentic task, the same kind of multi-module project that the MoE struggled with.

Beyond that, Claude Code recently introduced agent teams, where a lead agent decomposes a task and delegates sub-tasks to specialist agents that work in parallel. If teams decompose complex projects into smaller, focused units, each individual agent stays within what a 70B model handles well. The lead agent coordinates, but each worker has a narrower scope and a shorter context. I have not tested this with local models yet, but the architecture is promising because it plays to local model strengths rather than fighting their weaknesses.

There is also the Ralph Wiggum technique: running an AI coding agent in an autonomous loop against a specification until all tasks pass. A model that succeeds 70% of the time on a given task might seem unreliable. But a model that gets three attempts at the same task, with the ability to see its own errors, has a much higher effective success rate. The loop compensates for inconsistency with persistence. Combined with agent teams, the approach becomes: decompose the project into small tasks, let agents loop on each one until it passes, and coordinate the results. I plan to write about this experiment in a future post.

The third angle is giving local models better memory. Right now, Claude Code starts every session cold. It reads the codebase but has no record of past decisions, failed approaches, or architectural reasoning from previous sessions. An agent that remembers "we tried approach X and it failed because of Y" is dramatically more useful than one that suggests approach X again. I have started building a tool called project-brain that gives coding agents persistent project context: searchable documentation from codebases using local embeddings, session memory, decision logs, and pre-session context assembly. A local model that can search its own project history is working with more information than one that starts fresh every time. Memory does not make the model smarter, but it might make it effective enough to close part of the gap.

These three ideas, teams, loops, and memory, are all attempts to solve the same underlying problem from different angles. I do not know yet which combination will work, or whether any of them will close the gap enough to matter. But that is the experiment I am running.

Where I Am Now

For agentic coding, I still use frontier Claude for anything complex. Local models handle small, well-scoped tasks. That is the honest answer today.

But every month the local options get a little better. New models drop, quantization methods improve, context windows grow, and the scaffolding tools get more sophisticated. I plan to be there when the gap closes, because closing it is the whole point of this series.

I Built an AI Health Coach That Actually Knows Me

Donald Cruver — Wed, 11 Feb 2026 15:11:03 +0000

Every health app wants to coach you, but none of them know you. They don't know your baseline. They don't know that your blood pressure spikes after certain meals, or that your ketones tank when you're stressed. They offer generic advice based on population averages and call it personalized.

So I built my own. It runs on my infrastructure, stores everything in a time-series database I control, and uses Claude to understand what I'm eating, estimate the macros, and give me context on every measurement I log. Not generic advice. Insights based on my data.

Why Build This?

For years, my lab results have told the same story: HDL too low, LDL too high, triglycerides elevated. The kind of numbers that make doctors reach for the prescription pad.

A decade ago, I tried keto and saw good results. My labs improved dramatically. But life happened, I fell off, and the numbers crept back. Now I'm trying again, but this time I want data. Real data. Not scattered across six apps, but unified, queryable, and analyzed by something smart enough to spot patterns I'd miss.

How Claude Fits In

This isn't just a database with a chatbot bolted on. Claude is woven into every interaction.

Meal Tracking Without the Tedium

Traditional calorie tracking is tedious. You either scan barcodes (hope you're eating packaged food), search through databases (is this "grilled chicken breast" or "chicken breast, grilled, skinless"?), or give up and guess.

I just describe what I ate:

"Breakfast: 3 eggs scrambled with cheese and butter, 4 strips of bacon, coffee with heavy cream"

Claude parses that, estimates the macros (protein, fat, carbs, calories), and records everything to InfluxDB. Then it responds with a summary:

"Logged breakfast: ~640 cal, 48g fat, 42g protein, 2g net carbs. Solid keto-friendly start. You're at 640/1800 calories for the day."

No searching. No scanning. Just natural language and an AI that understands food.

Context on Every Measurement

When I log my morning ketones or blood pressure, Claude doesn't just acknowledge the number. It puts it in context:

"Ketones 2.2, up from your 7-day average of 1.9. You've been consistently above 2.0 for three days now. Whatever you're doing is working."

Or:

"Blood pressure 128/82. That's higher than your recent trend. Yesterday was 118/76. Anything different today?"

The feedback is immediate. When I can see that my ketones dropped after a particular meal, I think twice about eating it again. That's the behavior change that apps promising "insights" never delivered.

Ask Questions, Get Answers

Claude has access to query the database directly. So I can ask:

"How's my blood pressure trending this month?"

"What did I eat last time my ketones dropped below 0.5?"

"Show me my average daily carbs for the past week."

It pulls the data, analyzes it, and responds in plain English. No dashboards to click through. No exports to spreadsheets. Just a conversation with something that has my complete health history.

The Rest of the Stack

Claude handles the intelligence. Here's what handles the plumbing:

Matrix for Input

I use Element as my daily chat client anyway. Rather than building a separate app, I set up a bot that accepts health data via natural conversation. Quick, frictionless, always accessible.

n8n for Routing

n8n is a self-hosted workflow automation tool. Think Zapier, but on your own infrastructure. It receives messages from Matrix, routes them to Claude for processing, and handles the database writes. It also manages weekly summaries and could easily add anomaly alerts.

InfluxDB for Storage

All metrics land in InfluxDB, a time-series database. Perfect for health data: everything has a timestamp, queries are usually by time range, and calculating rolling averages is trivial.

Grafana for Visualization

When I want the big picture (trends over weeks or months, overlaid metrics, drill-downs into specific periods), Grafana turns the raw data into dashboards. The 7-day moving averages smooth out daily noise and make real trends visible.

What I'm Tracking

Meals: Natural language descriptions, AI-estimated macros, stored for correlation
Blood ketones: Morning measurements to verify I'm staying in ketosis
Blood pressure and pulse: Watching for improvements as weight drops
Weight: Weekly, to track trends without daily obsession
Lab work: Periodic lipid panels via OwnYourLabs, which lets you order tests without a doctor's visit

Why Not Just Use Apps?

The health app graveyard is full of services that customers depended on until the plug got pulled. Jawbone went bankrupt and bricked thousands of fitness trackers. Pebble got acquired and shut down. Google keeps retreating from health features. Every one of these left users with devices that no longer served their purpose and data they couldn't easily export.

Even when the apps survive, they're silos. You've got one app for meal tracking, another for weight, a third for workouts, and none of them talk to each other. Want to correlate your food intake with your lab results? Most health apps don't even have a place to store lab data. You end up juggling multiple services, manually cross-referencing, and hoping none of them get acquired or discontinued.

With my setup, everything flows into one database I control. If a tool stops working, I replace it. The data stays.

On Data and Privacy

Health data is personal. I want it consolidated, not scattered across a dozen app companies with a dozen privacy policies.

With this architecture, the data lives on my servers. Claude processes it during conversations, but Anthropic's API doesn't retain prompts or use them for training. That's a meaningful difference from apps that permanently store your data and mine it for their own purposes.

For those who want complete sovereignty, the architecture works with local LLMs too. I wrote about running Claude Code with local models via vLLM. The same infrastructure powers this system, and swapping to a fully self-hosted model should be a simple configuration change.

Early Results

It's been a few weeks, and having an AI that actually knows my history changes behavior more than any notification ever did. It can connect today's meal to tomorrow's ketone reading. The feedback is immediate, personal, and relevant.

If you're interested in running Claude on your own infrastructure, I wrote about setting up Claude Code with local LLMs via vLLM.

Running Claude Code with Local LLMs via vLLM and LiteLLM

Donald Cruver — Thu, 05 Feb 2026 02:12:49 +0000

Every query to Claude Code means sending my source code to Anthropic's servers. For proprietary codebases, that's a non-starter. With vLLM and LiteLLM, I can point Claude Code at my own hardware - keeping my code on my network while maintaining the same workflow.

The Architecture

The trick is that Claude Code expects the Anthropic Messages API, but local inference servers speak OpenAI's API format. LiteLLM bridges this gap. It accepts Anthropic-formatted requests and translates them to OpenAI format for my local vLLM instance.

The stack looks like this:

Claude Code → LiteLLM (port 4000) → vLLM (port 8000) → Local GPU

One environment variable makes it work:

export ANTHROPIC_BASE_URL="http://localhost:4000"

Claude Code now sends all requests to my local LiteLLM proxy, which forwards them to vLLM running my model of choice.

The vLLM Configuration

I'm running Qwen3-Coder 30B A3B, a Mixture of Experts model with 30 billion total parameters but only 3 billion active per forward pass. The AWQ quantization brings memory requirements down enough to split it across my dual MI60 GPUs using tensor parallelism:

services:
  vllm:
    image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
    container_name: vllm
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri/card1:/dev/dri/card1
      - /dev/dri/card2:/dev/dri/card2
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/renderD129:/dev/dri/renderD129
    shm_size: 16g
    environment:
      - HIP_VISIBLE_DEVICES=0,1
    command:
      - python
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      - --tensor-parallel-size
      - "2"
      - --max-model-len
      - "65536"
      - --gpu-memory-utilization
      - "0.9"
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder

The --enable-auto-tool-choice and --tool-call-parser qwen3_coder flags are essential for agentic use. They let the model emit tool calls that Claude Code expects.

The LiteLLM Translation Layer

LiteLLM maps Claude model names to the local vLLM endpoint. The wildcard pattern catches any model Claude Code requests:

model_list:
  - model_name: claude-*
    litellm_params:
      model: hosted_vllm/QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      api_base: http://vllm:8000/v1
      api_key: "not-needed"
    model_info:
      max_tokens: 65536
      max_input_tokens: 57344
      max_output_tokens: 8192

litellm_settings:
  drop_params: true
  request_timeout: 600
  modify_params: true

general_settings:
  disable_key_check: true

A few settings to note:

drop_params: true silently ignores Anthropic-specific parameters that don't translate to OpenAI format
modify_params: true allows LiteLLM to adjust parameters as needed for the target API
disable_key_check: true skips API key validation since we're running locally

Practical Usage

With everything running, Claude Code works exactly as normal:

export ANTHROPIC_BASE_URL="http://localhost:4000"

cd my-project
claude

The experience is nearly identical to using Anthropic's API, with a few caveats:

Token throughput: My dual MI60 setup does roughly 25-30 tokens/second with ~175ms time-to-first-token. No rate limiting, no queue times, no network latency.
Context limits: I cap at 64K tokens. Claude Opus can handle 200K.
Model capability: Qwen3-Coder is excellent for coding tasks, but Claude has broader knowledge and better instruction following.

The upside is obvious: zero API costs, complete data sovereignty, and the ability to run Claude Code on air-gapped networks.

Agentic File Creation

The real test of Claude Code compatibility isn't chat. It's whether the model can create files, run commands, and iterate on a codebase. The --tool-call-parser qwen3_coder flag handles the translation between Qwen's XML-style tool calls and the OpenAI tool format that LiteLLM expects.

To verify this works end-to-end, I asked Claude Code to build a complete Flask application:

export ANTHROPIC_BASE_URL="http://localhost:4000"

cd /tmp && mkdir flask-test && cd flask-test
claude --dangerously-skip-permissions -p \
  "Build a Flask todo app with SQLite persistence, \
   modern UI with gradients and animations, \
   mobile responsive design, and full CRUD operations."

The model created a complete project structure:

flask_todo_app/
├── app.py              # Flask routes and SQLite setup
├── requirements.txt    # Dependencies
├── run_app.sh          # Launch script
├── static/
│   ├── css/
│   │   └── style.css   # Gradients, animations, hover effects
│   └── js/
│       └── script.js   # Client-side interactions
└── templates/
    └── index.html      # Jinja2 template with responsive layout

The generated app.py includes proper SQLite initialization:

from flask import Flask, render_template, request, redirect, url_for
import sqlite3

app = Flask(__name__)

def init_db():
    conn = sqlite3.connect('todos.db')
    c = conn.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS todos
                 (id INTEGER PRIMARY KEY AUTOINCREMENT,
                  task TEXT NOT NULL,
                  completed BOOLEAN DEFAULT FALSE)''')
    conn.commit()
    conn.close()

init_db()

@app.route('/')
def index():
    conn = sqlite3.connect('todos.db')
    c = conn.cursor()
    c.execute('SELECT id, task, completed FROM todos ORDER BY id DESC')
    todos = c.fetchall()
    conn.close()
    return render_template('index.html', todos=todos)

The CSS includes gradients, glass-morphism effects, and animations:

body {
    font-family: 'Poppins', sans-serif;
    background: linear-gradient(135deg, #667eea, #764ba2);
    min-height: 100vh;
    padding: 20px;
}

.container {
    max-width: 800px;
    margin: 0 auto;
}

.header {
    text-align: center;
    padding: 40px 0;
    color: white;
    text-shadow: 0 2px 4px rgba(0,0,0,0.1);
}

After activating the venv and running the app, everything works. Add a task, toggle it complete, delete it. The database persists across restarts.

cd flask_todo_app
source venv/bin/activate
python app.py
# Visit http://localhost:5000

The full generation took about five minutes across multiple agentic iterations. Each file is a separate tool call: the model generates, Claude Code executes, the result comes back, and the model plans the next step. The 91% prefix cache hit rate shows vLLM efficiently reusing context across the multi-turn loop.

This confirms the agentic workflow functions correctly. The model reads the prompt, plans a file structure, emits tool calls to create directories and write files, and produces a functional application. All inference happens locally on the MI60s. No code leaves my network.

I have not yet tested this on a larger codebase. A small Flask app is one thing; a multi-thousand-line refactor is another. The 64K context limit will eventually become a constraint, and I expect the model to struggle with complex architectural decisions that the real Claude handles gracefully. For now, this works well for focused, scoped tasks.

Choosing a Model

For Claude Code compatibility, you want:

Strong tool use: The model must emit structured tool calls reliably
Code focus: Qwen3-Coder works well; DeepSeek Coder and CodeLlama variants should also be viable
Sufficient context: I used 64K; smaller context windows may work but I haven't tested them

In my testing, Qwen3-Coder-30B-A3B handles straightforward coding tasks well. For complex refactoring or architectural decisions, the real Claude API is still the better choice.

If you don't have 64GB of VRAM, smaller models like Qwen2.5-Coder-7B or Qwen3-8B should fit on a single 16GB or 24GB card. I haven't tested these configurations, so I can't speak to their context limits or how well they handle Claude Code's agentic workflows.

In any case, the key is adjusting your workflow: instead of broad "refactor this module" prompts, break work into tighter, more focused requests. More prompts of narrower scope plays to a smaller model's strengths.

Running the Stack

The full configuration lives in a single compose file:

services:
  vllm:
    image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
    container_name: vllm
    restart: unless-stopped
    ports:
      - "8000:8000"
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri/card1:/dev/dri/card1
      - /dev/dri/card2:/dev/dri/card2
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/renderD129:/dev/dri/renderD129
    group_add:
      - "44"
      - "992"
    shm_size: 16g
    volumes:
      - /mnt/cache/huggingface:/root/.cache/huggingface:rw
    environment:
      - HIP_VISIBLE_DEVICES=0,1
    command:
      - python
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      - --tensor-parallel-size
      - "2"
      - --max-model-len
      - "65536"
      - --gpu-memory-utilization
      - "0.9"
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder

  litellm:
    image: litellm/litellm:v1.80.15-stable
    container_name: litellm
    restart: unless-stopped
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml:ro
    command:
      - --config
      - /app/config.yaml
      - --port
      - "4000"
      - --host
      - "0.0.0.0"
    depends_on:
      - vllm

Start it with nerdctl (or docker):

nerdctl compose -f coder.yaml up -d

From any machine on my network, I can point Claude Code at Feynman (my GPU workstation) and get local inference. When I'm done, tear it down with:

nerdctl compose -f coder.yaml down

The Verdict

This setup won't replace the Claude API for everyone. If you need maximum capability, Anthropic's hosted models are still the best option. But for those of us who care about where our code goes, local inference means complete data sovereignty. Proprietary code never leaves my network. Plus there's something satisfying about seeing your own GPUs light up every time you ask Claude Code a question.

A Second Brain That My AI and I Share

Donald Cruver — Mon, 02 Feb 2026 14:58:25 +0000

My AI assistant and I share the same brain. Not metaphorically, we literally read and write to the same knowledge base. When I update a project note in Emacs, Nabu (my AI) sees the change. When Nabu logs something it learned, I see it in my daily file. This post is about how it works.

How It Works

Nabu connects to my self-hosted Matrix server and it has access to my org-roam knowledge base via an MCP server I built. But it's not just access, it's integration. Nabu treats org-roam as its source of truth:

Before answering, search. When I ask about a project or a person, Nabu searches org-roam first. It doesn't hallucinate details; it looks them up.
Learning gets recorded. When Nabu learns something worth keeping (a decision we made, a fix we applied), it writes it to the knowledge base.
Shared memory. We have a context that persists across conversations.

In practice, while writing this post, I asked Nabu to find examples of our shared brain in action. It searched the org-roam daily notes and memory files, looking for instances where it had looked something up to help me. The example it found was that very interaction.

Another example: when I asked Nabu to troubleshoot my camera alert system, I did not explain how it worked. Nabu searched org-roam and found the Camera Alerts Pipeline note with the camera ID mapping, MQTT topics, container names, and alert rules. It diagnosed the issue and fixed the configuration without me having to re-explain my setup.

Similarly, when my Home Assistant dashboard had gone stale, I asked Nabu to fix it up. It found the sensor names, device configurations, and integration details in my notes, then updated the dashboard accordingly.

The Configuration

The magic is in how you instruct the agent. In OpenClaw, this happens via workspace files that the agent reads at startup.

AGENTS.md tells the agent its role:

## Org-Roam Knowledge Base

Org-roam is your primary knowledge base. Search it before answering
questions about projects, people, or decisions. Update it when you
learn something worth keeping.

MEMORY.md establishes the relationship:

## Org-Roam Knowledge Base: My Primary Role

I am the live interface for Don's org-roam second brain. I can read,
search, create, edit, and reorganize notes.

The agent internalizes these instructions. It knows where to look and what to do.

The Foundation: Why Emacs

I haven't tried every note-taking tool out there. I used TriliumNext for a while. I've used Evernote. I looked at Obsidian and a few others, but they either weren't self-hostable, or they limited you to their UI. I keep coming back to Emacs and org-mode for a few reasons.

First, my notes are just text files. No proprietary database, no cloud lock-in, no wondering if the company will exist in five years. I can read them with cat, grep them, version control them with git. And importantly, LLMs can read them too. When I paste a note into Claude, it just works. No export step, no format conversion.

Second, org-roam adds what plain org-mode lacks: backlinks and a graph structure. When I mention a person or project, it becomes a link. Over time, connections emerge that I didn't plan. The structure grows organically from the content.

Third, Emacs is programmable in a way that no other tool matches. When I want a new workflow, I write it. When something annoys me, I fix it. The tool bends to how I think, not the other way around.

Finally, there's data sovereignty. My notes live on my machine and sync via git to my own server. No cloud service has a copy. No company can discontinue access. This matters more to me the longer I do this.

The Pieces

Two components make the shared brain possible:

org-roam-second-brain (Emacs Package)

Vanilla org-roam is great, but I wanted more structure. This package adds structured node types (people, projects, ideas, admin tasks), each with its own template. It adds semantic search via vector embeddings stored in the org files themselves, generated locally with no cloud APIs. And it adds proactive surfacing: a daily digest that shows active projects, stale items, pending follow-ups, and dangling links.

org-roam-mcp (Python Server)

The MCP server exposes 30 tools via JSON-RPC: search (semantic, contextual, keyword), CRUD operations, task state management, and surfacing functions. It runs locally on port 8001. Any tool that can make HTTP requests can interact with my knowledge base, including AI agents.

curl -X POST http://localhost:8001 \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call",
       "params":{"name":"semantic_search",
                 "arguments":{"query":"container networking"}}}'

How to Set This Up

If you want to replicate this, the process is straightforward:

Install the Emacs package via straight.el:

(straight-use-package
  '(org-roam-second-brain :host github :repo "dcruver/org-roam-second-brain"))

(require 'org-roam-second-brain)
(require 'org-roam-api)

Run an embedding server: either Infinity with Docker (docker run -d -p 8080:7997 michaelf34/infinity:latest --model-id nomic-ai/nomic-embed-text-v1.5) or Ollama (ollama pull nomic-embed-text).

Start the MCP server:

pip install org-roam-mcp
export EMACS_SERVER_FILE=~/emacs-server/server
org-roam-mcp --port 8001

Then configure your agent to use org-roam as its source of truth. The specifics depend on your agent framework. See the full setup guide for details.

Closing Thoughts

This setup is not for everyone. It requires comfort with Emacs, willingness to run local services, and some patience for configuration. But for me, it solves real problems: my notes are mine forever, in a format that will outlast any company. I can find things by meaning, not just keywords. My AI assistant and I share the same context. The knowledge compounds over time.

If any of that resonates, the code is on GitHub. For the AI integration, check out OpenClaw. Take what's useful, ignore the rest.

An Affordable AI Server

Donald Cruver — Sat, 31 Jan 2026 17:44:29 +0000

Two AMD MI60s from eBay cost me about $1,000 total and gave me 64GB of VRAM. That's enough to run Llama 3.3 70B at home with a 32K context window.

When I started looking into running large language models locally, the obvious limiting factor was VRAM. Consumer GPUs top out at 24GB, and that's on an RTX 4090 at the high end. I wanted to run 70B parameter models locally, on hardware I own.

Datacenter Castoff, Homelab Treasure

The MI60 is a 2018 server GPU that AMD built for datacenters. It has 32GB of HBM2 memory, the same high-bandwidth memory you find in modern AI accelerators, and you can pick one up for around $500 on eBay. Two of them give you 64GB of VRAM, more than enough for Llama 3.3 70B.

One problem: they're passive-cooled cards designed for server chassis with serious airflow. Plug one into a regular PC case and it'll thermal throttle within minutes. I ended up 3D printing a duct and running a push-pull configuration: a 120mm fan inside blowing air across the heatsinks, and a 92mm fan on the rear pulling hot air out. A custom fan controller script keeps the fans in sync with GPU utilization, maintaining junction temps around 80°C instead of the 97°C I saw before I figured out cooling.

Why Not Just Use NVIDIA?

NVIDIA has better software support, more documentation, and CUDA is everywhere. But the MI60 has 32GB of HBM2. An RTX 3090 has 24GB of GDDR6X and costs significantly more on the secondary market. The MI60 gives me more memory for less money, and for inference workloads, that memory matters more than raw compute throughput. The MI60's HBM2 delivers higher theoretical memory bandwidth than GDDR6X. For inference, which is memory-bound, that helps. The tradeoff: with two cards doing tensor parallelism, PCIe becomes the bottleneck.

The software situation is workable, with caveats. The MI60 uses AMD's gfx906 architecture. AMD stopped actively developing for it, but backward compatibility carries forward. I'm running ROCm 6.3 without issues. The upside is that years of bug fixes have made the platform stable. I'm building on well-established code.

vLLM has been my best experience. I tried Ollama first, but performance was noticeably worse and tensor parallelism across both GPUs wasn't as smooth. vLLM gives me better speeds, but switching models isn't as simple as Ollama's pull-and-run. I built a solution for that, which I'll cover in another post.

What Can It Actually Do?

Here are some real numbers from my setup, running vLLM with AWQ-quantized models:

Model	Tokens/sec	GPUs
Qwen3 8B	~90	1
Qwen3 32B	~31	1
Llama 3.3 70B	~26	2 (tensor parallel)

The 8B and 32B models respond quickly, and even the 70B is very usable.

Most dual-GPU consumer setups max out at 48GB. Two MI60s give you 64GB for around $1,000. You'll need to solve cooling (see above), but it's a one-time fix.

I'll be writing more about this setup: the cooling solution, the software stack, and how I switch between model configurations. Spoiler: Stable Diffusion still locks up the GPU, and I haven't gotten Whisper working yet.

The MI60 isn't the only option: there are MI50s, MI100s, and various NVIDIA Tesla cards floating around the secondary market. Memory, compute, and software support all matter.