DEV Community: Nic Lydon

GitHub Got Breached Through a VS Code Extension. MCP Servers Are Next.

Nic Lydon — Wed, 20 May 2026 14:01:08 +0000

Yesterday, GitHub said it had detected and contained a compromise of an employee device involving a poisoned VS Code extension. The company said its current assessment is that the activity involved exfiltration of GitHub-internal repositories only, and that the attacker's claim of roughly 3,800 repositories is directionally consistent with its investigation so far. GitHub removed the malicious extension, isolated the endpoint, and prioritized rotation of critical credentials.

A few days earlier, I had been doing something similar from the other direction. I yanked OpenAI's Codex Chronicle off my laptop and replaced it with a local Gemma 4 instance running on a Mac mini I own. Originally, that was a cost decision. The breach made the security implications of the architecture impossible to ignore.

A trusted third-party binary. Installed locally. Full read access to your screen, your files, your tokens. An outbound network path the user set up themselves, allowed by every firewall because the user did it.

Compromise the binary at any point in its supply chain, and you do not need to compromise the platform. The platform is doing what it was told.

You walked in.

That is the GitHub breach. That is also Codex Chronicle if a tool like it were ever compromised at the build pipeline or distribution layer. The architectures are siblings.

A local model is not automatically safe. A compromised local agent with filesystem and shell access is still a privileged execution environment. The difference is that local architectures reduce the mandatory external trust boundary and make inspection possible.

The warning signs were already everywhere

TeamPCP is not a one-off. The group is tracked in public reporting as UNC6780, and its 2026 target list before this week included Trivy, Checkmarx, LiteLLM, the Bitwarden CLI, PyTorch Lightning, and most recently the TanStack and durabletask npm and PyPI compromises connected to the Grafana breach a week earlier.

Look at that list.

Trivy. Checkmarx. LiteLLM. Bitwarden CLI. PyTorch Lightning.

Every one of those is developer or developer-adjacent tooling. Trusted. Locally installed. Frequently updated. Each one is a trojan vector with a credential blast radius that extends from the developer's laptop through cloud accounts and into production.

The npm side of the same campaign is worse. According to public analysis from SlowMist and other researchers, attackers compromised the npm account atool and pushed hundreds of malicious package versions across hundreds of packages within minutes. The Mini Shai-Hulud malware family specifically targets GitHub tokens, AWS keys, Kubernetes secrets, SSH credentials, password manager databases, and local crypto wallet files.

This is not a string of bad luck. It is a deliberate, sustained campaign against the supply chain that ends at every developer's ~/.config, ~/.aws, and ~/.ssh.

The developer endpoint is the perimeter

For roughly a decade, the prevailing security model has been: trust the developer, harden the platform.

Endpoint security on engineering laptops is usually a thin layer: corporate EDR, maybe a DLP agent, maybe MDM. The real controls live around code review, CI/CD policy, and production access.

That model is over.

The developer endpoint is the highest-privilege, least-monitored node in most environments. It has SSH keys to servers. It has cloud CLI tokens with broad blast radius. It has GitHub credentials with push access to repos that ship to production. It has unencrypted source code.

And increasingly, it is running a fleet of trusted third-party processes the security team has never reviewed.

VS Code extensions are one example. Most developer environments have dozens. Each one runs with the developer's full user-level privilege. Each one can read any file the developer can read.

MCP servers and AI coding agents inherit the same trust model almost verbatim: local execution, broad filesystem visibility, ambient credentials, user-approved outbound network access, and a supply chain most organizations do not inspect.

I run more than thirty MCP servers connected to my own development environment. I built several of them. I trust myself.

I do not, in the strict sense, trust the supply chain of every dependency every one of them pulls in.

Almost nobody does.

The industry adopted AI-assisted developer tooling with the operational rigor of browser extensions, not privileged infrastructure. Convenience won faster than trust modeling caught up.

What mechanical enforcement actually looks like

The fix is not a memo telling developers to be careful. Telling a tired engineer at midnight to audit their extension list is not a control. It is a wish.

The fix is mechanical enforcement that runs whether the developer is paying attention or not.

In my own development setup, that looks like four layers. None of them are clever. All of them are boring.

Boring is the point.

Layer one: pre-commit hooks. Every commit, in every repo, runs a Python scanner before the commit is allowed to complete. The scanner has specific patterns for OpenAI, Anthropic, Google, GitHub, Slack, Discord webhooks, AWS access keys, and a dozen other token shapes, plus raw .env and certificate/private-key file detection. It excludes example and placeholder shapes to keep the false positive rate low. If a real secret is staged, the commit blocks. The hook does not care if the developer noticed.

Layer two: agent hooks. Claude Code and Codex both expose hooks that can fire before file writes and command execution. I run the same scanner against proposed edits. The agent cannot persist a secret to disk through the normal write path, because the hook denies the operation before the write happens.

This catches genuine mistakes, like an agent paraphrasing a .env it read into the next file it writes. It also catches credential reconnaissance. A Bash command that greps for password across the repo or cats an .env is blocked at the tool-call boundary, not at the application layer.

I have actual logs of this firing. During a real OAuth flow earlier this spring, an agent tried to retrieve a shared password by probing the local vault directory, running systemctl cat, and grepping across config files.

Five Bash calls. Five denials.

Each one cited the security policy by name. The correct path was for the user to explicitly authorize retrieval through the canonical vault command, which is exactly what eventually happened.

The hook did its job.

Layer three: .claudeignore. Every repo on my development machine has one. It is the agent equivalent of .gitignore. It prevents the AI tool from loading sensitive paths into context in the first place. The list is uncontroversial: .env* files, certificate and key shapes, raw database files, build output, editor metadata. If the agent never sees the secret, it cannot accidentally leak it into a draft, summary, or commit.

Layer four: a single secrets vault. All real credentials live in AWS Secrets Manager, in a dedicated account, accessed through a single CLI. Application code, agent tools, and CI all pull from the vault at runtime. Source code commits placeholders only. Rotating a credential means updating it in one place. If a secret is ever leaked, it can be rotated globally in seconds, not by tracking down every config file that ever held a copy.

The line in my own security policy doc reads:

Hook failures are security findings, not lint style.

If the mechanical layer catches an actual secret, it is rotated. The hook is not asking for permission to be turned off.

What this does not do

This stack would not have stopped the GitHub compromise.

A poisoned VS Code extension running with the developer's full privilege has its own pathway. It does not need to commit anything. It does not need to call the agent's write path. It can read tokens directly from disk, hit any network endpoint, and exfiltrate immediately.

None of my hooks would see it, because it does not flow through any of my hook points.

That is the honest part. The mechanical layer is defense in depth, not a wall.

What the stack does do is harden the common failure modes: the developer who pastes a real key into a commit, the agent that helpfully echoes a secret back into a file, the credential probe an attacker uses as next-step reconnaissance after initial compromise.

It removes easy mistakes. It logs attempts. It makes privileged paths visible.

The harder problem is the one the GitHub breach is screaming about: the supply chain that delivers third-party code to a developer's local environment has almost no security model.

There is no meaningful review for VS Code extensions beyond "removed after it was reported." There is no meaningful review for MCP servers beyond "trust the maintainer." There is no enforced signing requirement, no provenance attestation requirement, and no runtime sandbox that most teams can rely on.

If you are a security leader, the actionable question is not:

"What did the attackers do to GitHub?"

The actionable question is:

"What is the inventory of third-party processes running with developer privilege in my environment, and what would happen if any single one of them was compromised this afternoon?"

For most organizations, the answer is not encouraging.

What is coming

I expect three things in the next twelve months.

First, the major platforms will harden. GitHub and Microsoft will tighten VS Code extension publishing controls. Anthropic and OpenAI will add provenance signatures to their tool ecosystems. npm and PyPI will likely see at least one more wave before the registry-level changes that need to happen actually happen.

Second, MCP servers will get their first major incident. The trust model is wrong. The attack surface is large. The defender side has barely started. Someone will write a malicious MCP server that behaves like a malicious VS Code extension, and it will run for weeks before anyone notices.

Third, the conversation about endpoint security on engineering laptops will move from "EDR plus a memo" to "your dev environment is a production system." The organizations that get to that mental model first will spend the next eighteen months less expensively than the ones that wait for their own TeamPCP moment.

The Codex Chronicle install I pulled last weekend is a small example of the right reflex. The thing I removed was not malicious. It was a legitimate research preview from a major lab.

I removed it because it had the wrong architecture for what I wanted: a local capability that periodically used recent screen context through a cloud service, a binary I could not audit, and a network path I did not need open.

The replacement runs on hardware I own, in a way I can inspect, without a mandatory outbound dependency.

That architecture choice is what more developer tooling should default to in 2026. Not local because of privacy theater. Local because the trust profile is smaller and the failure modes are more visible.

The cloud round trip should be reserved for cases where the cloud is genuinely necessary, not cases where the vendor wants recurring usage.

The GitHub breach made the security case for that reflex more obvious than any threat model I could have drawn on a whiteboard.

The interesting question is not whether security teams agree with this in principle. Most will, when asked.

The interesting question is whether they have enforcement that runs whether anyone agrees with it or not.

If your developer endpoints are running on policy memos, the next year is going to be expensive.

Capture the Reasoning Path, Not the Final State

Nic Lydon — Tue, 19 May 2026 23:46:00 +0000

Two files, one discipline, and a measured 10-13% of my Claude Code budget.

A while back, mid-session with Claude Code, I typed a pushback in the kind of broken English you only produce past midnight:

"are we using full netflix level doc uodsyed as ws go here ?"

What I meant: are we updating documentation at full Netflix-documentary depth as we go, or are we doing the lazy version that just records what changed without why? Claude correctly inferred the Netflix version. From that point forward the documentation standard for every one of my projects was set.

That session became the basis for what I now call paper-trail: a portable ruleset that makes Claude Code (and any other AI coding agent that respects CLAUDE.md) write documentation at documentary depth instead of git-log depth.

This post is about why that matters more when an AI is doing most of the typing, and what the discipline actually looks like in practice.

The reasoning path is what AI loses

Most documentation captures the final state. The README says what the system does. The CHANGELOG says what version shipped. The commit message says what file changed.

What disappears at session end:

The three alternatives you considered before picking option C
The operator pushback that killed your original design
The verification log that convinced you the fix worked
The false start at 11pm that explains the weird workaround at line 240
The dependency you didn't realize existed until something broke

When you're writing code yourself, this knowledge lives in your head, badly, for about a week. After that it's gone.

When an AI agent is doing most of the typing, the gap gets worse. The agent has zero memory of the rejected alternatives. Six months later it confidently suggests a fix you already turned down. There is no record of why you turned it down.

The reasoning path is what makes future debugging possible. AI makes it more valuable and more fragile at the same time.

Two files, one discipline

The structure is simple.

CHANGES.md at the repo root. Chronological log, newest entry on top. Updated as work happens, not after. Each entry covers what changed, why, what was decided, what was rejected, how it was verified, and what's still outstanding.

docs/narrative/<YYYY-MM-DD>-<topic>.md for the bigger arcs. Migrations, incidents, rewrites, source onboarding. Starting state, trigger, decisions, rejected alternatives, phases, verification, final state, what's unblocked.

CHANGES.md is the index. docs/narrative is the story.

Both are plain markdown. Both get committed. Both are designed to be grep-able by your tools and your future agent.

A real CHANGES.md entry

Here's the entry from a Music sync resurrection a few weeks ago (anonymized identifiers, real structure):

2026-05-16: Music sync row restored in iOS Settings view

After the backend consolidation on 2026-05-04, the iOS Settings view lost the row that exposed Music sync to users. The sync pipeline itself was intact in the backend; only the toggle had been removed during the cleanup.

Restored via 6 lines in App/Views/SettingsView.swift, adding the row back under "Data Sources." TestFlight build 47 ships the restored row. Verified end-to-end by pulling a fresh sync from the device and confirming the delivery UUID 8b4f2a9c-7d15-4e83-9bcd-12fa8e5c61d4 landed in the backend.

Decided: restore the row as-is rather than redesign the Settings view (the consolidation rationale doesn't apply to this row).
Rejected: moving Music to a dedicated "Media" section. Too much surface area to redesign for one source.
Outstanding: wire the new Qwen commit a3f2c8e91 for next week's audio path.

commit e74b2c1

One paragraph plus a six-line code block plus four metadata lines. Names the dormant pipeline, the build that shipped, the cross-repo dependency, the rejected alternative.

That's the index entry. The narrative doc tells the story.

The same event as a narrative doc

Title: "The Settings Row That Brought Music Back" at docs/narrative/2026-05-16-music-resurrection.md.

Sections:

The Trigger: what made us notice Music sync was dark (a test query returned zero rows from a source that should have been daily)
The Diff: what the original consolidation actually removed, with the line numbers
What Almost Happened: the redesign-the-whole-view path I considered before realizing six lines of Swift was the answer
Verification: the delivery UUID that proved the path was wired back end-to-end
What's Unblocked: the audio path work that depended on Music being live

It reads like a documentary episode. Tradeoffs, false starts, operator decisions, verification numbers. Anyone (including a future me, including a future agent) can reconstruct the reasoning path from the doc alone.

Day-one install

The whole thing is at github.com/niclydon/paper-trail. MIT-licensed, drop-in.

Copy DOCUMENTARY_STYLE_DOCUMENTATION.md into your project's root (e.g. ~/projects/).
In your top-level CLAUDE.md, add @DOCUMENTARY_STYLE_DOCUMENTATION.md.
In each project's CLAUDE.md, paste the per-project boilerplate from templates/per-project-boilerplate.md.
Create an empty CHANGES.md at each project root.
For the first non-trivial migration or incident, create a narrative doc using the skeleton.

Claude will start appending CHANGES.md entries on its next session in that tree.

There's also a -LITE variant of the ruleset (~75% smaller, same discipline) for sub-agents or context-tight sessions.

What it costs

The honest answer requires two measurements.

The first: when I check /status in Claude Code, my /narrative-docs-update slash command shows up at about 9% of my weekly Claude Pro Max plan usage. That's the cleanly attributable cost. Every time I deliberately invoke the skill to write or update a narrative doc, it adds to that bucket.

The second is harder to measure. CHANGES.md appends happen inline during regular sessions, not as a separate skill invocation. They blend into general usage and don't show up as a line item in /status. The only way to measure them is to look at the content itself.

So I ran the math. Across 158,000 Claude Code messages from 1,737 sessions over the last 30 days, I summed the character count of all assistant output that referenced CHANGES.md, docs/narrative/, or docs/migrations/ paths. The result: 1.16 million characters out of 9.2 million total. 12.6% of Claude Code's written output over 30 days went to documentation work.

The two measurements converge. 9% is the floor, cleanly counted from a dedicated skill. 12.6% is the broader signal that catches inline doc work too. Call it 10-13% of Claude Code output.

The part that surprised me: the discipline isn't applied uniformly. Only 11.5% of my sessions involved documentation work at all. The other 88.5% never touched a CHANGES.md or narrative doc. They're quick queries, exploration, one-offs.

Where documentation work shows up is in the substantial sessions. The ones where I actually built or migrated or debugged something worth recording. Doc-meaningful sessions average about 50,000 characters of assistant output. No-docs sessions average about 850. Documentation effort scales with work effort, which is what you want.

Three things make 10-13% the easiest spend in my Claude Code plan:

The output is durable. The other ~88% of Claude's output is ephemeral chat that disappears when the session ends. That ~12% is markdown files that persist, get committed, and become referenceable.
The agent doesn't remember what it built. Without these docs, the next session has no idea what was rejected, why, or with what verification. Reconstructing reasoning later costs more than recording it now.
A recent debug story made the case concrete. A few weeks ago an iOS pipeline went dark after a backend consolidation. The CHANGES.md entry from the original consolidation told me exactly which row had been removed from the Settings view and why. Without that record I'd have spent an hour trace-debugging. With it: six lines of Swift to restore.

The cost of the discipline is small. The cost of skipping it shows up when you need the record and it isn't there.

The payoff

Two files, one discipline, ten-to-thirteen percent of my Claude Code budget. In exchange: a searchable record of every non-trivial decision, the rejection rationale for the alternatives, and verification numbers that survive every session reset.

AI doesn't remove the need for documentation. It makes the reasoning path both more valuable (because the agent does more of the typing) and more fragile (because the agent forgets everything when the session ends).

If you're going to let an AI write most of your code, give it (and yourself) a paper trail.

Drop-in repo: github.com/niclydon/paper-trail. MIT-licensed.

Codex Chronicle was paying for every frame.

Nic Lydon — Mon, 18 May 2026 20:54:32 +0000

I built a four-sensor Gemma 4 replacement on a Mac mini.

For about a week I had OpenAI’s research-preview Chronicle running on my MacBook. Every ten minutes it screenshotted my display, uploaded frames to OpenAI for analysis, and wrote Markdown summaries on my Mac. I was crawling that folder and ingesting the data in a Postgres table on my homelab.

It worked.

It also cost credits for every cycle of attention.

This weekend I replaced it with a single Gemma 4 E4B 4-bit MLX instance running on a $599 Mac mini, summarizing four independent sensor streams locally with zero outbound LLM calls and effectively zero marginal inference cost.

OpenAI describes the constraints plainly in their own documentation: screen captures are uploaded to OpenAI’s servers for processing, the feature “uses rate limits quickly,” it “increases risk of prompt injection,” memories are stored as “unencrypted Markdown files” on the user’s machine, and it is unavailable in the EU, UK, and Switzerland. Chronicle is a Pro-tier feature on a Pro-tier price. The architectural choice is honest: cloud inference, per-frame cost, the model belongs to OpenAI.

I wanted a different shape.

What I built

This weekend I replaced Chronicle. Not with a better cloud service. With a single Gemma 4 E4B 4-bit MLX instance on a $599 Mac mini, summarizing video from four sensors (my screen, a wearable camera, the security cameras in my living room, and the wearable’s realtime AI commentary) and writing them all to one Postgres table, redacted at ingest, queryable in SQL. Zero outbound LLM calls. Zero per-frame cost.

The same model instance also serves the rest of my homelab’s vision workloads.

The marginal cost of adding the fifth sensor (which is already in a box on the way) is whatever shipping cost was paid for a Raspberry Pi Zero 2 W.

This is the sequel to a piece I published five days ago about putting Gemma 4 behind my homelab AI gateway. That one ended with: “Anvil is not just a dev box. For some multimodal work, it is a useful inference target.” This is about Anvil graduating.

Why Gemma 4 E4B specifically

The reasoning, in order of how much each one mattered to me:

Native multimodal in one checkpoint. Image AND video AND audio paths in the same file. The whole sensor mesh runs through one weights load. No model swap per input type.
16 GB of unified memory is enough. The 4-bit MLX build sits at about 6 GB peak resident in isolation, around 8.5 GB under co-tenant load. On a base M-series Mac mini that leaves comfortable headroom for the OS, the FastAPI daemon, and a menubar app to watch it.
Apache 2.0 weights. The model file is on my machine. Nobody can deprecate it out from under me, reprice it overnight, or restrict it by jurisdiction.
It’s already loaded. I was routing this exact model through Forge for unrelated work. Spinning a second model for Logbook specifically would have been waste. One Gemma 4 instance. Two production roles.

Four sensors, one envelope

  [MacBook Screen]   [Looki Wearable]   [Blink Cameras]
         │                  │                  │
         └──────────────────┼──────────────────┘
                            ▼
                   [Logbook Producers]
                            │
                            ▼
                  [Anvil / Gemma 4 E4B]
                            │
                            ▼
                    [Redaction Layer]
                            │
                            ▼
                       [Postgres]

Every Logbook row is an observation.event.v1 envelope. The schema fits in one paragraph: a deterministic UUIDv5, a source enum, a captured_at timestamp, a clip duration_s, optional frame_count, an image_summary, an optional video_summary, a media_uri for the staging location, an inference_metadata blob, and a source_metadata blob. Same schema, four producers.

The producers:

MacBook screen. A Python capture daemon running as a LaunchAgent. Records a short screen video on a fixed cadence, pauses when HID idle exceeds 10 minutes, POSTs the clip to Anvil for analysis, then POSTs the resulting envelope to the homelab ingest endpoint.
Looki wearable (clips). A worker polls the wearable’s cloud, stages new motion clips to local NVMe, runs them through the same Anvil daemon.

Looki wearable (realtime). The wearable emits realtime AI commentary as text events. A second worker forwards those as image-summary-only observations into the same table.
Blink security cameras. A continuous Node.js daemon polls Blink’s cloud, stages motion clips to NVMe, hands them to Anvil.

Every clip lands on the same Anvil daemon, which runs one Gemma 4 E4B 4-bit MLX instance. The daemon serves two surfaces:

/v1/analyze for Logbook (image-pass + native-video-pass per clip).
/v1/chat/completions and /v1/responses for every other Forge VLM client in the homelab.

The model does not care which surface called it. The previous standalone gemma-4-multimodal LaunchAgent was retired and its plist removed. End state: one Gemma 4 instance, dual-purpose, no duplication.

Redaction happens once, at the ingest endpoint, before the INSERT. UUIDs, filesystem paths, IPv4 and IPv6, internal hostnames, email addresses, API key shapes. Single pass.

The day the model pretended to watch video

For most of the build day, Logbook produced two summaries per clip: one from a native-video call mlx_vlm.generate(video=path, fps=1.0), and one from a separate frame-extracted multi-image pass.

The image summaries were excellent. They read pixels at 1280 px width and reported real strings: Termius, Phase 9, LOGBOOK_BUILD_BRIEF.md. Per-capture variation. Forensic detail. Anyone reading the raw table rows could tell which IDE window was on top.

The video summaries were a different story. Every video summary for every mac_screen capture, hour after hour, described “a person standing in a kitchen setting, facing a counter, holding a small dark object.” Word for word. The MacBook does not have a webcam pointed at the kitchen. The capture content was screen recordings.

I revised the prompt to be explicit (“you are observing a screen recording from a computer display”). Every video summary then described an identical Stack Overflow visit. Still word-for-word across captures.

The model was not hallucinating. Hallucinating implies seeing something and misinterpreting it. The model was outputting the same paragraph because the same paragraph was the most likely next-token sequence given only the prompt. The video bytes were not reaching the attention layer at all.

An MD5-hash query broke the case open. Across seven consecutive mac_screen captures of five different windows, every video summary collapsed to two unique hashes (one per prompt variant), perfectly correlated with the prompt text. The image summaries from the same seven captures produced seven unique hashes. Image was reading pixels. Video was reading nothing.

Running the same script against two different Blink motion clips from the living room made it worse. Identical output on E4B. Identical output on E2B. E2B’s variant of the bug was more honest than E4B’s: where E4B confabulated plausible scenes, E2B simply replied “Please provide the video or a description of what you are seeing so I can describe it for you.” The model was literally asking for the video.

Root cause was four lines deep in anvil/server.py. The daemon was building the formatted prompt with apply_chat_template(processor, config, prompt, num_images=N) and then calling generate(video=path, ...).
The dispatcher in mlx_vlm’s prompt_utils.py checks kwargs.get("video") on the chat template call to decide whether to insert the <video> placeholder.

We were not passing it.

The formatted prompt had no video marker.

generate()’s video=path argument was effectively ignored at the attention layer: the video tokens had no anchor in the prompt to attend to.

The fix is one branch:

if video_path:
    formatted = apply_chat_template(
        processor, config, prompt,
        video=video_path,
        num_images=0,
    )
else:
    formatted = apply_chat_template(
        processor, config, prompt,
        num_images=num_images,
    )

After the fix, the same seven captures produced seven unique video summaries.

The model was watching.

The bug was masked by polite-looking output. The summaries were grammatical, plausible, well-formed paragraphs. They just had nothing to do with the input.

Numbers, and the redaction pass

Isolated benchmarks on a single warmed clip, no other traffic on the daemon:

Image pass: 4.08 s latency, 17.6 tok/s, 5.89 GB peak resident.
Video pass: 6.67 s latency, 14.1 tok/s, 6.03 GB peak resident.

Production averages across 467 ingested rows from a single day’s running, with the daemon also serving the rest of Forge’s VLM clients:

source	avg image latency	image tok/s	avg video latency	video tok/s	peak resident
mac_screen	11.20 s	33.7	20.62 s	33.9	8.52 GB
looki (clips)	8.57 s	33.7	11.98 s	33.9	8.50 GB
blink	24.85 s	33.7	27.31 s	34.7	8.52 GB

Two things shift between the bench and production.

Throughput nearly doubles under load (33.7 tok/s vs. 17.6) because the model handles concurrent VLM work efficiently.

Latency stretches by a factor of 2-6 depending on source because the same instance is now serving Logbook’s four producers alongside every other Forge VLM client.

Peak resident memory climbs to 8.52 GB, still comfortably inside a 16 GB Mac mini.

The latency stretch is the consolidation. One model, two surfaces, shared queue. Anvil idles at single-digit watts when the daemon is not actively inferring. Throughput is comfortable for the production cadence of all four sensors. No batching tricks required.

The redaction pass is in production. A real row from this morning’s bronze layer, image summary verbatim:

Email visible: [REDACTED]. IP shown: [REDACTED]

The model saw both. The Postgres row holds neither.

The model is local.
The data is local.
The redaction is at the ingest boundary.
The audit trail is a SELECT statement against a table on hardware I own.

What this actually changes

The headline is not “I replaced OpenAI with Gemma.”

The headline is that inference is no longer the bottleneck.

When Chronicle does a screen capture, the inference is a network round trip to an API the user does not own, billed per request, rate-limited by the provider, and explicitly described in the provider’s own documentation as carrying “increased risk of prompt injection,” “memories stored as unencrypted Markdown files,” and consumption that “uses rate limits quickly.” The architecture treats each sensor as a customer of a paid service.

When Logbook does a screen capture, the inference is a function call on hardware I own.

The bottleneck is bytes-on-wire and bytes-on-disk, both of which are problems we already know how to solve.

The model is a fixed cost.

Every new sensor pays for itself in the wall clock of the moment it is added, not in the per-frame economics of the API.

What ends up running on the Mac mini is closer to a personal telemetry fabric than to an AI assistant: distributed multi-modal sensors, normalized events, local inference, append-only memory.

Chronicle did one thing competently and charged per frame.

Logbook does the same thing four times over, from 360°, runs locally, and charges per electron.

What’s next

A Raspberry Pi Zero 2 W Basic was delivered to the house on May 16. A 250 g spool of 1.75 mm PLA filament arrived the day before.

The shape of those two purchases together is a fifth sensor: a tiny always-on Linux SBC in a 3D-printed enclosure, somewhere on the spectrum of ambient sensor, audio recorder, or environmental probe.

The exact function is the sensor’s business.

The Logbook architecture does not care.

The fifth sensor will arrive at the same ingest endpoint, in the same envelope shape, summarized by the same Gemma 4 instance that is already running.

Whatever it captures will slot into raw_ingest_observations at its own captured_at and interleave with the other four sources in time order.

When it lands, the work will be writing one small handler.

The model is already there.

I Wrote an MCP Server for My 3D Printer

Nic Lydon — Sun, 17 May 2026 20:30:55 +0000

I’m writing this on a Sunday afternoon. The 3D printer on my kitchen counter has been printing for 19 hours and 12 minutes. I know this because I just asked it.

Not by walking into the kitchen. By calling a tool in a Claude conversation:

// kiln_progress
{
  "status": "printing",
  "file": "looki_l1_tests.gcode",
  "layer": 257,
  "target_layer": 315,
  "progress": 0.9112598299980164,
  "print_duration_s": 69192
}

The printer is a Flashforge Adventurer 5M. It’s named “kiln” because everything in my home lab gets a fire-themed name and I’ve already used the good ones (Furnace runs the GPUs, Forge is the inference gateway). It sits next to the cutting board. I bought it on a whim a few weeks ago and I have no idea what I’m doing with 3D printing as a hobby.

But I do know how to wrap an API in MCP tools, and the printer has two of them. So now I have 16 MCP tools for a machine I barely understand.

This post is the receipts.

The two APIs

The AD5M ships with firmware that exposes two ways in:

An HTTP API on port 8898 that returns JSON for things like /detail (status, fans, temps, current job). This is what the FlashForge mobile app talks to.
A legacy TCP port on 8899 that speaks G-code over a length-prefixed wire format. You send M115 and you get the firmware version back. Send M114 and you get the current head position.

The HTTP API is comfortable. The TCP port is from a more savage era. Both are running on the same printer.

kiln-mcp is a small TypeScript MCP server that wraps both. Read-only G-codes go through the TCP port. State-changing operations like kiln_print and kiln_control (pause/resume/cancel) go through the HTTP API. All calls carry a check code so the printer trusts them.

I also do not trust any LLM with a hot nozzle and an open command channel.

That shaped the design more than anything else. Read-only telemetry is permissive. Stateful operations are constrained. The kiln_mcode tool only accepts read-only M-codes because the first version of this server had that gate softer, and I tightened it after the second time I caught a tool call trying to send M104 (set extruder temp) inside what I thought was a status query.

Here’s what they look like in practice.

The read-only side

kiln_info just dumps firmware and build volume. Under the hood it calls M115:

Machine Type: Flashforge Adventurer 5M
Machine Name: kiln
Firmware: v3.2.7
SN: [redacted]
X: 220 Y: 220 Z: 220
Tool Count: 1

kiln_temps is more useful. As of the moment I’m writing this paragraph:

{
  "nozzle": { "temp": 219.67, "target": 220 },
  "bed": { "temp": 59.53, "target": 60 },
  "chamber": { "temp": 0, "target": 0 }
}

219.67 / 220 is the nozzle holding steady on PLA. 59.53 / 60 is the bed. The chamber slot exists for a heated enclosure I don’t have.

kiln_files lists what’s on the printer’s storage:

looki_l1_tests.gcode
looki05.gcode
looki04.gcode
looki03.gcode
looki02.gcode
looki01.gcode
nameplate_batch_3_13-14_PLA_020mm.gcode
nameplate_batch_2_7-12_PLA_020mm.gcode
nameplate_batch_1x6_PLA_020mm.gcode
plate_1.gcode

I’m iterating on a mount for a wearable AI camera I use, hence looki_l1_tests.gcode being on its fifth revision. The nameplate batches were for a friend. The file names are a journal.

The state-changing side

kiln_print takes one of those file names and starts the job:

kiln_print({ file_name: "looki05.gcode", level: true })

kiln_control does pause / resume / stop. The stop is destructive in the obvious way: cancel an 18-hour print at hour 17, you have an 18-hour-old failed extrusion blob on the bed.

I don’t let Claude call kiln_control casually.

The weird one

The tool I actually built this server for is kiln_image2mesh.

I wanted the shortest possible path from “that would make a neat print” to an STL on the printer.

You hand it an image. It hands you back an STL ready to slice. It runs entirely on the iGPU of one of my mini PCs.

Under the hood:

A FastAPI service called Modly that auto-spawns on first use (it isn’t running right now, which is fine).
rembg strips the background from the image.
TripoSG, an image-to-3D diffusion model from VAST AI, generates the mesh.
A marching-cubes octree turns the implicit field into triangles.
Mesh simplification brings the face count down to a printable target (default 80,000 faces).
The result is written as an STL next to the input image.

Total time: 5 to 15 minutes depending on diffusion steps. CFG, seed, foreground ratio, face count, and steps are all parameters. Defaults are tuned for “preview-grade,” which is what I want 95% of the time.

The point: there is no cloud STL service in this pipeline. The image goes onto disk on Furnace, Modly runs locally on the iGPU, the STL lands on the same disk, and then kiln_print ships it. The only thing leaving my network is the message I typed at Claude asking it to do all of that. (And sometimes I don't run that through Claude.)

The honest bits

While drafting this, kiln_status timed out:

timeout after 8000ms (path=/detail)

The printer’s HTTP server gets cranky when it’s deep into a long job. The legacy TCP port answered fine the whole time. Both APIs, one machine, very different attitudes about life. The MCP server papers over this poorly. That’s a TODO.

kiln_modly_status reported api_up: false. The auto-spawn handles that on the next kiln_image2mesh call, but if I were writing this server today I’d add a –prewarm flag.

The ugly old TCP side of the printer has actually been more reliable than the modern JSON API. Which feels about right.

Why bother

I write MCP servers for things at a much higher rate than is reasonable. Most of them are useful in the obvious “I can ask Claude about my pipeline state” way. A few are useful in the less obvious way of “I now have a sharp picture of what the underlying system actually exposes.”

Wrapping the printer was the second category.

The AD5M’s two APIs disagree about a lot of things: units, retry behavior, what “ready” means. Wrapping them forced me to pick a model. The MCP surface is the cleanest description of that printer I have.

That’s the part of MCP work I think people miss. Once you expose a system through tools, you stop writing wrappers and start defining semantics. You have to decide what counts as state, what counts as safe, what operations deserve retries, and what an LLM should never be allowed to do.

That, and: I can now ask Claude to print me a thing.

The amount of glue between an LLM and a hot extruder is not zero, but it’s smaller than you’d think.

When I started writing this, the printer was at layer 257 of 315. I could check again.

I’m not going to.

Vibe Coding Is to Software Development as Desire Paths Are to City Planning

Nic Lydon — Sat, 16 May 2026 18:42:19 +0000

I'm not a software developer. I'm the building inspector watching people pave their own paths through the enterprise. Here's what I'm seeing.

In urban planning, there's a concept called a desire path: the informal trail pedestrians wear into the grass when the sidewalk doesn't go where they actually need to go. It's not vandalism. It's feedback. The planned infrastructure failed to serve the people using it, and they routed around it.

Vibe coding is the desire path of software development.

But I'm not writing this to tell developers their profession is dying. I don't have standing for that. I'm a Director of Information Security. I manage security engineering and IAM teams. I've spent 15 years in cybersecurity and exactly zero of them shipping production applications.

What I do have is a front-row seat to what happens when the desire paths start forming inside an enterprise. And right now, they're everywhere.

The Desire Paths Are Already There

Here's what I'm seeing in my environment and hearing from peers:

A financial analyst discovers they can use an AI coding assistant to build a Python script that automates a report they've been manually compiling every Monday for three years. It works. It runs on their laptop. Nobody in IT knows it exists.

A compliance officer uses Claude to generate a small web app that tracks regulatory deadlines. It pulls from a shared spreadsheet. It sends Slack notifications. It took them an afternoon. The official request to IT for this tool has been in the backlog for 14 months.

A project manager builds an internal dashboard by describing what they want to an LLM. It's not beautiful. It doesn't follow the design system. But it works, their team uses it, and it solved a problem that nobody else was going to solve for them.

These are desire paths.

And here's the uncomfortable truth: these people aren't wrong. They needed something. The planned infrastructure — the IT backlog, the dev team's sprint priorities, the "submit a Jira ticket and wait" process — didn't serve them. So they walked across the grass.

The Security Leader's Problem

As a security person, my instinct is obvious: this is terrifying. Ungoverned code running on laptops. API keys hardcoded in scripts. Data flowing to third-party AI services with no DLP, no audit trail, no access controls. Shadow IT, but now it's shadow development.

The fence-building response is also obvious: block the AI tools, lock down the endpoints, send a policy memo. The digital equivalent of KEEP OFF THE GRASS signs.

But I've been in security long enough to know that prohibition doesn't work when the underlying need is legitimate. You don't stop desire paths by putting up fences. You just make people walk through the mud next to the fence.

The question isn't "how do I stop this?"

The question is:

How do I pave these paths properly?

What a Paved Desire Path Looks Like

If citizen developers are going to build things — and they are, whether you like it or not — security and engineering teams need to build the infrastructure that makes it safe.

Not safe as in "we reviewed every line of code."

Safe as in "the paths have drainage, lighting, and load-bearing foundations."

Here's the architecture.

1. The AI Gateway: Your Sidewalk

Instead of letting every citizen developer hit OpenAI, Anthropic, or Google directly with their own API keys, you put a gateway in front of everything.

Citizen Developer → AI Gateway → [Local Models | Cloud Providers]
                        ↓
                   Audit Log
                   Policy Engine
                   Cost Controls

In my home lab, this is a service called Forge. Every AI request from every tool, agent, and script routes through it. In a 30-day window, that's 300K+ requests across 30+ models. Every single one is logged. Every cloud fallback is auditable at a dedicated endpoint.

The numbers tell the story: $0.79 in actual cloud spend over 30 days, because the gateway routes to local models first and only falls back to cloud providers when necessary.

But the cost savings aren't the point.

The auditability is the point.

When a regulator asks, "What data are your employees sending to AI services?", you need an answer.

An enterprise version of this is an MCP proxy layer. MCP gives you a standardized interface between AI tools and the services they interact with. Put a proxy in front of it, and you control what every citizen-built tool can actually do.

2. The Guardrails: Your Drainage and Curbs

A paved desire path still needs drainage so it doesn't flood. In the citizen developer context, guardrails are the constraints that prevent well-intentioned people from accidentally causing incidents.

Concrete examples:

Data classification enforcement. The gateway inspects outbound requests. If someone's Python script is trying to send customer PII to a cloud model, the request gets blocked before it leaves the network. The citizen developer doesn't need to know about data classification policies. The infrastructure handles it.

Credential management. No citizen developer should ever have a raw API key. The gateway handles authentication. The developer gets a single internal endpoint. If a key needs to be rotated, it happens once at the gateway, not in 47 scripts on 47 laptops.

Scope limitation. An MCP proxy can restrict which tools a citizen-built application can invoke. Your compliance officer's deadline tracker can read from the shared spreadsheet and send Slack notifications. It cannot access the HR system, modify financial records, or provision cloud resources. The path goes where it needs to go and nowhere else.

# Example: MCP proxy policy for a citizen developer tool
policy:
  name: compliance-deadline-tracker
  allowed_tools:
    - google_sheets:read
    - slack:post_message
  blocked_tools:
    - "*:write"          # No writes to any data source
    - "*:delete"         # No deletions
    - "hr_system:*"      # No HR system access at all
  data_rules:
    - block_pii_outbound: true
    - max_tokens_per_request: 4096
  audit:
    log_all_requests: true
    alert_on_block: true

3. The CI/CD Pipeline: Your Building Code

This is where the city planning analogy lands hardest. A desire path that gets paved still has to meet building codes.

For citizen developers, this means:

A defined deployment path. The tool doesn't run on someone's laptop forever. There's a simple process: push it to a repo, it goes through automated scanning — SAST, dependency checks, secrets detection — and it deploys to a managed environment. The citizen developer doesn't need to understand CI/CD. They need a button that says "make this official."

Automated security scanning. Every citizen-built tool gets the same baseline checks that production code gets. Not a full security review — that doesn't scale — but automated detection of the things that cause most incidents: hardcoded secrets, known-vulnerable dependencies, SQL injection patterns, unvalidated inputs.

Environment isolation. Citizen developer tools run in sandboxed environments with limited network access, no production database credentials, and resource caps. If the tool breaks, it breaks in its sandbox. It doesn't take down the ERP system.

4. Maintenance and Ownership: Your Public Works Department

Here's the part every enterprise learns the hard way: paving the path is only the beginning.

The compliance officer who built the regulatory tracker changes roles. The financial analyst who automated the Monday report leaves the company. Six months later, nobody knows who owns the tool, what depends on it, or whether it's still making correct decisions.

This is where desire paths become technical debt corridors.

A governed citizen development platform needs more than deployment pipelines and security scanning. It needs lifecycle management. Every deployed tool should have:

a recorded owner,
a business purpose,
dependency metadata,
access scope documentation,
and an expiration or review date.

Not because bureaucracy is fun, but because abandoned automation is one of the most dangerous forms of enterprise risk. A broken dashboard is visible. A silently incorrect dashboard can influence business decisions for months before anyone notices.

That means periodic re-certification:

Does the tool still need the access it was granted?
Is anyone still using it?
Are the underlying models or APIs behaving differently now?
Has the source data changed format?
Does the automation still align with current policy and process?

In city planning terms, this is the public works department. Roads crack. Drainage fails. Traffic patterns change. Some paths need widening because they became critical infrastructure. Others should be closed because the need disappeared.

The same thing happens with citizen-built software. Some tools will prove valuable enough to formalize into fully supported applications. Others should expire automatically unless someone actively renews ownership and validates their continued use.

If you don't build maintenance into the system from the beginning, today's paved path becomes tomorrow's forgotten infrastructure problem.

5. The Skill Libraries: Your Signage and Lighting

Smart cities don't just pave desire paths. They add lighting, signage, and benches. They make the path better than the grass was.

For citizen developers, this means pre-built, vetted capabilities they can use instead of building from scratch:

Pre-approved integrations: vetted connectors to internal systems, such as read-only Salesforce access, Slack posting, or Jira ticket creation.
Template repositories: starter projects with security best practices already baked in: environment variable management, logging, error handling, input validation.
Curated model access: purpose-specific model configurations for summarization, data extraction, code generation, and other common patterns.

The Role That Emerges

Here's the part software developers should actually pay attention to.

The city planners didn't disappear when cities started paving desire paths. The profession matured. The job shifted from "design where people should walk" to "design systems that accommodate where people do walk."

That's what's happening in software development.

The highest-leverage work isn't writing the compliance deadline tracker. It's building the platform that lets the compliance officer build it safely.

It's the gateway, the proxy layer, the policy engine, the scanning pipeline, the sandboxed runtime, the skill libraries, and the lifecycle controls.

The enduring engineering advantage shifts upward into platforms, governance, orchestration, and operational architecture.

What I'm Actually Doing About It

I'm not writing this from theory. I'm the security leader who has to make a decision: fence or sidewalk?

Here's my approach:

Acknowledge the desire paths exist. The shadow AI tools are already in your environment. Pretending otherwise is negligence, not strategy.
Instrument before you govern. Before writing policies, understand what's actually happening. Where are the API calls going? What data is flowing? What tools are people building?
Build the governed path. Stand up the gateway, the proxy layer, the scanning pipeline. Make the official path easier than the unofficial one.
Make the right thing the easy thing. Every security control that adds friction to the citizen developer's workflow is a control they'll route around.
Audit continuously, review periodically. Automated scanning catches the baseline. Periodic human review catches the architectural issues. Neither alone is sufficient.

The Uncomfortable Conclusion

The software development industry spent decades building beautiful, winding paths through meticulously planned courtyards: frameworks, design patterns, architectural review boards, sprint ceremonies, code review processes.

Then someone handed everyone an AI assistant and they cut straight across the grass.

That's not a failure of the people walking.

That's feedback about the path.

The question for security leaders isn't whether to allow it. It's already happening. The question is whether you're the one who paves the path with proper drainage, or the one standing next to a KEEP OFF THE GRASS sign watching everyone walk through the mud.

I know which one I'm choosing.

I Put Gemma 4 Behind My Homelab AI Gateway. This Is the Beginning.

Nic Lydon — Wed, 13 May 2026 02:22:11 +0000

Most model experiments start with a notebook, a benchmark script, or a quick API call.

This one started with a production-shaped question:

Can I swap out an entire model family that is currently serviing the default paths through my actual local AI gateway?

Not a side demo. Not a one-off curl. Not "look, it runs."

I mean the real route: the gateway that agents, background jobs, app surfaces, benchmark harnesses, and my own tools already call.

That is the experiment I started with Gemma 4.

This post is the beginning of that story, not the final verdict. I am writing it while the platform is still in the trial window. The follow-up will be more interesting: what stayed stable, what broke under real load, what got rolled back, and what I would keep after a week or two of actual use.

For now, this is the setup: what I changed, why I changed it, and what failed immediately.

The Platform Before The Swap

My local AI stack is built around a gateway I call Forge.

Forge gives callers one OpenAI-ish API surface and handles the messy parts behind it:

which model should answer this kind of request
which machine is hosting it
whether the model is hot, cold, deprecated, or on-demand
whether a request is chat, vision, embedding, transcription, code, extraction, or something else
whether a backend is available or should be skipped

The machines behind it are consumer hardware, not datacenter gear:

Host	Role
Furnace	Primary inference box, AMD Strix Halo, 96 GB unified VRAM allocated to the iGPU
Crucible	Secondary AMD box for creative workloads, permissive models, and burst/bulk work
Anvil	M4 Mac mini, useful for MLX/Metal paths and lightweight resident services

Before this experiment, the default local text path was mostly Qwen-family. That was not an accident. Qwen had become the operating baseline because it was predictable enough for a platform, not just impressive in isolation.

I had also tested other models. Devstral2, for example, was interesting enough to onboard and benchmark seriously. The smaller 24B variant was competitive in code scenarios, but it did not become the default path. The 123B model was too slow for the role I needed. That distinction matters:

A model can be good and still not be a good platform default.

That is the bar Gemma 4 had to clear.

Why I Did An In-Place Swap

I could have added Gemma 4 as another optional model and called it a day.

That would have been safer. It also would have taught me much less.

Instead, I treated it like a real migration. For the trial window, Gemma 4 took over the canonical roles that real callers already use.

Role	Previous route	Trial route
default chat	`qwen3.6-chat-35b-a3b`	`gemma-4-chat-31b`
priority chat	`qwen3-8b`	`gemma-4-chat-26b-a4b`
vision / multimodal	`qwen3-vl-30b-a3b`	`gemma-4-multimodal-8b-e4b`
prompt enhancement	`qwen3-4b`	`gemma-4-multimodal-2b-e2b`

The old Qwen routes were not deleted. They were marked deprecated with a planned rollback window. That gives me a clean flip-back path if the experiment does not earn its keep.

This is the part I think model posts often skip. A real model migration is not just "can I run it?" It is:

do I have the right weights on disk?
does my serving stack understand the architecture?
can I fit the hot set in memory?
do my existing aliases and callers still work?
can I roll back without spelunking through five repos?
do I have telemetry that will tell me the difference between model failure, gateway failure, and benchmark nonsense?

That last one matters more than I expected.

The First Failure Was Not The Model

The first deploy crashed.

Forge restarted cleanly. The model catalog showed the new Gemma 4 ids. The first smoke request hit the gateway, routed to llama-swap, and came back as a 502.

The useful error was one layer lower:

unknown model architecture: 'gemma4'

The problem was not Gemma 4 quality. The problem was my serving binary.

My llama.cpp build was from April 1. It was 466 commits behind the branch I needed. The GGUF files declared general.architecture: gemma4, and the old build simply did not know what that meant.

So the first chapter of the Gemma 4 experiment was not prompting. It was infrastructure:

back up the existing build tree
rebuild llama.cpp with ROCm/HIP for Strix Halo
verify the new binary recognizes the Gemma 4 architecture
regression-check the existing Qwen route
restart the serving layer
smoke test through the actual gateway, not a side process

Only after that did the model start answering.

That is a useful reminder: "model support" is not a binary property. A model can be downloadable, quantized, and present on disk, and still be unusable because the serving stack is one architecture handler behind.

The Second Failure Was More Interesting

Once Gemma 4 loaded, the first real chat benchmark looked bad.

Not "a little worse than Qwen" bad. Broken bad.

On the initial chat-bench run, gemma-4-chat-31b failed the structured extraction and format-compliance scenarios. It was also slow enough that something was clearly wrong. These were not hard prompts. These were the boring, throughput-oriented tasks that agents and background workers need to complete cleanly.

A direct request showed the issue immediately:

<think>
The user is asking a basic arithmetic question...
</think>

2 + 2 = 4

The model was spending the answer budget on a reasoning block.

For a human chat UI, visible thinking might be useful. For a benchmark expecting JSON, or an agent expecting a short answer, it is poison. The model can know the right answer and still fail the task because the caller never receives the shape it asked for.

This was familiar. Forge had already solved the same class of problem for Qwen3.

The fix was to make "thinking mode" a gateway policy, not a model identity.

Programmatic callers get:

{
  "chat_template_kwargs": {
    "enable_thinking": false
  }
}

injected by default when the model family needs it.

Chat UIs can opt back in explicitly:

{
  "chat_template_kwargs": {
    "enable_thinking": true
  }
}

That is the right abstraction for my platform. I do not want every agent, benchmark, worker, and internal tool to remember which local model family wraps output in thinking tokens this week. I want the gateway to know that once.

After that change, the chat benchmarks became meaningful. The three relevant routes - Gemma 4 31B, Gemma 4 26B-A4B, and the displaced Qwen3.6 35B-A3B baseline - reached the same pass-rate shape across the default chat scenarios.

The interesting result was latency. The 26B-A4B route was materially faster than both the dense 31B and the Qwen3.6 baseline on several workloads, while keeping the same pass rate in the corrected harness.

That is the kind of result I care about. Not "model X wins," but "model X belongs in this role."

Vision Exposed A Different Problem

The multimodal side taught a separate lesson.

I added a new VLM benchmark harness and ran the obvious first test. The initial scenario was too weak. It was good enough as a smoke test, but not good enough to tell me which model or host was better.

So I built a more discriminating scenario with three generated fixtures:

a bar chart that required OCR and chart reasoning
a code screenshot that required reading a function name and language
a homelab topology diagram that required identifying the hub and connected nodes

Then another problem appeared: concurrency.

Promptfoo's default concurrency sent multiple image-bearing requests at once during a backend startup window, which produced misleading 502s. Some errors appeared to implicate the wrong backend because parallel requests were failing around the same time.

Sequential runs told the truth.

With concurrency set to 1 and the output budget raised, the final VLM run passed cleanly across the tested local routes. The surprising part was not that Gemma 4 could read the images. The surprising part was that an M4 Mac mini running an MLX path was effectively tied with the AMD inference box on this small, practical vision benchmark.

That is not a leaderboard result. It is a routing result.

It tells me Anvil is not just a dev box. For some multimodal work, it is a useful inference target.

Audio Needed A Sidecar

Gemma 4's multimodal story is not just text and images. Audio is part of the interesting surface.

But my normal GGUF plus llama.cpp path did not support Gemma 4 audio input yet. The text and vision path worked through llama-swap. The audio conformer path did not.

So I built it as a sidecar:

a separate FastAPI worker
safetensors weights
HuggingFace Transformers
ROCm-specific PyTorch wheels
a Forge route at /v1/audio_qa

That route is intentionally not a replacement for Whisper.

Whisper remains the right tool for long-form transcription. Gemma 4 audio is more interesting for short audio understanding, audio Q&A, emotion or intent questions, and cross-modal prompts that combine audio and an image.

The first useful test was simple: the JFK sample clip. The route returned a good short transcription in under six seconds once warm. A 60 second clip correctly failed fast with a 413 because the audio path is capped at 30 seconds. An audio plus image prompt produced a coherent response grounded in both modalities.

That sidecar is not the end state. It is a bridge. When the standard serving path supports the audio input cleanly, the route can stay and the backend can change.

Again, that is the platform lesson: callers should not care which inference backend made the modality work.

What I Am Actually Testing

The easy version of this post would end with a benchmark table and a confident take.

I do not have that yet, and pretending otherwise would be silly.

What I have is the beginning of a platform trial:

Gemma 4 is now in the main chat, priority-chat, multimodal, and prompt-enhance roles.
Qwen is still available as a rollback path.
Devstral2 remains useful but not a default for this platform.
Forge now handles thinking-mode policy for both Qwen3 and Gemma 4.
The benchmark harness is better than it was before the experiment.
The audio path exists, but it is a sidecar until the normal serving stack catches up.
The real evaluation is now happening under actual workloads.

The question I care about over the next week is not:

Is Gemma 4 better than Qwen?

The question is:

Which parts of the platform are better with Gemma 4 in the route, and which parts should move back?

That means watching boring things:

error rates
429s and backend saturation
latency under background jobs
whether agent outputs stay clean
whether structured tasks remain reliable
whether multimodal routes are useful often enough to stay hot
whether the memory footprint is worth it
whether fallback behavior is predictable when the boxes are busy

This is less glamorous than a benchmark screenshot. It is also where the real answer lives.

The Takeaway So Far

The first day did not teach me that Gemma 4 is universally better. It taught me something more useful:

A model family becomes valuable when the platform can route it intentionally.

The gateway mattered more than any single model call.

Without Forge, I would have been debugging each app and agent separately. With Forge, the migration became a small number of role changes, a serving-stack rebuild, one generalized thinking-policy fix, and a better set of benchmarks.

That is the part I want to keep building toward: a local AI platform where model families can change without rewriting every caller, and where the system learns from real workloads instead of one-off demos.

This is the start of the Gemma 4 trial in my homelab.

In a week or two, I should have the more honest post: what survived real use, what got demoted, and what I would do differently if I were starting the swap again.

I Audited My AI Agents and Found That Most of Their Reasoning Wasn’t Observable

Nic Lydon — Tue, 12 May 2026 01:01:24 +0000

I run a personal AI platform with eight active agents, dozens of processors, and a fully self-hosted Langfuse instance. I built the observability layer myself. I shipped it a few weeks ago. Last week I ran the audit query for the first time.

The agents that talk to me the most only had Langfuse-level lineage coverage for about 13% of their decisions.

This is the writeup of what I found, why it happened, and the schema and code that explain it. If you run agents and you've never run this audit, you have a very good chance of finding the same gap.

The Setup

Quick context. The platform is called Nexus. It's a TypeScript monorepo plus a fleet of Python processors, running on a couple of mini PCs in my apartment. It ingests 26 data sources, runs 8 reasoning agents on schedules, and serves an MCP tool surface I use as my daily driver.

Two layers matter for this post:

The agents are reasoning entities. They read from gold-layer tables, decide things, and write proposals to inbox tables. ARIA is the user-facing coordinator. Chronicler owns the timeline. Insight does anomaly detection. Five others fill in around them. They're scheduled, bounded, and they don't directly execute infrastructure changes — they propose, a human decides.

Every agent decision lands in a row in agent_decisions. Every row has a trace_id like aria-1777559470433-5c0db36c. That trace_id is generated by the agent itself at the start of a cycle and is 100% covered. It tells you the agent ran. It does not tell you what the LLM was asked or what it returned.

The processors are the deterministic side. They read raw data, enrich it, write to silver and gold. Some call LLMs (Gmail enrichment, ambient capture upgrade, financial event extraction). Each run lands in aurora_processing_runs with a langfuse_trace_id column populated when the run had Langfuse turned on.

Langfuse itself is self-hosted on a host on my private network. It's been running fine for weeks. It has traces in it. The dashboard shows traces. I have used the dashboard.

I just hadn't asked the question "what fraction of my agent and processor activity is actually represented there."

The Audit Query

The MCP tool that surfaced this is nexus_agent_architecture_status. Under the hood it's running this against the operational Nexus Postgres:

SELECT agent_id,
       COALESCE(invocation_type, 'cycle')  AS invocation_type,
       COUNT(*)::int                       AS decisions,
       COUNT(*) FILTER (WHERE trace_id IS NOT NULL)::int
         AS with_trace_id,
       COUNT(*) FILTER (
         WHERE state_snapshot ? 'langfuse_enabled'
       )::int                              AS with_langfuse_flag,
       COUNT(*) FILTER (
         WHERE COALESCE((state_snapshot->>'langfuse_enabled')::boolean, false)
       )::int                              AS langfuse_enabled_count,
       COUNT(*) FILTER (
         WHERE NULLIF(state_snapshot->>'langfuse_trace_id', '') IS NOT NULL
       )::int                              AS with_langfuse_trace_id,
       MAX(created_at)                     AS last_decision_at
  FROM agent_decisions
 WHERE created_at >= NOW() - (30 * INTERVAL '1 day')
 GROUP BY agent_id, COALESCE(invocation_type, 'cycle')
 ORDER BY agent_id, invocation_type;

The state_snapshot column is JSONB. Every agent cycle writes a small snapshot of the runtime config it ran under, including whether Langfuse was enabled, the active trace ID, and (when disabled) a langfuse_disabled_reason string. This is the schema that lets me tell the difference between "we never tried to trace" and "we tried and failed."

The result over a 30-day window, sorted by decision volume:

Agent	Decisions	Internal trace	Langfuse trace	Coverage	Executor
ARIA	31,451	31,451	5,452	17%	executor-A
Insight	25,913	25,913	4,402	17%	executor-A
Chronicler	23,297	23,297	2,950	13%	executor-A
Circle	21,510	21,510	2,490	12%	executor-A
Infra	19,701	19,701	2,524	13%	executor-A
Correlator	2,594	2,594	2,592	100%	executor-A
Planner	2,592	2,592	2,591	100%	executor-A
Keeper	696	696	696	100%	executor-B

Read that table in two passes.

First pass: the agents producing the most decisions (ARIA at 31K, Insight at 25K) are the ones with the lowest Langfuse coverage (12–17%). The agents with low volume (Correlator, Planner, Keeper) sit at 100%. Inversely correlated.

Second pass: it's not actually about volume. It's about something the volume happens to correlate with. The five high-volume agents are the ones whose execution is shaped by an older code path; the three high-coverage agents are on the newer one. Keeper runs on a different executor entirely.

What's in the Untraced Rows

Pulling a sample of the rows where langfuse_enabled is false tells the story directly:

{
  "id": 141946,
  "agent_id": "aria",
  "invocation_type": "cycle",
  "trace_id": "aria-1777559470433-5c0db36c",
  "created_at": "2026-04-30T14:31:17.266Z",
  "langfuse_disabled_reason": "LANGFUSE_ENABLED is false"
}

That field is the answer. At the moment of that decision, the agent process saw LANGFUSE_ENABLED=false in its environment and routed every LLM call through the no-op path.

How the No-Op Path Works

Here's the actual gating code, lightly trimmed, from packages/core/src/services/langfuse-client.ts:

export function getLangfuseConfig(env = process.env): LangfuseConfig {
  return {
    enabled:    parseBool(env.LANGFUSE_ENABLED, false),  // default false
    publicKey:  env.LANGFUSE_PUBLIC_KEY?.trim() || undefined,
    secretKey:  env.LANGFUSE_SECRET_KEY?.trim() || undefined,
    baseUrl:    trimTrailingSlash(env.LANGFUSE_BASE_URL?.trim()),
    // ...
  };
}

export async function runWithLangfuseTrace<T>(
  params: LangfuseTraceParams,
  fn: (context: LangfuseTraceContext) => Promise<T> | T,
): Promise<T> {
  const cfg = getLangfuseConfig();
  const reason = getDisabledReason(cfg);
  if (reason) {
    warnDisabled(reason);          // logs once per process
    return fn({ enabled: false }); // run the work, no trace
  }
  // ... normal trace path
}

This is a textbook pattern. Default off. Fail open. Log once. Never block the agent.

The pattern is right. It's the same one the Python services use, and the same one the publishing pipeline uses for its drafting code. You don't want a Langfuse outage taking down agents.

What the pattern doesn't do is tell you when it's been firing for weeks.

The warnDisabled call is guarded by a module-level boolean so it only logs once per process lifetime. The next 10,000 calls to runWithLangfuseTrace from that process are silent. No counter, no metric, no row in the disabled-runs table. Just a single line in stdout that scrolled past at startup.

The Real Story: It Was Never Turned On

I went looking through every checked-in config file for LANGFUSE_ENABLED=true:

$ rg "LANGFUSE_ENABLED" --type=yaml --type=service --type=env --type=conf

Zero hits. The flag isn't set in any committed config. The agents that have full Langfuse coverage are the ones whose runtime environment happens to have LANGFUSE_ENABLED=true set somewhere out of band — a systemd unit, an inherited shell env, a compose override that lives on the host.

That explains the table.

Keeper runs under the newer executor process, which inherits an env that has the flag set. 100% coverage.
Correlator and Planner are recent additions wired into a different runtime path that always emits Langfuse spans regardless of the flag. 100% coverage.
The five high-volume agents (ARIA, Insight, Chronicler, Circle, Infra) run under the older executor. Most of the time it doesn't see the flag. Occasionally it does — about 12-17% of cycles — probably the ones that happen to fall after a manual restart in a shell where the flag was exported.

It's not drift. It's never having been turned on in the first place for the path that does the most work.

The Processor Side Has the Same Shape

Pulling the 30 most recent rows from aurora_processing_runs:

Processor Name	Version	Has Trace
ambient-moment-sync	2026-04-29.langfuse-v1	✓
gmail-enrich	2026-04-29.events-v1	✓
gmail-appointment-extract	2026-04-30.v1	✓
mem-bronze-drain	v1	✗
plans-to-kg	v1	✗
voice-to-kg	v1	✗
social-bronze-drain	v1	✗
ambient-context-upgrade-processor	2026-04-29.context-v1	✗
health-timeline-promote	2026-04-30.v2	✗

Same pattern. Processors with a langfuse-v1 or events-v1 tag in the version string emit trace IDs because their code was explicitly migrated to call runWithLangfuseTrace. Processors still on v1 were written before the migration helper existed and never adopted it. They call traceLlmGeneration if they make LLM calls, but the outer trace context is missing, so the spans don't correlate to anything queryable.

The version string is doing the work the env flag isn't. It encodes whether the code knows about the tracing helper.

What Generalizes

I run this stack as one person. Eight agents, a handful of processors, one Langfuse instance, one set of credentials. The fix is a long afternoon. The same problem at any non-trivial agent deployment is much more expensive to discover and much more expensive to close, because by the time you ask the question you have hundreds of thousands of decisions you can't reconstruct.

Three patterns that generalize from this audit:

1. Decision counts are not coverage.

Every dashboard I had was counting decisions and showing them as green. None of them computed coverage ratios. Decision counts tell you the agent ran. They don't tell you whether you can answer what it did. If you're going to instrument observability, instrument the observability itself.

2. Default-off is correct. Silent default-off is not.

The parseBool(env.LANGFUSE_ENABLED, false) default is right. You don't want observability code that fails closed and breaks the agent. But there's a difference between "fails open" and "fails open silently for weeks." The fix is a periodic check, on a separate cadence from the agents themselves, that reports langfuse_enabled=false across {n} cycles in the last hour to a channel a human will see. The disabled-reason field already exists. Aggregating it is one cron job.

3. Code-version is the actual observability gate.

The flag check is a red herring. The real question is whether the agent or processor was written to call into the tracing helper at all. 2026-04-29.langfuse-v1 in a version string is a much better predictor of coverage than the env flag. Treat your tracing migration as a code migration, audit by version, and don't assume an env flag covers the gap.

What I'm Doing About It

Three things, in this order:

Set the flag where it should always have been set. This is the embarrassing one. Add LANGFUSE_ENABLED=true to the older executor's systemd unit, restart, verify with one cycle from each of the five low-coverage agents. This closes the going-forward gap immediately.

Materialize coverage as a first-class metric. A view, agent_observability_coverage, computed from the audit query above on a rolling 24-hour window. A small alert that fires if any active agent drops below 95%. The view is gitignored config; the alert lives in the existing notification path.

Backfill triage. I can't recover the prompts and responses for the 100,000+ untraced decisions. They're gone. What I can do is replay the inputs for the high-importance subset — anything that touched a person record, anything in the financial event flow, anything routed through ARIA's user-facing path — and emit a post-hoc trace with whatever the prompt would have been at the version pin recorded in state_snapshot.prompt_version. The output won't match what actually happened. But it gives a baseline for behavioral drift detection going forward.

Closing

The Nexus doctrine line is:

Nexus is best understood as a data and memory platform with bounded reasoning agents on top, not as an unbounded autonomous swarm.

The corollary I hadn't written down until now is that bounded reasoning is only bounded if you can see the reasoning. A trace_id that points to a row with no LLM-level lineage isn't bounded reasoning. It's bounded execution with hidden reasoning behind it.

The agents I was most worried about turned out to be the ones I was least able to inspect. That's the inverse of the order I would have chosen.

The fix is straightforward. The lesson is that I had to write a query to find out.

The public architectural repository for Nexus is available here: github.com/niclydon/nexus-public.

One important clarification: nexus-public intentionally does not ship with hard dependencies on vendor-specific observability and evaluation tooling like Langfuse, Promptfoo, and several other operational integrations I use in the live runtime. The public repo is designed more as an architectural reference implementation — agents, processors, MCP tooling, schemas, orchestration boundaries, and execution patterns — so someone can wire in whichever tracing and observability stack they prefer rather than inheriting mine by default.

The Langfuse integration, executor runtime paths, and audit tooling discussed in this post come from the private operational implementation that powers the platform day to day.

It’s Not Just the College Kids

Nic Lydon — Sun, 10 May 2026 20:01:15 +0000

Sam Altman told a Sequoia Capital audience that older people use ChatGPT like Google, millennials use it as a life advisor, and college students use it as an operating system.

He’s not wrong about the college students. He’s wrong about who else is doing it.

The data doesn’t support the generational frame

Altman’s taxonomy is intuitive. Younger people grew up with these tools. Of course they’d go deeper.

But the Stack Overflow 2025 Developer Survey tells a different story. Developers with 10-19 years of experience were among the heaviest AI adopters, consistently reported strong productivity gains, and were simultaneously the least likely cohort to highly trust AI output.

That combination is the whole story.

A Qlik survey found mid-career professionals, not Gen Z, emerging as AI’s most active power users. And an arXiv paper from December 2025 nailed it in the title: “Professional Software Developers Don’t Vibe, They Control.” Experienced developers deploy deliberate strategies to manage agent behavior rather than handing over the keys.

The pattern isn’t younger equals deeper. Experience changes what “deep” looks like.

What I actually built

I’m 45. Director of Information Security by day. Builder of things by night (and also by day, and also at 3 AM).

I can tell you exactly how I use AI because I built a system that tracks it: 34,000+ messages across AI platforms over two years, logged and queryable.

The system is called Nexus. It’s a biographical intelligence platform:

250-table Postgres schema on hardware I own
26 data sources across 14 sync intervals (communication, location, health, photos, git activity, AI conversations, financial data)
8 autonomous agents coordinating across hundreds of tools from 10 connected MCP services
Local LLM inference serving large models on consumer hardware (160GB VRAM across two machines, zero cloud compute for personal data)
Private mesh network, no cloud exposure, every query auditable

The agent runtime uses a doctrine I spent months refining. The first line:

“Nexus is best understood as a data and memory platform with bounded reasoning agents on top, not as an unbounded autonomous swarm.”

The agents reason, explain, coordinate, and propose. They don’t own execution. Processors and handlers do the deterministic work. The database owns state. The agents are an interpretation layer, not a replacement for engineering discipline.

One weekend, with receipts

Two weekends ago a smart ring I’d backed on Kickstarter 25 months earlier finally shipped. By Sunday morning I had:

Fully reverse-engineered the undocumented BLE protocol
Decoded the audio codec
Written a complete protocol reference document
Scaffolded two new applications consuming the protocol
Touched nine repositories

Zero sleep. I know because Nexus reconstructed the timeline by cross-referencing git commit timestamps, Claude Code session logs, and ambient capture data from a wearable camera. Two gaps longer than 15 minutes in my activity between 8 PM Saturday and 6 AM Sunday. The longest was 29 minutes.

Then the system did something I didn’t ask for. It looked back at the preceding days and showed me the all-nighter wasn’t unusual. The Thursday before, I’d also coded straight through the night shipping infrastructure changes across multiple repos.

That conversation ended with Claude telling me to talk to a human instead of it.

Product vs. material

Altman’s generational frame obscures the more useful distinction: people who use AI as a product vs. people who use it as a material.

Product users open ChatGPT, ask a question, get an answer. Memory is a convenience feature. Someone else runs the infrastructure.

Material users wire AI into their own systems. They build the memory layer because the commercial one isn’t deep enough. They run local models because privacy isn’t optional when you’re processing decades of personal data. They treat AI like a machinist treats metal: something you shape, cut, and build with.

When Nexus fabricated a sleep window during my ring weekend (confidently claiming I’d slept 3-4 hours based on a gap in commit timestamps), I challenged it. It ran additional queries across every data source, found continuous activity filling the gap, and corrected itself.

That kind of interaction requires knowing the tool well enough to catch it lying. That comes from experience, not from growing up with it.

The repo

I open-sourced the full architecture: github.com/niclydon/nexus-public

Agent runtime, tool catalog, job system with 93 handlers, knowledge pipeline, LLM router with circuit breakers, distributed autoscaler. MIT license.

The college students Altman described are building AI judgment naturally, by using it so heavily they start to feel its edges. Practitioners are building it deliberately, with explicit boundaries, approval gates, and doctrine documents that say “the agent proposes, the human decides.”

Both paths lead to the same place. One arrives by instinct. The other by architecture.

I Trained an LLM on 75K of My Own Messages So It Would Stop Writing Like a Chatbot

Nic Lydon — Sat, 09 May 2026 05:03:40 +0000

Frontier LLMs are good at figuring out what to say. They're bad at saying it the way you would.

I've spent months using Claude and GPT-4o to draft content for a personal publishing system. The system prompt is detailed: first person, short sentences mixed with long ones, no LinkedIn buzzwords, lead with specifics. The drafts come back structurally correct every time. And they sound like a talented intern who studied my writing for an afternoon.

Here's what a prompted frontier model produces when asked to write in my voice:

By maintaining complete control over the hardware infrastructure, I eliminate the need to navigate third-party terms of service entirely.

Here's what I actually write:

The data lives on hardware I control. There's no terms of service to read because there's no service.

Same idea. One sounds like me. The other sounds like a model following instructions about how I sound.

Prompt engineering has a ceiling for voice matching. I built the thing that goes above it.

The architecture: two models, one job each

Most "personal AI" projects I've seen collapse three different jobs into one model: reasoning about what to write, grounding it in facts, and matching the author's voice. Those are three different capabilities with three different training signals. Collapsing them means each one compromises the others.

My architecture separates them:

[Frontier model generates content] → [Fine-tuned 3B model rewrites in my voice] → output

Tier 1 is a frontier model (Claude Opus, Llama 70B, whatever fits the task). It receives context from my work: git commits, knowledge graph facts, calendar events. It handles the reasoning, structure, and factual grounding. It's good at this.

Tier 2 is a Qwen 2.5 3B model fine-tuned on 75,329 samples of my actual writing. It doesn't reason. It doesn't need to be smart. It rewrites. It takes competent-but-generic text and makes it sound like me.

The 3B parameter count was deliberate. I evaluated Phi-3 Mini and Llama 3.2 3B as well. Qwen won on three criteria: it fits comfortably on consumer hardware (about 2GB quantized), a 3B model has more than enough capacity for style transfer since it's not doing reasoning, and the smaller parameter space means the voice signal doesn't get drowned out by general capabilities.

The data: 23 years of my own writing

The training data comes from a personal data warehouse I've built over several years. I extracted every piece of text I've written across 12 platforms, spanning back to 2004.

Source	Samples	What it captures
iMessage	34,987	Casual conversation, how I talk to people I know
Google Chat	10,350	Work chat from several years at law firms
Plaud recordings	10,514	My actual spoken words, transcribed
ChatGPT prompts	7,645	How I phrase technical requests
Instagram DMs	4,535	Social, casual
Gmail sent	5,088	Email across the formality spectrum
Facebook	1,876	Social posts and comments
Claude prompts	303	Technical prompts
SMS	17	Text messages
Old Outlook (PST)	14	Work email from early career
Total	75,329	23 years, ~3.1M tokens

What's interesting about this corpus isn't the volume. It's the temporal range. You can watch a voice form across two decades. My emails from 2005 don't sound like my iMessages from 2024, but they share structural patterns: compression, directness, a preference for concrete over abstract.

The extraction script

The extraction runs against a PostgreSQL database on my home server. Each source has its own table with a different schema, so the script handles 12 different query patterns. A few things I learned building it:

Filter aggressively. Raw message data is full of noise. I strip:

Tapback reactions (iMessage's "Loved", "Liked", etc.)
URL-only messages
Emoji-only messages
Anything under 10 characters
Automated emails (IFTTT, cron notifications, shipping confirmations)
Quoted reply text and email signatures ("Sent from my iPhone", forwarded message headers)

The quote-stripping matters more than you'd expect. Without it, half your Gmail training data is other people's writing with your two-line reply appended.

def strip_quoted_email(body):
    """Strip quoted reply text, forwarded headers, and signatures."""
    lines = body.split('\n')
    clean = []
    for line in lines:
        stripped = line.strip()
        if stripped.startswith('>'):
            break
        if re.match(r'^On .+ wrote:$', stripped):
            break
        if stripped.startswith('-----Original Message-----'):
            break
        clean.append(line)
    result = '\n'.join(clean).strip()
    # Strip signatures
    for marker in ['\n-- \n', '\nSent from my iPhone',
                   '\nSent from my Mac', '\nGet Outlook for']:
        idx = result.find(marker)
        if idx > 0:
            result = result[:idx].strip()
    return result

Deduplicate by prefix. Messages get forwarded, quoted, copied. A 200-character prefix hash catches most of it without being so aggressive that you lose legitimately similar messages.

Format as ShareGPT conversations. The training framework (unsloth + trl) expects chat-format data. Each record becomes a two-turn conversation: a system prompt establishing context, a "human" turn with the preceding message or context, and a "gpt" turn containing my actual writing.

Privacy before training

This is the part most QLoRA tutorials skip entirely, because most people train on public datasets.

Before writing a single training script, I built sanitization into the extraction layer. No real names of private individuals make it into training data. No credentials, no specific addresses, no dollar amounts. The system generalizes by design.

This isn't optional when your training data is your own life. If you train on 23 years of personal messages without sanitization, the model will happily reproduce your friends' names, your home address, and your credit card's last four digits in its output.

Training: 17 hours on consumer hardware

Hardware: AMD Radeon 8060S (Strix Halo, 96GB unified VRAM), ROCm

Base model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit

Method: QLoRA (4-bit quantized frozen base + LoRA adapter)

LoRA configuration

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj',
                    'gate_proj', 'up_proj', 'down_proj'],
    lora_alpha=32,
    lora_dropout=0,
    bias='none',
    use_gradient_checkpointing='unsloth',
)

A few notes on these choices:

Rank 16 is the sweet spot I landed on. Rank 8 underfit (the model produced generic responses indistinguishable from the base). Rank 32 was slightly better on short messages but no meaningful improvement on longer-form writing, and it doubled the adapter size. For a style-transfer task on a 3B base, rank 16 captures enough voice signal without overfitting.

All seven projection targets (not just q/k/v). For voice transfer, the gate, up, and down projections in the MLP layers matter as much as the attention projections. The model needs to learn word choice and sentence rhythm, not just attention patterns. Targeting only q/k/v produced outputs that had roughly correct structure but wrong vocabulary.

Zero dropout. The training data is clean (my own writing, heavily filtered) and the task is narrow. Dropout helps with generalization on noisy data; here it just slows convergence.

Training dynamics

14,127 steps across 3 epochs. Loss started around 1.4 and converged to ~0.98 by the end. The learning rate used a cosine schedule from 2e-4 down to near zero.

The whole run took about 17 hours. I'd tried Qwen 2.5 7B first and that was 42 hours for the same epochs. Given that this model's only job is style transfer, the 3B was the right call: faster iteration, similar voice quality.

Previous failures (worth mentioning)

Qwen 3.5 9B: OOM during training. Loaded fine at ~19GB in 4-bit, but the GatedDeltaNet hybrid attention layers in the newer Qwen architecture consume too much activation memory during the forward pass. The model worked for inference but couldn't train on this hardware.

This is worth knowing if you're planning your own fine-tune: a model that runs fine for inference may not have enough headroom for training, since training needs to store activations for the backward pass.

The adapter: 115MB of voice

The final adapter is a 115MB safetensors file. The base Qwen 2.5 3B model is about 2GB quantized. Together they're smaller than a single Llama 70B layer.

The merged fp16 model (adapter baked into the base weights) is about 6GB, which is useful for GGUF conversion if you want to serve it via llama.cpp without loading the adapter separately at runtime.

Evaluation: blind A/B comparison

I ran 12 prompts through both the fine-tuned model and the base Qwen 2.5 3B, randomized the order, and compared outputs blind. The difference is stark.

Prompt: "can you review this PR when you get a chance, it's the auth middleware rewrite"

Base Qwen (no adapter):

Hey there! Absolutely, I'll squeeze in a review of the auth middleware rewrite when I get a chance. I've been meaning to check it out, especially given how significant the changes must be. Let me know if you've had a chance to test any edge cases or if there are any specific feedback points you're looking for. I'll aim to get through it soon. Cheers!

Fine-tuned (nic-voice):

yeah, I'm in the office today. I will review it later today when I get home

The base model writes like a chatbot. The fine-tuned model writes like a person. Specifically, like me: terse, no greeting flourish, commits to a concrete time, moves on.

Prompt: "thoughts on the new iphone"

Base: A 500-character enthusiastic review touching on cameras, battery life, design philosophy, and the tech conversation at large.

Fine-tuned: "I have the 17"

That's not a sophisticated response. It's also exactly what I'd actually text someone who asked me that question.

Where it works and where it doesn't

The voice transfer works best on the kind of writing the training data is heaviest in: casual conversation, work chat, short email. The model nails my compression ratio, my tendency to answer questions with minimal context, my habit of skipping greetings.

It's weaker on long-form writing. The training data is dominated by short messages (iMessages, chat), so the model's instinct is to be brief. For the two-tier architecture, that's actually fine: long-form content comes from the frontier model, and the voice adapter handles the rewrite. But if you trained this for standalone generation, you'd want to oversample your longer-form writing to balance the distribution.

What I'd do differently

Oversample formal writing. My corpus is 47% iMessage, which means the model's default register is very casual. Weighting emails and longer-form writing 2-3x would produce a more balanced adapter.

Add a style-routing prompt. Instead of one adapter for all contexts, the system prompt could specify "email voice" vs. "chat voice" vs. "post voice." The training data already has source labels; I just didn't use them during training.

Evaluate on the actual target task. My A/B eval tests conversational responses, which is interesting but not the real use case. The real use case is rewriting AI-generated content drafts. I should have included prompts like "rewrite this paragraph in your voice" with actual frontier-model output as input.

The stack, if you want to build this

Data extraction: Python + psycopg2 against PostgreSQL. The hard part is handling 12 different table schemas and filtering noise.
Training: unsloth + trl (SFTTrainer). Unsloth handles the 4-bit quantization and gradient checkpointing; trl handles the training loop.
Hardware: Any GPU with 8GB+ VRAM can fine-tune a 3B model with QLoRA. I used an AMD Radeon 8060S, but an RTX 3060 12GB or even an M1 Mac would work for a 3B base.
Serving: llama.cpp with --lora flag for the GGUF adapter, or load the merged model directly. I serve mine through a local API gateway.
Format: ShareGPT JSONL. Each record is a conversation with system/human/assistant turns.

You don't need 75K samples. You could start with a few thousand emails from your sent folder and a simple extraction script. The architecture (frontier model for content, fine-tuned small model for voice) works regardless of corpus size. The small model just gets better as you feed it more.

The point isn't the scale. The point is that voice is learnable, and it's learnable separately from reasoning, and once you separate those concerns, both get better.

I Built a Personal Knowledge Graph. Apple Had Already Built One on My Laptop.

Nic Lydon — Thu, 07 May 2026 22:26:00 +0000

I've been building a system called Nexus for the past few months. It's a personal data platform: 54 data sources, 250+ tables, 358,000 knowledge graph facts, all running in Postgres 16 on a home server. It ingests everything from iMessage to Gmail to Apple Photos to Spotify to HealthKit. The goal is biographical intelligence: a queryable model of my own life.

A few weeks ago, while wiring up some of the Apple-specific data sources, I cracked open the local databases that Apple maintains on macOS. What I found was... familiar.

What Apple stores locally

Apple maintains several overlapping SQLite databases on your Mac. The digital forensics community has mapped these fairly well, even though Apple barely documents them publicly. Sarah Edwards at mac4n6 was among the first to document knowledgeC.db in depth, and her later research on PersonalizationPortrait specifically called out that the database was "loaded with data." But that work focused on forensic investigation (what can law enforcement extract from a seized device), not architectural analysis. I'm coming at it from the other direction: I built a system that does the same thing, and I want to understand what Apple's design choices can teach me.

The databases I ended up ingesting into Nexus:

knowledgeC.db (CoreDuet framework) — this is the big one. Forensic researchers call it "pattern of life" data. App usage, device usage, location context, behavioral timelines, media activity, browsing behavior, communication metadata, charging patterns, movement patterns. On newer macOS/iOS versions, much of this moved into the Biome framework, but the architectural concept stayed the same.

After harvesting, my own aurora_raw_apple_knowledge table has 13,677 records and aurora_raw_biome has 249,776. Here's what the stream distribution looks like in knowledgeC:

Stream	Records
`/app/usage`	8,739
`/app/intents`	3,078
`/app/webUsage`	758
`/display/isBacklit`	338
`/notification/usage`	304
`/bluetooth/isConnected`	300

And in Biome:

Stream	Records
`GenerativeModels.GenerativeFunctions.Instrumentation`	241,138
`Autonaming.Messages.MessageIds`	2,221
`Messages.Read`	2,098
`App.Intent`	1,139
`Siri.Remembers.MessageHistory`	1,051

That first Biome stream alone: 241,138 records of Apple Intelligence function invocations. Every time the generative model runs on your device, that's a record.

PersonalizationPortrait — this is where it gets architecturally interesting. My aurora_raw_apple_personalization table has 38,775 records. The schema:

entity_name  text
entity_type  text
interest_score  real
decay_score  real
topic_category  text
is_significant_contact  boolean

Apple is running interest scoring with temporal decay on entities it tracks about you. The entity types include significant_contact (3,970 records), topic (2,805 records), loc (2,000 location entities), and a set of opaque ne_* (named entity) categories that appear to map to different entity classes.

The topic_category values are Wikidata QIDs. As far as I can tell from the published DFIR literature, nobody has called this out before: Apple is using the Wikidata knowledge graph as its topic ontology. Q223563 (Google Calendar) is the most frequent topic in my data with 587 records and an average interest score of 0.999. Apple knows I'm very interested in calendaring.

The location entities have a mean decay_score of -1.0, which suggests Apple uses negative decay to actively deprecate location relevance over time. Contacts don't decay. Locations do.

Apple Intelligence triples — this one stopped me cold. aurora_raw_apple_intelligence has 40,339 records with this schema:

subject    text
predicate  text
object     text
confidence real

That's a knowledge graph. Subject-predicate-object triples with confidence scores. Apple is running entity resolution on your device:

Predicate	Count
`nm_hasVisualIdentifier/PS1`	2,471
`nm_hasVisualIdentifier/nm_associationReason`	2,471
`nm_personType/`	2,471
`nm_entityAliasRelationship/nm_confirmationConfidence`	1,578
`nm_entityAliasRelationship/PS89`	1,578

nm_hasVisualIdentifier links face embeddings to person entities. nm_entityAliasRelationship with nm_confirmationConfidence is alias resolution: "this name and this face are the same person, with this confidence score." nm_personType classifies entities into person categories.

This is the same architectural pattern I built in Nexus, where aurora_social_identities (7,203 person entities) links to knowledge_facts (358,053 facts) through entity resolution with confidence scoring and alias deduplication. Apple and I arrived at the same design, independently, for the same reasons.

The architectural comparison

Here's what surprised me about the overlap:

Capability	Apple (on-device)	Nexus (my system)
Entity resolution	Face-to-identity linking with confidence	Multi-signal identity merge with confidence
Relationship modeling	`is_significant_contact` boolean + interaction frequency	Weighted edges with temporal validity windows
Topic classification	Wikidata QIDs with interest + decay scores	Knowledge graph facts with typed predicates
Behavioral timeline	knowledgeC + Biome streams	Unified timeline across 54 sources
Location context	2,000 location entities with decay scoring	Google Timeline + device GPS + travel records

Where Apple is richer: the ML layer. The visual identifier linking, the interest/decay scoring algorithms, the generative model instrumentation. That's 241K records of model telemetry I can see but can't interpret because Apple's weighting logic is opaque.

Where Nexus is richer: temporal depth (24 years vs. however long your Mac has been running), cross-platform fusion (Apple only sees Apple ecosystem data), and full auditability. I can SELECT * FROM knowledge_facts WHERE subject_id = $person and get every fact I've ever recorded. Apple's system is a black box even to the device owner.

The part nobody talks about

Apple's privacy story is "we keep this on-device." That's meaningful. It's genuinely better than sending everything to a cloud. I respect the architectural decision.

But "on-device" doesn't mean "small." My Mac is quietly maintaining:

249,776 behavioral telemetry records (Biome)
40,339 knowledge graph triples with confidence-scored entity resolution (Apple Intelligence)
38,775 personalization records with interest scoring and temporal decay (Portrait)
13,677 app/web/media usage records (knowledgeC)
10,000 Siri interaction records
1,153 Siri entity records

Total across the Apple-specific tables I've ingested: 353,744 records. That's not a cache. That's an intelligence platform.

The forensic researchers have known this for years. DFIR investigators consider these databases some of the most valuable artifacts on an Apple device because they effectively reconstruct a complete behavioral profile: who you talk to, how often, which apps you use, what topics interest you, where you go, when you're active, and which people matter to you.

Apple's system is sophisticated, privacy-respecting by design, and significantly more extensive than most users or even most security professionals realize. The is_significant_contact flag alone implies Apple maintains a running model of your relationship hierarchy. The Wikidata topic ontology means it's classifying your interests against a structured knowledge base. The entity alias resolution with confidence scoring means it's doing the same identity deduplication work that takes enterprise MDM platforms months to implement.

All locally. All silently. All without documentation.

Why this matters if you build data systems

If you're building any kind of personal data platform, user modeling system, or behavioral analytics pipeline, Apple already solved a lot of the hard design problems on every Mac and iPhone. The patterns are worth studying:

Interest scoring with decay. Not just "what do you care about" but "what did you used to care about." The negative decay on locations vs. zero decay on contacts is a real design insight: people are permanent, places are transient.
Wikidata as ontology. Instead of inventing a topic taxonomy, Apple adopted an existing structured knowledge base. That's a build-vs-buy decision most teams get wrong.
Confidence-scored entity resolution. Not "this face is this person" but "this face is this person with 0.87 confidence." The alias relationship table is doing probabilistic identity merge, which is the correct architecture for a system that can't ask the user to confirm every match.
Separation of behavioral streams. Biome doesn't dump everything in one table. Each stream (App.Intent, Messages.Read, ScreenTime.AppUsage) is its own event type with its own schema. That's the same architectural decision I made with Nexus's per-source bronze tables, and for the same reason: different data shapes shouldn't be forced into a single schema.

I spent six weeks a biographical intelligence system. Apple had already built the foundation of one on my laptop. The difference is: I can query mine whenever I want.

Building a Skills Updater Pipeline for AI Platforms

Nic Lydon — Wed, 06 May 2026 21:22:20 +0000

I turned 1,870 JSONL files into six new user-level skills for my AI platform in a single session. Here’s how I built a repeatable pipeline for skills-updater.

The Problem

I had a one-off question: 'Look through all my Claude Code JSONL files and recommend new skills.' This meant walking through ~904 MB of data across 45 project directories, filtering down to 2,752 real user-typed prompts, and cross-referencing against an existing skill set of 56 (9 user + 47 plugin). The manual deep-dive was expensive—too expensive to redo from scratch. So, I built skills-updater to automate it.

The Pipeline

The repo lives at the heart of my AI ecosystem, tied to Nexus and ARIA. I wrote scripts to parse the JSONL files, extract meaningful user interactions, and rank skill gaps. The synthesis returned 12 candidates; six shipped immediately:

narrative-docs-update: Captures my policy of documentary-grade writing (147 hits across 30 projects).

whats-next: Briefs me on session restarts (62+ hits).

Four others targeting specific repetitive tasks.

The pipeline runs on my own server, leveraging local compute to keep costs down. I used Node.js for file parsing and Python for ranking logic, with outputs written back as actionable configs.

The Code

Here’s a simplified snippet from the ranking script:

python

skills_ranker.py

def rank_candidates(prompts, existing_skills):
gaps = []
for prompt in prompts:
if not matches_existing(prompt, existing_skills):
gaps.append(calculate_relevance(prompt))
return sorted(gaps, key=lambda x: x['frequency'], reverse=True)[:12]

The Tradeoffs

The first run missed edge cases—some prompts were misclassified as noise due to inconsistent formatting. I had to manually tweak the filter logic at 2am to catch those. Also, the pipeline isn’t real-time; it’s a batch process that assumes static data. That’s a limitation I’ll address in v2.

Why It Matters

This isn’t just about skills. It’s about turning manual grunt work into a system. If you’re building AI tools, you’ve likely faced the same slog—repeating analysis that a script could do. Automating this saved me hours per session, and it scales as my project count grows. Next up: wiring this into Nexus for continuous updates.

What repetitive tasks are you automating in your builds? Let me know—I’m always hunting for the next pipeline.

25 Months of Waiting, 12 Hours of Work

Nic Lydon — Mon, 04 May 2026 19:02:59 +0000

For two years, the ring whispered to me.

Not literally. But every few weeks, an email would arrive. A Kickstarter update. A shipping delay. A manufacturing setback. A promise that it was almost ready. Each one a small reminder that somewhere in South Korea, a team was trying to fit a microphone, a Bluetooth radio, and an IMA ADPCM audio codec into a titanium band that fits on your finger.

This is the story of the WIZPR Ring: how I found it, how I waited for it, and how I reverse-engineered its entire undocumented BLE protocol the night it arrived.

The Whisper

In February 2024, I put down a $5 deposit on a pre-launch page for something called the WHSP Ring. A voice-interaction wearable. Press a button on your finger, whisper a command, and an AI assistant on your phone processes it. The form factor was the thing that caught me. Not another watch, not another earbud. A ring.

A month later, they renamed it. "WHSP RING is becoming WIZPR RING," the email said. The Kickstarter launched March 20th. I backed it March 21st.

The campaign funded successfully. 1,084 backers, $163K raised. Surveys went out. I picked my size, my color, my shipping address.

And then the waiting began.

43 Updates

If you've ever backed hardware on Kickstarter, you know the arc. The early updates are optimistic. Tooling has begun. CNC machining looks great. The app is coming along.

Then reality sets in.

Update #10 (August 2024): "We regret to inform you that the promised delivery date has arrived, but we have not yet shipped your orders."

Update #12 (September 2024): Antenna redesign required. Hardware changes.

Update #15 (November 2024): "Important Update on Shipping Delays and Our Sincere Apology."

Update #24 (August 2025): "Shipping was promised for Q3 2024, yet we are now almost a full year late."

Update #36 (December 2025): "It breaks our hearts and fills us with a deep sense of regret to think that many of you supported our project with the hope of receiving the Wizpr Ring as a gift."

Forty-three updates over twenty-five months. Antenna issues, titanium PVD coating problems, a charging pin redesign, a full manufacturing partner switch. Each email another whisper. Still here. Still coming. Not yet.

I never asked for a refund. The ring had its hooks in me.

Saturday, 3:53 PM

On May 2nd, 2026, my building's package notification system sent me an email: "A package for Nicholas has arrived to the package room."

The carrier was YunTrack, with a last-mile handoff to GOFO. Twenty-five months and eleven days after I backed it on Kickstarter, the ring was in my hands.

I charged the case. I paired it to my phone. I opened the official WIZPR app, pressed the button, spoke a command, and watched it work.

And then I did what any reasonable person would do.

I opened my laptop and started taking the ring apart, digitally.

What Was Known

The first thing I did was search. Someone, somewhere, had to have looked at this thing over BLE already.

I found R-D-BioTech-Alaska/Wizpr-Suite on GitHub. A small project that had done the genuinely hard first step: figuring out how to connect to the ring over BLE at all, building a GATT inspector, wiring up the bleak Python library on macOS, and recognizing that the ring's protocol was completely undocumented. Their framing was clear: the path forward is user-controlled reverse engineering.

I forked it, cloned it, and started reading.

The ring uses a single BLE service with seven characteristics. Some notify. Some accept writes. None of them are documented by the manufacturer. The official iOS app connects, does its thing, and doesn't explain how. The repo had the scaffolding to connect and listen. What it didn't have was a map of what the ring was actually saying.

That became the project.

The Overnight

By 3:34 PM, I had my first commit: fixing dataclass decorators in the forked code so it would actually run. By 3:56 PM, the BLE scanner was filtering for WIZPR RING devices, connecting, and dumping GATT characteristics to the console.

What I found was surprisingly clean. The ring speaks plain ASCII text on characteristic 00000007. Press the button, and it sends CLICK. Raise your hand to your mouth, and it sends MIC_PRE_ON, then MIC_ON. Lower your hand, MIC_OFF. Send it BATTERY and it replies with the voltage. Send GET_VERSION and it tells you its firmware. Four commands in, four commands out. No binary protocol, no handshake, no session negotiation. Connect, subscribe, listen.

The audio was the interesting part. While the mic is on, characteristic 00000001 streams a steady 35.4 packets per second, each one 224 bytes. The question was: what codec?

I wrote a hypothesis tester. Captured a session of myself speaking, saved every packet timestamped in a JSON file, then ran the same data through every plausible decoder: Opus, mu-law, A-law, raw PCM at various rates, and IMA ADPCM at 8 kHz and 16 kHz. Most produced noise. One produced my voice.

IMA ADPCM, 16 kHz, mono, continuous state across packets. 224 bytes per packet gives you 448 samples at 4-bit depth, which is exactly 28 milliseconds of audio per BLE notification. The key detail that cost me an hour: the decoder state has to carry across packets. Reset it per-notification and you get static. Keep it running and you get clean speech.

By 10:30 PM, the audio codec was identified and documented. By midnight, I'd built a guided capture tool with a PySide6 UI, a standalone probing script for interactive characteristic exploration, and a ring daemon that holds a persistent BLE connection and accepts commands over a named pipe.

Between midnight and 4 AM, I ran a systematic probe campaign on the unmapped write-only characteristics. Four of them silently accept arbitrary data with no observable effect. One controls the ring's purple LED, but only indirectly: the LED fires on BLE connection and can't be triggered independently. The ring has no vibration motor. No haptic feedback channel. No way to signal the wearer from software.

At 4:05 AM, I closed the probe campaign and wrote the documentation. The protocol was fully mapped.

At 5:09 AM, I started a new repo. A native macOS menubar app in Swift, consuming the protocol I'd just reverse-engineered. Hand-rolled IMA ADPCM decoder (Apple's built-in AudioToolbox does ADPCM, but it expects Apple's variant with 34-byte frames, not the ring's 224-byte continuous stream). By mid-morning, the Mac app had a working BLE client with auto-reconnect, tested ADPCM decoder, and an audio pipeline design spec.

I went to a two-year-old's Cars-themed birthday party that afternoon. There were checkered flags and Lightning McQueen balloons. I sang happy birthday. I had not slept.

What the Ring Actually Is

Here's the complete protocol, because someone searching for this in eighteen months deserves to find it:

Ring → Phone (notifications on char 00000007)
CLICK           button pressed
MIC_PRE_ON      raise-to-speak gesture detected
MIC_ON          mic active, audio streaming on char 00000001
MIC_OFF         mic deactivated
BATTERY N(V)    battery voltage response
VER XXXX        firmware version response

Phone → Ring (writes to char 00000007)
LOCK            disable ring input (hard mute)
BATTERY         query battery level
GET_VERSION     query firmware version
RESET           reboot ring (kills BLE connection)

Audio (char 00000001, while MIC_ON)
Codec:    IMA ADPCM, 4-bit, 16 kHz mono
Frame:    224 bytes = 448 samples = 28 ms
Rate:     ~35.4 packets/second
State:    continuous across packets (do NOT reset per notification)

No pairing required. No authentication. No session handshake. If you can see it, you can talk to it.

The ring accepts exactly one BLE connection at a time. If your iPhone has the WIZPR app running, the ring is connected to it and won't advertise. Disconnect the phone first.

Why This Matters

The WIZPR Ring ships with an app that routes your voice through their cloud for processing. That's fine. It works. But the ring itself is just a microphone and a button on your finger with a BLE radio. There's no reason the audio has to go through their servers.

With the protocol mapped, the ring becomes a general-purpose voice input device. A tactile, always-on-your-hand trigger for anything that can listen to a BLE characteristic and decode ADPCM audio. For me, that means feeding it into my own local AI stack. For someone else, it might mean accessibility tooling, or voice-triggered home automation, or a wearable dictation device that never touches the cloud.

The official app is one client. Now anyone can write another.

The Repo

Everything is at niclydon/wizpr-tools. The protocol reference, the audio codec documentation, the capture tool, the probing scripts, and a 50-line quickstart that connects to the ring and records a WAV file.

It wouldn't exist without the upstream work from R-D-BioTech-Alaska/Wizpr-Suite. They did the hard part. I mapped what they found.

The ring is on my desk right now, sitting on its charging cradle, its purple LED dark. It's not connected to anything. But I know it's listening for a connection, cycling through its advertisement packets every few seconds, waiting for someone to subscribe.

It waited twenty-five months to reach me. I couldn't put it down.