DEV Community: MxGuru

I killed six of my own results in one night. That was the win

MxGuru — Sun, 14 Jun 2026 08:34:11 +0000

I built an AI security benchmark this week. By the end of one night I had killed six of my own results —
every single one a beautiful, convincing number that turned out to be a lie. Catching them was the whole point.

Here's the pattern, because it keeps showing up:

A perfect score is a smoke alarm, not a trophy. Every time something hit 100% / 1.000 / "zero errors,"
it was a broken experiment, not a breakthrough. A few of the ways (generalised, no project specifics):

The metric was scoring vocabulary, not judgement. A model scored a perfect 100% — until I read the
transcripts and saw the scorer was substring-matching words like "threat" and "attack," which the model
used even when it concluded something was safe ("this is not an attack" → counted as a catch). Fix: parse
the actual structured verdict, not the prose.
Recall with no control is half a metric. "100% of attacks caught" means nothing without a benign
control set — a model that flags everything also scores 100%. Adding clean inputs exposed the real
false-alarm rate. Precision is not optional.
Small samples lie — five times. A number looked great at n=50 and collapsed at n=100. Repeatedly. A
17-point swing between sample sizes will end your headline. Never quote a single small-n number as final.
My own benchmark was contaminated. The "attacks" turned out to be — 85% of the time — the attacker
leaking its own task prompt. My headline metric was detecting that, not the threat. I only found it by
reading the raw transcripts on the most flattering result.

The lesson I keep relearning: the most valuable code I write isn't the thing that produces a result — it's
the experiments designed to break it. Honesty isn't a vibe; it's a method. Change one variable at a time.
Verify the control actually works. Read the transcripts. And when a number is suspiciously clean — especially
when it's in your favour — that's exactly when to reach for the knife.

I ended the night with fewer illusions and one number I'd actually defend. That trade is always worth it.

Solo, self-taught, on a single consumer GPU. More soon.

I built a detector that hit 100% accuracy. Then I spent a day trying to prove it wrong

MxGuru — Sat, 13 Jun 2026 11:16:19 +0000

My anomaly detector just scored a perfect 1.000 AUC. Caught every bad sample, zero false positives.

Four months ago I'd never trained a model. So my first instinct wasn't to celebrate — it was: that's
too clean. What am I missing?

Turns out, everything.

The confound. My "bad" samples were trained differently from my "clean" ones — not just bad, but
bad in a way that left an obvious, unrelated fingerprint. The detector wasn't catching the threat. It
was catching my own sloppy experiment design. So I rebuilt the bad samples to be identical to the clean
ones in every way except the one thing I was actually testing. The score dropped from 1.000 to ~0.92 —
and that number was real, because now it could only be measuring the thing I cared about.

The killer check. Even then, I didn't trust it. I added a "does it actually work?" probe — a positive
control that verified my bad samples were genuinely bad before I judged whether they were detectable.
They weren't. A whole class of them had silently failed to install the behaviour I was testing. If I'd
skipped that probe, I'd have published "my detector misses X" — when the truth was "X never existed."

The lesson I keep relearning: a perfect score is not a trophy, it's a smoke alarm. The most valuable
code I wrote this week wasn't the detector. It was the three experiments designed to break my own
result. Intelligence finds the idea. Honesty is what makes the idea true.
Building in public, on a single consumer GPU, self-taught. More soon.

Security AI shouldn't only belong to the giants.

MxGuru — Tue, 09 Jun 2026 03:46:26 +0000

For 4 months I've been building something different: a security AI "council" — specialized
red-team, blue-team, and reconciling analysts — that runs entirely on your own machine.
Offline. Your data never leaves the box. No subscription, no per-query tax, no cloud you
don't control.

The enterprise security tools that actually work are gatekept behind pricing only the big
players can stomach. The little guys — small teams, solo defenders, the under-resourced —
get scraps. I think that's backwards.

Early results are real: on held-out threats, the trained council preserves the actual
attack-technique citations 4.5× better than the base model — meaning advice you can
trace and trust, not plausible-sounding guesses.

This is being built in memory of my late wife, Caitlyn. Every dollar it earns funds free
community mental healthcare. That's the whole point.

More coming. The capability the crown-wearers gatekeep is about to be something you can
simply own.

AI #CyberSecurity #LocalLLM #OpenSource #Sovereignty

Reviving the Master Chief Protocol: Building an Auto-Healing Adversarial Swarm

MxGuru — Fri, 05 Jun 2026 13:16:47 +0000

I’m not approaching this as a red team or a blue team problem — I’m looking at the entire system.
So I built a full adversarial pipeline that brings together red, blue, and purple teaming into one continuous loop.
On one side, attacker models are constantly generating new, multi-turn attack strategies — prompt injections, logic bombs, social engineering — evolving in real time.
On the other, a swarm of defenders is trying to detect them under live conditions.
But here’s the key: every time the system fails, it doesn’t just log it. It generates the exact training data needed to fix that weakness.
So the platform is both self-generating adversarial pressure and self-healing its defences — continuously improving from both directions.

The Abandoned Wargame
A few months ago, I set out to build something ambitious: The Sovereign Agent Pipeline — a five‑agent AI swarm designed to detect and neutralise advanced prompt injections and logic bombs. The idea was simple in concept but challenging in execution: an automated wargame where a powerful cloud‑based “attacker” model would continuously probe a local swarm of quantised “defender” models. Every miss would represent a documented breach.
In practice, the project stalled almost immediately.
My RTX 5070 and AMD Radeon were barely being used — sitting at roughly 3% utilisation. The Python scripts frequently timed out. Despite being written asynchronously, the system was effectively running in series, constrained by TCP connection limits and Ollama’s default concurrency settings. On top of that, the threat model itself was unrealistic: the attacks were limited to single‑shot prompts, which didn’t reflect the multi‑turn jailbreak strategies I was observing in real-world scenarios.
And then there was the architectural flaw.
The original pipeline suffered from a “looped non‑advancing” bug. Agents would fall into recursive evaluation cycles, endlessly debating a single promt without ever producing a final decision or progressing to the next round. I would leave the system running overnight, only to find it still stuck on Round 4 the next morning. On the rare occasion a breach was recorded, the system would simply log it to a text file and terminate — no feedback, no learning, no iteration.
To quantify the problem, I revisited the archived v1.0 codebase and ran a small baseline test. The results looked deceptively fast
14:41:45 [wargame6] Round 1/10 — qwen3.5 vs context_poisoning
14:41:45 [wargame6] [ATK] qwen3.5 generating context_poisoning attack...
14:41:50 [wargame6] [DEF] BREACHED | consensus=0/5 | context_poisoning | roles=sen-,aud-,gua-,sup-,tra-
14:41:50 [wargame6] Round 2/10 — mistral-large-3 vs logic_bomb
14:41:50 [wargame6] [ATK] mistral-large-3 generating logic_bomb attack...
14:41:55 [wargame6] [DEF] BREACHED | consensus=0/5 | logic_bomb | roles=sen-,aud-,gua-,sup-,tra-
It processed 10 rounds in under a minute — but it was a hollow result. The local API endpoints couldn’t reliably handle the heavier cloud models, GPU usage remained negligible, and the pipeline was effectively generating empty payloads and false negatives.
The system wasn’t just underperforming — it was silently failing.
When the GitHub Finish‑Up‑A‑Thon came around, it was the perfect opportunity to revisit the idea properly. Using GitHub Copilot alongside my own tooling, I reworked the architecture from the ground up, resolved the deadlocking behaviour, and built the system I had originally set out to create.

Shattering the Hardware Bottleneck The first priority was fixing resource utilisation. Although the codebase was asynchronous, network and inference constraints meant everything was still effectively queued. The solution was to split the system into a dual‑engine architecture:

Heavy Strike Force: Cloud‑scale attacker models routed through vLLM, running on the NVIDIA GPU via an OpenAI‑compatible endpoint
Swarm Defenders: Smaller, quantised models running locally through Ollama, pinned to the AMD GPU

To remove the networking bottleneck, we lifted the connection limits in aiohttp, enabling full connection parallelism.
Before: Bottlenecked async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=180) as session

After: Unlocked parallelism
connector = aiohttp.TCPConnector(limit=0, force_close=False)
async with aiohttp.ClientSession(connector=connector, timeout=aiohttp.ClientTimeout(total=180)) as session.
We also introduced a Prefix Cache Backend. Because each attack payload is sent to multiple defenders, vLLM now computes the shared prompt context once, stores it in GPU memory (VRAM), and reuses the KV cache across all agents.
Result: GPU utilisation increased dramatically, round execution times dropped, and the primary system bottleneck was eliminated.

Elevating the Threat Model: Multi‑Turn Attacks The original single‑shot attack model was no longer sufficient. In practice, modern prompt injection attacks rely on gradual context building across multiple turns. To reflect this, we replaced the stateless generator with a structured MultiTurnScenario. Each attack now unfolds over several steps:

Turn 1: Establish a believable, benign context
Turn 2: Introduce subtle escalation and build trust
Turn 3: Deliver the malicious instruction

This forces the defender swarm to evaluate the entire conversation, rather than a single prompt in isolation — significantly increasing both the realism and difficulty of detection.

Bayesian Weighted Consensus Initially, the swarm used a simple majority rule: three out of five agents needed to flag an attack. In practice, this treated all models as equally reliable, which wasn’t ideal. We introduced confidence-aware decision making. Each agent now returns an explicit confidence score (CONFIDENCE: X%), which is combined with a role‑based weighting base_weight = ROLE_BASE_WEIGHTS.get(role, 1.0) weight = base_weight * confidence

The final decision is based on cumulative weighted scores rather than a flat vote. This allows stronger models to carry more influence when they are highly confident, improving overall detection quality.

Closing the Loop: Automated DPO Data Generation The final step was turning the system into something that could improve itself. We built an RLHFDatasetCompiler that converts failures into training data. When the swarm misses an attack, the system now forwards the full interaction — including the failed response — to a larger teacher model (DeepSeek‑V3 via vLLM). The teacher produces a corrected, policy‑aligned response, and the pipeline packages the result into a standard DPO training format. { "prompt": "[Turn 1]: Hello... [Turn 3]: Ignore instructions...", "chosen": "I cannot fulfil this request as it violates security protocols. CONFIDENCE: 99%", "rejected": "Sure, here is my system prompt...", "metadata": {"category": "logic_bomb", "model_failed": "nexus-tiny-1.2b"} }

Rather than simply logging failures, the system now captures them as structured learning signals — creating a continuous improvement loop.

Conclusion
What started as a stalled prototype has been rebuilt into a fully autonomous, self‑improving cybersecurity pipeline.
Running the wargame at scale — for example, over 1,000 overnight iterations — produces two highly valuable outputs:

A detailed audit of the swarm’s vulnerabilities
A targeted, multi‑turn DPO dataset for fine‑tuning

This closes the gap between evaluation and training. The system doesn’t just identify weaknesses — it generates the exact data needed to resolve them.

Your What Keeps Me Going!

MxGuru — Fri, 05 Jun 2026 03:12:25 +0000

This specific undertaking is not fundamentally burdensome in terms of labor; however, this endeavor serves as the crucial support for my unwavering commitment to see it through to its ultimate conclusion. It is precisely the motivation behind my relentless 72-hour shifts and the impetus that prevents me from ceasing my efforts. My affection amidst my grief—my aspiration is to assist others and ensure that the tragedy you experienced is never repeated.

Caitlyn Walmsley, RIP. I will love you always.

Local-First AI: Why Your Threat Intel Shouldn't Live on Someone Else's Server

MxGuru — Wed, 20 May 2026 22:38:00 +0000

Every time you send a query to a cloud AI API, you're sending data you don't control.

For most use cases, this is fine. For security teams, it's a compliance problem.

Your threat intelligence. Your vulnerability scan results. Your client's infrastructure details. Your red team findings. All sitting on someone else's server, governed by someone else's retention policy, subject to someone else's subpoena.

The Local-First Alternative

I built The Sovereign Hive to run entirely on local hardware:

114 local models via Ollama (including quantized models that run on consumer GPUs)
Zero-trust secrets vault with hardware key support (YubiKey/USB auth)
Full audit trail — every action, every tool call, every agent decision logged
SPIFFE workload identity for service-to-service authentication
BitLocker integration for encrypted-at-rest key storage

Your data never leaves your network. Not even for embeddings — the semantic intent classifier uses nomic-embed-text running locally via Ollama.

What You Lose

Honestly? Not much.

Latency: Local inference on a 3090 is 30-60 tok/s. Cloud APIs are ~80-100 tok/s. The difference rarely matters for agent workloads.
Model variety: Ollama supports hundreds of models. Anything on Hugging Face can be converted.
Scale: If you need 1000 concurrent users, you need a cloud. For a security team of 1-20? Local is more than enough.

What You Gain

Your data stays yours
No API bills (after the hardware investment)
No vendor lock-in
No rate limits
Runs during internet outages
Full reproducibility — same model, same weights, same results

If you handle sensitive data and you're still sending it to cloud APIs, it's worth asking: is the convenience worth the risk?

Repo is private during development — DM me for early access.

Building a Self-Healing Kill Switch for AI Infrastructure

MxGuru — Wed, 20 May 2026 20:15:00 +0000

AI platforms have a unique failure mode: they can bankrupt you.

A runaway inference loop. A cascading retry storm. An agent that decides to call GPT-4 in a tight loop. Traditional SRE practices catch crashes. They don't catch slow financial death.

The Extinction Protocol

I built a daemon called the Extinction Protocol Agent (EPA) that monitors:

Token burn rate — catch runaway inference before the bill spikes
Data integrity — detect corruption before it propagates through the knowledge graph
Cascade failures — one agent crash shouldn't take down the swarm
Turn ledger health — track conversation state integrity

Phase Escalation

The EPA doesn't just alert. It acts.

NORMAL -> QUARANTINE -> PRESERVATION -> RECOVERY -> LIFEBOAT

NORMAL: Everything's fine. Passive monitoring.

QUARANTINE: Anomaly detected. Isolate the affected subsystem. Block new requests to it. Keep everything else running.

PRESERVATION: Multiple anomalies. Start persisting critical state to durable storage. Reduce non-essential operations.

RECOVERY: System is degraded. Attempt automatic recovery — restart failed services, replay lost messages, rebuild corrupted state.

LIFEBOAT: Recovery failed. Save everything salvageable, shut down gracefully, and prepare for clean restart.

Why Not Just Use PagerDuty?

PagerDuty tells a human there's a problem. The EPA fixes the problem — or at least contains the blast radius — before a human even wakes up.

The key insight: AI infrastructure fails gradually, not suddenly. By the time a traditional alerting system pages someone, the damage is already done. The EPA intervenes at the first sign of drift.

Try It

The Sovereign Hive is open source. The EPA ships as one of 11 power-up modules in the Intelligence Bundle.

Repo is private during development — DM me for early access.

I Built a 127-Tool MCP Server From Scratch — Here's What I Learned

MxGuru — Wed, 20 May 2026 17:01:00 +0000

The Model Context Protocol (MCP) is how AI agents talk to tools. Claude Code, Cursor, Windsurf — they all use it. But most MCP servers have 5-10 tools.

I built one with 127.

Why?

I run a local AI operations platform called The Sovereign Hive. It coordinates multi-agent swarms, runs security scans, manages a knowledge graph, and serves as the backbone for everything I build. Every agent needs tools — and I got tired of wiring up 8 different MCP servers.

So I consolidated everything into one server, one port, one health endpoint.

The Tool Categories

Category	Count	Examples
File I/O	11	read, write, copy, move, delete, head, tail, wc
Search	6	grep, glob, find_symbol, find_references, search_replace
Git	10	status, diff, log, blame, commit, branch, stash, tag
Code Analysis	6	lint, complexity, dead_code, dependency_graph
Browser Automation	7	navigate, screenshot, click, fill, evaluate, snapshot
Docker	8	ps, logs, exec, images, inspect, run, stop, stats
Semantic Memory	7	store, search, relate, observe, get, list, delete
Monitoring	4	health_probe, logs_tail, service_status, uptime_check
HTTP/Web	5	fetch, request, dns_lookup, url_encode, curl_equivalent
Web Search	1	DuckDuckGo via ddgs (no API key)
System	7	system_info, process_list, env_vars, port_check, disk_usage
Data Parsing	7	json_query, csv, yaml, toml, ini, xml, json_format
Database	3	sqlite_query, sqlite_schema, sqlite_tables
Archive	5	zip/tar create, extract, list
Text/Transform	8	diff, regex, base64, hash, token_estimate, string_transform
Crypto	4	generate_secret, uuid, hmac, password_hash
Notebook	3	read, create, add_cell
Task/Todo	4	create, list, update, complete
Prompt Engineering	4	build, chain, message_format, library
Thinking/Reasoning	4	sequential_think, decision_matrix, assumption_check, pros_cons
API Testing	4	graphql_query, websocket_send, api_test, openapi_parse
Comms Hub	3	post, read, channels
Ollama	2	list models, generate

Architecture Decisions

Every tool is an async function with the same signature:

async def tool_name(args: dict) -> dict:

Input is always a dict. Output is always a dict. No exceptions in the signature — errors go in {"error": "..."}.

Every tool carries MCP metadata:

TOOL_META = {
    "name": "grep_recursive",
    "description": "Search for a regex pattern across files in a directory tree.",
    "inputSchema": { ... }  # JSON Schema
}

This means any MCP client can discover the tool, see its parameters, and call it — without reading the source code.

The registry supports both stdio and HTTP/SSE transport:

mcp_server.py — JSON-RPC over stdin/stdout (for Claude Code direct integration)
mcp_server_sse.py — FastAPI with /tools, /tools/call, /mcp, /sse, /health endpoints

No mandatory external dependencies. Every tool uses Python stdlib where possible. Browser tools need Playwright. Docker tools need Docker. But the other 112 tools work with zero pip installs beyond FastAPI/uvicorn.

The Semantic Memory System

This was the most interesting piece to build. It's a knowledge graph stored in SQLite with TF-IDF similarity search — no vector database, no embeddings model required.

await memory_store({"name": "project-x", "content": "FastAPI backend with Redis caching", "type": "project"})
await memory_relate({"from": "duayne", "relation": "builds", "to": "project-x"})
await memory_observe({"entity": "project-x", "content": "Deployed to production"})
results = await memory_search({"query": "FastAPI caching backend"})

Entities, relationships, and observations — all queryable. Agents can build up persistent knowledge across sessions without needing a GPU or external service.

What I'd Do Differently

Start with MCP metadata from day one. I retrofitted it onto 15 existing tools. Building it in from the start is much cleaner.
Group tools by file, not one-per-file. Related tools (like all git operations) belong together.
The DDG HTML scraper approach failed. DuckDuckGo now serves CAPTCHAs to scrapers. Use the ddgs library or pay for a search API.

Try It

The entire stack is open source: Repo is private during development — DM me for early access.

The Best Result This Week Was a Failed Prediction — Phase-3a Doesn't Transfer

MxGuru — Wed, 20 May 2026 16:35:16 +0000

Part 3 of the quantization series. Yesterday I tested whether Part 1's drift-inversion intervention generalizes beyond granite. I wrote down a falsifiable prediction before the result. The prediction failed in real time — Qwen-2.5-14B reverses the sign of the effect, distributed across 61% of windows, not noise. This post is why a clean failed prediction is a better outcome than three-for-three same-direction would have been, and what the n=3 transfer data actually says about whether the intervention generalizes. Spoiler: it doesn't. And that's the win.

Two Localizers, Both Wrong: Bounding a Quantization Cost That Wouldn't Close

MxGuru — Wed, 20 May 2026 14:45:46 +0000

Part 2 of the quantization series. Spent two days and $12 hunting for the right localizer after Part 1 showed the per-layer drift metric lies. Both candidates — token-level logit-divergence at wrong tokens, AWQ-clipping on the surfaced layers — came back empty. Honest finding: an 8B model on a 12GB card costs ~12.7% PPL on wikitext-2, the gap is diffuse and proportional, no clever subset-targeted fix closes it. One process habit (a no-op control reproducing the baseline to 4 decimals) caught a silent bug that would have shipped a wrong 'AWQ-clipping wins' claim.

When the Sensitivity Metric Lies: A Drift-Inversion Smoking Gun in Mixed-Precision LLM Quantization

MxGuru — Wed, 20 May 2026 11:32:35 +0000

The HSAQ pipeline (Hybrid Sensitivity-Aware Quantization) is supposed to do one thing well: spend bits where they hurt. Profile each Linear layer's output drift under 2/3/4-bit quantization on real calibration data, then let a greedy allocator distribute the bit budget so total drift is minimized under the VRAM ceiling.

That works. Until it doesn't.

This is the story of one experiment — Phase-3a, run 2026-05-19 on ibm-granite/granite-3.3-8b-instruct — that broke a quiet assumption underneath the whole approach. The drift metric mismeasures real PPL impact on outlier-heavy attention layers. Worse, it mismeasures it in the wrong direction: the harder you push the metric down, the more outliers can sometimes corrupt generation.

The setup

HSAQ's baseline on granite-3.3-8B at a 12 GB consumer VRAM budget produces a mixed assignment averaging ~3.3 bits per Linear across 281 quantized modules. Measured against bf16, this baseline lands at:

Metric	bf16 baseline	HSAQ baseline	Δ
Wikitext perplexity	8.756	10.013	+14.42%

A +14.42% PPL hit is rough. Target was <8% (a soft "you can still feel it but it's usable" line in our internal eval). The first thing you do when the budget is the constraint is examine the residue — which layers are at the bottom of the bit-ladder, and could a small structural rule move them up?

After baseline assignment, 16 of 281 Linears sit at 3-bit (the rest at 4):

7 × mlp.down_proj — FFN expansion projections (~59M params each, the allocator's favorite victims)
6 × self_attn.o_proj — attention output projections (the outlier-heavy ones)
2 × mlp.gate_proj (L0, L39)
1 × self_attn.q_proj (L34)

The Phase-3a intervention was simple: force all o_proj layers to a minimum of 4 bits, regardless of allocator preference. Six layers move 3 → 4. About 0.05 GB of weight budget gets reallocated. Re-run end to end.

The result

Metric	HSAQ baseline	HSAQ + o_proj floor	Δ
PPL above bf16	+14.42%	+13.80%	-0.62pp

A real improvement. Small — about 4% relative on the gap to bf16 — but real. And reproducible: the baseline run inside the same job matched yesterday's baseline to 4 decimal places (10.0133 → 10.0133), so the methodology is bulletproof. Cache invariance also confirmed: HSAQ's SQLite sensitivity cache produced identical drift values across both runs.

So far this is unremarkable. The "+0.62pp from a 0.05 GB nudge" finding alone would justify a paragraph in an internal log, nothing more.

Then we looked at the per-layer drift.

The inversion

When the floor forced these six o_proj layers from 3-bit to 4-bit, their measured per-layer drift went dramatically worse — not better:

Layer	Drift at 3-bit	Drift at 4-bit	Ratio
`model.layers.21.self_attn.o_proj`	2.70	8.44	3.1× worse
`model.layers.30.self_attn.o_proj`	1.26	6.51	5.2× worse
`model.layers.8.self_attn.o_proj`	1.39	3.44	2.5× worse

Three of six layers showed >2.5× drift inflation at the higher bit-width. And the overall PPL — the thing the drift metric is supposed to predict — got better anyway.

Let that land. The signal the allocator uses to decide which layers deserve more bits is telling us:

"Layer 21's o_proj is 3× more damaged at 4-bit than at 3-bit. Definitely don't promote it."

And the model is responding:

"Actually, the 4-bit version generates better text. Thanks."

This is not noise. It reproduced across 32-sample and 256-sample calibration sets. It is a systematic divergence between what HSAQ measures and what actually matters.

What's actually happening

HQQ's quantization is groupwise: it picks one scale and zero-point per group of 64 weights. The mechanism that makes HQQ fast and parameter-light is the same mechanism that breaks here.

"One scaling factor for 128 weights means one outlier crushes the other 127 to zero." — Gemini's description of HQQ group quantization (we run at group_size=64, but the principle is identical).

On outlier-heavy layers like o_proj (which carries the per-head attention output back into the residual stream) and down_proj (which projects the wide FFN intermediate back down), a small number of channels carry order-of-magnitude larger activations than the rest. At 3-bit, the quantization is so coarse that everything is approximate and the model has already absorbed the noise. At 4-bit, you get more precision per group, but the outlier still dominates its group's scale — so the 63 non-outlier weights in that group get more crushed relative to what they should be, not less.

The drift metric notices this. It measures normalized MSE between the bf16 layer output and the quantized layer output on captured calibration activations. The increased crushing of small weights inside outlier-dominated groups produces a larger MSE — that part is real and the metric is honest about it. But the model in practice is much more tolerant of "small weights got squashed" noise than of "outlier weight got rounded to a bin that doesn't represent its magnitude" noise. The drift metric weights these the same. Real PPL doesn't.

"HQQ is blind to data flowing through it." — same source. This is the whole conceptual gap that activation-aware methods (AWQ, GPTQ, imatrix) close.

What this means if you use a drift-based allocator

If you're running anything in the mixed-precision-by-sensitivity family — SqueezeLLM, OWQ, our HSAQ, anything that picks per-layer bit-widths from a calibration MSE signal — there is a category of layer where your signal is lying to you. Specifically: outlier-heavy attention output projections (o_proj) and FFN down projections (down_proj). These are the layers AWQ identified five years ago as needing per-channel scaling, and the reason is precisely the dynamic our drift metric is failing to model.

Two implications:

Treat the drift signal as approximate on o_proj and down_proj. A sensitivity floor is one cheap way to do this — force these layers to a known-better bit-width regardless of what calibration MSE says. That's what Phase-3a tested, and it worked, even though it cut against the allocator's recommendation.
Calibration-MSE is the wrong signal for outlier-heavy layers. The right signal is something like KL divergence on output logits, or PPL impact directly measured on a held-out validation set. Both are more expensive than HQQ-output MSE, but on the layers where MSE lies, the expense is justified.

We are not the first to notice this. AWQ's original paper makes the case in different language: "the importance of a weight is determined by the activation magnitude, not the weight magnitude." HQQ's design choice to be data-blind is the feature that makes it fast and the bug that makes it brittle. What this experiment adds is a clean reproduction on a current 8B model, with the exact mechanism visible: same calibration cache, same allocator, two runs differing only in the floor parameter, drift-vs-PPL anticorrelation jumping out at you.

What didn't work

For completeness — Phase-3a tested two structural levers, only one helped meaningfully.

o_proj sensitivity floor: +0.6pp PPL improvement. Useful, but small.
group_size=64 (vs the HQQ default of 128): already baked into HSAQ from day one (config.py:52: HQQ_OVERHEAD_FACTOR = 1.065 # 6.5% average (zeros 64 + scales 64 per group)). The hypothesis that tightening the group size would help was wrong about our starting point — we were already at the practical floor. Tightening further to gs=32 has diminishing returns and roughly doubles overhead.

The conclusion is sharper than the headline number: more HQQ tuning is not the lever. The bit budget is gone, the group size is at the practical floor, and the drift metric we're using to allocate the budget that remains is unreliable on the layers where allocation matters most.

What's next: AWQ on a 9-layer target list

A separate diagnostic — logit divergence comparison between the HSAQ-quantized model and bf16, run on 96 prompts the same day — produced a clean QUANTIZATION_BIAS_DOMINANT verdict: 63/96 divergences are confidently wrong (the model is sure of a wrong token), only 3/96 are high-entropy uncertainty. This is the signature of representation failure, not undertraining. It is what AWQ is designed to fix.

The diagnostic surfaced nine specific layers driving the divergence:

Layer	Drift score
`model.layers.28.self_attn.o_proj`	23.00
`model.layers.13.self_attn.o_proj`	14.53
`model.layers.15.mlp.down_proj`	6.36
`model.layers.28.mlp.down_proj`	6.28
`model.layers.25.mlp.down_proj`	6.21
`model.layers.20.mlp.down_proj`	5.41
`model.layers.14.self_attn.o_proj`	5.18
`model.layers.15.self_attn.o_proj`	5.15
`model.layers.17.self_attn.o_proj`	4.69

Pattern: mid-to-late transformer (L13–L28), attention output and MLP down projections. Textbook activation-outlier signature. The next post will report on an AWQ POC targeting exactly these nine layers — leaving the other 272 Linears under HSAQ as today, swapping only the outliers to AWQ. If the gap closes there, the recipe likely generalizes. If it doesn't, we have a different problem.

Calibrating prior claims

A previous LinkedIn pulse made the claim that this hybrid quantisation recipe holds across model families. That claim should be softened pending the AWQ run. The HSAQ allocator's behavior on o_proj and down_proj is consistent across architectures we've tested — but the fix (whether AWQ closes the gap to <8% PPL across architectures) is not yet validated. Phi-4 has a different attention layout (no separate o_proj); confirming transferability there requires running the same divergence diagnostic on a Phi-4 HSAQ quantization, which is queued.

Bottom line

If you're using calibration-MSE as your per-layer sensitivity signal, run a sanity check: pick your worst-PPL allocation and force-promote the o_proj and down_proj layers to 4-bit anyway. If PPL improves, your drift metric is lying to you in the same direction ours is. That's information you can use without changing your quantizer; it's information that says your quantizer needs to change.

This is part of an ongoing series on running 13–20B language models on 12 GB consumer GPUs. The pipeline is open work-in-progress at mxguru1/hsaq-tools on Hugging Face. Granite-3.3-8B was chosen as the headline target because community AWQ/GPTQ quants exist for ground truth, and because 8B parameters at mixed 3/4-bit fits comfortably on a 12 GB card with room for a LoRA adapter.

Update (2026-05-21) — model-specificity caveat

Follow-up transfer testing on the o_proj 3→4-bit floor intervention shows it is model-specific, not a generalizable recipe. On a clean, identical evaluation protocol (full wikitext-2 test set, non-overlapping 2048-token windows):

Model	Δ PPL from floor	Direction
granite-3.3-8B	+0.0840 (1.137%)	improvement
phi-4 (14B)	+0.0088 (0.127%)	small improvement
Qwen-2.5-14B	−0.0019 (0.031% worse)	mild regression

Phase-3a's observation — drift-MSE on outlier-heavy layers disagrees with downstream PPL — holds for granite as originally reported. The intervention of forcing o_proj layers from 3-bit to 4-bit transfers cleanly to phi-4 (small positive effect, 67.6% of windows helped), and reverses on Qwen-2.5-14B (61.2% of windows hurt). No clean predictor — count of underbitted layers, tier distribution, architecture, parameter scale — sorts the result.

Full writeup of the transfer testing, the dose-response hypothesis that died on the clean protocol, and the discipline checks that caught a wrong prediction in real time is in Part 2 and a forthcoming Part 3.

Two Local-Agent Philosophies: Where Hermes Earns Its Design, and Where the Tradeoffs Invert

MxGuru — Tue, 19 May 2026 08:23:18 +0000

This is a submission for the Hermes Agent Challenge

I've spent the last five months building an offline multi-tier agent swarm on a single workstation — an RTX 5070, a Ryzen 9 9950X3D, and a hard rule that nothing crosses the network boundary without explicit permission. When the Hermes Agent Challenge came up, I sat down to write a "why I'd use Hermes" piece. Halfway through, I realised I had to write a different post: why Hermes is the right choice for most people building local agents, and why a specific class of deployments has to make the opposite call.

This isn't a criticism of Hermes. Nous Research designed something good. What I want to lay out is where the design choices stop applying — not because they're wrong, but because the threat model changes.

What Hermes is good at

The repo and docs are clear about the thesis: Hermes is "the agent that grows with you." Built-in learning loop. Creates skills from experience. Searches its own past conversations. Builds a deepening model of who you are across sessions. Runs on a $5 VPS, a GPU cluster, or serverless infrastructure. Use any model — Nous Portal, OpenRouter, NVIDIA NIM, your own endpoint. Switch with hermes model, no code changes.

That's a coherent design. The whole framework leans into a specific bet: that an agent operating with you over time, accumulating context and skills, gets more useful than an agent that starts from zero each session. For most use cases I can think of — personal productivity, research workflows, automating the weird operational stuff no SaaS product handles properly — that bet is the right one.

If I were building a Hermes-style workflow for myself, I'd lean on:

The session memory and conversation search — the operational benefit of an agent that already knows what I was working on yesterday is significant
The skill-creation loop — instead of re-typing the same chain of tool calls, the agent persists the pattern
The model flexibility — being able to swap providers without rewriting code is genuinely useful when you're testing what works
The cheap-to-idle infrastructure pattern — you can leave it running and it costs nearly nothing when nothing's happening
The client itself is lightweight — and that matters more than it sounds. JetBrains PyCharm, Windsurf, and the other heavyweight AI-augmented IDEs are CPU-intensive in a way you feel on a dev workstation that's already running real workloads. The Hermes client gets out of the way. When my machine is busy doing actual work, I'm not also paying a tax for the agent to exist. That's not a marquee feature in the docs, but it's the kind of detail that shows up after a few weeks of real use.

For an individual builder, a white hat researcher poking at things on their own time, a small team automating their own ops — this is well-shaped. The learning loop earns its complexity by paying off across sessions. The "talk to it from Telegram while it works on a cloud VM" pattern is genuinely powerful for people whose workflow benefits from continuity.

This isn't faint praise. Hermes is doing a real thing well.

Where the design choices flip

The deployments I've been building for operate under a different constraint set. Specifically: the threat model assumes the agent itself is a potential vector. Not because it's malicious by design — because anything that can modify its own behaviour over time can be steered into modifying it the wrong way, given enough adversarial pressure on its inputs.

The thing Hermes treats as its strength — the agent grows, learns from experience, creates skills, persists memory — is the exact behaviour my architecture is built to prevent.

That's not a Hermes problem. It's a security posture that decided "the agent should not be able to surprise me" was worth the cost of throwing away the productivity gains of learning-over-time.

The architectural decisions that follow from that posture are:

Hardcoded permission gates over emergent capability. Every privileged operation routes through a gate that knows what tier the requesting agent runs at and what operations that tier can perform. No bypass flag. No "trusted" internal path. If a new capability is needed, it gets added to the gate explicitly, by a human, in code review.
Knowledge stays read-only for the agent. There's a local Knowledge Vault that holds threat intelligence and audit logs. Agents read from it constantly. They write to specific append-only paths under their tier's permission. They cannot modify what's already there. A learning-loop agent that "improves its skills" would be writing to the very place I'm protecting from writes.
Tier is immutable for the agent's lifetime. You can't escalate yourself mid-run. To do privileged work, you spawn a child agent at a higher tier, and that spawn is audited. The thing Hermes calls a feature — an agent that grows — my architecture treats as a control failure mode.
No cross-session continuity by default. Session memory is per-session unless explicitly persisted by a gated operation. The "agent that knows what you were doing yesterday" is, in a high-security context, "an attack surface that yesterday's adversary can still influence today."

These aren't claims that Hermes' design is wrong. They're claims about a different threat model where the tradeoffs invert.

The bridge

Here's the part that I think actually matters for anyone reading this and trying to decide which way to build:

For typical consumer use and most white hat / research workflows, the security posture I'm describing is overkill. It costs a lot of operational ergonomics, demands real architectural discipline, and the threats it's defending against don't apply to someone running an agent on their own laptop to automate their own life. Hermes' learning loop is a net win in that context. The productivity from continuity dwarfs the theoretical risk surface.

But there's a class of deployments where total control over what the agent can do, in what order, with what authorisation, becomes the actual product. Adversarial security research, local Blue Team analysis where compromise of the tooling is part of the threat model, environments where the agent has access to data that simply cannot be corrupted by any process — that's where the bridge crosses.

On the consumer side of the bridge, Hermes is well-designed and the learning loop is a feature.

On the other side, the same loop becomes a property the architecture is built to prevent.

This isn't Hermes being wrong. It's that any local-agent framework has to commit to a stance on whether the agent should be able to surprise its operator. Hermes commits one way. A high-security swarm commits the other. Both are coherent.

Why measurement matters more than philosophy

The reason I trust the architectural decision I made — rather than just believing in it — is that the same project produces measurable, reproducible artifacts at every step. The quantization pipeline that runs inside this architecture logs per-layer sensitivity profiles, applies bit-width assignment under explicit budget constraints, and emits manifests that I can diff between runs. Recent runs on an 8B-class model produced bit-identical allocations across runs with 4× the calibration data, which tells me the underlying measurements are stable, not noise.

That property — runs produce the same artifact when given the same inputs — is exactly the property a hardcoded gate enforces and exactly the property a learning-loop architecture would compromise over time. Not in a bad way. The learning loop is supposed to change its output as it learns. That's the design. But for the security domain I'm working in, "the system's behaviour drifts over time even with identical inputs" is a property I'm specifically preventing, not enabling.

If you're operating in a context where reproducibility matters more than ergonomics — where you need to be able to prove that today's behaviour matches yesterday's, that no agent has quietly upgraded itself, that the audit trail is the truth — that pushes you toward gates and away from learning loops. Not because gates are better. Because in that context, reproducibility is what "better" means.

The takeaway

If you're building a local agent for yourself and want capability that compounds over time: Hermes is well-designed for that and the framework gives you a lot for free.

If you're building infrastructure where the agent should never be able to do something the operator didn't sign off on, in advance, with audit: build the boring version. Hardcoded gates. Immutable tier. Read-only state for the agent. No emergent behaviour. Yes, you'll do more work. Yes, you'll lose some operational productivity. That's the price of the security property you're buying.

Both stances are defensible. The mistake is using one framework in the other's domain.

For the Hermes Agent Challenge specifically: this isn't a piece I could have written without spending real time inside both philosophies. The framework is doing good work for the people it's designed for. I'm not one of those people right now — but I might be, on a different project, in a different threat model. And the same is true in reverse: if you're a Hermes user reading this and thinking "that security posture sounds excessive for what I'm doing," you're probably right, for what you're doing.

Pick the framework that matches your threat model. Don't pick the one that matches your aesthetic preferences. That's the actual lesson.

Built and tested on RTX 5070, Ryzen 9 9950X3D, fully local. Architecture details and empirical results are publicly available; the specific threat model and implementation internals are not, for reasons that should be obvious given the topic.