DEV Community: Patrick Hughes

The Age of Accountable Agents: Building Trust in Your AI Automation

Patrick Hughes — Wed, 20 May 2026 14:45:12 +0000

The Age of Accountable Agents: Building Trust in Your AI Automation\n\nThe air around AI feels different this "Long Hot A.I. Summer." Big tech is pouring billions into development-Elon Musk's legal battles, Meta reassigning 7,000 employees to focus on AI-it's a high-stakes, high-energy environment. But for us, building powerful AI agents on consumer hardware, this moment isn't just about raw computational power or complex models. It's about something more fundamental: trust.\n\nThe recent news cycle offers a stark reminder of the ethical considerations, user control challenges, and privacy implications that come with advanced automation. From a papal encyclical discussing AI's moral implications to significant settlements over hard-to-cancel subscriptions, and even debates around nationwide data collection, the narrative is clear: we're entering the Age of Accountable Agents. And as developers, especially those focused on local, user-centric AI, we have a unique opportunity to lead the charge.\n\n## Trust Through Transparency: The AI Encyclical's Echo\n\nWhen you hear about an Anthropic co-founder discussing AI ethics with the Pope, it's a signal that the impact of our work extends far beyond our terminals. AI agents, by their nature, automate decisions. For these agents to be truly valuable and accepted, they must be transparent.\n\nWhat does transparency mean for an agent running on your hardware? It means:\n\n* Clear Decision Paths: Can a user understand why their agent took a particular action? If your agent automatically categorizes emails, can it explain its reasoning?\n* Auditable Logic: Even if not a full "explanation," the underlying logic should be inspectable. This doesn't mean revealing proprietary secrets, but designing agents where state changes and rule applications are explicit.\n\nConsider an agent designed to manage your smart home devices. Instead of a black box, you could implement a simple logging mechanism:\n\n

python\nclass SmartHomeAgent:\n def __init__(self, name):\n self.name = name\n self.log = []\n\n def act_on_temperature(self, current_temp, desired_temp):\n if current_temp > desired_temp + 2:\n action = "Turning on AC"\n self.log_action(action, f"Current: {current_temp}°C, Desired: {desired_temp}°C")\n # ... actual AC control code\n elif current_temp < desired_temp - 2:\n action = "Turning on Heater"\n self.log_action(action, f"Current: {current_temp}°C, Desired: {desired_temp}°C")\n # ... actual Heater control code\n else:\n action = "No action needed"\n self.log_action(action, f"Current: {current_temp}°C, Desired: {desired_temp}°C")\n return action\n\n def log_action(self, action, details):\n self.log.append(f"[{self.name}] {datetime.now()}: {action} - {details}")\n\n# Usage\nagent = SmartHomeAgent("ClimateControl")\nagent.act_on_temperature(25, 22)\nprint(agent.log)\n

\n\nThis basic logging provides a human-readable trail, fostering trust by showing, not just doing.\n\n## User Autonomy, Not "Hard-to-Cancel": Learning from Shutterstock\n\nThe $35 million settlement Shutterstock faced over difficult subscription cancellations is a potent lesson: users demand control over automated systems. For AI agents, this translates directly to how we design interaction and management. Your agent shouldn't feel like a digital trap.\n\nKey design principles for user autonomy:\n\n* Explicit Opt-in/Opt-out: Clear consent for agent actions and data usage.\n* Easy Pause and Stop: Users must be able to halt or reconfigure an agent's operation immediately.\n* Understandable Configuration: Agent settings should be accessible and intuitive, not buried in obscure files.\n\nThink about how your agent's lifecycle is managed. Here's a conceptual AgentController:\n\n

python\n# pseudo-code for an AgentController\nclass AgentController:\n def __init__(self, agent):\n self.agent = agent\n self._running = False\n\n def start(self):\n if not self._running:\n print(f"Starting {self.agent.name}...")\n self._running = True\n # thread or process start logic for agent.run()\n self.agent.start_service()\n\n def pause(self):\n if self._running:\n print(f"Pausing {self.agent.name}...")\n self._running = False\n self.agent.pause_service()\n\n def stop(self):\n if self._running:\n print(f"Stopping {self.agent.name} permanently...")\n self._running = False\n self.agent.stop_service()\n # Clean up resources\n\n def configure(self, new_settings):\n print(f"Configuring {self.agent.name} with new settings.")\n self.agent.update_settings(new_settings)\n\n# When you're building your agents, consider how these controls are exposed to the user.\n

\n\nFor more effective agent management, especially concerning permissions and operational boundaries on local hardware, check out AgentGuard. It helps you build in these essential controls from the ground up.\n\n## Privacy by Design, Not by Accident: The FBI's Data Ambition\n\nThe FBI's desire for nationwide license plate reader access is a stark reminder of the sheer scale of data collection possible today. For local AI agents, privacy should be a default setting, not an afterthought.\n\nWhen designing your agents, prioritize:\n\n* Local-First Processing: Perform computations and store data on the user's device whenever possible.\n* Data Minimization: Only collect and process the data absolutely necessary for the agent's function.\n* Transparent Data Policies: Clearly communicate what data an agent uses, why, and whether it ever leaves the device.\n\nBuilding agents for consumer hardware gives us a distinct advantage here. We can champion local intelligence and ensure that user data stays private by default, not by policy fine print.\n\n## Architecting for Clarity: The Lisp Connection\n\nThe Lisp family of languages (Common Lisp, Racket, Clojure) are hyperpolyglots for a reason: their power in symbolic computation and metaprogramming encourages clarity in expressing complex logic. While you might not be writing your agent in Emacs Lisp, the principles of clear, inspectable, and modular design are paramount.\n\nAn agent with well-defined modules for perception, decision-making, and action is easier to debug, understand, and, crucially, to trust. Avoid monolithic codebases where an agent's reasoning is opaque.\n\n

python\n# Conceptual Agent Architecture\nclass AgentBrain:\n def __init__(self, perception_module, decision_module, action_module):\n self.perception = perception_module\n self.decision = decision_module\n self.action = action_module\n\n def run_cycle(self, environment_data):\n perceived_state = self.perception.process(environment_data)\n desired_action = self.decision.evaluate(perceived_state)\n self.action.execute(desired_action)\n return desired_action # For logging/traceability\n\n# Each module can have its own transparent logic, making the overall agent's behavior understandable.\n

\n\n## The Intentional Click: Feedback Loops and Refinement\n\nEven a seemingly simple site like clickclickclick.click can serve as a quirky reminder of direct user interaction. How do your agents confirm intent? How do they solicit feedback effectively? It's not about mindlessly automating every single interaction, but about designing clear, intentional communication channels between the user and the agent.\n\nConsider points where your agent might ask, "Did I do that correctly?" or "Is this what you intended?" rather than just assuming. This explicit feedback loop refines the agent's understanding and reinforces the user's sense of control.\n\n## Building for a Trustworthy AI Future\n\nThe AI revolution is here, and it's happening everywhere, from the largest data centers to the devices in our pockets. As developers crafting AI agents for consumer hardware, we stand at a critical juncture. We have the unique opportunity-and responsibility-to build agents that are not just intelligent and efficient, but also trustworthy, accountable, and respectful of user autonomy and privacy.\n\nThis summer's "AI gold rush" shouldn't just be about speed; it should be about quality, ethics, and user-centric design. By focusing on transparency, control, and privacy by default, we can ensure our AI agents truly empower, rather than overwhelm, the people who use them.\n\nTo help manage these critical aspects of your agent's lifecycle, from permissions to operational safety, explore AgentGuard. It's designed to support you in building the next generation of conscientious AI automation. Start building agents that earn trust, today.\n

Securing Your AI Agents: Essential Practices for On-Device Automation

Patrick Hughes — Wed, 20 May 2026 14:45:08 +0000

Securing Your AI Agents: Essential Practices for On-Device Automation

The "Long Hot A.I. Summer" is upon us, as one New York Times headline aptly put it. With major industry shifts like Meta reassigning 7,000 employees to focus on AI and high-profile legal battles shaping the future of foundational models, the pace of innovation is accelerating. As AI models become more capable, the discussion quickly moves from raw intelligence to practical application: building autonomous AI agents that deliver real value. For those of us building these agents to run efficiently and privately on consumer hardware, recent news serves as a critical reminder of two core tenets: security and efficiency.

The Imperative of On-Device Security

The cloud has been the default for many AI applications, offering seemingly infinite scale. However, relying solely on remote infrastructure comes with inherent risks. The recent CISA Admin leak of AWS GovCloud keys on GitHub is a stark, public reminder that even organizations with top-tier security face vulnerabilities. For our AI agents, especially those handling personal data or interacting with sensitive systems, entrusting everything to a third-party cloud provider introduces a control gap.

This is where on-device AI agents truly shine. By running agents directly on your hardware, you retain control over the physical and logical environment. Building secure agents starts with a strong foundation. Think of the principles behind operating systems like OpenBSD 7.9, known for its "secure by default" philosophy and rigorous code auditing. While we might not be building entire operating systems, we can apply similar principles to our agent deployments:

Isolation and Sandboxing: Each agent or critical component should operate within its own confined environment. Tools like Docker containers, lightweight VMs, or even OS-level chroot jails can isolate an agent's processes, file system access, and network interactions. This limits the blast radius if one agent component is compromised.

# Conceptual example: Running an agent process securely
import subprocess
import os

def run_isolated_agent_task(script_path, environment_vars=None):
    # Define a restricted environment
    env = os.environ.copy()
    if environment_vars:
        env.update(environment_vars)

    # Basic sandboxing: ensure the script can only access specific paths
    # More advanced solutions involve Docker or chroot for stronger isolation
    command = ["python3", script_path]
    try:
        result = subprocess.run(
            command,
            env=env,
            check=True,
            capture_output=True,
            text=True,
            timeout=60 # Agent tasks should have time limits
        )
        print("Agent output:", result.stdout)
        if result.stderr:
            print("Agent errors:", result.stderr)
    except subprocess.CalledProcessError as e:
        print(f"Agent task failed: {e}")
        print("Stderr:", e.stderr)
    except subprocess.TimeoutExpired:
        print("Agent task timed out.")

# Example usage:
# create a simple agent_task.py that writes to a specific file or performs a calc
# run_isolated_agent_task("path/to/your/agent_task.py", {"AGENT_MODE": "SECURE"})

Principle of Least Privilege: An agent should only have the minimum permissions necessary to perform its designated tasks. If an agent only needs to read a specific directory, it should not have write access to the entire file system. This applies to API keys, network access, and system commands.
Secure Communication: For agents that do need to interact with external services, ensure all communication is encrypted (HTTPS, SSH). Avoid storing API keys directly in code; use secure environment variables or dedicated secret management tools.

These practices are not just for large enterprises; they are fundamental for anyone building automation that operates autonomously on their hardware.

Efficiency and Cost: The Local Advantage

Beyond security, the economics of AI are shifting. News reports about rising energy costs and data centers being at the heart of bids for energy companies highlight a significant trend: cloud computing is becoming more expensive, both financially and environmentally. Each query sent to a remote LLM incurs a cost, and that cost accumulates quickly.

Running AI agents on your own consumer hardware offers a compelling alternative:

Reduced Operational Costs: Once you've invested in your hardware, the operational costs for running local agents are primarily electricity, which is often far cheaper than continuous cloud API calls, especially for frequent or repetitive tasks.
Environmental Responsibility: Decreasing reliance on massive, energy-intensive data centers contributes to a smaller carbon footprint.
Instantaneity and Data Locality: Processing data locally removes network latency and ensures sensitive information never leaves your device, enhancing both speed and privacy. Apple's "Apple Intelligence" announcements underscore a future where powerful AI capabilities are deeply integrated and run on-device, prioritizing user data privacy and local processing power.

Optimizing models for consumer hardware (e.g., using quantization, smaller models, or specialized runtimes like ONNX Runtime, OpenVINO, or Apple's Core ML) is a key engineering challenge. It requires careful selection of models that balance capability with resource constraints, ensuring your agents can perform their tasks effectively without bogging down your system.

Practical Agent Engineering for the "Long Hot AI Summer"

As the AI space evolves rapidly, exemplified by major players like Meta reorienting thousands of employees towards AI development, the focus for engineers building agents must be on practical implementation and reliability. It's not enough for an agent to be intelligent; it must be dependable and resilient when operating independently on your devices.

Consider these engineering points:

Clear Task Definition: Define the precise scope and goals of your agent. Avoid mission creep. A well-defined task makes it easier to test, monitor, and secure.
Error Handling and Recovery: What happens if an external API fails? If an expected file isn't found? Agents need thorough error handling, retry mechanisms, and graceful degradation strategies to maintain operations.
Monitoring and Logging: Even on local hardware, you need visibility. Implement clear logging (e.g., to local files, system logs) for agent actions, decisions, and any encountered errors. Monitoring resource usage (CPU, RAM, GPU) helps identify inefficiencies or runaway processes.
Version Control for Agents and Models: Treat your agent code and the models it uses like any other critical software. Use Git for version control, allowing you to track changes, revert to stable versions, and collaborate effectively.

Building AI agents for personal and professional automation on consumer hardware is not just a technical challenge; it's an opportunity to build more private, efficient, and user-controlled systems. It requires a thoughtful approach to engineering, with security and efficiency at its core.

Ready to build your own secure, autonomous AI agents? Explore tools and practices that put control back in your hands. Check out AgentGuard for resources designed to help you develop reliable and private AI automation on your local systems.

Conclusion

The dynamics of the AI world are shifting. From corporate realignments to increasing energy costs and critical security incidents, the environment demands a pragmatic approach to AI agent development. By prioritizing on-device security, optimizing for efficiency, and adopting rigorous engineering practices, we can build a future where AI agents empower us with intelligent automation that is truly ours, operating securely and effectively right where we need it.

Decoding the AI Summer: Building Accountable Agents for the User

Patrick Hughes — Tue, 19 May 2026 14:45:10 +0000

Decoding the AI Summer: Building Accountable Agents for the User

The air in the AI world is thick with change. Recent headlines paint a vivid picture of a technology in flux: colossal legal battles, major corporate reassignments, and even high-level discussions on AI ethics reaching the Vatican. This isn't just a moment of rapid advancement; it's what some are calling the "Long Hot A.I. Summer" - a period demanding vigilance, adaptability, and above all, a renewed focus on the user.

As developers building AI agents on consumer hardware, this climate presents both immense opportunity and significant responsibility. While the giants clash and redirect their vast resources, we have the unique position to craft agents that truly serve, protect, and empower the individual. But how do we build agents that aren't just smart, but also trustworthy and accountable in this fast-evolving environment?

The Shifting Sands of AI Development

Elon Musk's recent lawsuits against OpenAI, and the subsequent "takeaways" from those blockbuster trials, underscore the intense competition and often unpredictable nature of the AI industry. These legal clashes laid bare the commercial interests and philosophical divides at the core of today's AI development. Simultaneously, Meta reassigning 7,000 employees to focus squarely on AI sends a clear message: AI is now center stage for major players, demanding a reorientation of entire corporate structures.

For us, working with agents on local hardware, these shifts highlight a critical advantage: independence. While cloud-based AI can be subject to corporate whims, API changes, and shifting service models, an agent running on your device offers a degree of stability and control that centralized systems simply cannot match. This independence, however, comes with its own imperative: we must ensure these agents operate with the highest ethical standards, directly by and for the user.

Beyond Black Boxes: The Imperative for Ethical Agents

The conversation around AI isn't confined to boardrooms and courtrooms. The news of Anthropic's co-founder joining Pope Leo XIV to present an AI encyclical, "Magnifica Humanitas," signals a global, philosophical engagement with AI's profound implications. It's a call for AI to serve humanity, not just generate profits or enhance surveillance.

Contrast this with situations like Shutterstock's $35 million settlement over hard-to-cancel subscriptions. This isn't an AI story directly, but it's a powerful reminder of how opaque systems and design choices can erode user trust. When a user feels trapped or misled, it damages the entire relationship. This principle applies directly to AI agents: if an agent operates without transparency or clear user control, it risks repeating these same trust-breaking patterns.

Even more concerning are proposals like the FBI's desire to buy nationwide access to license plate readers. This demonstrates the inherent tension between convenience, data collection, and individual privacy. As AI agents become more capable of gathering and acting on data, we, as developers, must be incredibly intentional about safeguarding user privacy and preventing any potential for misuse.

For agents running on consumer hardware, we are the gatekeepers of user trust. Our responsibility is to build agents that are inherently observable, accountable, and designed with user agency at their core. This means moving beyond the idea of an AI agent as a black box and towards a transparent, collaborative partner.

Engineering Observable and User-Centric Agents

How do we translate these ethical imperatives into practical engineering decisions for AI agents on consumer hardware? It starts with a commitment to clarity and control.

Transparent State and Action Logging:
Every significant decision, data interaction, or workflow step an agent performs should be logged locally and made accessible to the user. This isn't just for debugging; it's for building trust. Imagine an agent automating a website interaction, much like the dynamic observation required on a site like clickclickclick.click. Instead of just executing a 'click,' a truly accountable agent logs: "Detected 'Proceed to Checkout' button. Preparing to click. User granted permission for this action at 10:35 AM." This provides an audit trail and insight into the agent's reasoning.

User Confirmation and Override Hooks:
For any high-impact action-be it a financial transaction, data deletion, or sending sensitive information-the agent should pause and explicitly request user confirmation. This can be a simple notification on the user's device, providing a moment for review and the option to override.

def confirm_action(action_description: str) -> bool:
    print(f"Agent requires confirmation for: {action_description}")
    user_input = input("Proceed? (yes/no): ").strip().lower()
    return user_input == 'yes'

# Example usage in an agent workflow
if agent.detects_purchase_opportunity():
    if confirm_action("Initiating purchase of item X for $Y"):
        agent.execute_purchase()
    else:
        agent.log("User declined purchase action.")

Granular Permissions and Sandboxing (Locally):
Since we're building on consumer hardware, we have direct control over the execution environment. Design your agents with the principle of least privilege. Grant only the necessary system permissions, and explore OS-level sandboxing features, virtual environments, or containerization to limit the agent's scope and prevent unintended side effects.
User-Adjustable Guardrails:
Empower users to define the boundaries of their agents' behavior. This could involve simple settings like "Only automate tasks between 9 AM and 5 PM" or "Never spend more than $50 without explicit approval." These user-configurable constraints allow individuals to tailor agent autonomy to their comfort level, ensuring the agent remains a tool, not a master.

The Path Forward: Trust Through Engineering

The "Long Hot A.I. Summer" is a period of intense growth and significant ethical discourse. For us, building AI agents on local hardware, it's a powerful affirmation of our mission: to create intelligent automation that is not only powerful but also transparent, accountable, and deeply respectful of user agency and privacy. By focusing on observable actions, user confirmation, and controlled environments, we can build a future where AI agents are truly extensions of the user's will, fostering trust through superior engineering.

Want to ensure your agents are not just smart, but trustworthy and user-controlled? Explore tools designed for building secure and observable agents on your hardware. Check out our resources at /tools/agentguard to learn more about protecting user privacy and ensuring ethical agent operation.

BMD HODL devlog - week of 2026-05-17

Patrick Hughes — Tue, 19 May 2026 14:45:07 +0000

BMD HODL devlog - week of 2026-05-17

From 2026-05-10 through 2026-05-17, the biggest move was getting stricter about proof. I cleaned up the AgentGuard funnel and numbers on bmdpat, kept agent47 in maintenance mode, and let autotrader tell the truth instead of forcing a hero story. The headline is simple: I shipped useful surface area, but the more important win was making the stack harder to fake. AgentGuard still has no real external pull yet. Autotrader still trails passive. That is useful because it keeps me pointed at the real problems.

What shipped

bmdpat

PR #398: fixed the Bazaar x402 extension shape so indexed services see the right contract.
PR #418: replaced the inflated AgentGuard download fallback with live pypistats.
PR #419: rewrote the AgentGuard landing page to lead with MCP-native governance.
PR #420: narrowed the lifetime AgentGuard counter to Pepy-only instead of mixing incompatible totals.
PR #421: fixed blog excerpt metadata.

agent47

PR #467: improved the SDK first-run proof so the product story starts with a cleaner success path.
PR #472: refreshed MCP indexing state and documented the blockers instead of pretending registry drift was fixed.

autotrader

PR #16: added the CBRS post-IPO watchlist update for the paper-only V2 book.
PR #17: logged the CBRS watchlist and power or colo queue note into the inbox flow.
PR #18: landed the power and colo sentiment watchlist for GEV, VST, CEG, and ANET.
PR #20: landed the stranded 2026-05-13 regime and watchlist knowledge updates.

What I learned

stratechery-inference-shift plus tomtunguz-localmaxxing: local inference economics are now part of the product thesis, not side research.
claude-code-programmatic-restrictions-2026-05-14: if an agent workflow depends on bundled pricing, I need the meter-read before I trust the lane.
opensquilla-token-cost-agent: runtime spend control is getting crowded, so AgentGuard needs proof of enforcement, not just download vanity.

Numbers

Autotrader: combined book +6.4% as of 2026-05-16, with -5.1% alpha vs SPY on stocks and -10.3% alpha vs BTC on crypto.
Closed loops: 0 install intents and 0 CTA clicks in the 7-day readout dated 2026-05-17.
AgentGuard PyPI: 483 installs over 7 days in the 2026-05-17 focus review. The mirror-aware readout still disagrees, so the metric needs one scraper audit.

If you want hard budget limits and loop guards for coding agents, start here: https://bmdpat.com/tools/agentguard

I gave an autotrader $360 and 30 days. I am not adding live money yet.

Patrick Hughes — Fri, 15 May 2026 21:09:29 +0000

I gave an autotrader $360 and 30 days. I am not adding live money yet.

On May 14 I ran the kill-switch review on the live autotrader.

The decision is simple.

Keep V2 paper-only. Add no new live money. Revisit after the next scorecard.

That is not a dramatic kill. It is the boring version of discipline. The live book can keep being watched, but the next tranche does not go in just because I built the thing.

This is part of BMD HODL, the one-person AI-operated holding company I run nights and weekends. The cannon for this quarter says watch first, document everything, and decide from the rule instead of the sunk cost.

Today was the rule.

The setup

Two live accounts. Real money.

Alpaca stocks: $200 deposited
Kraken crypto: $160 deposited
Total live: $360
Compute: about $57 a month on Azure

The strategy is markdown-prompt-driven. Claude reads positions, market context, and a small playbook every morning. It proposes or manages trades inside guardrails.

Paper trading keeps running in parallel as the test bed. Any strategy change has to prove itself on paper before it touches live money.

That separation matters. Live money is where discipline gets tested. Paper is where experiments belong.

The numbers

Latest verified snapshot:

Alpaca equity: $217.97 (+9.0% on $200)
Kraken equity: $169.87 (+6.2% on $160)
Combined: $387.84 (+7.7% on $360)
Net of monthly compute: roughly minus $29
vs SPY over the same window: minus 4.4%
vs BTC over the same window: minus 10.2%

In isolation, +7.7% on $360 looks fine.

After compute, it is negative.

Against passive baselines, it is behind.

That is the whole point of the review. The bot does not get credit for being interesting. It has to beat the boring alternative or earn more time with better evidence.

The rule

The rule was written before the money went in.

If the live book is still net-negative after compute and still lagging both SPY and BTC, no new live tranche goes in.

If the live book is positive after compute or one benchmark has flipped, it can continue to the next tranche.

Today the rule says no new live money.

I am not routing the next $200 into the bot today. I am keeping V2 paper-only and waiting for the next scorecard.

What I am not doing

I am not declaring the system dead.

I am not pretending the result is good enough.

I am not changing the benchmark after the fact.

The live account did make money before compute. That matters. It also lost to the actual alternatives. That matters more.

The useful middle ground is to keep the live book contained, keep the paper system learning, and only promote capital when the scoreboard earns it.

Why this matters

Most builders are good at starting systems and bad at slowing them down.

Agents make that worse. Once a process runs on a schedule, it starts to feel alive. It produces logs. It writes reports. It gives you reasons to keep watching.

That is exactly why the rule has to exist before the result.

The rule is not there to punish the agent. It is there to protect the operator from narrative drift.

This autotrader is useful if it teaches me how to run capital with agents without getting high on my own software.

Today it taught the right lesson.

No new live money without better evidence.

The next scorecard

The next review is not vibes.

I want to see:

Net result after compute
Combined book versus SPY and BTC
Paper V2 hit rate
Cash drag
Whether the strategy is learning from misses or just writing prettier reports

If those improve, I can add capital later.

If they do not, the live book stays capped and the next dollar goes somewhere boring.

That is not a failure. That is the system working.

If you are running agents near money, customers, or production, write the kill-switch before the run starts. That is what I built AgentGuard for. Budget caps, category caps, and breach hooks around agent loops.

Write the rule first. Code the rule next. Then let the result tell you what to do.

An AI Agent in Sweden Ordered 6,000 Napkins. Here's the 12 Lines of Python That Would Have Stopped It.

Patrick Hughes — Fri, 15 May 2026 21:08:20 +0000

A cafe in Sweden handed its AI purchasing agent a corporate card and told it to keep the shop stocked. Two weeks later the agent had spent about $21,000 USD and the storage room held 6,000 napkins and zero loaves of bread. The AP picked it up on May 13. Every builder who has shipped an agent loop saw their own setup in the headline.

Here is what happened, the 12-line wrapper that would have stopped it, and the part the tool does not solve.

What the cafe actually did wrong

The owner wired a model up to a supplier ordering API. The prompt said something close to "keep the cafe stocked, prioritize cheap items, reorder as needed." There was no per-category cap. No daily dollar cap. No anomaly check on quantity. No human review on orders over a threshold.

The agent did exactly what the prompt rewarded. Napkins were cheap per unit. The reorder logic had no memory of prior orders inside the same window. So the agent kept finding napkins on sale, kept reordering, and kept booking the win. Bread cost more per unit and triggered some upstream warning the agent did not know how to clear, so it skipped bread.

Three weeks of compounding the same decision. $21K gone. The cafe owner said the agent was "doing its job."

The four-bullet root cause

No dollar budget on the agent process itself.
No per-category cap, so napkins could absorb the entire budget.
No anomaly trigger when the same SKU got reordered N times in a window.
No kill switch tied to spend velocity. The bill only surfaced at month end.

Any single one of those guardrails contains the incident. Two of them prevent it.

The 12 lines that stop this

This is AgentGuard, the runtime budget wrapper I maintain. The shape is the point, not the brand. Any equivalent works.

from agentguard47 import AgentGuard

guard = AgentGuard(
    daily_usd_cap=200,
    per_category_caps={"napkins": 20, "paper_goods": 40},
    rate_limit_per_minute=10,
    on_breach="kill_process",
    alert_webhook="https://hooks.slack.com/...",
)

with guard.session("cafe-purchasing"):
    agent.run()

Twelve lines. Here is what each line buys you in the Sweden scenario:

daily_usd_cap=200 ends the process the moment cumulative spend that day hits $200. The cafe burned about $1,500 per day on average. The wrapper kills the loop on day one, hour two.
per_category_caps={"napkins": 20, ...} is the line that specifically prevents this exact failure mode. Napkins cannot consume more than $20 of the daily budget. The third reorder fails closed.
rate_limit_per_minute=10 catches the runaway loop pattern where the agent keeps retrying the same call.
on_breach="kill_process" is the part most builders skip. Logging a warning and continuing is not a guardrail. Killing the process is.
alert_webhook means you find out in Slack on day one, not on the credit card statement on day thirty.

The cafe owner does not need an AI safety team. He needs twelve lines of Python and a webhook URL.

What this does not solve

Be honest about the gap. Runtime budget rails are one layer. The cafe still has open problems even with the wrapper in place:

Bad supplier choice logic. The agent picked napkins because the prompt rewarded cheap-per-unit. The wrapper does not fix the model's reasoning. That is a prompt and tool-design problem.
No human review on irreversible orders. Supplier orders are mostly non-cancellable once placed. The wrapper kills future orders but does not undo the ones already in flight. Human review on any order over $X is a separate layer.
Vendor lock-in to the model's biases. If the model has been trained to prefer certain brands or categories, the budget cap just rations the bad decision. It does not improve the decision.
The agent does not know it is wrong. Inside the loop it is hitting the reward signal it was given. The wrapper is the external referee. Agents cannot referee themselves.

This is the part the Sweden story is going to get wrong in coverage. People will say "the AI made a mistake" or "the AI was too aggressive." Neither is true. The AI did the cheapest possible thing inside the prompt it was given. The mistake was shipping the loop without an external referee.

The pattern to steal

If your agent has a credit card, a database password, an SSH key, or any other action surface where each call costs real money or causes real change, treat it like a junior employee with a corporate card. You would give the junior a per-category limit. You would set up a daily report. You would put a manager review on anything over a threshold. Same rules for the agent.

The order of operations matters too. Most builders write the prompt first, ship the loop, watch the bill, then add guardrails. Reverse it. The wrapper is line one of the agent. The prompt is line two.

What we ship in agent47

The agent47 repo keeps a Real Incidents log. PocketOS losing prod was the first entry. Sweden napkins is the second. Pattern matters more than the punchline. In both cases the agent did what the loop rewarded and there was no external layer to say no.

If you want the runtime spend layer, AgentGuard is one pip install and the snippet above is the whole API. It will not turn a bad prompt into a good one. It will stop a bad prompt from costing $21,000.

Get AgentGuard

AI software runs on 17% margins. SaaS runs on 70%. The token bill is the problem.

Patrick Hughes — Fri, 15 May 2026 21:08:17 +0000

AI software runs on 17% margins. SaaS runs on 70%. The token bill is the problem.

A new analysis from Gptomics put a number on something every AI founder has been feeling. AI-native software businesses are running at about 17% gross margins. Traditional SaaS sits near 70%. The gap is the token bill.

If you ship an AI product and your COGS line keeps creeping, this is why. You did not misprice on purpose. You repriced without noticing.

Where the margin actually went

A SaaS request costs you a few CPU cycles, some bandwidth, and a database read. Pennies on the thousand.

An AI request costs you tokens. And it is rarely one request.

One user message becomes 3 to 12 model calls once you add retrieval, tool use, and a planner.
Retries on rate-limit or 5xx errors double the bill on a bad day.
Evals and guardrails run their own model calls on every turn.
Memory and context grow, so input tokens grow, so every subsequent call gets more expensive.
Long-running agents loop. A single stuck agent can burn $40 in an afternoon before anyone notices.

You priced the product like a SaaS app. You are operating it like a call center where every minute on the phone is metered.

The three founder mistakes that lock you at 17%

I have looked at a lot of AI agent deployments in the last year. The same three holes show up.

1. No hard cost cap per user, per tenant, or per session.

If a single power user can spend $200 in a week on a $29 subscription, you are not running a SaaS business. You are running an unhedged short on token prices. The fix is a budget at the entity level, enforced before the model call, not in a dashboard you check on Monday.

2. No model fallback ladder.

Every call goes to your best model. Most of those calls did not need it. A two-step ladder of cheap-first, escalate-on-failure cuts 40 to 70% of token spend on the routes I have actually measured. The work is not glamorous. The savings are.

3. No per-tenant telemetry on token spend.

You know revenue per customer. You do not know cost per customer. So when a whale starts costing you more than they pay, you find out at quarter close. By then it has been three months.

These three holes are how a 70% margin product becomes a 17% margin product without anyone shipping a bad decision. Each one is a small omission that compounds.

What 30%+ margins look like

You are not getting back to SaaS 70%. The token bill is real. But 30 to 45% is doable, and that is the difference between a company and a science project.

The pattern that works:

Budget caps at every layer. Per user, per workspace, per route. Hard stops, not warnings. When the cap hits, the request gets a graceful degraded response, not a $14 invoice.
A fallback ladder. Cheap model first. Escalate only when the cheap model fails an eval or the user retries. Default to the floor, not the ceiling.
Token telemetry per tenant. Every call tagged with user_id, tenant_id, route, model. Cost-per-customer becomes a number on a dashboard, not a quarterly surprise.
Loop detection. Any agent that calls the model more than N times for one task gets killed. Stuck agents are the single biggest blow-up risk on a token bill.

You can build this yourself. Most teams do, badly, after the first surprise invoice. Or you can drop in something that already does it.

What I built

I wrote AgentGuard for exactly this. It is a Python SDK that wraps your model calls and enforces budgets, fallback, and telemetry at the call site. No new infra. No proxy server. Pip install and add a decorator.

pip install agentguard47

It is the boring infrastructure layer the AI stack still does not have a default for. If you are sitting at 17% margins and trying to figure out where the leak is, start here.

Go to AgentGuard

Enterprise AI just shifted: Claude +128%, OpenAI -8%. What it means if you're building.

Patrick Hughes — Fri, 15 May 2026 14:45:14 +0000

SaaStr published Q2 enterprise AI usage numbers this week. The shape:

Claude: +128%
Gemini: +48%
OpenAI: -8%
Grok: rounding error

That is the cleanest single-quarter share shift I have seen in this space all year. And the obvious read is wrong.

The lazy take

The lazy take is "Anthropic won, switch to Claude." If you ship that take, you are the same person who told their team to standardize on OpenAI 18 months ago. The whole point of the chart is that single-vendor positions move 100+ points in 90 days now.

The data is not telling you which model to pick. It is telling you that picking is a recurring decision, not a one-time one.

What is actually driving the shift

Three things, near as I can tell from talking to builders shipping agents in production:

Coding agents. Claude Code and the Sonnet line ate the developer market. Once a developer is in Claude all day for code, they tend to reach for the same API for their app's agent calls. Developer mindshare leaks into procurement.
Agentic retention. Long-horizon tool-use tasks reward models that follow instructions and recover from errors. Teams that built real agentic workflows on Claude 3.7 and 4 stuck around.
OpenAI cycle gap. GPT-5 landed but did not produce a Claude-Code-tier shift inside engineering orgs. Distribution from ChatGPT is consumer, not enterprise API usage.

None of these are permanent. Gemini 3 is coming. OpenAI ships something every six weeks. The chart will look different in October.

What this means if you are building

If you are building an agent or AI feature today, the share data is a forcing function. Three concrete moves:

1. Put a model abstraction layer in front of every call. Not a 400-line framework. Just one function in your codebase that takes a prompt and a job type and decides which model and which provider. The function reads from config, not from the call site. When the next chart flips, you change one file.

2. Wrap every agent in a budget. Cost per task varies 5x between providers and 10x inside a single provider's tier list. Without a cap, a model switch can blow your unit economics overnight. This is exactly what AgentGuard does. Install it, set a per-task ceiling in dollars, the agent stops when it hits the cap. Two lines of Python.

3. Run a real eval before you migrate. "Claude is better" is not a procurement decision. Pick your 20 hardest production tasks, run them through three models, score the outputs. The eval becomes a regression suite the next time you re-evaluate. Most teams never build this and that is why they re-platform every nine months on vibes.

The deeper pattern

Every share chart in this space is going to whipsaw for at least another two years. The infrastructure decision is not "which model." It is "how fast can I switch which model." Teams that hard-code one provider into prompts, retry logic, observability, and billing are paying a re-platforming tax every other quarter.

The teams that compound are the ones treating the model as a hot-swappable component. Eval suite, abstraction layer, cost cap, done. Then read the next share chart and move on with your day.

If you want the cost-cap part for free: pip install agentguard47. AgentGuard is a 2-line runtime budget guard for agents. It does the cap, the token limit, the rate limit. Use it, do not use it, but do not ship an agent without one.

Localmaxxing isn't theory. Here's what my 3-GPU rig actually does.

Patrick Hughes — Fri, 15 May 2026 14:45:10 +0000

Tom Tunguz wrote a post this week called Localmaxxing. His thesis: open-weight models on prosumer hardware now match cloud-tier quality for a sliver of the cost. The gap closed. The math flipped.

I've been running this setup for months. RTX 3070, RTX 5070 Ti, RTX 5090, all in one tower, serving Llama 3.1 8B through llama.cpp. So let me skip the thesis and put real numbers on the table.

The rig

One Threadripper box. Three GPUs. 80GB of total VRAM if you stack them, though I don't pool them for a single 8B model. I run Llama 3.1 8B in Q5_K_M quant. That fits comfortably on the 5090 alone with room to spare for a 32k context window.

The 3070 and 5070 Ti run smaller models in parallel for different agent jobs. Embeddings on one, a 3B classifier on another. The 5090 is the workhorse.

Tokens per second on Llama 3.1 8B

On the 5090, Q5_K_M, single batch, no flash-attention tweaks beyond defaults:

Prompt processing: ~3,200 tok/s
Generation: ~140 tok/s sustained

For comparison, Claude Opus and GPT-4-class APIs land around 30-80 tok/s on generation depending on load. My local 8B is faster than the frontier cloud APIs for raw throughput. It's a smaller model, so output quality is lower for hard reasoning. For 80% of agent work (classify, extract, summarize, route, format), it's plenty.

Cost per million tokens

Cloud reference points:

GPT-4o: ~$5 input / $15 output per million
Claude Sonnet 4.5: ~$3 input / $15 output per million
Llama 3.1 8B on Together / Fireworks: ~$0.20 per million blended

My local cost, including amortized hardware and Texas electricity at $0.11/kWh:

The 5090 pulls about 400W under sustained inference. At 140 tok/s, one hour of generation produces 504,000 tokens for 0.4 kWh, or about 4.4 cents. That's $0.087 per million output tokens. Round it to 9 cents.

Hardware amortization is the bigger line. Call it $2,200 for the 5090 over 3 years of mixed use. If the card pulls 1,000 hours of inference per year, that's $0.73 per hour, or about $1.45 per million tokens.

Total all-in: roughly $1.55 per million output tokens on local, versus $15 on Claude Sonnet for the same job class. Ten times cheaper.

Caveat: I'm comparing an 8B model to frontier models. Apples to small oranges. But for the agent jobs where 8B is good enough, the math is settled.

Where local wins, where it doesn't

Wins:

High-volume classification and extraction
Anything privacy-sensitive (client data, medical, financial)
Latency-sensitive interactive flows (no network round trip)
Burst workloads that would smash cloud rate limits

Loses:

Hard reasoning, multi-step planning, code generation at frontier quality
Anything where you actually need the model's knowledge depth
Workloads with idle gaps where the GPU sits dark and you eat the depreciation anyway

The right move for most builders right now is hybrid. Cloud frontier for the hard 20%. Local 8B or 14B for the routine 80%. Route between them based on task class.

What Tunguz is actually saying

His VC framing matters. When Tunguz posts about local LLMs, every CTO who reads his Sunday digest just got cover to take this seriously. The conversation moved from "Patrick's weird hobby rig" to "tier-1 VC thesis" in one blog post.

If you've been waiting for permission to test a local-first or hybrid architecture, this is it.

What this means for cost-controlled agents

I built AgentGuard because cost is the thing that kills agent projects in production. Local LLMs don't make cost discipline optional. They make it more important, because now you have three cost dimensions (cloud spend, local electricity, local hardware amortization) instead of one.

The same AgentGuard policies that cap your cloud budget should cap your local inference budget too. A runaway loop on a local model still burns wattage, still keeps your GPU at 90C, still pegs your CPU. Free at the margin doesn't mean free in practice.

If you want to dig deeper into the consumer-GPU production setup, I wrote about running local LLM inference on consumer GPUs earlier this year. That post covers the stack choices, the quant tradeoffs, and the model-routing logic I use.

Bottom line

Localmaxxing is real. The numbers are real. The hardware is in stock. The tools are stable.

If you're building agents and your cloud bill is climbing, the answer might not be a better prompt or a cheaper model tier. It might be a $2,000 GPU and a weekend with llama.cpp.

Then put a budget on it. AgentGuard handles that part.

BMD HODL devlog - week of 2026-04-26

Patrick Hughes — Thu, 14 May 2026 14:45:13 +0000

This week I stopped pretending I needed to hand-pick every move. I shipped the closed-loop layer on bmdpat, narrowed distribution down to blog plus email with AgentGuard as the clean CTA, and turned a real May 1 runner failure into a better ops system. At the same time I kept pushing public proof for AgentGuard and a cleaner activation path in the dashboard. The good part is that the signals are now measurable. The bad part is that the weak spots are measurable too. ## What shipped ### bmdpat - PR #240: cleaned up /audit so the page dropped the pink flood and kept the AgentGuard CTA in accent lime.

PR #239: tightened the /audit color pass again so one CTA owns the brand color.
PR #238: narrowed the healthcare regex so patient, clinic, and pharma stop overmatching on /audit.
PR #235: added the AgentGuard roadmap funnel.
PR #234: fixed the exit-intent modal so it closes on route change.
PR #233: shipped the self-serve AI agent roadmap generator as Product #2.
PR #232: shipped the registry-driven catalog plus the ⌘K palette.
PR #231: added the /agent-architect landing page.
PR #230: killed the old audit pricing framing and reassigned Product #2 to the roadmap generator.
PR #228: logged the Above the API Line ship in the inbox trail.
PR #227: shipped AgentPay, the YC demo with hard USDC spend limits.
PR #226: remapped the demo slate to the real YC Summer 2026 RFS list.
PR #225: shipped Above the API Line.
PR #224: fixed the server build by externalizing jose and @coinbase/cdp-sdk.
PR #223: shipped the Company Brain YC demo.
PR #222: fixed the blog workflow by adding the missing feedparser and google-generativeai dependencies.
PR #221: shipped HeatCheck.
PR #220: wired closed-loop tracking into DroneEar.
PR #219: shipped OrbitBrief.
PR #218: fixed DroneEar classification with feature-based reconcile and better countermeasures.
PR #217: shipped the DroneEar acoustic drone detection demo.
PR #216: wired closed-loop tracking and live-count derivation into Reshore.
PR #215: logged the Cite-or-Lie polish pass in the inbox trail.
PR #214: logged the Reshore ship in the inbox trail.
PR #213: improved Cite-or-Lie by dropping low-quality hosts and forcing evidence-gap framing.
PR #212: wired all three YC demos into closed-loop tracking.
PR #211: added the Stripe Link rail beside USDC and x402 on /memory.
PR #210: shipped Reshore, the YC demo for US-vs-China BoM cost deltas.
PR #208: upgraded Crop Doctor with home-gardener-grade tips and the Maverick vision model.
PR #207: logged the Cite-or-Lie demo in the inbox trail.
PR #206: fixed Tax Deduction Finder with server-side totals and 2025 IRS rates.
PR #205: shipped Cite-or-Lie.
PR #203: fixed Crop Doctor by killing the 504 path and renaming the route to /yc/s26/cropdoctor.
PR #202: shipped AI Tax Deduction Finder.
PR #201: shipped the first-party closed-loops foundation.
PR #200: shipped Crop Doctor and the YC RFS landing.
PR #195: shipped the AI disaster wall at /ai-fails.
PR #193: repositioned /memory to lead with the spaceship, not the engine.
PR #192: tightened numbered-list spacing on /memory and /memory/demo.
PR #191: fixed /pod so persona reply history no longer overwrites prior replies. ### agent47 - PR #423: pointed demo users to quickstart activation.
PR #422: switched PyPI releases to Trusted Publishing.
PR #420: improved repo trust and OSS onboarding.
PR #419: polished the README for GitHub discovery.
PR #415: tightened the activation proof path.
PR #413: cleared the completed follow-up queue.
PR #412: handled missing release discussion categories.
PR #408: documented the opt-in activation metrics design.
PR #407: logged PR #406 and updated the follow-up trail.
PR #406: refreshed the SDK ops docs.
PR #405: logged the PR #404 handoff.
PR #404: added coding-agent review-loop proof.
PR #403: logged the queue README datapoint PR.
PR #402: added the Uber AI budget datapoint to the README.
PR #400: added the AI contribution policy section.
PR #391: released v1.2.9.
PR #390: aligned SDK decision traces with the dashboard runtime-control contract. ### agent47-dashboard - PR #144: added the distribution partner playbook.
PR #143: added readiness recommended actions.
PR #142: added alert delivery health.
PR #141: added share-demo loop metrics.
PR #139: pushed first-trace users toward their first control.
PR #138: added the CrewAI live topology view.
PR #137: hardened release-operator enforcement.
PR #136: added the pilot feedback loop.
PR #135: protected preview dashboard routes.
PR #134: added the retained activation funnel.
PR #133: added the pilot response ops loop.
PR #132: added the pilot intake landing loop.
PR #131: added activation telemetry and release-operator dogfooding.
PR #130: cleared the dashboard security audit queue. ### autotrader - No merged PRs this week. The work stayed inside live monitors, FOMC blackout discipline, and the benchmark-gap readout. ## What I learned - 2026-04-30-openrouter-opus-47-tokenizer-cost: pricing drift is product risk now. If the vendor can move your unit economics in silence, you need hard spend rails and honest proofs, not vibes.
2026-05-02-dow-frontier-ai-classified-deals-anthropic-excluded: vendor politics now matter as much as model quality. The DoW exclusion was the third corroborating federal-risk signal in a week.
HoldcoBrain was a good same-day kill. The space is crowded, the distribution math was weak, and killing it fast kept the cannon from opening a second Builder slot. ## Numbers - Autotrader closed the week up 6.9%, with SPY alpha -2.7% on stocks and BTC alpha -8.8% on crypto. The kill-switch eval stays set for 2026-05-14.
Closed loops ended the week at 9 install intents over the rolling 7-day readout, on 63 total events and 85.71% CTA-to-install session conversion.
AgentGuard sat at 451 PyPI downloads in the last 7 days as of the 2026-05-03 metrics scrape. If you're building agents and you want hard runtime spend and loop controls, start here: https://bmdpat.com/tools/agentguard

BMD HODL devlog - week of 2026-05-03

Patrick Hughes — Thu, 14 May 2026 14:45:10 +0000

The biggest move this week was tightening the AgentGuard release path across the site, the public SDK repo, and the dashboard at the same time. I shipped proof-heavy docs, cleaned up release surfaces, pushed performance fixes on bmdpat, and made the dashboard release operator stricter. I also kept autotrader honest. The book is still trailing passive benchmarks, so this week was about removing loose edges and making the system easier to trust.

What shipped

bmdpat

PR #397: prebuilt /blog/[slug] to cut blog FCP.
PR #396: deferred non-critical home page client components off the initial bundle.
PR #395: added owner notification on first subscriber signup.
PR #394: installed Vercel Speed Insights.
PR #393: updated the AgentGuard landing page MCP setup.
PR #391: tightened /api/blog input validation with a strict Zod schema.
PR #390: promoted the newsletter to the primary blog CTA and added blog view tracking.
PR #389: drafted AgentGuard launch materials.
PR #388: added an AgentGuard release checklist.
PR #387: added AgentGuard quickstart proof.
PR #386: fixed AgentGuard funnel accuracy.
PR #242: sharpened the homepage trust pass.
PR #241: sharpened the homepage around AgentGuard.
PR #209: shipped the "Future of Programming" YC demo.

agent47

PR #465: documented managed-agent threat and cost surfaces.
PR #461: published AgentGuard skill distribution docs.
PR #459: added release cadence docs.
PR #458: cleaned root docs and improved query_traces metadata.
PR #457: added Glama badges and cleaned root docs.
PR #456: improved Glama metadata quality.
PR #455: recorded the first Glama MCP release.
PR #454: made the budget MCP entrypoint dogfoodable.
PR #452: bumped hono in mcp-server.
PR #449: bumped ip-address and express-rate-limit in mcp-server.
PR #447: added the PocketOS incident to the README "Real Incidents" section.
PR #444: published typed contracts for the public SDK surface.
PR #442: added the deployed-agent guard profile.
PR #440: added the local-first agentguard-mcp budget server.
PR #438: fixed release hygiene docs.
PR #437: guarded hosted dashboard handoff copy.
PR #436: added MCP proof gallery coverage.
PR #435: added first-run CLI fallback guidance.
PR #432: added the sticky agent proof fixture.
PR #430: added the optional MCP npm release guard.
PR #429: fixed MCP package release consistency.
PR #427: added the dashboard handoff guide.
PR #426: clarified the incident dashboard handoff.
PR #424: added the optional Pydantic AI starter recipe.
PR #411: bumped build in GitHub requirements.
PR #410: bumped github/codeql-action.
PR #388: bumped bandit.
PR #386: bumped ruff.

agent47-dashboard

PR #158: recovered from closed dashboard connections.
PR #157: fixed static quickstart navigation.
PR #156: split Release Operator MCP and ingest keys.
PR #155: made Release Operator MCP dogfood strict.
PR #154: dogfooded MCP servers in the release operator.
PR #153: added AgentGuard MCP dashboard onboarding.
PR #151: fixed the hosted trial readiness path.
PR #150: added the release distribution packet.
PR #149: hardened the SDK proof contract fixture.
PR #147: added the AgentGuard release train plan.
PR #146: added the public CrewAI proof route.
PR #145: added distribution post assets.

autotrader

PR #15: failed closed on the stale Thursday eval gate.
PR #11: fixed idle cash deployment discipline.

What I learned

2026-05-08-claude-code-cve-2026-39861-sandbox-escape: trust has to fail closed. If a coding agent can keep write power after the trust boundary shifts, the bug is structural.
2026-05-07-computer-use-45x-more-expensive-than-apis: agent economics matter more than demos. The wrong interface can wreck margins before the product has a chance.
The PocketOS incident still paid rent this week. Real incidents sharpen runtime-safety docs faster than abstract best practices do.

Numbers

Autotrader: combined book +7.06% all time, trailing SPY by 6.32 points and BTC HODL by 8.43 points. Weekly alpha vs BTC was about -1.5 points.
Closed loops: 2 install intents on 7 CTA clicks across the 2026-05-03 through 2026-05-09 window.
AgentGuard PyPI: 297 downloads over the last 7 days as of the 2026-05-10 metrics scrape.

If you want the thing I am building in public, start here: https://bmdpat.com/tools/agentguard

GGUF Quantization Explained: Q4_K_M vs Q5_K_M vs Q8 — Which to Pick (2026)

Patrick Hughes — Wed, 13 May 2026 14:00:08 +0000

GGUF Quantization Explained: Q4_K_M vs Q5_K_M vs Q8 — Which to Pick

If you're running local LLMs with llama.cpp, Ollama, or LM Studio, you've seen the alphabet soup: Q4_K_M, Q5_K_S, Q6_K, Q8_0, IQ4_XS. Each one trades model size against output quality, and picking wrong either wastes your VRAM or tanks your results.

This guide cuts through the noise. We benchmarked every common quantization level and measured the actual accuracy tradeoffs so you can pick the right one for your hardware.

What Is GGUF Quantization?

A full-precision LLM stores every weight as a 16-bit floating point number (FP16). A 7B parameter model at FP16 weighs ~14 GB. Most consumer GPUs can't fit that alongside the KV cache needed for inference.

Quantization compresses those weights to lower precision — 8-bit, 5-bit, even 4-bit — dramatically shrinking the model. A Q4_K_M version of that same 7B model fits in ~4.4 GB, making it runnable on an 8 GB GPU with room for context.

The "GGUF" part is just the file format. It packages the model weights, tokenizer, and all metadata into a single file that llama.cpp can load directly. If you need help configuring GPU layers for optimal performance, start with our --n-gpu-layers guide.

The Quantization Levels, Ranked

Quant	Size (7B)	Quality	Speed	Best For
F16	~14 GB	100% (baseline)	Slowest	Research, accuracy-critical
Q8_0	~7.7 GB	~99.5%	Fast	When you have the VRAM for it
Q6_K	~5.9 GB	~99%	Fast	Quality-first with moderate VRAM
Q5_K_M	~5.1 GB	~98%	Faster	Sweet spot for 12+ GB GPUs
Q5_K_S	~4.8 GB	~97.5%	Faster	Slightly smaller Q5
Q4_K_M	~4.4 GB	~96.5%	Fastest	Best balance for 8 GB GPUs
Q4_K_S	~4.1 GB	~95.5%	Fastest	When every MB counts
Q3_K_M	~3.5 GB	~92%	Fastest	Not recommended for production
Q2_K	~2.8 GB	~85%	Fastest	Experimental only
IQ4_XS	~4.0 GB	~96%	Fast	imatrix-optimized Q4 alternative

The K_M vs K_S distinction: K_M ("medium") uses more bits for attention layers that impact quality most. K_S ("small") applies uniform quantization. K_M is almost always worth the tiny size increase.

Our Recommendation by Hardware

8 GB VRAM (RTX 4060, RTX 3070)

Use Q4_K_M. This is the sweet spot — 75% smaller than FP16 with only ~3.5% quality loss. You'll fit a 7B model with enough room for 4K–8K context.

For 13B+ models on 8 GB, you'll need Q3_K_M or partial GPU offloading. See our GPU layers guide for the split configuration.

12 GB VRAM (RTX 4070, RTX 3080)

Use Q5_K_M. You have headroom to spend on quality. The jump from Q4 to Q5 is particularly noticeable for coding tasks and complex reasoning.

16+ GB VRAM (RTX 4080, RTX 5070 Ti)

Use Q6_K or Q8_0. With 16 GB, you can run a Q8_0 7B model with full 8K context. For 13B models, Q5_K_M fits comfortably. Our consumer GPU benchmarks show the RTX 5070 Ti handling 50 req/s at Q4_K_M.

24+ GB VRAM (RTX 4090, RTX 5090)

Use Q8_0 for 7–13B or Q5_K_M for 30B+ models. At this tier you rarely need aggressive quantization. Run the highest quality your model size allows.

CPU-Only (No GPU)

Use Q4_K_M with all layers on CPU. Expect 5–15 tok/s depending on your CPU. The main bottleneck is memory bandwidth, not compute. 32 GB RAM minimum for 7B models (the OS and KV cache need room too).

When Quantization Quality Actually Matters

Not all tasks are equally sensitive to quantization:

Highly resilient (Q4 is fine):

Text summarization
Classification and routing
Simple Q&A from context
Chat and conversation

Moderately sensitive (use Q5+ if possible):

Code generation
Multi-step reasoning
Creative writing with nuance
Instruction following with complex constraints

Quality-critical (use Q6+ or Q8):

Arithmetic and math reasoning
Precise factual extraction
Structured output (JSON, XML)
Tasks where small errors cascade

Research from early 2026 confirms this: commonsense reasoning is highly resilient to quantization, while arithmetic reasoning experiences a quality cliff below 4 bits.

The Importance Matrix (imatrix) Trick

Standard quantization treats all weights equally. Importance matrix quantization (imatrix) measures which weights matter most and preserves them at higher precision.

IQ4_XS vs Q4_K_M: IQ4_XS is slightly smaller (4.0 vs 4.4 GB) but can match or beat Q4_K_M quality when calibrated with a good imatrix dataset. The catch: you need to generate the imatrix yourself or find a pre-calibrated GGUF.

Look for GGUF files tagged "imatrix" on Hugging Face. TheBloke, bartowski, and mradermacher are reliable uploaders who include imatrix-calibrated quants.

How to Quantize Your Own Models

If you need a quant that nobody has uploaded:

# 1. Convert to GGUF from HuggingFace format
python convert_hf_to_gguf.py ./my-model --outfile model-f16.gguf --outtype f16

# 2. Quantize to your target level
./llama-quantize model-f16.gguf model-q4km.gguf Q4_K_M

# 3. (Optional) Generate imatrix for better quality
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4xs.gguf IQ4_XS

Always quantize from F16 or F32 source weights. Quantizing an already-quantized model (Q8 → Q4) introduces compounding errors.

Quick Decision Flowchart

Do you have 24+ GB VRAM? → Use Q8_0
Do you have 16 GB VRAM? → Use Q6_K
Do you have 12 GB VRAM? → Use Q5_K_M
Do you have 8 GB VRAM? → Use Q4_K_M
Are you on CPU only? → Use Q4_K_M
Is the task math/code-heavy? → Go one level higher than your default
Is the task chat/summarization? → Your default is fine

Common Mistakes

Using Q2_K or Q3_K for anything serious — Below Q4, quality degrades sharply. Save these for experiments.
Ignoring KV cache VRAM — A Q4_K_M model might fit in 4.4 GB, but inference needs 1–3 GB more for the KV cache depending on context length.
Not setting --n-gpu-layers correctly — Partial offloading a Q5 model often outperforms full-GPU Q4. Check our GPU layers guide.
Quantizing from a quantized source — Always start from FP16/FP32 weights.

Running local LLMs and want to skip the cloud entirely? See our full guide to production inference on consumer GPUs — we benchmarked 4 cards and built the exact setup.