<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Patrick Hughes</title>
    <description>The latest articles on DEV Community by Patrick Hughes (@pat9000).</description>
    <link>https://dev.to/pat9000</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3763138%2Fa7736e79-1b96-4f55-a9f7-9ddd8775eb09.jpg</url>
      <title>DEV Community: Patrick Hughes</title>
      <link>https://dev.to/pat9000</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pat9000"/>
    <language>en</language>
    <item>
      <title>How to Tune llama.cpp --n-gpu-layers: A Practical VRAM Guide (2026)</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Tue, 09 Jun 2026 14:45:13 +0000</pubDate>
      <link>https://dev.to/pat9000/how-to-tune-llamacpp-n-gpu-layers-a-practical-vram-guide-2026-m8i</link>
      <guid>https://dev.to/pat9000/how-to-tune-llamacpp-n-gpu-layers-a-practical-vram-guide-2026-m8i</guid>
      <description>&lt;p&gt;You already know what &lt;code&gt;--n-gpu-layers&lt;/code&gt; does. It moves transformer layers onto your GPU. This post is the next step: how to actually pick the number.&lt;/p&gt;

&lt;p&gt;If you want the basics first, read the original: &lt;a href="https://bmdpat.com/blog/llama-cpp-n-gpu-layers-explained-2026" rel="noopener noreferrer"&gt;llama.cpp n-gpu-layers explained&lt;/a&gt;. This is the tuning guide that follows it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one rule that matters
&lt;/h2&gt;

&lt;p&gt;A model has a fixed number of layers. A 7B model might have 32. A 70B might have 80. The &lt;code&gt;--n-gpu-layers&lt;/code&gt; flag (often shortened to ngl) says how many of those go on the GPU. The rest stay on the CPU and run in system RAM.&lt;/p&gt;

&lt;p&gt;Full GPU means fast. Full CPU means slow. Partial means somewhere in between, and it scales close to linearly. Offload half the layers and you get roughly half the speedup.&lt;/p&gt;

&lt;p&gt;So the goal is simple. Put as many layers on the GPU as your VRAM allows. Not one more.&lt;/p&gt;

&lt;h2&gt;
  
  
  The VRAM math
&lt;/h2&gt;

&lt;p&gt;Each layer costs roughly the same amount of VRAM. You can estimate it.&lt;/p&gt;

&lt;p&gt;Take the model file size on disk. Divide by the layer count. That gives you a rough per-layer cost.&lt;/p&gt;

&lt;p&gt;A 7B model quantized to Q4 is around 4 GB. Split across 32 layers, that is about 125 MB per layer. Offload 24 layers and you spend roughly 3 GB on weights.&lt;/p&gt;

&lt;p&gt;This is an estimate, not a promise. Attention layers and embedding layers differ slightly. But the per-layer average holds well enough to plan with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do not forget the KV cache
&lt;/h2&gt;

&lt;p&gt;Weights are only part of the bill. The KV cache also lives on the GPU when you offload, and it grows with context length.&lt;/p&gt;

&lt;p&gt;Longer context means a bigger cache. Double the context window and you roughly double the cache size. On a tight card, a long context can push you into OOM even when the weights fit.&lt;/p&gt;

&lt;p&gt;So budget VRAM in two buckets. Weights first. Then leave headroom for the KV cache at the context length you actually plan to run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading OOM symptoms
&lt;/h2&gt;

&lt;p&gt;When you ask for too many layers, llama.cpp fails at load time with a CUDA out of memory error. It will not silently fall back. It stops.&lt;/p&gt;

&lt;p&gt;The fix is to drop ngl by a few and reload. Step down until it loads. If you are right at the edge, shave 2 or 3 layers and try again.&lt;/p&gt;

&lt;p&gt;Watch your VRAM with &lt;code&gt;nvidia-smi&lt;/code&gt; while the model loads. You want a buffer left over, not a card pinned at 100 percent. Other apps, your desktop, and the KV cache all want a slice.&lt;/p&gt;

&lt;h2&gt;
  
  
  A fast tuning loop
&lt;/h2&gt;

&lt;p&gt;You do not need to calculate everything. You can probe.&lt;/p&gt;

&lt;p&gt;Start with ngl set to a high number. Many people use 99 to mean "offload everything." If it loads, you are done. The whole model fits.&lt;/p&gt;

&lt;p&gt;If it OOMs, step down. Try 28, then 24, then 20. Each reload tells you where the ceiling is. Five minutes of trial beats an hour of spreadsheet math.&lt;/p&gt;

&lt;p&gt;Once it loads cleanly, run a real prompt at your target context length. If that OOMs mid-generation, the KV cache pushed you over. Drop a few more layers and leave room.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick starting points by card
&lt;/h2&gt;

&lt;p&gt;These are rough anchors, not guarantees. Your quant, context, and model size all move the number.&lt;/p&gt;

&lt;p&gt;On an 8 GB card, a 7B Q4 model usually offloads fully. A 13B will only fit partially.&lt;/p&gt;

&lt;p&gt;On a 12 GB card, 13B models fit comfortably and you have room for context.&lt;/p&gt;

&lt;p&gt;On 16 GB or more, you can run larger models or push context length hard. A 24 GB card handles most single-GPU local work without much tuning at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  How quant choice feeds in
&lt;/h2&gt;

&lt;p&gt;Smaller quant means smaller weights means more layers fit. If you cannot offload a model fully, dropping from Q5 to Q4 might get you there. That tradeoff is its own decision, and it pairs directly with this one.&lt;/p&gt;

&lt;p&gt;If you are weighing which quant to run, read the companion post: &lt;a href="https://bmdpat.com/blog/gguf-quant-which-to-pick-2026" rel="noopener noreferrer"&gt;which GGUF quant should you actually pick&lt;/a&gt;. Tune ngl and quant together. They share the same VRAM budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  The instinct underneath all of this
&lt;/h2&gt;

&lt;p&gt;Running models locally is a cost move. Every token you serve on your own GPU is a token you did not pay an API for. Tuning ngl is just squeezing more value out of hardware you already own.&lt;/p&gt;

&lt;p&gt;That same instinct, watching the meter and refusing to overspend, is what AgentGuard does for AI agents. It caps token spend, rate limits calls, and stops a runaway loop before it burns your budget. Local inference cuts your fixed cost. AgentGuard caps your variable cost.&lt;/p&gt;

&lt;p&gt;If you are running agents and want a hard ceiling on spend, &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;check out AgentGuard&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>llamacpp</category>
      <category>gpu</category>
      <category>vram</category>
    </item>
    <item>
      <title>Which GGUF Quant Should You Actually Pick? Q4 vs Q5 vs Q6 vs Q8 (2026)</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Tue, 09 Jun 2026 14:45:10 +0000</pubDate>
      <link>https://dev.to/pat9000/which-gguf-quant-should-you-actually-pick-q4-vs-q5-vs-q6-vs-q8-2026-3kek</link>
      <guid>https://dev.to/pat9000/which-gguf-quant-should-you-actually-pick-q4-vs-q5-vs-q6-vs-q8-2026-3kek</guid>
      <description>&lt;p&gt;You know what Q4, Q5, and Q8 mean. Now the real question: which one do you actually download?&lt;/p&gt;

&lt;p&gt;If you need the background on what these numbers represent, start with the original: &lt;a href="https://bmdpat.com/blog/gguf-quantization-q4-q5-q8-explained-2026" rel="noopener noreferrer"&gt;GGUF quantization Q4 Q5 Q8 explained&lt;/a&gt;. This post is the decision guide.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff in one line
&lt;/h2&gt;

&lt;p&gt;Lower number means smaller file, less VRAM, faster load, and slightly worse output. Higher number means bigger file, more VRAM, and output closer to the original model.&lt;/p&gt;

&lt;p&gt;That is the whole game. You are trading quality for size. The trick is knowing how much quality you actually lose, and the answer is: less than you think at the high end, more than you think at the low end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start with what fits
&lt;/h2&gt;

&lt;p&gt;The first filter is not quality. It is VRAM.&lt;/p&gt;

&lt;p&gt;Pick the largest quant that fits on your GPU with room for context. A model you can fully offload at Q4 will run faster than the same model at Q5 that spills onto the CPU. Fit beats precision when fit decides speed.&lt;/p&gt;

&lt;p&gt;So measure your VRAM, subtract headroom for the KV cache, and see which quant lands under that ceiling. That narrows the choice fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the K-quants give you
&lt;/h2&gt;

&lt;p&gt;You will see names like Q4_K_M and Q5_K_M. The K means K-quants. They are smarter than the old flat quants.&lt;/p&gt;

&lt;p&gt;K-quants spend more bits on the parts of the model that matter most and fewer bits on the rest. For the same file size, a K-quant holds quality better than a plain one. This is why Q4_K_M became the default many people reach for.&lt;/p&gt;

&lt;p&gt;The M and S suffixes mean medium and small. M keeps more quality. S shrinks further. When in doubt, pick M.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical ladder
&lt;/h2&gt;

&lt;p&gt;Here is how the common options stack up for someone on a consumer GPU.&lt;/p&gt;

&lt;p&gt;Q4_K_M is the workhorse. Smallest footprint that still feels like the real model. If you are tight on VRAM, start here.&lt;/p&gt;

&lt;p&gt;Q5_K_M is the safe upgrade. A noticeable quality bump over Q4 for a modest size increase. If it fits, many people prefer it.&lt;/p&gt;

&lt;p&gt;Q6_K is close to lossless for most tasks. Bigger, but the quality gap to the full model is small. Good when you have VRAM to spare and want margin.&lt;/p&gt;

&lt;p&gt;Q8_0 is near the original. The difference from full precision is hard to notice in normal use. It is large, so you only pick it when size is not a concern.&lt;/p&gt;

&lt;h2&gt;
  
  
  How quality actually falls off
&lt;/h2&gt;

&lt;p&gt;Think of it as a curve, not a line.&lt;/p&gt;

&lt;p&gt;Going from Q8 down to Q5, the quality loss is small. The model barely changes for most prompts. You get a big size win for almost no cost.&lt;/p&gt;

&lt;p&gt;Going below Q4, the loss grows fast. Q3 and Q2 start making real mistakes: weaker reasoning, more repetition, shakier instruction following. They exist for cases where a model simply will not fit otherwise.&lt;/p&gt;

&lt;p&gt;So the sweet spot for most people sits between Q4_K_M and Q6_K. Above that you pay size for little gain. Below that you lose quality faster than you save space.&lt;/p&gt;

&lt;h2&gt;
  
  
  A simple decision flow
&lt;/h2&gt;

&lt;p&gt;Can you fit Q6_K with your context? Take it. Near-lossless, done.&lt;/p&gt;

&lt;p&gt;Cannot fit Q6 but can fit Q5_K_M? Take that. Strong quality, smaller.&lt;/p&gt;

&lt;p&gt;Tight on VRAM? Q4_K_M is the floor that still feels right.&lt;/p&gt;

&lt;p&gt;Cannot even fit Q4? Drop to a smaller model at Q4 before you drop to Q3 of a bigger one. A smaller model at a healthy quant usually beats a big model crushed too hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quant and offload are the same budget
&lt;/h2&gt;

&lt;p&gt;Your quant choice and your &lt;code&gt;--n-gpu-layers&lt;/code&gt; setting pull from the same VRAM pool. A smaller quant frees room to offload more layers, which is what makes the model fast.&lt;/p&gt;

&lt;p&gt;If you have not tuned your layer offload yet, read the companion post: &lt;a href="https://bmdpat.com/blog/llama-cpp-n-gpu-layers-tuning-guide-2026" rel="noopener noreferrer"&gt;the n-gpu-layers tuning guide&lt;/a&gt;. Pick the quant and the layer count together. They are one decision wearing two hats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why any of this matters
&lt;/h2&gt;

&lt;p&gt;Running local models is a cost play. Every prompt you answer on your own card is a prompt you did not pay an API to handle. Picking the right quant means more capable output per dollar of hardware you already bought.&lt;/p&gt;

&lt;p&gt;That same discipline, getting the most value while keeping spend capped, is what AgentGuard enforces for AI agents. It sets hard limits on tokens, cost, and call rate so a loop cannot run up a bill while you sleep. Local inference trims your fixed cost. AgentGuard caps your variable cost.&lt;/p&gt;

&lt;p&gt;If you run agents and want a real ceiling on spend, &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;check out AgentGuard&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>gguf</category>
      <category>quantization</category>
      <category>gpu</category>
    </item>
    <item>
      <title>How to Tune --n-gpu-layers for Your VRAM Budget</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Mon, 08 Jun 2026 14:45:11 +0000</pubDate>
      <link>https://dev.to/pat9000/how-to-tune-n-gpu-layers-for-your-vram-budget-4o79</link>
      <guid>https://dev.to/pat9000/how-to-tune-n-gpu-layers-for-your-vram-budget-4o79</guid>
      <description>&lt;h1&gt;
  
  
  How to Tune --n-gpu-layers for Your VRAM Budget
&lt;/h1&gt;

&lt;p&gt;I wrote &lt;a href="https://bmdpat.com/blog/llama-cpp-n-gpu-layers-explained-2026" rel="noopener noreferrer"&gt;an explainer on llama.cpp's --n-gpu-layers flag&lt;/a&gt; and it keeps pulling traffic. The explainer covers what the flag does. This post covers the part people actually struggle with: how to pick the right number, do the offload math, split across two GPUs, and stop the out-of-memory crashes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the flag really controls
&lt;/h2&gt;

&lt;p&gt;A model is a stack of transformer layers. &lt;code&gt;--n-gpu-layers&lt;/code&gt; (or &lt;code&gt;-ngl&lt;/code&gt;) tells llama.cpp how many of those layers to put on the GPU. The rest run on the CPU.&lt;/p&gt;

&lt;p&gt;Layers on the GPU run fast. Layers on the CPU run slow. So your goal is simple: put as many layers on the GPU as will fit, and not one more. One layer too many and you get an out-of-memory crash or a silent spill that tanks your speed.&lt;/p&gt;

&lt;p&gt;If the whole model fits, just set &lt;code&gt;-ngl 99&lt;/code&gt; and forget it. The number only matters when the model is bigger than your VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The offload math
&lt;/h2&gt;

&lt;p&gt;Each layer takes roughly the same amount of memory. So the math is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vram-per-layer = model-weights-GB / total-layers
layers-that-fit = (free-vram-GB - overhead) / vram-per-layer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Work an example. A 13B model at Q4 is about 7.5 GB of weights across 40 layers. That is roughly 0.19 GB per layer.&lt;/p&gt;

&lt;p&gt;You have an 8 GB card. Reserve about 1.5 GB for the KV cache and overhead. That leaves 6.5 GB for layers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;6.5 / 0.19 = ~34 layers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So start at &lt;code&gt;-ngl 34&lt;/code&gt; for that model on that card. The other 6 layers run on CPU. You get most of the speed of a full GPU load without the crash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Find the real number fast
&lt;/h2&gt;

&lt;p&gt;The math gets you close. Then you tune by hand. Watch VRAM in one terminal and step the number in another.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# terminal 1&lt;/span&gt;
nvidia-smi &lt;span class="nt"&gt;-l&lt;/span&gt; 1

&lt;span class="c"&gt;# terminal 2: start lower than the math says, then climb&lt;/span&gt;
./llama-cli &lt;span class="nt"&gt;-m&lt;/span&gt; model-q4.gguf &lt;span class="nt"&gt;-ngl&lt;/span&gt; 30 &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"test prompt"&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Climb by 2 or 3 layers each run. Watch &lt;code&gt;nvidia-smi&lt;/code&gt;. When VRAM hits about 90 percent, stop. Leave headroom. The KV cache grows as the context fills, so a load that fits an empty prompt can crash 3000 tokens later.&lt;/p&gt;

&lt;p&gt;That last point is the number one cause of OOM crashes. People tune with a tiny prompt, see it fit, ship it, then crash on a real long input. Always tune at the context length you will actually use, set with &lt;code&gt;-c&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common OOM mistakes
&lt;/h2&gt;

&lt;p&gt;You set -ngl too high and forgot the KV cache. The cache is not free. At an 8K context it can eat a couple of GB on a 13B. Reserve for it.&lt;/p&gt;

&lt;p&gt;You raised the context length and kept the old -ngl. Bigger context means a bigger cache means less room for layers. Re-tune when you change &lt;code&gt;-c&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You loaded a second model on the same card. Two models share one pool of VRAM. The first one's -ngl no longer fits.&lt;/p&gt;

&lt;p&gt;You assumed Q8 fits because Q4 did. Q8 is nearly double the weight memory. The layer math changes completely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Splitting across two GPUs
&lt;/h2&gt;

&lt;p&gt;If you have two cards, llama.cpp can split the model across both. Use &lt;code&gt;--tensor-split&lt;/code&gt; to set the ratio.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# two cards, 24 GB and 32 GB: weight the bigger card heavier&lt;/span&gt;
./llama-cli &lt;span class="nt"&gt;-m&lt;/span&gt; big-model-q5.gguf &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;--tensor-split&lt;/span&gt; 24,32 &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"test"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The numbers are a ratio, not gigabytes, but matching them to your VRAM sizes is a good start. With &lt;code&gt;-ngl 99&lt;/code&gt; and a split, llama.cpp puts all layers on the GPUs and divides them by the ratio. Now a 34B that fits on neither card alone fits across both.&lt;/p&gt;

&lt;p&gt;One catch. Splitting adds a little cross-GPU traffic, so two 16 GB cards are a touch slower than one 32 GB card at the same total memory. Still far faster than spilling to CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this runs inside an agent
&lt;/h2&gt;

&lt;p&gt;Tuning -ngl gets a single run fast. But if a local model sits behind an agent that calls it in a loop, a stuck loop can peg both GPUs for hours and run your power bill up overnight. Local does not mean free.&lt;/p&gt;

&lt;p&gt;That is why I built &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt;. It is an open-source runtime budget, token, and rate limiter for AI agents, and it caps your agent loop whether the model is a cloud API or a local GGUF on your own cards. &lt;code&gt;pip install agentguard&lt;/code&gt;, wrap the loop, set a cap, and a runaway agent stops before it costs you a night of compute.&lt;/p&gt;

&lt;p&gt;Do the layer math, tune at your real context length, leave headroom for the cache, and split across cards when one is not enough. That is the whole game.&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>llamacpp</category>
      <category>gpu</category>
      <category>vram</category>
    </item>
    <item>
      <title>llama.cpp Multi-GPU: Splitting a Model Across Cards with --tensor-split</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Mon, 08 Jun 2026 14:45:08 +0000</pubDate>
      <link>https://dev.to/pat9000/llamacpp-multi-gpu-splitting-a-model-across-cards-with-tensor-split-5767</link>
      <guid>https://dev.to/pat9000/llamacpp-multi-gpu-splitting-a-model-across-cards-with-tensor-split-5767</guid>
      <description>&lt;p&gt;If you are still tuning a single card, start here first: &lt;a href="https://bmdpat.com/blog/llama-cpp-n-gpu-layers-explained-2026" rel="noopener noreferrer"&gt;llama.cpp n-gpu-layers explained&lt;/a&gt;. That post covers how &lt;code&gt;--n-gpu-layers&lt;/code&gt; moves layers onto one GPU and the VRAM math behind it.&lt;/p&gt;

&lt;p&gt;This one is the next step. Once a model no longer fits on one card, you split it across several. A 70B model at Q4 needs about 40 GB on disk plus a few GB for the KV cache. No single consumer card has that much VRAM. But three cards together do. This is where &lt;code&gt;--tensor-split&lt;/code&gt; comes in.&lt;/p&gt;

&lt;p&gt;Most people who go multi-GPU are not running a one-off prompt. They are standing up a local model to feed an agent: a loop that calls the model over and over to read, plan, and act. That changes what matters. A bigger model on more cards is the capacity side of the problem. The other side is keeping that agent loop from running the rig hot all night on a task that should have stopped. Both sides need a plan before you load the weights. We will set up the hardware here and come back to the loop control &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt; gives you once the model is running.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core idea
&lt;/h2&gt;

&lt;p&gt;With one GPU, you only decide how many layers go on the card. With multiple GPUs, you decide how the layers get divided between them. llama.cpp loads the model once and spreads the weights across every GPU you point it at. Each card holds a slice. During inference, activations pass from one card to the next over the PCIe bus.&lt;/p&gt;

&lt;p&gt;Three flags control this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--tensor-split&lt;/code&gt; sets the ratio of the model that lands on each GPU.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--main-gpu&lt;/code&gt; picks which card holds the KV cache and coordinates the run.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--split-mode&lt;/code&gt; chooses how the work is divided: by layer or by row.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  --tensor-split: the ratio
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;--tensor-split&lt;/code&gt; takes a comma-separated list, one number per GPU. The numbers are weights, not gigabytes. llama.cpp normalizes them into proportions.&lt;/p&gt;

&lt;p&gt;Say you have a 24 GB card and a 16 GB card. You want the bigger card to hold more of the model. A split of &lt;code&gt;24,16&lt;/code&gt; puts 60 percent on GPU 0 and 40 percent on GPU 1. The exact integers do not matter, only the ratio. &lt;code&gt;24,16&lt;/code&gt; and &lt;code&gt;3,2&lt;/code&gt; do the same thing.&lt;/p&gt;

&lt;p&gt;The goal is to match the split to each card's free VRAM. If you overload the smaller card, you get an out-of-memory crash mid-load. Leave headroom on whichever card holds the KV cache, because that buffer grows with context length.&lt;/p&gt;

&lt;h2&gt;
  
  
  --main-gpu and the KV cache
&lt;/h2&gt;

&lt;p&gt;One card has to hold the KV cache and run the orchestration. That is &lt;code&gt;--main-gpu&lt;/code&gt;. It defaults to GPU 0. Point it at your largest card so the KV cache has room to grow as the context fills up.&lt;/p&gt;

&lt;p&gt;A long context can add several gigabytes to the main GPU on top of its model slice. If your main GPU is also your smallest card, you will hit OOM at high context even though the model loaded fine. Put the cache on the big card.&lt;/p&gt;

&lt;h2&gt;
  
  
  --split-mode: layer vs row
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;--split-mode layer&lt;/code&gt; is the default and the right choice on consumer hardware. Each GPU owns a contiguous block of layers. Cross-card traffic happens only at the layer boundaries, so PCIe bandwidth stays low.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--split-mode row&lt;/code&gt; splits individual tensors across cards. It can be faster, but only when the GPUs talk over a fast link like NVLink. Consumer cards do not have NVLink. On a plain PCIe rig, row mode floods the bus and usually runs slower than layer mode. Stick with layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  A worked example on a real rig
&lt;/h2&gt;

&lt;p&gt;Here is a mixed consumer setup: an RTX 5090 (32 GB), an RTX 5070 Ti (16 GB), and an RTX 3070 (8 GB). Total is about 56 GB of VRAM. That is enough to run a 70B model at Q4 fully on GPU, with the ~40 GB of weights spread across all three cards and room left for the cache.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Index&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Split weight&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 5090&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;32 GB&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;main-gpu, holds KV cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 5070 Ti&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;model slice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3070&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;model slice (leave headroom)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note the 3070 gets a weight of 6, not 8. Leave a margin on the smallest card so a context spike does not push it over.&lt;/p&gt;

&lt;p&gt;The command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; models/llama-3.1-70b-instruct-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-split&lt;/span&gt; 32,16,6 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--main-gpu&lt;/span&gt; 0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--split-mode&lt;/span&gt; layer &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"Explain tensor parallelism in one paragraph."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--n-gpu-layers 999&lt;/code&gt; still means "all layers on GPU." The difference now is that &lt;code&gt;--tensor-split&lt;/code&gt; decides which GPU each layer lands on. Watch the load logs. llama.cpp prints how much VRAM it assigns to each device. If one card is near its limit, adjust the ratio down and rerun.&lt;/p&gt;

&lt;h2&gt;
  
  
  When multi-GPU helps and when it hurts
&lt;/h2&gt;

&lt;p&gt;Multi-GPU helps when a model does not fit on your largest single card. Splitting a 70B across three cards turns "impossible" into "runs at usable speed." That is the whole point.&lt;/p&gt;

&lt;p&gt;It hurts when the model already fits on one card. Splitting it then adds PCIe round-trips between cards for no benefit, and throughput drops. If your model fits on the 5090 alone, run it there and leave the other cards free.&lt;/p&gt;

&lt;p&gt;Expect throughput to scale below linear. Three cards do not give you 3x the speed of one. Activations still hop across PCIe between layer blocks, and that adds latency. The win is capacity, not raw speed. You are trading some tokens per second for the ability to run a much larger model at all. A 70B at Q4 across this rig lands in a usable interactive range, slower than an 8B on a single card but far smarter.&lt;/p&gt;

&lt;p&gt;That slower throughput is exactly where an agent loop gets dangerous. A 70B across three cards might do 10 to 15 tokens a second. An agent that retries, re-reads context, and re-plans can run for hours and you would not notice until the room is warm. The model is free to run, but the time and power are not, and a stuck loop produces nothing. This is the operational cost of local inference: not a per-token bill, but wasted hours on a runaway task. &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt; sits around the calls and enforces a hard ceiling on tokens and wall-clock spend, so the loop stops itself instead of grinding until morning. Set the cap before you point an agent at a multi-GPU model, not after the first overnight surprise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;p&gt;Single-card &lt;code&gt;--n-gpu-layers&lt;/code&gt; is step one: how many layers fit on one GPU. &lt;code&gt;--tensor-split&lt;/code&gt; is how you scale past one card: the ratio of the model each GPU holds. Set the ratio to match free VRAM, put &lt;code&gt;--main-gpu&lt;/code&gt; on your largest card so the KV cache has room, and keep &lt;code&gt;--split-mode layer&lt;/code&gt; unless you have NVLink. For the broader picture on running local models in production, see &lt;a href="https://bmdpat.com/blog/local-llm-inference-consumer-gpu-production-2026" rel="noopener noreferrer"&gt;local LLM inference on consumer GPUs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That gets the model running. The capacity side is solved. Before you wire the model into an agent, solve the loop side too: put &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt; around the calls so a runaway agent on your slow-but-smart 70B stops at a budget instead of burning the whole night.&lt;/p&gt;

</description>
      <category>llamacpp</category>
      <category>localllm</category>
      <category>gpu</category>
      <category>multigpu</category>
    </item>
    <item>
      <title>What Uber's $1,500/Developer AI Cap Tells You About Your Own Bill</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Sun, 07 Jun 2026 14:45:13 +0000</pubDate>
      <link>https://dev.to/pat9000/what-ubers-1500developer-ai-cap-tells-you-about-your-own-bill-214i</link>
      <guid>https://dev.to/pat9000/what-ubers-1500developer-ai-cap-tells-you-about-your-own-bill-214i</guid>
      <description>&lt;p&gt;Bloomberg reported this week that Uber now caps every employee at $1,500 per month, per AI coding tool. Simon Willison &lt;a href="https://simonwillison.net/2026/Jun/3/uber-caps-usage/#atom-everything" rel="noopener noreferrer"&gt;picked it up on June 3&lt;/a&gt;. The number matters less than the move. A Fortune 50 company with a real finance team just admitted it cannot predict what its developers spend on AI.&lt;/p&gt;

&lt;p&gt;This is the follow-up to a story I wrote about in May. Uber &lt;a href="https://bmdpat.com/blog/uber-2026-ai-budget-claude-code" rel="noopener noreferrer"&gt;burned its entire 2026 AI coding budget in four months&lt;/a&gt;, running at roughly 3x the naive projection. The $1,500 cap is the response. Burn first, cap second. That order tells you everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math is looser than it sounds
&lt;/h2&gt;

&lt;p&gt;Read the policy carefully. The cap is per tool, per employee. Not per person. Not per month total.&lt;/p&gt;

&lt;p&gt;So one developer running Claude Code, Cursor, and Copilot can spend $1,500 on each before any limit fires. That is $4,500 a month, per head, inside policy. Multiply by a team. The org-level number is still wide open.&lt;/p&gt;

&lt;p&gt;A per-tool cap is a speed bump, not a wall. It slows the worst offenders. It does not give you a real budget. Uber knows this. It is the best they could ship fast, and fast was the constraint.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means if you are not Uber
&lt;/h2&gt;

&lt;p&gt;Here is the part that hits small shops and solo builders.&lt;/p&gt;

&lt;p&gt;If a company with thousands of engineers and a dedicated FinOps function cannot forecast coding-agent spend, you cannot either. Not because you are bad at it. Because the spend is non-deterministic. An agent in a loop, a long context window, a retry storm, a contractor who left a script running over the weekend. None of that shows up until the bill does.&lt;/p&gt;

&lt;p&gt;You probably have one of these problems right now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A $200 Claude Code subscription that quietly metered out and is now billing API rates.&lt;/li&gt;
&lt;li&gt;Three contractors sharing one API key with no per-person ceiling.&lt;/li&gt;
&lt;li&gt;An automation agent that retries on failure and occasionally retries forever.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A policy memo does not stop any of these. Uber's lesson is that the fix has to live at the call site, not in a spreadsheet. You want the limit enforced in code, before the request goes out, per identity.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Uber's policy looks like as runtime code
&lt;/h2&gt;

&lt;p&gt;This is the whole idea behind AgentGuard. It is the open-source primitive Uber is reinventing in-house, except you get it in three lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentguard&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BudgetGuard&lt;/span&gt;

&lt;span class="n"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monthly_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the identity behind that guard crosses $1,500 for the month, the next call raises instead of charging you. No memo. No quarterly surprise. The limit is a fact of the runtime, not a guideline somebody might ignore.&lt;/p&gt;

&lt;p&gt;You can scope it per developer, per contractor, per agent, or per customer if you resell access. That is the difference between Uber's blunt per-tool cap and what you actually want: a per-identity ceiling that knows who is spending.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why in-process beats a proxy
&lt;/h2&gt;

&lt;p&gt;Most cost tools sit in front of your calls as a proxy or router. That works until it does not. A proxy adds a hop, a single point of failure, and another thing to operate. It also cannot see intent. It sees traffic.&lt;/p&gt;

&lt;p&gt;AgentGuard runs in your process, around your client. It counts real spend against the identity making the call and stops before the request leaves. No extra infrastructure. No second bill to control your first bill.&lt;/p&gt;

&lt;p&gt;Uber proved the demand. A hard per-identity dollar cap on AI tooling is now something the largest engineering orgs ship by hand. You do not have to build it from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Two cost-cap stories landed in 24 hours this week: Uber's $1,500 cap and Copilot moving to usage-based pricing. The direction is set. AI coding spend is going from flat-rate to metered, and metered means someone has to own the meter.&lt;/p&gt;

&lt;p&gt;If you wait for the bill to tell you, you have already lost the month. Put the cap in the code.&lt;/p&gt;

&lt;p&gt;AgentGuard is free and open source. Three lines to a hard budget that actually fires. &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;Try it here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aicostcontrol</category>
      <category>agentguard</category>
      <category>claudecode</category>
      <category>runtimegovernance</category>
    </item>
    <item>
      <title>Your AI Agent's Retry Loop Is a Cost Bug Waiting to Happen</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Sun, 07 Jun 2026 14:45:09 +0000</pubDate>
      <link>https://dev.to/pat9000/your-ai-agents-retry-loop-is-a-cost-bug-waiting-to-happen-24h7</link>
      <guid>https://dev.to/pat9000/your-ai-agents-retry-loop-is-a-cost-bug-waiting-to-happen-24h7</guid>
      <description>&lt;p&gt;This morning a small piece of my own automation got stuck. A repair agent tried to fix one blog draft. It failed the same check 27 times in a row. Each attempt was a full model call. The loop never asked the obvious question: if 26 tries did not work, why would the 27th?&lt;/p&gt;

&lt;p&gt;That is a retry loop with no circuit breaker. And it is one of the most common ways AI agents quietly waste money.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retries are not free
&lt;/h2&gt;

&lt;p&gt;In normal code, a retry is cheap. You hit a flaky network call, you try again, you move on. The cost of one extra attempt is a few milliseconds.&lt;/p&gt;

&lt;p&gt;Agent retries are different. Every attempt is a model call. Every model call costs tokens. A loop that retries 27 times is 27 paid attempts at the same task. If the task is impossible the way it is framed, you pay 27 times to learn nothing.&lt;/p&gt;

&lt;p&gt;The error handling looks responsible. It catches the failure. It tries again. It logs the attempt. But "try again" without "and stop at some point" is not error handling. It is a slow leak.&lt;/p&gt;

&lt;h2&gt;
  
  
  A quick cost example
&lt;/h2&gt;

&lt;p&gt;Say each attempt is a 4,000 token call. At a few dollars per million tokens, one attempt is a fraction of a cent. Twenty-seven attempts is still small. Now scale it. A loop that runs on a 40,000 token context, across ten agents, several times a day, adds up. The per-attempt cost hides the total. That is the danger. Small numbers times a loop with no ceiling become a large number you never chose to spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why agents thrash
&lt;/h2&gt;

&lt;p&gt;Traditional retries assume the failure is transient. The server was busy. The connection dropped. Wait a beat and the same input works.&lt;/p&gt;

&lt;p&gt;Agent failures are often not transient. The model produced output that breaks a hard rule. A paragraph too long. A forbidden word. A schema mismatch. Feed the same prompt back and you get a slightly different wrong answer. The constraint that blocked attempt one blocks attempt 27.&lt;/p&gt;

&lt;p&gt;So the loop runs until something else stops it. A timeout. A token budget. A human noticing the bill. None of those are good stopping conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix is a counter and a ceiling
&lt;/h2&gt;

&lt;p&gt;You do not need anything fancy. You need two things most retry loops skip.&lt;/p&gt;

&lt;p&gt;First, count the attempts. Not just "did it fail," but "how many times in a row." A simple integer.&lt;/p&gt;

&lt;p&gt;Second, set a ceiling. After N tries, stop. Escalate. Write the failure somewhere a human will see it. Hand the work to a different path. Anything except trying again forever.&lt;/p&gt;

&lt;p&gt;In my case the right move was obvious in hindsight. After three failed repairs, the draft should have been flagged for a human and the loop should have stopped. Instead it ground through 27 attempts before a separate check caught it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Make the ceiling visible
&lt;/h2&gt;

&lt;p&gt;The trap is that an uncapped loop looks fine from the outside. The agent is busy. Logs are filling. Work appears to be happening. Nobody sees the problem until the invoice arrives or a quota runs dry.&lt;/p&gt;

&lt;p&gt;So make the cap explicit and loud. Log when you hit it. Count retries as their own metric, separate from successes and failures. A spike in retries is an early warning that something is stuck, long before it shows up as cost.&lt;/p&gt;

&lt;p&gt;If you run more than one agent, track this across all of them. One stuck loop is annoying. Ten stuck loops at once is a real bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is the same idea as a budget
&lt;/h2&gt;

&lt;p&gt;A retry ceiling and a token budget are the same instinct. Both say an autonomous process should have a hard stop it cannot cross. The agent does not get to decide it needs one more try, one more call, one more dollar. The ceiling decides.&lt;/p&gt;

&lt;p&gt;That is why I built loop and budget limits into AgentGuard in the first place. Agents are good at doing things over and over. They are bad at knowing when to quit. The stop has to come from outside the agent.&lt;/p&gt;

&lt;p&gt;If your agents retry, audit those loops today. Find the ones with no counter. Add a ceiling. Make the ceiling visible. The 27-attempt loop I hit cost me cents because the task was tiny. The next one might not be.&lt;/p&gt;

&lt;p&gt;Want hard caps on retries, tokens, and spend for your agents? Start here: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>costcontrol</category>
      <category>agentguard</category>
    </item>
    <item>
      <title>When JPMorgan Turns On AI Bank-Wide, Who Controls the Bill?</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Sat, 06 Jun 2026 14:45:09 +0000</pubDate>
      <link>https://dev.to/pat9000/when-jpmorgan-turns-on-ai-bank-wide-who-controls-the-bill-5fe6</link>
      <guid>https://dev.to/pat9000/when-jpmorgan-turns-on-ai-bank-wide-who-controls-the-bill-5fe6</guid>
      <description>&lt;p&gt;JPMorgan just turned on AI for its entire global investment bank. Every employee, all 250,000 of them, now has access to AI tools. Microsoft says pitch deck generation dropped from four hours to about thirty seconds. Jamie Dimon put it plainly: more AI people and fewer bankers.&lt;/p&gt;

&lt;p&gt;This is the moment a lot of us have been waiting on. Not the layoffs part. The cost part. When a bank that size flips AI on for everyone, the bill stops being a rounding error. It becomes a line item the board asks about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bill nobody budgeted for
&lt;/h2&gt;

&lt;p&gt;Here is the quiet story under the headline. Bloomberg reported that bankers' Claude usage is racking up fees. Not a pilot. Not a sandbox. Real people doing real work, sending real tokens, all day, every day.&lt;/p&gt;

&lt;p&gt;That is the part most enterprise AI plans skip. You approve the rollout. You celebrate the four-hours-to-thirty-seconds win. Then the invoice shows up and nobody can explain why it is what it is. Who used what. Which team. Which workflow. Which prompt got run 40,000 times because someone wired it into a loop.&lt;/p&gt;

&lt;p&gt;JPMorgan is the first big bank to go bank-wide, but the pattern is everywhere now. Goldman rolled an AI assistant to more than 10,000 workers. Morgan Stanley's AskResearchGPT sits on top of 70,000 research reports. Standard Chartered is cutting 8,000 jobs by 2030. A Citigroup study found 54% of financial jobs have high potential for automation.&lt;/p&gt;

&lt;p&gt;Every one of those numbers is a usage number waiting to happen. More seats means more calls. More calls means more spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why enterprise AI spend runs away
&lt;/h2&gt;

&lt;p&gt;Token spend is not like a software license. A license is a fixed cost. You pay for the seat whether the person logs in or not.&lt;/p&gt;

&lt;p&gt;AI is the opposite. You pay for what gets used, and usage is invisible until you measure it. A single power user can cost more than a whole team. A badly written automation can spend a month of budget in a weekend. Nobody is being reckless. The meter just runs faster than anyone expects.&lt;/p&gt;

&lt;p&gt;Three things make it worse in a big company:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fan-out.&lt;/strong&gt; One workflow gets adopted by a department. Now it runs thousands of times a day instead of ten.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No per-team visibility.&lt;/strong&gt; The bill comes in as one number. You cannot tell sales from research from ops.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No ceiling.&lt;/strong&gt; Most teams have no hard cap. Spend climbs until someone notices the invoice, which is always after the fact.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is the real problem. By the time finance flags it, the money is gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What cost control actually looks like
&lt;/h2&gt;

&lt;p&gt;You do not fix this with a spreadsheet review once a quarter. You fix it at the point where the tokens get spent, in the code, before the call goes out.&lt;/p&gt;

&lt;p&gt;That means a few concrete things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A budget per agent, per team, per workflow.&lt;/strong&gt; Not a suggestion. A real limit the code respects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A stop.&lt;/strong&gt; When a workflow hits its cap, it stops instead of quietly spending the next department's money.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visibility you can read without a data team.&lt;/strong&gt; Who spent what, broken down the way your org is actually shaped.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is plumbing, not strategy. But it is the plumbing that decides whether your AI rollout looks smart in six months or shows up as a surprise on the wrong side of a budget meeting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson for the rest of us
&lt;/h2&gt;

&lt;p&gt;You are not JPMorgan. You are not turning on AI for 250,000 people. But the math is the same at every scale.&lt;/p&gt;

&lt;p&gt;If you are a small business wiring an AI agent into customer support, or a solo builder running a research pipeline overnight, the failure mode is identical. Usage you cannot see. A bill you did not predict. A single bad loop that burns a week of budget while you sleep.&lt;/p&gt;

&lt;p&gt;The banks are just hitting it first, and bigger. They have the seats and the volume to make the problem loud. Watch what they do next, because the cost-control tooling they buy is the tooling everyone ends up needing.&lt;/p&gt;

&lt;p&gt;The good news: you can put the controls in before your bill gets loud. A budget cap, a hard stop, and per-workflow visibility cost you an afternoon to wire up. A runaway invoice costs you a lot more than that, and it costs you the trust of whoever signed off on the rollout.&lt;/p&gt;

&lt;p&gt;Dimon wants more AI people. Fine. Be the AI person who also knows where the money goes. That is the one who keeps getting to build.&lt;/p&gt;

&lt;p&gt;If you are running AI agents and you want a budget cap and a hard stop before the bill surprises you, &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt; does exactly that. It is a runtime budget, token, and rate limiter for AI agents. Set a ceiling, and the agent stops instead of spending past it.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>enterpriseai</category>
      <category>costcontrol</category>
      <category>agentguard</category>
    </item>
    <item>
      <title>What Anthropic's MITRE ATT&amp;CK Report Means for Teams Running AI Agents</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Sat, 06 Jun 2026 14:45:06 +0000</pubDate>
      <link>https://dev.to/pat9000/what-anthropics-mitre-attck-report-means-for-teams-running-ai-agents-fhl</link>
      <guid>https://dev.to/pat9000/what-anthropics-mitre-attck-report-means-for-teams-running-ai-agents-fhl</guid>
      <description>&lt;p&gt;Anthropic just published a year of threat intel on AI-enabled attacks. It covers March 2025 to March 2026. They banned 832 accounts for malicious cyber activity and mapped what those accounts did to MITRE ATT&amp;amp;CK, the same framework enterprise security teams use to describe attacker behavior. They co-released it with Verizon's 2026 DBIR.&lt;/p&gt;

&lt;p&gt;If you run AI agents in production, this is primary-source data. Not a vendor scare deck. Here is what actually matters for the people building and shipping agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The attack work moved past writing code
&lt;/h2&gt;

&lt;p&gt;The headline number: 560 of those 832 accounts, about 67 percent, used Claude for malware writing. That is the expected one. Models are good at code, including bad code.&lt;/p&gt;

&lt;p&gt;The number that should change how you think is smaller. 54 accounts, about 6.5 percent, used Claude for lateral movement. That is a kill-chain stage that used to be hand-driven. Lateral movement is what an attacker does after they are already inside, hunting for the next box to compromise. Account discovery went up 8.9 percent. AI-assisted phishing actually dropped 8.6 percent.&lt;/p&gt;

&lt;p&gt;Read that shift plainly. The work moved away from initial access and toward post-compromise. Attackers are not just generating payloads. They are using AI to make real-time decisions deeper inside systems they already breached.&lt;/p&gt;

&lt;p&gt;For you, the takeaway is direct. Your input filter is not the main event. Your agent's blast radius is. The question is not only "can a bad prompt get in." It is "if this agent is compromised or coerced, how far can it reach and how long does it run before anyone notices."&lt;/p&gt;

&lt;h2&gt;
  
  
  The risk mix is getting worse, not just bigger
&lt;/h2&gt;

&lt;p&gt;Anthropic split the year into two six-month windows. Medium-or-higher-risk actors went from 33 percent of cases to 56 percent. The pool is not just growing. It is concentrating toward serious operators.&lt;/p&gt;

&lt;p&gt;One more finding worth sitting with. Technique count and platform type stopped predicting how dangerous an actor is. The high-risk ones do not spray a hundred techniques. They put AI on the operationally hard stuff and skip the easy parts. So "we saw a lot of weird activity" is no longer a clean severity signal. Volume tells you less than it used to.&lt;/p&gt;

&lt;h2&gt;
  
  
  MITRE ATT&amp;amp;CK does not yet cover agentic orchestration
&lt;/h2&gt;

&lt;p&gt;Here is the part that matters most if you ship agents. Anthropic says the attackers are chaining ATT&amp;amp;CK stages with minimal human input. Autonomous orchestration. And they say it directly: MITRE ATT&amp;amp;CK does not yet capture agentic orchestration. The framework that underwrites enterprise security operations is being outgrown by the threat. Anthropic is working with MITRE to evolve it.&lt;/p&gt;

&lt;p&gt;If you build agents, sit with that. The standard model of how attacks work was written for human-paced kill chains. Your own agents already run faster than that model assumes. So do the malicious ones. You are operating in a place the reference frameworks have not fully described yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What enterprises running agents should take from this
&lt;/h2&gt;

&lt;p&gt;Three concrete moves.&lt;/p&gt;

&lt;p&gt;First, treat agentic orchestration as its own threat category, not a footnote on your existing controls. An agent that can call tools, read data, and act in a loop is not a chatbot with extra steps. The thing that makes it useful, autonomy across many steps, is the thing that makes a compromise expensive.&lt;/p&gt;

&lt;p&gt;Second, get value from inference-time safeguards. Anthropic detects malware development and exfiltration patterns at the model layer. If you build on a frontier API, you inherit that floor for free. That is a real reason to keep production agent work on a monitored frontier model instead of an uncensored local model where no one is watching the traffic.&lt;/p&gt;

&lt;p&gt;Third, and this is the one most builders skip: cap the blast radius at runtime. The expensive failure with an agent is rarely a single bad call. It is an agent that runs unattended for hours, burning tokens, hitting APIs, doing the wrong thing at machine speed while you sleep. Lateral movement, in attacker terms. Runaway spend, in yours. Same shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cheapest control you can install today
&lt;/h2&gt;

&lt;p&gt;You cannot stop a nation-state actor with a config file. That is not the goal. The goal is limiting how much a compromised or misbehaving agent can cost you before you notice.&lt;/p&gt;

&lt;p&gt;That means hard limits at runtime. A budget cap so a loop cannot burn your whole month in a night. A token cap per task. A rate limit so one agent cannot hammer an API into a five-figure bill. These are boring controls. They are also the ones that actually save you, because they work whether the cause is an attacker, a bad prompt, or your own buggy code.&lt;/p&gt;

&lt;p&gt;That is exactly what AgentGuard does: runtime budget, token, and rate limits for AI agents, in a few lines. If you run agents in production, put a ceiling on them before you need one. Start here: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aisecurity</category>
      <category>agents</category>
      <category>anthropic</category>
      <category>agentguard</category>
    </item>
    <item>
      <title>What GitHub Copilot Users Wish They Had a Week Ago</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Fri, 05 Jun 2026 14:45:15 +0000</pubDate>
      <link>https://dev.to/pat9000/what-github-copilot-users-wish-they-had-a-week-ago-19ch</link>
      <guid>https://dev.to/pat9000/what-github-copilot-users-wish-they-had-a-week-ago-19ch</guid>
      <description>&lt;p&gt;GitHub Copilot moved to usage-based pricing, and the bills landed fast. Ars Technica covered the reaction: developers reporting that they burned through a full month of credits in a single day under the new model (&lt;a href="https://arstechnica.com/ai/2026/06/ai-costs-how-much-github-copilot-users-react-to-new-usage-based-pricing-system/" rel="noopener noreferrer"&gt;Ars Technica&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;If you have shipped anything with an AI coding tool lately, you know the feeling. The flat monthly fee was predictable. You knew the number. Now the number moves with how hard you work, and nobody told you where the ceiling is.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually changed
&lt;/h2&gt;

&lt;p&gt;Copilot went from flat-rate to usage-based. There is no firm public quota tier you can point at and say "I will never pass this." Power users hit a wall they could not see coming. Thirty days of nominal credits, gone in one focused day of work.&lt;/p&gt;

&lt;p&gt;The same week, Microsoft pushed small cheap models as the stay-under-budget option inside the same product. That is the tell. When a vendor ships a "use this to spend less" feature alongside a pricing change, the cost problem is real and they know it.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is the failure mode, not a Copilot problem
&lt;/h2&gt;

&lt;p&gt;Copilot is the headline. The pattern is bigger. Every AI dev tool that charges per token or per request has this shape now. You write code, the meter runs, and you find out the total at the end of the cycle.&lt;/p&gt;

&lt;p&gt;Uber reportedly caps AI coding spend at $1,500 per developer. That cap exists because without it, the number runs. A big company can absorb a surprise. A solo builder or a small shop cannot.&lt;/p&gt;

&lt;p&gt;The fix is not "use the tool less" or "switch to the cheap model and hope." The fix is a budget envelope at the call site. A hard limit that lives in your code, watches spend in real time, and stops the run before it overshoots.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a runtime budget cap looks like
&lt;/h2&gt;

&lt;p&gt;This is the wedge for AgentGuard, the open-source budget and rate limiter I maintain (&lt;code&gt;pip install agentguard&lt;/code&gt;). It wraps your AI calls and enforces a ceiling you set. When you hit it, the run stops. No surprise invoice.&lt;/p&gt;

&lt;p&gt;Here is the shape of it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentguard&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BudgetGuard&lt;/span&gt;

&lt;span class="c1"&gt;# Hard ceiling for this session. Pick a number you can defend.
&lt;/span&gt;&lt;span class="n"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;max_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on_exceeded&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# kill the run, do not keep spending
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@guard.track&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# your normal AI call, any provider
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run your agent loop as usual.
# When spend crosses $5.00, the guard raises and stops.
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;work_queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BudgetExceeded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hit the cap. Spent &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spent_usd&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Stopping.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thirty lines, give or take. The point is not the syntax. The point is that the limit lives in your code, not in a billing dashboard you check after the damage is done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the call site matters
&lt;/h2&gt;

&lt;p&gt;You can set alerts in a vendor console. Alerts tell you after the fact. By the time the email arrives, the credits are spent. A runtime cap is different. It runs in the same loop as your work, counts every call, and refuses to make the call that would put you over.&lt;/p&gt;

&lt;p&gt;It is also cross-provider. Copilot today, some other tool next quarter. If your budget logic lives in your code instead of one vendor's settings page, you do not start over every time you switch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Usage-based pricing is not going away. It is the default direction for AI dev tools, and Copilot just made the cost real for a lot of people in one news cycle. Predictable spend is now something you build, not something the vendor hands you.&lt;/p&gt;

&lt;p&gt;Put the limit at the call site. Set a number. Let the code enforce it. That is the difference between a tool you control and a meter you watch.&lt;/p&gt;

&lt;p&gt;If you want the runtime budget cap without writing it yourself, that is exactly what AgentGuard does: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;bmdpat.com/tools/agentguard&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>costcontrol</category>
      <category>githubcopilot</category>
      <category>agentguard</category>
    </item>
    <item>
      <title>When Not to Use an AI Agent</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Fri, 05 Jun 2026 14:45:11 +0000</pubDate>
      <link>https://dev.to/pat9000/when-not-to-use-an-ai-agent-59m4</link>
      <guid>https://dev.to/pat9000/when-not-to-use-an-ai-agent-59m4</guid>
      <description>&lt;p&gt;When everyone is shipping AI agents, the useful question is the opposite one: when should you not?&lt;/p&gt;

&lt;p&gt;I run a one-person operation with a fleet of agents doing real work every day. Writing drafts, scoring leads, checking that scheduled jobs actually did their job. They earn their keep. But I have also wired up agents for tasks that a plain script would have done better, cheaper, and with less drama. Here is where I draw the line now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use a script, not an agent, when the steps never change
&lt;/h2&gt;

&lt;p&gt;An agent is worth it when the input is messy and the right next step depends on judgment. If the steps are fixed, you do not need a model in the loop. You need a function.&lt;/p&gt;

&lt;p&gt;I once had an agent renaming files and moving them into dated folders. It worked. It also cost tokens, took ten seconds instead of ten milliseconds, and failed in a new creative way about once a week. A six-line script replaced it and has not been touched since. The rule: if you can write the if-statements, write the if-statements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skip the agent when a wrong answer is expensive and silent
&lt;/h2&gt;

&lt;p&gt;Agents are confident even when they are wrong. That is fine when a human reviews the output before it matters. It is dangerous when the output flows straight into something irreversible.&lt;/p&gt;

&lt;p&gt;Money movement, production deletes, sending email to real customers. I let agents draft those actions. I do not let them commit those actions without a gate. A wrong trade or a wrong DELETE does not announce itself. By the time you notice, the damage is done. Put a human or a hard rule between the agent and the irreversible step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do not reach for an agent to dodge a decision you have not made
&lt;/h2&gt;

&lt;p&gt;This is the one that bit me most. When I did not actually know what I wanted, I would hand the fuzzy problem to an agent and hope it would figure out the goal. It cannot. It will pick a goal, usually a plausible wrong one, and pursue it with energy.&lt;/p&gt;

&lt;p&gt;An agent amplifies a clear intent. It does not supply one. If you cannot write the success condition in one sentence, the agent is not your problem. The missing decision is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watch the cost of the loop, not the cost of one call
&lt;/h2&gt;

&lt;p&gt;A single model call is cheap. An agent that retries, reflects, and calls tools in a loop is not. The cost is the loop, and loops can run away.&lt;/p&gt;

&lt;p&gt;I learned this the boring way. A draft in my own publishing pipeline failed a length check, so the repair agent fixed it, resubmitted, failed again, and did that twenty-five times before anything flagged it. No single run looked expensive. The total was real, and the task was dead on arrival. Unbounded retries are how a helpful agent quietly burns your budget.&lt;/p&gt;

&lt;p&gt;That last failure mode is exactly why I built AgentGuard. It caps spend, token use, and call counts per run, so a stuck loop stops itself instead of running until you happen to look. An agent should fail loud and cheap, not silent and expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  A short test before you build one
&lt;/h2&gt;

&lt;p&gt;Ask three questions. Does the task need judgment, or just rules? If a step goes wrong, will someone notice before it hurts? Can I state the goal in one sentence?&lt;/p&gt;

&lt;p&gt;If the answers are rules, no, and no, you do not want an agent. You want a script with a human nearby. Agents are good. They are not the answer to every box on the board, and pretending otherwise is the fastest way to a surprising bill and a quiet mistake.&lt;/p&gt;

&lt;p&gt;Build the agent when the task is genuinely fuzzy and the stakes are reviewable. Bound every loop. And if you want hard budget, token, and rate limits around your agents so a runaway loop cannot drain your account, that is what I built AgentGuard for: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>agentguard</category>
      <category>automation</category>
    </item>
    <item>
      <title>llama.cpp ngl: when -ngl 99 still runs on your CPU</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Thu, 04 Jun 2026 23:25:21 +0000</pubDate>
      <link>https://dev.to/pat9000/llamacpp-ngl-when-ngl-99-still-runs-on-your-cpu-46im</link>
      <guid>https://dev.to/pat9000/llamacpp-ngl-when-ngl-99-still-runs-on-your-cpu-46im</guid>
      <description>&lt;h1&gt;
  
  
  llama.cpp ngl: when -ngl 99 still runs on your CPU
&lt;/h1&gt;

&lt;p&gt;You passed &lt;code&gt;-ngl 99&lt;/code&gt;. You expected the GPU to light up. Instead llama.cpp generates at 9 tokens per second and your fans stay quiet. The flag did nothing.&lt;/p&gt;

&lt;p&gt;I have hit this on every machine in my rig: a 3070, a 5070 Ti, and a 5090, all serving Llama 3.1 8B through llama.cpp. The &lt;code&gt;ngl&lt;/code&gt; flag is almost never the problem. The build you are running, the wheel pip installed, or the VRAM you do not actually have is the problem. Here is how to find out which, in about 30 seconds, then the five real causes ranked by how often they bite.&lt;/p&gt;

&lt;p&gt;If you do not yet know what the flag controls, read the companion post first: &lt;a href="https://dev.to/blog/llama-cpp-n-gpu-layers-explained-2026"&gt;llama.cpp n-gpu-layers explained&lt;/a&gt;. This post assumes you know what &lt;code&gt;-ngl&lt;/code&gt; is supposed to do and want to know why it is not doing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 30-second diagnostic: read the load log
&lt;/h2&gt;

&lt;p&gt;Stop guessing. llama.cpp tells you exactly what it offloaded. When the model loads, look for this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;load_tensors: offloaded 33/33 layers to GPU
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire diagnosis. Two numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;33/33&lt;/code&gt; means every layer is on the GPU. If you are still slow, your problem is downstream (context, sampling, a CPU-bound KV cache).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0/33&lt;/code&gt; means nothing offloaded. Your &lt;code&gt;-ngl&lt;/code&gt; flag was accepted and ignored. This is a build problem.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;22/33&lt;/code&gt; means partial offload. The model did not fit. This is a VRAM problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything below is just "which of those three did you get, and what to do about it."&lt;/p&gt;

&lt;h2&gt;
  
  
  Cause 1: pip installed a CPU-only wheel (the most common by far)
&lt;/h2&gt;

&lt;p&gt;This one cost me the most time. &lt;code&gt;pip install llama-cpp-python&lt;/code&gt; ships a CPU-only wheel by default. No CUDA. The &lt;code&gt;-ngl&lt;/code&gt; argument is still accepted, parsed, and then silently does nothing because there is no GPU backend compiled in. You get &lt;code&gt;offloaded 0/33&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix is to reinstall with the CUDA backend turned on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CMAKE_ARGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-DGGML_CUDA=on"&lt;/span&gt; pip &lt;span class="nb"&gt;install &lt;/span&gt;llama-cpp-python &lt;span class="nt"&gt;--force-reinstall&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;CMAKE_ARGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-DGGML_CUDA=on"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;llama-cpp-python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--force-reinstall&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you do not have the CUDA toolkit set up for a source build, use the prebuilt wheel index instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;llama-cpp-python &lt;span class="nt"&gt;--extra-index-url&lt;/span&gt; https://abetlen.github.io/llama-cpp-python/whl/cu124
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After reinstall, the load log on my 5070 Ti went from &lt;code&gt;offloaded 0/33&lt;/code&gt; at 9 tokens per second to &lt;code&gt;offloaded 33/33&lt;/code&gt; at 95 tokens per second on Llama 3.1 8B Q4_K_M. Same flag. Same model. The only thing that changed was the backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cause 2: you built llama.cpp from source without a GPU flag
&lt;/h2&gt;

&lt;p&gt;Same failure, raw binary version. If you cloned the repo and ran &lt;code&gt;cmake -B build&lt;/code&gt; with no backend flag, you built CPU-only. &lt;code&gt;-ngl&lt;/code&gt; is ignored.&lt;/p&gt;

&lt;p&gt;Rebuild with the backend for your hardware:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# NVIDIA&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release

&lt;span class="c"&gt;# Apple Silicon&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release

&lt;span class="c"&gt;# AMD (ROCm)&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_HIPBLAS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm it took: &lt;code&gt;./llama-cli --version&lt;/code&gt; should print your backend, and the startup banner lists the GPU device. If it says &lt;code&gt;CPU&lt;/code&gt; only, the build flag did not apply.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cause 3: the model does not fit, so llama.cpp offloads what it can
&lt;/h2&gt;

&lt;p&gt;You set &lt;code&gt;-ngl 99&lt;/code&gt; but the load log says &lt;code&gt;offloaded 22/33&lt;/code&gt;. The flag worked. The VRAM did not. llama.cpp loaded as many layers as fit and put the rest on the CPU. Those CPU layers drag the whole generation down because every token still waits on them.&lt;/p&gt;

&lt;p&gt;Two things eat VRAM you forgot about: the KV cache and the context buffer. A 32K context window on an 8B model can cost more than a gigabyte before a single layer loads.&lt;/p&gt;

&lt;p&gt;Quick wins, in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Shrink the context: &lt;code&gt;--ctx-size 4096&lt;/code&gt; instead of 32768.&lt;/li&gt;
&lt;li&gt;Drop a quant level: Q4_K_M to Q3_K_M frees roughly 20 percent.&lt;/li&gt;
&lt;li&gt;Check what actually fits before you download anything with the &lt;a href="https://dev.to/tools/quant-compare"&gt;quant comparison tool&lt;/a&gt;. Pick GPU, model, and quant, and it does the VRAM math for you.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On my 3070 (8 GB) a Llama 3.1 8B Q4_K_M fits fully only if I keep the context under about 8K. Push it higher and I am back to partial offload without changing the flag at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cause 4: the wrong binary is first on your PATH
&lt;/h2&gt;

&lt;p&gt;If you have ever installed Ollama, LM Studio, and a hand-built llama.cpp on the same box, you have three &lt;code&gt;llama&lt;/code&gt; binaries. The one your shell finds first might be the CPU build.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;which llama-cli        &lt;span class="c"&gt;# macOS / Linux&lt;/span&gt;
where.exe llama-cli    &lt;span class="c"&gt;# Windows&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that path is not the CUDA build you just compiled, call the right one with a full path or fix your PATH order. I lost an afternoon to this once because a stale binary in &lt;code&gt;~/.local/bin&lt;/code&gt; shadowed the new one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cause 5: a CUDA or driver mismatch falls back to CPU
&lt;/h2&gt;

&lt;p&gt;A wheel built for CUDA 12.4 against a driver that only supports 12.0 can fail to initialize the GPU and quietly fall back to CPU. The tell is a warning during load, often swallowed in noisy logs.&lt;/p&gt;

&lt;p&gt;Run &lt;code&gt;nvidia-smi&lt;/code&gt; and check the "CUDA Version" in the top right. Match your wheel or build to that or lower. When in doubt, the prebuilt &lt;code&gt;cu121&lt;/code&gt; wheel is the most forgiving across older drivers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The order I actually debug this
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Read the load log. &lt;code&gt;0/33&lt;/code&gt;, &lt;code&gt;22/33&lt;/code&gt;, or &lt;code&gt;33/33&lt;/code&gt; decides everything.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0/33&lt;/code&gt;: rebuild or reinstall with the GPU backend. This is the answer 80 percent of the time.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;22/33&lt;/code&gt;: shrink context, drop a quant level, or use a smaller model.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;33/33&lt;/code&gt; and still slow: the flag is innocent. Look at context size, batch size, and whether your KV cache is on CPU.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;-ngl&lt;/code&gt; flag is one of the most blamed and least guilty settings in local inference. It is a passthrough. When it looks broken, something upstream of it is.&lt;/p&gt;

&lt;p&gt;If you run local models inside an agent loop, the next thing that will surprise you is cost, not speed. A retry storm or a runaway loop can burn tokens and watts long after you stopped watching. That is why I built &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;AgentGuard&lt;/a&gt;: a budget and rate limiter you wrap around your agent so a bad night caps out instead of running until morning. Free to install, and it works the same whether the model is local or an API.&lt;/p&gt;

&lt;p&gt;What is your load log telling you: &lt;code&gt;0/33&lt;/code&gt;, partial, or full?&lt;/p&gt;

</description>
      <category>llamacpp</category>
      <category>localllm</category>
      <category>gpuoffloading</category>
      <category>ngpulayers</category>
    </item>
    <item>
      <title>I made my blog API reject its own writer</title>
      <dc:creator>Patrick Hughes</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:45:08 +0000</pubDate>
      <link>https://dev.to/pat9000/i-made-my-blog-api-reject-its-own-writer-25lk</link>
      <guid>https://dev.to/pat9000/i-made-my-blog-api-reject-its-own-writer-25lk</guid>
      <description>&lt;h1&gt;
  
  
  I made my blog API reject its own writer
&lt;/h1&gt;

&lt;p&gt;I run a small content pipeline. A Codex reviewer reads drafts at 08:30 CT. A repair loop fixes the rejections at 08:55. A publisher ships approved drafts at 09:30. The whole thing lives in scheduled tasks on a Windows box in my office.&lt;/p&gt;

&lt;p&gt;It worked. Right up until I added a same-day heal path that skipped the reviewer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The heal path
&lt;/h2&gt;

&lt;p&gt;When &lt;code&gt;brain think&lt;/code&gt; finds no post went live today and there is nothing approved in the review folder, it fires a heal agent. That agent picks a topic, writes the post, and ships it. Same hour.&lt;/p&gt;

&lt;p&gt;The whole point of the heal path is speed. The whole problem with the heal path is also speed.&lt;/p&gt;

&lt;p&gt;If the reviewer is the only thing stopping bad posts, and the heal path skips the reviewer, then on any day the regular pipeline stalls I have a fast lane straight to production with no QA in front of it. Not great.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first instinct (wrong)
&lt;/h2&gt;

&lt;p&gt;My first instinct was to add stronger instructions to the heal prompt. "Run a QA check before publishing. No exceptions." Capital letters. Three reminders. The works.&lt;/p&gt;

&lt;p&gt;This is the prompt-engineer version of yelling at a developer to remember to run the tests. It works until it does not.&lt;/p&gt;

&lt;p&gt;A few days later I caught a post that shipped without the QA step. The model decided the post was clean and skipped its own gate. The instructions said do not skip. The model skipped anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix that actually held
&lt;/h2&gt;

&lt;p&gt;I moved the gate out of the prompt and into the API.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /api/blog&lt;/code&gt; now requires two extra fields on any &lt;code&gt;published: true&lt;/code&gt; request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"published"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"qa_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"approved"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"qa_reviewer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"codex"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If either field is missing, the server returns HTTP 422 with code &lt;code&gt;qa_provenance_required&lt;/code&gt;. The post does not get written.&lt;/p&gt;

&lt;p&gt;The fields do not prove the QA actually happened. A lying client can send them anyway. But that is not the threat. The threat is a careless writer who forgets the gate exists. Forgetting now produces a loud error instead of a silent shipped post.&lt;/p&gt;

&lt;p&gt;Then I updated the heal prompt to require the QA subagent run BEFORE the POST, with the result written to those two fields. The prompt and the API now agree on what a publishable post looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works when prompt instructions did not
&lt;/h2&gt;

&lt;p&gt;A prompt is a request. An API contract is a wall.&lt;/p&gt;

&lt;p&gt;When the gate lives in the prompt, the model is the judge of whether the gate was met. Models are generous judges of their own work. They will skip the gate and tell you the gate was met. You will not know until you read the post.&lt;/p&gt;

&lt;p&gt;When the gate lives in the API, the model is no longer the judge. The server is. The server does not care how confident the model is. The server checks the payload. If the fields are not there, the post does not exist.&lt;/p&gt;

&lt;p&gt;This is the same reason &lt;code&gt;--no-verify&lt;/code&gt; exists on git hooks and why disabling it in CI matters. The local check is a courtesy. The remote check is the law.&lt;/p&gt;

&lt;h2&gt;
  
  
  The general pattern
&lt;/h2&gt;

&lt;p&gt;If you are running agents that produce real artifacts (blog posts, PRs, emails, code commits), look at where your quality gates live.&lt;/p&gt;

&lt;p&gt;If your gates live in the prompt, you are trusting the agent to be its own judge. This works most of the time. The times it does not work are the times you find out by reading the production output.&lt;/p&gt;

&lt;p&gt;If your gates live in the system the agent talks to (the API, the CI, the deploy pipeline), the gate runs regardless of what the agent thinks. The agent can be careless, distracted, or wrong. The gate still fires.&lt;/p&gt;

&lt;p&gt;Move the rules out of the writer and into the reader. Anything else is a trust system masquerading as a control system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost of the change
&lt;/h2&gt;

&lt;p&gt;Maybe 40 lines of code in the API handler. One new error code. Two new fields on the POST payload. Five minutes to update the heal prompt to include those fields.&lt;/p&gt;

&lt;p&gt;The pipeline still ships posts at roughly the same speed. The only difference is that the day the model forgets the gate, the post does not silently go live with no review. The POST returns 422. The heal agent has to actually run the QA step to ship.&lt;/p&gt;

&lt;p&gt;That is the entire change. It is small. It holds.&lt;/p&gt;

&lt;p&gt;If you build agents that touch production, push the quality checks down the stack into systems the agent cannot route around. Trust no writer, including the one you wrote.&lt;/p&gt;

&lt;p&gt;If you want hard budget limits and loop guards for your own agents, start here: &lt;a href="https://bmdpat.com/tools/agentguard" rel="noopener noreferrer"&gt;https://bmdpat.com/tools/agentguard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agenticcoding</category>
      <category>aiagentops</category>
      <category>apidesign</category>
      <category>qualitygates</category>
    </item>
  </channel>
</rss>
