<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: praveenlavu</title>
    <description>The latest articles on DEV Community by praveenlavu (@praveenlavu).</description>
    <link>https://dev.to/praveenlavu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3991566%2Ff7152a58-11e0-4256-b1d8-a564907bd5a1.png</url>
      <title>DEV Community: praveenlavu</title>
      <link>https://dev.to/praveenlavu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/praveenlavu"/>
    <language>en</language>
    <item>
      <title>A green test suite proves less than you think</title>
      <dc:creator>praveenlavu</dc:creator>
      <pubDate>Thu, 18 Jun 2026 22:20:26 +0000</pubDate>
      <link>https://dev.to/praveenlavu/a-green-test-suite-proves-less-than-you-think-59cg</link>
      <guid>https://dev.to/praveenlavu/a-green-test-suite-proves-less-than-you-think-59cg</guid>
      <description>&lt;h1&gt;
  
  
  A green test suite proves less than you think
&lt;/h1&gt;

&lt;p&gt;The test that scared me was the one that passed.&lt;/p&gt;

&lt;p&gt;I had an integration test for a routing agent, the kind that takes a task and picks a capability to handle it. The test registered a new capability at runtime and then checked that the router would eventually route to it. Green run after run. Solid. I trusted it.&lt;/p&gt;

&lt;p&gt;Then I read it properly. It reused the same task string on every iteration of the loop. My scorer was deterministic by design, it hashed the task and indexed into the capability list, so a fixed string mapped to a fixed slot, and the newly registered capability lived at a different slot that the fixed string could never reach. The test asserted that the new capability got selected. The new capability was structurally unreachable. And the assertion passed anyway, because the loop happened to land on something registered every time, which was all the weak version of the check actually demanded.&lt;/p&gt;

&lt;p&gt;The test was not testing what it said it was testing. It was green for a reason that had nothing to do with the thing I cared about. The fix was almost insulting in its smallness, vary the task strings so the hash spreads across every slot including the new one, and suddenly the test could fail when the feature was broken, which is the entire point of a test. One line. I had been shipping false confidence behind a checkmark.&lt;/p&gt;

&lt;p&gt;That is the moment this whole piece is about. Not the bug. The checkmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number that lies
&lt;/h2&gt;

&lt;p&gt;Here is the setup that produces this every time. You build an agent. You write unit tests. You watch line coverage climb to ninety-something percent. CI turns green. You deploy. And within a week the thing is making nonsensical decisions under load, falling over on inputs you never imagined a user would send, and getting stuck in loops your loop detector cannot see because two threads stepped on each other's state at the same instant.&lt;/p&gt;

&lt;p&gt;The unit tests were not lying to you. The functions genuinely worked in isolation. That is the trap. Line coverage measures whether your tests executed a line, not whether they cornered it. You can run every line in a file and assert nothing that matters about any of them, exactly like my integration test ran its loop and asserted the wrong thing. A green suite built on coverage tells you your tests touched the code. It tells you almost nothing about whether the code survives contact with production.&lt;/p&gt;

&lt;p&gt;And autonomous systems, agents that route, retry, fall back, remember, do not fail in isolated functions. They fail in the seams between functions. They fail where two modules meet and disagree about a type. They fail on the input the author never pictured. They fail when two requests arrive at once. They fail when a dependency dies and the system panics instead of limping. They fail on the edge case nobody wrote down. Coverage walks straight past all five, because every one of those failures lives in territory a unit test is structurally built to avoid.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five seams, five suites
&lt;/h2&gt;

&lt;p&gt;The shift that changed how I test agents was to stop asking "did my tests run the code" and start asking "what are the distinct ways this system actually breaks, and do I have a suite aimed at each one." Five answers came back, and they are genuinely distinct failure classes, not five flavors of the same check. None of these dimensions is mine to claim as an invention, they are long-standing testing practice, integration testing, adversarial and fuzz testing, concurrency testing, fault injection, and property-based testing each have decades of prior art behind them. The engineering distinctive is narrower and more honest, it is recognizing that an autonomous agent needs all five aimed at it at once, because it can fail in all five ways in a single week, and that a coverage number cannot stand in for any of them.&lt;/p&gt;

&lt;p&gt;The first seam is &lt;strong&gt;integration&lt;/strong&gt;, where modules compose. The most common bug in a multi-module system is not "function X has wrong logic," it is "X works fine but Y expected a different type," or "A only works if B was set up first." Mocks paper over exactly this, they return what you told them to and never enforce the real interface, which is how my same-string test slept through a real defect.&lt;/p&gt;

&lt;p&gt;The second is &lt;strong&gt;adversarial input&lt;/strong&gt;, the gap between the task you imagined and the task a real user sends, the hundred-thousand-character string, the embedded newline carrying a fake directive, the injection attempt, the empty string, the wall of emoji. The contract is not that nothing weird arrives. It is that weird input gets a safe answer or an honest error, never a crash and never a leak.&lt;/p&gt;

&lt;p&gt;The third is &lt;strong&gt;concurrency&lt;/strong&gt;, the races that only appear when many requests hit shared state at once. A history list, a registry, a loop detector, anything two threads can write without a lock, will silently corrupt under load in a way no single-threaded test will ever reproduce.&lt;/p&gt;

&lt;p&gt;The fourth is &lt;strong&gt;failure cascade&lt;/strong&gt;, what happens when the pieces an agent depends on, the registry, the scorer, the loop detector, start dying. A naive build lets any one failure crash the whole call. A real one degrades, and the failure you actually have to test is all of them dying at once, because real outages are correlated and take down several things together.&lt;/p&gt;

&lt;p&gt;The fifth is &lt;strong&gt;property-based testing&lt;/strong&gt;, where instead of writing examples you state an invariant and let a generator hunt thousands of inputs for the one that breaks it. The invariants that look obvious, "routing always returns a real capability or a clean error, nothing in between," are exactly the ones a generated single-character task or a Unicode combining sequence quietly violates.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the checkmark should mean
&lt;/h2&gt;

&lt;p&gt;No single one of these dimensions catches everything, and that is the whole argument. Integration finds the type-contract and setup-order bugs and is blind to races. Adversarial finds the injection and the boundary crash and never sees a component failure. Concurrency finds the race and ignores the malformed input. Failure-cascade finds the un-graceful crash and says nothing about invariants. Property-based finds the violated invariant and the boundary input and cannot see a thread race. The value is not in any one suite. It is in the combination, five independent nets under the trapeze, each catching what the others structurally drop.&lt;/p&gt;

&lt;p&gt;Each suite is its own short read, linked below, with the one specific failure it exists to catch and the cheapest test that catches it.&lt;/p&gt;

&lt;p&gt;I am not going to pretend this makes a system bulletproof. It does not. There are failures none of these five see, and there will be a sixth seam I learn about the hard way at 2am some night that is already coming. But the difference between the green checkmark before and the green checkmark after is the difference between a number and a sentence. Before, green meant "the tests ran the code." After, green means "this composes correctly, handles hostile input without leaking, holds together under concurrent load, degrades instead of crashing when its dependencies die, and keeps its invariants across inputs I never thought to write down."&lt;/p&gt;

&lt;p&gt;That is a checkmark worth trusting. The first one was just a string that happened to match.&lt;/p&gt;

&lt;p&gt;This is the hub for a five-part series, one short read per dimension:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/dispatch/mocks-pass-for-the-wrong-reason"&gt;Mocks let your integration test pass for the wrong reason&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/dispatch/happy-path-never-met-a-user"&gt;Your happy-path test never met a real user&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/dispatch/race-your-tests-never-find"&gt;The race your single-threaded test will never find&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/dispatch/test-the-correlated-outage"&gt;Test the outage where everything fails at once&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/dispatch/state-invariants-not-examples"&gt;Stop writing examples, start stating invariants&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devjournal</category>
      <category>softwaredevelopment</category>
      <category>testing</category>
    </item>
    <item>
      <title>LLM Self-Preference Bias: How Anonymized Peer Review Fixes It</title>
      <dc:creator>praveenlavu</dc:creator>
      <pubDate>Thu, 18 Jun 2026 22:20:23 +0000</pubDate>
      <link>https://dev.to/praveenlavu/llm-self-preference-bias-how-anonymized-peer-review-fixes-it-3hh8</link>
      <guid>https://dev.to/praveenlavu/llm-self-preference-bias-how-anonymized-peer-review-fixes-it-3hh8</guid>
      <description>&lt;h1&gt;
  
  
  LLM Self-Preference Bias: How Anonymized Peer Review Fixes It
&lt;/h1&gt;

&lt;p&gt;The panel had been agreeing with itself for a week before I noticed, and the worst part is that the logs looked healthy the whole time.&lt;/p&gt;

&lt;p&gt;I had built what felt like a clean idea. Several frontier models, different families, each one judging a pool of candidate outputs and ranking them best to worst. A jury of machines. I would generate a handful of answers, let the panel vote, take the winner, and trust that five independent opinions beat one. That was the whole pitch I had sold myself at 1am, and for a few days it ran without complaint. The rankings came in. A winner emerged every round. The dashboard was green.&lt;/p&gt;

&lt;p&gt;Then I started actually reading what won.&lt;/p&gt;

&lt;p&gt;The outputs the panel kept crowning were not the sharpest. They were the ones that sounded a particular way. Numbered lists where the content did not need numbering. A certain rhythm to the sentences. A house style. I stared at it for a while before the shape of it landed, and when it did it was a little sickening: my panel was not selecting for quality. It was selecting for resemblance. The judges were rewarding the candidates that wrote the way the judges write. I had built a popularity contest and dressed it up as an evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing nobody tells you you assumed
&lt;/h2&gt;

&lt;p&gt;The premise underneath every multi-model panel is that the judges are neutral. You assume a model reading an unlabeled answer scores it on merit. It does not. Panickssery and colleagues measured this directly in 2024, in a NeurIPS paper with the unambiguous title "LLM Evaluators Recognize and Favor Their Own Generations." They found GPT-4 preferred its own output at a pairwise win rate above 0.90 on summarization tasks. Over ninety percent of head-to-head comparisons, the model picked the answer it had written. Not because it was better. Because it was its.&lt;/p&gt;

&lt;p&gt;The effect is directional across families. Prose in one model family's house style reads better to a judge from that same family. A more hedged, more structured answer reads better to a judge that writes that way. So when I assembled a panel and let it vote on a pool that included its own members' outputs, what I actually measured was which style happened to be most common among my evaluators. The highest-scoring answer was the one whose fingerprint matched the room. I had spent the planning at 1am congratulating myself on independence, and built the opposite.&lt;/p&gt;

&lt;p&gt;And it is not only the obvious bias. Once I went looking, there were three of them stacked on top of each other. Self-preference was the loud one. Underneath it sat verbosity bias, where models score longer answers higher because length reads as effort and authority, even when the extra words say nothing. So my selection criterion was quietly drifting toward "writes the most" rather than "answers best." And under that sat position bias, where the first answer in an ordered list anchors the judgment, the same anchoring documented in human juries, so whichever candidate happened to appear first carried a structural head start that had nothing to do with being right.&lt;/p&gt;

&lt;p&gt;Three biases, one panel, all of them invisible in a green dashboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrong fix I reached for first
&lt;/h2&gt;

&lt;p&gt;My first instinct was to out-engineer it. Add a rubric. Tell every judge, in the prompt, to ignore style and length and score only on correctness. Lecture the jury about fairness before it deliberates.&lt;/p&gt;

&lt;p&gt;It did almost nothing, and in hindsight it could not have. You cannot instruct a model out of a preference it does not know it has. The recognition is happening below the level the prompt can reach. The judge is not consciously thinking "this is mine, I shall reward it." It is reading prose that matches its own training distribution and finding it more fluent, more correct-feeling, more right. Asking it to be fair is asking it to notice a bias it cannot see. I was trying to argue a model out of its own reflection.&lt;/p&gt;

&lt;p&gt;The real problem was not that the judges were biased. It was that the judges could tell whose work they were reading. The bias needed information to operate, and I was handing that information over for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  The turn
&lt;/h2&gt;

&lt;p&gt;The fix was not mine, and I want to be clear about that, because the elegant part was already sitting in public when I got there. Andrej Karpathy had published a small project called llm-council that solves exactly this, and the mechanism is almost insultingly simple: do not let the judges know whose output they are reading.&lt;/p&gt;

&lt;p&gt;That is the entire idea. Before the panel votes, you strip every identity off the candidates. The first answer becomes "response A," the second "response B," and so on. No model name. No provider. No tell. The server keeps a private mapping of which label belongs to which model, a clean one-to-one assignment in both directions, so that after the votes are in you can reverse it and reconstruct exactly who scored what. The judges see only neutral labels and the text. The information the bias needs to operate is simply absent during the vote.&lt;/p&gt;

&lt;p&gt;It works because you cannot favor what you cannot identify. Self-preference dies the moment the judge does not know which answer is its own. Hiding the names also strips the most obvious recognition signal, which dents style bias too, though not all the way, because if a model writes in an unmistakable rhythm its identity is still legible in the prose itself. Anonymization breaks the label, not the fingerprint. But the label was doing most of the damage, and removing it changed the room.&lt;/p&gt;

&lt;p&gt;The first time I rewired my panel to run blind and watched the rankings come back, the winners were different. The house-style answers stopped sweeping. The thing that had been quietly rigging my evaluation for a week was just gone, because I had taken away the one piece of information it ran on. That is a strange and specific kind of satisfaction, watching a bias evaporate not because you argued with it but because you starved it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Counting the votes honestly
&lt;/h2&gt;

&lt;p&gt;Hiding the names fixes who wins a comparison. There is a second question underneath it: how you turn a panel of rankings into a single decision. Karpathy's project keeps that part as plain as the anonymization. Each judge ranks the anonymized pool, and the project aggregates those rankings by average rank position. You take every judge's placement for a given candidate, average them, and the answer with the best average ranking across the panel wins. That is it. No weighting, no points table, just the mean of where each judge put each answer.&lt;/p&gt;

&lt;p&gt;What I like about averaging the rank is what it captures and what it ignores. It does not care how many head-to-head matchups an answer technically won, which is the trap of naive majority vote. Majority vote can crown an answer that one judge adored and the rest found mediocre, because a thin win still counts as a win. Average rank position cannot do that. A candidate that four of five judges place second and one judge places first lands at a strong average, and the panel correctly reads it as broadly acceptable rather than narrowly adored. Broad acceptability is exactly the signal you want when you are picking the single best output from a pool, and the mean of the rankings is what surfaces it.&lt;/p&gt;

&lt;p&gt;If I were extending the project I would probably reach for something like a Borda-style scoring on top, turning each placement into points and summing them so a near-miss second place carries explicit weight rather than just nudging an average. That is my own refinement, not what the repo ships. What llm-council actually does is the simpler and honestly sufficient thing: anonymize, rank, average the positions, take the winner. The discipline is in the order of operations, not in any clever counting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is not enough
&lt;/h2&gt;

&lt;p&gt;I want to be honest about what anonymization does not fix, because I shipped it feeling like I had solved the panel, and I had solved one third of it.&lt;/p&gt;

&lt;p&gt;Self-preference is gone. Two biases are still in the room.&lt;/p&gt;

&lt;p&gt;Verbosity bias is completely untouched. A longer answer is longer regardless of what label it wears. Anonymization operates on identity, not length, so if your task rewards thoroughness the panel will keep favoring the candidate that simply wrote more. The only real mitigations are a rubric that explicitly penalizes length without substance, or normalizing every candidate to the same length before the vote. Neither comes for free.&lt;/p&gt;

&lt;p&gt;Position bias is only half-addressed. Randomizing which model draws which label between rounds helps, so no single model always sits in the anchor slot. But within any one judge's view, response A still appears before response B, and on the marginal calls, which is most of the interesting ones, first-listed still wins a little more often. The honest fix is an independent random ordering per judge, not just per round.&lt;/p&gt;

&lt;p&gt;And there is a quieter trap I walked into while feeling clever about diversity. A five-judge panel built from five models in the same family is not five opinions. Shared training lineage means shared preferences, so in the limit a fully correlated panel of five is one judge counted five times wearing different name tags. Anonymization cannot save you from that, because the bias is in the composition, not the labels. The fix is upstream: compose the panel from genuinely different architectures, or measure how often your judges disagree and weight accordingly. A panel that always agrees is not a panel. It is an echo with a quorum.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle
&lt;/h2&gt;

&lt;p&gt;The mechanism is the right foundation even with those three caveats, and the reason is structural. You do not coax a biased judge into fairness with a better prompt. You remove the information the bias needs to operate, so it cannot operate, and then you treat the residue as second-order cleanup. Structure the problem so the failure mode is impossible rather than asking the model nicely not to fail.&lt;/p&gt;

&lt;p&gt;That is the part I keep coming back to. I lost a week to a panel that looked healthy while it voted for its own reflection, and the fix was not a clever model or a longer rubric. It was taking away a name tag. Karpathy had already shipped the idea, plainly, and the only work left for me was recognizing my own problem in it and admitting the version I had built was a popularity contest. If you are wiring models to judge models, run the panel blind before you trust a single ranking it gives you. Mine looked fine for a week. It was quietly rigged the whole time.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;llm-council&lt;/strong&gt;, Andrej Karpathy, 2024. The label-anonymization design that this piece leans on, which aggregates the anonymized rankings by average rank position. Source: github.com/karpathy/llm-council&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Evaluators Recognize and Favor Their Own Generations&lt;/strong&gt;, Arjun Panickssery, Samuel R. Bowman, Shi Feng. NeurIPS 2024. The source of the GPT-4 self-preference win rate above 0.90. arXiv: 2404.13076&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>nlp</category>
    </item>
    <item>
      <title>Drift Detection for LLM Routing: Catching Silent Model Degradation</title>
      <dc:creator>praveenlavu</dc:creator>
      <pubDate>Thu, 18 Jun 2026 22:14:32 +0000</pubDate>
      <link>https://dev.to/praveenlavu/drift-detection-for-llm-routing-catching-silent-model-degradation-4dj9</link>
      <guid>https://dev.to/praveenlavu/drift-detection-for-llm-routing-catching-silent-model-degradation-4dj9</guid>
      <description>&lt;h1&gt;
  
  
  Drift Detection for LLM Routing: Catching Silent Model Degradation
&lt;/h1&gt;

&lt;p&gt;It's 2am and I am staring at a routing layer I spent weeks tuning, running a thought experiment that will not let me sleep. The router is doing exactly what I built it to do. Nothing in my code would change, nothing in my config would change, and yet I can see, plain as day, the night this system goes confidently, repeatedly wrong while every line of it stays correct. The failure is already baked in. I just have not been bitten by it yet.&lt;/p&gt;

&lt;p&gt;The setup is simple. I route incoming tasks across four capabilities: a fast cheap model, a slow expensive one, a retrieval tool, and a code-execution agent. Each task goes to one of them, and I watch a single binary signal, did the output pass the quality gate or not. Run that for a few thousand calls and the policy converges, the weights stabilize, the dispatcher learns which arm wins. For a while, life is good. And that good stretch is exactly the trap.&lt;/p&gt;

&lt;p&gt;Here is the scenario that keeps me up. The fast cheap model gets silently updated by its vendor, and its accuracy on my tasks quietly collapses. My router has no idea. It is carrying a high historical success estimate for that arm, earned honestly over weeks of good performance, and it keeps routing there because three weeks ago that was the right call. The dispatcher would not be broken. It would be right about a world that no longer existed. It would be wrong because it remembered too well.&lt;/p&gt;

&lt;h2&gt;
  
  
  The assumption nobody tells you you made
&lt;/h2&gt;

&lt;p&gt;Multi-armed bandit routing, from Thompson Sampling to UCB to plain epsilon-greedy, rests on one quiet premise: each arm's true success rate is fixed. Pick the arm with the best estimate, keep nudging it as outcomes arrive, converge on the best option. Clean, and it works right up until the ground moves. In production LLM routing the ground moves constantly. Models get updated, prompts age, the external API you lean on degrades. The question was never whether drift would hit me. It was whether my routing layer would notice before my users did. By default, it wouldn't.&lt;/p&gt;

&lt;p&gt;The cruel detail is the inertia. With a slow learning rate the running estimate remembers roughly the last twenty observations, which feels fast enough until you picture an arm that has served two thousand calls at 85 percent. A moving average with that much history behind it does not flinch when the truth drops to 30 percent. It takes dozens of fresh failures just to halve the gap, and far longer to close it, and every one of those failures is a task routed straight into the hole.&lt;/p&gt;

&lt;p&gt;My first instinct was the obvious one, and it was wrong: turn up the learning rate, make the estimate forget faster. A faster learning rate makes every estimate jittery all the time, even on arms where nothing has changed. I would be trading one silent failure for a router that twitches at noise. That is not a fix, it is a different bug. What I actually needed was narrower. Not a shorter memory everywhere, but a way to forget on purpose, surgically, only on the one arm that had genuinely shifted, and only when a real shift had occurred. A tripwire per arm.&lt;/p&gt;

&lt;h2&gt;
  
  
  The turn
&lt;/h2&gt;

&lt;p&gt;The tool already existed, and it had existed since 2007. ADWIN, for adaptive windowing, published by Bifet and Gavaldà at SIAM SDM 2007, does exactly the surgical thing I was reaching for. It watches a single stream of binary outcomes and keeps a window over them. After every new result it asks one question: is there a point inside this window where the older stretch and the recent stretch look like two different distributions, too far apart to be the same thing wearing noise?&lt;/p&gt;

&lt;p&gt;If no, the window just grows. Stable periods accumulate evidence, and a bigger window makes the test harder to fool, so it does not trip on ordinary variance. If yes, ADWIN declares drift, throws away everything before the split, keeps only the recent post-shift stretch, and tells you, so you reset the arm's estimate using only what survived. That collapse is the whole idea. A fixed window can only notice a shift once it has been present for about half its length, so you are always looking backward at a horizon you had to guess in advance. ADWIN's window grows without bound while things are calm, building the power to resist false alarms, then collapses hard the instant a real shift lands. The window size is an answer the data gives you, not a knob you set and pray over. For a bandit with several arms, I run one independent ADWIN per reward stream; they share nothing, because arms do not generally degrade at the same moment, and each one watching its own stream in isolation is not a simplification, it is the correct model.&lt;/p&gt;

&lt;p&gt;A single sensitivity knob governs how eager the test is to fire, really the false-positive rate you will tolerate on a stable stream. Under the hood its tolerance band tightens when the split is balanced and is nearly impossible to trip when only one or two observations sit on one side, so a single fresh result never declares drift on its own; it carries a gentle penalty for checking every possible split point; and it tightens further on a low-variance arm that almost always succeeds, so real degradation on a near-perfect arm gets caught sooner rather than hiding in slack. The river library uses a variance-aware form of the bound that runs tighter for large windows and high-quality arms than the simpler version in the original 2007 paper; the two agree closely when the window is very small. For routing I settled on a sensitivity of 0.002. With a window around a thousand observations that keeps spurious firings well under one per five hundred evaluations per arm, which across four arms at a hundred routing decisions an hour is a false drift event roughly once every thirty hours, low enough not to pollute the policy, high enough that I am not waiting weeks to catch the real thing.&lt;/p&gt;

&lt;p&gt;The original authors prove two guarantees, and both held up. On a stationary stream the odds of a false alarm stay bounded by the sensitivity setting. And once a real shift of a given size lands, ADWIN catches it within a number of observations that scales inversely with the square of the shift magnitude, so a big drop is caught fast and a subtle one takes proportionally longer. Both bounds are tight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watching it work
&lt;/h2&gt;

&lt;p&gt;I did not want to trust any of this on faith, so I built a small synthetic run to validate the design before it ever touched a real reward path. Four arms, five hundred steps, a fixed seed so it reproduces. Three arms hold steady at roughly 0.70, 0.65, and 0.60. The fourth, capability A, starts strong at 0.85 and then, at step 300, drops off a cliff to 0.25, standing in for exactly the silent vendor update I had been losing sleep over.&lt;/p&gt;

&lt;p&gt;The first time I saw the log line appear, the feeling was disproportionate to a synthetic test. Around step 318, eighteen steps after the true shift, ADWIN fired on capability A. Its estimate dropped from about 0.85 to about 0.28 as the window collapsed from over three hundred observations down to roughly a dozen, dumping the stale high-accuracy history in one motion. A second, smaller event near step 412 was just the window settling onto the new regime. By the end, capability A's routing weight had fallen from near 0.40 before the drift to around 0.11, its honest post-drift share, while the three stable arms held near 0.25 to 0.29 each. The eighteen-step lag lines up with the guarantee: the shift here is 0.60, which puts the theoretical floor at only a handful of observations, and constants and warm-up account for the rest.&lt;/p&gt;

&lt;p&gt;One design choice mattered more than the rest. The arm's estimate has to be decoupled from ADWIN's internal window: when drift fires, the monitor reads the fresh collapsed window's mean and adopts it as the new estimate, and that is the fast policy refresh the whole exercise exists to produce. The other thing I refused to give up was that no arm is ever zeroed out. A tiny floor on every weight keeps a sliver of exploration alive even for a failing arm, seeded at an uninformed 0.5 when it is new, so the system keeps probing it and can notice the day it recovers. A degraded arm is not a dead arm. Sometimes the vendor ships a fix and you want to find out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it does not belong
&lt;/h2&gt;

&lt;p&gt;A clean synthetic result does not make ADWIN a hammer for every problem, and pretending otherwise is how you get burned somewhere else.&lt;/p&gt;

&lt;p&gt;If an arm sees fewer than thirty observations an hour, the tolerance band is enormous, only a total collapse trips it, and the post-drift estimate is too noisy to trust anyway. Aggregate at a coarser granularity or reach for a Bayesian change-point detector with an informative prior.&lt;/p&gt;

&lt;p&gt;If the drift is gradual, ADWIN is the wrong tool by design. It is built for abrupt shifts, a model update, an endpoint degrading, a sudden change in the prompt mix. For an arm whose success rate decays a percent a week as its world knowledge ages, the window grows slowly, the gap between old and recent stays narrow, and detection lags by months. A scheduled two-sample test on rolling seven-day buckets is what that job wants.&lt;/p&gt;

&lt;p&gt;And if your non-stationarity is structural, if reward correlates with time of day or task type or session by design, ADWIN will fire constantly and churn your estimates into noise. That is not a routing system with drift detection. That is a contextual bandit in denial, and the fix is to model the context explicitly rather than detect it as drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The principle
&lt;/h2&gt;

&lt;p&gt;You don't have to implement any of this yourself. ADWIN ships maintained in the river online-machine-learning library; the work is wiring one detector per arm into the reward path and letting a drift signal reset that arm's estimate to its post-collapse window. I kept the companion implementation small, one monitor, one synthetic run, an optional plot, and a test suite above 80 percent coverage, with river as the only real dependency, so the whole thing reproduces from a single command.&lt;/p&gt;

&lt;p&gt;But the wiring is not the lesson. The lesson is the one that kept me up at 2am: a router that learns is also a router that can be confidently, durably wrong the moment the world it learned stops being true. Convergence is not the finish line. A converged policy is a strong opinion about a fixed reality, and production has no fixed reality. A dispatcher that was right three weeks ago and wrong tonight is not malfunctioning. It is believing its own history a little too hard, and nothing in the loop is watching for the day that history stops being true.&lt;/p&gt;

&lt;p&gt;So now something is. One tripwire per arm, quietly asking after every outcome whether the past still predicts the present, ready to forget on purpose the moment it doesn't. If you route anything across capabilities that can change underneath you, and in this field everything can, you want that tripwire in the loop before the page arrives, not after.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Bifet, A., &amp;amp; Gavaldà, R. (2007).&lt;/strong&gt; Learning from time-changing data with adaptive windowing. In &lt;em&gt;Proceedings of the 2007 SIAM International Conference on Data Mining&lt;/em&gt; (SDM 2007), pp. 443-448. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611972771.42&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;river Python library.&lt;/strong&gt; Online machine learning in Python. BSD-3-Clause license. Source: github.com/online-ml/river. ADWIN implementation: &lt;code&gt;river.drift.ADWIN&lt;/code&gt;.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Two queues for local-LLM fleets</title>
      <dc:creator>praveenlavu</dc:creator>
      <pubDate>Thu, 18 Jun 2026 22:13:18 +0000</pubDate>
      <link>https://dev.to/praveenlavu/two-queues-for-local-llm-fleets-3jdp</link>
      <guid>https://dev.to/praveenlavu/two-queues-for-local-llm-fleets-3jdp</guid>
      <description>&lt;h1&gt;
  
  
  Two queues for local-LLM fleets
&lt;/h1&gt;

&lt;p&gt;Two ollama pulls, plus an LM Studio Llama 70B load, plus two subagents hitting a cloud LLM provider's API, plus seven daemons running scheduled scans. All at once. 2026-05-13, 10:58 UTC. Kernel panic.&lt;/p&gt;

&lt;p&gt;I'd triggered all of them myself, carelessly, inside ten minutes. The ollama pulls were fetching the latest quantized weights for two different oracle models. LM Studio was loading a 70B parameter model into resident memory for a council review. Two subagents were dispatched via my orchestration layer, each making concurrent calls to a frontier model API. Seven launchd daemons fired on schedule because it was the top of the hour. The machine had 96GB of unified memory. It wasn't enough.&lt;/p&gt;

&lt;p&gt;The postmortem produced a rule I now follow religiously: &lt;strong&gt;local-heavy tasks run serially, one at a time; remote-API fleet tasks run with bounded concurrency; never cross-mix the two.&lt;/strong&gt; This is the two-queue discipline.&lt;/p&gt;

&lt;p&gt;A quick vocabulary stop before we go further. By &lt;strong&gt;fleet&lt;/strong&gt; I mean a set of cloud-LLM agent calls running in parallel against remote APIs. Each agent is a thin process; the work happens server-side. By &lt;strong&gt;oracle&lt;/strong&gt; I mean a heavy local model I load into resident memory for high-stakes reasoning that has to happen on the machine itself, usually for IP or latency reasons. Both are part of the same agent-orchestration pattern, but they have completely different resource profiles. That difference is exactly why mixing them blows up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two task classes
&lt;/h2&gt;

&lt;p&gt;When you're running local LLMs alongside cloud APIs, the work splits into two classes based on &lt;em&gt;where the saturation happens&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local-heavy tasks&lt;/strong&gt; saturate your machine's resources directly. Model downloads via ollama pull. Loading a 30-70GB quantized model into LM Studio or ollama with &lt;code&gt;keep_alive&lt;/code&gt; set to hold it resident. Running a full pytest sweep across your entire repo. Oracle council dispatch where you're loading massive models into RAM for inference. These compete for unified memory bandwidth, disk I/O, CPU cores for dequantization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remote-API fleet tasks&lt;/strong&gt; saturate neither your memory nor your CPU. They saturate network connections and your cloud provider's rate limits. Subagent dispatch via an orchestration layer where each agent calls a frontier model via API. Parallel web scrapes. Batch processing jobs that fan out to external services. These tasks are I/O-bound on network latency and remote throughput, not local compute.&lt;/p&gt;

&lt;p&gt;The distinction matters: saturation failure modes are different. Local-heavy tasks cause memory pressure, thermal throttling, and in the worst case, kernel panics when the memory allocator can't satisfy a request. Remote-API tasks cause connection pool exhaustion, rate limit errors, and cascading timeouts when you overwhelm either your network stack or the remote service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why mixing them saturates faster than you'd expect
&lt;/h2&gt;

&lt;p&gt;Here's the non-obvious part. On Apple Silicon, the memory is unified: the same physical RAM serves CPU, GPU, and the Neural Engine. When ollama loads a 67GB quantized model with &lt;code&gt;keep_alive &amp;gt; 0&lt;/code&gt;, it's not just occupying 67GB. It's pinning that allocation in a way that fragments the address space for everything else. The OS can't trivially reclaim it because the model needs to stay resident for fast inference.&lt;/p&gt;

&lt;p&gt;Now add two ollama pulls in parallel. Each one is streaming gigabytes of weights from the network and writing them to disk while simultaneously validating checksums and decompressing blobs. That's sustained disk I/O and memory allocation churn. The system is juggling resident model memory, in-flight download buffers, filesystem cache pressure, and whatever else is running.&lt;/p&gt;

&lt;p&gt;Then add an LM Studio load: another 30-70GB allocation request. LM Studio is trying to &lt;code&gt;mmap&lt;/code&gt; the model file into a contiguous region. If the address space is already fragmented by the ollama allocations, the kernel has to work harder to find a suitable range. On a system with 96GB total and a 67GB model already loaded, the remaining headroom is only ~29GB. That's before fragmentation overhead, filesystem cache, and kernel allocations eat into it. A 30-70GB request collides with that headroom immediately. At the low end, a smaller model squeezes in with zero margin for spikes; at the high end, the request fails outright. Either way the kernel starts swapping, which on a machine built for low-latency inference is effectively a soft hang.&lt;/p&gt;

&lt;p&gt;Now add two subagents making concurrent API calls. They're not heavy on memory, but they are allocating connection state, buffers for HTTP responses, and JSON parsing overhead. Not huge individually, but enough to tip the balance when the system is already under memory pressure from the local-heavy work.&lt;/p&gt;

&lt;p&gt;Add seven daemons firing on schedule. Maybe they're just doing lightweight scans, but each one spawns a process, allocates a stack, opens file descriptors, and touches the filesystem. Again, not huge individually. But in aggregate, on a system already saturated, it's the cumulative load with no single smoking gun. Each spawn adds overhead the kernel can't reclaim fast enough.&lt;/p&gt;

&lt;p&gt;The kernel panic isn't random. It's deterministic. You've exceeded the practical working set the unified memory architecture can sustain under concurrent pressure. The math isn't "96GB total, models fit if they sum to less than 96GB." The math is "96GB total minus fragmentation overhead minus in-flight buffers minus filesystem cache minus kernel allocations minus margin for allocation spikes." You hit the limit well before the total RAM. Once you account for everything else the system is doing, the practical working set is significantly smaller.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two-queue rule
&lt;/h2&gt;

&lt;p&gt;Don't mix the classes. Run them in separate queues with separate concurrency limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local-heavy tasks: serial only.&lt;/strong&gt; One at a time. No exceptions. If you're pulling a model via ollama, nothing else heavy runs until it's done. If you're loading an oracle model into LM Studio, no other model loads or downloads run concurrently. If you're running a full test sweep, no oracle dispatch, no model pulls. Serial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remote-API fleet tasks: bounded concurrency, ≤5 concurrent by default.&lt;/strong&gt; Five subagents hitting cloud APIs in parallel is fine. They're network-bound, not memory-bound. You can saturate your rate limit before you saturate your machine. But don't go unbounded. Connection pool exhaustion is real, and most cloud providers have per-account concurrency limits anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Never cross-mix.&lt;/strong&gt; If a local-heavy task is in flight, the remote-API queue is paused. If the remote-API fleet is running, no local-heavy tasks start. This is the hard rule. It eliminates the saturation-interaction failure mode entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pre-flight gate
&lt;/h2&gt;

&lt;p&gt;Before starting any heavy task, I check three things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load average.&lt;/strong&gt; &lt;code&gt;uptime&lt;/code&gt; shows the 1-minute load average. If it's above 4.0 on my 12-core machine, something is already saturated. Wait or kill. Don't pile on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free disk space.&lt;/strong&gt; &lt;code&gt;df ~&lt;/code&gt; shows available space on the home volume. I need at least 30GB free before pulling a large model. Ollama writes to a temp location before moving the final file, so you need roughly 2× the model size in transient headroom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-flight heavy task check.&lt;/strong&gt; I explicitly verify no other local-heavy task is running. &lt;code&gt;ps aux | grep -E 'ollama.*pull|lmstudio|pytest|oracle_dispatch'&lt;/code&gt; as a sanity check. It's manual, it's primitive, but it works. Automation can come later. Discipline comes first.&lt;/p&gt;

&lt;p&gt;If any gate fails, I don't proceed. I reschedule or I kill the conflicting task. The pre-flight check takes ten seconds. Recovery from a kernel panic takes ten minutes and loses whatever state was in flight.&lt;/p&gt;

&lt;h2&gt;
  
  
  The forbidden combinations
&lt;/h2&gt;

&lt;p&gt;Some combinations are always wrong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local-heavy + remote-API fleet active&lt;/strong&gt;: memory pressure + connection churn → saturation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fleet active + oracle dispatch starting&lt;/strong&gt;: oracle loads a 30-70GB model while fleet holds HTTP state → OOM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oracle dispatch + ollama pull&lt;/strong&gt;: two large allocations competing for the same unified memory → kernel panic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two oracle models loaded simultaneously&lt;/strong&gt;: 67GB + 30GB &amp;gt; practical working set → swap death spiral&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More than 5 concurrent fleet dispatches without override&lt;/strong&gt;: connection pool exhaustion, cascading timeouts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these surfaced as a real failure at some point. The rule isn't theoretical. It's scar tissue.&lt;/p&gt;

&lt;h2&gt;
  
  
  The discipline
&lt;/h2&gt;

&lt;p&gt;Postmortems for solo founders means writing rules your future self will hate following, until the day they save you.&lt;/p&gt;

&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The two-queue rule is the floor, not the ceiling. Once the discipline is in place, four trajectories open up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-scheduling.&lt;/strong&gt; The pre-flight gate is currently ten seconds of manual checks. With the gate logic codified, you can wrap it in a scheduler that decides automatically: when an ollama pull finishes, fire the next queued heavy task. When the fleet rate-limit ceiling approaches, throttle. When load average rises, defer. Manual discipline becomes automatic policy. The hard part is not the scheduler. The hard part is encoding "what counts as heavy" precisely enough that the scheduler doesn't have to ask.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-machine fleet coordination.&lt;/strong&gt; The same logic extends to a multi-node setup. One machine handles oracle work, another handles fleet, a third runs daemons. The queues become network-coordinated and the rule generalizes: never let a node accept a task that would push it past its per-class concurrency cap. The interesting design question is where the queue state lives. A Redis sorted set works for two nodes. Past five nodes you start wanting a real durable queue, and now you have a different operations problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Predictive saturation modeling.&lt;/strong&gt; The kernel-panic math at the end of section 3 is a working-set predictor in disguise. Given a tuple of (free memory, in-flight model sizes, filesystem cache pressure, kernel allocation headroom), you can compute whether a candidate task will fit before dispatching. The math is there. What's missing is the wrapper that runs it on every dispatch and refuses the unsafe ones. That refusal is more valuable than the scheduler, because it stops you from making the mistake in the first place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability that matches the discipline.&lt;/strong&gt; A single surface showing what's queued vs running per queue, real-time load average, projected memory headroom after the next dispatch. Not a separate observability project. Just the gate logic exposed as state, with a small UI on top. The reason this is worth building: when the rule fails it fails silently for several minutes before the kernel intervenes. A live view of "you're about to violate the rule" beats a postmortem of "you violated the rule" every time.&lt;/p&gt;

&lt;p&gt;Each of these builds on the two-queue rule without abandoning it. The rule stays the same: serial on local, bounded on remote, never cross-mix. What changes is how much of the discipline you have to hold in your head, and how much the system holds for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  AI Disclosure
&lt;/h2&gt;

&lt;p&gt;This artifact was prepared with assistance from generative AI tools, in&lt;br&gt;
accordance with COPE+STM "AI in scholarly publishing" guidance and the&lt;br&gt;
target venue's AI-use policy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drafting:&lt;/strong&gt; Author-Enthusiast agent via a frontier large language model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice + IP firewall:&lt;/strong&gt; Author-Human agent via a frontier large language model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-tell removal + readability:&lt;/strong&gt; Author-Humanizer agent via a frontier large language model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final responsibility:&lt;/strong&gt; The named human author has read and approved the
final content. No generative AI is listed as a co-author. Substantive
intellectual contribution remains with the human author.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
