DEV Community: praveenlavu

IQ, EQ, and the Rise of AQ

praveenlavu — Wed, 22 Jul 2026 14:02:45 +0000

Random Work Is the New Deep Work: Notes on the IQ → EQ → AQ Shift

My work journal writes itself: every session gets captured by a hook, and a nightly job distills the day into one page. Eighty-four entries since April 4. This morning I scrolled back through all of them looking for a through-line, and the honest answer is there isn't one. An EDI acknowledgment loop on a Monday. A drift detector for model routing on a Wednesday. License hygiene on a Friday. A kernel-panic postmortem the week after. One hundred fourteen article seeds sit in my backlog right now, and they read like ten different people wrote them.

The voice in my head about this is the one every career book installed: pick a lane. Depth wins. A senior engineer is someone who spent a decade getting unreasonably good at one thing. By that standard, the last two months of my life look like a focus failure.

I want to argue the opposite. I also want to be careful doing it, because the argument flatters me, and arguments that flatter you are the ones to check twice. So I treated it like any suspicious result: I went looking for the data that would kill it.

Quotients have a habit of taking over

For most of a century, the prized number was IQ. It sorted school tracks, army placements, and eventually, through a long chain of proxies, who got hired to think for a living. The assumption underneath it: intelligence is the scarce input, so measure the intelligence.

Then in 1990 two psychologists, Peter Salovey and John Mayer, defined "emotional intelligence" in an academic journal. In 1995 a science journalist named Daniel Goleman turned it into a bestseller, and that October TIME put EQ on its cover. Within a decade every leadership offsite had a module on it. EQ also got oversold: the popular claim that emotional intelligence accounts for up to 90 percent of leadership success never had adequate data behind it, and by 2005 the psychologist Edwin Locke was publishing papers calling the whole construct invalid. Hold onto that pattern. We'll need it again.

Adaptability's turn started quietly, at the level of companies rather than people. In 2011, two BCG strategists argued in Harvard Business Review that sustainable competitive advantage was dying and the replacement was speed: the ability to read signals, experiment, and mobilize faster than the environment changes. By 2019 a venture investor named Natalie Fratto was on the TED stage proposing AQ, the adaptability quotient, as the thing she screens founders for. An assessment industry followed, the way it always does.

Here is where I'm supposed to quote Darwin about how it is not the strongest of the species that survives but the most adaptable. He never wrote it. A management professor named Leon Megginson paraphrased him that way in a 1963 speech, and the paraphrase got promoted to scripture because it was too useful to fact-check. I find that fitting rather than damning: the adaptability era runs on a quote that adapted.

You don't need Darwin. You need job ads.

Here's what I found when I tried to kill the thesis.

LinkedIn's learning report had already named adaptability the "skill of the moment" in 2024, the fastest-growing skill demand in its data. The World Economic Forum's employer survey expects 39 percent of core skills to be transformed or obsolete by 2030; the previous edition said 44 percent by 2027, so the panic cooled a notch while the direction held. PwC mined close to a billion job ads and found the skills employers ask for changing 66 percent faster in AI-exposed occupations than in the rest of the economy, up from 25 percent faster one edition earlier. The churn isn't spread evenly. It concentrates exactly where builders live.

Then the stat I keep rereading. When Microsoft and LinkedIn surveyed 31,000 people in 2024, 71 percent of leaders said they'd rather hire a less experienced candidate with AI aptitude than a more experienced candidate without it. Read that again slowly. Experience is the compound interest of the IQ era, the asset you were told to spend thirty years accumulating. A majority of hiring managers just said they'll trade it for evidence you can absorb a new tool this quarter.

The quietest data point comes from inside psychometrics itself. For decades the textbook said general mental ability tests were the single best predictor of job performance. In 2022, Sackett and colleagues re-ran the math and showed the classic estimates had been systematically over-corrected for years. In the revised table, cognitive ability tests fall behind structured interviews and biodata. Demonstrated behavior now outranks measured aptitude in the discipline that engineered aptitude measurement. Nobody held a parade. The most-cited number in hiring science got quietly marked down, in the same decade the market started pricing adaptation.

The people building the tools say it in plainer words. Jensen Huang stood on a stage in Taipei in 2023 and declared, "Everyone is a programmer now. You just have to say something to the computer." Sam Altman keeps answering the what-should-students-learn question with versions of one answer: learning how to learn, resilience, the raw ability to adapt when everything around you changes. Dan Shipper calls what comes after the knowledge economy the allocation economy: you stop being valued for what you know and start being valued for how well you direct intelligence that isn't yours.

So the data didn't kill the thesis. It sharpened it.

What it feels like from inside

What the reports can't tell you is what the shift does to the person living it.

My family farms. They do work that pays off only when the season turns, and I inherited that patience along with the assumption that mastery has a season too: plant the years, harvest the expertise. The hardest thing about building with AI is watching my field lose its seasons. IBM's researchers put the half-life of a technical skill around two and a half years now. From inside, it feels shorter. Frameworks I knew deeply stopped mattering. Tools I dismissed became load-bearing within a quarter. The capital I'd spent a career compounding was melting while I held it.

I can date the low point. On May 13, I kernel-panicked my own machine: one local model too many pulled into memory while another heavy job was already loaded. The computer that runs my whole operation went dark because I was trying to absorb new tools faster than I was respecting their limits. That's the texture of this era that never makes the keynote: the am-I-keeping-up loop, the vertigo weeks, the retraining that happens at hours the journal timestamps don't flatter.

Here's the part that made me stop reading the crash as a verdict. By the end of that same day, the panic had become two new entries in my article backlog: one on queueing disciplines for local model fleets, one on postmortems for solo builders. Both seeds carry the source date May 13. The crash and its curriculum, logged on the same page.

That's when the journal's through-line finally showed itself. I'd been scanning the topic column, and the topics never repeat. The pattern lives in the other column, the one that never changes: frame the problem sharply, find the prior art, set the quality gates, put the machines to work, audit what comes back, write down the lesson. Every one of those eighty-four days runs that loop. An EDI acknowledgment protocol and a drift detector have nothing in common as domains. As loops, they're the same day.

Once I saw the loop, the economics flipped. A new domain used to cost months of ramp before output; now an unfamiliar one goes from hostile to workable in about a day, because execution is cheap and the loop is practiced. Seventeen days after the kernel panic, an essay on drift detection for model routing went out the door. Watching working code materialize in a domain I didn't know the week before is the closest thing to a cheat code I've ever felt while building. The dopamine is real. So is the discipline bill: the faster the code appears, the more the verification matters, because speed without gates is just confident garbage.

That loop is what people are trying to name when they say AQ.

The honest caveats

I'm reluctant to turn this into a score, because the science isn't there. AQ has no validated instrument the way IQ does; the prominent assessments are commercial products; and the measurable parts of adaptability keep dissolving into older constructs when researchers look closely. A 2017 meta-analysis found Big Five personality traits explain a large share of what career-adaptability scales capture. Awkwardly, "learns fast in unfamiliar situations" was always half of what intelligence tests measured anyway. The quotient framing is marketing. EQ taught us how that movie goes.

But the repricing is not marketing. The job ads, the hiring preferences, the skill-churn rates: those are measured behaviors of people spending money. You can reject AQ as a construct and still accept the conclusion that the market premium moved from what you've mastered to how fast you re-orient.

Recovery time, not mastery time

So here's the principle I run on now. I stopped optimizing time-to-mastery in a single domain and started optimizing recovery time across domains: the interval between landing somewhere unfamiliar and shipping something I can defend there. It's days now. I'm working on hours. You can't put that number on a résumé yet. You can only demonstrate it, which may be exactly why the selection methods that survived the 2022 revision are the ones that watch you behave instead of asking what you know.

The deep-work instinct isn't wrong; it just attached to the wrong object. The thing to go deep on is no longer a domain. It's the loop that eats domains.

Eighty-four journal entries, one hundred fourteen seeds, one kernel panic, and no through-line in the topic column. I spent two months reading that as the bug.

It's the résumé.

Dynamic SOQL Without Getting Burned

praveenlavu — Sun, 19 Jul 2026 14:00:07 +0000

Dynamic SOQL Without Getting Burned

There is a particular kind of dread that sets in at two in the morning when you realize the query you shipped last week can be turned against you.

I was staring at a security review report, coffee going cold, reading a finding that should not have surprised me but did. The feature was a configurable search: administrators could pick which fields appeared in results, users could filter on any of them. Flexible, powerful, genuinely useful. The implementation built the query string at runtime from user-supplied field names. The finding said, in flat auditor language, that an attacker with access to the search interface could manipulate that field name to extract data from objects they were never supposed to see. The app had been in production for months.

This is the SOQL injection problem, and it is not exotic. It is one of the most common failures I see in Salesforce ISV security reviews, precisely because the features that invite it are also the features customers love most. Configurable dashboards. Flexible report builders. User-driven filter panels. Every one of them requires building queries from runtime input, and every one of them is a potential foot-gun if you have not thought carefully about the layers between what a user supplies and what hits the database.

I want to walk through how I think about this now, after enough scars to have developed an opinion.

Why Dynamic SOQL Exists, and Why It Bites

Static queries are safe by design. When you write a query with fixed object names, fixed field names, and only the filter values changing, the platform can reason about what you are asking for. The query shape is known at compile time. The only variable is data, not structure.

The moment a customer says "I want to search across whichever fields matter to me," you lose that static shape. The object, the fields, the filter logic, or all three might come from configuration or from direct user input. You are building a string and handing it to an interpreter, which is exactly the category of problem that has caused security incidents across every platform and language for decades.

On Salesforce, the blast radius is real. The SOQL engine runs inside the platform's security model, but that model does not protect you from a query that is structurally manipulated before it reaches the engine. If an attacker can inject their own field names or object references into a query string, they can potentially read fields across relationships the original query was never meant to traverse. In a multi-tenant ISV package, where customer data lives in the same org alongside your package's logic, the stakes are higher.

The good news is that there are three layers of defense that together close almost every vector I have seen. They compose, which means each one limits what can go wrong even if another layer is imperfectly applied.

The First Layer: Escaping Filter Values

The most well-known defense is escaping user-supplied values before they are embedded in a query string. The platform provides a method for this, and it does exactly what the name implies: it handles single-quote characters in a way that prevents them from being used to break out of a string literal inside the query.

This is necessary, but it is the weakest of the three layers because it only protects filter values. It does nothing about structural components of the query: the object name, the field names, the sort order, the limit logic. If any of those components come from user input and are not separately validated, escaping filter values leaves the door open.

I have seen implementations that escape diligently and still end up with injection vulnerabilities because someone assumed that the only dangerous input was in the WHERE clause values. The attacker does not care about the WHERE clause if they can control the SELECT list.

Think of escaping as the seat belt. You absolutely wear it. But you also want airbags and crumple zones.

The Second Layer: Bind Variables

Bind variables change the relationship between your code and the query engine fundamentally. Instead of building a string that includes both structure and values, you build the structural parts as a string and pass the value parts separately, in a way that the engine treats them as data rather than as query syntax.

When you use a bind variable, you are essentially saying: this thing I am passing you is a value, not syntax. The engine never interprets it as query language. It cannot be used to append additional query clauses, to comment out existing ones, or to reference different fields. It is data, full stop.

Bind variables are available for filter values in SOQL, and they are the correct approach for anything the user supplies as a search term, a record ID, a date range, or any other filtering input. The moment I understood this distinction, a lot of things that felt fragile started feeling solid.

The catch is that bind variables do not apply to structural query components. You cannot use a bind variable as a field name. You cannot use one as an object name. The query engine needs those parts as literal strings to construct the query, which means they are in a different threat category entirely.

The Third Layer: Schema-Based Allowlisting

This is where things get interesting, and where I spent the most time thinking before it clicked.

For structural query components, specifically the object names and field names that come from configuration or user choice, the defense is not about escaping or parameterizing. It is about validation: before you let a string become part of your query structure, you verify that it is a real, expected, accessible schema element.

The platform's schema describe capabilities give you everything you need for this. You can introspect the metadata of any object your code has access to, retrieve the set of fields that actually exist on it, check their accessibility and type, and make decisions based on real, platform-sourced information rather than a list you maintained yourself.

The pattern works like this at a conceptual level: when your feature needs to accept a field name as input, whether from an administrator's configuration or from a user's filter selection, you do not trust that string directly. You ask the platform what fields exist on the relevant object. You check whether the proposed field name is in that set. You also check whether the field type is appropriate for what you are doing. Only after those checks pass do you allow the field name into the query.

This is powerful for two reasons. First, it is automatically current. If your schema changes, the allowlist changes with it without you doing anything. Second, it is grounded in truth that the platform itself controls. An attacker cannot fake a valid schema describe response. Either the field exists and is accessible, or it does not.

I remember the moment this approach crystallized for me. I was trying to think about how an attacker would defeat field-name validation, and I kept reaching for the same answer: they would supply a field name that was not real, or one that accessed a relationship in an unexpected way, or one with special characters. Every single one of those is caught by asking the platform whether the field actually exists on the target object. The attacker cannot create valid schema entries. The platform controls that namespace entirely.

Object-name allowlisting follows the same pattern. If your feature lets administrators pick which objects a search applies to, you validate each object name against the platform's describe before it enters a query. You can also check object accessibility and whether your package has the right permissions to query it. The configuration-time check and the runtime check both use the same source of truth.

The Layers Working Together

Here is why composition matters. Suppose you have a feature that lets users filter on any field in a list they configure, and you apply all three layers.

Filter values go through escaping and bind variables. Even if a user tries to inject SOQL syntax into a search term, it never becomes query syntax.

Field names from the configuration go through schema validation before the query is built. Even if an attacker somehow manipulates the configuration, a field name that does not exist or that fails accessibility checks never reaches the query.

The query is still built as a string because SOQL requires structural elements to be literal, but the inputs to that string have all been filtered through one or more validation layers appropriate to their type.

What remains? The template around those validated inputs, the structural logic you write yourself, which is static and does not vary with user input. That part cannot be injected.

I ran this through a few attack scenarios mentally, and then more formally during a security review. The intersection of what an attacker controls and what reaches the query engine without validation is, ideally, empty.

What This Looks Like in Security Review

When I review a Salesforce application now, the questions I ask about dynamic SOQL are fairly specific.

Does any user-supplied input enter a query string directly, without escaping or bind variables? That is the first cut.

Does any query structural element, object names, field names, clause components, come from configuration or user input? If so, what validates it?

Is the validation based on a maintained list in the code, or is it grounded in real-time schema introspection? The former drifts and can be wrong. The latter is authoritative.

Can the feature's configuration be modified by low-privilege users, and if so, does the runtime validation still apply? This is the multi-tenant question. In a managed package, you often cannot trust that configuration has not been tampered with at runtime, so runtime validation using describe is more important than configuration-time validation.

Applications that fail these questions fail the security review. Not as a technicality, but because the risk is concrete and the fix is well-understood. The platform gives you the tools. The question is whether you reached for them.

The Principle

Every security control I care about in software follows the same basic shape: stop trusting things you do not have to trust, use authoritative sources rather than self-maintained lists, and separate the structural parts of a system from the data parts as clearly as possible.

Dynamic SOQL is a compressed case study in all three. You cannot avoid building queries at runtime if you want configurable features, but you absolutely can avoid trusting runtime-supplied strings to construct query structure. You can separate filter values from field names and treat them differently. You can ask the platform what is real instead of guessing.

The dopamine hit when a security review comes back clean on a feature that used to fail is real. But it is smaller than the relief you feel knowing that you actually understand why it is safe, not just that it passed this time.

If you are building configurable search on Salesforce, or any feature that assembles queries from pieces that users or administrators can influence, these three layers are the minimum bar. They are not expensive to implement. They are well worth the two-in-the-morning peace of mind.

The 837/999 HIPAA Acknowledgment Loop

praveenlavu — Sat, 18 Jul 2026 14:00:10 +0000

Every 837 You Send Needs a 999 Back: How the HIPAA Acknowledgment Loop Works

There is a particular kind of dread that hits you at two in the morning when you realize the system you spent six months building has been quietly lying to you.

Not maliciously. Not dramatically. Just silently, in the way that only healthcare integrations can lie: by doing nothing when something was expected, and giving you no signal either way.

That was the moment I understood why the 999 Implementation Acknowledgment exists. And why ignoring it is one of the most expensive mistakes a health tech team can make.

The Claim Leaves the Building

When a Salesforce Health Cloud implementation submits an 837 claim to a payer, the conversation does not end at the moment of transmission. Most teams build to the point where the claim goes out the door and treat that as a milestone. The integration works. The 837 fires. The build ships.

What they miss is that HIPAA mandates a response. Not as a courtesy, not as a best practice, but as a formal requirement. Every 837 you send to a trading partner carries control numbers (identifiers stamped into the interchange envelope, the functional group, the transaction set) and the payer is required to send back a 999 that mirrors those exact identifiers to confirm they received and processed what you actually sent.

The 999 is a receipt. A signed receipt. And until you have it, the submission is in limbo.

This matters more than it sounds. If the 999 never arrives, or arrives carrying a rejection, your claim is not sitting in a queue waiting to be adjudicated. It may never be adjudicated at all. The question of whether the payer ever really received it, in the form you intended, stays open.

The Verdict in the Envelope

What makes the 999 consequential is not just its existence but what it carries. The acknowledgment contains a verdict at the functional group level, encoded in a field called AK901. Four possible outcomes, each with completely different implications for what you do next.

Accepted (A) means exactly what it sounds like: the payer received your 837 functional group, validated it structurally, and is pulling it into their adjudication workflow. The loop closes. Move forward.

Partial (P) means the functional group had mixed results across individual transactions. Some claims passed; others failed. This one demands the most surgical response, because you cannot treat the group as a whole. You have to dig into which transactions were flagged and why.

Rejected (R) is unambiguous. The entire functional group was refused. Nothing in it will be processed. This is not a soft failure; it is a hard stop that demands investigation before anything is resubmitted.

Accepted with Errors (E) is the fourth state, and it carries a subtle but important distinction. Unlike Rejected, an E verdict means the functional group was accepted and will process, but the payer flagged issues it detected along the way. No retry is needed. The errors, though, need to be understood and cleared before the next submission cycle, because issues flagged today become rejections tomorrow.

Four distinct states. Four distinct responses required. And if you have no system consuming and acting on the 999, all four look identical from the outside: silence.

The Silent Duplicate Risk

Here is where the real danger lives, and it is not theoretical.

When a trading-partner agreement defines an acknowledgment window (and they all do), the clock starts the moment you submit. If a 999 does not arrive before that window closes, the submission is technically unacknowledged. The correct behavior is to retry: investigate whether the original submission was received, and if not, resubmit.

But if your system has no awareness of the 999 loop, it has no way to make that determination. What often happens instead is one of two failure modes.

The first: the claim sits indefinitely, considered submitted, never acknowledged, never followed up. Revenue evaporates. No one notices until a receivables reconciliation days or weeks later surfaces claims that were never adjudicated.

The second, and more operationally damaging: the team discovers the gap and retries manually, or the system retries on a timer, without knowing whether the original submission was actually received. Now there are two claims in flight for the same encounter. The payer receives a duplicate, flags it, and the adjudication process becomes significantly more complicated to unwind.

The 999 acknowledgment loop exists precisely to prevent both failure modes. It gives the submitting system a definitive signal that the payer received the 837, validated it, and is processing it, or it gives you the information you need to respond appropriately when that signal does not come.

Without a system that closes this loop, you are flying without instruments.

Why Salesforce Health Cloud Implementations Miss This

I have seen this pattern in enough HIPAA integrations built on Salesforce to know it is not a one-team problem. It is structural.

Salesforce Health Cloud is extraordinarily capable at managing the clinical and administrative data that feeds a claim. The patient, the encounter, the diagnosis codes, the procedure codes, the provider relationships, all of this lives naturally in the Health Cloud data model. Building an integration that generates an 837 from that data and transmits it to a clearinghouse or payer is a well-understood problem. There is tooling, there are patterns, and teams get good at this part.

The 999 requires something architecturally different. It is inbound. It is asynchronous: the payer sends it on their schedule, not yours. It requires the receiving system to parse the acknowledgment, match it against outstanding submissions by control number, evaluate the AK901 verdict, and take action based on what it finds.

This is a different kind of integration work. Reactive, not proactive. It requires state: knowing which 837s are pending acknowledgment, for how long, and what their current status is. And because it is invisible when it works, teams do not feel its absence until the first time something goes wrong.

The gap is almost always a project-planning artifact. The 837 submission is visible, measurable, and tied to a clear user story. The 999 receipt is a callback that happens later, asynchronously, and whose absence has no immediate observable effect. It falls out of scope. Sprint after sprint, it stays in the backlog.

The Moment the Loop Closed

The night I understood all of this was not while reading the HIPAA Technical Report Type 3 specification. It was while sitting with a failing claim that had been submitted three times, each time generating a duplicate claim flag, trying to figure out why.

The 837 was well-formed. The clearinghouse was accepting it. But the receiving system had no record of a 999. No acknowledgment had ever been processed. The system had been retrying on a timer with no knowledge of whether the original submission was alive or dead.

When we finally built the 999 consumer, the first thing it surfaced was that the original submission had been rejected at the functional group level. The rejection reason was clear. The fix was straightforward. But it had been sitting there, invisible, while automated retries compounded the problem.

The moment the loop actually closed (watching a rejection come in cleanly, get matched to its 837 by control number, update the submission record, and surface for review), that felt like turning on a light in a room where you had been working in the dark. That specific kind of exhilaration when a system finally does what it was supposed to do, and you see it happen in front of you.

The Control Number is the Key

The mechanism that makes the 999 work as a precise receipt, not a generic confirmation, is control number matching.

The 837 you send carries identifiers at every level of its envelope: the interchange, the functional group, the transaction set. These are not decorative. They are the identifiers the payer uses to reference your submission, and the 999 uses the same identifiers to tell you exactly which submission it is acknowledging and at what level.

This precision is what makes the 999 genuinely useful. A generic "we received something" confirmation would not tell you whether the submission that arrived matches what you sent. The 999, when consumed correctly, is a mathematical statement: here is the control envelope you sent, here is what we found when we opened it, and here is our verdict.

Getting the control number matching right in the inbound 999 consumer is non-trivial. It requires maintaining state on every outbound 837: capturing the control numbers at transmission time, persisting them, and using them to correlate inbound acknowledgments. This is the infrastructure most teams skip, and its absence is what turns a missing 999 into a silent failure instead of an actionable event.

What Closing the Loop Actually Changes

When the 999 consumer is working, the operational picture changes completely.

Rejected submissions surface immediately, with enough information to diagnose and correct the problem before the adjudication window passes. Partial acknowledgments flag specific transactions for review while letting clean ones continue. Error-flagged acknowledgments surface issues to clear in future submissions without halting the current processing cycle.

And missing 999s, the ones that never arrive because the trading-partner SLA window closes without a response, become visible as events rather than absences. The system knows what it sent, when it sent it, and when it expected a response. When that response does not come, it knows.

This is what real claim lifecycle management looks like. Not a pipeline that fires and forgets. A loop that opens when an 837 goes out and closes when a valid 999 comes back, with defined behavior for every outcome in between.

The Principle Behind the Protocol

HIPAA EDI is often described as a compliance burden, and it is true that the specification is dense, the segment hierarchy is unforgiving, and the interoperability requirements are exacting. But the 999 acknowledgment loop is not bureaucratic overhead. It is a protocol designed around a real operational problem: how do you build a durable, trustworthy claim submission pipeline across organizational boundaries, over networks you do not control, with trading partners whose systems behave inconsistently?

The answer the protocol gives is: you require acknowledgment. You mandate a receipt. You make the loop explicit and auditable, with verdicts that carry enough information to act on.

The teams that build this well do not just process claims. They build systems that know the status of every claim they have ever submitted, at every moment, with enough context to respond appropriately to any outcome. That is a fundamentally different capability than a submission pipeline, and it is what HIPAA was designed to enable.

If you are building on Salesforce Health Cloud and your 999 processing is still in the backlog, move it up. The submission loop is not closed until it comes back. And an open loop, in healthcare operations, is just a failure that has not been counted yet.

Reliable AI Agent Control Flow: Keep the State Machine Out of the Prompt

praveenlavu — Thu, 09 Jul 2026 14:37:05 +0000

Reliable AI Agent Control Flow: Keep the State Machine Out of the Prompt

Picture the failure that keeps me up at night. An agent reports that a job failed. The job did not fail. The work went through cleanly, every field extracted, the output sitting right there, correct. And the agent routed itself to the error state and stopped, calm as anything, as if it had done its job. There is no diff to look at. The code did not change. The config did not change. The transition that misfired lives in neither place. It lives in the prompt, as a few lines of English telling the model which state may follow which, and somewhere upstream the model that reads those lines got quietly updated. The machine you deployed is not the machine you are running. Nobody touched it. It drifted out from under you.

I have not had that exact night land on me, and I am writing this so that it never does. But it is not a hypothetical I had to strain to imagine, because the ingredients are sitting in plain sight in agent codebases everywhere. Once you see the shape of it, you stop being able to unsee it, and I landed on a boundary I now treat as non-negotiable: the state machine that governs which step runs next belongs in ordinary code, not in a prompt. State machines are deterministic by definition. LLMs are not. Putting safety-critical control flow inside a model's instruction-following behavior is putting load-bearing logic on a non-load-bearing surface. The fix is mundane. Keep the machine in code and let the model work inside the steps.

I want to be clear up front that none of the pieces here are mine to claim. State machines are decades old. The failure modes are the ordinary properties of stochastic systems. What I am offering is the case for a boundary, drawn before the bill comes due, not an invention of mine.

The pattern is everywhere, and it is seductive

Open a handful of orchestration repos and you find the same prompt fragment within minutes. It tells the model it is an agent operating as a state machine, injects the current state as a template variable, lists the legal transitions in plain English, and asks the model to emit the next state. Idle to processing on a new task, processing to complete or error, error back to idle once acknowledged. The shape is identical every time, and I understand why, because the first time I reached for a quick agent loop, a version of it is what my hand wanted to write.

The appeal is real, so let me be honest about it before taking it apart. It is compact: a working state machine in one prompt block, no imports, no transition table, no boilerplate. A junior engineer reads it and immediately understands what the agent is doing, and adding a state is a one-line edit. There is also a real intelligence argument: a model can judge ambiguous situations a hand-branched conditional would get wrong, and a code guard would need its own classifier for that judgment. And it composes naturally, since the window holding the task instructions also holds the machine's rules. One prompt, one call, and it feels elegant.

These are not imaginary benefits. The pattern persists because it works, for a while. The trouble is what that while turns into.

The first crack: the machine drifts when the model does

A state machine has a contract: given a state and an input, the next state is fixed, and that determinism is the entire value. Without it you have a stochastic function that occasionally returns a state name.

Models above zero temperature are openly non-deterministic: same state, same input, different transitions across runs. But even pinned at zero, the deeper problem should scare you: determinism there is an artifact of one model checkpoint, and checkpoints change, sometimes silently. A provider fine-tunes, ships an alignment update, or reformats the system prompt, and your machine shifts with it.

That is the whole mechanism behind the opening scene. A transition that landed on complete for a year starts occasionally landing on error, not because anything failed but because a new checkpoint weights the error-handling prose differently. The job processes. The agent believes it did not. No diff, no commit, no config change to inspect, because the thing that moved moved upstream and invisibly. You cannot pin a model version forever on most hosted APIs, and no test catches the model changing its mind. The machine you shipped is not the machine you are running six months later.

The second crack: instructions collide in one context

The machine's rules and the task's content share a single context window, and the model is asked to honor both at once with no enforced wall between them. That produces a failure no prompt engineering fully clears.

First, the attention problem. In long conversations or deep agent loops, the machine's rules drift toward the edge of the model's effective attention while fresh task content sits front and center. Constraints near the top of a sixteen-thousand-token context are weaker than the content filling the bottom, and the rules quietly demote themselves to suggestions.

Worse is the semantic collision. What happens when the task content itself contains your state names. A document that says move this ticket to the error state: is that content to process, or an instruction to execute. The model has to guess. Careful naming shrinks the collision surface but cannot remove it, because any vocabulary shared by your machine and your task domain is a path for an unintended transition. The model has no privileged parser for instructions versus content. One context window, read all at once.

The third crack: nothing leaves a trace

A code-based machine has a callstack. You can break on every transition, log the state, the trigger, the inputs, and the result with one decorator, and replay a whole sequence offline with no network call. When something breaks, you have a trace.

A prompt-based machine makes its decision inside a forward pass. You see the output token, next state error, but you do not see why. The attention weights that produced it are not inspectable, and the model's state tracking is not a data structure you can query but an emergent property of the activations that leaves nothing behind.

This is precisely where the opening scene becomes unfixable. When an agent reaches an unexpected state at step fourteen, you need to know one thing: did it receive the wrong input, or make the wrong transition on correct input. Those are different bugs. In code you read the log and answer in seconds. In a prompt-based system you re-run the workflow, vary the temperature, and try to reproduce. Sometimes you cannot, because the model update that caused it already rolled back. The bug was real in production and gone from your debugger, and you cannot fix a failure you cannot observe.

The turn: make the machine data, and the model a worker inside it

The fix is not clever, and that is the point. Instead of describing the machine in prose and letting the model emit the next state, you represent it as data and let plain code decide transitions. It reduces to three pieces.

First, an explicit, closed set of states. A small enumeration, not a free-text vocabulary. The legal states are fixed and listable, and nothing outside the set can ever appear, because no token-generation step can invent a fourth one.

Second, the transition table, held as data: a mapping from each state to the states allowed to follow it, the single source of truth for the machine's shape. Idle may only advance to processing, processing may resolve to complete or error, complete is terminal, error returns to idle. Because it is a data structure and not a paragraph of instructions, every legal move is explicit, the whole machine is readable in one glance, and adding or removing an edge is a change you can diff, review, and test, not a reweighting of prose the model reads differently after the next update.

Third, a single transition function that consults the table. Given the current state and a desired target, it refuses the move if the target is not allowed or a guard fails, and otherwise returns the new state. It never touches a model, so you can test it exhaustively, every legal edge, every illegal one, every guard outcome, with no network.

The model still has a contained job. Inside a given state the workflow calls it to do content work, extract the entities or generate the output, and it returns a structured result. Plain code decides which transition to request, and the function rules on whether it is legal. The model is a reasoning oracle invoked at a node, never the dispatcher that decides which node runs next. And because every transition flows through that one function, each is logged with the state, the target, the guard result, and a timestamp, so when something fails at step fourteen you have the complete, replayable trace the prompt-based version can never give you.

The rule, and the line it draws

The whole decision collapses to one rule. If the behavior must be deterministic, it belongs in code. If it needs language understanding, in a prompt.

In practice, code owns the skeleton: the states, the transition table, the guards, the error and recovery paths, the loop-termination conditions, the timeouts and retry counts, the hard ceiling on steps. The prompt owns the language work: generating output for a human, pulling structured data out of unstructured text, classifying content into the categories your guards consume. The model decides what to say at a step, never which step to go to.

One test makes the line concrete. Could a determined adversary steer the agent by injecting text into the task content. In a prompt-based machine, often yes, because content and transition instructions share a context. In a code-based machine the model's output is a classification your code consumes, so injecting move to complete changes the text, not the transition logic. The moment you ask a model to emit a state name your system treats as a routing instruction, you have handed control flow to a stochastic process.

Write the code version first

The prompt-as-state-machine pattern will keep showing up because it is trivially easy on day one. Twenty lines, working, demos beautifully. The failure modes only surface under operating conditions: a model update, a long context, an adversarial input, an edge case never in the demo. And the debt compounds. Every state you add makes the transition prose harder to follow, and every feature crossing a state boundary adds another collision surface. A few months in, the prompt is a few hundred tokens of machine logic nobody understands and everybody is afraid to touch.

Rewriting a prompt-based machine into a code-based one is always possible and always painful. The tests you should have written do not exist, the transitions that seemed obvious are underspecified, and you trace edge cases back through model outputs just to understand them. The closed set of states takes five minutes, the table ten, the tests twenty, and you spend those minutes either way. The only choice is whether you spend them now, calmly, or later, after the bug has been live long enough to hurt.

So stop putting state machines in prompts. Define the states explicitly, make the transition table the single source of truth, and test it with no model in the loop. The model belongs inside the nodes, doing the content work it is good at, not governing which node runs next. The agent in the opening scene, calmly reporting failure on a job that processed perfectly, is not a model problem. It is what happens when you ask the model to be the one thing it can never be, which is sure.

Agent Routing Caches: A Competence Ratchet from SOAR Chunking

praveenlavu — Wed, 08 Jul 2026 14:37:06 +0000

Agent Routing Caches: A Competence Ratchet from SOAR Chunking

I was watching my own routing agent send the same task to the same sub-agent for the forty-seventh time. "Summarize this PDF." Same shape, same answer, every single time. And on attempt forty-eight, it stopped, thought hard, burned the tokens, and arrived at the exact route it had already arrived at forty-seven times before.

It bothered me more than it should have. The agent was not learning anything. It had no memory of competence. It knew nothing about what had worked a hundred times before, only what the model currently predicted was most likely to work given the prompt in front of it. Every dispatch was the first dispatch. The planner tax was being paid forty-seven times for a route that never changed.

I knew the obvious fix, and I knew it was wrong, which is the worst kind of knowing. Embed the incoming task, find the nearest cached task by cosine distance, reuse the route if the similarity clears a threshold. It demos beautifully. It falls apart in production, and I had three reasons sitting in my head for exactly why.

Embedding similarity is not the same thing as task identity. "Summarize the Q3 earnings report" and "summarize this PDF about competitor pricing" sit close together in vector space. They are not the same task. The things that actually decide which agent handles the job are structural: document type, domain, the output format the caller needs. Those are discrete questions with crisp answers. A similarity score smears them into one continuous number that approximates meaning, not identity.

A similarity cache also has no concept of confidence. A route cached from one lucky run carries the same weight as one confirmed fifty times. And when a sub-agent's capabilities shift underneath it, a new tool, a swapped model, a deprecated endpoint, the cache has no way to express that it should trust the old entry less. The stale route just sits there at the same threshold it always had. Worse, the thing grows forever. Every unique-enough task adds another entry, and the longer it runs the more it accumulates embeddings from workflows that no longer exist. There is no signal telling it what to forget.

So I had a problem I understood and a solution I did not trust. The routing problem was structural, and I wanted a structural answer. I just could not see one.

Then I remembered SOAR.

I had read about it years ago, the cognitive architecture from a 1987 paper, the kind of thing you file away as intellectual furniture and never expect to use. SOAR runs a loop: take a state, pick an operator, apply it, update the state, repeat until the goal is reached. When it hits a situation where no operator applies, an impasse, it opens a subgoal, works the problem out in that smaller space, and resolves it. The expensive part is that same impasses recur across different tasks, and without help the architecture solves each one from scratch every time. Replaying the same moves. Deliberation without memory.

That last phrase was the moment it clicked. Deliberation without memory was exactly what I was staring at. My routing agent was a machine for re-resolving the same impasse forty-seven times.

SOAR closes that loop with a mechanism called chunking. When a subgoal resolves successfully, the architecture traces the conditions that led to the impasse and the operators that resolved it, and compiles them into a single rule: given this context, fire this directly. Next time the same context shows up, the rule fires immediately and the whole deliberation is skipped. Recall replaces reasoning.

Two properties make it precise, and they are the two properties I had been missing. Chunking only happens on successful resolution; failed attempts compile nothing, so the cache is built exclusively from evidence of competence. And it fires on context equivalence, not similarity. The match is structural, a pattern match, not a nearest-neighbor guess. The payoff is what the SOAR authors call a competence ratchet: performance over time only holds steady or improves. The architecture cannot get slower at a problem it has already solved.

That was the abstraction I needed, sitting in a paper older than most of the people building agents today.

The mapping turned out to be almost embarrassingly direct. A SOAR problem context becomes a task fingerprint: a deterministic hash of the structural attributes that decide the route, task type, input modality, required output format, domain flags. Not an embedding. A fingerprint. Two tasks with the same fingerprint should get the same route, and if they would not, the fingerprinting schema is wrong, not the threshold. The operator trace becomes a routing trajectory, the ordered list of agents and tools a successful run actually used, captured only after the outcome is confirmed good. A SOAR production rule becomes a cached route keyed by fingerprint. An impasse resolution is the planner call. A chunk firing is a direct dispatch with the planner bypassed entirely.

And the ratchet, in this setting, is one clean rule: after K confirmed successful dispatches for the same fingerprint, stop deliberating and dispatch the cached route. The planner does not run. The route is known.

I went with K equal to three. One success could be luck. Two is a signal. Three is enough confirmation to skip the deliberation while still being small enough to adapt fast when something changes. I want to be honest that this is not a deeply principled number. It is a reasonable prior that should be configurable for wherever it runs. I would rather flag that than dress it up.

The honesty extends to a second mechanism I borrowed from biology rather than cognitive science. A chunk that has not been used in a long time is a liability, not an asset, because its route may point at an agent that no longer exists. So chunks die. In cell biology, apoptosis is programmed cell death, the body clearing cells that have stopped being useful. A chunk untouched past its age window gets swept out the same way. Ninety days is my conservative default; for a fast-moving deployment with frequent model swaps, thirty is more honest. The invariant I care about: no chunk survives long enough to become a trap after the ground has shifted under it.

I also kept a list of places where I refuse to chunk at all, because a ratchet that bypasses thought is dangerous in the wrong context. High-stakes routing, where a misroute means data written to the wrong system or a destructive operation on the wrong resource, the planner is cheap insurance and I leave it in. Drift-detected contexts, where a distribution-shift signal says the world has changed, the chunks go on probation. Genuinely new task shapes, where there is no fingerprint match, the planner runs and that is correct, and I resist every temptation to add fuzzy matching, because structural precision is the entire point. And tasks where the route selection is itself the valuable reasoning, A/B comparisons, exploring route diversity, caching would eliminate the exploration I actually wanted. This is an optimization for settled decisions, not for routing research.

The moment I want you to feel is the fourth run. The first three dispatches for a fingerprint pay the full planner cost and accumulate their successes, and the third one is what pushes the count over the threshold. From the fourth run on, the lookup returns the cached route, the planner is skipped, and the latency just collapses. In a toy loop the per-run cost drops from around a hundred milliseconds to around twenty, the planner taken clean off the critical path. In a real deployment, where the planner is an actual model API call with a network round-trip, the delta is bigger, four hundred to twelve hundred milliseconds depending on the model and the infrastructure. The agent stops re-deciding what it already knows.

That is the whole story, and it is also the principle. The agents we build re-deliberate settled decisions because we never gave them a way to remember being right. A thirty-nine-year-old paper on general intelligence had already solved that, and the only new work was recognizing my problem in its shape. The competence ratchet is not a clever trick I came up with. It is an old idea I was lucky enough to remember at the right moment, on attempt forty-eight, when I finally got tired of paying the same tax twice.

If you are building routing agents, you have probably paid it too. You just might not have noticed yet.

Citation

Laird, J. E., Newell, A., & Rosenbloom, P. S. (1987). SOAR: An architecture for general intelligence. Artificial Intelligence, 33(1), 1-64. https://doi.org/10.1016/0004-3702(87)90050-6

Semantic Loop Detection: Catching Stuck AI Agents

praveenlavu — Tue, 07 Jul 2026 14:37:05 +0000

Semantic Loop Detection: Catching Stuck AI Agents

It is 2am. The agent has burned 40k tokens and reverted the same file four times, and from where I am sitting it looks like it is working hard. That is the part that fooled me. It was busy. Every loop produced a new patch, a new diff, a new paragraph of reasoning about why this time would be different. The log scrolled. Things were happening. The agent just was not getting anywhere.

The task was a bug fix. Generate a patch, run the tests, watch them fail. Read the error, generate another patch with different variable names and different line numbers and the exact same underlying logic. Tests fail again. Third attempt, it wraps the fix in a try/except. Still fails. Same root cause, untouched. By attempt seven it had written and reverted the same file four times and was no closer than it had been on attempt one. I was watching a machine spend my tokens to stand perfectly still.

The thing I could not get past was that my loop detector said everything was fine.

The detector that lied to me

I had a guard for exactly this. A set of seen actions, each action hashed to a string, and the rule was simple: if the same action shows up twice, you are looping, halt. It had caught dumb loops before. So I trusted it.

It reported no loop. Not once across seven attempts. And it was technically correct, which is the worst kind of correct. Each action string really was different. "Change the timeout from 30 to 60." "Set the timeout to 60 seconds where it was 30." "Apply the same value at the call site instead." Three different strings, three different SHA-256 hashes, three entries the set saw as three distinct actions. The agent was not repeating itself, not in the only sense my detector understood. It was absolutely, structurally stuck, and my guard waved it through every single time.

That is the moment that got under my skin. The hash changes when the text changes. That is the entire failure mode. My detector was measuring whether the words moved, when the only thing I cared about was whether the situation moved. Those are not the same question, and I had been treating them as if they were.

I sat with the three timeout edits for a while. Look at what actually happened underneath them. The first two are the same edit with the prose rephrased. The third shifts where the edit lands but leaves the real bug, a hardcoded default upstream, completely untouched. The next test run would throw the identical error. Three "different" actions, one unchanged reality. The string was new. The progress was zero. My guard could only see the first one.

The thing I had been measuring wrong

Once I named it, I could not unsee it. There are two different things going on every time an agent acts, and I had collapsed them into one. There is syntactic novelty, which is just "the string changed." And there is semantic progress, which is "the underlying situation changed." Loop detection has to live at the second level. Hashing the action string only ever sees the first.

So the question stopped being "how do I make a better hash of the action" and became "what is the smallest honest description of what this agent is actually doing." Not how it described the work. What the work was. That reframing is the whole turn. Everything after it is bookkeeping.

This is where I stopped reinventing and went looking, and found that KARIMO (github.com/opensesh/KARIMO) had already published the shape of the answer: a four-dimensional fingerprint. Instead of hashing the action string, you describe the action along four axes and hash that. Each axis catches a different way an agent goes nowhere.

The first axis is a coarse action class. Not the full action string, just the category: a file edit, a test run, a shell command, a search. This is the one that would have saved me at 2am. My three timeout edits, with all their rephrased prose, collapse into the same category the instant you stop caring about the words. The phrasing difference simply disappears.

The second axis is where the work is landing, a normalized hash of the files involved. An agent that keeps touching the same two files on every attempt is structurally looping no matter what story it tells about each pass.

The third axis is the state of the world, supplied by the caller. A hash of whatever counts as the agent's environment: test results, file checksums, the context it is working against. This is the brutal one. If the state hash has not moved between two actions, then by definition nothing the agent did had any effect. Two identical state hashes across two different action strings is not a hint. It is proof.

The fourth axis is the one I wish I had had first, and it is the key to the bug-fix stall specifically. You take the error output and strip the parts that wander without meaning anything: line numbers collapse to a placeholder, memory addresses collapse, absolute paths reduce to bare filenames, timestamps collapse. What survives is the error type and the message template. So AssertionError at line 47 and AssertionError at line 52, after a refactor nudged the lines, normalize to the same thing. The agent is still hitting the same wall even when the traceback shifts under it. My old detector treated that shift as progress. It was noise.

Combine the four into one tuple, hash it, and you have a fingerprint that holds still under rephrasing and only moves when something real moves. That is the whole insight, and it is almost embarrassing how much calmer the problem felt once I had it.

Watching it catch the thing that fooled me

The first time I ran the new detector against the same stuck pattern, it was quietly satisfying in a way debugging rarely is.

Picture the five attempts. The same config file, the same timeout assertion failing, the environment state hash never moving an inch. The agent narrates a different story on every pass: change the timeout from 30 to 60, set it to 60 seconds, bump the request timeout parameter, update it in config, fix it again because the last fix was incomplete. Five confident, distinct sentences. But each one feeds the detector the same action class, the same file, the same state, the same normalized error. Five times.

The old seen-actions set looks at those five strings, counts five unique entries, and stays silent. The fingerprint looks at the same five and sees one identical tuple, over and over. And because detection without a response is just logging, the fingerprint is paired with a count. KARIMO's pattern is a two-step ladder. At three matches, escalate: the current approach is dead, but a stronger model or a different prompt might break the wall, so hand it up. At five, stop and surface to a human, because the stall has now survived escalation and is just burning money.

So the streak climbs. The first two attempts pass as routine. The third trips the escalate threshold and the detector says try something stronger. The fourth stays escalated. The fifth hits the halt and kicks it to me. Five syntactically distinct actions, one semantic fingerprint, and for the first time the guard actually saw what I had seen at 2am.

The two-step design is not arbitrary either, and I came around to why. The costs are asymmetric. A false alarm at three costs one unnecessary escalation, slightly more expensive, but the agent keeps going. A false alarm at five interrupts a run that might have finished, while a missed stall lets an agent burn resources with no path forward at all. That asymmetry is what makes five a defensible place to halt for most tasks and three a cheap early warning you can afford to be wrong about.

Where it still gets it wrong

I want to be straight about the limits, because this is a heuristic, not a truth oracle, and pretending otherwise is how you ship a guard that fights your own agent.

Intentional retries will trip it. Exponential backoff on a flaky API, or a polling loop waiting on a job, will look exactly like a stall if the error and the state stay stable across attempts. The fix is to loosen the threshold for those paths, reset the detector on known retry patterns, or hand them a different action class so they cannot pile onto the same streak.

Planned iteration is the friendlier case. A write-test, run, fix, run cycle produces alternating fingerprints and will not trip the streak, which is correct. But if the fix step itself stalls while the test and run steps keep cycling around it, the alternation hides the problem, and you may need a second detector scoped to just the fix-class actions.

And the thresholds are task-specific. An agent doing exploratory code search might legitimately re-examine the same files three times from three different angles. Three is far too tight there; seven or eight is more honest. Wire the thresholds through config and tune them per task type rather than baking in numbers that were only ever right for one kind of work.

None of that undoes the core move. The fingerprint pattern I leaned on here, the combination of action class, files touched, state, and normalized error, is published by KARIMO at github.com/opensesh/KARIMO; check the current license at the repo before you reuse it. What I took from it was not code. It was the reframing.

That is the thing I keep coming back to. The bug was never in the agent. The agent was doing what stuck things do, churning. The bug was in me, in what I had chosen to measure. I had built a detector that asked whether the words were changing, when the only question that ever mattered was whether anything real was. If your agent can spend 40k tokens looking productive while standing still, your loop detector is measuring the wrong thing, and it will keep lying to you politely until you teach it to look underneath the words.

A Prior-Art Discipline for IP-Sensitive Builders: Reading Competitors'' Code

praveenlavu — Mon, 06 Jul 2026 14:37:05 +0000

A Prior-Art Discipline for IP-Sensitive Builders: Reading Competitors' Code Safely

Picture the worst version of a deposition. You are three years past the build, sitting in a conference room you do not own, and opposing counsel slides two printouts across the table. One is your git history. The other is your browser history. They are not accusing you of copying a single line of code. They have something quieter and worse. A highlighted row on each page, and a date that lines up between them: your commit adding "the new mechanism" landed twenty-two days after you opened that competitor's GitHub repo. They let the silence sit. Then they ask, very politely, to walk through what you were thinking on the day you say you conceived it.

That scene has never happened to me. Nothing here is a war story I lived. It is the scene I was imagining the night I decided I could no longer read other people's code the way I had been reading it. I build things that might be patentable, and I live in open source, and for a long time I treated those two facts as if they had nothing to say to each other. The imagined deposition is what made me stop and build a workflow instead.

Because here is the thing I had wrong. I thought the danger was copying. It is not, or not mainly. The danger is that I could read a competitor's implementation honestly, internalize a pattern, build my own version six weeks later from scratch, file on it, and then, three years out, have no contemporaneous record of what I understood and concluded on the days in between. The merits might be entirely on my side. It would not matter much. I would be litigating my own mental state with nothing in hand but memory, and memory loses to a timestamp every time.

The two ways this goes wrong

There are two failure modes, and they pull in opposite directions, which is what makes this hard.

The first is ignorance. You build something genuinely interesting, you file, and during prosecution the examiner surfaces a repo from 2022 that implements your core mechanism under an MIT license. Now you are spending money arguing around prior art you could have accounted for, or your claims get rejected outright. Filing fee, attorney time, the whole prosecution arc: wasted, or worse, wasted and a little embarrassing.

The second is the deposition I opened with. Contamination. You read the competitor's code while researching, you build later, and the dates line up badly enough that someone can tell a story where you adopted a pattern from prior art and filed on it as if it were yours.

For a while I thought the lesson was "read less." Stay clean by staying ignorant. That instinct is exactly backwards, and it took me a while to see why.

Why reading the code is the protective move

Novelty in patent law is defined against the prior-art landscape. If you do not know that landscape, you do not know where your novelty actually lives. You will write claims that are broader than the art allows, not because you copied anyone, but because you never looked. Those claims get rejected, you narrow them under pressure, and you walk away with weaker protection than if you had done the homework first.

The deeper point is the one that flipped me. Reading prior art with discipline gives you contemporaneous documentation of your own novelty analysis. A dated line that says, in effect, on this day I read this approach and here is exactly how mine differs. That entry, stamped before your filing, is evidence. It is the difference between being ambushed by prior art during prosecution and walking in already holding your distinction, written down, dated, never reconstructed under time pressure.

Avoiding the art does not protect you. It just moves the collision somewhere more expensive and strips you of the one document that would have helped. The builders who run clean IP read everything relevant. They just record what they read and what they concluded.

The license gate goes first, always

Before I read any open-source code as part of IP-sensitive research, I check the license. Two minutes. Skipping it can manufacture a problem you cannot talk your way out of later.

The rule is blunt: GPL, AGPL, and LGPL carry patent-related risk that MIT and Apache-2.0 do not. The GPL family ships explicit patent-retaliation clauses, where bringing a patent claim related to the licensed software can terminate your license to use it, and some readings extend that to work merely interoperating with GPL code. AGPL adds network-use provisions that can spread copyleft obligations in ways that interact badly with a patent strategy. MIT and Apache-2.0 are the safe-to-read tier, and Apache-2.0 even carries an express patent-license grant that cuts in your favor when you are reading it as documented prior art.

So the first field in any ledger entry is the license. If it comes up GPL or AGPL, I stop and get explicit legal sign-off before reading. That is not a call I make on my own. The interaction between copyleft and patent strategy is fact-specific and needs professional advice. What I can do mechanically, without a lawyer, is recognize the gate and not walk through it without checking. It applies to small repos, to "just the README," to archived projects. The check is always first.

What a ledger entry actually is

It is not a notes file. It is a structured, dated record with one job: contemporaneous evidence of what I read, what I pulled from it, and where I drew my novelty boundary at that point in time.

Each entry pins a sequential id so I can cross-reference it from an invention disclosure later. It records the date read, which is the evidence anchor, the thing that proves the reading predates the filing. It names the source precisely, full URL, commit hash, DOI, patent number, because vague references are weak evidence. It logs the license and an explicit note that the license was checked, an affirmative record that I ran the gate. Then three substantive fields in my own words: a short description of what the system does, the abstract patterns I extracted, and the field that carries the whole weight, my novelty-boundary analysis. Something like: the individual mechanisms here are prior art as of this date; my contribution is the composition and the priority-resolution logic that binds them. A status field tracks where the entry sits, documented when written, disclosed once it has been formally handed to a patent attorney for an Information Disclosure Statement.

Store the file in version control. The commit timestamps are part of the evidence. Never backdate an entry. The moment you backdate one, the whole ledger stops being a record and becomes a liability.

What I extract, and what I refuse to

This is the line that keeps the ledger clean and the position defensible.

I extract ideas. Architectural patterns, algorithmic approaches, interface contracts, problem framings, the conceptual mechanism underneath a system's behavior. The abstract things a skilled engineer would recognize from reading. I do not extract implementation. No copy-paste, no line-by-line translation into another language, no reproducing the specific sequence of operations at the code level.

That maps onto how the law splits ideas from expression. An algorithm is an idea. A specific implementation is an expression of it. Prior-art analysis lives at the idea level. The real question is never "did you copy this code." It is "did you arrive at this approach independently, or did reading it contaminate your conception." So if I read a fingerprint-based loop-detection approach and later build my own from first principles, working from what the mechanism achieves rather than the code in front of me, that is independent development of a concept that happened to already exist. The ledger documents exactly that: read it, understood it, recorded it as prior art, noted it in the novelty boundary before writing a line. What I refuse to do is read code and then write code that walks the same structural sequence. That is the thing that looks bad in hindsight no matter what my intent was.

A worked example (fictional, on purpose)

To keep this concrete without exposing anything real, here is a fictional invention: a routing system using preference-weighted ensemble voting for multi-model task assignment, with dynamic weight decay based on outcome recency. It is not a real product. The point is the paper trail it leaves.

Before writing code, I run prior-art searches. First, multi-armed bandit LLM routing, which surfaces the ADWIN drift-detection work, window-based statistical change detection used in some routing systems (Bifet and Gavaldà, SIAM SDM 2007). Academic publication, no code-license issue. Entry PA-001 documents the mechanism and draws the boundary: ADWIN handles distribution shift via threshold-crossing window detection and reselection; my weight decay is continuous and recency-weighted and needs no threshold event. Same problem, different mechanism. Second search, ensemble voting across models, surfaces llm-council, an MIT-licensed library that routes queries to several models and aggregates via peer-review voting. Entry PA-002: llm-council addresses post-response aggregation; mine addresses pre-dispatch routing assignment. Different layer. Third search, preference learning for model selection, surfaces two RLHF-based preference-modeling papers. Entries PA-003 and PA-004.

Four dated entries before I write a line of implementation. The boundary is explicit: the sub-mechanisms exist in prior art; the composition and the specific weighting approach are the claim. Three months later, when the patent attorney asks what the prior art is, I hand over four dated entries with analysis. Claims get drafted to the composition. Prosecution goes faster. The novelty argument is pre-built instead of improvised.

Why this strengthens prosecution

An examiner's job is to find art that anticipates or renders your claims obvious. If you have already found the relevant art, dated your analysis, and disclosed it, you have done a real piece of that work and shown your claims were written to account for it. It flips your posture from reactive, the examiner found X and now we argue around it, to proactive, we disclosed X and here is the distinction. Examiners respond differently to those two situations.

There is also a duty-of-disclosure dimension. Applicants are required to disclose material prior art they are aware of, and a maintained ledger helps you meet that systematically rather than from memory months later. Discuss the specifics of your IDS obligations with your patent attorney. That part is not optional. But the engineering workflow that produces the documentation is entirely within your control.

The asymmetry, and the honest caveat

The value is lopsided in your favor. If you never file, the ledger cost you a couple of hours per significant review session and left you a useful research artifact. If you do file, the dated record of your novelty analysis is the kind of thing that changes prosecution outcomes and makes any future litigation posture dramatically cleaner. Start the ledger before you think you will need it. The date on the first entry is the evidence.

And then there is the deposition I opened with, the one that never happened to me. The difference between that being a bad afternoon and being a catastrophe is whether, when counsel lays your git history next to your browser history, you have a dated entry that already says what you read, when, and exactly why your work is distinct. I would much rather hand them that than try to reconstruct what I was thinking twenty-two days before a commit, under oath, from memory.

This post describes an engineering workflow, not legal advice. Patent prosecution, IDS obligations, license risk analysis, and inequitable conduct exposure are fact-specific legal questions that require professional counsel. Nothing here should be read as a substitute for working with a registered patent attorney on any specific filing or IP strategy decision.

A Field Guide to Multi-Agent Orchestration in Late 2025: ruflo, KARIMO, llm-council

praveenlavu — Sun, 05 Jul 2026 14:37:06 +0000

A Field Guide to Multi-Agent Orchestration in Late 2025: ruflo, KARIMO, llm-council

I read three orchestration repos so you do not have to. It started because I was sick of the pattern. Every few months something announces that multi-agent orchestration is figured out, and inside its own demo, it is. Then you hand it a real workload and it dies in the seams I actually live in. So one week I stopped running these things and started reading them, several files deep at the kind of hour where you forget you have not eaten, trying to understand what each one bets its design on.

What I was hunting for is the stuff the demos never show. Whether a plan can quietly drift once it is running. Whether a loop detector can tell a real loop from an intentional retry. Whether the router learns from anything, or just reads a config file. Whether a panel of models can grade each other without playing favorites. None of that shows up in a fixed-horizon, single-model, no-concurrency benchmark. All of it shows up at 2am in something you shipped. So this is a field report, not a leaderboard. Three honest repos, each a different bet on the seams. Here is what each one actually does, what I would push on, and the one gap I noticed running through all three.

Before any of it: I read the repos, I did not run a controlled comparison. The mechanism claims below are what the projects document. The worries are mine, framed as worries, not measurements.

The three bets

ruflo (github.com/ruvnet/ruflo, MIT) bets on behavioral adaptation across a federation of agents. The conviction underneath it is one I share: which peer you trust should come from what the system has actually observed, not from a static config you tune once and forget. ruflo makes this concrete with a documented federation trust formula, a weighted blend of success rate, uptime, threat signal, and integrity, that continuously evaluates peers and downgrades untrusted ones with no human in the loop. It is model-agnostic, routing across several providers with failover. The bet is that trust is a measured property, not a declared one.

KARIMO (github.com/opensesh/KARIMO, Apache-2.0) bets on structural isolation and disciplined context. It is built on a commercial agent SDK as a coding-assistant plug-in, and it runs each agent in its own git worktree with branch identity verification, so agents cannot trample each other's working state. Execution is wave-ordered: tasks inside a wave run in parallel, waves run in sequence, across three loops it calls Foundation, Decomposition, and Orchestration. It layers context for token efficiency rather than dumping everything into every prompt, and it advertises semantic loop detection as a capability beyond the base SDK. The bet is that isolation and tiered context buy you reliability at scale.

llm-council (github.com/karpathy/llm-council, MIT) bets on adversarial peer review done honestly. A single model grading its own output is self-referential and you cannot trust it. llm-council runs three stages: every model answers the query, then each model reviews the others and ranks them on accuracy and insight, then a designated Chairman compiles a final answer. The detail that makes it work is in its own words, the model identities are anonymized so a model cannot play favorites when judging outputs. It is small, deliberately scoped, and by Karpathy. If you only read one of the three, read this one.

How they compare, in plain terms

I am not going to show you code, and the honest comparison lives at the level of design choices anyway, so line them up. Isolation runs from federation-level trust gating in ruflo, to hard per-agent git worktree isolation in KARIMO, to none in llm-council's single-pass review. Execution is a trust-routed federation, a wave-ordered three-loop pipeline, and a one-shot three-stage panel. Trust scoring is ruflo's whole thesis, expressed as a documented weighted formula; KARIMO does not center it; llm-council replaces it with anonymized rank-order review. On models, ruflo is explicitly multi-provider with failover, while KARIMO is built on a single vendor's SDK and routes by complexity within it. Loop detection is something KARIMO names as a feature and the other two do not foreground. Persistence runs from a vector memory store in ruflo, to git history as the durable artifact in KARIMO, to effectively none in llm-council's single exchange. That is the map. Now the parts I would push on.

ruflo, up close

Behavioral trust as a measured quantity is the strongest idea in the repo, and reading it gave me that small jolt of recognizing a good instinct. A trust score that blends success, uptime, threat, and integrity, and that downgrades a peer the moment the numbers say so, is the right shape for a system that has to keep working while individual agents go bad. The conviction that trust should be earned from observed behavior rather than declared in config is exactly the conviction I would build on.

The worries I would carry into production are about what a learning router does on a bad day, and these are my worries, not defects I measured. A score that moves with observed success can thrash when the pool is small and the feedback is noisy. And any system that routes toward what has worked has to answer the cold-start question: if a downgraded agent gets fewer tasks, it gets fewer chances to prove it recovered, and you can slide into rich-get-richer unless something deliberately explores. I did not see that explicitly addressed, so I would want to know how the trust loop avoids starving an agent that had one bad stretch. That is the general failure mode for any router that learns from outcomes: it will get quietly betrayed by its own feedback unless you design against it. The flip side is the part I would trust: when a peer genuinely goes bad, instant no-human downgrade is the behavior you want.

KARIMO, up close

The git-worktree-per-agent isolation is the cleanest reliability idea across all three codebases, and I knew it the moment I understood it. Most frameworks let agents share working state and rely on prompt instructions to keep them in their lane, which fails the instant two agents reach for the same file. Giving each agent its own worktree with branch identity verification removes a whole class of shared-state race structurally, not by asking nicely. That did something to me as a principle. Structure holds where instructions do not.

The context layering is the part I would read more carefully before trusting at scale. KARIMO tiers context for token efficiency, a level-of-detail approach that loads compact summaries first and full content only when needed, which is a genuinely good token-conservation strategy and not a small one. What it is not, and the repo is honest about this, is a security boundary; it is a scanning discipline, not a wall an agent physically cannot climb. So my worry is the ordinary one for any retrieval-by-relevance scheme: when the right context is filed under words that do not match the query, the efficient path can skip it, and you find out at runtime. On loop detection, the repo names semantic loop detection as a capability but does not, in what I read, document the internals, so I will not describe a mechanism it does not state. The honest open question I would ask the maintainers is whether it can tell an intentional retry, an agent re-running a step after new information arrives, from a true loop. That distinction is hard, and I could not verify how they handle it.

llm-council, up close

Identity anonymization is the sharpest single idea I read all week. The repo says it plainly: the model identities are anonymized so a model cannot play favorites when judging outputs. That one move targets the bias that makes self-grading worthless, a model preferring work from its own family, and it targets it structurally, because a reviewer that does not know whose answer it is reading cannot favor a name. Ranking on accuracy and insight rather than handing out absolute scores is the right instinct alongside it, since models are calibrated differently and a rank order sidesteps the worst of that. The README does not state the exact aggregation math, so I will not name an algorithm it never claims; the principle stands on the anonymization and the ranking, and that is enough.

The gaps are scope gaps, not bugs, and the repo does not pretend otherwise. It is a peer-review primitive, not an orchestration framework. The flow is single-pass: every model answers, the panel ranks, the Chairman compiles, and that is the run. There is no reconciliation round where reviewers see each other's verdicts and revise, and no state carried across evaluations, which are exactly the things you would add if you wanted this to be a standing judge rather than a one-shot panel. None of that is a knock. It is a small tool that does one thing well, and the one thing is the right thing.

The thing none of them claims to fix

Each project is genuinely good at one bet. ruflo at measured trust across a federation, KARIMO at structural isolation and token-disciplined context, llm-council at unbiased multi-model review. And reading all three back to back, the same hole stayed open in every one, which is the part that has stuck with me since.

None of the three, in what I read, treats the meaning of a failure as a first-class thing that gets routed on. A failure can be the model getting it wrong, or the context being stale, or the plan asking for something impossible, or a downstream tool throwing a transient error. Those want completely different responses, re-route, re-plan, escalate, retry, and the natural default in a system like this is to collapse them into one undifferentiated failure signal and react to all of it the same way. The piece I keep wanting is a structured failure taxonomy with routing by cause, a layer that reads the failure mode out of an agent's output and dispatches to a handler built for that class, which means every agent has to conform to a typed failure schema. I did not see one defined in any of the three.

I want to be careful here, because this is an observation about a gap, not a claim that I originated the fix. I did not, and typing your failures and routing on them is not new, it is just not something these three foreground. But it is the thing I cannot stop thinking about, because I have lived the adjacent version of it. The failure that does not announce itself as a failure is the one that costs you. In my own pipeline a fresh planner once read a stale plan and confidently re-derived work that had already been done, because nothing told it the difference between an open step and one already closed, and the cleanest defense I found was to stop trusting the carried-over note and treat the durable record as the only source of truth. That is the same shape: a signal that looks fine until you ask what it actually means. Until failure semantics are structured and routable, orchestration frameworks will keep recovering gracefully from the easy failures and badly from the hard ones. That is the part the demos will never show you, and the only part that ever kept me up.

Citations

ruflo source code: https://github.com/ruvnet/ruflo (MIT)
KARIMO source code: https://github.com/opensesh/KARIMO (Apache-2.0)
llm-council source code: https://github.com/karpathy/llm-council (MIT)
Crandall, J. W., & Goodrich, M. A. (2005). Learning to compete, compromise, and cooperate in repeated general-sum games. ICML 2005. (Background on multi-agent trust and behavioral adaptation.)
Stiennon, N. et al. (2020). Learning to summarize from human feedback. NeurIPS 2020. (Documents self-preference bias in model evaluation, motivating anonymization approaches.)

Tiered Context Loading: Fit a Huge Agent Registry in Your Context Window

praveenlavu — Sat, 04 Jul 2026 14:44:57 +0000

Tiered Context Loading: Fit a Huge Agent Registry in Your Context Window

Pattern source: KARIMO (Apache-2.0)

It is late, the kind of late where you are doing back-of-the-envelope math on a system you were sure was fine, and the envelope keeps telling you it is not. The dispatcher had been happily picking the right agent for each incoming task. The fleet kept growing, more capabilities every week, and that was supposed to be a good thing. Then I actually multiplied two numbers I had never bothered to multiply, and the floor quietly dropped out. The router I had shipped could not, on paper, do the thing I was watching it do. It was only still standing because the registry had stayed small enough to hide the truth.

I had built it the way you build everything the first time, which is to say without imagining it ever getting big. Every time a task came in, I loaded every capability's full spec into context, let the model read all of it, and asked it to choose. Description, parameters, examples, edge cases, error modes, the whole document for every tool the system owned. It worked beautifully when I had a dozen capabilities. It worked fine at fifty. The trouble is that nothing in the code complains as you climb. There is no alarm at the moment it stops being affordable. It just keeps working right up until the arithmetic says it cannot, and you only find that wall if you go looking for it.

So I did the math I should have done before I ever shipped it, and the number was genuinely funny in the way that only an own goal can be. Take a realistic multi-agent registry. I will use round engineering estimates here, not a census of the real fleet, because the shape of the problem is what matters and the shape holds at any scale: call it somewhere between two hundred and four hundred capabilities, each full spec running maybe four to eight kilobytes once you count all the parameters and examples and edge cases. Put four hundred capabilities at a ballpark six kilobytes each and you are holding around 2.4 megabytes of plain text. GPT-4o gives you a 128K-token window. At roughly four bytes a token, that is about 512 kilobytes of usable room. You can fully load on the order of eighty-five capabilities into that. The estimate said four hundred.

Loading all of them was not a little over budget. Four hundred specs at eight thousand tokens each is 3.2 million tokens against a 128K window, before a single word of conversation history. Not twenty-five percent over the window. Roughly twenty-five times over it. I had not built a routing system with a performance problem. On the numbers, I had built one that could not run, and had been quietly getting away with it only because the live registry stayed small enough to keep the bill from coming due. The exact figures are illustrative; the order of magnitude is the point, and the order of magnitude does not forgive you.

The three bad doors

My first move was the obvious one, and the obvious one was a trap. Just load less. Load none of the specs upfront and fetch them on demand. But fetch them how? The honest options were all doors that opened onto the same drop. Load all the specs: impossible, that was the wall the arithmetic had just drawn. Load none and route blind: useless, the model cannot pick a tool it knows nothing about. Load by keyword match: brittle in exactly the ways that bite you in production, where synonyms miss, categories overlap, and the router confidently sends a task to the wrong arm because two descriptions happened to share a word. These were not three alternatives to a solution. They were three different ways to fail, and I spent a frustrating evening discovering that each one personally.

What dragged me out of it was a reframe I should have started with. I had been treating "what tools exist" and "how do I use this tool" as the same question, loaded at the same time, paid for on every single call. They are not the same question. To pick a capability, the router barely needs to know it is there and roughly what it does. It does not need the parameter types, the examples, the retry semantics, any of it. All of that detail only matters at the instant of dispatch, for the one tool actually being called. I had been paying the full price of knowing how to use four hundred tools in order to make one decision about which one to reach for.

The turn

I did not invent the way out of this. I found it. KARIMO, an Apache-2.0 project, had already codified exactly the reframe I was groping toward, and given it a clean shape: three tiers, L0, L1, L2, each holding a different depth of knowledge, each loaded at a different moment.

L0 is the always-present index. For every capability you keep almost nothing: the name, a one-line description, the primary task type. No parameters, no examples. At roughly a hundred tokens per capability, four hundred of them cost on the order of 40K tokens, about a third of the window. That is the rent you pay unconditionally, on every call, and in exchange the router always knows the full menu of what exists. It can scan all four hundred names in one pass and shortlist the candidates without ever fetching a retrieval system or risking a recall miss. It just does not yet know how to use any of them.

L1 is the category layer, and it loads only when the router gets stuck. When the shortlist comes back ambiguous, two or three candidates in the same category whose one-liners are too close to separate, the router pulls the full parameter summaries for that one category and nothing else. One category, on demand, a couple thousand tokens. When one candidate clearly wins, this step never happens at all.

L2 is the full spec, the complete document I used to load for everything: every parameter with its types and constraints, the examples, the edge cases, the error modes. It loads for exactly one capability, the one being dispatched, at the moment of dispatch, and never a beat earlier. The invariant underneath all of it is almost insultingly simple. At any point in a routing decision, the total context you are carrying is L0 for everything, plus L1 for at most one category, plus L2 for at most one capability. That is the whole ceiling.

What the math becomes

Run the same arithmetic that had been mocking me, now against the tiered shape. L0 across all four hundred capabilities is around 40K tokens. One category of L1 is about 2K. One capability of L2 is about 8K. The worst case the system can ever be in is roughly 50K tokens, which leaves about 78K of a 128K window free for conversation history, retrieved documents, and the actual output.

Fifty thousand tokens, holding steady no matter how the registry grows, versus the millions the naive version demanded. Same capabilities, same window, same model. The difference is not a clever compression trick. The router never needed all that detail at once; I had just never given it permission to look things up only when it had to. The wall the numbers drew was not a limit of the model. It was a consequence of asking the wrong question on every call.

The cleanup that keeps it honest

There is one more piece that makes this hold up over a long session rather than just on paper, and it is the part I find quietly elegant. When context does eventually press against the ceiling, the eviction order is a hard rule, not a guess. The L2 spec for an in-flight dispatch is untouchable, because truncating it mid-call means the router generates against incomplete parameters and emits a malformed request. A category's L1 is atomic, all of it or none, because half a category overview is worse than no overview. So truncation always falls on L0 first, and never at random: the least-routed capabilities drop out before the most-routed ones.

The effect is that, over a long-running session, the always-present index slowly sheds the tools nobody is calling and keeps the ones the work actually leans on. The system concentrates its remaining headroom on its real working set. That is not decay. It is the design doing its job.

What I would tell you

If you are routing across a registry that can grow, build the tiered split before it grows, not after the wall. The whole lesson is one I had to learn the embarrassing way: knowing that a tool exists and knowing how to use it are different questions, and a router that pays for the second one on every decision is buying something it does not need yet. Separate them, load each at the moment it actually matters, and a registry that could never fit suddenly costs a flat, predictable slice of your window forever.

None of the core idea is mine. The L0/L1/L2 discipline is KARIMO's, published openly. What I brought was the bad evening, the arithmetic I should have run first, and the relief of recognizing my own problem in someone else's clean answer. It needs no embeddings, no retrieval infrastructure, no dependency beyond a registry you author with three depths of detail instead of one. The only cost is writing those three layers when you register a capability, and that cost pays for itself on the first routing decision the old way could not have survived.

Repository

Pattern: github.com/opensesh/KARIMO, Apache-2.0

The GPL v3 patent trap nobody checks until a lawyer walks your requirements.txt

praveenlavu — Fri, 03 Jul 2026 14:37:05 +0000

The GPL v3 patent trap nobody checks until a lawyer walks your requirements.txt

Disclaimer: I am not a lawyer, and nothing here is legal advice. What follows is my reading of what these license texts actually say in plain language, and the patent implications that seem to follow from that text. For any real IP decision, take it to a qualified patent attorney. On a public page a wrong legal claim is worse than a cautious one, so I am going to stay close to the words on the page.

Picture the moment a patent attorney goes through your requirements.txt line by line, asking which licenses are in your dependency tree. Not "can we ship this." A different question, one almost nobody asks before it matters: does using this code constrain my ability to assert a patent later?

I was not in a courtroom when this clicked for me. I was building, the way I always am, late, with a dependency list I had grown the lazy way, one pip install at a time, each one chosen by an engineer who was thinking about getting the thing to run, not about what happens in 2027 when a competitor ships a similar feature. And then the question landed and I realized I was about to do archaeology on my own decisions.

Here is the trap in one line. Most engineers treat open-source licenses as a shipping question. Can I include this in a commercial product, can I keep my changes private. The GPL-versus-proprietary boundary gets all the attention. The patent question is different, and it is the one that quietly waits.

Plaintiff one day, infringer the next

The stakes are concrete. GPL v3 code in your dependency tree means that asserting a patent against someone using the same GPL v3-covered functionality could, by the plain text of the license, terminate your own license to that code. You go from plaintiff to infringer in a single filing.

The mechanism is a patent retaliation clause, text that is deliberately written to make patent assertion legally expensive for anyone who incorporated the protected code. It is not an accident or a side effect. The license was drafted to do exactly this. And it sounds abstract right up until someone with a law degree is reading your lockfile out loud.

So let me walk the actual sections, because the whole thing lives in the text, not in the vibe.

The core mechanism: sections 10, 11, and 8 in combination

GPL v3's retaliation works through three sections acting together. None of them does the job alone.

Section 10 imposes a condition on every licensee. In its own words, you may not initiate litigation, including a cross-claim or counterclaim in a lawsuit, alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it.

Section 11 is the grant. Each contributor hands every downstream user a license to that contributor's essential patent claims. That is the thing of value you are holding.

Section 8 is the trigger. Violate the license and your rights terminate automatically, and that termination reaches the patent grants made under Section 11.

Now stack them. The sequence the text creates runs like this. You incorporate GPL v3 code into your system, a library, a component, even a small utility you copied and adapted. A competitor ships something you believe infringes a patent you hold. You file suit, alleging infringement of the program or a portion of it. That filing is the Section 10 violation. Section 8 then terminates your rights, including the Section 11 patent grants, as of the date you filed. And now you are distributing GPL v3 code without a license, which is its own infringement.

The uncomfortable step is the one in the middle. "Any patent claim infringed by making, using, selling the Program or any portion of it" is not a precisely bounded phrase. Patent claims are written broadly on purpose. The GPL code's contribution to your system might be peripheral to your core invention, and that does not automatically insulate you. What matters, as I read it, is whether the patent you are asserting can be characterized as one infringed by making or using the GPL v3 work. If opposing counsel can draw that line, the text says the sequence can fire. Whether a court agrees in your specific case is exactly the kind of question I am not qualified to answer, which is the point of the disclaimer at the top.

AGPL widens it, GPL v2 does not have it

Two siblings are worth naming because people assume they all behave the same, and they do not.

AGPL v3 carries the same Section 10 and Section 11 language, plus the extra reach of Section 13. Providing the software as a network service, without ever shipping a binary, triggers the same source-disclosure obligations as distribution. For patent strategy that means the full retaliation exposure travels with the network-use surface. If you are building an AI inference service on an AGPL-licensed library, you are inside the AGPL's reach whether or not anyone ever downloads a binary from you. That one catches people who think "I never distribute, so I am fine."

GPL v2 is the other direction. It has no equivalent patent retaliation mechanism. Its Section 7 prohibits adding further restrictions to GPL'd code, which can complicate certain downstream patent assertion strategies, but it does not contain the automatic license-termination trigger that v3 introduced. v2 and v3 are not equivalent on patent risk, and treating them as one thing is its own quiet error.

What "using" GPL code does to your position

The danger is not confined to wholesale copying. I think of it as a gradient, and the middle of the gradient is where the real trouble lives.

At one end, you use a GPL v3 utility that parses files, and your patent covers an optimization on the parsed data. The GPL code touches none of the claimed functionality. As I read the clause, it is looking for an allegation that the GPL work itself constitutes infringement. If you are not making that allegation, the text gives it nothing to grab.

At the other end, you copied GPL v3 code in, modified it, and ship it as part of a system your patent claims cover. You have incorporated the work, and your claims arguably allege that the combined system, which is the work, does the patented thing. That one is almost certainly a problem.

The middle is the one that should scare you, and it is the reason I am writing this. You use a GPL v3 library that does something adjacent, say an approximation method, and your patent claims a system that uses approximation methods to get a result. You assert against a competitor. Their counsel argues your claimed functionality is intertwined with the GPL library's behavior. Whether that argument wins is a legal question, but you are then paying attorneys to litigate it either way. "Adjacent" is decided by how your claims are written and how good the other side's lawyer is. The risk is fuzzy and front-loaded. You cannot fully know at filing time whether a court will find the connection, and by then the GPL code has been in your tree for eighteen months.

LGPL v3 belongs in this conversation too. It is "lesser" only in the copyleft sense, you can link against it without your application becoming GPL'd, but it incorporates the GPL v3 patent framework by reference. It is not lesser on retaliation. Same Section 10, 11, and 8 story.

The safe operating rule I landed on

After sitting with the text, the rule I actually run is simple, and deliberately conservative. No GPL v3, AGPL v3, or LGPL v3 code in any codebase where I intend to file or assert patents covering the same functional area. And "same functional area" gets read broadly, because patent claims are written broadly.

The good news, and it is genuine good news, is that this almost never costs you anything in modern AI work. The permissive ecosystem covers nearly everything you would actually reach for. The scientific Python stack is BSD-licensed. The major deep-learning frameworks are BSD or Apache-2.0. The big transformer libraries are Apache-2.0. The GPL libraries you bump into are usually legacy data tools or specialized components that predate the Apache-and-MIT-everything norm, and they have permissive alternatives in almost every case. The work of finding the alternative is small next to the work of defending the decision not to.

If you discover a GPL v3 dependency late, it is not a death sentence, just deliberate work. Re-implement the functionality clean-room, from the interface contract and not the source, and document that you did. Or wrap it as a separately deployed service you call over a network instead of linking it in, with the loud caveat that AGPL v3's Section 13 specifically addresses that pattern, so confirm with counsel before leaning on it. Or, if the component is peripheral, just drop it and carry the small technical debt instead of the legal debt. Every one of those paths is cleaner than explaining the dependency to an attorney three years from now.

The principle

None of this is original to me. The retaliation mechanism is right there in the license text, drafted by people who understood patents far better than I do, and IP counsel have been reasoning about it for years. The engineering distinctive, the part that is actually mine, is small and boring and saves you anyway: treat the license as a property of a dependency you check before adoption, not a footnote you reconstruct under adversarial scrutiny after you file.

The check costs about thirty seconds per dependency. Reconstructing a clean development history while opposing counsel reads your commit log costs a great deal more. The license is sitting in the repository the whole time. Read it before the import statement is in your tree, not before the filing is on the docket.

I am still not a lawyer. If any of this maps onto a real decision you are about to make, that is precisely the moment to put it in front of one.

A green test suite proves less than you think

praveenlavu — Thu, 18 Jun 2026 22:20:26 +0000

A green test suite proves less than you think

The test that scared me was the one that passed.

I had an integration test for a routing agent, the kind that takes a task and picks a capability to handle it. The test registered a new capability at runtime and then checked that the router would eventually route to it. Green run after run. Solid. I trusted it.

Then I read it properly. It reused the same task string on every iteration of the loop. My scorer was deterministic by design, it hashed the task and indexed into the capability list, so a fixed string mapped to a fixed slot, and the newly registered capability lived at a different slot that the fixed string could never reach. The test asserted that the new capability got selected. The new capability was structurally unreachable. And the assertion passed anyway, because the loop happened to land on something registered every time, which was all the weak version of the check actually demanded.

The test was not testing what it said it was testing. It was green for a reason that had nothing to do with the thing I cared about. The fix was almost insulting in its smallness, vary the task strings so the hash spreads across every slot including the new one, and suddenly the test could fail when the feature was broken, which is the entire point of a test. One line. I had been shipping false confidence behind a checkmark.

That is the moment this whole piece is about. Not the bug. The checkmark.

The number that lies

Here is the setup that produces this every time. You build an agent. You write unit tests. You watch line coverage climb to ninety-something percent. CI turns green. You deploy. And within a week the thing is making nonsensical decisions under load, falling over on inputs you never imagined a user would send, and getting stuck in loops your loop detector cannot see because two threads stepped on each other's state at the same instant.

The unit tests were not lying to you. The functions genuinely worked in isolation. That is the trap. Line coverage measures whether your tests executed a line, not whether they cornered it. You can run every line in a file and assert nothing that matters about any of them, exactly like my integration test ran its loop and asserted the wrong thing. A green suite built on coverage tells you your tests touched the code. It tells you almost nothing about whether the code survives contact with production.

And autonomous systems, agents that route, retry, fall back, remember, do not fail in isolated functions. They fail in the seams between functions. They fail where two modules meet and disagree about a type. They fail on the input the author never pictured. They fail when two requests arrive at once. They fail when a dependency dies and the system panics instead of limping. They fail on the edge case nobody wrote down. Coverage walks straight past all five, because every one of those failures lives in territory a unit test is structurally built to avoid.

Five seams, five suites

The shift that changed how I test agents was to stop asking "did my tests run the code" and start asking "what are the distinct ways this system actually breaks, and do I have a suite aimed at each one." Five answers came back, and they are genuinely distinct failure classes, not five flavors of the same check. None of these dimensions is mine to claim as an invention, they are long-standing testing practice, integration testing, adversarial and fuzz testing, concurrency testing, fault injection, and property-based testing each have decades of prior art behind them. The engineering distinctive is narrower and more honest, it is recognizing that an autonomous agent needs all five aimed at it at once, because it can fail in all five ways in a single week, and that a coverage number cannot stand in for any of them.

The first seam is integration, where modules compose. The most common bug in a multi-module system is not "function X has wrong logic," it is "X works fine but Y expected a different type," or "A only works if B was set up first." Mocks paper over exactly this, they return what you told them to and never enforce the real interface, which is how my same-string test slept through a real defect.

The second is adversarial input, the gap between the task you imagined and the task a real user sends, the hundred-thousand-character string, the embedded newline carrying a fake directive, the injection attempt, the empty string, the wall of emoji. The contract is not that nothing weird arrives. It is that weird input gets a safe answer or an honest error, never a crash and never a leak.

The third is concurrency, the races that only appear when many requests hit shared state at once. A history list, a registry, a loop detector, anything two threads can write without a lock, will silently corrupt under load in a way no single-threaded test will ever reproduce.

The fourth is failure cascade, what happens when the pieces an agent depends on, the registry, the scorer, the loop detector, start dying. A naive build lets any one failure crash the whole call. A real one degrades, and the failure you actually have to test is all of them dying at once, because real outages are correlated and take down several things together.

The fifth is property-based testing, where instead of writing examples you state an invariant and let a generator hunt thousands of inputs for the one that breaks it. The invariants that look obvious, "routing always returns a real capability or a clean error, nothing in between," are exactly the ones a generated single-character task or a Unicode combining sequence quietly violates.

What the checkmark should mean

No single one of these dimensions catches everything, and that is the whole argument. Integration finds the type-contract and setup-order bugs and is blind to races. Adversarial finds the injection and the boundary crash and never sees a component failure. Concurrency finds the race and ignores the malformed input. Failure-cascade finds the un-graceful crash and says nothing about invariants. Property-based finds the violated invariant and the boundary input and cannot see a thread race. The value is not in any one suite. It is in the combination, five independent nets under the trapeze, each catching what the others structurally drop.

Each suite is its own short read, linked below, with the one specific failure it exists to catch and the cheapest test that catches it.

I am not going to pretend this makes a system bulletproof. It does not. There are failures none of these five see, and there will be a sixth seam I learn about the hard way at 2am some night that is already coming. But the difference between the green checkmark before and the green checkmark after is the difference between a number and a sentence. Before, green meant "the tests ran the code." After, green means "this composes correctly, handles hostile input without leaking, holds together under concurrent load, degrades instead of crashing when its dependencies die, and keeps its invariants across inputs I never thought to write down."

That is a checkmark worth trusting. The first one was just a string that happened to match.

This is the hub for a five-part series, one short read per dimension:

LLM Self-Preference Bias: How Anonymized Peer Review Fixes It

praveenlavu — Thu, 18 Jun 2026 22:20:23 +0000

LLM Self-Preference Bias: How Anonymized Peer Review Fixes It

The panel had been agreeing with itself for a week before I noticed, and the worst part is that the logs looked healthy the whole time.

I had built what felt like a clean idea. Several frontier models, different families, each one judging a pool of candidate outputs and ranking them best to worst. A jury of machines. I would generate a handful of answers, let the panel vote, take the winner, and trust that five independent opinions beat one. That was the whole pitch I had sold myself at 1am, and for a few days it ran without complaint. The rankings came in. A winner emerged every round. The dashboard was green.

Then I started actually reading what won.

The outputs the panel kept crowning were not the sharpest. They were the ones that sounded a particular way. Numbered lists where the content did not need numbering. A certain rhythm to the sentences. A house style. I stared at it for a while before the shape of it landed, and when it did it was a little sickening: my panel was not selecting for quality. It was selecting for resemblance. The judges were rewarding the candidates that wrote the way the judges write. I had built a popularity contest and dressed it up as an evaluation.

The thing nobody tells you you assumed

The premise underneath every multi-model panel is that the judges are neutral. You assume a model reading an unlabeled answer scores it on merit. It does not. Panickssery and colleagues measured this directly in 2024, in a NeurIPS paper with the unambiguous title "LLM Evaluators Recognize and Favor Their Own Generations." They found GPT-4 preferred its own output at a pairwise win rate above 0.90 on summarization tasks. Over ninety percent of head-to-head comparisons, the model picked the answer it had written. Not because it was better. Because it was its.

The effect is directional across families. Prose in one model family's house style reads better to a judge from that same family. A more hedged, more structured answer reads better to a judge that writes that way. So when I assembled a panel and let it vote on a pool that included its own members' outputs, what I actually measured was which style happened to be most common among my evaluators. The highest-scoring answer was the one whose fingerprint matched the room. I had spent the planning at 1am congratulating myself on independence, and built the opposite.

And it is not only the obvious bias. Once I went looking, there were three of them stacked on top of each other. Self-preference was the loud one. Underneath it sat verbosity bias, where models score longer answers higher because length reads as effort and authority, even when the extra words say nothing. So my selection criterion was quietly drifting toward "writes the most" rather than "answers best." And under that sat position bias, where the first answer in an ordered list anchors the judgment, the same anchoring documented in human juries, so whichever candidate happened to appear first carried a structural head start that had nothing to do with being right.

Three biases, one panel, all of them invisible in a green dashboard.

The wrong fix I reached for first

My first instinct was to out-engineer it. Add a rubric. Tell every judge, in the prompt, to ignore style and length and score only on correctness. Lecture the jury about fairness before it deliberates.

It did almost nothing, and in hindsight it could not have. You cannot instruct a model out of a preference it does not know it has. The recognition is happening below the level the prompt can reach. The judge is not consciously thinking "this is mine, I shall reward it." It is reading prose that matches its own training distribution and finding it more fluent, more correct-feeling, more right. Asking it to be fair is asking it to notice a bias it cannot see. I was trying to argue a model out of its own reflection.

The real problem was not that the judges were biased. It was that the judges could tell whose work they were reading. The bias needed information to operate, and I was handing that information over for free.

The turn

The fix was not mine, and I want to be clear about that, because the elegant part was already sitting in public when I got there. Andrej Karpathy had published a small project called llm-council that solves exactly this, and the mechanism is almost insultingly simple: do not let the judges know whose output they are reading.

That is the entire idea. Before the panel votes, you strip every identity off the candidates. The first answer becomes "response A," the second "response B," and so on. No model name. No provider. No tell. The server keeps a private mapping of which label belongs to which model, a clean one-to-one assignment in both directions, so that after the votes are in you can reverse it and reconstruct exactly who scored what. The judges see only neutral labels and the text. The information the bias needs to operate is simply absent during the vote.

It works because you cannot favor what you cannot identify. Self-preference dies the moment the judge does not know which answer is its own. Hiding the names also strips the most obvious recognition signal, which dents style bias too, though not all the way, because if a model writes in an unmistakable rhythm its identity is still legible in the prose itself. Anonymization breaks the label, not the fingerprint. But the label was doing most of the damage, and removing it changed the room.

The first time I rewired my panel to run blind and watched the rankings come back, the winners were different. The house-style answers stopped sweeping. The thing that had been quietly rigging my evaluation for a week was just gone, because I had taken away the one piece of information it ran on. That is a strange and specific kind of satisfaction, watching a bias evaporate not because you argued with it but because you starved it.

Counting the votes honestly

Hiding the names fixes who wins a comparison. There is a second question underneath it: how you turn a panel of rankings into a single decision. Karpathy's project keeps that part as plain as the anonymization. Each judge ranks the anonymized pool, and the project aggregates those rankings by average rank position. You take every judge's placement for a given candidate, average them, and the answer with the best average ranking across the panel wins. That is it. No weighting, no points table, just the mean of where each judge put each answer.

What I like about averaging the rank is what it captures and what it ignores. It does not care how many head-to-head matchups an answer technically won, which is the trap of naive majority vote. Majority vote can crown an answer that one judge adored and the rest found mediocre, because a thin win still counts as a win. Average rank position cannot do that. A candidate that four of five judges place second and one judge places first lands at a strong average, and the panel correctly reads it as broadly acceptable rather than narrowly adored. Broad acceptability is exactly the signal you want when you are picking the single best output from a pool, and the mean of the rankings is what surfaces it.

If I were extending the project I would probably reach for something like a Borda-style scoring on top, turning each placement into points and summing them so a near-miss second place carries explicit weight rather than just nudging an average. That is my own refinement, not what the repo ships. What llm-council actually does is the simpler and honestly sufficient thing: anonymize, rank, average the positions, take the winner. The discipline is in the order of operations, not in any clever counting.

Where this is not enough

I want to be honest about what anonymization does not fix, because I shipped it feeling like I had solved the panel, and I had solved one third of it.

Self-preference is gone. Two biases are still in the room.

Verbosity bias is completely untouched. A longer answer is longer regardless of what label it wears. Anonymization operates on identity, not length, so if your task rewards thoroughness the panel will keep favoring the candidate that simply wrote more. The only real mitigations are a rubric that explicitly penalizes length without substance, or normalizing every candidate to the same length before the vote. Neither comes for free.

Position bias is only half-addressed. Randomizing which model draws which label between rounds helps, so no single model always sits in the anchor slot. But within any one judge's view, response A still appears before response B, and on the marginal calls, which is most of the interesting ones, first-listed still wins a little more often. The honest fix is an independent random ordering per judge, not just per round.

And there is a quieter trap I walked into while feeling clever about diversity. A five-judge panel built from five models in the same family is not five opinions. Shared training lineage means shared preferences, so in the limit a fully correlated panel of five is one judge counted five times wearing different name tags. Anonymization cannot save you from that, because the bias is in the composition, not the labels. The fix is upstream: compose the panel from genuinely different architectures, or measure how often your judges disagree and weight accordingly. A panel that always agrees is not a panel. It is an echo with a quorum.

The principle

The mechanism is the right foundation even with those three caveats, and the reason is structural. You do not coax a biased judge into fairness with a better prompt. You remove the information the bias needs to operate, so it cannot operate, and then you treat the residue as second-order cleanup. Structure the problem so the failure mode is impossible rather than asking the model nicely not to fail.

That is the part I keep coming back to. I lost a week to a panel that looked healthy while it voted for its own reflection, and the fix was not a clever model or a longer rubric. It was taking away a name tag. Karpathy had already shipped the idea, plainly, and the only work left for me was recognizing my own problem in it and admitting the version I had built was a popularity contest. If you are wiring models to judge models, run the panel blind before you trust a single ranking it gives you. Mine looked fine for a week. It was quietly rigged the whole time.

References

llm-council, Andrej Karpathy, 2024. The label-anonymization design that this piece leans on, which aggregates the anonymized rankings by average rank position. Source: github.com/karpathy/llm-council

LLM Evaluators Recognize and Favor Their Own Generations, Arjun Panickssery, Samuel R. Bowman, Shi Feng. NeurIPS 2024. The source of the GPT-4 self-preference win rate above 0.90. arXiv: 2404.13076