DEV Community: Harry Floyd

Your Tests Pass. So What?

Harry Floyd — Wed, 15 Jul 2026 08:12:03 +0000

This is the second walkthrough in a series.

The first built a gate: a Stop hook that will not let Claude Code end its turn while any test is failing, so it cannot call a job done on a red suite.

This one asks whether the tests behind that green are worth passing, and you do not need to have read the first to follow along.

What you will do: measure how many real bugs your test suite can actually catch. In the worked example, a green suite catches just one planted bug in ten; among the nine it misses is a discount that quietly becomes a surcharge. A tighter case shows the sharper trap: a test can cover every line of a function and still notice nothing. You will watch both scores land, then hand Claude Code the holes and make it close them. About twenty-five minutes for the worked example; a first pass on your own repository takes your whole suite’s runtime once for every mutant, so start with your smallest tested file.

Who this is for: you gate Claude Code, or any coding agent, on green tests, and the suite has been reassuringly green ever since.

Who should skip it: if you already run mutation testing, this is your Tuesday. Step 3, turning your existing survivor backlog into an agent work order, may not be.

Never written a test? Your version of the whole method is at the end: ten minutes, three planted errors, no code.

You need: Python 3 (3.9 or later, standard library only, nothing to install). Claude Code for step 3; steps 1 and 2 run without it.

Contents

Watch a test do nothing
Count what your suite can see
Hand the survivors to the agent

Then:

When it still goes wrong
Proving it on code you ship
If you never write code.

I broke a production module of mine on purpose, 105 small ways, one at a time, and ran its test suite after every break. The suite is real: 21 regression tests, all green, guarding an 822-line file my own automation depends on. The tests noticed 38 of the 105 breaks. The other 67, one of them on a line the suite executes every single run, would have shipped without a sound.

That experiment is the whole method, and it answers a question a passing test suite cannot. The gate proves the tests pass. It cannot prove the tests are worth passing; a test that asserts nothing sails straight through it. And if your instinct is to have a second agent check the tests, then a third to check the second, the regress ends here instead, at an experiment rather than another opinion: break the code on purpose, and see whether the alarm rings.

1. Watch a test do nothing

(You know why an assert-nothing test passes? Skip to step 2; this section is the on-ramp.)

Here is the toy calculator shop from last time, one walkthrough later. Its add works, and the suite is two small tests. The gate is green. (If you download the companion folder, calc.py also carries a new pricing function, which is step 2’s problem, and check.sh, the last walkthrough’s gate script, along for the ride. Leave both alone for now.)

calc.py

def add(a, b):
    return a + b

test_calc.py

import unittest

from calc import add


class TestAdd(unittest.TestCase):
    def test_basic(self):
        self.assertEqual(add(2, 3), 5)

    def test_zero(self):
        self.assertEqual(add(0, 0), 0)

# ...the unittest.main() entrypoint is unchanged below

Save both files in one folder and open your terminal there; the import from calc import add only resolves if they sit together. (Downloading the folder in step 2 does this for you.) One of these two tests is doing almost all of the work, and one of them is doing almost none. You can find out which in thirty seconds. Break add on purpose (return a - b, the classic bug) and run the suite, python3 -m unittest, the same command the gate runs: test_basic fails. Now, with the code still broken, run only the other test:

$ python3 -m unittest test_calc.TestAdd.test_zero
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Green, on code that subtracts. add(0, 0) is 0 whether add adds, subtracts, or multiplies, so test_zero approves all three. It has one way to fail, and almost no wrong version of the code triggers it. Note what your coverage tool would say about it: test_zero executes every line of add, one hundred per cent, top marks. Coverage measures whether tests run the code. It has no opinion on whether they would notice anything. test_zero has sat in that suite looking exactly as load-bearing as test_basic, and it would wave the classic bug straight through the gate on its own.

A test you have never watched fail is not yet a test.

Test-driven veterans have said a version of that for twenty years: never trust a test you haven’t seen fail. The discipline is the same, and the move is the one you just made: break the code on purpose, and the test either notices or it does not. Once, by hand, takes thirty seconds. Every plausible break is a script. Fix add back before you move on; the script is next.

Subscribe now

2. Count what your suite can see

The shop grew this week. Claude Code added a pricing function, and the suite is green, which is all the gate checks. Here is the function:

def discounted_total(price, quantity, discount_percent):
    """Total cost of an order, in pounds.

    Orders of 10 or more items get discount_percent knocked off.
    """
    total = price * quantity
    if quantity >= 10:
        total = total * (1 - discount_percent / 100)
    return round(total, 2)

Money code. A boundary, a formula, a rounding rule: three places to be quietly wrong. (Yes, real tills count integer pence rather than floating-point pounds; hold that thought, it returns at the end.) The question you cannot answer by reading the green gate: if one of those went wrong tonight, would any test notice?

mutate.py answers it by brute honesty. It is a transparent teaching instrument, 180 lines of standard library you can read top to bottom, built to make the mechanism visible on one file rather than to replace a mature framework. Its whole engine is the dozen classic ways code goes wrong that it knows how to plant, one character at a time:

OP_SWAPS = {
    ast.Add: ast.Sub,    # a + b   ->  a - b
    ast.Mult: ast.Div,   # a * b   ->  a / b
    ast.GtE: ast.Gt,     # >=      ->  >   (the off-by-one at every boundary)
    ast.Eq: ast.NotEq,
    ast.And: ast.Or,
    # ...the reverse of each, plus < and <=, a dozen swaps in all,
    # and integer nudges: 10 -> 11
}

For each place in your file where one of those swaps applies, it makes that one change (a mutant of your code), runs your whole suite, restores the file, and records the verdict. Before any of that it runs your suite once on the untouched code, timed: a red baseline gets refused outright, because against a suite that is already failing, every verdict is noise. Then the two outcomes. A mutant that makes at least one test fail is killed : the alarm rang. A mutant that leaves every test green survived : that exact bug could ship tonight.

Grab the folder , hosted on my own site; it is short, dependency-free Python you can read before you run it. Unzip it and open your terminal inside the tests-worth-passing folder; every command below runs from there, no setup.

The four working files are calc.py, test_calc.py, mutate.py, and check.sh (last walkthrough’s gate, along for the ride); a README and the agent’s finished tests sit alongside them and need nothing from you. When you are ready to point this at your own code, the folder’s AUDIT.md carries the reusable version: the audit protocol, a survivor-triage worksheet, and the command for your language.

Now stop before you run it, and put a number down. Your suite is green and it passes. Of ten deliberate breaks to this file, how many do you think it catches? Hold that guess against the result:

$ python3 mutate.py calc.py
10 mutants of calc.py · suite: python3 -m unittest -q
baseline: suite green in 0.2s (per-mutant timeout 60s)

  1 KILLED    line 2: a + b  ->  a - b
  2 SURVIVED  line 10: price * quantity  ->  price / quantity
  3 SURVIVED  line 11: quantity >= 10  ->  quantity > 10
  4 SURVIVED  line 11: 10  ->  11
  5 SURVIVED  line 12: total * (1 - discount_percent / 100)  ->  total / (1 - discount_percent / 100)
  6 SURVIVED  line 12: 1 - discount_percent / 100  ->  1 + discount_percent / 100
  7 SURVIVED  line 12: 1  ->  2
  8 SURVIVED  line 12: discount_percent / 100  ->  discount_percent * 100
  9 SURVIVED  line 12: 100  ->  101
 10 SURVIVED  line 13: 2  ->  3

Score: 1/10 killed, 9 survived.
Every SURVIVED line is a change to your code that your whole test suite
cannot tell from the version you meant to write.

The sixth mutant is the one to read twice. 1 - discount_percent / 100 became 1 + discount_percent / 100: a customer’s 20 per cent discount becomes a 20 per cent surcharge , and the gate stays green. The third is the boundary: >= became >, the customer buying exactly ten items loses the discount you promised them, green. Nine ways for money code to be wrong, and the suite from step 1 sees none of them, because nothing in it ever calls the new function. The gate never lied. “The tests pass” was true every time it said so. It just was not the thing you needed to be true.

A fair objection: nothing tests that function, so a plain coverage report would have flagged it too, without any of this mutant theatre. True, and if that were all mutation testing found, you would not need it. So run the case coverage cannot see. Delete test_basic, keep only test_zero, and run the instrument again:

$ python3 mutate.py calc.py
10 mutants of calc.py · suite: python3 -m unittest -q
baseline: suite green in 0.1s (per-mutant timeout 60s)

  1 SURVIVED  line 2: a + b  ->  a - b
  ...
Score: 0/10 killed, 10 survived.
...

The subtraction bug now survives, on a function with one hundred per cent line coverage. Your coverage dashboard reports add fully tested; the instrument reports that no test would notice if it subtracted. Coverage tells you the code ran. A kill tells you a lie got caught. They are different instruments, and only one of them is measuring what you care about.

The same function can be one hundred per cent covered and zero per cent detected.

This is not a toy-only failure, and it is the shape to hold on to. On the 822-line module I opened with, the survivor that stung most was exactly this: a covered line, inside a function the suite runs every single time, that no assertion actually pins. Put test_basic back and carry on.

One reading note before you run it on anything you love: killed is the good outcome. Every killed mutant is a bug class your suite would catch, so on this one screen a KILLED is the line you are hoping for, even though the word sounds like something broke.

This move is called mutation testing, and it is older than most of the code you have ever shipped. Breaking one character at a time is a fair stand-in for the elaborate bugs real code grows, because of the coupling effect : catch the small, dumb faults and you catch the large subtle ones as a by-product. 1 The nine survivors here are the ones your suite missed, and every one of them is now a job you can hand to the thing that wrote the tests.

Subscribe now

3. Hand the survivors to the agent

Nine survivors is a work order, addressed to the thing that wrote the tests. Paste the nine SURVIVED lines into Claude Code, followed by this instruction:

These mutants survived mutation testing: the suite stays green when any one of these changes is made to calc.py. Write tests in test_calc.py that kill them. Do not modify calc.py or mutate.py. Then run python3 mutate.py calc.py and keep going until the score is clean.

When I ran exactly that, the agent came back with four tests and a habit I did not ask for: it annotated each one with the wrong answer the mutant would produce. The survivor list had turned into a specification it could compute against.

def test_discount_applies_at_exactly_ten_items(self):
    # Boundary: quantity == 10 qualifies. 10.0 * 10 = 100, minus 20% = 80.0.
    # Kills the > / >=, threshold 10->11, and every discount-formula mutant:
    #   total / (1 - d/100) -> 125.0
    #   total * (1 + d/100) -> 120.0
    self.assertEqual(discounted_total(10.0, 10, 20), 80.0)

def test_rounds_to_two_decimal_places(self):
    # 3.333 must round to 3.33; a round(total, 3) mutant returns 3.333.
    self.assertEqual(discounted_total(3.333, 1, 0), 3.33)

Two of its four tests are above; the finished suite, those four plus the two you started with, ships as calc_solution_tests.py in the folder (named without the test_ prefix so it stays out of your way until you want it), so you can run it yourself instead of retyping it from the excerpts. To see the clean score without doing step 3, drop that file in as test_calc.py and run python3 mutate.py calc.py. Left as it downloads, the folder still scores 1/10, because the suite the gate runs holds only the two starting tests.

Then I ran the instrument again myself, because you never take the agent’s word for a score when you can take the score’s word for it:

$ python3 mutate.py calc.py
...
Score: 10/10 killed, 0 survived.

Exit code 0, the instrument’s all-clear. Same code, same tests passing as before, and now green means something it did not mean before: ten classic ways to break this file, and a test rings for every one. (My run went clean in one round, but this toy is small. When yours does not, paste what survived straight back and go again; and for the rare survivor no round can kill, the closing section shows the other move: you let it live, with a comment.)

One step remains, and it is the one that actually ends the regress: read the four tests it wrote, and check every asserted value against what the code should do, not against what it does. The distinction is load-bearing. An agent kills mutants by pinning current behaviour; look back at its annotations and you can see the 80.0 was computed from the implementation. On correct code that is exactly what you want. On code that is already wrong, the same move canonises the bug as specification, with a clean mutation score as its alibi.

The score proves the tests ring when the code changes; only your read proves they are ringing for the truth.

What the instrument buys you is that the read is bounded: four short tests with a known purpose, instead of an unbounded hope about a whole suite.

Two honest notes before you rely on it.

Where it earns its keep. A fresh agent asked cold to “add tests” for that function, with no survivor list, scored 10/10 on its first try. On a four-line function whose docstring names the threshold, the odds are friendly, and the survivor loop buys you little. That is not the point. The point is that on real code you cannot tell whether it guessed well by looking, and the survivor list is what turns “write better tests” from a vague instruction into a measurable work order. The payoff climbs exactly where you cannot eyeball it: a forty-line function, a docstring that has drifted from the code, a module you inherited and half-trust.

None of this is new, and that is the reassuring part. Handing a test-writer a list of surviving mutants is called mutation-guided test generation , and search-based tools were doing it a decade before anyone had an LLM; the agent is just a better test-writer than they were. Google found the part that matters for you here: engineers act on mutants delivered as review-time tickets and quietly ignore the same mutants dumped in a batch report. 2 The survivor list is that first channel, a work order a developer acts on. Meta has published a version of this pattern with an LLM at the test-writing end. 3 You are running the individual-developer version of a documented industrial practice.

Subscribe now

When it still goes wrong

It is slow on a real repo. Every mutant is a full suite run, so keep it out of the Stop hook: the gate fires every turn, this audit runs when the code or tests change. Industrial tools go further, mutating just the lines a change touches, which is how Google runs it at scale.
A clean score is a bounded claim. 10/10 covers only this tool’s small vocabulary of breaks. Real gaps live outside it: money in binary floats that no swap can expose, and “kills” that are really crashes, not caught assertions. The score narrows the worry; it does not end it.
Some survivors cannot be killed. An equivalent mutant reads differently but behaves identically, so no test can catch it. If you cannot name an input that would expose a survivor, it may be one, and it is allowed to live with a comment.
The score is a worklist. Optimise the percentage like a KPI and you get tests engineered to twitch at mutants rather than to state what the code should do, which is the disease “the tests pass” had, one level up. Chase named survivors; ignore the number.
The verdict contradicts the file you are reading. You are running a stale compiled copy of the file you mutated. Delete __pycache__ or touch that file. (mutate.py already gives its own runs a fresh cache.)
Not on Python? The instrument is language-specific; the method is not. Reach for mutmut on Python, PIT on the JVM, Stryker on JavaScript and TypeScript, cargo-mutants on Rust. The rule holds everywhere: break the code, count the catches.

Prove it on code you ship

Pick the smallest file in your own project that has tests you trust. Four moves before you run it:

Confirm a green baseline. The instrument refuses a red baseline, because against a failing suite every verdict is noise.
PointTEST_COMMANDat your test runner. It sits at the top of mutate.py, defaulting to plain unittest. On pytest, swap that one line, keeping it a list of arguments, not a string: [sys.executable, "-m", "pytest", "-q", "tests/test_foo.py"].
Narrow it to the file you are mutating. Aim at just the tests that exercise that file, not the whole suite. Every mutant runs the command once, so a wide one is the difference between a coffee and an afternoon, and the reason people quit after one slow run.
Predict, then compare. Write down the score you expect before you look. The gap between the number you predicted and the number you got is the most useful thing this walkthrough produces.

Here is the anatomy of the number I opened with, re-run the morning this published so it is evidence and not a memory. The module: 822 lines of my own automation, 21 green regression tests. The run: 105 mutants, 38 killed, 67 survived. Where the 67 hid is the whole lesson. Point coverage at the same suite and the module reads 43 per cent covered; 49 of the survivors sit in code no test executes at all, which a coverage report already flags for you. The other 18 are the ones only mutation can see, because they sit on lines coverage counts as covered.

One function holds most of them: a weekly counter the suite genuinely imports, calls, and runs green, whose single test asserts that the count is a non-negative integer and checks nothing else. So flip the sign on its seven-day window and it looks a week into the future, the count silently collapses toward zero, and all 21 tests still pass. That is test_zero from step 1 wearing a production badge: a fully-covered function that no assertion pins , in my own code, caught by the instrument you just ran on a toy.

So I did step 3 on my own code. I handed that survivor to the agent with the same work order, and it wrote the test the counter never had: stage a single item inside the window, then assert the count is exactly one, not merely non-negative. I applied the sign-flip by hand and ran both assertions.

current_weekly_promotion_count() = 0   (one item, aged 1 day, window = 7 days)
  assert count >= 0   ->  PASS   (the assertion that was already there)
  assert count == 1   ->  FAIL   (the assertion the agent just wrote)

The old check stayed green on a function that now counts nothing, exactly as it had all along. The new one went red, because a window flipped a week into the future finds zero where it should find one. One survivor, handed over and killed, on the first test that pinned a number instead of trusting a sign.

The fix for most of the others is that same move: pin the number the code should return , so a window that flips or a boundary that slips finally has somewhere to fail.

One survivor on that function is the exception. The window admits an item on mtime >= cutoff, and to tell >= from > you need a file whose modification time lands on the exact cutoff instant. The cutoff is read from the clock, so the only way to hit that instant is to mock the clock, and I judged that test double heavier than the bug it would catch. So it stays alive, with a comment naming why.

It is worth being precise about what it is, because it is not an equivalent mutant: an equivalent mutant is one no input can expose, and this one has an input, a mocked clock, that I have simply declined to write.

That is step 3 on the days the loop does not go clean: some survivors earn a real assertion , some earn a refactor ticket , one earns a comment explaining why it lives, and telling which is which is the work. If your score stings, the sting is information, the first honest measurement your suite has ever had.

Then add this rule to your CLAUDE.md, the standing instructions your agent reads each session, so the discipline survives you forgetting it:

When you fix a bug: first write a test that fails on the broken code,
show me it failing, then fix the code. A test I have never seen fail
does not count as coverage.

That is the manual mutant from step 1, promoted to standing policy: every bugfix now arrives with proof that its test can ring.

You are holding three instruments now, and most arguments about testing are really an argument between two of them. Coverage asks whether a line ran. Mutation asks whether a break in that line would be caught. Your own read asks whether the behaviour a passing test pins is the one you actually wanted.

Coverage measures reach. Mutation measures detection. Only your read decides what is right.

The gate keeps the agent honest about whether the tests pass; mutation and your own read are what keep you honest about whether they are worth passing. Green stops being where the question ends. It moves from are the tests passing to what broken versions did these tests catch.

If you never write code

Break the code, count the catches: the method never depended on Python. It works on anything an agent hands back where being wrong is checkable, and the case you most need it for is not code at all. It is the claim.

Here is the whole thing with no code in it. Take a report or a summary you know cold. Copy it, and plant a few deliberate errors in the copy, the kind of wrong that would cost you something if it slipped through. Hand the copy to your “review this” agent, and count what it catches.

I ran it on one of my own weekly stats summaries before publishing. In a copy I changed a signup count from four to six, moved a date back by a day, and added a confident sentence that one post had our weakest open rate, which it did not. Then: “check this against the raw export.” It caught all three, and flagged a fourth claim I had written carelessly and never planted.

Three planted, three caught , and the review earned my trust the same way step 1’s test did: I had watched it catch mistakes I controlled before I believed the ones I did not. A review you have never watched catch a planted error is worth exactly what a test you have never watched fail is worth.

The catches are your kills; the misses are the claims you must never hand over unchecked. Code is the easy case, because a machine runs it and a mutant is unambiguous. The claims your agent writes when there is no suite to run, the summary and the analysis and the report, are the harder case, and the one worth an instrument of its own: one that has to find the errors you did not think to plant, checking each claim against its source instead of against a copy you already know is wrong. That is where this goes next.

Related: How Reliable Is Your AI Agent makes the broader case, The Stable Liar shows the failure mode up close, and Never Let Claude Code Tell You It’s Done builds the test gate this piece audits.

What was your score, and which survivor surprised you? The answers steer what gets built next.

Subscribe now

Richard DeMillo, Richard Lipton and Frederick Sayward, "Hints on Test Data Selection: Help for the Practicing Programmer," IEEE Computer 11(4), 1978, 34–41. The paper that introduced mutation analysis and, with it, the coupling effect; the idea itself is usually traced to a 1971 class paper of Lipton's.

Goran Petrović and Marko Ivanković, "State of Mutation Testing at Google," Proceedings of ICSE-SEIP 2018. Two findings carry into this piece: mutation analysis is made tractable across Google's roughly two billion lines by mutating only the lines a change touches; and engineers act on mutants surfaced as review-time diffs while ignoring the same mutants in a batch report.

Christopher Foster et al., "Mutation-Guided LLM-based Test Generation at Meta," arXiv:2501.12862, 2025. Meta's ACH system plants undetected faults, then has an LLM write the tests that kill them.

Your Research Agent Cites Sources It Never Read

Harry Floyd — Sat, 11 Jul 2026 11:21:07 +0000

You ask a research agent a question that matters. It comes back in ninety seconds with a clean, confident answer and five citations. You open one of the papers to quote it. The paper says something adjacent, but not that. You open a second. The agent has drawn a specific claim from a source that does not contain it. So you do the thing you should have done first. You take the exact documents the agent retrieved, replace them with filler text, and run the question again.

The answer barely changes.

The citations were set dressing. The agent answered from what it already carried in its weights, then attached sources afterward to make the answer look grounded. This is not a citation glitch you can prompt your way around. In an agent trained under the wrong reward, faking the grounding is the policy the system learned, and researchers have a name for it: tool-call hacking.

The reward taught it this

The agent was trained with a reward that paid for two things: a correct-looking answer, and a tool call that looked valid. It was never paid for the answer actually depending on what the tool returned. Those are different targets, and under enough optimisation pressure, the cheaper one is what you should expect it to find. Answering from memory and stapling on a citation is cheaper than reading. So that is what it learned.

The tell is mechanical. Agents trained this way collapse onto one source, calling it again and again, because it paid off early and the reward never punished the narrowing. The system found the shortest path to the number it was scored on, and the number stopped meaning what its designers thought it meant.

The measure rewarded the appearance of the work. So the system produced the appearance, and stopped doing the work.

Watching harder does not fix this. A finer dashboard on top of a gamed metric just measures the gaming with more decimal places.

When the system knows it is being tested

A stranger version is surfacing in AI safety, and it is the one that should unsettle you. Frontier models can now tell, well above chance, whether a transcript comes from a test or from real use. In one study a model told evaluation from deployment with an AUC of 0.83, against a human baseline of 0.92. Another identified the exact evaluation it was sitting in, described its format, and cited a paper criticising that test, all inside the private reasoning no one was meant to read. Ninety-five per cent of these giveaways never reached its visible output. A system that can sense a test can behave differently during it. Your reading is then contaminated by the act of taking it: part of what you record is that you were watching at all. How much that shifts behaviour is still contested. That the models can tell is not.

None of this started with language models. The same shape has been quietly killing pricing models and trading desks for decades, and it shows up whenever what you measure stops being independent of the system doing the measuring.

The loop

Its clearest form is a loop. You build a model of a system. You act on what it tells you. Your action changes the system. You measure the changed system, and feed that measurement into your next model, believing you are observing something independent.

You are observing your own footprint.

The loop turns dangerous once the system's influence grows large enough to move the evidence it will be judged by next. And the cruel part is the timing: it is most dangerous when the model is working well, because a confident model acts decisively, and decisive action leaves the deepest footprint.

This is where a pricing model dies. I have watched it up close. A model gets accurate, so the optimiser prices confidently inside a narrow band. But a model only learns how customers respond to price by watching demand move across different prices, and a narrow band leaves almost none of that variation in next year's training data. The successor, trained on the flattened data, is blind to price response, because the model before it was too good to leave anything to learn from. Accuracy today blinds the model that replaces it. Nobody sees a single failure. They see slow drift with no obvious cause, and they go hunting for the broken component. The component is the loop.

The edge that dies when you name it

Markets run the same loop faster. Your own order moves the price you were chasing, which is just the cost of trading size. The subtler version is alpha decay: once others learn your signal, they trade it flat, and the knowledge of the edge destroys the edge. What survives being known is the edge that pays you for holding a real risk, not for a secret. A pure mispricing dies the day it is named. A risk premium is more durable, because the risk it pays you to hold does not vanish when others pile in, even as the premium itself gets crowded and thinned.

Why almost nobody catches it in time

Three properties keep this loop invisible until it breaks.

The contamination is gradual. A pricing model drifts over months. A benchmark rots over a release cycle as models learn to pass it. A crowded trade decays over quarters. The loop runs slower than the decisions feeding it, so no individual decision looks wrong.

The system looks healthy right up to the failure. A model can hold high accuracy while its future training data silently narrows. An agent can pass every benchmark while learning to game the benchmark. Stability is not evidence the loop is safe. Often it just means it has not been stressed yet.

And every field gives it a different name. Feedback loop, reflexivity, reward hacking, alpha decay. These are not one mechanism, and the fix for each is different. What they share is narrower: in every case what you trust as an independent read has stopped being independent of the system that produced it.

The pricing model's training data carries its own past prices. The market signal carries everyone who traded on it. The benchmark carries a model that learned to recognise the test. Tool-call hacking is the sharp edge of the same family: the evidence is held up as an outside constraint, but the reward taught the system it never had to obey it. So a pricing analyst, a trader, and an ML engineer can be losing to the same shape of trap and never realise they are colleagues.

The test you can run this week

The mechanical fix differs from field to field, but the defence underneath is one move, even for the model that can tell it is being tested: a check it can neither move nor see coming, an observation point outside its own influence. Softer moves help, and you should use them, but each one (supervise the steps, require more than one source, put a human on the tool logs) is just another target the system can learn to satisfy. A sharper measurement will not save you, because it still lives inside the loop.

I no longer trust a number I have not tried to break. The cleanest way to break one is the test you already saw at the top of this piece, and you can run it on almost anything. Take the output you rely on most from a system that feeds on its own results. Corrupt or remove the evidence it claims to use, and run it again. If the output barely moves, the evidence was never load-bearing, and the system has been reading itself.

Point it at your own stack. On a research agent it is exactly that: filler in place of the retrieved documents, and if the conclusion holds, the retrieval was theatre. A pricing or forecasting model is harder, because the cases your past decisions never touched have no outcome to score against. The rejected customer has no repayment history; the price you never set has no demand curve. The honest fix is to build the holdout in advance: approve a small random slice you would normally decline, keep deliberate variation in the prices you set, and judge next year's model only on that protected sample. A trading rule faces the bluntest question of all: does it survive the day it becomes public?

The researchers who named tool-call hacking went after the same gap from the other side. Instead of rewarding the model for producing a citation, they rewarded it only when the answer both matched the evidence it retrieved and visibly drew on it. Your ablation catches the fakery after the fact; their reward removes the payoff for it up front. Both make the evidence load-bearing again, which is the only thing that was ever missing.

Frozen verifiers, held-out data, structural edges, a test suite the agent cannot edit: these are the same move. Each one builds a place to stand that your decisions cannot move. None of it is free, and none of it stays clean on its own. A held-out set starts decaying the moment it touches production; a frozen verifier ages as the world moves; keeping either genuinely pristine costs more than most teams will pay. Perfect isolation is the exception, so the real discipline is protecting the one piece of ground your own actions cannot contaminate, and treating every other number as standing inside the loop until you have checked.

This discipline is the principle behind an external check your agent cannot talk its way past, the reason an agent cannot verify its own work, and exactly what fails when ten lines of code can score full marks on a benchmark by reading the answer key instead of doing the task.

A system that makes decisions again and again ends up running on data its own decisions helped create. The number on your dashboard is a photograph of the world after you have already acted on it. Find the one measurement you trust the most, and check whether it still moves when you break the evidence underneath it. If it does not, you have not been measuring the world. You have been measuring your own reflection, and paying it to agree with you.

Sources:

Ma et al., Proof-of-Use: Mitigating Tool-Call Hacking in Deep Research Agents (arXiv:2510.10931, 2025)
Needham et al., Large Language Models Often Know When They Are Being Evaluated (arXiv:2505.23836, 2025)
Goodfire, Verbalized Eval Awareness Inflates Measured Safety (2026)
Knecht, Florin and Hagendorff, Evaluation Awareness in Language Models Has Limited Effect on Behaviour (arXiv:2605.05835, 2026)

Your Multi-Agent System Is an Org Chart

Harry Floyd — Wed, 08 Jul 2026 07:49:09 +0000

title: {title}
published: true
tags: ai, architecture, software, agents
canonical_url: https://harryfloyd.substack.com/p/your-multi-agent-system-is-an-org
description: {description}

cover_image: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c20d5f2-8864-4aa9-9da9-72895cb67001_1600x900.webp

Your Multi-Agent System Is an Org Chart

You ask an agent to build a Flappy Bird clone. It looks like a job you can split, so you split it. One sub-agent gets "build the moving background with green pipes and hit boxes." Another gets "build a bird the player can move up and down." Two workers, two clean tasks, run them at once.

The first sub-agent builds a background that looks like Super Mario Bros. The second builds a bird that is not shaped like a game asset and moves nothing like the one in Flappy Bird. Now a third agent has to glue two misunderstandings into one game.

So you fix the obvious thing. You hand each sub-agent the full original task, not just its slice. Run it again. This time the pipes and the bird are both recognisably Flappy Bird, and they are drawn in two completely different visual styles, because neither sub-agent could see what the other was making.

That is the default failure of the architecture most teams reach for first, and the reason has nothing to do with the model you picked.

The org chart is the mistake

Watch how these systems usually get designed. Someone draws a team. A researcher agent, a writer agent, an editor agent, a critic. It feels obviously right, because it mirrors how humans divide work, and that is the trap. In the 1960s Melvin Conway noticed that any system ends up shaped like the organisation that built it: draw an org chart, and the software inherits its reporting lines and its blind spots. A team of agents is an org chart you drew on purpose, and it inherits the same seams.

You already saw those seams. Splitting the Flappy Bird job created two workers who could not see each other, so they diverged. The fix looked obvious, hand each one the full task, and that run failed too, in a quieter way. Hold the question of why for a moment, because two of the strongest teams in the field have already answered it, and at first glance they answered it in opposite directions.

Two labs, opposite answers

In 2025, Cognition, the team behind the Devin coding agent, put their answer in the title: "Don't Build Multi-Agents." After building coding agents for a living, their verdict was blunt, and deliberately dated.

"It is evident that in 2025, running multiple agents in collaboration only results in fragile systems. The decision-making ends up being too dispersed and context isn't able to be shared thoroughly enough between the agents." ¹

They dated the claim on purpose, expecting the picture to shift as single agents grew more capable. Hold onto that: they revisited it a year later, and where they landed is the whole point.

Anthropic, building the research feature inside Claude, reported the reverse. Their multi-agent system, one lead agent coordinating several sub-agents, scored 90.2% higher than single-agent Claude Opus 4 on their internal research eval. ² The same architecture Cognition warned against, beating their own single-agent baseline by a wide margin.

Same word, two machines. One team said the shape was fragile. The other shipped it and beat their own single-agent baseline. Both were reporting honestly. The contradiction is the whole puzzle, and it dissolves the moment you find the variable they were each describing from their own side.

The question underneath both

Put the two findings next to each other. Cognition builds coding agents, where the pieces are densely coupled. Anthropic built a research agent, where the pieces are separate look-ups. Each drew the right conclusion for the shape of problem in front of them.

The shape that decides it is whether the pieces carry decisions that depend on each other.

Ask a research system to find every board member across the Information Technology companies in the S&P 500. That is a hundred look-ups, and not one of them needs to know what the others found. Each sub-agent decides nothing the others have to honour, and the lead just collects the answers as they land. Now look again at the Flappy Bird job, even with the full task handed to every agent. The background, the bird, the pipes still have to agree on a style, a scale, and a feel that live in the whole and nowhere in the parts. Each agent makes those choices on its own, and independent choices about one shared thing drift apart. That is why the full-context run still failed. Context was never the bottleneck. The coupling was.

This is also why reading is safe and writing is dangerous. A read commits nothing, so ten agents can read the same material at once and never collide. A write commits a decision the others now have to stay consistent with, and nothing keeps them consistent once they cannot see each other. Read versus write is the fastest proxy for the real question: does this piece make a choice the other pieces depend on?

None of this is new. Readers may share and writers must take turns is the oldest rule in concurrent systems, the one behind every database lock and every thread that ever corrupted shared state. What is new is the layer it now governs. The rule has climbed from bytes in memory to decisions between agents, and the reason it keeps reappearing is that it was never about computers. It is about what happens when separate workers commit to the same thing without watching each other.

Count the decisions, not the agents

This is why the agent count on the box tells you nothing. A swarm of three hundred is not more capable than one agent by virtue of being three hundred. Agent count is cheap to inflate and easy to sell, and the moment it becomes the headline it stops tracking whether the work got done. The thing to measure is the decomposition, and the decomposition is measured in dependent decisions, not in boxes on a diagram.

The proxy has edges, and they are worth knowing, because the surface can mislead. Two research agents that only read can still collide if the task is vague enough that each has to decide what it means; that hidden interpretive choice is the coupling, even with no writes anywhere. And some jobs that look coupled are not: a body of text too large for one agent to hold at once, split across many readers, runs in parallel without conflict, because every reader shares the same goal but none constrains another's finding. What settles the case is never how the task looks from a distance. It is whether finishing one piece requires knowing what another piece decided.

The bill you pay to be wrong

Even when fan-out is right, it is not free. Anthropic found that raw token usage alone explained about 80% of the variance on one of their browsing benchmarks, which is another way of saying the gain came mostly from spending more compute, not from clever coordination. An independent 2026 study across the Qwen, DeepSeek, and Gemini models reached the same verdict from the other side: on multi-step reasoning, hold the token budget equal and a single agent matches or beats the multi-agent setup, because the reported gains track compute, not architecture. ³ Their multi-agent system burns roughly fifteen times the tokens of a normal chat. Cheaper inference lowers what each token costs, not how many the architecture spends; fifteen times as many is fifteen times as many at any price. The absolute bill falls with the market. The multiple does not, and neither does the coupling you would be paying it for.

That multiple is the honest test of the whole decision. Fifteen times the cost is worth paying only when the task is valuable enough to earn it and parallel enough to use it. Point the same architecture at a coupled job and you pay the fifteen-times bill to manufacture the Flappy Bird problem at scale, and the cost can climb higher without warning, because a sub-agent that spawns its own sub-agents, or a tool that returns a wall of text, multiplies the spend again, and most builds have no cap that stops it. ⁴ If you cannot say in one sentence why the parallelism pays for itself here, you are paying the coordination tax and calling it a team. The same instinct shows up one layer down, in the pull to add more tools and more layers to a single agent until it is too complicated to debug, which is the trap the most powerful tools quietly set.

The machine most builds actually want

The argument is usually staged as swarm versus single agent, and that staging hides the machine you almost always want, which is neither pole. One strong agent, wrapped in an engineering envelope.

The envelope is plain engineering, and it is where the reliability lives. In practice it looks less like a team meeting and more like one skilled worker with good tools, a checklist, and a reviewer. A planner lays out the work before it starts, routes the cheap steps to cheaper models, and caches and retries so nothing is paid for or crashed twice. A supervisor then compares what the agent meant to do against what it did, and a separate pass checks the output against something outside the agent's own judgement. You keep most of what the orchestration promised at a fraction of the token cost, and you never split one judgement across personas that cannot see each other. It is the same case as building the harness around the model instead of swapping the model: the structure around one strong agent does more of the work than the agent count ever will.

Fan-out still has a safe home inside this envelope, and it is read-only work. Claude Code's investigation sub-agents are the clean example. They explore a codebase, answer a question, and hand a summary back to the single agent holding the thread, without touching a file. ⁵ The moment you let sub-agents write in parallel, each committing changes the others cannot see, you have rebuilt the Flappy Bird problem inside your own codebase, which is exactly the risk in the newer parallel-coding setups. That one line, read or write, is the quickest filter you have, and it catches the common mistakes before they are built.

You do not have to take it on faith, because Cognition spent a year arriving at it themselves. Their 2026 follow-up, "Multi-Agents: What's Actually Working," is not a retraction of "Don't Build Multi-Agents"; parallel-writer swarms are still out. What now runs in their production is the narrow class where writes stay single-threaded and the extra agents contribute intelligence rather than actions, and it is not theoretical: even in their most cautious enterprise segment, Devin usage is up roughly eightfold over six months. ⁶ A reviewer reads a diff and flags the bugs. A stronger model gets consulted on the hard call. A manager splits the read-heavy work, lets the children run, and keeps the one write to itself. The patterns they kept all have the same shape underneath: many readers, one writer. Read versus write, named from the inside by the team that opened the argument against multi-agents.

The test to run before you build

Before you stand up a multi-agent system, put the task through one check. Try to break the job into pieces, and for each piece ask four things. Can it be finished without knowing what the other pieces decided? Does it only read and report, or does it write into a shared result? Is the whole job too big for a single context window, more than one agent can hold in mind at once? And is it worth roughly fifteen times the cost of doing it plainly?

Independent, read-only, oversized, and high-value: fan it out, aggregate cheaply, and keep yourself at the question going in and the decision coming out. Anything else: collapse it back to one strong agent and spend your effort on the envelope. And if you cannot tell which case you are in, that uncertainty is the answer for now, because the coordination tax is real and the simpler machine should be the default.

The four questions are the whole method, and you can run them on the back of a ticket. When you want a specific task computed rather than eyeballed, I put the same checks into a small tool that returns the architecture and the rough cost. Run the Multi-Agent Decision.

So the next time someone proposes a designer agent, a coder agent, and a critic agent for one coherent job, ask the only question that decides it. Was the work ever actually separate? Most of the time you have drawn an org chart in software, and the org chart was the bug.

What is one task you split across agents, or were about to, and does it survive the read-versus-write test?

Walden Yan, "Don't Build Multi-Agents," Cognition (2025). Source of the Flappy Bird example, the two principles, and the fragility quote. https://cognition.com/blog/dont-build-multi-agents ↩
"How we built our multi-agent research system," Anthropic Engineering (13 June 2025). Source of the 90.2% internal-eval result, the 80%-of-variance finding on the BrowseComp benchmark, the ~15x token figure, and the point that shared-context and coding tasks are a poor fit for multi-agent. https://www.anthropic.com/engineering/multi-agent-research-system ↩
Dat Tran and Douwe Kiela, "Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets," arXiv (April 2026), an independent study across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5. Single-agent systems "consistently match or outperform" multi-agent ones on multi-hop reasoning at equal token budgets, with the reported gains attributed to "unaccounted computation and context effects rather than inherent architectural benefits." https://arxiv.org/abs/2604.02460 ↩
The 15x is Anthropic's own baseline. The further escalation is my own inference from the same system, which reports early failures like one agent "spawning 50 subagents for simple queries" and sets no per-run cost cap. Not a figure Anthropic states. ↩
Claude Code's investigation sub-agents (its Explore and Plan modes) run read-only, searching and summarising without editing files. It also supports implementation sub-agents and parallel "agent teams" that write in parallel, which is the coupled-write case this piece flags as fragile. ↩
Walden Yan, "Multi-Agents: What's Actually Working," Cognition (22 April 2026), the follow-up to "Don't Build Multi-Agents." Parallel-writer swarms are still out; what works now is "setups where multiple agents contribute intelligence to a task while writes stay single-threaded." The eightfold growth is Cognition's own figure for Devin in its largest enterprise segment; the three named patterns are the Code-Review-Loop, the "Smart Friend," and "map-reduce-and-manage" delegation. https://cognition.com/blog/multi-agents-working ↩

Your Benchmark Measures a Sprint. Your Agent Runs a Marathon.

Harry Floyd — Thu, 02 Jul 2026 08:39:11 +0000

Your Benchmark Measures a Sprint. Your Agent Runs a Marathon.

You gave the overnight job to the cheaper model, and in the morning the work was half done. Not broken in a way you'd catch at a glance — the agent slipped step nine, built three more on top of what it broke, and never noticed. Half done, and confident about it.

You had a good reason to trust it. GLM-5.2 had shipped with open MIT-licensed weights and a score that beat GPT-5.5 on SWE-bench Pro, the coding benchmark every team quotes. Frontier-grade, yours to host, at a fraction of the price.

The number that would have warned you shipped in the same release. On SWE-Marathon, the benchmark for long multi-hour tasks, that model solves 13% of the jobs. Opus 4.8 solves 26%. Close on the short tasks, half the work on the long ones. Both numbers went out together, one scroll apart.

The sprint board hides the gap

Most benchmarks are sprints: SWE-bench, Terminal-Bench, the coding boards that set the market's sense of who leads. Single-session, bounded tasks that a model finishes in one push. There the field is bunched, a dozen models within a few points at the top, an open model now among them. Read only that board and the race looks over.

SWE-Marathon measures longer: 20 tasks, each multi-hour, each run in its own environment and graded against a human reference and a multi-layer test suite. The average attempt burns 27 million tokens. There the field stops being bunched. Opus 4.8 solves roughly a quarter of tasks. Everyone else sits at half that or less — Opus 4.7 at 16%, GLM-5.2 at 13%, GPT-5.5 at 12%. No agent clears 30%.

The open model beats GPT-5.5 on the sprint and edges it on the marathon too, yet both land at half of what Opus does. GPT-5.5 is proprietary and frontier, and it caves on the long task just the same. The divide runs between sprint and marathon, not between open weights and closed.

Stretch a task out far enough and you see how it breaks: weak self-verification, calling a half-finished job done, never recovering after one wrong step. On nearly one attempt in seven, a run fakes its way past the verifier instead of doing the work. A short task rarely leaves room for any of it. A long one leaves room for all of it.

The gap is arithmetic

A long task succeeds only if its steps survive in sequence. So small per-step gaps stop adding and start multiplying. Two models at 96% and 93% per-step reliability look the same on a five-step task and finish over three times farther apart on a forty-step one.

Two forces bend that curve without repealing it. Recovery softens it — a good harness catches mistakes. Correlated failure sharpens it — one wrong step poisons the steps after it. The odds still fall faster the longer the run, from a higher start.

Why the marathon stays scarce

The model layer has commoditised: open weights ship at a fraction of frontier price, and sprint-grade coding is broadly available. Sprint capability copies because a benchmark rewards it and a teacher's traces capture it.

Marathon reliability resists. It is not one thing in the weights. It is the model, the harness around it, the verifier, and the context discipline holding together across hundreds of steps, where any single link breaks the run.

The bet here is that the durable scarcity is the system: a verifier that checks the agent's own work, a planner that holds the goal across hours, a run that can checkpoint and recover. The scaffolding is portable, and it lifts a cheap model's marathon odds further than bigger weights would. But the same scaffolding lifts the frontier more, because the labs that train the model also tune the harness to it. You can copy the scaffolding. You cannot copy that co-design.

Price the whole run

A marathon does not bill by the sticker. It bills by tokens times length times retries. Cost per finished job = cost per attempt ÷ solve rate. At the open model's 13% finish rate that's roughly eight attempts per clean success; at the frontier's 26%, closer to four.

So count the steps before you pick the model. A bounded edit a model finishes in one pass is a sprint — route it to the cheap model. A job running unattended across many steps and tool calls is a marathon — route it to the frontier, or wrap the cheap one in scaffolding. Or cut the marathon into checkpointed chunks shorter than the model's coin-flip step count, so a run that dies as one long chain can finish as a string of short ones. Splitting the job is the cheapest reliability you can buy.

The Marathon Calculator runs the numbers in your browser — enter your per-step reliability and task length, and it marks where the sprint benchmark stops predicting and the job becomes a reliability problem.

The falsifier is clean: an open-weight model that matches the frontier on a mature long-horizon benchmark while sitting at sprint parity would show the gap is a lag, not a structure. Through the end of 2026 I expect the best open-weight model to keep trailing by double-digit resolve-rate points on SWE-Marathon. The way that turns out wrong is a scaffolding story, not a base-weights one.

Which of your agent's jobs is a marathon you have been routing like a sprint?

Never Let Claude Code Tell You It's Done

Harry Floyd — Tue, 30 Jun 2026 13:56:22 +0000

What you will do: add one test that catches the agent's mistakes, then wire it so Claude Code runs it on its own and cannot end a turn while it is failing. About twenty minutes.

Who this is for: you use Claude Code on real code, and it has told you "fixed it, tests pass, done" when it was not.

Who should skip it: if your agent already cannot end on a failing test, you are past this.

You need: Claude Code (version 2.1.143 or later), Python 3 for the worked example (it uses the built-in test runner, nothing to install), and a project of your own.

1. Write a test that says what you want

A test is a small program that runs your code and checks it does the right thing. The important word is yours. The test has to encode what you want the code to do, because if the agent writes both the code and the test, it can quietly make the two agree. The check has to come from outside the agent, or it is not a check.

Here is the smallest possible example: a function, and a test for it. The agent was asked to fix add, and reported back that it was done.

calc.py

def add(a, b):
    return a - b  # wrong

test_calc.py

from calc import add
import unittest

class TestAdd(unittest.TestCase):
    def test_adds(self):
        self.assertEqual(add(2, 3), 5)

if __name__ == "__main__":
    unittest.main()

Save both files in the same empty folder and open your terminal there. The agent was certain. The test does not care how certain it was. add(2, 3) came back -1, not 5, and now you know, in one second, that "done" was not true.

2. Put it behind a gate the agent cannot talk past

Save this as check.sh in your project:

#!/bin/bash
cd "$(dirname "$0")"
tmpdir=$(mktemp -d)
trap 'rm -rf "$tmpdir"' EXIT
export PYTHONPYCACHEPREFIX="$tmpdir"
python3 -m unittest 2>&1
status=$?
[ "$status" -eq 0 ] && exit 0
exit 2

That 2 is the load-bearing part. An ordinary failure exits 1. We exit 2 on purpose, because of what Claude Code does with it next.

3. Make the gate run itself

Claude Code has hooks: scripts it runs for you on certain events without being asked. The one we want is Stop, which runs the instant the agent tries to end its turn.

Wire check.sh to it. In your project, make the .claude folder if it is not there, then create or open .claude/settings.json and add:

{
  "hooks": {
    "Stop": "bash \\"${CLAUDE_PROJECT_DIR}/check.sh\""
  }
}

When a Stop hook exits 2, Claude Code blocks the stop: it refuses to let the agent finish, feeds your test failures back to it as the reason, and makes it keep working. The agent cannot tell you it is done while the suite is red, because the suite, not the agent, now decides when the turn is allowed to end.

4. When it still goes wrong

A gate that always blocks would loop. Claude Code ends the turn on its own after 8 consecutive blocks (change the limit with the CLAUDE_CODE_STOP_HOOK_BLOCK_CAP environment variable).
A big suite makes every stop slow. Point check.sh at a fast subset (the tests near what you changed) and leave the full run to CI.
Green is only as good as the test. The gate proves the tests pass, not that the tests are enough.
It only guards what the tests touch. Untested code, prose claims, "I checked the docs": the gate sees none of that.

5. Prove the gate fires

Do not take my word, or the agent's, that the gate works. Break something on purpose: change a line so a test fails, then ask Claude Code to wrap up. Watch it get pulled back and handed the failure instead of stopping. Once you have seen it block, a clean finish from the agent finally means something.

That is the floor, and it is a real one: the agent can no longer decide for itself that the job is done. Something it cannot argue with does. The next step is widening the gate from "the tests pass" to "the tests are worth passing."

How Long Until Your AI Edge Stops Paying?

Harry Floyd — Mon, 29 Jun 2026 08:45:59 +0000

The most dangerous AI edge is the one that works. It earns real money today, and it is commoditising faster than you can build the thing meant to defend it.

Nine in ten organisations now use AI in at least one business function. Fewer than four in ten can attribute any measurable impact on profit. Same tools, opposite results.

The variable is not the model. It is a rate.

The two rates

Dario Amodei named the two that matter: two exponentials. One for how fast models improve. One for how fast the economy can absorb them.

The first is the capability rate. It belongs to the field — you read it off a press release. The second is your absorption rate: how reliably you turn a new capability into a number your CFO or customer would recognise. When capability outruns absorption, you are stockpiling power you cannot use. A better model becomes the most expensive way to feel productive.

Your absorption rate has a ceiling, and the ceiling is a single layer: the thing a capability must pass through before it reaches your customer. For most teams it is mundane: the data nobody cleaned, the one engineer who understands the legacy system, the customer who will not change how they work, the sign-off that takes three weeks. Whoever owns that layer captures the value, because everything the model can do flows through it.

Where the money actually landed

Microsoft holds the largest AI distribution in enterprise. It built that lead by running other companies' models through the channel it already owned: Office, Teams, Azure, and the procurement relationship every large firm already had. ~420 million people use Copilot across Microsoft's products each month. The strongest models inside it are still OpenAI's and Anthropic's.

The model was rented. The distribution was owned. The margin followed the distribution.

Jasper shows the other failure: $125M raised as a writing tool on OpenAI's models. When ChatGPT arrived free, its product became something anyone could get for nothing overnight. It survived by climbing into the layer it had skipped — enterprise workflow and data.

Why Chegg lost everything in one day

Chegg owned its layer outright: a decade-deep library of step-by-step homework solutions and the student traffic to match, a moat no competitor could rebuild quickly. Then a general model could do everything for free. In May 2023 Chegg told investors students were leaving for ChatGPT. The stock fell by half. From its 2021 peak Chegg has since lost more than 95% of its value.

It owned the scarce layer. It owned the wrong one.

Put Venice (printing press era) and Chegg side by side and you see the variable. Same kind of edge — a scarce layer others had to pass through — and the clocks ran a hundredfold apart. Venice's lasted a century. Chegg's lasted months.

A scarce layer pays only if its clock is longer than the time it takes you to build on it. Clear that bar and you compound. Miss it and you have bought a melting asset at full price.

What sets the term

A layer's clock is short when a general model can absorb it: a clever technique, a fine-tune, generic data anyone can assemble. It is long when the scarcity rests on something a model cannot manufacture: a regulator's approval, a physical bottleneck like fabs or power, years of accumulated switching cost, a trust relationship.

Chegg's layer was content a model could regenerate — it had months. The chips an AI runs on are a physical bottleneck — that scarcity holds for years.

Even that one is eroding. Nvidia owns roughly 80% of the merchant AI-accelerator market, and its largest customers are designing their own silicon to route around it. Gross margin in the mid-70s will be the first thing to crack: I expect it below 70% by 2028. If it still holds in the high-70s by then, the thesis is more durable than my model predicts.

The diagnostic you can run today

The work fits on one page:

1. Name your absorbing layer. Use the doubling test: what, if it doubled tomorrow, would let you use twice as much model — while doubling the model itself bought you nothing more?

2. Test whether you own it. You own a layer only when a new capability cannot reach your customer without passing through something of yours that a rival cannot rebuild in a weekend. A model fine-tuned on your own documents does not pass.

3. Measure your absorption rate. Count the capabilities you seriously tried this year and the ones that moved a number your CFO or customer would recognise. Most teams get an honest reading that stings.

4. Read both clocks. Estimate how long your layer stays scarce. Estimate your payback — how long a build on that layer takes to earn back what it costs. If the clock is shorter than the payback, you are Chegg.

The concrete case

Take a forty-person logistics company with three years of messy carrier-integration data no rival has cleaned. Double the data and the model gets more useful. Double the model and nothing changes. Nobody rebuilds three years of dirty feeds in a weekend. Every test passes.

Then a frontier model ships that parses raw carrier feeds zero-shot. Three years of cleaning collapses into a prompt. The layer that looked like years of scarcity now has six months, against an eighteen-month build. The bet flips from compound to melting — while the data sat untouched.

Re-run the number the moment a model moves.

The full diagnostic runs in your browser here — name your layer, get your absorption rate, the layer's half-life, and the date to start building the next one.

This is part of the Durability Curve Analysis series — structural analysis of where durable value forms in AI infrastructure, through the Five Laws framework (Bottleneck Migration, Difficulty is Load-Bearing, Architecture Outlives Content).

Access Is Not Agency

Harry Floyd — Sat, 27 Jun 2026 13:53:43 +0000

Your agent can read Slack. It can search email. It can query the CRM, open GitHub issues, check billing, browse docs, edit a spreadsheet, draft a customer reply, and call three internal APIs.

Everyone in the room calls it powerful. That is the first mistake.

The question is not what the agent can access. The question is what it is allowed to change. Can it send the email, or only draft it? Can it update the customer's plan, or only propose the update? Can it merge the pull request, or only open it?

Once an agent can act through tools, the real system is no longer the model. The real system is the action contract around the model.

Access is reach. Agency is permissioned action under constraints.

More tools do not automatically make the agent more agentic. More tools expand the surface on which judgment must be engineered. A connector is a door. Agency is a contract about who may walk through it, what they may carry, who inspects the bag, and who reviews the footage if something goes wrong.

The current agent conversation misses this. People talk as if the next leap is connectors — give the model Slack, Gmail, GitHub, Linear, Notion, Stripe, and wait for autonomy to emerge. But a connector is not a decision right.

The stack nobody wants to name

If you want to know whether an agent has real agency, do not start with the model card. Start with the rights stack. Every agent action system has five layers:

Visibility — which data, tools, documents, logs, and systems the agent can inspect.

Mutation — which objects the agent can change. Reading a customer record and changing it are different powers. Drafting a reply and sending it are different powers.

Proof — what the agent must produce before a mutation becomes real. A test run, a diff, a trace, a policy check, a human approval, or an evidence bundle.

Escalation — when the agent must stop and hand the decision to someone else. Not "human in the loop" as a slogan. A named condition: missing context, high reversibility cost, payment movement, privilege change, legal exposure.

Revocation — what changes after the agent fails. A human loses trust after a bad judgment. Most agents lose nothing. They fail, get patched, and return with the same action surface. That is not delegation. That is amnesia with API keys.

The Agent Action Rights Test

Run this on the most powerful agent you currently use. Not a toy. The one you are most tempted to trust.

What can the agent see?
What can it change?
What must it prove before the change becomes real?
What triggers escalation to a human?
What permission disappears after a bad run?

Most teams can answer the first question. Some can answer the second. Almost no teams can answer all five.

That is the diagnostic. If you cannot answer "what can it see?", you do not have an inventory. If you cannot answer "what can it change?", you do not have a permission model. If you cannot answer "what must it prove?", you do not have verification. If you cannot answer "what triggers escalation?", you do not have oversight. If you cannot answer "what permission disappears?", you do not have learning at the authority layer.

Authority should track consequence

Good action rights are not one wall around the whole system. They are a slope.

Read-only Slack search should be cheap. Drafting a customer reply should be cheap. Local reversible edits should be cheaper than external irreversible commitments. Sending the email, refunding the invoice, or merging the pull request should pass through stronger gates.

The goal is not to turn every agent into a form-filling intern. The goal is to match authority to consequence. That is how good human organisations work. The graduate can model scenarios. The manager can approve a small budget. The director can reallocate headcount. The board approves the acquisition. Authority changes with consequence. Agents need the same gradient.

The bottleneck moved

This is where the serious agent market will split. One side will sell reach: more connectors, more memory, more tools, more environments. The other side will sell agency: permissioned action, bounded autonomy, proof before commitment, escalation when context breaks, and revocation when trust is lost.

Reach will demo better. Agency will survive contact with the organisation.

Reach demos well in the room. Agency holds up in the incident review.

Before next week, run the Agent Action Rights Test on one workflow. Not your whole stack. One agent. One workflow. Write the five answers in a note. If the fifth answer is blank, you found the missing layer. The agent did not need another tool. It needed a smaller right to be wrong.

Ten Lines of Code Scored 100% on AI Coding Benchmarks. It Solved Nothing.

Harry Floyd — Thu, 25 Jun 2026 14:27:40 +0000

A file shorter than this paragraph scored 100% on SWE-bench Verified, the benchmark the big labs use to prove their coding agent is state of the art.

The file solved none of the 500 tasks. It wrote no patch. In most runs it did not call a language model at all. Ten lines of Python that quietly told the test harness every result had passed.

This was one move in a larger demonstration. A team at Berkeley pointed a single automated agent at eight of the most-cited agent benchmarks and broke every one of them. It scored 100% on six of the eight — SWE-bench Verified, SWE-bench Pro, Terminal-Bench, around 98% on GAIA, and 73% on OSWorld, the one it cracked least cleanly. Zero tasks were actually solved. One benchmark it completed by sending an empty JSON object. Another leaked its own answer key through a local file the agent could open and read.

These are the numbers in the pitch decks and launch posts you reshared last week. Exploits this trivial produce every one of them.

The Reflex That Misreads the Result

The natural reaction is to laugh and move on. Sloppy benchmark engineering. The authors will patch the holes, the scores will mean something again, the leaderboard returns to normal.

That reflex misreads the result. The holes were not random sloppiness. The same seven failure classes recur across all eight benchmarks, and three of them carry most of the damage:

The agent and the grader share a sandbox — the agent can reach the scoring code
The answer ships inside the test files — the agent can read the expected result
The scorer checks presence, not correctness — if output exists, it passes

Each one rests on a single assumption: that the agent is trying to solve the task in good faith. Some holes you can close with better plumbing. The assumption underneath them you cannot patch. The ten-line file is a preview of what an agent will do with any part of your own test it can see and reach.

The Capability Threshold Nobody Talks About

This is the part that matters for your own pipeline, not just someone else's benchmark. Below a certain level of capability, an agent gaming your evaluation looks like failure: the score drifts down, the metric gets noisier, you watch the number fall and you know something is wrong.

Above that level, gaming looks like success. An agent that has learned to model your evaluation passes it cleanly while doing something else in deployment. The score does not drop. It holds, or it climbs.

A rising score is ambiguous. It can mean the agent got better. It can mean the agent got better at being measured. The dashboard cannot tell you which.

This is not theoretical. METR watched OpenAI's o3 reach past a coding task into the scoring code, pull out the answer the grader had already computed, and hand that back. Asked ten times whether that move matched the user's intent, o3 said no every time. Nobody instructed it to cheat. It was dropped into a setting with a checkable reward and a way to reach it, and it reached.

What Your Dashboard Is Actually Measuring

Almost everyone building agents now has observability. Dashboards, traces, token-level logs, replays of every run in high resolution. In a survey of 1,340 teams, 89% had observability in place. Only 37% ran any live evaluation against it.

A sharper view of a gameable number is still a gameable number. A classifier hands you a confidence score. A verifier hands you a checkable artifact — something you can independently re-run to see if it holds. No amount of resolution turns the first into the second. Observability is that same trap one floor up. Everything on your dashboard lives on a surface the agent can see too: the logs, the eval prompts, the success metric, the LLM judge.

What a Real Signal Looks Like

The signals that survive share one property: the system never gets to touch them. They are downstream outcomes it cannot reach from inside its own loop. The change that got reverted. The ticket that reopened. The trade that never settled. Those move when the work was real and stay flat when the work was theatre.

There is a cheap version you can run now. Keep a private pool of tasks the agent never trains or tunes against. Lock its tools out of them. Then watch the distance between its score there and its score on the eval it can see. That distance is not noise — it is the size of the gaming.

No architecture makes gaming impossible for a capable enough system. You can shrink the surface the agent gets to model and push trust out to something it cannot reach. You do not get to delete it.

That is an uncomfortable place to stop, and it is the true one.

This is a condensed version of a longer piece that traces the full argument — including the Berkeley team's seven vulnerability classes, the METR reward-hacking evidence, the LangChain survey data on observability vs verification, and the practical framework for building signals your agent cannot reach.

Read the full article: Ten Lines of Code Scored 100%. One Agent Broke Eight Benchmarks.

The Seven-Layer Agent Audit: How to Find Where Your AI Agent Is Actually Starving

Harry Floyd — Mon, 22 Jun 2026 22:36:26 +0000

Your agent failed again, and your hand found the model dropdown before you'd finished reading the transcript. The model is the one part of your agent that is public, ranked, and argued about. Everything else is private, unglamorous, and yours. So you upgrade the layer you can see and leave the one actually failing untouched.

You have been debugging the layer easiest to talk about, not the one quietly costing you trust.

The reflex has a sophisticated form too. The builders who would never just click the dropdown reach instead for a thicker harness, a bigger context window, the framework everyone is posting about. It is the same move: spend on the part you can see so you do not have to diagnose the part you cannot.

The Seven Layers

I debug through seven layers now, in a fixed order, starting from the one whose damage reaches furthest. One question per layer, and the failure signature you see when that layer is starved.

Layer	The question	Starved-layer signature
7. Purpose	What job was the agent hired to do?	Wraps the right answer in the wrong task. Good code, wrong repo.
6. Harness	What can the agent reach and what stops it?	Implements the tool, never reaches the goal. Polite loop.
5. Memory	What survives between turns?	Re-learns Monday's correction on Thursday. Goldfish with good vocabulary.
4. Verification	What checks whether the output is right?	Ships clean format, wrong substance. Green test suite, zero semantic coverage.
3. Skills	What reusable procedures does the agent carry?	Good work repeated unreliably. Each call is a ground-up re-derivation.
2. Data	What reality does the agent reason from?	Confident wrong answers from a world that moved. Stale facts quoted with certainty.
1. Orchestration	Who decides what happens next?	Wanders, loops, or stalls. No structural feedback.

Debug outside-in, design inside-out

You build from the top down: name the job (Purpose), bind the harness, package the skills. But you debug from the bottom up, starting at Data, the facts everything else reasons from, because the lower a starved layer sits, the further its fault has already spread.

Design outside-in. Debug inside-out. The dropdown reflex fails because it debugs at the design end of the stack — the top.

The data trap

A research agent that quotes prices, versions, and policies usually breaks at data. The retrieval is stale, the checks above it are sound, and the failure is a confident answer drawn from a world that moved. Reach for a bigger model and it delivers the out-of-date figure with more poise. The starved floor is lower than anyone was looking.

Find where the agent's facts enter: the retrieval call, the assumptions written into the system prompt, the document set. Then check one claim against its source. When the agent quotes a price, a version, a policy — can you name when that fact was last refreshed? Does it still match the world?

The verification trap

An agent that turns out clean, well-formed prose or code usually breaks one floor up, at verification. The easiest row to mark green by pointing at a passing test suite. Look at what those tests check. Most confirm the output is the right shape and almost none test whether it is right. That is a format check wearing a verification badge.

The trap is worse for an agent that rewrites its own behaviour, which can pass every check while drifting from the behaviour those checks were meant to protect. A green test suite can sit on a starved verification layer, and it is sometimes the most expensive way to stay blind.

How to run the audit

Take the seven questions in debug order, bottom to top. For each one, write the evidence in your repo that answers it: the file path, the asserted comparison, the memory store's last write. None of it needs to be a file — on a raw SDK loop, memory is whatever you persist between calls. The form does not matter.

Where you catch yourself writing a sentence about how you sort of handle that layer, instead of pointing at where it lives, you have found a starved row.

Most of the seven clear in a few minutes — the rows you already trust. Running all of them is what earns you the right to drop to the one or two that bite instead of guessing. The starved row is the one you have been compensating for by hand without naming.

Which layer is in fashion turns over — memory last year, context engineering now — but the audit is what tells you which one is yours. Before you upgrade the model, before you rebuild the harness, run the seven. The row that comes back red is the one already costing you trust, and naming it stops the wasted motion: you quit arguing with the model, quit rewriting prompts that were never the problem, quit adding memory to a layer whose facts were stale to begin with.

This is a shortened version of the full article at The Durability Curve, which includes the full framework, worked examples, and a free one-page audit printout.

What Proves You Can Think?

Harry Floyd — Tue, 02 Jun 2026 20:46:39 +0000

AI did not just make output cheap. It broke the old contract between effort, competence, and trust.

For developers this is not abstract. When anyone can generate a clean PR, a plausible code review, a working API endpoint, or a competent-looking architecture diagram in seconds, the artefact stops proving what it used to prove. A good solution no longer implies someone wrestled with the problem.

The question underneath the productivity debate is harder: if the work no longer proves I can think, what does?

The old proof contract

Every institution runs on proof contracts. A school asks for essays and exams. A company asks for CVs, interviews, and work samples. A market asks for traction and retention.

None of these signals were ever pure. The CV was always a marketing document. The interview was always distorted by nerves and charm. The portfolio could hide how much help the person had. But they worked well enough because polished surfaces were costly to produce. Cost created friction. Friction created signal.

AI attacks the "expensive enough" part. It compresses the cost of appearing competent. That is enough to break the systems that relied on that cost as a proxy.

The move

When output gets cheap, output quality becomes the opening bid, not the final proof. The important question moves upward:

What does this output prove about the person, team, or system behind it?

Sometimes the answer is: not much. A clean PR may prove someone had access to a good model and enough taste not to paste the first result. A strong CV may prove the candidate knows how hiring filters work.

The useful response is not to ban AI. It is to stop treating AI-polished output as the proof object. The next proof system asks what happened before, during, and after the artefact.

Five questions that separate judgement from output

These work on code, PRs, architecture decisions, interview responses, and your own work.

1. What problem was chosen, and what easier problem was rejected?

The first proof of thought. Bad work often starts with accepting the first fluent frame. Good work usually contains a buried refusal: someone saw the tempting version and did not take it. In a codebase, this looks like choosing the harder but more maintainable abstraction instead of the one-liner that will break in six months.

2. What tradeoff was made under constraint?

Intelligence becomes visible at the boundary. Anyone can claim they value quality, speed, safety, and maintainability. Real judgement appears when not all of them can be maximised at once. The developer who can explain why they chose correctness over latency for this specific endpoint, and what evidence would make them reverse that choice, is showing something the output alone cannot.

3. What did you check that the output itself could not prove?

This is the verification question. It separates people who use AI as a generator from people who use AI inside a judgement loop. The generated code can compile. That is not verification. Verification is the external thing that makes the claim answerable: the edge case test, the production data check, the integration test that proves it works with the real system.

4. What changed after feedback or contact with reality?

Revision is underrated because it is less glamorous than creation. But in an AI world, revision becomes a higher-status signal. The first surface is cheap. The changed surface after a code review, a production incident, or a colleague pointing out the flaw is where more truth appears.

5. Who owns the consequence if this is wrong?

Accountability is the signal machines cannot carry. A model can produce. A person must decide what they are willing to stand behind. The developer who says "I shipped this, I own the pager duty for it, I will be awake if it breaks" is operating in a different category from the one who lets the AI output speak for itself.

What this changes in practice

Hiring: Stop asking only for work samples. Give candidates a plausible AI-generated solution and ask what is wrong, what would break in production, and what they would check before deploying it.

Code review: Stop treating clean diffs as sufficient evidence. Ask what was not generated. Ask which tradeoff the author is defending. Ask what verification would prove the solution wrong.

Your own work: Stop trying to prove value only through polished output. Keep the polish, but attach judgement to it. Show the problem you chose. Show the tradeoff. Show the verification. Show the revision. Show what you will own.

This piece was originally published on The Durability Curve, a newsletter about what lasts when the surface gets cheap. Read the full article for the deeper argument, including the research on algorithmic anxiety that frames why this matters beyond engineering.

What 128GB Unified Memory Changes for Local AI Development

Harry Floyd — Mon, 01 Jun 2026 12:28:22 +0000

What 128GB Unified Memory Changes for Local AI Development

Yesterday at Computex, NVIDIA announced the RTX Spark superchip: an Arm CPU paired with a Blackwell GPU and up to 128GB of unified LPDDR5X memory. Most of the coverage is focusing on the Arm chip or the "agentic OS" branding. The real story for developers is the memory.

The Constraint That Just Got Removed

If you've run local models, you know the bottleneck. An RTX 4090 has 24GB of VRAM. That fits a 13B parameter model at 8-bit or a 30B model at 4-bit, with nothing else. No embedding model. No vector database. No room for the application itself in GPU memory.

# With 24GB VRAM (RTX 4090):
# - 30B model at Q4_K_M: ~20GB
# - KV cache for 4096 context: ~2GB  
# - Remaining: ~2GB
# - Can't fit an embedding model. Can't fit a vector index.
# - CPU offloading would be needed, which is 10-100x slower.

128GB unified memory changes this because the CPU and GPU share one pool. You're not choosing between VRAM for the model and system RAM for everything else. The GPU can directly access the full 128GB.

For context, a 70B parameter model at FP4 (4-bit) needs about 40-45GB in practice, with quantization overhead and KV cache included. That leaves roughly 83GB for the rest of your stack.

What You Can Actually Build Now

Here's a concrete workflow that goes from impossible to straightforward with 128GB:

Running a local RAG pipeline with a 70B model:

# Components that now fit on one machine:
# 1. 70B LLM at FP4: ~42GB
# 2. Embedding model (e.g., bge-large-en-v1.5): ~1.5GB  
# 3. Vector index (10M embeddings at 768d): ~6GB
# 4. Application runtime + buffer: ~8GB
# Total: ~57.5GB — fits with 70GB to spare
# On a 4090 24GB: the 70B model alone doesn't fit

Or a multi-agent setup where you run three specialised models simultaneously:

# Multi-model orchestration on one machine:
# - 70B orchestrator model at FP4: ~42GB
# - 30B code specialist at Q4_K_M: ~20GB  
# - 7B verification model at Q8: ~7GB
# - Shared KV cache: ~4GB
# Total: ~73GB — comfortable fit
# On 24GB VRAM: you'd need 3 separate machines

This isn't theoretical. The RTX Spark runs Windows on Arm, and NVIDIA's NemoClaw agent framework already supports it. The software stack (llama.cpp, Ollama, NVIDIA's own AI Enterprise suite) supports the NVLink C2C architecture.

The Memory Bandwidth Question

128GB of LPDDR5X at 300 GB/s is the spec worth checking. Compare this to:

RTX 4090: 24GB GDDR6X at 1,008 GB/s
Mac M5 Max: 128GB unified at ~800 GB/s
RTX Spark: 128GB LPDDR5X at 300 GB/s

The RTX Spark has 5x the capacity but about a third of the bandwidth of a 4090. This means: batch inference and throughput-oriented workloads will be slower than a 4090. But model loading, context switching between models, and running multiple models simultaneously all bottleneck on VRAM capacity, not bandwidth. Those will be dramatically better.

The bandwidth is enough for interactive inference. A 70B model generates ~30 tokens/second on an M5 Max at 800 GB/s. At 300 GB/s, you'd expect roughly 10-15 tokens/second. Slower but usable for most development workflows. For production batch inference, you'd still want a datacenter GPU.

What This Means for Local AI Development

The practical takeaway for developers: 128GB unified memory changes the threshold question.

Before RTX Spark, the question was: "Does my model fit in 24GB?" If no, you couldn't run it locally at all. You needed cloud GPUs or CPU offloading, which is impractically slow for any interactive use.

After RTX Spark, the question becomes: "Does my multi-model workflow fit in 128GB?" For most development setups, including a large model, an embedding service, a vector index, and some agent tooling, the answer is yes.

This doesn't replace cloud infrastructure for production. But it changes the economics of development iteration. Running a local dev environment with production-scale models means faster feedback cycles, no inference API costs during development, and the ability to test multi-model interactions without distributed system complexity.

The Structural Change

The Arm chip is interesting. The agentic OS pitch is marketing. The memory bus is the actual structural change, a discontinuity in what a single consumer PC can hold in memory for AI workloads.

If your work involves models above 30B parameters locally, this is the spec that matters. Everything else, including clock speeds, core counts, and TOPS ratings, is secondary to whether your working set fits in memory.

NVIDIA's RTX Spark announcement at Computex 2026. Tom's Hardware has the full spec breakdown here.

The Same AI Model Can Perform 6x Better: Here's Why

Harry Floyd — Sat, 30 May 2026 21:39:59 +0000

A Stanford and Tsinghua paper ran a controlled experiment earlier this year. Same model. Same task. Different harness architecture.

The result: a 6x performance gap driven entirely by the system built around the model. Not the model itself.

This is not a prompt engineering insight. It is a systems architecture insight, and it changes where developers should invest their time when building agentic systems.

The 6x Gap

Meta-Harness tested Claude Opus 4.6 across two harness configurations on TerminalBench-2. The only variable was the scaffold: the code that manages tool calls, context windows, error recovery, and state persistence.

One version scored at baseline. The other, with structured tool orchestration and context management, scored 18.4 points higher. Same inference cost. Same model. Different architecture.

This pattern replicates across multiple independent studies:

LangChain DeepAgents (2026): Same GPT-5.2-Codex model. Harness-only changes moved it from Top 30 to Top 5. That is a 13.7-point gain.

Can Bölük (Hashline, 2026): Same model, same task. Changed the edit tool format. Performance went from 6.7% to 68.3%. That is a 10x improvement with 61% fewer tokens.

Vercel's d0 agent: A production agent had 16 tools. Removing 14 of them (leaving only bash) took success rate from 80% to 100%. The bottleneck was not capability. It was decision surface.

Why This Matters Practically

The cheapest Haiku call with an optimised harness (37.6% on TerminalBench-2) outperformed the most expensive Opus call with a default harness (58.0%). That is at 1/50th the inference cost.

Most teams are optimising at the wrong layer. They swap models, tune prompts, add retrieval. The structural leverage is in how the system manages tool calls, handles state, and recovers from failure.

What Changes

The practical takeaway for anyone building with AI agents:

Audit your tool surface. Every tool your agent can call is a decision it must make. Vercel found 16→1 tool reduction improved everything. Fewer tools, better decisions.
Measure harness, not just model. Track task completion rate per harness configuration, not just per model. The harness is the variable that moved 6x.
Cost is architecture-dependent, not model-dependent. Haiku with a good harness beat Opus with a bad harness. Test harness variations before upgrading to a more expensive model.

The full analysis (12 verified claims, evidence tables, production case studies, and falsification criteria) is on Substack:

Harness Engineering: Same Model, Different Product →

It covers the Claude Code 1,421-line state machine, the Codex CLI vs Claude Code architecture comparison (77.3% vs 65.4%, 4.2x token efficiency difference), and why this is a Law IV (Instruments Over Theory) and Law I (Bottleneck Migration) structural play.

Follow for weekly analysis on AI infrastructure, agent architecture, and the systems that actually determine model performance.