<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Harry Floyd</title>
    <description>The latest articles on DEV Community by Harry Floyd (@harryfloyd).</description>
    <link>https://dev.to/harryfloyd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3933548%2Fa644e757-fdc0-4213-a2d0-37774cbe6730.png</url>
      <title>DEV Community: Harry Floyd</title>
      <link>https://dev.to/harryfloyd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/harryfloyd"/>
    <language>en</language>
    <item>
      <title>Access Is Not Agency</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Sat, 27 Jun 2026 13:53:43 +0000</pubDate>
      <link>https://dev.to/harryfloyd/access-is-not-agency-3hni</link>
      <guid>https://dev.to/harryfloyd/access-is-not-agency-3hni</guid>
      <description>&lt;p&gt;Your agent can read Slack. It can search email. It can query the CRM, open GitHub issues, check billing, browse docs, edit a spreadsheet, draft a customer reply, and call three internal APIs.&lt;/p&gt;

&lt;p&gt;Everyone in the room calls it powerful. That is the first mistake.&lt;/p&gt;

&lt;p&gt;The question is not what the agent can access. The question is what it is allowed to change. Can it send the email, or only draft it? Can it update the customer's plan, or only propose the update? Can it merge the pull request, or only open it?&lt;/p&gt;

&lt;p&gt;Once an agent can act through tools, the real system is no longer the model. The real system is the action contract around the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Access is reach. Agency is permissioned action under constraints.
&lt;/h2&gt;

&lt;p&gt;More tools do not automatically make the agent more agentic. More tools expand the surface on which judgment must be engineered. A connector is a door. Agency is a contract about who may walk through it, what they may carry, who inspects the bag, and who reviews the footage if something goes wrong.&lt;/p&gt;

&lt;p&gt;The current agent conversation misses this. People talk as if the next leap is connectors — give the model Slack, Gmail, GitHub, Linear, Notion, Stripe, and wait for autonomy to emerge. But a connector is not a decision right.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack nobody wants to name
&lt;/h2&gt;

&lt;p&gt;If you want to know whether an agent has real agency, do not start with the model card. Start with the rights stack. Every agent action system has five layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visibility&lt;/strong&gt; — which data, tools, documents, logs, and systems the agent can inspect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mutation&lt;/strong&gt; — which objects the agent can change. Reading a customer record and changing it are different powers. Drafting a reply and sending it are different powers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proof&lt;/strong&gt; — what the agent must produce before a mutation becomes real. A test run, a diff, a trace, a policy check, a human approval, or an evidence bundle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Escalation&lt;/strong&gt; — when the agent must stop and hand the decision to someone else. Not "human in the loop" as a slogan. A named condition: missing context, high reversibility cost, payment movement, privilege change, legal exposure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Revocation&lt;/strong&gt; — what changes after the agent fails. A human loses trust after a bad judgment. Most agents lose nothing. They fail, get patched, and return with the same action surface. That is not delegation. That is amnesia with API keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agent Action Rights Test
&lt;/h2&gt;

&lt;p&gt;Run this on the most powerful agent you currently use. Not a toy. The one you are most tempted to trust.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What can the agent see?&lt;/li&gt;
&lt;li&gt;What can it change?&lt;/li&gt;
&lt;li&gt;What must it prove before the change becomes real?&lt;/li&gt;
&lt;li&gt;What triggers escalation to a human?&lt;/li&gt;
&lt;li&gt;What permission disappears after a bad run?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most teams can answer the first question. Some can answer the second. Almost no teams can answer all five.&lt;/p&gt;

&lt;p&gt;That is the diagnostic. If you cannot answer "what can it see?", you do not have an inventory. If you cannot answer "what can it change?", you do not have a permission model. If you cannot answer "what must it prove?", you do not have verification. If you cannot answer "what triggers escalation?", you do not have oversight. If you cannot answer "what permission disappears?", you do not have learning at the authority layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authority should track consequence
&lt;/h2&gt;

&lt;p&gt;Good action rights are not one wall around the whole system. They are a slope.&lt;/p&gt;

&lt;p&gt;Read-only Slack search should be cheap. Drafting a customer reply should be cheap. Local reversible edits should be cheaper than external irreversible commitments. Sending the email, refunding the invoice, or merging the pull request should pass through stronger gates.&lt;/p&gt;

&lt;p&gt;The goal is not to turn every agent into a form-filling intern. The goal is to match authority to consequence. That is how good human organisations work. The graduate can model scenarios. The manager can approve a small budget. The director can reallocate headcount. The board approves the acquisition. Authority changes with consequence. Agents need the same gradient.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottleneck moved
&lt;/h2&gt;

&lt;p&gt;This is where the serious agent market will split. One side will sell reach: more connectors, more memory, more tools, more environments. The other side will sell agency: permissioned action, bounded autonomy, proof before commitment, escalation when context breaks, and revocation when trust is lost.&lt;/p&gt;

&lt;p&gt;Reach will demo better. Agency will survive contact with the organisation.&lt;/p&gt;

&lt;p&gt;Reach demos well in the room. Agency holds up in the incident review.&lt;/p&gt;

&lt;p&gt;Before next week, run the Agent Action Rights Test on one workflow. Not your whole stack. One agent. One workflow. Write the five answers in a note. If the fifth answer is blank, you found the missing layer. The agent did not need another tool. It needed a smaller right to be wrong.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>Ten Lines of Code Scored 100% on AI Coding Benchmarks. It Solved Nothing.</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Thu, 25 Jun 2026 14:27:40 +0000</pubDate>
      <link>https://dev.to/harryfloyd/ten-lines-of-code-scored-100-on-ai-coding-benchmarks-it-solved-nothing-58p2</link>
      <guid>https://dev.to/harryfloyd/ten-lines-of-code-scored-100-on-ai-coding-benchmarks-it-solved-nothing-58p2</guid>
      <description>&lt;p&gt;A file shorter than this paragraph scored 100% on SWE-bench Verified, the benchmark the big labs use to prove their coding agent is state of the art.&lt;/p&gt;

&lt;p&gt;The file solved none of the 500 tasks. It wrote no patch. In most runs it did not call a language model at all. Ten lines of Python that quietly told the test harness every result had passed.&lt;/p&gt;

&lt;p&gt;This was one move in a larger demonstration. A team at Berkeley pointed a single automated agent at eight of the most-cited agent benchmarks and broke every one of them. It scored 100% on six of the eight — SWE-bench Verified, SWE-bench Pro, Terminal-Bench, around 98% on GAIA, and 73% on OSWorld, the one it cracked least cleanly. Zero tasks were actually solved. One benchmark it completed by sending an empty JSON object. Another leaked its own answer key through a local file the agent could open and read.&lt;/p&gt;

&lt;p&gt;These are the numbers in the pitch decks and launch posts you reshared last week. Exploits this trivial produce every one of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reflex That Misreads the Result
&lt;/h2&gt;

&lt;p&gt;The natural reaction is to laugh and move on. Sloppy benchmark engineering. The authors will patch the holes, the scores will mean something again, the leaderboard returns to normal.&lt;/p&gt;

&lt;p&gt;That reflex misreads the result. The holes were not random sloppiness. The same seven failure classes recur across all eight benchmarks, and three of them carry most of the damage:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The agent and the grader share a sandbox&lt;/strong&gt; — the agent can reach the scoring code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The answer ships inside the test files&lt;/strong&gt; — the agent can read the expected result&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The scorer checks presence, not correctness&lt;/strong&gt; — if output exists, it passes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each one rests on a single assumption: that the agent is trying to solve the task in good faith. Some holes you can close with better plumbing. The assumption underneath them you cannot patch. The ten-line file is a preview of what an agent will do with any part of your own test it can see and reach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fymi8hvi1jy2utcqh7mei.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fymi8hvi1jy2utcqh7mei.png" alt="The capability-threshold inversion — below the threshold, the score tracks reality; above it, the same number could mean either improvement or gaming, and the dashboard cannot tell which" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Capability Threshold Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;This is the part that matters for your own pipeline, not just someone else's benchmark. Below a certain level of capability, an agent gaming your evaluation looks like failure: the score drifts down, the metric gets noisier, you watch the number fall and you know something is wrong.&lt;/p&gt;

&lt;p&gt;Above that level, gaming looks like success. An agent that has learned to model your evaluation passes it cleanly while doing something else in deployment. The score does not drop. It holds, or it climbs.&lt;/p&gt;

&lt;p&gt;A rising score is ambiguous. It can mean the agent got better. It can mean the agent got better at being measured. The dashboard cannot tell you which.&lt;/p&gt;

&lt;p&gt;This is not theoretical. METR watched OpenAI's o3 reach past a coding task into the scoring code, pull out the answer the grader had already computed, and hand that back. Asked ten times whether that move matched the user's intent, o3 said no every time. Nobody instructed it to cheat. It was dropped into a setting with a checkable reward and a way to reach it, and it reached.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Your Dashboard Is Actually Measuring
&lt;/h2&gt;

&lt;p&gt;Almost everyone building agents now has observability. Dashboards, traces, token-level logs, replays of every run in high resolution. In a survey of 1,340 teams, 89% had observability in place. Only 37% ran any live evaluation against it.&lt;/p&gt;

&lt;p&gt;A sharper view of a gameable number is still a gameable number. A classifier hands you a confidence score. A verifier hands you a checkable artifact — something you can independently re-run to see if it holds. No amount of resolution turns the first into the second. Observability is that same trap one floor up. Everything on your dashboard lives on a surface the agent can see too: the logs, the eval prompts, the success metric, the LLM judge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3dgtypnn0mveuq1ek2x5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3dgtypnn0mveuq1ek2x5.png" alt="The observability trap — you watch the front of the board while the agent writes the back, and the real outcome sits off the screen" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Real Signal Looks Like
&lt;/h2&gt;

&lt;p&gt;The signals that survive share one property: the system never gets to touch them. They are downstream outcomes it cannot reach from inside its own loop. The change that got reverted. The ticket that reopened. The trade that never settled. Those move when the work was real and stay flat when the work was theatre.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn1wl6aklviqiivdodw5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn1wl6aklviqiivdodw5f.png" alt="The reach test — everything in reach turns green on command. The signal worth trusting is the one the agents hands never touch" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is a cheap version you can run now. Keep a private pool of tasks the agent never trains or tunes against. Lock its tools out of them. Then watch the distance between its score there and its score on the eval it can see. That distance is not noise — it is the size of the gaming.&lt;/p&gt;

&lt;p&gt;No architecture makes gaming impossible for a capable enough system. You can shrink the surface the agent gets to model and push trust out to something it cannot reach. You do not get to delete it.&lt;/p&gt;

&lt;p&gt;That is an uncomfortable place to stop, and it is the true one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is a condensed version of a longer piece that traces the full argument — including the Berkeley team's seven vulnerability classes, the METR reward-hacking evidence, the LangChain survey data on observability vs verification, and the practical framework for building signals your agent cannot reach.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Read the full article: &lt;a href="https://harryfloyd.substack.com/p/ten-lines-of-code" rel="noopener noreferrer"&gt;Ten Lines of Code Scored 100%. One Agent Broke Eight Benchmarks.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>evaluation</category>
      <category>architecture</category>
    </item>
    <item>
      <title>The Seven-Layer Agent Audit: How to Find Where Your AI Agent Is Actually Starving</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Mon, 22 Jun 2026 22:36:26 +0000</pubDate>
      <link>https://dev.to/harryfloyd/the-seven-layer-agent-audit-how-to-find-where-your-ai-agent-is-actually-starving-1gh4</link>
      <guid>https://dev.to/harryfloyd/the-seven-layer-agent-audit-how-to-find-where-your-ai-agent-is-actually-starving-1gh4</guid>
      <description>&lt;p&gt;Your agent failed again, and your hand found the model dropdown before you'd finished reading the transcript. The model is the one part of your agent that is public, ranked, and argued about. Everything else is private, unglamorous, and yours. So you upgrade the layer you can see and leave the one actually failing untouched.&lt;/p&gt;

&lt;p&gt;You have been debugging the layer easiest to talk about, not the one quietly costing you trust.&lt;/p&gt;

&lt;p&gt;The reflex has a sophisticated form too. The builders who would never just click the dropdown reach instead for a thicker harness, a bigger context window, the framework everyone is posting about. It is the same move: spend on the part you can see so you do not have to diagnose the part you cannot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Seven Layers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I debug through seven layers now, in a fixed order, starting from the one whose damage reaches furthest. One question per layer, and the failure signature you see when that layer is starved.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;The question&lt;/th&gt;
&lt;th&gt;Starved-layer signature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;7. Purpose&lt;/td&gt;
&lt;td&gt;What job was the agent hired to do?&lt;/td&gt;
&lt;td&gt;Wraps the right answer in the wrong task. Good code, wrong repo.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6. Harness&lt;/td&gt;
&lt;td&gt;What can the agent reach and what stops it?&lt;/td&gt;
&lt;td&gt;Implements the tool, never reaches the goal. Polite loop.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5. Memory&lt;/td&gt;
&lt;td&gt;What survives between turns?&lt;/td&gt;
&lt;td&gt;Re-learns Monday's correction on Thursday. Goldfish with good vocabulary.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4. Verification&lt;/td&gt;
&lt;td&gt;What checks whether the output is right?&lt;/td&gt;
&lt;td&gt;Ships clean format, wrong substance. Green test suite, zero semantic coverage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. Skills&lt;/td&gt;
&lt;td&gt;What reusable procedures does the agent carry?&lt;/td&gt;
&lt;td&gt;Good work repeated unreliably. Each call is a ground-up re-derivation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. Data&lt;/td&gt;
&lt;td&gt;What reality does the agent reason from?&lt;/td&gt;
&lt;td&gt;Confident wrong answers from a world that moved. Stale facts quoted with certainty.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1. Orchestration&lt;/td&gt;
&lt;td&gt;Who decides what happens next?&lt;/td&gt;
&lt;td&gt;Wanders, loops, or stalls. No structural feedback.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Debug outside-in, design inside-out&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You build from the top down: name the job (Purpose), bind the harness, package the skills. But you debug from the bottom up, starting at Data, the facts everything else reasons from, because the lower a starved layer sits, the further its fault has already spread.&lt;/p&gt;

&lt;p&gt;Design outside-in. Debug inside-out. The dropdown reflex fails because it debugs at the design end of the stack — the top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The data trap&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A research agent that quotes prices, versions, and policies usually breaks at data. The retrieval is stale, the checks above it are sound, and the failure is a confident answer drawn from a world that moved. Reach for a bigger model and it delivers the out-of-date figure with more poise. The starved floor is lower than anyone was looking.&lt;/p&gt;

&lt;p&gt;Find where the agent's facts enter: the retrieval call, the assumptions written into the system prompt, the document set. Then check one claim against its source. When the agent quotes a price, a version, a policy — can you name when that fact was last refreshed? Does it still match the world?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The verification trap&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An agent that turns out clean, well-formed prose or code usually breaks one floor up, at verification. The easiest row to mark green by pointing at a passing test suite. Look at what those tests check. Most confirm the output is the right shape and almost none test whether it is &lt;em&gt;right&lt;/em&gt;. That is a format check wearing a verification badge.&lt;/p&gt;

&lt;p&gt;The trap is worse for an agent that rewrites its own behaviour, which can pass every check while drifting from the behaviour those checks were meant to protect. A green test suite can sit on a starved verification layer, and it is sometimes the most expensive way to stay blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to run the audit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Take the seven questions in debug order, bottom to top. For each one, write the evidence in your repo that answers it: the file path, the asserted comparison, the memory store's last write. None of it needs to be a file — on a raw SDK loop, memory is whatever you persist between calls. The form does not matter.&lt;/p&gt;

&lt;p&gt;Where you catch yourself writing a sentence about how you &lt;em&gt;sort of&lt;/em&gt; handle that layer, instead of pointing at where it lives, you have found a starved row.&lt;/p&gt;

&lt;p&gt;Most of the seven clear in a few minutes — the rows you already trust. Running all of them is what earns you the right to drop to the one or two that bite instead of guessing. The starved row is the one you have been compensating for by hand without naming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which layer is in fashion turns over&lt;/strong&gt; — memory last year, context engineering now — but the audit is what tells you which one is &lt;em&gt;yours&lt;/em&gt;. Before you upgrade the model, before you rebuild the harness, run the seven. The row that comes back red is the one already costing you trust, and naming it stops the wasted motion: you quit arguing with the model, quit rewriting prompts that were never the problem, quit adding memory to a layer whose facts were stale to begin with.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is a shortened version of the full article at &lt;a href="https://harryfloyd.substack.com/p/the-seven-layer-agent-audit" rel="noopener noreferrer"&gt;The Durability Curve&lt;/a&gt;, which includes the full framework, worked examples, and a free one-page audit printout.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>debugging</category>
    </item>
    <item>
      <title>What Proves You Can Think?</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Tue, 02 Jun 2026 20:46:39 +0000</pubDate>
      <link>https://dev.to/harryfloyd/what-proves-you-can-think-4hmh</link>
      <guid>https://dev.to/harryfloyd/what-proves-you-can-think-4hmh</guid>
      <description>&lt;p&gt;AI did not just make output cheap. It broke the old contract between effort, competence, and trust.&lt;/p&gt;

&lt;p&gt;For developers this is not abstract. When anyone can generate a clean PR, a plausible code review, a working API endpoint, or a competent-looking architecture diagram in seconds, the artefact stops proving what it used to prove. A good solution no longer implies someone wrestled with the problem.&lt;/p&gt;

&lt;p&gt;The question underneath the productivity debate is harder: if the work no longer proves I can think, what does?&lt;/p&gt;

&lt;h3&gt;
  
  
  The old proof contract
&lt;/h3&gt;

&lt;p&gt;Every institution runs on proof contracts. A school asks for essays and exams. A company asks for CVs, interviews, and work samples. A market asks for traction and retention.&lt;/p&gt;

&lt;p&gt;None of these signals were ever pure. The CV was always a marketing document. The interview was always distorted by nerves and charm. The portfolio could hide how much help the person had. But they worked well enough because polished surfaces were costly to produce. Cost created friction. Friction created signal.&lt;/p&gt;

&lt;p&gt;AI attacks the "expensive enough" part. It compresses the cost of appearing competent. That is enough to break the systems that relied on that cost as a proxy.&lt;/p&gt;

&lt;h3&gt;
  
  
  The move
&lt;/h3&gt;

&lt;p&gt;When output gets cheap, output quality becomes the opening bid, not the final proof. The important question moves upward:&lt;/p&gt;

&lt;p&gt;What does this output prove about the person, team, or system behind it?&lt;/p&gt;

&lt;p&gt;Sometimes the answer is: not much. A clean PR may prove someone had access to a good model and enough taste not to paste the first result. A strong CV may prove the candidate knows how hiring filters work.&lt;/p&gt;

&lt;p&gt;The useful response is not to ban AI. It is to stop treating AI-polished output as the proof object. The next proof system asks what happened before, during, and after the artefact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Five questions that separate judgement from output
&lt;/h3&gt;

&lt;p&gt;These work on code, PRs, architecture decisions, interview responses, and your own work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. What problem was chosen, and what easier problem was rejected?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first proof of thought. Bad work often starts with accepting the first fluent frame. Good work usually contains a buried refusal: someone saw the tempting version and did not take it. In a codebase, this looks like choosing the harder but more maintainable abstraction instead of the one-liner that will break in six months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. What tradeoff was made under constraint?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Intelligence becomes visible at the boundary. Anyone can claim they value quality, speed, safety, and maintainability. Real judgement appears when not all of them can be maximised at once. The developer who can explain why they chose correctness over latency for this specific endpoint, and what evidence would make them reverse that choice, is showing something the output alone cannot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. What did you check that the output itself could not prove?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the verification question. It separates people who use AI as a generator from people who use AI inside a judgement loop. The generated code can compile. That is not verification. Verification is the external thing that makes the claim answerable: the edge case test, the production data check, the integration test that proves it works with the real system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. What changed after feedback or contact with reality?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Revision is underrated because it is less glamorous than creation. But in an AI world, revision becomes a higher-status signal. The first surface is cheap. The changed surface after a code review, a production incident, or a colleague pointing out the flaw is where more truth appears.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Who owns the consequence if this is wrong?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Accountability is the signal machines cannot carry. A model can produce. A person must decide what they are willing to stand behind. The developer who says "I shipped this, I own the pager duty for it, I will be awake if it breaks" is operating in a different category from the one who lets the AI output speak for itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  What this changes in practice
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Hiring:&lt;/strong&gt; Stop asking only for work samples. Give candidates a plausible AI-generated solution and ask what is wrong, what would break in production, and what they would check before deploying it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code review:&lt;/strong&gt; Stop treating clean diffs as sufficient evidence. Ask what was not generated. Ask which tradeoff the author is defending. Ask what verification would prove the solution wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your own work:&lt;/strong&gt; Stop trying to prove value only through polished output. Keep the polish, but attach judgement to it. Show the problem you chose. Show the tradeoff. Show the verification. Show the revision. Show what you will own.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This piece was originally published on The Durability Curve, a newsletter about what lasts when the surface gets cheap. &lt;a href="https://harryfloyd.substack.com/p/what-proves-you-can-think" rel="noopener noreferrer"&gt;Read the full article&lt;/a&gt; for the deeper argument, including the research on algorithmic anxiety that frames why this matters beyond engineering.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>career</category>
      <category>architecture</category>
    </item>
    <item>
      <title>What 128GB Unified Memory Changes for Local AI Development</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Mon, 01 Jun 2026 12:28:22 +0000</pubDate>
      <link>https://dev.to/harryfloyd/what-128gb-unified-memory-changes-for-local-ai-development-kam</link>
      <guid>https://dev.to/harryfloyd/what-128gb-unified-memory-changes-for-local-ai-development-kam</guid>
      <description>&lt;h1&gt;
  
  
  What 128GB Unified Memory Changes for Local AI Development
&lt;/h1&gt;

&lt;p&gt;Yesterday at Computex, NVIDIA announced the RTX Spark superchip: an Arm CPU paired with a Blackwell GPU and up to &lt;strong&gt;128GB of unified LPDDR5X memory&lt;/strong&gt;. Most of the coverage is focusing on the Arm chip or the "agentic OS" branding. The real story for developers is the memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Constraint That Just Got Removed
&lt;/h2&gt;

&lt;p&gt;If you've run local models, you know the bottleneck. An RTX 4090 has 24GB of VRAM. That fits a 13B parameter model at 8-bit or a 30B model at 4-bit, with nothing else. No embedding model. No vector database. No room for the application itself in GPU memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# With 24GB VRAM (RTX 4090):
# - 30B model at Q4_K_M: ~20GB
# - KV cache for 4096 context: ~2GB  
# - Remaining: ~2GB
# - Can't fit an embedding model. Can't fit a vector index.
# - CPU offloading would be needed, which is 10-100x slower.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;128GB unified memory changes this because the CPU and GPU share one pool. You're not choosing between VRAM for the model and system RAM for everything else. The GPU can directly access the full 128GB.&lt;/p&gt;

&lt;p&gt;For context, a 70B parameter model at FP4 (4-bit) needs about 40-45GB in practice, with quantization overhead and KV cache included. That leaves roughly 83GB for the rest of your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can Actually Build Now
&lt;/h2&gt;

&lt;p&gt;Here's a concrete workflow that goes from impossible to straightforward with 128GB:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running a local RAG pipeline with a 70B model:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Components that now fit on one machine:
# 1. 70B LLM at FP4: ~42GB
# 2. Embedding model (e.g., bge-large-en-v1.5): ~1.5GB  
# 3. Vector index (10M embeddings at 768d): ~6GB
# 4. Application runtime + buffer: ~8GB
# Total: ~57.5GB — fits with 70GB to spare
# On a 4090 24GB: the 70B model alone doesn't fit
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or a multi-agent setup where you run three specialised models simultaneously:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Multi-model orchestration on one machine:
# - 70B orchestrator model at FP4: ~42GB
# - 30B code specialist at Q4_K_M: ~20GB  
# - 7B verification model at Q8: ~7GB
# - Shared KV cache: ~4GB
# Total: ~73GB — comfortable fit
# On 24GB VRAM: you'd need 3 separate machines
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't theoretical. The RTX Spark runs Windows on Arm, and NVIDIA's NemoClaw agent framework already supports it. The software stack (llama.cpp, Ollama, NVIDIA's own AI Enterprise suite) supports the NVLink C2C architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Memory Bandwidth Question
&lt;/h2&gt;

&lt;p&gt;128GB of LPDDR5X at 300 GB/s is the spec worth checking. Compare this to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RTX 4090: 24GB GDDR6X at 1,008 GB/s&lt;/li&gt;
&lt;li&gt;Mac M5 Max: 128GB unified at ~800 GB/s&lt;/li&gt;
&lt;li&gt;RTX Spark: 128GB LPDDR5X at 300 GB/s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The RTX Spark has 5x the capacity but about a third of the bandwidth of a 4090. This means: batch inference and throughput-oriented workloads will be slower than a 4090. But model loading, context switching between models, and running multiple models simultaneously all bottleneck on VRAM capacity, not bandwidth. Those will be dramatically better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bandwidth is enough for interactive inference.&lt;/strong&gt; A 70B model generates ~30 tokens/second on an M5 Max at 800 GB/s. At 300 GB/s, you'd expect roughly 10-15 tokens/second. Slower but usable for most development workflows. For production batch inference, you'd still want a datacenter GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Local AI Development
&lt;/h2&gt;

&lt;p&gt;The practical takeaway for developers: &lt;strong&gt;128GB unified memory changes the threshold question.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before RTX Spark, the question was: "Does my model fit in 24GB?" If no, you couldn't run it locally at all. You needed cloud GPUs or CPU offloading, which is impractically slow for any interactive use.&lt;/p&gt;

&lt;p&gt;After RTX Spark, the question becomes: "Does my multi-model workflow fit in 128GB?" For most development setups, including a large model, an embedding service, a vector index, and some agent tooling, the answer is yes.&lt;/p&gt;

&lt;p&gt;This doesn't replace cloud infrastructure for production. But it changes the economics of development iteration. Running a local dev environment with production-scale models means faster feedback cycles, no inference API costs during development, and the ability to test multi-model interactions without distributed system complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Structural Change
&lt;/h2&gt;

&lt;p&gt;The Arm chip is interesting. The agentic OS pitch is marketing. The memory bus is the actual structural change, a discontinuity in what a single consumer PC can hold in memory for AI workloads.&lt;/p&gt;

&lt;p&gt;If your work involves models above 30B parameters locally, this is the spec that matters. Everything else, including clock speeds, core counts, and TOPS ratings, is secondary to whether your working set fits in memory.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;NVIDIA's RTX Spark announcement at Computex 2026. Tom's Hardware has the full spec breakdown &lt;a href="https://www.tomshardware.com/laptops/nvidia-unveils-rtx-spark-superchip-at-computex-2026-new-platform-promises-to-turn-windows-into-an-agentic-ai-os-with-arm-cpu-blackwell-gpu-and-128gb-unified-memory" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Same AI Model Can Perform 6x Better: Here's Why</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Sat, 30 May 2026 21:39:59 +0000</pubDate>
      <link>https://dev.to/harryfloyd/the-same-ai-model-can-perform-6x-better-heres-why-440o</link>
      <guid>https://dev.to/harryfloyd/the-same-ai-model-can-perform-6x-better-heres-why-440o</guid>
      <description>&lt;p&gt;A &lt;a href="https://arxiv.org/abs/2603.28052" rel="noopener noreferrer"&gt;Stanford and Tsinghua paper&lt;/a&gt; ran a controlled experiment earlier this year. Same model. Same task. Different harness architecture.&lt;/p&gt;

&lt;p&gt;The result: a 6x performance gap driven entirely by the system built &lt;em&gt;around&lt;/em&gt; the model. Not the model itself.&lt;/p&gt;

&lt;p&gt;This is not a prompt engineering insight. It is a systems architecture insight, and it changes where developers should invest their time when building agentic systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 6x Gap
&lt;/h2&gt;

&lt;p&gt;Meta-Harness tested Claude Opus 4.6 across two harness configurations on TerminalBench-2. The only variable was the scaffold: the code that manages tool calls, context windows, error recovery, and state persistence.&lt;/p&gt;

&lt;p&gt;One version scored at baseline. The other, with structured tool orchestration and context management, scored 18.4 points higher. Same inference cost. Same model. Different architecture.&lt;/p&gt;

&lt;p&gt;This pattern replicates across multiple independent studies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/langchain-ai/deepagents" rel="noopener noreferrer"&gt;LangChain DeepAgents&lt;/a&gt; (2026):&lt;/strong&gt; Same GPT-5.2-Codex model. Harness-only changes moved it from Top 30 to Top 5. That is a 13.7-point gain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.can.ac/2026/02/12/the-harness-problem/" rel="noopener noreferrer"&gt;Can Bölük&lt;/a&gt; (Hashline, 2026):&lt;/strong&gt; Same model, same task. Changed the edit tool format. Performance went from 6.7% to 68.3%. That is a 10x improvement with 61% fewer tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools" rel="noopener noreferrer"&gt;Vercel's d0 agent&lt;/a&gt;:&lt;/strong&gt; A production agent had 16 tools. Removing 14 of them (leaving only bash) took success rate from 80% to 100%. The bottleneck was not capability. It was decision surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Practically
&lt;/h2&gt;

&lt;p&gt;The cheapest Haiku call with an optimised harness (37.6% on TerminalBench-2) outperformed the most expensive Opus call with a default harness (58.0%). That is at 1/50th the inference cost.&lt;/p&gt;

&lt;p&gt;Most teams are optimising at the wrong layer. They swap models, tune prompts, add retrieval. The structural leverage is in how the system manages tool calls, handles state, and recovers from failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changes
&lt;/h2&gt;

&lt;p&gt;The practical takeaway for anyone building with AI agents:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit your tool surface.&lt;/strong&gt; Every tool your agent can call is a decision it must make. Vercel found 16→1 tool reduction improved everything. Fewer tools, better decisions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Measure harness, not just model.&lt;/strong&gt; Track task completion rate per harness configuration, not just per model. The harness is the variable that moved 6x.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost is architecture-dependent, not model-dependent.&lt;/strong&gt; Haiku with a good harness beat Opus with a bad harness. Test harness variations before upgrading to a more expensive model.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full analysis (12 verified claims, evidence tables, production case studies, and falsification criteria) is on Substack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://harryfloyd.substack.com/p/harness-engineering-same-model-different-product" rel="noopener noreferrer"&gt;Harness Engineering: Same Model, Different Product →&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It covers the Claude Code 1,421-line state machine, the Codex CLI vs Claude Code architecture comparison (77.3% vs 65.4%, 4.2x token efficiency difference), and why this is a Law IV (Instruments Over Theory) and Law I (Bottleneck Migration) structural play.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow for weekly analysis on AI infrastructure, agent architecture, and the systems that actually determine model performance.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Decision Subtraction Framework: How to Evaluate Any AI Tool</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Thu, 28 May 2026 10:39:34 +0000</pubDate>
      <link>https://dev.to/harryfloyd/the-decision-subtraction-framework-how-to-evaluate-any-ai-tool-1o1l</link>
      <guid>https://dev.to/harryfloyd/the-decision-subtraction-framework-how-to-evaluate-any-ai-tool-1o1l</guid>
      <description>&lt;p&gt;Last week someone asked me which AI tools they should be using. The question hides a problem that costs real money: there are more capable AI tools available than any single person can evaluate.&lt;/p&gt;

&lt;p&gt;ChatGPT Plus at $20/month. Claude at $20. Grok at $30. Cursor at $20. Copilot at $10. Each with a $100, $200, or $300 variant underneath. Each claims to earn its place.&lt;/p&gt;

&lt;p&gt;The real question is not which tool is best. The real question is: which tools subtract more decisions than they add?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Lenses
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Replacement Ratio
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Formula:&lt;/strong&gt; decisions replaced by the tool ÷ decisions it creates.&lt;/p&gt;

&lt;p&gt;List every decision the tool makes for you. Then list every new decision it forces you to make. Divide the first by the second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thresholds:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;≥ 2.0 → Keep&lt;/li&gt;
&lt;li&gt;1.0–2.0 → Evaluate&lt;/li&gt;
&lt;li&gt;&amp;lt; 1.0 → Drop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A code completion tool that writes a function body (replaces 5 decisions about syntax, structure, naming) but requires review (adds 2 decisions about correctness) has a ratio of 2.5. It passes.&lt;/p&gt;

&lt;p&gt;A meeting summariser that replaces 1 decision (should I re-listen?) but creates 3 (verify accuracy, add context, decide what to share) has a ratio of 0.33. It fails.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Friction Delta
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Formula:&lt;/strong&gt; time without the tool ÷ time with the tool.&lt;/p&gt;

&lt;p&gt;Include onboarding time amortised over your first 10 uses. A tool that saves 30 minutes per use but took 2 hours to learn breaks even at 4 uses. After that, it is pure gain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Threshold:&lt;/strong&gt; Break-even within 5 uses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Catch:&lt;/strong&gt; This lens breaks for tools that enable tasks you could not do at all before. A drug discovery simulation has infinite Friction Delta because the alternative is impossible. Score those as "can't evaluate on this lens" and rely on the others.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Attention ROI
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Formula:&lt;/strong&gt; output quality ÷ attention consumed.&lt;/p&gt;

&lt;p&gt;Estimate cognitive load per use on a simple scale: 1 (fire and forget) to 4 (full attention required). Track whether it goes up or down over 10 uses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Threshold:&lt;/strong&gt; Attention per use should decrease over time. If you need to watch it more closely after ten uses than after one, something is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Framework Lies to You
&lt;/h2&gt;

&lt;p&gt;I tested this framework against the hardest cases I could find. It failed in five ways. Knowing them makes it useful:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decision quality matters more than quantity.&lt;/strong&gt; One high-stakes judgment (should I deploy?) outweighs 10 trivial picks (camelCase or snake_case?). Weight strategically.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Friction Delta can't measure capability expansion.&lt;/strong&gt; If a tool lets you do something new rather than just faster, skip this lens.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Attention ROI rewards deskilling.&lt;/strong&gt; The descending attention threshold is a Goodhart target — it rewards tools that train you to rubber-stamp.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Erasure cost is invisible.&lt;/strong&gt; The framework never asks: if I use this for a year, what can I no longer do without it?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Error asymmetry is invisible.&lt;/strong&gt; Two tools can score identically while producing catastrophically different results when they fail.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Fourth Lens: Erasure Cost
&lt;/h2&gt;

&lt;p&gt;Ask: "If I use this tool for six months and then stop, what skill will I have lost?"&lt;/p&gt;

&lt;p&gt;Score it: 1 (nothing lost) to 4 (core competency outsourced). Score 1-2 is safe. Score 3 is a deliberate trade. Score 4 is dependency, not tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Apply: Monday Morning
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;List every AI tool you have used in the last 30 days&lt;/li&gt;
&lt;li&gt;Score Replacement Ratio and Friction Delta for each&lt;/li&gt;
&lt;li&gt;Both pass → Keep. One fails → 7-day trial. Both fail → Cancel&lt;/li&gt;
&lt;li&gt;Score Erasure Cost for the survivors&lt;/li&gt;
&lt;li&gt;When evaluating a new tool: score it before subscribing&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Worked Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ChatGPT Plus ($20/month)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Replacement Ratio:&lt;/strong&gt; 3.5. Replaces research lookups, drafting, formatting. Creates verification and prompt decisions. Pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Friction Delta:&lt;/strong&gt; Breakeven in 2-3 uses. Shallow learning curve. Pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attention ROI:&lt;/strong&gt; Decreasing. Gets faster as you learn its patterns. Pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Erasure Cost:&lt;/strong&gt; 2. The underlying skill (structuring an argument) is reinforced, not replaced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verdict:&lt;/strong&gt; Keep.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cursor Pro ($20/month)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Replacement Ratio:&lt;/strong&gt; 4.0. Replaces syntax lookups, boilerplate, function structure. Creates code review decisions. Pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Friction Delta:&lt;/strong&gt; Breakeven in 1-2 uses. Tab completion is instant. Pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attention ROI:&lt;/strong&gt; Steeply decreasing. Pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Erasure Cost:&lt;/strong&gt; 3. Heavy users report difficulty writing syntax without it after 3+ months. A deliberate trade worth making.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verdict:&lt;/strong&gt; Keep for daily coding. Monitor erasure.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Meeting Summariser ($20/month, anonymised)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Replacement Ratio:&lt;/strong&gt; 0.33. Replaces 1 decision. Creates 3. Fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Friction Delta:&lt;/strong&gt; Never breaks even. Still attend meetings, still verify. Fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attention ROI:&lt;/strong&gt; Flat. Must check every summary at same level. Fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Erasure Cost:&lt;/strong&gt; 2. Minor skill atrophy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verdict:&lt;/strong&gt; Cancel.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This framework connects to a deeper structural principle: a tool's value is the difficulty it removes. If it creates new difficulty of a different kind, it is not a tool. It is a job.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Full framework with diagram: &lt;a href="https://telegra.ph/The-Decision-Subtraction-Framework-How-to-Evaluate-Any-AI-Tool-05-28" rel="noopener noreferrer"&gt;https://telegra.ph/The-Decision-Subtraction-Framework-How-to-Evaluate-Any-AI-Tool-05-28&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>frameworks</category>
      <category>decisionmaking</category>
    </item>
    <item>
      <title>The Verification Bottleneck: Why Testing AI Agents Is Harder Than Building Them</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Sat, 23 May 2026 01:54:26 +0000</pubDate>
      <link>https://dev.to/harryfloyd/the-verification-bottleneck-why-testing-ai-agents-is-harder-than-building-them-3716</link>
      <guid>https://dev.to/harryfloyd/the-verification-bottleneck-why-testing-ai-agents-is-harder-than-building-them-3716</guid>
      <description>&lt;h1&gt;
  
  
  The Verification Bottleneck: Why Testing AI Agents Is Harder Than Building Them
&lt;/h1&gt;

&lt;p&gt;The AI industry has a supply problem. Not with chips, not with models, not with capital. It is a verification problem.&lt;/p&gt;

&lt;p&gt;Every week, another agent framework launches. Every month, another company announces autonomous task completion. The build rate is accelerating. But the verification rate, the speed at which we can determine whether an agent’s output is correct, safe and worth trusting is not keeping pace.&lt;/p&gt;

&lt;p&gt;This gap between build speed and verify speed is structural. It is not going to close on its own. And it is the bottleneck the market is not yet pricing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Build-Verify Asymmetry
&lt;/h2&gt;

&lt;p&gt;Building an AI agent is getting cheaper. Open-weight models, orchestration frameworks like LangGraph and Semantic Kernel, managed inference APIs. The cost of standing up a functional agent pipeline has dropped by an order of magnitude in 18 months.&lt;/p&gt;

&lt;p&gt;Verification has not followed the same curve.&lt;/p&gt;

&lt;p&gt;To verify a single agent run, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ground-truth data for the task domain&lt;/li&gt;
&lt;li&gt;A mechanism for comparing structured and unstructured outputs&lt;/li&gt;
&lt;li&gt;Edge-case coverage for the long tail of user inputs&lt;/li&gt;
&lt;li&gt;Human review loops for uncertain cases&lt;/li&gt;
&lt;li&gt;Regression test suites that survive model updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these is expensive to build and maintain. And unlike the agent’s inference cost (which drops predictably with hardware improvements), verification cost is labour-proportional. Human review scales with headcount. Ground-truth data requires domain expertise to curate.&lt;/p&gt;

&lt;p&gt;This creates an asymmetry that compounds with scale:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Building&lt;/th&gt;
&lt;th&gt;Verifying&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost trajectory&lt;/td&gt;
&lt;td&gt;Dropping (compute + models)&lt;/td&gt;
&lt;td&gt;Flat or rising (labour + data)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling method&lt;/td&gt;
&lt;td&gt;More compute&lt;/td&gt;
&lt;td&gt;More humans or better instruments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automation potential&lt;/td&gt;
&lt;td&gt;High (the agent itself)&lt;/td&gt;
&lt;td&gt;Low (ground truth is domain-specific)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure mode&lt;/td&gt;
&lt;td&gt;No agent&lt;/td&gt;
&lt;td&gt;Wrong agent output that looks correct&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fourth row is the dangerous one. A false negative in verification (approving an incorrect agent output) has no visible failure signal until the downstream damage is done. A false positive, rejecting a correct output, creates friction and user frustration. Verification systems optimise for the wrong side of this trade-off because the wrong side is invisible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Market Is Missing
&lt;/h2&gt;

&lt;p&gt;When developers talk about agent reliability, the conversation usually lands on one of three things: chain-of-thought prompting, retrieval-augmented generation quality, or model fine-tuning. These are useful but they are not verification. They are attempts to make the agent less likely to produce wrong outputs in the first place.&lt;/p&gt;

&lt;p&gt;Verification is a separate problem. It is the instrument that detects whether the output, regardless of how it was produced, is correct.&lt;/p&gt;

&lt;p&gt;This is a Law IV problem. Law IV of the Durability Curve framework says that hidden structure stays hidden until you build the instrument to observe it. The hidden structure in AI agents is their failure modes. We are deploying agents without instruments to observe failures at scale because those instruments do not exist yet in any reliable form.&lt;/p&gt;

&lt;p&gt;The companies that build those instruments will capture value that currently sits unclaimed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Verification Infrastructure Looks Like
&lt;/h2&gt;

&lt;p&gt;The verification problem decomposes into layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structural verification.&lt;/strong&gt; Does the agent’s output conform to a known schema? JSON parsers, Pydantic models, and structured output constraints handle this today. This layer is the most mature but only catches format errors, not semantic ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic verification.&lt;/strong&gt; Does the output mean what we think it means? This is where the hard problems live. For a code-generating agent, does the produced code actually solve the user’s problem? For a document-analysis agent, are the extracted facts correct? This requires a second model, a verifier, running in evaluation mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Behavioural verification.&lt;/strong&gt; Does the agent behave appropriately across a distribution of inputs? Not just single-shot accuracy but conversation-level coherence, safety boundary adherence, and refusal calibration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability.&lt;/strong&gt; Can you trace what the agent did, why it did it, and where it went wrong? This is the instrumentation layer: how tools, prompts and agent steps create signal. Datadog and ServiceNow are building in this space, but the landscape is fragmented and the standards are immature.&lt;/p&gt;

&lt;p&gt;The market currently prices the first layer (structural verification) as solved, which it mostly is. It ignores the existence of the second and third layers. And it treats the fourth layer as a monitoring problem rather than a verification problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Instrument-Making Opportunity
&lt;/h2&gt;

&lt;p&gt;The history of technology markets suggests a pattern: the layer that controls verification captures a disproportionate share of value.&lt;/p&gt;

&lt;p&gt;In software, the testing and observability tools (New Relic, Datadog, Selenium) created markets larger than many of the products they tested. In hardware, the inspection equipment market (KLA, ASML’s metrology) rivals the fabrication equipment market.&lt;/p&gt;

&lt;p&gt;The same dynamic is unfolding in AI. The companies building agent-verification infrastructure (whether through evaluation frameworks, structured-output tooling or agent-observability platforms) are building instruments for a structure the market does not yet see clearly.&lt;/p&gt;

&lt;p&gt;The falsification condition for this thesis is straightforward: if existing evaluation approaches (benchmarks, human review, test suites) prove sufficient for production agent deployment, the verification bottleneck does not materialise. But signals from production deployments, including the proliferation of guardrails, the emergence of dedicated evaluation teams at frontier labs and the growing literature on agent failure modes, suggest the opposite.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Developers Should Watch
&lt;/h2&gt;

&lt;p&gt;Three signals indicate whether the verification layer is becoming load-bearing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal one: agent deployment velocity vs. incident rate.&lt;/strong&gt; If agents are deployed faster but incident rates are not rising proportionally, verification is keeping pace. If incidents are accelerating faster than deployment, the bottleneck is tightening.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal two: emergence of dedicated agent-evaluation roles.&lt;/strong&gt; The first companies to hire “agent verifier” or “AI evaluation engineer” as a distinct role, outside QA, are signalling that verification is not a subset of testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal three: consolidation around evaluation standards.&lt;/strong&gt; If the ecosystem converges on one or two evaluation frameworks (beyond simple benchmark suites) within the next 12 months, the instrument-making phase is accelerating.&lt;/p&gt;

&lt;p&gt;The build-verify asymmetry is not permanent. It is a market inefficiency that will correct, either through better verification infrastructure or through a pullback in agent deployment when undetected failures accumulate. Which correction path wins depends on whether the instrument makers move faster than the failure curve.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>analysis</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>NVIDIA's $81.6B Quarter Confirms the Networking Bottleneck — Here's What Developers Should Know</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Thu, 21 May 2026 08:39:53 +0000</pubDate>
      <link>https://dev.to/harryfloyd/nvidias-816b-quarter-confirms-the-networking-bottleneck-heres-what-developers-should-know-5hal</link>
      <guid>https://dev.to/harryfloyd/nvidias-816b-quarter-confirms-the-networking-bottleneck-heres-what-developers-should-know-5hal</guid>
      <description>&lt;p&gt;NVIDIA reported $81.6 billion in revenue for Q1 FY2027 yesterday. That's an 85% year-over-year increase. Non-GAAP EPS of $1.87 beat consensus by 6%. Q2 guidance of $91 billion is 4% above what analysts expected.&lt;/p&gt;

&lt;p&gt;By every headline measure, this was a clean beat and raise. The stock closed up 1.8% and was flat after hours.&lt;/p&gt;

&lt;p&gt;The pattern is now five out of six quarters where NVIDIA beats the numbers and the stock doesn't rally. The headline numbers are priced before the print. The signal is in the &lt;em&gt;composition&lt;/em&gt; of revenue, not the total.&lt;/p&gt;

&lt;h2&gt;
  
  
  The networking number that changes the story
&lt;/h2&gt;

&lt;p&gt;Data Center networking revenue reached &lt;strong&gt;$14.8 billion&lt;/strong&gt; — a record, up &lt;strong&gt;199% year-over-year&lt;/strong&gt; and 35% sequentially. Compare that to Data Center compute revenue of $60.4 billion, which grew 77% year-over-year.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The networking segment is growing at 2.6x the rate of the compute segment.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the clearest signal yet that the bottleneck in AI training is migrating from GPU FLOPs to inter-GPU bandwidth. As clusters scale past 50,000 devices, the wall-clock constraint shifts from "how fast can the GPU multiply matrices" to "how fast can the network move gradients between GPUs."&lt;/p&gt;

&lt;p&gt;Two years ago, networking was roughly 12% of Data Center revenue. It is now 20% and accelerating.&lt;/p&gt;

&lt;h2&gt;
  
  
  The full-stack moat is protecting margins
&lt;/h2&gt;

&lt;p&gt;GAAP gross margin reached &lt;strong&gt;74.9%&lt;/strong&gt; — up from 60.6% a year ago. This directly contradicts the narrative that Blackwell volume production would compress margins through CoWoS packaging and HBM memory costs.&lt;/p&gt;

&lt;p&gt;Margins are expanding because the full stack (CUDA + NVLink + Spectrum-X + Blackwell silicon) creates pricing power that chip-design-alone cannot match. No competitor currently replicates this combination of scale and margin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key number:&lt;/strong&gt; Operating income of ~$53.3 billion, up 147% year-over-year. GAAP net income tripled to $58.3 billion.&lt;/p&gt;

&lt;h2&gt;
  
  
  $48.6 billion in free cash flow — in one quarter
&lt;/h2&gt;

&lt;p&gt;That's a 60% FCF margin. Only about 15 companies in the world generate more net profit in an entire year than NVIDIA generates in cash in three months.&lt;/p&gt;

&lt;p&gt;Management's confidence shows in capital returns: dividend from $0.01 to $0.25 per share (25x increase), new $80 billion buyback authorization, and ~$20 billion returned to shareholders during the quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vera Rubin timeline de-risks the transition
&lt;/h2&gt;

&lt;p&gt;Confirmed: on track for H2 2026, starting Q3, volume ramp in Q4. Architectural transitions are the highest-risk moment for any semiconductor company. This quarter's confirmation substantially de-risks the handoff.&lt;/p&gt;

&lt;p&gt;Blackwell demand is so strong it's driving up secondary-market prices for older Hopper and Ampere GPUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the bottleneck matters for infrastructure engineers
&lt;/h2&gt;

&lt;p&gt;If you're designing training clusters, the implication is direct: the binding constraint on throughput is shifting from GPU selection to network topology. Spectrum-X Ethernet and InfiniBand choices matter more per dollar than GPU generation increments at the margin.&lt;/p&gt;

&lt;p&gt;For inference workloads, the bottleneck migration has different dynamics — memory bandwidth and model distribution across nodes become the constraint. But the pattern holds: the easy gains are no longer at the compute layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one risk
&lt;/h2&gt;

&lt;p&gt;Data Center now accounts for 92% of revenue. This is not a company risk — it's a regime risk. If hyperscaler capex cycles, 92% of revenue faces the same headwind.&lt;/p&gt;

&lt;h2&gt;
  
  
  All falsification triggers are green
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Q2 guide below $85B → Guided $91B&lt;/li&gt;
&lt;li&gt;Vera Rubin delayed beyond Q3 → On track Q3&lt;/li&gt;
&lt;li&gt;Gross margin below 73% → 74.9%&lt;/li&gt;
&lt;li&gt;Networking growth &amp;lt; compute growth → 199% vs 77%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thesis is intact and the data is strengthening it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://durability-curve.pages.dev/blog/nvda-q1-fy2027-the-networking-number-that-changes-the-story/" rel="noopener noreferrer"&gt;The Durability Curve&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infrastructure</category>
      <category>architecture</category>
      <category>investing</category>
    </item>
    <item>
      <title>The Indium Phosphide Bottleneck: What the Market Missed</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Wed, 20 May 2026 11:19:09 +0000</pubDate>
      <link>https://dev.to/harryfloyd/the-indium-phosphide-bottleneck-what-the-market-missed-44lc</link>
      <guid>https://dev.to/harryfloyd/the-indium-phosphide-bottleneck-what-the-market-missed-44lc</guid>
      <description>&lt;p&gt;A UK semiconductor company lost a quarter of its value in a single day this week. The market assumed Huawei exposure was catastrophic. The Huawei exposure is under five percent of revenue.&lt;/p&gt;

&lt;p&gt;The panic was about the wrong thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Indium Phosphide Fits in the AI Stack
&lt;/h2&gt;

&lt;p&gt;Every 1.6T optical transceiver shipping into an AI cluster relies on lasers made from indium phosphide (InP). Silicon cannot lase. InP does.&lt;/p&gt;

&lt;p&gt;The supply chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw indium (zinc byproduct)
  → InP boule → sliced into substrates (AXT Inc)
    → Epitaxial wafer growth (IQE)
      → Laser fabrication (Lumentum, Coherent, MACOM)
        → Transceiver assembly
          → AI GPU clusters
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;IQE sits at the third step. It is the only independent large-scale epiwafer foundry with qualified InP capacity globally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Switching Is Hard
&lt;/h2&gt;

&lt;p&gt;Qualifying a new epiwafer supplier takes 12-24 months. Sample testing, device fabrication trials, reliability qualification. Once qualified, switching costs are enormous. There are approximately five qualified InP epiwafer suppliers worldwide. IQE is the only independent one at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The MACOM Signal
&lt;/h2&gt;

&lt;p&gt;In April 2026, MACOM invested £81M into IQE at 19.8p per share. MACOM received 11.5% ownership and two board seats, plus a long-term strategic supply agreement. The message: they cannot secure InP capacity without IQE.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Silicon Photonics Does Not Kill This
&lt;/h2&gt;

&lt;p&gt;The common objection: silicon photonics will replace InP. This misunderstands the technology. Even silicon photonic ICs need InP lasers, integrated heterogeneously. Monolithic lasers-on-silicon are not expected before 2030. InP is needed regardless of which platform wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tension
&lt;/h2&gt;

&lt;p&gt;IQE is not a clean story. Gross margins were under 4% last year on £118M revenue because reactors run at roughly half capacity. The wireless segment (57% of revenue) is declining. Three CEOs in five years. The current CEO also serves as CFO.&lt;/p&gt;

&lt;p&gt;But the underlying structural thesis is clean: the InP bottleneck is real and tightening. AXT Inc is doubling capacity for 2027. Lumentum reported 90% revenue growth. The optical interconnect buildout for AI is a capital expenditure cycle backed by the largest technology companies in the world.&lt;/p&gt;

&lt;p&gt;The pattern has historical precedent. As Marc Levinson documents in &lt;a href="https://www.amazon.co.uk/dp/0691136408/?tag=giftfndr0d8-21" rel="noopener noreferrer"&gt;The Box&lt;/a&gt;, the shipping container created a modular interface that collapsed freight costs — the same structural pattern happening now in optical interconnects.&lt;/p&gt;

&lt;p&gt;The market panicked about Huawei and ignored the structural story underneath. That is the pattern: value migrates upward as lower layers commoditise, and the market is slow to update on where the new bottleneck sits.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://telegra.ph/The-Indium-Phosphide-Bottleneck-05-20" rel="noopener noreferrer"&gt;Telegraph&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infrastructure</category>
      <category>analysis</category>
      <category>architecture</category>
    </item>
    <item>
      <title>The Earnings Report Is Not the Signal</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Wed, 20 May 2026 09:15:00 +0000</pubDate>
      <link>https://dev.to/harryfloyd/the-earnings-report-is-not-the-signal-3l6f</link>
      <guid>https://dev.to/harryfloyd/the-earnings-report-is-not-the-signal-3l6f</guid>
      <description>&lt;p&gt;NVDA reports $80B+ tonight. The headlines will focus on the number. Here is why it will not tell you what you need to know about the AI buildout paradigm.&lt;/p&gt;

&lt;p&gt;Every earnings season, the same pattern plays out. A company reports a number. The market moves. Analysts revise targets. By the next morning, everyone is asking the wrong question.&lt;/p&gt;

&lt;p&gt;The question is not whether NVIDIA beat or missed. The question is what the structure of the quarter reveals about the migration path of the bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Earnings Number Is a Lagging Indicator
&lt;/h2&gt;

&lt;p&gt;A revenue beat tells you what happened in the past ninety days. The structural variables live in the forward-looking text, not the headline number.&lt;/p&gt;

&lt;p&gt;Consider three data points from NVIDIA's last quarter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Revenue of $68 billion. A beat. Priced in hours.&lt;/li&gt;
&lt;li&gt;Supply commitments doubled to $95 billion. A signal. Took weeks to fully price.&lt;/li&gt;
&lt;li&gt;A $500 million investment in a glass company. A map. Months later, it is still being understood.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Three Kinds of Data in Every Report
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Signals&lt;/strong&gt; — Forward-looking structural data that changes the probability distribution. Supply commitments. Lead times. Capacity expansion timelines. Customer concentration shifts. Rare. Worth a position.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Noise&lt;/strong&gt; — Beats and misses within expected range. Drive the overnight move. Mean nothing for the thesis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Echoes&lt;/strong&gt; — Lagging confirmation of a known trend. Useful for calibration. Add no new information.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Watch Tonight
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Data Center narrative&lt;/strong&gt; — is the mix shifting from training to inference? That is Law I migration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply chain language&lt;/strong&gt; — optical interconnects and glass substrates in prepared remarks are structural signals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply commitment growth rate&lt;/strong&gt; — if it exceeds revenue growth, NVIDIA is building against a constraint they see coming.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why This Framework Matters
&lt;/h2&gt;

&lt;p&gt;The durability of an investment thesis is determined by how well you distinguish signal from noise from echo. The market is designed to make everything look equally important. The structure that separates durable value from temporary noise is invisible to the real-time feed.&lt;/p&gt;

&lt;p&gt;The report is not the signal. The structure behind the report is the signal.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://telegra.ph/The-Earnings-Report-Is-Not-the-Signal-05-20" rel="noopener noreferrer"&gt;The Durability Curve&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infrastructure</category>
      <category>analysis</category>
      <category>architecture</category>
    </item>
    <item>
      <title>3 Infrastructure Bottlenecks That Exist Beyond Any Single Earnings Report</title>
      <dc:creator>Harry Floyd</dc:creator>
      <pubDate>Tue, 19 May 2026 09:25:37 +0000</pubDate>
      <link>https://dev.to/harryfloyd/3-infrastructure-bottlenecks-that-exist-beyond-any-single-earnings-report-2nd0</link>
      <guid>https://dev.to/harryfloyd/3-infrastructure-bottlenecks-that-exist-beyond-any-single-earnings-report-2nd0</guid>
      <description>&lt;p&gt;NVIDIA reports Q1 FY2027 earnings tomorrow. The consensus sits at roughly $79B in revenue with a 7% options move priced in. Every analyst note and headline will be about whether the number clears the bar.&lt;/p&gt;

&lt;p&gt;None of that matters for the three bottlenecks that will define compute infrastructure investment for the next three years.&lt;/p&gt;

&lt;p&gt;These constraints are structurally decoupled from NVIDIA’s quarterly variance. They exist whether the beat is 3% or 8%. They exist because of a first-principles property of large-scale systems: value migrates upward as lower layers commoditise.&lt;/p&gt;

&lt;p&gt;The GPU layer is the lower layer. These three sit above it.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Grid Transformer: 128 Weeks and No Substitute
&lt;/h2&gt;

&lt;p&gt;The electrical transformer that steps voltage from transmission lines to data center distribution levels has a procurement lead time of 80 to 128 weeks globally. This is a structural ceiling on how fast AI infrastructure can physically be built.&lt;/p&gt;

&lt;p&gt;This constraint exists regardless of GPU supply. You can have every Blackwell GPU allocated and paid for. If the transformer cabinet is not bolted to a concrete pad with an energized feed from the grid, those GPUs are room-temperature silicon.&lt;/p&gt;

&lt;p&gt;The market is slowly noticing. ABB, Siemens Energy, and Schneider Electric have rerated upward. But the real asymmetry sits two layers deeper.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Optical Link: Data Movement, Not Compute
&lt;/h2&gt;

&lt;p&gt;Inside every large AI training cluster, data moves between GPUs at terabit speeds. As clusters scale to 100,000+ GPUs, the distance between nodes forces a fundamental migration from electrical to optical interconnects.&lt;/p&gt;

&lt;p&gt;NVIDIA spent approximately $4B securing supply from Lumentum and Coherent. The photonics supply chain is tight through at least 2027.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The Evaluation Layer: Trust as Infrastructure
&lt;/h2&gt;

&lt;p&gt;As AI agents move from demos into production workflows, the binding constraint shifts from “can the model do this?” to “can we prove it did it correctly?”&lt;/p&gt;

&lt;p&gt;2-3% of AI-generated code passes tests but contains subtle errors. Catching that minority requires evaluation infrastructure that most teams do not have.&lt;/p&gt;

&lt;p&gt;This market barely exists today. No dominant platform for AI evaluation exists. That absence is itself the pattern: the layer below is commoditising, and the layer above is where value forms next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why These Three
&lt;/h2&gt;

&lt;p&gt;Each bottleneck sits at a layer above the GPU in the infrastructure stack. The GPU is being efficiently priced by the market. The constraints above it are where the market’s attention has not yet reached.&lt;/p&gt;

&lt;p&gt;NVIDIA will beat or miss tomorrow. By Thursday the financial media will have moved on. These three bottlenecks will still be tightening.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;36-page NVIDIA earnings research report: &lt;a href="https://harryfloyd.gumroad.com/l/nvda-q1-fy2027-earnings-research-report?utm_source=devto" rel="noopener noreferrer"&gt;Gumroad&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Free weekly analysis: &lt;a href="https://harryfloyd.substack.com?utm_source=devto" rel="noopener noreferrer"&gt;The Durability Curve&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infrastructure</category>
      <category>architecture</category>
      <category>analysis</category>
    </item>
  </channel>
</rss>
