<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michael Tuszynski</title>
    <description>The latest articles on DEV Community by Michael Tuszynski (@michaeltuszynski).</description>
    <link>https://dev.to/michaeltuszynski</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1447774%2Fa99eea93-7845-4764-9fce-b1755bcfa456.png</url>
      <title>DEV Community: Michael Tuszynski</title>
      <link>https://dev.to/michaeltuszynski</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/michaeltuszynski"/>
    <language>en</language>
    <item>
      <title>How to Run an Agent Loop Without Burning Your Token Budget</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Tue, 16 Jun 2026 02:43:55 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/how-to-run-an-agent-loop-without-burning-your-token-budget-1ffp</link>
      <guid>https://dev.to/michaeltuszynski/how-to-run-an-agent-loop-without-burning-your-token-budget-1ffp</guid>
      <description>&lt;p&gt;There's a diagram making the rounds that splits the world into "prompt engineering" and "loop engineering." The pitch: stop writing prompts one at a time and let the agent drive. Set a goal, fire a trigger, let the agent act, check whether the goal is met, and repeat until it is.&lt;/p&gt;

&lt;p&gt;The pattern is real. An autonomous agent running its own act-check-repeat cycle is &lt;a href="https://www.anthropic.com/engineering/building-effective-agents" rel="noopener noreferrer"&gt;the core loop behind every agentic coding tool&lt;/a&gt;, including the one I use every day. The shape isn't new either — the &lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;reason-act-observe loop was formalized back in 2022&lt;/a&gt;. What the picture skips is the two decisions that decide whether a loop earns its keep or quietly drains your account.&lt;/p&gt;

&lt;p&gt;A loop is cheap to start and expensive to run. The moment you hand an agent the keys, you trade a problem you understand — writing the next prompt — for two that are harder: framing the goal so the agent can actually reach it, and defining the exit so it stops before the cost outruns the value. Get those wrong and you haven't automated your work. You've built a slot machine and pointed it at your API bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question is the hard part, not the loop
&lt;/h2&gt;

&lt;p&gt;Look at that flowchart again and find the diamond labeled "Goal met?" That one box does all the work, and it's drawn as a trivial yes/no. It isn't. The entire economics of the loop live there.&lt;/p&gt;

&lt;p&gt;A goal an agent can check is a goal an agent can finish. "Make every test in &lt;code&gt;e2e/&lt;/code&gt; pass without editing the test files" has a built-in pass/fail. The agent runs the suite, reads the result, and knows where it stands. "Make the dashboard better" has nothing. There's no signal that says done, so the agent either spins forever or declares victory on a target it invented.&lt;/p&gt;

&lt;p&gt;So the skill that replaces prompt-writing isn't loop design. It's writing a goal with a test attached. Before you start a loop, answer one question: how will the agent know it succeeded, without me reading the output? If the only judge is your own eyes, you don't have a loop. You have an assistant that never learned to stop.&lt;/p&gt;

&lt;p&gt;The cheap, trustworthy judges are the ones you already own, such as a test suite, a type checker, a linter, a schema validator, or a diff against expected output. The expensive judges are the ones that cost as much as the work itself — another model grading the first model, or you. When checking the answer is as hard as producing it, a loop adds cost and removes nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  A loop needs three limits, not one
&lt;/h2&gt;

&lt;p&gt;The diagram has one exit: goal met. Real loops need three, because "goal met" is the exit that might never fire.&lt;/p&gt;

&lt;p&gt;An iteration cap. A hard ceiling on turns. Not because you expect to hit it, but because the runs that blow up your bill are the ones that never converge, and a cap is the only thing standing between "didn't converge" and "didn't stop." Pick a number — ten, twenty — and treat hitting it as a failure to investigate, not a budget to spend.&lt;/p&gt;

&lt;p&gt;A cost ceiling. Count tokens, not just turns. An agent that re-reads a 100,000-token context on every pass has spent two million input tokens over twenty iterations before doing any real work. Set a budget in tokens or dollars and stop when you reach it, the same way you'd put a timeout on a network call you don't fully trust.&lt;/p&gt;

&lt;p&gt;A no-progress detector. This is the one people skip, and it's the one that saves the most. Track the success signal across iterations. If the last three passes didn't move it — same test count failing, same lint errors — the agent is stuck, and ten more turns won't help. Stop on stall, not just on success or on the cap.&lt;/p&gt;

&lt;h2&gt;
  
  
  When a loop is worth it
&lt;/h2&gt;

&lt;p&gt;Loops are worth it when verification is cheaper than the work and the task genuinely needs iteration. Fixing a failing test suite fits: running the tests is fast and certain, and the work is real trial and error. Migrating a few hundred files to a new API fits, if you can check each one mechanically.&lt;/p&gt;

&lt;p&gt;Loops are waste when the task is one-shot or has no honest pass/fail. Looping a model on "write a good post" doesn't produce a good post. It produces however many drafts your cap allows, and then you read all of them anyway — which is the human review you were trying to skip, now multiplied by the iteration count.&lt;/p&gt;

&lt;p&gt;This is the same lesson as enterprise AI more broadly. &lt;a href="https://www.mpt.solutions/enterprise-ai-has-an-80-failure-rate-the-models-arent-the-problem-what-is/" rel="noopener noreferrer"&gt;The models were rarely the reason projects failed&lt;/a&gt; — the discipline around them was. An agent loop concentrates that truth into a single design decision and then bills you for getting it wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;Before I start an agent loop, I write down three things: the goal, the check that proves the goal, and the three limits. If I can't name the check, I stop and fix the goal first, because a loop without a check isn't ready to run. If the check costs as much as the task, I don't loop at all — I do it once and review it myself.&lt;/p&gt;

&lt;p&gt;None of this is exotic. It's the same discipline you'd apply to any process that runs without a human watching it: define done, bound the cost, detect when it's stuck. The agent is new. The engineering isn't.&lt;/p&gt;

</description>
      <category>agentengineering</category>
      <category>platformengineering</category>
      <category>aitooling</category>
      <category>developertools</category>
    </item>
    <item>
      <title>On-Prem AI Is a Bet on Control</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Sun, 14 Jun 2026 16:30:08 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/on-prem-ai-is-a-bet-on-control-317c</link>
      <guid>https://dev.to/michaeltuszynski/on-prem-ai-is-a-bet-on-control-317c</guid>
      <description>&lt;p&gt;Dell booked $12.3 billion in AI server orders in a single quarter — $30 billion year to date, with an $18.4 billion backlog and a five-quarter pipeline the company says is multiples of that, per &lt;a href="https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-delivers-third-quarter-fiscal-2026-financial-results.htm" rel="noopener noreferrer"&gt;Dell's Q3 FY2026 results&lt;/a&gt;. Full-year AI shipment guidance is roughly $25 billion, up over 150 percent year over year. That is not a vendor hedging against an on-prem niche. That is a hardware company bulking up for AI workloads moving back into buildings the customer owns.&lt;/p&gt;

&lt;p&gt;The easy read is cost: GPUs got cheap enough to own instead of rent. The math exists — Dell claims break-even against public-cloud API pricing in as little as three months — but that figure is footnoted vendor arithmetic with a stack of utilization assumptions under it. The real driver is control. Data residency, predictable latency, and a roadmap that is not gated by someone else's capacity queue. The operators winning this shift are the ones who can run the same serving stack in their own racks and in a cloud without rewriting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift Is Bigger Than One Earnings Call
&lt;/h2&gt;

&lt;p&gt;Dell's own survey, presented at Dell Technologies World in May, found that &lt;a href="https://www.nextplatform.com/compute/2026/05/19/dell-bulks-up-hardware-as-ai-infrastructure-shifts-to-on-premises/5242811" rel="noopener noreferrer"&gt;67 percent of AI workloads already run outside the public cloud&lt;/a&gt; — on-premises, on devices, at the edge, or in colocation — and 88 percent of surveyed organizations run at least one AI workload on-premises. A survey from the vendor selling the conclusion deserves a discount. The order book is harder to discount: the backlog mix spans neocloud, sovereign, and enterprise customers, and sovereigns buy racks for exactly one reason. They cannot wait in someone else's queue, and they cannot let the workload leave the jurisdiction.&lt;/p&gt;

&lt;p&gt;Michael Dell said the operative part out loud in his keynote: "The risk is not the cloud. The risk is losing control of your data, your cost, your security, your intellectual property, and your speed." Read that list again. Cost is one item out of five, and it is not first. The vendor whose entire business is selling on-prem iron is pitching control, not unit economics. When the pitch and the spreadsheet diverge, the pitch tells you what is closing deals.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Control Buys, Concretely
&lt;/h2&gt;

&lt;p&gt;Eli Lilly is the cleanest case. &lt;a href="https://blogs.nvidia.com/blog/lilly-ai-factory-nvidia-blackwell-dgx-superpod/" rel="noopener noreferrer"&gt;LillyPod&lt;/a&gt; is the largest AI factory wholly owned and operated by a pharmaceutical company — the first NVIDIA DGX SuperPOD built on DGX B300 systems, with 1,016 Blackwell Ultra GPUs delivering over 9,000 petaflops. The justification fits in one sentence: the models train on $1 billion worth of Lilly's proprietary drug-discovery data. That data is the company. Lilly was never going to park it in a multi-tenant region and hope the contract language held. The deployment is part of a $50 billion US manufacturing and R&amp;amp;D commitment, which tells you how Lilly categorizes it — not as IT spend, but as production capacity.&lt;/p&gt;

&lt;p&gt;The second thing control buys is latency you can plan around. DDN's enterprise AI infrastructure guide, written with NVIDIA, names the cloud failure mode precisely: fragmented per-line-of-business cloud subscriptions produce &lt;a href="https://www.ddn.com/resources/research/guide-to-enterprise-ai-infrastructure/" rel="noopener noreferrer"&gt;"significant subscription costs, substantial data transit costs... introduced latencies"&lt;/a&gt; and, the line that matters most for production inference, a lack of performance SLAs. An inference service feeding a manufacturing line or a clinical workflow needs a latency budget somebody actually owns. In your own racks, the distance between the data and the GPU is a cable you bought.&lt;/p&gt;

&lt;p&gt;The third is the capacity queue. Rent your compute and your roadmap inherits your provider's allocation decisions. The hottest GPUs go to the biggest committed spenders, and a mid-size enterprise's Q3 launch waits behind a hyperscaler's internal training run. Owning the floor does not make capacity infinite. It makes capacity yours, on a depreciation schedule you control instead of an allocation email you wait for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Portability Is the Actual Discipline
&lt;/h2&gt;

&lt;p&gt;The most interesting announcement out of Dell Technologies World was not a box. It was &lt;a href="https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2026~05~dell-technologies-closes-the-gap-between-ai-ambition-and-ai-outcomes.htm" rel="noopener noreferrer"&gt;Gemini 3 Flash running on Google Distributed Cloud atop Dell PowerEdge XE9780 servers&lt;/a&gt;, inside a confidential-computing envelope built for data protection, residency, and sovereignty requirements. The same model, the same serving stack, running in a hyperscaler's cloud and in your own datacenter. Palantir's Foundry and AIP are coming on-prem through the same program. Reflection's open-weight frontier models too. Dell Enterprise Hub on Hugging Face ships DeepSeek-V4, Kimi K2.6, and GLM 5.1 optimized for the same iron, deployed where the data lives.&lt;/p&gt;

&lt;p&gt;That is the part of this story most coverage skips. The on-prem bet only pays if the stack is portable. An enterprise that builds a bespoke serving layer welded to its own racks has traded one lock-in for another — it has just moved the lock-in into its own building. The operators getting this right treat their racks as one more deployment target for a stack that also runs in a cloud region: same model artifacts, same orchestration, same observability, different floor.&lt;/p&gt;

&lt;p&gt;This is the same discipline frontier scale demands. Labs serving models across heterogeneous fleets cannot afford a rewrite per environment, so the serving layer abstracts the floor it runs on. Enterprises pulling workloads back on-prem are converging on the identical requirement from the opposite direction. Hybrid is not a compromise position between cloud and on-prem. Hybrid is the engineering standard, and on-prem is one target it compiles to.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Caveats That Keep This Honest
&lt;/h2&gt;

&lt;p&gt;Dell sells servers. DDN sells storage. NVIDIA sells everything underneath both. Every vendor quoted here profits from the conclusion, and the three-month break-even claim should be treated as marketing until your own utilization numbers reproduce it. A GPU you own earns its keep only when it is busy; a half-idle cluster loses to an API on cost every time, and plenty of enterprises will buy racks for workloads that never fill them.&lt;/p&gt;

&lt;p&gt;The non-circular evidence is narrower but solid: a named pharmaceutical company chose to own and operate a thousand-GPU factory because of what its data is worth, and a hardware vendor's filing-grade order numbers show tens of billions in demand from customers making the same call. Neither datapoint depends on survey framing or TCO calculators.&lt;/p&gt;

&lt;p&gt;The cost story will keep getting the headlines because it is easy to chart. But watch what the buyers say when they explain themselves. Lilly talks about its data. Sovereigns talk about jurisdiction. Michael Dell, given a keynote stage and every incentive to talk about price, talked about control five ways before mentioning cost once. The workloads moving back on-prem are the ones where the data, the latency budget, or the timeline is too valuable to put in someone else's queue. Owning the racks is how you hold those variables. Keeping the stack portable is how you avoid building a new cage out of your own concrete.&lt;/p&gt;

</description>
      <category>aiinfrastructure</category>
      <category>onpremai</category>
      <category>enterpriseai</category>
      <category>datasovereignty</category>
    </item>
    <item>
      <title>Loading Tool Schemas on Demand Is How Agents Scale</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Fri, 12 Jun 2026 16:30:06 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/loading-tool-schemas-on-demand-is-how-agents-scale-363o</link>
      <guid>https://dev.to/michaeltuszynski/loading-tool-schemas-on-demand-is-how-agents-scale-363o</guid>
      <description>&lt;p&gt;An agent connected to five MCP servers can burn 55,000 tokens on tool schemas before it reads the first word of your request. GitHub's server ships 35 tools at roughly 26K tokens. Slack adds 11 more at about 21K. Sentry, Grafana, and Splunk tack on a few thousand each. Add Jira — about 17K on its own — and you are pushing 100K tokens of pure definitions. Anthropic measured &lt;a href="https://www.anthropic.com/engineering/advanced-tool-use" rel="noopener noreferrer"&gt;134K tokens of tool definitions&lt;/a&gt; in its own internal setup before it optimized anything. None of that is work. It is the bill for capabilities the agent might use, paid in full on every request.&lt;/p&gt;

&lt;p&gt;That is importing your whole dependency tree to call one function. No engineer would defend that pattern in code. We have lazy loading, tree shaking, and dynamic imports precisely because paying for everything up front does not scale. Agents need the same move, and as of late 2025 the platform layer supports it: schemas load on demand, when the work actually calls for them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Upfront Bill Has Two Line Items
&lt;/h2&gt;

&lt;p&gt;The token cost is the visible half. The hidden half is accuracy. The &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool" rel="noopener noreferrer"&gt;Claude API documentation&lt;/a&gt; is blunt about it: tool selection accuracy degrades significantly once an agent has more than 30 to 50 tools in front of it. Two tools named &lt;code&gt;notification-send-user&lt;/code&gt; and &lt;code&gt;notification-send-channel&lt;/code&gt; sitting in a pile of 58 definitions is a recipe for the model grabbing the wrong one. The context is not just expensive. It is noisy, and the noise makes the model worse at its job.&lt;/p&gt;

&lt;p&gt;So the upfront-loading pattern costs you twice. You pay tokens for schemas you will not use this turn, and you pay accuracy because the schemas you will use are buried in the ones you will not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deferred Loading Flips the Default
&lt;/h2&gt;

&lt;p&gt;Anthropic shipped the Tool Search Tool and &lt;code&gt;defer_loading&lt;/code&gt; on the Claude Developer Platform in November 2025. The mechanics are simple. Tools marked &lt;code&gt;defer_loading: true&lt;/code&gt; stay out of the context window entirely. The model gets one small search tool — about 500 tokens — and when the task needs a capability, it searches, the matching schemas load, and work proceeds. A typical request discovers three to five tools, around 3K tokens. The full library stays reachable. Almost none of it is resident.&lt;/p&gt;

&lt;p&gt;The before-and-after numbers make the case. A 58-tool setup that consumed roughly 77K tokens of context before any work now consumes about 8.7K — the agent keeps roughly 95% of its context window for the actual task. Deferred tools also stay out of the initial prompt, so prompt caching survives instead of getting invalidated every time the tool roster changes.&lt;/p&gt;

&lt;p&gt;The part worth dwelling on is what happened to accuracy. On MCP evaluations with large tool libraries, Opus 4 went from 49% to 74% with tool search enabled. Opus 4.5 went from 79.5% to 88.1%. Trimming the working set did not cost capability. It removed distraction. A model choosing between four relevant tools outperforms the same model choosing between 58 mostly irrelevant ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Extreme Case Is Reading Tool Definitions Like Files
&lt;/h2&gt;

&lt;p&gt;Anthropic's &lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;code execution with MCP&lt;/a&gt; work pushes the principle to its limit. Instead of presenting tools as schemas at all, the agent gets a filesystem: each server is a directory, each tool a file, and the agent reads the definition only when it needs that tool. A workflow that cost 150,000 tokens dropped to 2,000 — a 98.7% reduction.&lt;/p&gt;

&lt;p&gt;The same post names the other half of the context tax: intermediate results. A two-hour meeting transcript fetched by one tool and passed to another is roughly 50K tokens moving through the model twice, even when the model never needed to see it. On-demand discovery for schemas and code-level handling for intermediate data attack the same problem from both ends. The model's context holds what the model needs to reason about, and nothing else rides along.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills Apply the Same Principle to Procedure
&lt;/h2&gt;

&lt;p&gt;Tool schemas are not the only capability payload. Procedures are too — the how-to knowledge for filling out a brand template, processing a spreadsheet, following a deploy checklist. &lt;a href="https://www.anthropic.com/news/skills" rel="noopener noreferrer"&gt;Agent Skills&lt;/a&gt;, shipped October 2025, package that knowledge as folders of instructions, scripts, and resources that the model loads when needed, in the same format across Claude apps, Claude Code, and the API.&lt;/p&gt;

&lt;p&gt;The loading model mirrors deferred tools exactly. Anthropic's &lt;a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills" rel="noopener noreferrer"&gt;engineering write-up&lt;/a&gt; names the design principle: progressive disclosure. At startup, only each skill's name and description enter the system prompt — tens of tokens per skill. The full instruction body loads when the task matches. Bundled files load only if the instructions reference them. A library of fifty skills costs you a table of contents, not fifty manuals.&lt;/p&gt;

&lt;p&gt;Schemas on demand cover what the agent can call. Skills on demand cover what the agent knows how to do. Together they mean an agent's capability surface can grow without its per-request context bill growing with it. That decoupling is the scaling property. Capability count and context cost used to be the same axis. Now they are two.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Upfront Loading Is Still Fine
&lt;/h2&gt;

&lt;p&gt;Honest caveat: deferred loading adds a discovery round-trip. The agent has to search before it can call, and that costs latency. Anthropic's own guidance is that tool search pays off when definitions exceed roughly 10K tokens or when you see the model picking wrong tools. Below that, load everything up front and skip the indirection. A five-tool agent does not need a search layer any more than a 200-line script needs a plugin architecture.&lt;/p&gt;

&lt;p&gt;The pattern matters when the library is large and the per-task working set is small. That describes most serious agent deployments — and the gap between library size and working-set size only widens as teams connect more systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Is a Platform Contract, Not a Trick
&lt;/h2&gt;

&lt;p&gt;The platform engineering field spent 2025 asking what changed and closed the year asking what agents need from infrastructure. The &lt;a href="https://platformengineering.org/events/platform-engineering-in-2025-what-changed-ai-and-the-future-of-platforms-2025-12-09" rel="noopener noreferrer"&gt;year-in-review discussion&lt;/a&gt; put agentic developer platforms at the top of the 2026 agenda, with the open question being which interfaces and what infrastructure agents actually require. One panelist's warning fits here: an agentic platform still needs a product mindset — AI is not going to save the day on its own.&lt;/p&gt;

&lt;p&gt;On-demand capability loading is one concrete answer to that question. The contract between an agent and its platform should look like the contract between a program and its module system: declare what exists cheaply, resolve what is needed lazily, and never charge the running process for the parts of the library it did not touch. Teams wiring up agents today should treat the 30-to-50-tool threshold as a design constraint, mark everything past the core set as deferred, and ship procedures as skills with one-line descriptions. The agents that scale will be the ones that pay for context the way good code pays for dependencies: only when called.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>platformengineering</category>
      <category>mcp</category>
      <category>contextengineering</category>
    </item>
    <item>
      <title>Org-Wide Dependency Visibility Is Integrated Work</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Wed, 10 Jun 2026 23:30:18 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/org-wide-dependency-visibility-is-integrated-work-39ih</link>
      <guid>https://dev.to/michaeltuszynski/org-wide-dependency-visibility-is-integrated-work-39ih</guid>
      <description>&lt;p&gt;Every large engineering org eventually wants the same thing: one view of every dependency across hundreds of repositories. Who uses lodash 4.17.20. Which services still pull in the bad Log4j. What breaks if the shared client library gets bumped. The market has answered that demand the same way for a decade — another scanning tool, another dashboard, another seat license. AI just collapsed the price of that answer to roughly zero. What it did not collapse is the judgment about which dependency actually matters to the business, and that distinction should change how you staff.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Intern Test
&lt;/h2&gt;

&lt;p&gt;A few weeks ago &lt;a href="https://www.reddit.com/r/devops/comments/1tinjem/existing_toolsarchitectures_for_orgwide/" rel="noopener noreferrer"&gt;a thread on r/devops&lt;/a&gt; laid the problem out in its purest form: an org with hundreds of repos, an intern, and eight weeks to proof-of-concept org-wide dependency visibility. The question was which existing tools to start from. The replies named the usual suspects — Backstage, Dependency-Track, OWASP Dependency-Check, SBOM aggregation, Azure DevOps Advanced Security.&lt;/p&gt;

&lt;p&gt;The most-repeated advice was not a tool recommendation. It was a warning: don't build a platform around this problem, because the maintenance and curation overhead becomes the real project. One commenter described the graveyard exactly — great metadata systems that slowly drift because ownership and update discipline become unclear.&lt;/p&gt;

&lt;p&gt;That thread is the whole thesis in miniature. The scanning and graph-building layer is now intern-plus-AI work. An agent can walk hundreds of repos, parse the manifests and lockfiles, and emit a dependency graph in an afternoon. The thing that decays — the thing every reply warned about — is curation. Knowing which entries in the graph are load-bearing. That was never in the tool, and it still isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Visibility Was Never the Bottleneck
&lt;/h2&gt;

&lt;p&gt;If scanning solved the problem, Log4Shell would have been closed out in a quarter. The entire industry pointed every scanner it owned at Log4j in December 2021. &lt;a href="https://www.veracode.com/blog/state-of-log4j-vulnerabilities-how-much-did-log4shell-change/" rel="noopener noreferrer"&gt;Two years later, Veracode found&lt;/a&gt; that 38% of 38,278 applications across 3,866 orgs were still running vulnerable Log4j versions. Worse, 32% sat on Log4j 1.2.x — end-of-life since August 2015.&lt;/p&gt;

&lt;p&gt;The underlying behavior explains it: 79% of the time, developers never update a third-party library after adding it to a codebase. And the single biggest predictor of slow remediation was not scanner coverage. Developers who lacked context on how a vulnerable library related to their application took more than seven months to fix half their backlog. Teams with that context fixed theirs in weeks.&lt;/p&gt;

&lt;p&gt;Read that gap again. Seven-plus months versus weeks, and the variable was not tooling. It was integrated knowledge of the system — which call paths touch the library, what the upgrade breaks, who owns the blast radius. Every org in that dataset had visibility. The ones that remediated fast had people who understood what they were looking at.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scan Got Free and Less Trustworthy at the Same Time
&lt;/h2&gt;

&lt;p&gt;AI agents now build dependency graphs for free — and they also pollute them. &lt;a href="https://www.endorlabs.com/lp/state-of-dependency-management-2025" rel="noopener noreferrer"&gt;Endor Labs' State of Dependency Management 2025&lt;/a&gt; analyzed 10,663 MCP server repositories plus AI dependency recommendations across PyPI, npm, Maven, and NuGet. The findings: AI coding agents import vulnerable and hallucinated dependencies at scale, 75% of MCP servers are built by individuals, 41% carry no license at all, and 82% touch sensitive APIs. Endor's recommendation is blunt — treat AI-generated code as untrusted third-party input.&lt;/p&gt;

&lt;p&gt;The people using these tools already sense it. The &lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/dora-report-2025/" rel="noopener noreferrer"&gt;2025 DORA report&lt;/a&gt; puts AI adoption among software professionals at 90%, up 14 points year over year, with a median of two hours a day. More than 80% report productivity gains. And 30% trust the output "a little" or not at all. DORA's own framing: AI is "a supportive tool... rather than a full substitute for human judgment."&lt;/p&gt;

&lt;p&gt;So the economics flipped in both directions at once. The access-rent layer — paying a vendor to tell you what is in your own repositories — is collapsing, because an agent does the inventory for free. The judgment layer got more valuable at the same moment, because the same agents that build the graph inject noise into it, and somebody has to know which edges are real and which ones matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Upgrade Nobody Documented
&lt;/h2&gt;

&lt;p&gt;A concrete instance from my own infrastructure. I run a set of production cron jobs on a Mac mini that depend on better-sqlite3, a native Node module compiled against a particular runtime ABI. A routine Homebrew update moved the default node from v22 to v26. Every job that touched the database started crash-looping, silently, until I noticed the output had stopped.&lt;/p&gt;

&lt;p&gt;No scanner flags that. The dependency graph was accurate the whole time — better-sqlite3 was right there in the manifest, version pinned, zero CVEs. What broke was a relationship between a native module's ABI and a runtime version that no manifest expresses. The fix took five minutes because I knew the system. Someone without that context loses a day to it, and the scanner contributes zero useful minutes either way.&lt;/p&gt;

&lt;p&gt;Now scale that to a 400-repo org and multiply by every undocumented coupling: the service that pins an old client because the new one changed retry semantics, the EOL logging library that looks scary but is unreachable from any network input, the "low" severity finding that sits directly on the auth path and is the actual emergency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Staff for the Judgment, Not the Scan
&lt;/h2&gt;

&lt;p&gt;Platform engineering spent 2025 &lt;a href="https://platformengineering.org/events/platform-engineering-in-2025-what-changed-ai-and-the-future-of-platforms-2025-12-09" rel="noopener noreferrer"&gt;taking stock of what AI changed&lt;/a&gt; about its own discipline. For dependency work, the honest ledger looks like this.&lt;/p&gt;

&lt;p&gt;AI commoditized: repo crawling, manifest parsing, SBOM generation, graph assembly, CVE matching, even drafting the upgrade PR.&lt;/p&gt;

&lt;p&gt;AI did not commoditize: knowing which transitive dependency carries business risk, which upgrade breaks the thing nobody wrote down, and which finding deserves this week's attention. That is integrated expertise — it accumulates in people who have lived inside the system, and it does not transfer through a dashboard.&lt;/p&gt;

&lt;p&gt;Three practical moves follow. First, stop funding the visibility layer as if it were the hard part; an agent plus an intern can stand it up, exactly as that r/devops thread suspected. Second, assign senior, named owners to the graph's curation — the drift the commenters warned about is an ownership failure, not a tooling gap. Third, measure context instead of coverage: the Veracode data says the team that understands how a library relates to its application remediates an order of magnitude faster than the team that merely detects it.&lt;/p&gt;

&lt;p&gt;The single pane of glass is finally cheap. It was never the product. The reading of it is, and the people who can read it are the ones worth paying for.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>softwaresupplychain</category>
      <category>dependencymanagement</category>
      <category>aiengineering</category>
    </item>
    <item>
      <title>Telemetry Is the Context Your Coding Agent Is Missing</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Wed, 10 Jun 2026 14:56:45 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/telemetry-is-the-context-your-coding-agent-is-missing-2o1d</link>
      <guid>https://dev.to/michaeltuszynski/telemetry-is-the-context-your-coding-agent-is-missing-2o1d</guid>
      <description>&lt;p&gt;&lt;a href="https://www.reddit.com/r/kubernetes/comments/1tiu0ig/anyone_using_telemetry_data_in_tandem_with_ai/" rel="noopener noreferrer"&gt;The question being asked in r/kubernetes&lt;/a&gt; this week is whether to feed observability data to coding agents. Traces, metrics, logs, the whole stream the system already emits — should the agent that writes and fixes your code get to read it? The answer is yes. The agent that can see what production actually does will fix the right thing more often than the agent staring at a static repo. But the answer comes with a condition that the question usually skips: the telemetry has to be shaped by someone who runs the system before it is worth anything to the agent.&lt;/p&gt;

&lt;p&gt;Raw telemetry is not context. It is access-rent input. A trace tells you a request took 800 milliseconds and touched nine services. It does not tell you whether 800 milliseconds is normal for that path or a five-alarm regression. A spike in a metric is a number going up. Whether the number going up matters depends on what the baseline is, what the team has agreed to care about, and what the system was doing last Tuesday when the same shape showed up and turned out to be a batch job. None of that lives in the data. It lives in the heads of the people on call.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agent Reads the Stream, Not the Meaning
&lt;/h2&gt;

&lt;p&gt;The direction of travel here is clear. AI development environments are turning into &lt;a href="https://arxiv.org/abs/2506.11019" rel="noopener noreferrer"&gt;observability-first platforms that wire real-time telemetry, prompt traces, and evaluation feedback into the developer workflow&lt;/a&gt;, connected to the editor over MCP. The plumbing is getting solved. You can hand an agent a live feed of what your system is doing, and the agent can read it. That is a real capability and it is arriving fast.&lt;/p&gt;

&lt;p&gt;The plumbing is not the hard part. The hard part is that the agent reads the stream and not the meaning. A coding agent given raw traces will treat every anomaly as equally interesting because it has no way to know which ones the team has already triaged, which ones are load-bearing, and which ones are noise the on-call rotation learned to ignore eight months ago. Feed it everything and you get an agent that is confidently wrong about what is broken.&lt;/p&gt;

&lt;p&gt;This is the same failure mode the observability vendors are already warning about at the protocol layer. Honeycomb's engineering team points out that piping raw context to an agent backfires: &lt;a href="https://www.honeycomb.io/blog/what-s-special-about-mcp" rel="noopener noreferrer"&gt;each MCP server the user configures clogs up the context with instructions and tool definitions, whether the agent needs them for this conversation or not&lt;/a&gt;. More telemetry in the context window is not more signal. Past a point it is less, because the agent spends its attention budget on data nobody decided was relevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Healthy Baseline Is a Decision, Not a Reading
&lt;/h2&gt;

&lt;p&gt;So which signals should the agent read? That is not a question the platform answers. It is a question an operator answers.&lt;/p&gt;

&lt;p&gt;Consider what it takes to say "this is healthy." Specifically: the p99 latency on the checkout path should sit under 300 milliseconds — except during the nightly reindex, when 900 is fine and expected. Error rate on the auth service above 0.5% is a page; the same rate on the recommendations service is a shrug because it fails open. The queue depth metric that the dashboard renders in red is actually the metric the team stopped trusting after the last migration, and the real signal moved to a different counter that the dashboard does not even show. Every one of those is a judgment call. Every one of them was made by someone who got paged, traced the incident, and decided what the line should be.&lt;/p&gt;

&lt;p&gt;That is why the telemetry standard an agent reads against has to be a praxis output. It comes from the people who run the system, written down from the actual experience of operating it — not handed down by a platform team that provisions the dashboards and never gets paged. The platform team can give you the stream. They cannot tell you that a spike in the third service is meaningless and a flutter in the first one means drop everything, because they have never been on the wrong end of either at 2 a.m.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shaping Role Is Underbuilt
&lt;/h2&gt;

&lt;p&gt;The infrastructure to do this is further along than the discipline to use it well. &lt;a href="https://platformengineering.org/events/platform-engineering-in-2025-what-changed-ai-and-the-future-of-platforms-2025-12-09" rel="noopener noreferrer"&gt;Nearly 90 percent of enterprises now run internal platforms&lt;/a&gt;, with observability sitting inside the dozen platform capabilities that are now table stakes, according to the field's 2025 year-end review drawing on the State of Platform Engineering and the 2025 DORA report. The same review found that AI adoption and platform maturity now drive each other, which means more agents reading more telemetry, soon.&lt;/p&gt;

&lt;p&gt;But only about a quarter of those organizations have a dedicated platform product manager. The platform exists. The role that decides what the platform's signals mean — the person who turns raw telemetry into a standard an agent can act on — is mostly not staffed. We are wiring agents into observability streams faster than we are deciding who owns the question of which signals matter. That gap is where the bad agent behavior is going to come from.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Standard Is Being Written by Practitioners
&lt;/h2&gt;

&lt;p&gt;There is a healthier version of this already taking shape, and it is being built by the people who operate the systems, not specified from above. OpenTelemetry's AI-agent observability work, &lt;a href="https://opentelemetry.io/blog/2025/ai-agent-observability/" rel="noopener noreferrer"&gt;driven by its GenAI special interest group&lt;/a&gt;, is an effort to standardize the shape of the telemetry that agent apps emit. The point is explicit in their own framing: for a non-deterministic agent, telemetry is a feedback loop used to evaluate and improve the agent's behavior, and the community needs to set standards around that telemetry's shape to avoid lock-in.&lt;/p&gt;

&lt;p&gt;That is the model. The standard for what an agent reads gets authored by practitioners — maintainers, on-call engineers, the people who feel the consequences — and it gets written down where the agent can use it. The research backs this up: the documented patterns for feeding telemetry to agents already include local prompt iteration, CI-based optimization, and autonomous agents that adapt using telemetry, built on named frameworks like DSPy and PromptWizard. This is not speculation about a future capability. The architecture exists. What is still missing on most teams is the human decision about what the agent should pay attention to.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Actually Do
&lt;/h2&gt;

&lt;p&gt;If you are deciding whether to give your coding agent observability data, give it. The agent that sees production is better than the agent that does not. But do not hand it the raw firehose and call it context.&lt;/p&gt;

&lt;p&gt;Write down the baselines. Which paths are hot, what normal looks like on each, which alerts are real and which are theater. Name the anomalies that matter and the ones the team learned to ignore. Make that document the thing the agent reads against, not the unfiltered metric stream. And put it in the hands of someone who actually runs the system, because the value of telemetry to your agent is exactly equal to the quality of the judgment that shaped it. The data is access-rent. The standard is the asset.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>observability</category>
      <category>aiengineering</category>
      <category>developerproductivity</category>
    </item>
    <item>
      <title>Some Problems Are Too Big for One Context Window</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Wed, 10 Jun 2026 14:37:57 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/some-problems-are-too-big-for-one-context-window-4cdj</link>
      <guid>https://dev.to/michaeltuszynski/some-problems-are-too-big-for-one-context-window-4cdj</guid>
      <description>&lt;p&gt;Internal platforms used to be the exotic option. Now they are the default. A December 2025 Platform Engineering survey of 518 organizations found that &lt;a href="https://platformengineering.org/events/platform-engineering-in-2025-what-changed-ai-and-the-future-of-platforms-2025-12-09" rel="noopener noreferrer"&gt;nearly 90% of enterprises run internal platforms&lt;/a&gt;, with a dedicated platform-product-manager role appearing in a quarter of them. The interesting part is not the adoption curve. It is what teams are starting to run on those platforms: not single AI calls, but coordinated pipelines of AI agents doing work that no single agent could finish.&lt;/p&gt;

&lt;p&gt;That shift has a forcing function, and it is mechanical, not philosophical. Some problems do not fit in one context window. When you hit that wall, the fix is not a better prompt. It is orchestration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wall Is Capacity, Not Cleverness
&lt;/h2&gt;

&lt;p&gt;Here is the number that settles the argument. Anthropic ran a multi-agent system — an Opus 4 lead delegating to Sonnet 4 workers — against a single Opus 4 agent on their internal research eval. The multi-agent system &lt;a href="https://www.anthropic.com/engineering/built-multi-agent-research-system" rel="noopener noreferrer"&gt;outperformed the single agent by 90.2%&lt;/a&gt;. The single agent's failure mode was concrete: asked to enumerate all the IT board members of the S&amp;amp;P 500, it ground through companies one at a time and never finished. It was not dumb. It ran out of room.&lt;/p&gt;

&lt;p&gt;The same write-up makes the underlying mechanic explicit. Raw token usage explains 80% of the performance variance on their browsing benchmark. Subagents help because each one "operates in parallel with their own context windows," then condenses the most important tokens back to the lead. You are not making the model smarter. You are giving the problem more total working memory than one session holds.&lt;/p&gt;

&lt;p&gt;That is the whole thesis in one line. Comprehensiveness is a capacity problem. Take a concrete case, for example: a task like "review these fifty files the same way" or "search this entire surface and miss nothing" has no room in a single conversation. You need parallel contexts and a way to coordinate them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Orchestration Actually Buys You
&lt;/h2&gt;

&lt;p&gt;There are two distinct wins here, and people conflate them.&lt;/p&gt;

&lt;p&gt;The first is isolation. When you hand work to a subagent, it runs with &lt;a href="https://code.claude.com/docs/en/sub-agents" rel="noopener noreferrer"&gt;fresh context and a separate cache&lt;/a&gt;, and the verbose output — test logs, file dumps, intermediate reasoning — "stays in the subagent's context while only the relevant summary returns to your main conversation." This is why "review fifty files the same way" is an orchestration problem. You spawn a reviewer per file. Each one drowns in its own file's detail. Your main session only ever sees fifty clean summaries. The noise never touches the conversation that has to hold the final answer.&lt;/p&gt;

&lt;p&gt;The second win is determinism, and this is the part that turns a clever trick into infrastructure. A dynamic workflow runs &lt;a href="https://code.claude.com/docs/en/workflows" rel="noopener noreferrer"&gt;from a script Claude writes and you can rerun&lt;/a&gt;. The script — not the model, turn by turn — holds "the loop, the branching, and the intermediate results itself, so Claude's context holds only the final answer." The control flow lives in JavaScript with structured-output schemas. The model fills in the work; the code decides what runs next.&lt;/p&gt;

&lt;p&gt;That distinction matters because it makes the pipeline repeatable. Run it Monday on fifty files, run it Friday on a different fifty, and the loop behaves identically. You are not re-explaining the process in prose every time and hoping the model interprets it the same way twice. The process is code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adversarial Verification Is the Killer Feature
&lt;/h2&gt;

&lt;p&gt;The workflows model lets you do something a single session cannot honestly do: have &lt;a href="https://code.claude.com/docs/en/workflows" rel="noopener noreferrer"&gt;independent agents adversarially review each other's findings before they are reported&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Think about why a single conversation is bad at checking its own work. It already produced the answer. It has every reason — mechanically, in how attention weights the tokens already on the page — to be consistent with what it just said. Asking it to find its own errors is asking it to argue against its own context.&lt;/p&gt;

&lt;p&gt;A separate agent has no such loyalty. It gets a fresh context, the prompt you pass, and one job: find the holes. The script wires the producer's structured output into the verifier's input, collects both, and only then surfaces a result. That is a real audit, not a model grading its own homework. For anything where being wrong is expensive — a security pass, a financial reconciliation, a research claim that is going to get quoted — that second adversarial context is the difference between a draft and something you would put your name on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which Problems Are Just Prompts
&lt;/h2&gt;

&lt;p&gt;Here is where the skill lives, because orchestration is not free and reaching for it reflexively is its own failure.&lt;/p&gt;

&lt;p&gt;The same Anthropic analysis is blunt about the cost. Multi-agent systems &lt;a href="https://www.anthropic.com/engineering/built-multi-agent-research-system" rel="noopener noreferrer"&gt;burn about 15× the tokens of a chat&lt;/a&gt; — even a single agent runs roughly 4× a normal conversation. So the economics only work on "valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools." They explicitly call out where it does not fit: most coding tasks, where the subtasks are not really parallelizable and the agents have tight dependencies on each other's output. If step two needs step one's result, you do not have a fan-out. You have a sequence, and a sequence belongs in one context.&lt;/p&gt;

&lt;p&gt;The cautionary tale is theirs too. Early versions of their agents would "spawn 50 subagents for simple queries" — fan-out applied to a problem that never needed it, paying 15× for an answer one prompt would have produced. That is the tell. When you find yourself orchestrating something a single well-written prompt would have handled, you have mistaken a prompt for a pipeline.&lt;/p&gt;

&lt;p&gt;So the decision rule is small and concrete. Reach for orchestration when the work is wide (many independent items, fifty files, a whole search surface), when it exceeds what one context can hold, or when you genuinely need one agent to check another. Stay in a single session when the work is narrow, sequential, or dependency-heavy. The deterministic script does not make a small problem better. It makes a big problem possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Boundary
&lt;/h2&gt;

&lt;p&gt;The platform-engineering numbers tell you orchestration is becoming an infra-team concern rather than a novelty. But the adoption stat is not the point. The point is the boundary it forces people to learn.&lt;/p&gt;

&lt;p&gt;The valuable judgment is no longer "can I write a good prompt." It is "is this an orchestration problem or a prompt problem." Get that wrong in the expensive direction and you pay 15× for nothing. Get it wrong in the cheap direction and you stuff fifty files into one window and watch the model lose the thread halfway through.&lt;/p&gt;

&lt;p&gt;The script holds the loop. The subagents hold the parallel work. Your context holds only the answer. Knowing when that shape is worth building is the actual skill — and it is the one worth getting right, because the problems that need it are exactly the ones too big to fix any other way.&lt;/p&gt;

</description>
      <category>aiengineering</category>
      <category>multiagent</category>
      <category>claudecode</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Renting Compute From Three Clouds Is the Default Now</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Mon, 08 Jun 2026 16:55:11 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/renting-compute-from-three-clouds-is-the-default-now-4gd4</link>
      <guid>https://dev.to/michaeltuszynski/renting-compute-from-three-clouds-is-the-default-now-4gd4</guid>
      <description>&lt;p&gt;The companies with the most control over chip supply on the planet still rent across three cloud providers. That is the fact that should reset how a platform team thinks about AI infrastructure. If a frontier lab with custom silicon deals and over a million of its own accelerators cannot single-source compute, the 200-person team running model-serving in production has no business betting on one provider either.&lt;/p&gt;

&lt;p&gt;Read the numbers from the lab itself. Anthropic states plainly that it runs Claude across three silicon families and three clouds at the same time: "We train and run Claude on a range of AI hardware — AWS Trainium, Google TPUs, and NVIDIA GPUs… Claude remains the only frontier AI model available to customers on all three of the world's largest cloud platforms: AWS (Bedrock), Google Cloud (Vertex AI), and Microsoft Azure (Foundry)." That is from &lt;a href="https://www.anthropic.com/news/google-broadcom-partnership-compute" rel="noopener noreferrer"&gt;Anthropic's own partnership announcement&lt;/a&gt;. They do not frame it as insurance. They frame it as matching workloads to the chips best suited for them, which buys better performance and more resilience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Money Says This Is the Baseline, Not a Side Bet
&lt;/h2&gt;

&lt;p&gt;Hedging is small. What Anthropic is doing is not small.&lt;/p&gt;

&lt;p&gt;On the AWS side, the commitment runs &lt;a href="https://www.anthropic.com/news/anthropic-amazon-compute" rel="noopener noreferrer"&gt;over $100 billion and up to 5 gigawatts&lt;/a&gt; across a ten-year span. More than a million Trainium2 chips are already training and serving Claude through Project Rainier, and AWS is named the primary training and cloud provider. That spans Graviton CPUs and the Trainium2-through-Trainium4 custom silicon line.&lt;/p&gt;

&lt;p&gt;On the Azure side, &lt;a href="https://blogs.nvidia.com/blog/microsoft-nvidia-anthropic-announce-partnership/" rel="noopener noreferrer"&gt;Anthropic committed $30 billion in compute&lt;/a&gt; plus up to a gigawatt of NVIDIA Grace Blackwell and Vera Rubin capacity. In the same deal Microsoft and NVIDIA are investing $5 billion and $10 billion into Anthropic. And there is a multi-gigawatt Google and Broadcom TPU buildout coming online in 2027 on top of that.&lt;/p&gt;

&lt;p&gt;Stack those up. Over $100 billion on AWS, $30 billion on Azure, multi-gigawatt on Google. A company does not spread that kind of capital across three vendors as a defensive crouch. It does it because that is what running serious AI workloads at scale actually requires. Anthropic's run-rate revenue passed $30 billion this year, up from roughly $9 billion at the end of 2025. They are diversifying providers while they scale, not because anyone is forcing their hand.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Silicon Layer Is Multi-Vendor Too
&lt;/h2&gt;

&lt;p&gt;It is tempting to read "multi-cloud" as a billing decision — three vendors, three invoices, one abstraction over commodity GPUs underneath. That is not what is happening here. The diversification goes all the way down to the chip.&lt;/p&gt;

&lt;p&gt;The hardware list is AWS Trainium2 through Trainium4 and Graviton, Google TPUs built with Broadcom, and NVIDIA Grace Blackwell and Vera Rubin. And the supplier set is still growing. Anthropic is now reportedly &lt;a href="https://www.techmeme.com/260521/p20#a260521p20" rel="noopener noreferrer"&gt;in talks to rent servers running on Microsoft-designed chips&lt;/a&gt;, with Azure usage rising since November 2025, per The Information. That is a fourth distinct silicon path entering the mix.&lt;/p&gt;

&lt;p&gt;Different chips have different strengths for different parts of the workload. Trainium is cost-efficient for large training runs. TPUs have their own profile for certain matrix shapes. NVIDIA's parts lead on raw flexibility and tooling maturity. Routing the right workload to the right silicon is an engineering decision with real performance and cost consequences, and it only works if your serving layer can target more than one backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for a 200-Person Platform Team
&lt;/h2&gt;

&lt;p&gt;The lesson transfers directly, and it cuts against a posture you still hear in platform-engineering circles: pick one cloud, go deep, standardize everything on its managed services, and treat portability as premature optimization. For most of the stack, that posture is defensible. The managed database, the queue, the object store — going deep on one provider there saves real time.&lt;/p&gt;

&lt;p&gt;The AI-serving layer is the exception, and the frontier labs just told you why. If the company with the most control over its own chip supply still cannot single-source compute or silicon, your model-serving layer cannot bet on a single backend either. The constraints that force diversification at the top — capacity availability, price per token, chip-to-workload fit, supply timing — show up at every scale below it. You will not get a million chips allocated, but you will hit GPU availability walls in a region, price changes on a managed inference endpoint, and a quota that does not move when you need it to.&lt;/p&gt;

&lt;p&gt;So treat portability of the serving layer as an architecture requirement, the same way you treat authentication or observability as a requirement. Concretely, that means a few things. Keep an inference abstraction between your application code and any single provider's SDK, so swapping the backend is a config change and not a rewrite. Avoid building hard dependencies on one vendor's proprietary serving features unless you have a deliberate reason and an exit plan. Keep your model weights and serving stack in a form you can stand up on more than one provider's accelerators. Run at least a smoke-test path on a second backend continuously, so "we could move" is a tested claim and not a hope.&lt;/p&gt;

&lt;p&gt;This is not a call to run everything everywhere all the time. Multi-cloud as a blanket strategy is expensive and usually a mistake. The point is narrower and load-bearing: the inference path is the one place where single-provider lock-in is now a standing liability, because the supply dynamics above you guarantee you will eventually need to move some of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Default Has Already Shifted
&lt;/h2&gt;

&lt;p&gt;A year ago, spreading inference across providers and chip families read like something only the largest labs could justify. The receipts say it is now the operating baseline for anyone running frontier models — stated in the lab's own words, backed by more than $130 billion in committed capacity across three clouds and four silicon paths.&lt;/p&gt;

&lt;p&gt;When the baseline at the top of the market moves, the architecture expectations below it move with it. Single-cloud AI strategy used to be the safe default. It is now the position you have to justify. Build the serving layer so the backend is a choice you keep making, not a decision you made once and cannot revisit.&lt;/p&gt;

</description>
      <category>cloudarchitecture</category>
      <category>platformengineering</category>
      <category>aiinfrastructure</category>
      <category>multicloud</category>
    </item>
    <item>
      <title>The Million-Token Context Window Changes What You Put In It</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Tue, 02 Jun 2026 16:30:05 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/the-million-token-context-window-changes-what-you-put-in-it-245p</link>
      <guid>https://dev.to/michaeltuszynski/the-million-token-context-window-changes-what-you-put-in-it-245p</guid>
      <description>&lt;p&gt;The 1M-token context window is here. Opus 4.8, 4.7, and 4.6, plus Sonnet 4.6, now carry the full million-token context on the Claude API, Amazon Bedrock, and Vertex AI — no surcharge, generally available. A single request can hold up to 600 images or PDF pages. The reflex, the second the number lands, is to point the tool at the whole repo and let it rip.&lt;/p&gt;

&lt;p&gt;That is the wrong move. The operators who get real value out of the bigger window treat it as a curated working set, not a junk drawer. What you load decides what the model reasons about, and a million tokens of noise still produces noisy output.&lt;/p&gt;

&lt;p&gt;The cleanest statement of this comes from the vendor's own documentation. Anthropic's docs say the 1M window's retrieval gains "depend on what's in context, not just how much fits" (&lt;a href="https://platform.claude.com/docs/en/build-with-claude/context-windows" rel="noopener noreferrer"&gt;context windows&lt;/a&gt;). Read that sentence twice. The platform that just handed you a million tokens is telling you, in the same breath, that capacity and effective use are different things.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capacity Is Not the Same as Attention
&lt;/h2&gt;

&lt;p&gt;Here is the mistake the dump-the-repo reflex makes. It assumes the model reads a million tokens the way a database reads a million rows — uniformly, with equal fidelity from the first byte to the last. It does not.&lt;/p&gt;

&lt;p&gt;Chroma's "Context Rot" research evaluated 18 frontier models — Claude 4, Gemini 2.5, Qwen3, and a dozen others — and found that performance "grows increasingly unreliable as input length grows" (&lt;a href="https://www.trychroma.com/research/context-rot" rel="noopener noreferrer"&gt;Context Rot&lt;/a&gt;). Models do not process long context uniformly. They degrade, and they degrade "even on simple tasks." A task the model nails at 10K tokens gets less reliable at 800K, holding everything else constant.&lt;/p&gt;

&lt;p&gt;The type of filler matters too. Chroma found that locally-cancelling operations hurt more than neutral print statements, and topically-related distractors degrade answers non-uniformly. Translate that to a codebase: loading the wrong 800K does not just waste space. It actively poisons the model's reasoning over the right 200K. The half-relevant module, the deprecated helper, the three abandoned migration scripts that look load-bearing — those are not free passengers. They are distractors with a vote.&lt;/p&gt;

&lt;p&gt;So the question is never "will it fit." With a million tokens, almost everything fits. The question is "what does loading this do to the model's attention on the part that matters."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Smallest High-Signal Set
&lt;/h2&gt;

&lt;p&gt;Anthropic's applied AI team frames the discipline directly. Good context engineering means finding "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome" (&lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;effective context engineering&lt;/a&gt;). Context is "a finite resource with diminishing marginal returns." Every token you load spends from a shared attention budget.&lt;/p&gt;

&lt;p&gt;That last point is the one to internalize. The attention budget is shared. The 50 lines of code that actually contain the bug compete for the model's focus with every other token in the window. Pad the window with the rest of the repo and you have not given the model more help. You have given the 50 lines more competition.&lt;/p&gt;

&lt;p&gt;This reframes the whole job. The operator's task is not "assemble everything that could conceivably be relevant." It is "curate the smallest set that makes the answer likely." Those are opposite instincts. The first is collection. The second is editing. The bigger window rewards editors and punishes collectors, because the collector's reflex scales straight into the rot.&lt;/p&gt;

&lt;p&gt;In practice the curated set looks like intent, not coverage. The file with the bug, its direct callers, the test that fails, the relevant interface, the one config that governs the behavior. Maybe 200K of the right tokens, assembled because you decided each one earns its place. Not a million tokens assembled because the window happened to be that big.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Headroom Is Actually For
&lt;/h2&gt;

&lt;p&gt;So if the answer is "feed it the right 200K," what is the other 800K for? It is not for padding. It is for the cases that genuinely need it — the codebase-wide refactor that legitimately touches forty files, the migration that has to reason across a sprawling schema, the incident where the relevant signal really is distributed across a large surface. Those exist. The headroom is there to serve them when they show up, deliberately, not to be the default fill level for every request.&lt;/p&gt;

&lt;p&gt;Anthropic's tooling makes the "manage it, don't fill it" stance concrete. The platform ships server-side compaction, context editing that clears stale tool results and thinking blocks, and "context awareness" — models that track their own remaining token budget rather than guessing how many tokens remain (&lt;a href="https://platform.claude.com/docs/en/build-with-claude/context-windows" rel="noopener noreferrer"&gt;context windows&lt;/a&gt;). That is an architecture for spending a finite budget on purpose. None of it would make sense if the design intent were "load everything and let the window sort it out."&lt;/p&gt;

&lt;p&gt;The pattern across all three primary sources is the same. The vendor gives you a million tokens, builds the tooling to help you spend them carefully, and states in the docs that what you put in decides what you get out. The empirical research from outside the vendor confirms the failure mode the docs are warning about: more input, less reliability, even on simple tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Discipline the Bigger Window Demands
&lt;/h2&gt;

&lt;p&gt;The 1M window is a real capability gain. The forty-file refactor that used to require chunking and stitching can now happen in one pass. That is worth having. But the gain only materializes for operators who bring intent to what they load.&lt;/p&gt;

&lt;p&gt;Treat the window as a working set you curate. Feed it the right 200K because you decided each token earns its place. Reach for the headroom when the task genuinely spans a large surface, and reach for the compaction and context-editing tools to keep the working set clean as the session runs long. The discipline is the same one good engineers already apply to a code review: the question is not how much you can put in front of someone, it is what they actually need to see to make the call.&lt;/p&gt;

&lt;p&gt;The number on the box went up by 5x. The skill that turns it into output did not change. If anything, the bigger window raises the price of getting it wrong — a million tokens of noise is a much louder distraction than 200K of it ever was. Spend the budget like it is scarce, because for the model's attention, it still is.&lt;/p&gt;

</description>
      <category>aiengineering</category>
      <category>contextengineering</category>
      <category>llmops</category>
      <category>developerproductivity</category>
    </item>
    <item>
      <title>Absorption Rate Is a Praxis Problem</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Fri, 29 May 2026 16:30:06 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/absorption-rate-is-a-praxis-problem-1849</link>
      <guid>https://dev.to/michaeltuszynski/absorption-rate-is-a-praxis-problem-1849</guid>
      <description>&lt;p&gt;A LinkedIn post circulating this week made an observation worth sitting with. AI write-throughput in a serious engineering team can hit fifty pull requests a day. The team's merge-throughput is two. The author's own bug-finder product had hit the same wall on the customer side — initial deliveries of twenty findings per session, sometimes fifty, came back as a request for two a day. The findings were fine. The cost was elsewhere. Review, validate, merge, and regression-watch costs hours per item, multiplied by every engineer on the team, makes the math fall apart fast. The bottleneck moved. The new ceiling is what the team can absorb.&lt;/p&gt;

&lt;p&gt;The diagnosis is right. The prescription the bug-finder category will produce — "rank better, surface fewer, prioritize harder" — repeats the mistake the cloud-runtime layer made before it. The right answer to "which two findings today" is not a smarter ranker. It is a different category of product entirely, and the path to getting there is the same one &lt;a href="https://www.mpt.solutions/usage-standards-are-a-praxis-problem/" rel="noopener noreferrer"&gt;yesterday's piece on enterprise distribution&lt;/a&gt; walked at the policy layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Which Two" Is a Values Question
&lt;/h2&gt;

&lt;p&gt;The reason a ranking algorithm cannot answer "which two findings should this team look at today" is that the answer depends on information the ranker cannot see. The team had a customer-facing incident last Thursday that traced to a specific subsystem; findings in that subsystem are worth more this week than they will be in three weeks. There is a marquee launch on Wednesday; findings touching the deploy path or the feature flag system are urgent in a way no static rule will capture. The payments code is in calibration mode; the team treats finding-class A as cosmetic in payments and catastrophic in identity. The new hire is reviewing in their first week and the standards have been deliberately relaxed for their first three PRs so the team can see how they reason about ambiguity.&lt;/p&gt;

&lt;p&gt;None of that is in the codebase. It lives in the team's lived context — incident logs, deal calendars, code-area maturity assessments, hiring decisions, the conversation in last Friday's retro. A ranker that looks at the AST and the static analysis output and the commit history is solving the prior problem, which was "find more bugs." The product problem now is "let the team encode its own values in a way the finder can rank against." That is an authoring problem. Calling it a ranking problem will produce a generation of finders that compete on signal-to-noise and lose every renewal at the eighteen-month mark when the customer realizes the signal-to-noise was never the bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Absorb Rate Actually Depends On
&lt;/h2&gt;

&lt;p&gt;A team that absorbs ten changes a day without a regression has done specific things to make that possible. The CODEOWNERS file is calibrated so the right person sees the right change without a thread of "who should review this?" The review template asks the questions that surface blast radius before line-level critique — what does this touch, what could fail under load, what is the rollback path. The merge gate is configured to catch the failure modes the team has actually hit in the last six months, not the ones some other team hit in a different stack. The on-call rotation is staffed at a level where the engineer who merges at four pm is not the engineer who pages at midnight, so the merger has slack to think.&lt;/p&gt;

&lt;p&gt;All of those are practice artifacts. None of them are product features. A team that has done this work absorbs more change with less risk than a team that has not. The output of the work is review craft, codeowner judgment, lint config that matches the team's specific failure history, and a culture in which the question "what could break this?" gets asked before "is this style-correct?" The ceiling moves. The work to move it is craft work, and the category-level honesty about it is what bug-finder vendors will eventually be forced into when their growth model bumps the absorb-rate cap on every account.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Customer Feedback Was Pointing
&lt;/h2&gt;

&lt;p&gt;The post's most useful detail is the customer telemetry. Twenty findings per session, sometimes fifty, came back as a request for two a day. Read that as a UX complaint and the answer is a ranking algorithm. Read it as a domain truth and the answer is different. Customers are saying that the unit of useful work in their environment is "two items per day that fit the team's actual attentional budget after the rest of the day's responsibilities," and that the input format of "fifty findings ranked by static severity" is the wrong shape for that unit. The product question is not "how do we surface the right two." The product question is "how do we let the team author the filter that determines what the right two means for them this week."&lt;/p&gt;

&lt;p&gt;That is the &lt;a href="https://www.mpt.solutions/three-memory-systems-under-one-login-stop-picking-sides/" rel="noopener noreferrer"&gt;wrapper-pattern argument&lt;/a&gt; at the bug-finder layer. The vendor provides the surface; the operator authors the standard that runs against the surface. The vendor that ships a configurable filter — backed by the team's own incident log, codeowner-weighted blast-radius scores, deploy-calendar awareness, individualized risk profiles per code area — is shipping for the actual market. The vendor that ships a "smarter default ranker" is shipping for a market that exists only in the spec doc.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Generation-Absorption Gap Is the Signal
&lt;/h2&gt;

&lt;p&gt;The gap between "AI ships 50, your team merges 2" is being read in the category press as a problem to close. The category press has it backward. The gap is the signal of where the high-yield work now lives. The work has moved from writing code to judging code — to deciding which changes are worth the absorb cost given everything the team is doing this week. Judgment is not a product feature. Judgment is a praxis output, accumulated from sustained time in the actual work, encoded in the artifacts the team produces and the conversations they have at standup.&lt;/p&gt;

&lt;p&gt;The bug-finder category will split over the next eighteen months. The vendors that recognize they are selling into a judgment surface will build the authoring layer the operators need. The vendors that keep selling "more findings, better ranked" will hit the absorb-rate ceiling on every account and the renewal conversations will turn into pricing pressure. The cloud-runtime layer already played this out at a different layer of the stack. The pattern repeats.&lt;/p&gt;

&lt;p&gt;The next layer at this scale is the same as the next layer at the enterprise scale. Authorship. Which operator, with what depth of context, encodes the rules that decide what gets attention today. The vendors did their part. The hard part is whether the team's judgment gets a place to live inside the product.&lt;/p&gt;

</description>
      <category>engineeringleadership</category>
      <category>developerproductivity</category>
      <category>aiengineering</category>
      <category>codereview</category>
    </item>
    <item>
      <title>Usage Standards Are a Praxis Problem</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Thu, 28 May 2026 16:30:10 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/usage-standards-are-a-praxis-problem-5n6</link>
      <guid>https://dev.to/michaeltuszynski/usage-standards-are-a-praxis-problem-5n6</guid>
      <description>&lt;p&gt;The cloud-runtime question for enterprise Claude is settled. AWS shops route through &lt;a href="https://platform.claude.com/docs/en/api/claude-on-amazon-bedrock" rel="noopener noreferrer"&gt;Amazon Bedrock&lt;/a&gt;. Google Cloud shops route through Vertex AI. Direct deployments now sit inside the customer's perimeter via &lt;a href="https://www.infoq.com/news/2026/05/claude-mcp-tunnels/" rel="noopener noreferrer"&gt;self-hosted sandboxes and MCP tunnels&lt;/a&gt; that shipped this month. The question "where should Claude run" has a workable answer for every common stack, and a circulating LinkedIn post this week argued — correctly — that the debate has shifted from "should we use Claude" to "where do we wire it in."&lt;/p&gt;

&lt;p&gt;The diagnosis is right. The prescription that usually follows — "now write the usage standards" — is the part most enterprises will get wrong, for the same reason most enterprise AI rollouts have already gotten this wrong on every prior tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Word "Standards" Is Doing Work
&lt;/h2&gt;

&lt;p&gt;Calling the next layer "usage standards" makes it sound like a policy artifact. A document the central AI committee drafts, approves, and circulates. A page on the wiki. A two-hour all-hands. That is what most enterprises will produce in Q3. It will not survive contact with the actual work, and the divergence between the document and the operator behavior will start within sixty days of publication.&lt;/p&gt;

&lt;p&gt;The reason is structural and the corpus has been arguing it from a different angle for six weeks. The &lt;a href="https://www.mpt.solutions/you-cant-co-design-what-you-dont-operate/" rel="noopener noreferrer"&gt;May 20 piece on faculty AI governance&lt;/a&gt; laid out the version of this argument applied to higher education. The mechanism is identical in the enterprise. Standards that hold under stress are authored by operators who have worked through the actual edge cases in their actual workflows for weeks. Standards that fail are authored by committees of policy interpreters who have used the tool in artificial training contexts and are extrapolating from surface familiarity to operational judgment.&lt;/p&gt;

&lt;p&gt;The output of those two processes carries the same word — "standards" — and looks the same in PDF. One survives the first stress case. The other becomes the canonical example of the gap between policy and practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Edge Cases Actually Look Like
&lt;/h2&gt;

&lt;p&gt;The reason a central standards document fails is that the standards that matter are discipline-specific. Marketing's prompts touch customer data and outbound brand voice; the failure modes are leakage of unpublished campaigns and tonal mismatch on cross-channel publish. Finance's prompts touch material non-public information and account-level identifiers; the failure modes are inadvertent disclosure into model context and downstream training. Legal's prompts touch privileged communications; the failure modes are waiver and the discoverability of LLM transcripts in subsequent litigation. Engineering's prompts touch production data, customer PII, and intellectual property in source form; the failure modes are everything the security team already knows plus the new class of model-egress risks.&lt;/p&gt;

&lt;p&gt;A central usage-standards document compresses all of these into a paragraph each, written by someone whose closest experience is the third-party security review of the procurement contract. The compression itself is the failure. The marketing operator who has actually been red-teaming customer-data prompts for six weeks knows the failure mode that matters in their work. The compression in the central document does not capture it because the compression was done by someone who was never in the seat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Real Standards Come From
&lt;/h2&gt;

&lt;p&gt;The enterprises that will end up with standards worth defending are the ones who treat the authoring layer the way safety-critical industries treat their procedure manuals. Operators in each domain spend sustained time in the actual workflow with the tool, surface specific failures, document them, and write the discipline-specific rules. The central function's job is to aggregate, not to author. The marketing team's three pages on outbound-brand prompts is the canonical document for marketing. The finance team's two pages on MNPI handling is the canonical document for finance. The aggregated central artifact is a table of contents, not a policy.&lt;/p&gt;

&lt;p&gt;This is the &lt;a href="https://www.mpt.solutions/three-memory-systems-under-one-login-stop-picking-sides/" rel="noopener noreferrer"&gt;wrapper-pattern argument&lt;/a&gt; at the enterprise scale. Same shape as the personal version. The operator writes the rules they encode in their own hooks and lint configurations because they are the only ones who hit the edge cases the rules need to handle. The central function holds the index.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Procurement Bite Lands at the Runtime Layer Too
&lt;/h2&gt;

&lt;p&gt;The cloud-runtime decision treated as settled is partially a usage-standards decision in disguise. Bedrock has one default for training-data carve-outs and prompt-logging retention; Vertex AI has another; the new self-hosted-sandbox direct path has its own. The choice of runtime has policy buried in it. Calling that decision "settled" because every common stack has a workable answer understates how much of the standards work was already decided in the contract before any committee met.&lt;/p&gt;

&lt;p&gt;This is the same pattern as the &lt;a href="https://www.mpt.solutions/your-marketing-team-is-now-a-software-vendor/" rel="noopener noreferrer"&gt;May 18 piece on shadow IT 2.0&lt;/a&gt;. The procurement door is where the actual policy gets written. The committee door is where the discussion document gets written. The enterprises that staff operators into both doors get standards that match the runtime. The enterprises that staff operators into only one door get standards that diverge from the runtime within a quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "The Holding Pattern Is Expensive" Is Half True
&lt;/h2&gt;

&lt;p&gt;The source piece is correct that delay pushes adoption outside governed environments. The implicit claim is that an enterprise standard would have kept things governed. It would not have. Central standards never govern operator behavior. They govern the policy document. What actually governs operator behavior in a sustainable way is the operator's own standards — encoded in their own tooling, their own hooks, their own lint, their own playbooks for the workflows they actually run. The holding pattern is in fact the period when operators with that posture are building standards that will be the de facto rules by the time the official document ships.&lt;/p&gt;

&lt;p&gt;This was the prescriptive close of yesterday's &lt;a href="https://www.mpt.solutions/inside-the-stack-i-ship-from-daily/" rel="noopener noreferrer"&gt;piece on opening the personal stack&lt;/a&gt; at the individual scale. The same logic applies at the team scale. The marketing team that has spent the "holding pattern" period encoding their actual prompt guardrails into their own internal MCP server has authored their standards. The marketing team that has spent the same period waiting for the central policy document has authored nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Next Layer Is Authorship
&lt;/h2&gt;

&lt;p&gt;The vendors solved distribution. The next layer is authorship — which operators, with what depth of praxis, are allowed to write the standards that will actually govern the work. The enterprise that ships a twelve-page AI Usage Policy in Q3 will discover in Q4 that the operators who built their own standards in Q2 are the ones who set the de facto rules. The policy document and the lived standard will diverge. Every prior wave of tool adoption — version control, container runtimes, IDE choice, CI pipelines — has produced the same divergence, and the resolution has always been the same: the operator-authored standard wins, the policy document catches up, and the gap between them is the cost of pretending the central function can author what only the operator can.&lt;/p&gt;

&lt;p&gt;Match the layer to the right author. Distribution to procurement. Runtime to platform engineering. Standards to the operators in each discipline who have done the work. The vendors did their part. The hard part — and the part that has not been "solved" by anything Anthropic shipped in the last sixty days — is whether the authoring chair gets assigned correctly inside the customer.&lt;/p&gt;

</description>
      <category>enterpriseai</category>
      <category>aigovernance</category>
      <category>claudecode</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Wealth Management's Coming Agent Shock</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Wed, 27 May 2026 16:30:05 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/wealth-managements-coming-agent-shock-596m</link>
      <guid>https://dev.to/michaeltuszynski/wealth-managements-coming-agent-shock-596m</guid>
      <description>&lt;p&gt;The &lt;a href="https://www.mpt.solutions/what-looks-like-busywork-is-mostly-rent/" rel="noopener noreferrer"&gt;May 15 piece on access-rents&lt;/a&gt; drew a line through every services industry. On one side: access-rents — work that consists of operating an interface the customer cannot operate themselves. On the other: integrated expertise — work that consists of telling the customer which question to ask. AI eats both, with opposite policy implications.&lt;/p&gt;

&lt;p&gt;Wealth management is the industry that has the largest mix of both, sitting in the same company, frequently in the same advisor. The agent shock that is coming for this industry will be uneven in a way the public discussion has not started to model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Workflow Surface
&lt;/h2&gt;

&lt;p&gt;A wealth-management workflow, observed from the operator's side, has roughly twelve recurring operations.&lt;/p&gt;

&lt;p&gt;Account aggregation across custodians — pulling positions, balances, and transactions from Schwab, Fidelity, Janney, IBKR, Empower, smaller firms. Transaction categorization for tax and reporting. Performance attribution against benchmarks. Rebalancing signal generation against a target allocation. Tax-loss harvesting candidate identification. Cost-basis tracking across in-kind transfers and corporate actions. ACATS execution. RMD calculation. Backdoor Roth conversion mechanics. Estate-planning vehicle structure across multiple entities. Client behavioral coaching during drawdowns. Strategy revision after a life event.&lt;/p&gt;

&lt;p&gt;Half of those are mechanical. Half of those are judgment. The mechanical half is where agents already work today. The judgment half is where they do not, and where the durable expertise lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Agents Already Work
&lt;/h2&gt;

&lt;p&gt;Account aggregation has been agent-tractable for ten years. &lt;a href="https://plaid.com/docs/" rel="noopener noreferrer"&gt;Plaid&lt;/a&gt; and its competitors have been doing it under different vendor names. The underlying technology — credential vaulting, OAuth-where-supported, scraping-where-not, transaction normalization — is mature. The integration cost is non-trivial but bounded.&lt;/p&gt;

&lt;p&gt;Transaction categorization is a solved supervised-learning problem. Consumer apps have been categorizing personal transactions since Mint shipped in 2007. The same models work for advisor-facing categorization at higher quality with feedback loops.&lt;/p&gt;

&lt;p&gt;Performance attribution is table joins and arithmetic. Benchmark selection has some judgment in it; the math against the chosen benchmark is mechanical.&lt;/p&gt;

&lt;p&gt;Rules-based rebalancing — if allocation drift exceeds threshold, generate the trade list — is a working agent today inside every major robo platform. Wealthfront, Betterment, Schwab Intelligent Portfolios all run automated rebalancing as the core product.&lt;/p&gt;

&lt;p&gt;The agent layer for these workflows is not a future problem. It is a present commodity. The advisors whose primary value-add is operating these systems on a client's behalf are doing access-rent work. Their position is the same position the travel agent occupied in 2002.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Agents Still Do Not Work
&lt;/h2&gt;

&lt;p&gt;Multi-entity tax strategy across a household — when to convert, how much, against which marginal rate, against which projected future bracket, against which Roth conversion ladder, accounting for state-level interactions — is integrated expertise. Models can generate options. The judgment that selects between options under specific client constraints is not in the model. The advisor who runs this kind of optimization for a client has a moat, not a rent.&lt;/p&gt;

&lt;p&gt;Rebalancing under tax constraints crosses the threshold. A rules-based rebalance is mechanical; a rebalance that knows not to harvest the loss in the IRA because the wash-sale rule reaches across accounts, and that schedules the gain realization across calendar years to avoid an NIIT spike, requires the kind of context that lives in the advisor's head.&lt;/p&gt;

&lt;p&gt;Behavioral coaching during drawdowns is the part the academic literature has been studying for thirty years and that DALBAR keeps measuring as the largest single source of advisor value. The client who calls in March 2020 and asks to move everything to cash is not asking for an answer; they are asking for a counterparty. The advisor who is that counterparty has work no agent currently replaces.&lt;/p&gt;

&lt;p&gt;Estate planning across vehicles — a revocable trust, a charitable remainder trust, a 529 with appreciated stock from an employer ESPP, a defined-benefit plan that has been in run-off mode for six years — is the kind of multi-entity, multi-rule integrated expertise that no current model has the working context to author. Generating draft documents is easy. Deciding which draft to use is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Procurement Bite
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.mpt.solutions/your-marketing-team-is-now-a-software-vendor/" rel="noopener noreferrer"&gt;May 17 piece on substrate&lt;/a&gt; argued that the policy surface for AI inside organizations is decided at the procurement layer, not the policy committee. The wealth management version of this is the data-feed layer.&lt;/p&gt;

&lt;p&gt;Plaid TTLs are short by industry standard. Bank credential reauth typically resolves in days. Brokerage credential reauth is often shorter — most retail brokerage feeds require monthly reauth at minimum, some weekly. The actual durability of the agent layer that operates against any household's accounts depends on the worst reauth cycle across all the accounts in that household. If even one custodian has an aggressive reauth policy, the whole stack inherits that policy.&lt;/p&gt;

&lt;p&gt;Custodian feeds vary wildly. Schwab through its standard Plaid path has different rate limits, different transaction-history depth, and different reauth cycles than Fidelity through theirs. Empower's institutional feed for 401(k) data has its own rules. Janney's broker-dealer feed has a much shorter TTL than the consumer-bank standard — recurring re-auth requests are documented as expected behavior, not as a defect. The agent that operates across all of these inherits the most restrictive constraint of the bundle.&lt;/p&gt;

&lt;p&gt;The advisor-tech firms that decide today which custodian feeds to integrate first, and which corporate actions to normalize across feeds, are deciding the policy surface for the entire household-level agent layer of the industry. By the time the official agent rollout from any individual brokerage arrives, the cross-custodian agent layer will already be sitting on top of it, and the brokerage that locked down its feed too aggressively will find itself routed around.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Advisor Split
&lt;/h2&gt;

&lt;p&gt;The advisors most exposed to substitution are the ones whose primary value-add sits between the client and a custodian API. The advisor whose deliverable is a quarterly performance report, an annual rebalance, and an occasional ACATS — that work is access-rent. The robo platforms are already running it for ten basis points. The full-service advisor's spread above the robo is paying for relationship continuity and occasional judgment. The judgment fraction of that work shrinks as agents handle more of the mechanical surround.&lt;/p&gt;

&lt;p&gt;The advisors least exposed are the ones whose primary value-add is multi-vehicle, multi-rule, multi-stakeholder judgment under tax and estate constraints. A family with a closely-held business, a defined-benefit plan, a real-estate concentration, and three generations of heirs needs the kind of integrated reasoning that does not collapse into an agent prompt because the constraints do not fit into a prompt. The fee structure for that work survives. The fee structure for the access-rent work does not.&lt;/p&gt;

&lt;p&gt;This is the same split the &lt;a href="https://www.mpt.solutions/brand-on-the-cv-is-a-2018-heuristic/" rel="noopener noreferrer"&gt;May 13 piece on AWS Cloud Support engineers&lt;/a&gt; described at a different industry — the L1 work falls to agents first, the integrated-expertise work falls to agents last, the badge of having operated the L1 interface for eighteen months loses signal value almost immediately. The wealth-management equivalent is happening on a parallel curve.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in 2027
&lt;/h2&gt;

&lt;p&gt;The robo platforms will offer richer agentic layers, with the Plaid-style feeds plumbed in by default. The full-service brokerages will offer hybrid agent-plus-advisor products at lower spreads. The independent RIA shops that built integrated-expertise practice — multi-entity tax, estate-across-vehicles, behavioral coaching, business-owner work — will continue to charge premium fees and grow market share. The independent RIA shops that built access-rent practice will lose market share at the lower-fee end of their book and try to move upmarket.&lt;/p&gt;

&lt;p&gt;The cross-custodian agent layer that operates across multiple feeds — household-level rebalancing, tax-aware harvesting across accounts, RMD coordination — will exist as third-party software before any single custodian builds it natively. The firms that own that layer will have the same relationship to the custodians that Plaid has to the banks. Some of them will get acquired. Some will not.&lt;/p&gt;

&lt;p&gt;The decade-long erosion that hit travel agents between 2000 and 2010 — &lt;a href="https://www.latimes.com/archives/la-xpm-2002-mar-19-fi-comp19-story.html" rel="noopener noreferrer"&gt;LA Times documented the 2002 commission elimination as the inflection&lt;/a&gt; — is the rough analog. The mechanical work fell to interfaces the customer could operate themselves. The advice work survived where it had been integrated expertise; it died where it had been a rent on the access surface.&lt;/p&gt;

&lt;p&gt;Wealth management is in the 2002 moment now. The five years that follow are the redistribution.&lt;/p&gt;

</description>
      <category>wealthmanagement</category>
      <category>fintechai</category>
      <category>financialadvisors</category>
      <category>agentdesign</category>
    </item>
    <item>
      <title>The Five Failures That Shaped My Personal AI Stack</title>
      <dc:creator>Michael Tuszynski</dc:creator>
      <pubDate>Tue, 26 May 2026 16:30:04 +0000</pubDate>
      <link>https://dev.to/michaeltuszynski/the-five-failures-that-shaped-my-personal-ai-stack-lno</link>
      <guid>https://dev.to/michaeltuszynski/the-five-failures-that-shaped-my-personal-ai-stack-lno</guid>
      <description>&lt;p&gt;Every working stack is the residue of failures the operator did not see coming. The &lt;a href="https://www.mpt.solutions/inside-the-stack-i-ship-from-daily/" rel="noopener noreferrer"&gt;Saturday piece&lt;/a&gt; showed the architecture as it stands now. This piece is the inverse — the five specific incidents that produced the current shape. Each one started as a quiet bug and ended as a permanent change in how the system runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure 1: The Eleven-Day Stale Lock
&lt;/h2&gt;

&lt;p&gt;On May 15 the session-end auto-commit hook tried to commit pending changes and failed. The commit attempt collided with a &lt;code&gt;.git/index.lock&lt;/code&gt; file that had been sitting in the repo since May 3 — a zero-byte file created by a crashed git process eleven days earlier. The hook had been quietly failing every session in between, and nobody had noticed because the failure mode was silent.&lt;/p&gt;

&lt;p&gt;Root cause: the hook had no defense against orphaned lock files. The original code assumed any &lt;code&gt;.git/index.lock&lt;/code&gt; it encountered was held by a live git process, which is true ninety-nine times out of a hundred. The hundredth time was a process that died without releasing the lock.&lt;/p&gt;

&lt;p&gt;Fix: a five-line stale-lock cleanup block. The hook checks for &lt;code&gt;.git/index.lock&lt;/code&gt; before attempting the commit. If the lock exists, it checks the file's mtime against the current time — a lock older than five minutes is suspicious. If the mtime is old, the hook then verifies via &lt;code&gt;lsof&lt;/code&gt; that no live process holds the file. Both conditions true: delete the lock. Either condition false: preserve it.&lt;/p&gt;

&lt;p&gt;Healthy auto-commits complete in under a second. The five-minute threshold cannot race a real concurrent run. Tested across three scenarios — no lock, old lock with no holder, fresh lock with a live holder — before the change shipped.&lt;/p&gt;

&lt;p&gt;The general lesson: hooks accumulate edge cases. The version of the hook that survives a year of daily use is the version that handles the failure modes you discovered along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure 2: The Silently Forked Database
&lt;/h2&gt;

&lt;p&gt;For eight days between May 4 and May 12, the content engine was writing to two different SQLite databases at the same time without anyone noticing. The cron pipeline at &lt;code&gt;~/services-local/content-engine/data/content.db&lt;/code&gt; was getting new topics from the daily trend-scan. The manual publish scripts in the same directory were also writing there. But a separate copy of the same database file at &lt;code&gt;~/.local/share/nexus/services-db/content-engine/content.db&lt;/code&gt;, which a broken Synology XSym symlink in the nexus path was silently resolving to, was getting the older trend-scan rows from the AI-driven path.&lt;/p&gt;

&lt;p&gt;Both files had &lt;code&gt;content&lt;/code&gt; rows, both had &lt;code&gt;topics&lt;/code&gt; rows, both had &lt;code&gt;publications&lt;/code&gt; rows, and the IDs overlapped. The reason this was not immediately catastrophic was that the disjoint content was bounded — temporal handoff between the two files happened cleanly on May 4 when the manual sprint began, so there were no genuine ID collisions, only orphaned rows on each side that the other side did not know about.&lt;/p&gt;

&lt;p&gt;Root cause: a Synology XSym pointer in the nexus directory that had been treating one of the source files as a symlink to a different location than the canonical one. The XSym format does not behave the same way as a POSIX symlink across mount boundaries; the difference between the two had been silent.&lt;/p&gt;

&lt;p&gt;Fix: an ID-offset merge that brought the orphaned rows from the older file into the canonical one (topics +1000, content/research/publications +100). The &lt;code&gt;sqlite_sequence&lt;/code&gt; table got rebumped. &lt;code&gt;PRAGMA foreign_key_check&lt;/code&gt; came back clean. Backups of both source databases were saved before the merge. The broken XSym symlink was replaced with a real POSIX symlink to the canonical path.&lt;/p&gt;

&lt;p&gt;The general lesson: silent forks are the worst class of incident because they degrade trust in the data retroactively. Anything that reports counts, dedupes, or makes scheduling decisions against the table is suspect until reconciled.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure 3: The Re-Generated Drafts
&lt;/h2&gt;

&lt;p&gt;On May 13 the 10 AM &lt;code&gt;draft.ts&lt;/code&gt; cron produced two &lt;code&gt;pending_review&lt;/code&gt; drafts for titles that had already been published in April. The system was about to ship a second copy of two pieces that had been live for weeks. The drafts sat in Slack for review and got caught before they shipped, but the failure mode was that the cron pipeline would have happily generated them again the next day and the day after that until someone noticed.&lt;/p&gt;

&lt;p&gt;Root cause: two compounding gaps in the state machine. The &lt;code&gt;content_approve&lt;/code&gt; handler in &lt;code&gt;review-workflow.ts&lt;/code&gt; only advanced the content status; the topic status stayed at whatever the draft-runner left it, which meant a successfully published piece could leave its topic in &lt;code&gt;drafted&lt;/code&gt; (happy path) or &lt;code&gt;approved&lt;/code&gt; (if the Slack post mid-draft failed). Trend-scanner had a &lt;code&gt;getPublishedContentTitles()&lt;/code&gt; dedupe; &lt;code&gt;draft.ts&lt;/code&gt; did not. Then the May 12 DB merge brought two topics from the forked database in at &lt;code&gt;status='approved'&lt;/code&gt;, and the next day's 10 AM cron drained them.&lt;/p&gt;

&lt;p&gt;Fix in two parts. A defensive guard in &lt;code&gt;draft-runner.ts&lt;/code&gt; that imports &lt;code&gt;getPublishedContentTitles&lt;/code&gt;, builds a lowercase Set once per run, and skips and archives any topic whose title matches an already-published title. Re-drafting becomes structurally impossible regardless of upstream state-machine leaks. A state-machine fix in &lt;code&gt;review-workflow.ts&lt;/code&gt; that calls &lt;code&gt;updateTopicStatus(content.topic_id, 'archived')&lt;/code&gt; when the &lt;code&gt;content_approve&lt;/code&gt; case fires with a non-null &lt;code&gt;topic_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The general lesson: a state machine is only safe when the invariants hold from both directions. The trend-scanner had the dedupe; the drafter did not. Now both do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure 4: The 409 That Was a Success
&lt;/h2&gt;

&lt;p&gt;On May 2 the Instagram carousel publish for a T3 piece returned an HTTP 409 from Late.dev — "exact content already scheduled," with an &lt;code&gt;existingPostId&lt;/code&gt; field pointing at the post the request had just created. The carousel had successfully scheduled. The response said it had failed.&lt;/p&gt;

&lt;p&gt;Root cause: Late.dev's API was returning a duplicate-detection error against requests it had itself just enqueued, before its internal scheduler reconciled them. The 409 was a race condition between insert and dedup-check.&lt;/p&gt;

&lt;p&gt;Fix: a try/catch around the IG publish call that catches the 409, parses the &lt;code&gt;existingPostId&lt;/code&gt; from the error response, and treats it as success — inserts a publication row pointing at the returned ID, marks the content row as &lt;code&gt;status='published'&lt;/code&gt;. The fix lives in &lt;code&gt;publisher.ts &amp;gt; publishToInstagram&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The general lesson: integrations with vendor APIs accumulate vendor-specific quirks. The fix is not to file a support ticket and wait. The fix is to handle the quirk inside your wrapper and move on. The May 2 incident produced &lt;a href="https://www.mpt.solutions/your-agents-compliments-are-a-confession/" rel="noopener noreferrer"&gt;Hard-Won Lesson #21&lt;/a&gt; — the corpus reference to the broader pattern of catching false negatives at the integration layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure 5: The Named Foil
&lt;/h2&gt;

&lt;p&gt;On May 4, the contrarian piece "Agentic Coding Isn't the Trap. Supervising From Your Head Is." named the writer of the original argument I was rebutting and proceeded to characterize their position in ways that pushed beyond what they had actually written. Twelve days later, on May 16, the author of the original piece pushed back publicly in the LinkedIn comments — quoting their own piece to show they had never advocated the specific thing I had implied they advocated.&lt;/p&gt;

&lt;p&gt;The pushback was fair. The strawman risk had been highest precisely because their position was close enough to mine that the extrapolation felt safe. I acknowledged the correction publicly on LinkedIn, added an editor's note at the top of the original Ghost post linking back to the comment, and shipped a new reusable script (&lt;code&gt;scripts/add-editors-note-faye.ts&lt;/code&gt;) that uses the Ghost JWT auth pattern to add notes idempotently to any post.&lt;/p&gt;

&lt;p&gt;Root cause: a voice-and-discipline gap, not a code gap. Two patterns compounded — naming a foil author in the prose, and using the negative-parallelism title pattern ("X Isn't Y. Z is.") that depends on a strawman to work.&lt;/p&gt;

&lt;p&gt;Fix: two new entries in the feedback memory. The first bans the "X isn't Y. Z is." title and lede pattern across the corpus. The second bans naming the contrarian target in prose — the link to the source piece can stay, the URL slug can carry the author's name, but the in-prose attribution does not. Both rules are now part of the auto-loaded session context. Subsequent pieces — the May 19 Goodhart piece responding to a field guide, the May 20 co-design piece responding to an academic article — followed both rules and shipped clean.&lt;/p&gt;

&lt;p&gt;The general lesson: the corpus is the residue of editor's notes. Every voice-discipline rule worth keeping was learned from a specific incident where shipping without it produced a public correction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Survives
&lt;/h2&gt;

&lt;p&gt;The current stack is the survivor of these five and a dozen smaller incidents I am not writing up. The pieces of it that look obvious in retrospect — the stale-lock defense, the canonical-DB symlink, the dedupe guard in the drafter, the 409 catch in the publisher, the named-foil ban in the lint — each one came from a specific incident the original design did not anticipate.&lt;/p&gt;

&lt;p&gt;The stack is not what I planned. It is what is left after the failures pruned the parts that did not work. Anyone reading the &lt;a href="https://www.mpt.solutions/inside-the-stack-i-ship-from-daily/" rel="noopener noreferrer"&gt;Saturday architecture piece&lt;/a&gt; is looking at the convex hull of those five corrections, plus the smaller ones, plus the parts that worked the first time.&lt;/p&gt;

&lt;p&gt;Show your stack. Show the failures that shaped it. Show the editor's notes. The thing that ships is the thing that survived.&lt;/p&gt;

</description>
      <category>postmortem</category>
      <category>personalaistack</category>
      <category>developertools</category>
      <category>claudecode</category>
    </item>
  </channel>
</rss>
