<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: claire nguyen</title>
    <description>The latest articles on DEV Community by claire nguyen (@claire_nguyen).</description>
    <link>https://dev.to/claire_nguyen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864932%2F406e92e1-2c8d-4d65-a1f4-a8d4e8c2fd1d.jpg</url>
      <title>DEV Community: claire nguyen</title>
      <link>https://dev.to/claire_nguyen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/claire_nguyen"/>
    <language>en</language>
    <item>
      <title>MCP in Production Reality vs the Spec</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Wed, 29 Apr 2026 05:58:18 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/mcp-in-production-reality-vs-the-spec-3f04</link>
      <guid>https://dev.to/claire_nguyen/mcp-in-production-reality-vs-the-spec-3f04</guid>
      <description>&lt;p&gt;Been building against MCP for the last four months and the gap between what vendors claim and what the spec actually supports is getting hard to ignore.&lt;/p&gt;

&lt;p&gt;If you have not read the official roadmap yet, it is worth your time. The document published by AAIF in March lays things out clearly and honestly. The list of what is still missing is longer than many people in the ecosystem seem willing to admit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stateless Streaming Is Not Here Yet
&lt;/h2&gt;

&lt;p&gt;Stateless Streamable HTTP is still marked as in progress. That has real consequences.&lt;/p&gt;

&lt;p&gt;Today, if you want to scale horizontally, you are dealing with sticky sessions or putting a stateful proxy in front of your servers. This is not a small implementation detail. It directly affects reliability, cost, and operational complexity.&lt;/p&gt;

&lt;p&gt;Every MCP native at scale pitch I have seen quietly works around this with a custom session layer. That may be practical for now, but it is not what people assume when they hear "stateless protocol."&lt;/p&gt;

&lt;h2&gt;
  
  
  Async Work Is Still DIY
&lt;/h2&gt;

&lt;p&gt;The Tasks primitive for async and long running operations is also in progress.&lt;/p&gt;

&lt;p&gt;In practice, this means any agent doing multi minute work is faking async. Most teams end up with polling endpoints, custom retry logic, and their own definitions of job state.&lt;/p&gt;

&lt;p&gt;The problem is not just inconvenience. It is fragmentation. Each implementation behaves slightly differently, which makes interoperability harder before it even begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discovery Is Still Manual
&lt;/h2&gt;

&lt;p&gt;Server discovery is another gap that shows up quickly.&lt;/p&gt;

&lt;p&gt;The idea of Server Cards exposed via .well known URLs is promising, but not available yet. Right now, you cannot know what an MCP server can do without connecting to it first.&lt;/p&gt;

&lt;p&gt;The Registry preview from late 2025 helps, but it is not a replacement for protocol level discovery. You still end up writing glue code just to answer basic capability questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enterprise Auth Is Not Ready
&lt;/h2&gt;

&lt;p&gt;Authentication is where things feel especially incomplete for real world use.&lt;/p&gt;

&lt;p&gt;Most implementations today rely on static client secrets. That works for prototypes, but does not align with how larger organizations manage identity and access.&lt;/p&gt;

&lt;p&gt;The roadmap calls out SSO integrated cross app access as a priority. That is exactly what is needed. Until it lands, teams are building their own auth layers on top.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Cost: Rewrites Later
&lt;/h2&gt;

&lt;p&gt;Put all of this together and a pattern emerges.&lt;/p&gt;

&lt;p&gt;If you are building serious MCP infrastructure today, you are not just implementing the spec. You are filling in gaps around session management, async orchestration, discovery, and authentication.&lt;/p&gt;

&lt;p&gt;Those gaps come with a cost. Once these features land in the official spec, a lot of today's custom infrastructure will need to be reworked or replaced. Some abstractions will survive. Many will not.&lt;/p&gt;

&lt;p&gt;If you are designing systems now, it is worth being explicit about where you are deviating from the spec and how hard it will be to unwind later.&lt;/p&gt;

&lt;h2&gt;
  
  
  About Those "Production Ready" Claims
&lt;/h2&gt;

&lt;p&gt;This also makes it hard to take production ready MCP gateway claims at face value in April 2026.&lt;/p&gt;

&lt;p&gt;There are usually two possibilities. Either the deployment is small enough that these issues have not surfaced yet, or the vendor has built proprietary extensions on top of MCP.&lt;/p&gt;

&lt;p&gt;Neither is inherently wrong, but both are very different from what the marketing suggests.&lt;/p&gt;

&lt;p&gt;The Good News&lt;/p&gt;

&lt;p&gt;None of this is a knock on MCP itself.&lt;/p&gt;

&lt;p&gt;The shape of the protocol feels right. The direction is solid. The roadmap is transparent about what is missing, which is more than can be said for many standards at this stage.&lt;/p&gt;

&lt;p&gt;But the reality is simple. Production grade tooling is still catching up.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>llm</category>
      <category>sre</category>
    </item>
    <item>
      <title>Are Execution-First Models Getting Underrated for Agent Workflows?</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Mon, 27 Apr 2026 04:50:33 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/are-execution-first-models-getting-underrated-for-agent-workflows-2egj</link>
      <guid>https://dev.to/claire_nguyen/are-execution-first-models-getting-underrated-for-agent-workflows-2egj</guid>
      <description>&lt;p&gt;A lot of model discussion still gravitates toward benchmark screenshots, clever chat demos, or long reasoning traces that look impressive at first glance. Those things are easy to share and easy to evaluate in isolation.&lt;/p&gt;

&lt;p&gt;But once a model is actually embedded inside a product or an agent workflow, I am not convinced those are the most useful ways to think about performance anymore.&lt;/p&gt;

&lt;p&gt;The question I keep coming back to is much simpler: &lt;strong&gt;how much useful work does the model actually get done per token, per step, and per retry?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Shifting the lens to execution
&lt;/h2&gt;

&lt;p&gt;This is where models like Ling-2.6-1T start to stand out.&lt;/p&gt;

&lt;p&gt;What makes it interesting is not just its size. It is the way it seems to be positioned. It feels much more execution-first. The focus appears to be on precise instruction following, handling long context without falling apart, fitting naturally into tool use, and maintaining tighter token discipline.&lt;/p&gt;

&lt;p&gt;It is less about producing responses that &lt;em&gt;look&lt;/em&gt; impressive and more about moving tasks forward efficiently.&lt;/p&gt;

&lt;p&gt;That distinction matters more than it might seem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where real workflows break down
&lt;/h2&gt;

&lt;p&gt;In real systems, the biggest pain points are rarely about whether a model seems reflective or articulate.&lt;/p&gt;

&lt;p&gt;The issues tend to show up elsewhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chains start to drift off task&lt;/li&gt;
&lt;li&gt;Retries become expensive and unpredictable&lt;/li&gt;
&lt;li&gt;Intermediate steps consume too many tokens&lt;/li&gt;
&lt;li&gt;Tool calls become inconsistent or brittle&lt;/li&gt;
&lt;li&gt;Multi-step workflows lose structure over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Individually, these problems feel small. But at scale, they compound quickly. What starts as a minor inefficiency turns into real cost, latency, and operational friction.&lt;/p&gt;

&lt;p&gt;In that environment, a model that is slightly more disciplined and slightly more direct can outperform one that is more “impressive” in a single turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hidden cost of reasoning overhead
&lt;/h2&gt;

&lt;p&gt;There is also a subtle cost to models that lean heavily into visible reasoning.&lt;/p&gt;

&lt;p&gt;Long reasoning traces can be useful for debugging or transparency. But they also introduce overhead. More tokens, more latency, and more surface area for errors or drift.&lt;/p&gt;

&lt;p&gt;If that extra reasoning does not translate into better execution across steps, it becomes hard to justify in production workflows.&lt;/p&gt;

&lt;p&gt;Execution-first models, on the other hand, tend to optimize for forward progress. They aim to do the right thing with fewer steps, fewer tokens, and fewer retries.&lt;/p&gt;

&lt;p&gt;That tradeoff is not always obvious in demos, but it becomes very clear in sustained usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rethinking what “good” looks like
&lt;/h2&gt;

&lt;p&gt;If the goal is to build reliable agent systems, the definition of a “good” model might need to shift.&lt;/p&gt;

&lt;p&gt;It is less about maximum reasoning depth in a single interaction, and more about consistency across many interactions.&lt;/p&gt;

&lt;p&gt;Can the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read messy, real-world context without losing track?&lt;/li&gt;
&lt;li&gt;Preserve task structure across multiple steps?&lt;/li&gt;
&lt;li&gt;Call tools correctly and at the right time?&lt;/li&gt;
&lt;li&gt;Avoid unnecessary verbosity and token waste?&lt;/li&gt;
&lt;li&gt;Recover cleanly when something goes wrong?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not the most glamorous benchmarks, but they are the ones that determine whether a system actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open question
&lt;/h2&gt;

&lt;p&gt;So I am curious how others are thinking about this.&lt;/p&gt;

&lt;p&gt;If the real objective is to move multi-step work forward reliably, are we still overvaluing maximum reasoning depth?&lt;/p&gt;

&lt;p&gt;And on the flip side, are we undervaluing execution-per-token as a core metric for agent workflows?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>automation</category>
      <category>chatgpt</category>
    </item>
    <item>
      <title>The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Thu, 23 Apr 2026 11:28:37 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/the-actual-cost-of-self-hosting-your-llm-nobody-does-this-math-first-1k88</link>
      <guid>https://dev.to/claire_nguyen/the-actual-cost-of-self-hosting-your-llm-nobody-does-this-math-first-1k88</guid>
      <description>

&lt;p&gt;title: "The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)"&lt;br&gt;
published: true&lt;br&gt;
tags: [infrastructure, llm, devops, sre]&lt;/p&gt;
&lt;h2&gt;
  
  
  canonical_url:
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TL;DR: Self-hosting an LLM looks cheaper until you add up everything else. This post walks through the real numbers — compute, storage, networking, and the ops overhead you will absolutely pay. Know the full picture before you spin up that g5 instance.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Decision That Seems Obvious
&lt;/h2&gt;

&lt;p&gt;Your team is spending $4k/month on OpenAI. Someone runs the numbers on a g5.xlarge. "We could host our own model and cut costs by 70%." Sounds great on paper.&lt;/p&gt;

&lt;p&gt;Then you actually do it.&lt;/p&gt;

&lt;p&gt;I've watched this play out a few times. The initial estimate is real. The full picture is not. So let me walk you through what the bill actually looks like twelve months in.&lt;/p&gt;
&lt;h2&gt;
  
  
  Compute: The Number You Actually Know
&lt;/h2&gt;

&lt;p&gt;A g5.xlarge runs about $1.006/hour on-demand. Running inference 24/7 for a mid-sized model, that's roughly $740/month per instance. For any real load, you'll need more than one.&lt;/p&gt;

&lt;p&gt;Go with a g5.12xlarge for a serious 7B model? You're at $5.67/hour. That's $4,082/month before you've touched anything else.&lt;/p&gt;

&lt;p&gt;Spot instances help. You can get 50–70% off. But spot instances get interrupted, and your inference service needs to handle that gracefully or users see errors. Not impossible, but that's engineering work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# EKS node group config for spot GPU instances&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eksctl.io/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterConfig&lt;/span&gt;

&lt;span class="na"&gt;managedNodeGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-spot&lt;/span&gt;
    &lt;span class="na"&gt;instanceTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;g5.xlarge"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;g5.2xlarge"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;spot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;minSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;maxSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;desiredCapacity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;workload&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-inference&lt;/span&gt;
    &lt;span class="na"&gt;taints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
        &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;
    &lt;span class="na"&gt;iam&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;withAddonPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;autoScaler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The autoscaler handles scaling down. You still need logic to drain in-flight requests before a spot interruption kills the node. AWS gives you a two-minute warning, so your request timeout has to live inside that window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage, Networking, and the Bits Everyone Forgets
&lt;/h2&gt;

&lt;p&gt;Model weights are not small. Llama 3 8B in fp16 is about 16GB. Mixtral 8x7B hits 87GB. You're storing these in S3 or EFS and loading them on pod startup.&lt;/p&gt;

&lt;p&gt;EFS for shared model weights runs roughly $0.30/GB/month. For a 50GB model, that's $15/month. Fine. But EFS read throughput matters at startup. Loading 50GB cold takes minutes per pod. Pre-warm your nodes or bake model weights into an EBS snapshot on your AMI.&lt;/p&gt;

&lt;p&gt;Networking is where it gets sneaky. Data transfer out of AWS costs $0.09/GB after the first GB. If you're serving inference to users outside AWS — which is most teams — every response adds up. A million requests/month at 1KB average response is $90 in egress. At 10 million requests, that's $900, and nobody put it in the budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ops Tax Is Real
&lt;/h2&gt;

&lt;p&gt;This is the one that bites hardest. You now own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model versioning and rollout&lt;/li&gt;
&lt;li&gt;GPU driver updates&lt;/li&gt;
&lt;li&gt;CUDA compatibility between driver and runtime&lt;/li&gt;
&lt;li&gt;OOM debugging (GPU OOM behaves nothing like CPU OOM)&lt;/li&gt;
&lt;li&gt;Observability for inference latency, token throughput, GPU utilisation&lt;/li&gt;
&lt;li&gt;On-call when the model service falls over at 2am&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is the real cost. Human time. If your team hasn't run LLM inference at scale before, budget a few months of learning. GPU OOM crashes are cryptic. NCCL errors for multi-GPU setups will make you question your life choices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Hosted vs Managed API: The Honest Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Self-Hosted&lt;/th&gt;
&lt;th&gt;Managed API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Compute cost&lt;/td&gt;
&lt;td&gt;Medium (but tunable)&lt;/td&gt;
&lt;td&gt;High per token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ops burden&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Near zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency control&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model flexibility&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Vendor-locked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data privacy&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Depends on vendor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to first token&lt;/td&gt;
&lt;td&gt;Days to weeks setup&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling complexity&lt;/td&gt;
&lt;td&gt;You own it&lt;/td&gt;
&lt;td&gt;Handled&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No universally right answer here. High-volume batch workloads with infrastructure chops? Self-hosting wins. Need to move fast? Managed APIs are fine.&lt;/p&gt;

&lt;p&gt;Some teams go hybrid: managed API for interactive use, self-hosted batch for cost-sensitive workloads. Tools like Bifrost help route between providers and track actual spend across both, which makes the hybrid model easier to reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Self-hosting wins on cost only when GPU utilisation is high. A GPU sitting idle at 20% utilisation is expensive. You need to be hitting 60–70% to beat per-token API pricing.&lt;/p&gt;

&lt;p&gt;Spot instances cut compute cost but add operational complexity. Async batch jobs can tolerate interruptions. Real-time inference is harder to design around.&lt;/p&gt;

&lt;p&gt;Multi-GPU setups for larger models introduce inter-node networking costs and NCCL configuration pain. Worth it at scale. Not worth it for a team of five still figuring out their workload shape.&lt;/p&gt;

&lt;p&gt;Model quantisation (int4, int8) can cut memory requirements dramatically. Running Llama 3 70B in int4 on two A10G GPUs is real. The quality trade-off is usually acceptable for most tasks. Always benchmark before committing to a quantisation level in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Recommendation
&lt;/h2&gt;

&lt;p&gt;Run the full numbers. Not the compute number. The full number: compute, storage, networking, engineering time, on-call time.&lt;/p&gt;

&lt;p&gt;If the math works and your team has the capacity, self-hosting is genuinely solid. You get control, data privacy, and cost efficiency at volume.&lt;/p&gt;

&lt;p&gt;If you're not sure yet, start with a managed API. Get your usage patterns nailed down. Then revisit.&lt;/p&gt;

&lt;p&gt;No worries either way. Both paths are valid. The mistake is assuming one is clearly better without doing the first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/ec2/pricing/on-demand/" rel="noopener noreferrer"&gt;AWS EC2 GPU Instance Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/vllm-project/vllm" rel="noopener noreferrer"&gt;vLLM: High-throughput LLM Serving&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/ec2/spot/instance-advisor/" rel="noopener noreferrer"&gt;AWS Spot Instance Advisor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/" rel="noopener noreferrer"&gt;NVIDIA Triton Inference Server Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ray-project/llmperf/" rel="noopener noreferrer"&gt;LLMPerf: Benchmarking LLM Inference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>ML Infrastructure Renaissance: What Everyone's Missing About GPU Orchestration</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Sun, 19 Apr 2026 20:25:34 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/ml-infrastructure-renaissance-what-everyones-missing-about-gpu-orchestration-5fd5</link>
      <guid>https://dev.to/claire_nguyen/ml-infrastructure-renaissance-what-everyones-missing-about-gpu-orchestration-5fd5</guid>
      <description>&lt;h2&gt;
  
  
  What's Your Experience?
&lt;/h2&gt;

&lt;p&gt;I'm curious to hear from other infrastructure engineers: what's the most overlooked challenge you're seeing in ML infrastructure today? Are there tools or patterns that have worked particularly well for your teams?&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>When Your LLM Provider Pulls the Rug: Lessons from Anthropic's OAuth Shutdown</title>
      <dc:creator>claire nguyen</dc:creator>
      <pubDate>Sun, 19 Apr 2026 19:27:09 +0000</pubDate>
      <link>https://dev.to/claire_nguyen/when-your-llm-provider-pulls-the-rug-lessons-from-anthropics-oauth-shutdown-31hh</link>
      <guid>https://dev.to/claire_nguyen/when-your-llm-provider-pulls-the-rug-lessons-from-anthropics-oauth-shutdown-31hh</guid>
      <description>&lt;p&gt;On April 4th, 2026, Anthropic revoked OAuth access for OpenClaw, killing over 135,000 integrations overnight. Developers who built workflows around Claude Pro/Max subscriptions woke up to broken pipelines and a 10-50x cost increase to stay on Anthropic via API.&lt;/p&gt;

&lt;p&gt;This wasn't just another API change—it was a stark reminder that when your authentication and routing goes through a mechanism that your provider controls exclusively, they can pull the rug whenever their business priorities shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Dependency Trap
&lt;/h2&gt;

&lt;p&gt;We had two internal services that authenticated directly against Anthropic's API. When the OpenClaw situation happened, we weren't directly affected, but it prompted a thorough audit of our LLM infrastructure. What we found was concerning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hardcoded provider logic&lt;/strong&gt;: Model names baked into agent configurations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Different auth patterns&lt;/strong&gt;: Each provider has unique token refresh flows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incompatible error handling&lt;/strong&gt;: Different auth failure codes across providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost lock-in&lt;/strong&gt;: Subscription models vs API pricing created massive cost disparities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Swapping from Anthropic to OpenAI or Gemini wasn't just a model change—it required rewriting authentication layers, updating response parsing, and redeploying services.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gateway Pattern Solution
&lt;/h2&gt;

&lt;p&gt;We moved to routing everything through a centralized gateway that manages provider credentials. Our application code now authenticates against our own gateway API keys, and the gateway handles per-provider authentication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits we gained:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provider agnosticism&lt;/strong&gt;: Swap models without touching application code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized monitoring&lt;/strong&gt;: Single point for health checks and metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic failover&lt;/strong&gt;: Gateway can reroute requests during outages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt;: Route to cheapest provider that meets quality thresholds&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation Strategy
&lt;/h2&gt;

&lt;p&gt;Here's how we structured our gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified gateway routing logic
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLMGateway&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;providers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;anthropic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;AnthropicProvider&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;OpenAIProvider&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gemini&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;GeminiProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;health_checker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HealthChecker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cheapest_available&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Check provider health and availability
&lt;/span&gt;        &lt;span class="n"&gt;healthy_providers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;health_checker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_healthy_providers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Apply routing strategy
&lt;/span&gt;        &lt;span class="n"&gt;selected_provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_select_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;healthy_providers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;fallback_strategy&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;selected_provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Abstract provider specifics&lt;/strong&gt;: Don't let provider-specific logic leak into your application&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assume instability&lt;/strong&gt;: Treat LLM providers as unreliable dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build for portability&lt;/strong&gt;: Make swapping providers a config change, not a code change&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor everything&lt;/strong&gt;: Health checks should be continuous, not reactive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for failure&lt;/strong&gt;: Have automatic fallback routes for every critical path&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;This pattern extends beyond LLMs. Any third-party service that could change pricing, access models, or business priorities should be abstracted through a gateway. Whether it's payment processors, email services, or cloud storage—the principle remains the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most resilient systems aren't the ones that never fail. They're the ones that fail gracefully and recover automatically.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By treating LLM providers as interchangeable components rather than foundational infrastructure, we build systems that can adapt to market changes without requiring massive rewrites.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What provider abstraction patterns are you using in your AI infrastructure? Share your experiences in the comments!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
    </item>
  </channel>
</rss>
