<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Cloudstar</title>
    <description>The latest articles on DEV Community by Alex Cloudstar (@alexcloudstar).</description>
    <link>https://dev.to/alexcloudstar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1190670%2F18910089-3a37-4072-9b4c-289211f053eb.JPG</url>
      <title>DEV Community: Alex Cloudstar</title>
      <link>https://dev.to/alexcloudstar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexcloudstar"/>
    <language>en</language>
    <item>
      <title>Durable AI Workflows in 2026: Why Your Next AI Feature Needs Orchestration</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Wed, 22 Apr 2026 10:47:28 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/durable-ai-workflows-in-2026-why-your-next-ai-feature-needs-orchestration-1djd</link>
      <guid>https://dev.to/alexcloudstar/durable-ai-workflows-in-2026-why-your-next-ai-feature-needs-orchestration-1djd</guid>
      <description>&lt;p&gt;I shipped an AI feature last fall that took an input document, called a large language model to extract structured data, called a second model to validate it, posted the results to a webhook, and then emailed the user. The whole thing took between 40 seconds and 3 minutes depending on the document size.&lt;/p&gt;

&lt;p&gt;It worked perfectly in testing. It worked for the first hundred users in production. Then a network hiccup took out the LLM provider for 90 seconds during a busy afternoon, and I discovered the hard way that I had built a very expensive way to lose data.&lt;/p&gt;

&lt;p&gt;My serverless function timed out. The retry was another full run from scratch, which hit the LLM a second time for tokens I had already paid for. Users saw errors. Some of them got two emails. A few of them got neither because the second run failed at a different step and the retry count hit zero.&lt;/p&gt;

&lt;p&gt;I spent the next weekend rewriting the whole thing on top of a durable workflow engine. The problem was not that I had bad code. The problem was that I was using request-response infrastructure to run a multi-step, long-running, stateful process. That is not what serverless functions are for, and pretending it is leads to exactly the kind of failure I walked into.&lt;/p&gt;

&lt;p&gt;This post is the guide I wish I had before I shipped that feature. It covers what durable workflows are, why AI features need them more than almost any other category of work, and how to choose between Inngest, Trigger.dev, and Vercel Workflow in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Breaks When AI Meets Serverless
&lt;/h2&gt;

&lt;p&gt;The default pattern for shipping a feature in 2026 looks something like: a Next.js or similar framework, an API route that handles a request, some business logic, maybe a database call, and a response. This pattern is fast, cheap, and covers 90 percent of what most web apps do.&lt;/p&gt;

&lt;p&gt;It also breaks in predictable ways when AI gets involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timeouts.&lt;/strong&gt; LLM calls are slow. A single Claude or GPT call is typically a few seconds. A chain of them can take minutes. Vercel raised the default function timeout to 300 seconds in 2025, which helps, but a multi-step agent can easily exceed that. If your function times out mid-run, you lose the work in progress and any external side effects you already triggered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retries.&lt;/strong&gt; When an LLM provider has an outage or rate limits you, you need to retry. Naive retries cause duplicate emails, duplicate database writes, and duplicate bills. Smart retries require keeping track of which steps have already succeeded so you can resume from where you left off instead of starting over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; Every retry on an LLM call costs real money. A workflow that reruns from scratch on every failure can 2x or 3x your AI costs during a bad day with a provider. For features where each run is cheap this is tolerable. For agentic workflows that use 50,000 tokens per run, it is a budget problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability.&lt;/strong&gt; When a multi-step AI workflow fails, you need to know which step failed, with what input, and with what output from the previous steps. Tracing this in a standard logging setup is painful. You end up grepping logs across multiple function invocations, trying to correlate request IDs that may not even exist on retries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrency.&lt;/strong&gt; If a user kicks off ten AI workflows at once, you want to throttle them so you do not blow up your rate limits with your LLM provider. Standard serverless functions have no built-in way to do this without building your own queue, and &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization in production&lt;/a&gt; depends on getting this right.&lt;/p&gt;

&lt;p&gt;These are not edge cases. They are the default failure modes for any AI feature that does more than a single one-shot completion. The moment you chain two LLM calls together, or mix an LLM call with an external API, or run something that takes longer than a normal HTTP request, you are in workflow territory whether you planned for it or not.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Durable Workflows Actually Are
&lt;/h2&gt;

&lt;p&gt;The term "durable workflow" sounds like enterprise jargon, but the idea is simple.&lt;/p&gt;

&lt;p&gt;A durable workflow is a function where each step is checkpointed. When a step succeeds, the result is persisted. If the workflow fails partway through, the engine resumes from the last successful step instead of starting over. The function can take minutes, hours, or days to complete. It can pause to wait for external events. It can sleep for a week and then resume. All of this is handled by the engine, not by you.&lt;/p&gt;

&lt;p&gt;The programming model looks almost identical to normal async code. You write a function with steps. Each step is a regular async operation. The engine wraps each step to persist its result and provide the persisted result on replay if the step has already run.&lt;/p&gt;

&lt;p&gt;The magic is that failures become survivable. A network blip in step 3 of a 5 step workflow does not lose the work from steps 1 and 2. A provider outage does not double bill you. A deploy in the middle of a running workflow does not drop it on the floor. These are not optimizations. They are the baseline behavior.&lt;/p&gt;

&lt;p&gt;This is the model Temporal popularized in the enterprise. What changed in 2026 is that the pattern finally got accessible to indie developers and small teams, with tools that work natively with Next.js, serverless functions, and modern TypeScript stacks. You no longer need a dedicated worker infrastructure to run durable workflows. You can run them on the same platform as the rest of your app.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inngest: The Mature Choice
&lt;/h2&gt;

&lt;p&gt;Inngest has been in the durable workflow space longer than most of the current competitors. It is a hosted service with a TypeScript SDK that defines workflows as functions with steps, using a familiar async pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The developer experience is polished. Defining a workflow looks like writing a regular async function with a few wrapper calls. You call &lt;code&gt;step.run&lt;/code&gt; for operations that should be checkpointed, &lt;code&gt;step.sleep&lt;/code&gt; for delays, and &lt;code&gt;step.waitForEvent&lt;/code&gt; for waiting on external triggers. There is no special syntax to learn and the types are strong.&lt;/p&gt;

&lt;p&gt;Event-driven triggers are a first class concept. Instead of calling a workflow directly, you emit an event, and Inngest decides which workflows should run based on event matching rules. This is the right pattern for anything that involves user actions triggering background work, and it composes cleanly as your app grows.&lt;/p&gt;

&lt;p&gt;The local development story is good. Inngest has a local dev server that mirrors production behavior, so you can iterate on workflows without deploying. The dashboard shows you every run, every step, every input, every output. When something goes wrong, you can see exactly what happened and often just click to replay from a failed step.&lt;/p&gt;

&lt;p&gt;Concurrency and rate limiting are built in. You can limit a workflow to process at most 5 runs concurrently per user, or throttle invocations to 10 per second per integration, or back off exponentially on retry. For AI features that need to stay under LLM rate limits, this is the feature you did not know you needed until you shipped without it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;The hosted pricing can get expensive for high-volume workflows. Inngest charges based on step executions and concurrency, and both scale with how chatty your workflows are. For a workflow that checkpoints a lot of small steps, the bill adds up.&lt;/p&gt;

&lt;p&gt;Self-hosting is possible but more involved than the managed service suggests. If you want to run Inngest on your own infrastructure to control costs or compliance, expect to spend time on the deployment.&lt;/p&gt;

&lt;p&gt;The abstraction is opinionated about event-driven triggers. If your mental model is "call this workflow now and wait for the result," Inngest supports it but the ergonomics lean toward async event-driven patterns. This is usually the right pattern, but it can feel foreign if you are coming from a simpler background job queue.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to pick it
&lt;/h3&gt;

&lt;p&gt;Inngest is the right choice if you are building an event-driven system, care about first class concurrency controls, and want a polished managed service. It is also the choice with the longest track record, so if you are risk averse, it is the safe pick.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trigger.dev: The Open Source Friendly Pick
&lt;/h2&gt;

&lt;p&gt;Trigger.dev took a different path. It is open source, self hostable from day one, and focuses on making background jobs and workflows accessible with a minimum of ceremony. Version 3, which is the version you should be using in 2026, is a full rewrite that added durable execution and significantly improved the developer experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The setup is the fastest of the three tools I tested. You install the SDK, define a task with a simple decorator pattern, and it is ready to run. For quick prototyping or for developers who want to minimize the conceptual overhead of adopting a new tool, Trigger.dev is the lightest lift.&lt;/p&gt;

&lt;p&gt;The self-hosting story is first class. The open source version of Trigger.dev runs as a Docker container and has feature parity with the managed cloud product. For teams that need to own their infrastructure for compliance or cost reasons, this is a significant advantage over the more managed-first alternatives.&lt;/p&gt;

&lt;p&gt;The dashboard is genuinely nice. You get a live view of running tasks, a history of past runs, the ability to replay from any step, and the tooling for debugging failed runs is polished. For AI workflows specifically, being able to see exactly what each LLM call received and returned is invaluable when you are tracking down a bad completion.&lt;/p&gt;

&lt;p&gt;The SDK handles common AI patterns well. There is built in support for streaming responses, long running inference calls, and checkpointing expensive LLM outputs so you do not rerun them on retry. This is the kind of domain-specific polish that separates a tool that works for AI from a tool that was designed for AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;The platform is younger than Inngest. Some advanced features like sophisticated event matching, complex concurrency policies, and multi-tenant controls are either newer or still in development. For a simple AI workflow this does not matter. For a complex multi-tenant SaaS with intricate routing needs, it might.&lt;/p&gt;

&lt;p&gt;The managed cloud pricing is competitive but the tool is still finding its positioning. I have seen pricing adjustments several times in the last year, which is normal for a product at this stage but worth being aware of if you are trying to budget.&lt;/p&gt;

&lt;p&gt;The ecosystem around triggers and integrations is smaller than Inngest's. Inngest has invested heavily in pre-built integrations with common services. Trigger.dev leans on you to wire up the integrations yourself, which is fine but slightly more work.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to pick it
&lt;/h3&gt;

&lt;p&gt;Trigger.dev is the right choice if you value open source, want the fastest possible setup, need to self host, or want a tool that was designed with AI workloads in mind from the start. It is especially strong for indie developers building &lt;a href="https://dev.to/blog/one-person-startup-scaling-2026"&gt;one person startups&lt;/a&gt; who want to control their infrastructure without managing it full time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vercel Workflow: The Native Vercel Pick
&lt;/h2&gt;

&lt;p&gt;Vercel Workflow, sometimes called Vercel Workflow DevKit or WDK, is Vercel's answer to the durable workflow problem. It launched in 2025 and matured throughout 2026 as part of Vercel's broader push to own more of the backend runtime. It runs on Fluid Compute, integrates with the rest of the Vercel platform, and requires no separate infrastructure if you are already deploying on Vercel.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The integration with the Vercel platform is seamless. If your app is already on Vercel, adding a workflow is a matter of creating a new function file with the workflow pattern. No separate service, no additional dashboard, no new billing relationship. Everything shows up in your existing Vercel project.&lt;/p&gt;

&lt;p&gt;The programming model is clean. You write a workflow as a regular async function, mark steps that should be checkpointed, and the runtime handles persistence. The API feels like a natural extension of Next.js rather than an external tool bolted on.&lt;/p&gt;

&lt;p&gt;Cost efficiency is genuinely different. Because Vercel Workflow runs on Fluid Compute, you get the benefits of &lt;a href="https://dev.to/blog/ai-sdk-v6-developer-guide-2026"&gt;function instance reuse and active CPU pricing&lt;/a&gt;. For AI workflows that spend most of their time waiting on LLM responses, you are not paying for idle time the way you would with traditional serverless invocation counts.&lt;/p&gt;

&lt;p&gt;The observability tie-in is strong. Workflow runs show up in the Vercel dashboard alongside your deployments, logs, and other platform metrics. When a workflow fails, you can trace it back to the specific deployment, look at the runtime logs, and see the preview environment context all in one place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;It only works on Vercel. This is the obvious limitation and it is not going to change. If you are on AWS, Render, Fly, Cloudflare, or self hosted, Vercel Workflow is not available.&lt;/p&gt;

&lt;p&gt;It is newer than the alternatives. Inngest and Trigger.dev have years of production usage across thousands of applications. Vercel Workflow is production-ready but has less battle-tested coverage of edge cases. For straightforward AI workflows this is fine. For complex orchestration with unusual patterns, you may run into rough edges.&lt;/p&gt;

&lt;p&gt;The ecosystem of patterns, examples, and integrations is smaller. Inngest and Trigger.dev both have mature libraries of patterns for common use cases. Vercel Workflow is catching up but you will sometimes end up implementing things from first principles.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to pick it
&lt;/h3&gt;

&lt;p&gt;Vercel Workflow is the right choice if you are already on Vercel and want the tightest possible integration with your existing stack. For AI features that are part of a larger Next.js app, the zero-configuration setup and platform-native observability are hard to beat.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;p&gt;After running all three on real projects for the last few months, here is the framework I use to decide which one to reach for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you on Vercel and shipping Next.js?&lt;/strong&gt; Start with Vercel Workflow. The integration is seamless and the setup cost is effectively zero. If you hit a limitation, switching to one of the others is always an option, but most AI features do not hit those limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need to self host?&lt;/strong&gt; Trigger.dev is the pick. Inngest can be self hosted but the experience is more involved. Vercel Workflow is not an option off platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is your workflow fundamentally event-driven?&lt;/strong&gt; Inngest is the pick. The event routing and matching features are first class in a way the others are not. For systems where many different triggers can kick off related workflows, Inngest's model is the cleanest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you optimizing for the fastest possible setup?&lt;/strong&gt; Trigger.dev is the pick. The cognitive overhead is the lowest of the three, and for a solo developer trying to ship an AI feature quickly, this matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you care about long term track record and maturity?&lt;/strong&gt; Inngest is the pick. It has been at this the longest and has the largest set of real-world production deployments to learn from.&lt;/p&gt;

&lt;p&gt;For most of my current projects, I end up running Vercel Workflow for the AI features that live inside a Vercel-hosted app, and Trigger.dev for anything that needs to run off platform or where I want to control my own infrastructure. I have stopped reaching for Inngest on new projects mostly because the pricing for the kind of chatty workflows I write adds up faster than the alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Patterns for AI Workflows
&lt;/h2&gt;

&lt;p&gt;A few patterns I have learned the hard way that apply regardless of which tool you pick.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checkpoint LLM calls aggressively.&lt;/strong&gt; Every LLM call should be its own checkpointed step. If the call succeeds, you never want to run it again, because it costs money and the output is not deterministic anyway. Every durable workflow engine handles this well if you mark the step correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Store the raw LLM output, not just the parsed version.&lt;/strong&gt; When an LLM call succeeds but the parsing fails, you want to be able to fix the parser and replay without rerunning the LLM. This requires persisting the raw completion, not just the structured result you extracted from it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the workflow engine's native rate limiting.&lt;/strong&gt; Do not build your own throttling layer on top of a workflow engine. Every tool I have covered has built in primitives for this. Use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design steps for idempotency.&lt;/strong&gt; Even with durable workflows, steps can retry. If a step sends an email, sends a webhook, or charges a card, make sure running it twice has the same effect as running it once. Idempotency keys, deduplication tokens, and "has this been done already" checks all matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep step inputs small.&lt;/strong&gt; Every step's inputs get persisted. If you pass a large payload to a step, you are paying to serialize, store, and deserialize that payload on every retry. Pass references to stored data rather than the data itself when possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log the prompts and the responses.&lt;/strong&gt; For debugging AI workflows, the prompt-response pair is the source of truth. Log both, correlate them to the workflow run, and make sure you can replay any failed step with the exact same prompt that caused the failure. &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;AI agent observability&lt;/a&gt; is the companion discipline that makes durable workflows debuggable in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Bottom Line
&lt;/h2&gt;

&lt;p&gt;If you are shipping an AI feature that does more than a single one-shot completion, you need a durable workflow engine. The alternative is not "simpler code." The alternative is a production incident that you will write a blog post about, and the blog post will be shaped a lot like this one.&lt;/p&gt;

&lt;p&gt;Inngest is mature and event-driven. Trigger.dev is open source and fast to adopt. Vercel Workflow is native to Vercel and uses Fluid Compute to keep costs down on long running AI workloads. All three are production ready and all three solve the core problem of multi-step, long-running, stateful AI work.&lt;/p&gt;

&lt;p&gt;The wrong answer is to keep running AI workflows on plain serverless functions and hope that your users never hit a provider outage. The provider outage is coming. The only question is whether your code is ready for it.&lt;/p&gt;

&lt;p&gt;I ended up migrating the feature that ate my weekend to a durable workflow in a single afternoon. The rewrite was smaller than the original implementation because most of the retry logic and state tracking I had built by hand got replaced by the engine. Six months later the feature has weathered three LLM provider incidents without dropping a single run. That is the whole pitch.&lt;/p&gt;

&lt;p&gt;Pick a tool. Migrate your AI workflows. Get your weekends back.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>saas</category>
      <category>productivity</category>
    </item>
    <item>
      <title>AI Code Review Tools in 2026: CodeRabbit vs Greptile vs Vercel Agent</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Wed, 22 Apr 2026 10:46:55 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/ai-code-review-tools-in-2026-coderabbit-vs-greptile-vs-vercel-agent-jdc</link>
      <guid>https://dev.to/alexcloudstar/ai-code-review-tools-in-2026-coderabbit-vs-greptile-vs-vercel-agent-jdc</guid>
      <description>&lt;p&gt;I merged a pull request last month that introduced a race condition in a background worker. Two reviewers had approved it. The tests passed. The staging environment looked fine. The bug surfaced three days later when traffic picked up on a Monday morning, and I spent most of that day unwinding state that had been corrupted across several thousand rows.&lt;/p&gt;

&lt;p&gt;The kicker was that I had an AI code reviewer enabled on the repo. It had flagged exactly the pattern that caused the incident, buried in a list of twelve other comments that were mostly noise. I had trained myself to skim past its output because most of what it said was wrong or pedantic. The one time it was right, I missed it.&lt;/p&gt;

&lt;p&gt;That experience sent me down a rabbit hole. I spent the next six weeks running CodeRabbit, Greptile, and Vercel Agent side by side on three different codebases: a Next.js SaaS, a Bun-based API, and a messy TypeScript monorepo. I wanted to know which one actually catches real bugs without burying them under style nits, and which one is worth paying for when you are a solo developer or a small team.&lt;/p&gt;

&lt;p&gt;Here is what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AI Code Review Became Table Stakes in 2026
&lt;/h2&gt;

&lt;p&gt;The shift happened faster than I expected. Two years ago, AI code review was a curiosity. Tools like CodeRabbit existed but felt more like linters with LLM sprinkles. By mid 2026, roughly 60 percent of teams with a CI pipeline run some form of automated AI review on every pull request. For solo developers and small teams, adoption is even higher.&lt;/p&gt;

&lt;p&gt;The driver is not hype. It is math. If &lt;a href="https://dev.to/blog/ai-generated-code-technical-debt-2026"&gt;51 percent of GitHub commits are now AI assisted&lt;/a&gt; and bug density in AI generated code runs 35 to 40 percent higher in error paths and boundary conditions, human review alone cannot keep up. You either add more reviewers, which solo developers cannot do, or you add a second set of eyes that scales with commit volume instead of headcount.&lt;/p&gt;

&lt;p&gt;That is the job AI code review is actually doing in 2026. It is not replacing senior engineers. It is catching the boring stuff so human review can focus on architecture, product decisions, and the subtle bugs that require context a tool does not have.&lt;/p&gt;

&lt;p&gt;The question is which tool actually does that job well.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Tools That Matter
&lt;/h2&gt;

&lt;p&gt;There are a dozen AI code review products on the market right now. Most of them are thin wrappers around GPT-4 or Claude with a webhook receiver and a Stripe integration. Three are worth taking seriously because they have either market share, technical differentiation, or native platform integration that the others lack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CodeRabbit&lt;/strong&gt; is the incumbent. It launched in 2023, has the largest install base, and works on every major code host. If you walk into a random startup that has AI review set up, there is a two out of three chance it is CodeRabbit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Greptile&lt;/strong&gt; is the technical favorite. It builds a graph of your codebase and uses that to reason about how changes ripple through the system. Developers who care about review quality over breadth of features tend to end up here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vercel Agent&lt;/strong&gt; is the newcomer. It is part of Vercel's broader push to own the development loop on their platform, and it leans heavily on context about your deployments, runtime logs, and infrastructure to inform reviews. It is in public beta as of early 2026 but improving quickly.&lt;/p&gt;

&lt;p&gt;I ran all three on the same three repos, on the same pull requests, for six weeks. Here is how each one performed.&lt;/p&gt;




&lt;h2&gt;
  
  
  CodeRabbit: The Market Leader
&lt;/h2&gt;

&lt;p&gt;CodeRabbit is the tool most developers have tried and the one most teams are actively using. It integrates with GitHub, GitLab, Bitbucket, and Azure DevOps. It posts inline comments on pull requests, offers a summary of changes, and lets you chat back to clarify or push back on its suggestions.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;Setup takes about three minutes. You install the GitHub app, authorize it on the repos you want, and it starts reviewing. No configuration required. The default behavior is sensible and you can tune it later if you want.&lt;/p&gt;

&lt;p&gt;The pull request summaries are genuinely useful. For any PR over a hundred lines, having a TLDR at the top of the thread saves real time during review. I have a bad habit of submitting PRs with sparse descriptions, and CodeRabbit's summary often ends up being a better description than what I wrote.&lt;/p&gt;

&lt;p&gt;The chat feature is the thing I use most. Instead of leaving a comment and waiting for a human reviewer, I can ask CodeRabbit why it flagged something, ask for alternatives, or push back when it is wrong. This back and forth catches maybe one in five false positives and clarifies another one in five.&lt;/p&gt;

&lt;p&gt;Integration breadth is unmatched. It works with Linear, Jira, Notion, Slack, and several of the major CI providers. If you have an existing toolchain, CodeRabbit probably speaks it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;The noise problem is real. On a PR with thirty lines of changes, I routinely get eight to fifteen comments. Maybe two or three are genuinely useful. The rest range from "consider renaming this variable" to "this function could return early" to outright wrong suggestions that would break the code if applied.&lt;/p&gt;

&lt;p&gt;You can tune this with configuration, but the tuning is fiddly. The default verbosity is calibrated for teams that want lots of signals and are willing to filter. For solo developers who want fewer, higher quality comments, the defaults are exhausting.&lt;/p&gt;

&lt;p&gt;Context is shallow. CodeRabbit reads the diff and some of the surrounding files, but it does not build a deep model of your codebase. That means it misses bugs that require understanding how a change interacts with code elsewhere in the repo. The race condition I mentioned earlier is a category CodeRabbit is structurally weak at catching.&lt;/p&gt;

&lt;p&gt;Pricing gets aggressive fast. The free tier covers public repos. Paid plans start at 15 dollars per developer per month and scale based on PR volume and code lines. For a small team, the bill adds up quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verdict
&lt;/h3&gt;

&lt;p&gt;CodeRabbit is the best tool if you want broad coverage, fast setup, and integration with an existing toolchain. It is not the best if you want high signal to noise or deep code understanding. Use it for teams that value breadth, skip it if you want precision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Greptile: The Precision Pick
&lt;/h2&gt;

&lt;p&gt;Greptile takes a different architectural approach. Instead of reading the diff and some surrounding files, it indexes your entire codebase and builds a graph of how functions, modules, and types relate to each other. When you submit a PR, it uses that graph to reason about the change in context.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The bug catching is noticeably better. On the same pull requests I ran through CodeRabbit, Greptile caught issues that required understanding code outside the diff. A function signature change that broke a call site three files away. An async pattern that conflicted with how the caller was handling errors. A type narrowing assumption that held in one context but not another.&lt;/p&gt;

&lt;p&gt;Noise is dramatically lower. On a typical PR I get two to four comments. Almost all of them are worth reading. When Greptile flags something, I have trained myself to actually read it, which is the opposite of my experience with most AI reviewers.&lt;/p&gt;

&lt;p&gt;The summaries are precise rather than exhaustive. It does not try to describe everything the PR does. It focuses on the parts that have meaningful implications, including downstream effects that a human reviewer might miss on a first pass.&lt;/p&gt;

&lt;p&gt;Greptile also understands your codebase's conventions over time. After a few weeks on a repo, its suggestions start matching the style and patterns the team uses. CodeRabbit's suggestions feel more generic regardless of how long it has been running on your code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;Setup is heavier. Indexing a large codebase takes time and costs compute, which is reflected in pricing. For a small repo, this is not an issue. For a monorepo with millions of lines, the initial indexing can take an hour or more.&lt;/p&gt;

&lt;p&gt;Integration breadth is narrower. Greptile works with GitHub well. GitLab support exists but feels secondary. Bitbucket and Azure DevOps are limited. If you are not on GitHub, CodeRabbit is a more comfortable fit.&lt;/p&gt;

&lt;p&gt;The chat and back and forth is less polished. You can leave comments asking for clarification, but the conversational flow feels rougher than CodeRabbit's. This is improving but worth noting.&lt;/p&gt;

&lt;p&gt;Pricing is positioned at the higher end. Plans start around 30 dollars per developer per month. The value is real if you care about review quality, but it is not the budget option.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verdict
&lt;/h3&gt;

&lt;p&gt;Greptile is the best tool if you want precision over breadth. It catches bugs other tools miss, the noise level is manageable, and the codebase awareness compounds over time. Use it for teams that prioritize quality, skip it if integration breadth or price sensitivity matters more.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vercel Agent: The Native Platform Pick
&lt;/h2&gt;

&lt;p&gt;Vercel Agent sits in a slightly different category. It is not just a code reviewer. It is part of Vercel's broader AI layer, which also includes production investigation, automated incident diagnosis, and deployment analysis. The code review feature uses context from your Vercel deployments, runtime logs, and preview environments to inform its suggestions.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The production context is genuinely unique. When Vercel Agent reviews a PR, it knows about the preview deployment, which environment variables are set, what the runtime logs show during preview traffic, and whether any errors surfaced in the preview environment. No other AI reviewer has this data.&lt;/p&gt;

&lt;p&gt;This leads to categories of feedback the others cannot provide. Vercel Agent has flagged regressions in preview environments that were not obvious in the code diff. It has surfaced performance changes between commits based on actual deployment metrics. On one PR, it caught a cold start regression that would have been invisible to any tool that only reads the diff.&lt;/p&gt;

&lt;p&gt;Integration with the Vercel ecosystem is seamless. If you are already on Vercel, enabling Agent is a toggle. No app install, no webhook configuration, no separate dashboard. It shows up on your PRs and in your Vercel project overview.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;AI agent observability&lt;/a&gt; angle is interesting. Agent's suggestions often include links to relevant logs, traces, or specific requests that triggered the behavior it is commenting on. That context shortens the time from "this looks like a bug" to "yes, here is exactly what broke."&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;It only works if you are on Vercel. This is the obvious limitation and it is not going away. If your production runs on Render, Fly, AWS, or anywhere else, Vercel Agent is not an option.&lt;/p&gt;

&lt;p&gt;It is still in public beta. The review quality is good but inconsistent. Some PRs get sharp, context-aware feedback. Others get generic comments that feel like any other AI reviewer. This is improving monthly but it is not yet as reliable as the mature tools.&lt;/p&gt;

&lt;p&gt;It optimizes for the Vercel runtime and patterns. If your codebase does weird things that deviate from typical Next.js or Vercel Function conventions, Agent can get confused or miss issues that a more agnostic tool would catch.&lt;/p&gt;

&lt;p&gt;Pricing is bundled into Vercel's usage-based model, which is both good and annoying depending on your perspective. You do not pay a separate per-developer fee, but your Vercel bill does absorb the cost of Agent's reviews and investigations. For heavy users, this is a meaningful line item.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verdict
&lt;/h3&gt;

&lt;p&gt;Vercel Agent is the best tool if you are already on Vercel and care about connecting code review to production behavior. It is not the best if you are on a different platform or if you need a tool that has been battle tested at scale. Use it for Vercel-native teams that want the tightest possible dev loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Side by Side: Where Each Tool Wins
&lt;/h2&gt;

&lt;p&gt;Here is how the three stacked up across the dimensions I actually cared about after six weeks of daily use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug catching accuracy.&lt;/strong&gt; Greptile wins. It caught the most real bugs, with the fewest false positives, across all three codebases. Vercel Agent was close for anything involving runtime or deployment context. CodeRabbit trailed on precision but covered more surface area in total.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal to noise ratio.&lt;/strong&gt; Greptile wins clearly. Its comment volume is low and its hit rate is high. CodeRabbit produces the most comments overall and has the worst noise ratio on default settings. Vercel Agent is in between and improving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup time.&lt;/strong&gt; CodeRabbit wins. Install the app, authorize, done. Greptile takes longer for the initial index. Vercel Agent is fastest if you are already on Vercel and slowest if you are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integration breadth.&lt;/strong&gt; CodeRabbit wins by a significant margin. Greptile covers the essentials. Vercel Agent only works on Vercel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production context.&lt;/strong&gt; Vercel Agent wins. No other tool has access to runtime data, deployment metrics, and preview environment logs. This is a category of value the others structurally cannot match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing.&lt;/strong&gt; CodeRabbit and Vercel Agent are comparable depending on usage. Greptile is the most expensive on a per-developer basis but cheaper when you account for the reviewer time it saves by producing less noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  Which One Should You Actually Use
&lt;/h2&gt;

&lt;p&gt;If you are a solo developer on a tight budget and your repo is on GitHub, the honest answer is to start with CodeRabbit's free tier or Greptile's trial. CodeRabbit is easier to try and will give you a feel for what AI review does. Greptile is the upgrade if you find yourself ignoring most of CodeRabbit's output.&lt;/p&gt;

&lt;p&gt;If you are a small team of two to five engineers and review quality matters more than integration breadth, Greptile is the pick. The noise reduction alone is worth the higher per-developer cost, and the deep code understanding pays compounding dividends on a stable codebase.&lt;/p&gt;

&lt;p&gt;If you are already on Vercel and shipping Next.js or Vercel Functions as your core stack, add Vercel Agent on top of whatever else you are using. It catches a category of issues the others cannot, and the integration cost is effectively zero. Running Greptile and Vercel Agent together is actually the setup I settled on for my main SaaS project.&lt;/p&gt;

&lt;p&gt;If you are on AWS, Render, Fly, Cloudflare, or any non-Vercel platform, Vercel Agent is out. Choose between CodeRabbit and Greptile based on whether you value breadth or precision.&lt;/p&gt;

&lt;p&gt;Do not run all three on the same repo. The comment overlap creates exactly the noise problem you are trying to avoid. Pick one primary reviewer, maybe add a second if it covers a distinct axis like production context, and trust the signal you get from that setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  What AI Code Review Does Not Replace
&lt;/h2&gt;

&lt;p&gt;One thing worth being blunt about. None of these tools replace human review on non-trivial changes. They catch common issues, surface obvious problems, and reduce the cognitive load of reading a large diff. They do not understand your product, your customers, or the decisions behind a feature.&lt;/p&gt;

&lt;p&gt;A tool can tell you that a function is inefficient. It cannot tell you that the feature itself is the wrong thing to build. A tool can catch a type error. It cannot tell you that the abstraction you are introducing will make the next three features harder to ship.&lt;/p&gt;

&lt;p&gt;That is the part human review still has to do, and it is the part that does not scale with AI. Treat these tools as a first pass that frees up human attention for the things that actually require human judgment. If you use them to replace all human review, you will ship faster for a few weeks and then hit the exact class of bugs that AI review cannot catch.&lt;/p&gt;

&lt;p&gt;The teams I have seen use these tools well treat them as infrastructure. You set them up, you let them do their job, and you reserve human review for the changes where human judgment actually matters. The teams I have seen struggle treat them as decision makers or try to automate away review entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up AI Code Review the Right Way
&lt;/h2&gt;

&lt;p&gt;A few practical lessons from six weeks of comparing these tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tune the verbosity on day one.&lt;/strong&gt; Every tool has a noise problem at default settings. Turn off style suggestions, turn off pedantic comments, and focus the tool on the categories of issue you actually want to catch. Correctness and security issues first, everything else second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create an ignore file for your conventions.&lt;/strong&gt; If your codebase has patterns the tool keeps flagging as issues, document them. CodeRabbit and Greptile both support repo-level configuration that teaches the tool what to stop complaining about. Ten minutes of setup here saves hours of ignored comments later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review the tool's comments critically.&lt;/strong&gt; Do not blindly apply suggestions. AI review is right often enough to be useful and wrong often enough to cause real damage if you merge without reading. Treat every comment as a suggestion, not an instruction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Combine AI review with &lt;a href="https://dev.to/blog/testing-ai-generated-code-developer-guide-2026"&gt;testing strategies for AI generated code&lt;/a&gt;.&lt;/strong&gt; AI review catches issues at commit time. Tests catch them at runtime. Neither is sufficient alone. The combination is what actually keeps quality up as commit volume increases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure whether it is helping.&lt;/strong&gt; After a month of running one of these tools, look at your bug reports. Are you catching things earlier? Are you shipping with fewer post-merge hotfixes? If the answer is no, the tool is not earning its cost and you should either tune it more aggressively or switch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Bottom Line
&lt;/h2&gt;

&lt;p&gt;AI code review in 2026 is not a future technology. It is a current mandatory piece of infrastructure for any team shipping at meaningful velocity. The question is no longer whether to use it. The question is which one, and how to configure it so it helps instead of generating noise you will learn to ignore.&lt;/p&gt;

&lt;p&gt;CodeRabbit is the safe pick for breadth and integration. Greptile is the precision pick when review quality is the priority. Vercel Agent is the native pick for anyone on the Vercel platform who wants runtime context in their reviews.&lt;/p&gt;

&lt;p&gt;Pick one, tune it for signal, and let it do its job. The cost is real but the cost of a race condition that ships to production on a Friday afternoon is much higher. I know, because I merged one of those, and the AI that flagged it was drowned out by the eleven comments it generated that week that I had already learned to ignore.&lt;/p&gt;

&lt;p&gt;The tool does not save you. The tool plus a minute of attention to its output does.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>saas</category>
    </item>
    <item>
      <title>Cursor vs Windsurf vs Zed: The AI IDE Showdown (2026)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Tue, 21 Apr 2026 07:41:09 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/cursor-vs-windsurf-vs-zed-the-ai-ide-showdown-2026-44eo</link>
      <guid>https://dev.to/alexcloudstar/cursor-vs-windsurf-vs-zed-the-ai-ide-showdown-2026-44eo</guid>
      <description>&lt;p&gt;I have a bad habit of switching editors the moment something shinier appears on my timeline.&lt;/p&gt;

&lt;p&gt;Over the last six months I have used Cursor as my daily driver for two features, Windsurf for one side project, and Zed for the last month with Claude Code wired in. I have opinions. Most of them are different from the opinions I had at the start.&lt;/p&gt;

&lt;p&gt;The short version: these are three genuinely different tools that happen to all call themselves AI IDEs. Picking between Cursor, Windsurf, and Zed in 2026 is not about which one is best in the abstract. It is about which tradeoffs match how you actually work. If you pick wrong, you will spend the first week fighting the editor instead of shipping.&lt;/p&gt;

&lt;p&gt;Here is how they actually compare when you use them for real work, what surprised me, and how I would pick today.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Landscape in One Paragraph
&lt;/h2&gt;

&lt;p&gt;Cursor is the AI IDE that won 2024 and 2025 by being the best VS Code fork with AI baked in. Windsurf is Codeium's more autonomous cousin, also a VS Code fork, with an agentic model called Cascade that tries to do more of the work for you. Zed is the odd one out: a Rust-based editor built from scratch for speed, with AI features layered on top (and increasingly, Claude Code as the agentic companion).&lt;/p&gt;

&lt;p&gt;Everyone else (Cline, Aider, Copilot in VS Code, Antigravity, Kiro, Trae) is either a plugin inside another editor or a niche tool worth its own post. The three that most developers are actually picking between are these.&lt;/p&gt;

&lt;p&gt;If you have been using GitHub Copilot inside VS Code and are wondering whether to switch, I wrote about the broader decision in &lt;a href="https://dev.to/blog/claude-code-vs-cursor-vs-github-copilot-2026"&gt;Claude Code vs Cursor vs GitHub Copilot&lt;/a&gt;. This post is specifically about the three AI-native editors most likely to replace your current setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cursor: The Polish Leader
&lt;/h2&gt;

&lt;p&gt;Cursor is what happens when you take VS Code, optimize everything about the AI experience, and charge $20 per month for it.&lt;/p&gt;

&lt;p&gt;The things Cursor does better than everyone else: tab completion that feels telepathic once you get used to it, a chat panel with real codebase understanding via the @codebase command, and multi-file edits that actually work for non-trivial refactors. The UI is familiar to anyone who has used VS Code, all your extensions work, and the learning curve is close to zero.&lt;/p&gt;

&lt;p&gt;After using it daily for months, the workflow that clicks for me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Command-K for inline edits on small scoped changes. Highlight five lines, describe the change, accept.&lt;/li&gt;
&lt;li&gt;The agent panel (Cmd-I) for multi-file changes where I have a clear spec. Feed it the spec, review the plan, let it run, inspect the diff.&lt;/li&gt;
&lt;li&gt;@codebase in chat when I need to ask "where does X live" or "how does Y work" without leaving the editor.&lt;/li&gt;
&lt;li&gt;Tab completion everywhere else, all day, constantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where Cursor disappoints: the agent mode still gets confused on large multi-file changes in unfamiliar code. It will confidently produce a plan that looks right and then edit files in ways that compile but miss the point. When this happens, the iteration loop (review, reject, reprompt) is worse than writing it yourself.&lt;/p&gt;

&lt;p&gt;The pricing is straightforward: $20 per month for Pro, which gets you fast requests to the best models. You will hit the fast-request limit if you use it heavily; after that you are on slow requests, which still work but feel noticeably worse. For most professional developers, $20 per month is negligible next to the productivity gain.&lt;/p&gt;

&lt;p&gt;Where Cursor wins the comparison: you can be productive on day one. No learning curve. No surprising behavior. All your VS Code extensions work. If you just want a better VS Code with AI that does not fight you, this is the default answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Windsurf: The Agent-First Bet
&lt;/h2&gt;

&lt;p&gt;Windsurf is the editor you pick when you want the AI to do more of the work, not just the typing.&lt;/p&gt;

&lt;p&gt;Its headline feature is Cascade, the agentic model that can plan and execute multi-step changes across your codebase with less hand-holding than Cursor's agent mode. In practice Cascade feels like you are delegating to a junior developer who occasionally overreaches but gets the easy stuff done while you focus on the hard parts.&lt;/p&gt;

&lt;p&gt;A task I regularly hand to Cascade: "add a rate limiter to the user endpoint with a 60-second window and 100 requests per window, update the tests, add the new middleware to the router." This is three files of coordinated changes. Cascade usually nails it first try. When Cursor's agent does the same task, I get closer to 50% first-pass success.&lt;/p&gt;

&lt;p&gt;The pricing model is different from Cursor in a way that matters. Windsurf offers unlimited autocomplete on its free tier, which is rare. The paid individual plan at $15 per month is cheaper than Cursor and includes Cascade access. Team plans at $30 per user include zero data retention and enterprise features.&lt;/p&gt;

&lt;p&gt;What makes me not use Windsurf as my daily driver despite liking Cascade: the non-AI parts of the editor feel a beat behind Cursor. The tab completion is good but not as uncanny. The inline edit experience is fine but less polished. Extensions mostly work but I have hit a few that do not.&lt;/p&gt;

&lt;p&gt;There is also a trust issue. Cascade will sometimes make changes I did not expect and would not have chosen. Good agentic behavior is a spectrum between "does the minimum I asked for" and "reshapes the code according to its own judgment." Cascade is further toward the second end than I prefer. If you like more autonomy from your AI, this is a feature. If you like an AI that stays in its lane, this is friction.&lt;/p&gt;

&lt;p&gt;Where Windsurf wins the comparison: if agentic workflows are the point for you, and you want to hand off entire tasks rather than speed up typing, Cascade is genuinely better than the Cursor agent today. Pair it with a disciplined review process and you will ship more.&lt;/p&gt;




&lt;h2&gt;
  
  
  Zed: The Speed Play
&lt;/h2&gt;

&lt;p&gt;Zed is the editor you pick when you have tried the others, found them slower than your brain, and are willing to give up some polish for raw responsiveness.&lt;/p&gt;

&lt;p&gt;Zed is written in Rust. It starts in under half a second on my machine. Input latency is under 2 milliseconds. Large files open without thinking. Syntax highlighting and autocomplete never hitch. After spending time in Zed, going back to a VS Code fork feels like wading through molasses.&lt;/p&gt;

&lt;p&gt;The AI story in Zed has evolved fast. The built-in assistant is capable and integrates cleanly with Claude, OpenAI, and other providers via API keys. You get inline AI edits (similar to Cursor's Command-K), a chat panel, and agent features that have improved significantly over the last year. But where Zed really shines for AI development in 2026 is its integration with Claude Code as an external agent.&lt;/p&gt;

&lt;p&gt;My current setup: Zed for editing, reading, and navigating. Claude Code running in a side terminal for larger agentic tasks. The editor stays fast because it is not also trying to be an agent. The agent is agentic because it is not also trying to be an editor. The combination is more productive for me than any all-in-one tool has been.&lt;/p&gt;

&lt;p&gt;The downsides of Zed are real. The extension ecosystem is a fraction of VS Code. Some languages and frameworks have first-class support; others feel underserved. If your daily workflow depends on a specific VS Code extension, you may find Zed cannot replace it yet. The AI features, while good, are not as polished as Cursor's. Tab completion is noticeably less magical.&lt;/p&gt;

&lt;p&gt;The collaboration story is Zed's underappreciated trick. Real-time pair programming is built in. If you are on a small team and occasionally want to mob-program on a gnarly problem, the native multiplayer is genuinely useful.&lt;/p&gt;

&lt;p&gt;Pricing is the easiest of the three: Zed itself is free and open source. You bring your own API key for AI (or use the Zed-hosted offering). If you already pay for Claude Pro or have an API budget, the total monthly cost is often lower than the AI-IDE subscriptions.&lt;/p&gt;

&lt;p&gt;Where Zed wins the comparison: if speed matters to you, if you like minimal tools you can configure, and if you are comfortable with a BYO-agent model where Claude Code or similar runs outside the editor, Zed is the most satisfying choice of the three. It feels like the future of "editor plus agent" even though the editor itself is decidedly traditional.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Feature Comparison That Actually Matters
&lt;/h2&gt;

&lt;p&gt;Every AI IDE comparison online has a feature grid. Most of them are useless because they list features, not behavior. Here is what actually matters when you sit down to work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tab Completion Quality
&lt;/h3&gt;

&lt;p&gt;Cursor: best in class. The ghost text feels like it knows what you are about to type.&lt;/p&gt;

&lt;p&gt;Windsurf: very good. Not quite Cursor, but close enough for most work.&lt;/p&gt;

&lt;p&gt;Zed: good. Noticeably a step behind the two VS Code forks for this specific feature.&lt;/p&gt;

&lt;p&gt;If tab completion is the AI feature you use most (it is for many developers), this alone may decide the comparison.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inline Edit (Command-K Style)
&lt;/h3&gt;

&lt;p&gt;Cursor: polished. Accept or reject with a keystroke, diff view is clean, works reliably on scoped changes.&lt;/p&gt;

&lt;p&gt;Windsurf: nearly identical to Cursor. Same experience.&lt;/p&gt;

&lt;p&gt;Zed: works well, slightly less polished UI, but functionally equivalent for most edits.&lt;/p&gt;

&lt;p&gt;All three are good at this. Not a deciding factor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent Mode (Multi-File Tasks)
&lt;/h3&gt;

&lt;p&gt;Cursor: good on well-scoped tasks, struggles on large or unfamiliar code.&lt;/p&gt;

&lt;p&gt;Windsurf (Cascade): best of the three for autonomous execution. Also the most likely to overreach.&lt;/p&gt;

&lt;p&gt;Zed + Claude Code: the Claude Code agent is state of the art, but it is external to the editor. Integration is via a terminal, not inline.&lt;/p&gt;

&lt;p&gt;If you want to hand off entire tasks, Cascade or Claude-Code-beside-Zed is where you should be.&lt;/p&gt;

&lt;h3&gt;
  
  
  Codebase Understanding
&lt;/h3&gt;

&lt;p&gt;Cursor: @codebase is well-tuned and fast. Works on large repos.&lt;/p&gt;

&lt;p&gt;Windsurf: similar capability, slightly less refined UX.&lt;/p&gt;

&lt;p&gt;Zed: depends heavily on which agent you pair with. Claude Code has excellent codebase understanding but requires you to launch it separately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speed and Responsiveness
&lt;/h3&gt;

&lt;p&gt;Cursor: fine. Occasional slowness on very large files. Startup is slow by Zed standards.&lt;/p&gt;

&lt;p&gt;Windsurf: fine. Similar to Cursor.&lt;/p&gt;

&lt;p&gt;Zed: in a different category. If you have ever been annoyed by editor lag, this is the difference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extension Compatibility
&lt;/h3&gt;

&lt;p&gt;Cursor: full VS Code extension ecosystem.&lt;/p&gt;

&lt;p&gt;Windsurf: full VS Code extension ecosystem.&lt;/p&gt;

&lt;p&gt;Zed: its own ecosystem, smaller but growing. Language server support is good. Specific productivity extensions may not exist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pricing (2026)
&lt;/h3&gt;

&lt;p&gt;Cursor: $20/month Pro. Team plans scale up.&lt;/p&gt;

&lt;p&gt;Windsurf: $15/month individual, $30/user team. Free tier with unlimited autocomplete.&lt;/p&gt;

&lt;p&gt;Zed: free (editor). AI costs come from your provider API key or the Zed-hosted plan.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;p&gt;The bad answer to "which one should I use" is "it depends." The useful answer is to have a default based on how you work, and only deviate if you have a specific reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Cursor if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the least friction to get productive with AI.&lt;/li&gt;
&lt;li&gt;Tab completion is the AI feature you value most.&lt;/li&gt;
&lt;li&gt;You rely on specific VS Code extensions.&lt;/li&gt;
&lt;li&gt;You want a single-subscription tool that handles everything.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Windsurf if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to hand off larger tasks to the AI.&lt;/li&gt;
&lt;li&gt;Cost matters (the free tier and cheaper paid plan are real advantages).&lt;/li&gt;
&lt;li&gt;You are comfortable reviewing more AI-initiated changes before they merge.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Zed (with Claude Code or similar) if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Editor speed actively matters to you.&lt;/li&gt;
&lt;li&gt;You prefer the editor and the agent to be separate tools.&lt;/li&gt;
&lt;li&gt;You already pay for Claude Pro or have an API budget.&lt;/li&gt;
&lt;li&gt;You like minimal, configurable tools over all-in-one environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most developers, Cursor is the right default. It is the least-surprising, fastest-to-productive option, and the one I recommend when someone asks "where do I start."&lt;/p&gt;

&lt;p&gt;For developers who have been doing this for a while and know what they like, the Zed-plus-Claude-Code combination is where I have personally landed. It respects my muscle memory for a fast editor while giving me a best-in-class agent for the tasks where agents matter.&lt;/p&gt;

&lt;p&gt;Windsurf is the wild card. If your work pattern is "describe what I want, let the AI take a real stab at it, iterate," Cascade is the best tool today.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Does Not Matter (Even Though the Marketing Says It Does)
&lt;/h2&gt;

&lt;p&gt;A few things keep showing up in AI IDE comparisons that do not actually affect daily work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model choice.&lt;/strong&gt; All three tools let you use Claude, GPT, or other frontier models. The model matters for quality, but the editor choice does not meaningfully gate which model you use. Pick the editor based on ergonomics and pair with whichever model you prefer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Built in Rust" vs "built on Electron."&lt;/strong&gt; This affects startup time and memory, and that shows up in Zed's superior responsiveness. But if your current editor already feels fast enough, the underlying implementation does not matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent autonomy scores.&lt;/strong&gt; The marketing around "most autonomous AI coder" is a race to make the AI do more with less input. In practice, the bottleneck is almost never autonomy. It is correctness. An agent that does more of the wrong thing is worse than one that does less of the right thing. Don't optimize for autonomy at the expense of quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Telemetry and privacy positioning.&lt;/strong&gt; All three offer a private or enterprise tier with zero data retention if you need it. For solo developers working on non-sensitive projects, the default tiers are fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Meta Takeaway
&lt;/h2&gt;

&lt;p&gt;Switching editors is expensive. Not in dollars, in focus. Every time I have switched, I lost a week of momentum learning keybindings, restoring my workflow, and discovering what was missing.&lt;/p&gt;

&lt;p&gt;The best strategy is to pick one as your default and live in it for at least a month before evaluating whether another tool is meaningfully better for you. A week is not enough. Two weeks is not enough. A month of actual work, including at least one hard debugging session and one large multi-file refactor, is the minimum bar for forming an opinion.&lt;/p&gt;

&lt;p&gt;If you are currently happy with your editor, the honest answer is that none of these will 10x your productivity. They will incrementally improve specific parts of your workflow. If you are already using one of the three and feeling resistance, switching might close that gap, or might not. You have to try.&lt;/p&gt;

&lt;p&gt;If you are coming from plain VS Code with no AI, or from a non-AI editor, switching to any of these will be a step change. In that case, start with Cursor. Get comfortable with the workflows. Reassess in three months.&lt;/p&gt;

&lt;p&gt;The bigger shift I think most developers underestimate is not between these three editors. It is between "editor with AI features" and "editor plus dedicated AI agent as separate tool." That is the divide I have come out on the other side of, and the reason my daily driver is Zed with Claude Code instead of one of the more obvious picks.&lt;/p&gt;

&lt;p&gt;For where the industry is heading on that shift specifically, I wrote more about it in &lt;a href="https://dev.to/blog/agentic-coding-2026"&gt;agentic coding in 2026&lt;/a&gt;, which covers the broader pattern of agents as first-class development tools rather than IDE features.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Actual Recommendation
&lt;/h2&gt;

&lt;p&gt;If I had to pick one and only one for a developer who asked me today, with no other context about their work, I would say:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor, for the next three months.&lt;/strong&gt; Get the fundamentals. Get the reflexes. Learn what an AI IDE feels like when it is working well.&lt;/p&gt;

&lt;p&gt;Then, once you have that baseline, try Zed with Claude Code for a month. See if the editor-plus-separate-agent model clicks. If it does, you have found your long-term setup. If it does not, you have a thousand-dollar-a-year tool that you understand deeply and will get more out of because you evaluated the alternative.&lt;/p&gt;

&lt;p&gt;Windsurf is worth a try if the cost matters or if Cascade's specific style of autonomy appeals to you. It is a real contender, just not my daily driver.&lt;/p&gt;

&lt;p&gt;The only wrong answer is paralysis. These tools are good enough that any of the three, used consistently, will move you forward. The tools are getting better faster than you can evaluate them. Pick one, ship work with it, and switch only when you have a specific reason.&lt;/p&gt;

&lt;p&gt;The work is the point. The editor is the thing that gets out of your way.&lt;/p&gt;

</description>
      <category>devtools</category>
      <category>ai</category>
      <category>productivity</category>
      <category>editors</category>
    </item>
    <item>
      <title>AI Agent Observability: Debugging Production Agents Without Going Insane (2026)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Tue, 21 Apr 2026 07:41:08 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/ai-agent-observability-debugging-production-agents-without-going-insane-2026-53km</link>
      <guid>https://dev.to/alexcloudstar/ai-agent-observability-debugging-production-agents-without-going-insane-2026-53km</guid>
      <description>&lt;p&gt;The first time I shipped an AI agent to production, I watched it do something in the logs that I could not reproduce locally, could not explain, and could not fix.&lt;/p&gt;

&lt;p&gt;A user asked it to summarize a meeting. It responded with a confident paragraph that referenced three people who were not in the meeting, a decision that was never made, and a date that did not exist. Everything about the response looked plausible. None of it was true.&lt;/p&gt;

&lt;p&gt;I had logs. I had the prompt. I had the final response. What I did not have was any visibility into the sixteen tool calls, three retries, one silent fallback to a cheaper model, and two places where the context was truncated before the model ever saw the real input. The bug was somewhere in that middle. I could not see the middle.&lt;/p&gt;

&lt;p&gt;That was the moment I understood that agent observability is not a nice-to-have. It is the difference between shipping agents and shipping prayers.&lt;/p&gt;

&lt;p&gt;This is the setup I wish I had that day. It works for solo developers, it costs less than you would guess, and it turns the black box of agent execution into something you can actually debug.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Agent Observability Is Different From Regular Logging
&lt;/h2&gt;

&lt;p&gt;If you are coming from web development, your instinct is to reach for the logging tools you already know. Sentry for errors. Datadog for metrics. A structured logger for requests. These are great tools and you should still use them. But they do not tell you what you need to know about an agent.&lt;/p&gt;

&lt;p&gt;The problem is that an agent failure is rarely a single event. It is a chain of events, each of which looks fine on its own.&lt;/p&gt;

&lt;p&gt;A tool call returns valid JSON. The model reads that JSON and makes a reasonable next decision. The next step executes without errors. Eventually the agent returns an answer. Every individual step passes validation. The final answer is wrong.&lt;/p&gt;

&lt;p&gt;If you log these steps independently, you see a sequence of successful operations. If you trace them together, you see that the second tool call returned stale data that the model then built a confident hallucination on top of for the next eight turns. The root cause is invisible at the individual log line. It only appears in the full causal chain.&lt;/p&gt;

&lt;p&gt;This is why the observability stack for agents looks different. You need three things that traditional logging tools do not give you by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session traces.&lt;/strong&gt; The full sequence of prompts, completions, tool calls, retries, and state changes that make up a single agent execution, linked together as one object you can inspect top to bottom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token and cost attribution.&lt;/strong&gt; Which parts of your agent are spending the tokens, and therefore the money. Without this you cannot find the prompt that accidentally got 40x more expensive after a refactor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evals that run in production.&lt;/strong&gt; Offline evals catch the bugs you thought to test for. Production evals catch the bugs your users ran into first.&lt;/p&gt;

&lt;p&gt;Get those three right and you will solve 90% of the problems that make solo developers afraid to ship agents. Let me walk through each one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Session Traces: The One Thing You Cannot Live Without
&lt;/h2&gt;

&lt;p&gt;A trace is a single executable view of an entire agent run. For a typical agent, a trace might include the user input, the system prompt, the first model completion, the first tool call and its response, the updated context, the second model completion, and so on until the agent stops.&lt;/p&gt;

&lt;p&gt;You want this view because agent failures are contextual. The question is never just "what did the model say" but "what did the model say given this specific context after this specific history of events." Without the full trace, you are reading the punchline without the setup.&lt;/p&gt;

&lt;p&gt;The tools I have tried and would recommend for solo founders and small teams in 2026:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Langfuse&lt;/strong&gt; is the open-source option with a generous free tier and a self-hostable option if you want to keep your data. It supports any framework through a simple SDK. Traces render as a nested tree where you can click into each span, see the full prompt and completion, inspect tool inputs and outputs, and compare runs side by side. If you want to just try something and get value quickly, this is where I would start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangSmith&lt;/strong&gt; from LangChain is the most polished of the observability platforms if you are building with LangChain or LangGraph, though it now works framework-agnostically. The trace UI renders the execution tree beautifully and has the best prompt engineering workflow (you can edit a prompt in the playground, run it against the same input, and compare the result). The free tier is enough for early development, and the paid tiers scale with volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Braintrust&lt;/strong&gt; is worth considering if you want observability and evals in the same tool with a product focus on experimentation. The trace view is clean, and the "playground" workflow for iterating on prompts inside real traces is genuinely excellent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; is the lightest-weight option if you do not want another SDK. It works as a proxy in front of your LLM provider, so you change one URL and suddenly have observability. For simple agents this is a great "just make the pain stop" option.&lt;/p&gt;

&lt;p&gt;Pick one and integrate it into your agent on day one, not the day you ship to production. The cost of adding observability later is much higher than the cost of adding it at the start.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to Actually Log in Each Trace
&lt;/h3&gt;

&lt;p&gt;Once you have the tool picked, what you log in each trace matters more than the tool choice. The defaults are usually not enough.&lt;/p&gt;

&lt;p&gt;For each agent run, log:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The raw user input (not a summarized or preprocessed version)&lt;/li&gt;
&lt;li&gt;The full system prompt as sent to the model (not the template you intended to send)&lt;/li&gt;
&lt;li&gt;Every tool call input and output, including failed calls&lt;/li&gt;
&lt;li&gt;Every model call with its model name, temperature, and token counts&lt;/li&gt;
&lt;li&gt;Any retries, including why they happened&lt;/li&gt;
&lt;li&gt;Any fallback to a cheaper or different model&lt;/li&gt;
&lt;li&gt;The final response as the user saw it&lt;/li&gt;
&lt;li&gt;A session ID and user ID (or anonymous user hash) so you can correlate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two specific things that you will forget and then regret: the exact model version string (not just "gpt-5" but the full identifier with date stamp if available) and the full prompt after template substitution. The bug is often in the substitution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Naming and Structuring Traces So You Can Actually Find Things
&lt;/h2&gt;

&lt;p&gt;The failure mode I see most often with agent observability is not having too little data. It is having data you cannot query. A million traces are useless if you cannot find the ten that broke a user's workflow.&lt;/p&gt;

&lt;p&gt;Here is the structure that has saved me the most time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use consistent span names.&lt;/strong&gt; Every tool call should be named after the tool, not a generic "tool_call" label. Every retry should be named "retry_1", "retry_2" so you can filter on retries specifically. Every model call should include the model name in the span.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag sessions with user context.&lt;/strong&gt; Add metadata tags for user ID, account plan, feature flag state, and any other dimension you might want to filter on later. "Show me all failed agent runs for paid users in the last 24 hours where the feature flag X was on" is a query you will want to run and cannot answer without tags.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag sessions with success or failure signals.&lt;/strong&gt; The model cannot always tell you if an agent run was successful. You can sometimes tell from downstream user behavior. If the user copied the response, they probably liked it. If they asked a clarifying question immediately after, they probably did not. Log these signals back to the trace as tags. You will use this data for evals later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capture the decision points explicitly.&lt;/strong&gt; If your agent branches ("should I use the search tool or answer from memory"), log the decision and the reason, not just the action taken. When something goes wrong you want to know the agent chose path A when it should have chosen path B, and you want to know why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Evals: The Part Most Developers Skip
&lt;/h2&gt;

&lt;p&gt;I wrote a whole post on &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt; that covers the basics, but observability is where evals start paying off.&lt;/p&gt;

&lt;p&gt;Evals are the automated tests for your agent. Unit tests for deterministic code ask "does this function return 42 for input X." Evals for agents ask "does this agent's response meet the quality bar for input X." Quality bar is fuzzy, so evals use a mix of deterministic checks (did it call the right tool), LLM-as-judge checks (is the answer factually grounded in the provided context), and sometimes human review.&lt;/p&gt;

&lt;p&gt;The thing I want to hammer on here: evals and observability are the same workflow.&lt;/p&gt;

&lt;p&gt;When you have a trace, you have an input and an output and all the intermediate steps. That is exactly what an eval consumes. Good observability tools make the round trip from "I saw a bad trace in production" to "I added this case to my eval suite and confirmed my fix works on it" a single-click operation.&lt;/p&gt;

&lt;p&gt;The workflow that actually works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You see a bad agent run in production (tagged by a negative signal or reported by a user).&lt;/li&gt;
&lt;li&gt;You open the trace and see what went wrong.&lt;/li&gt;
&lt;li&gt;You save the input as a test case in your eval suite with an expected behavior or quality criterion.&lt;/li&gt;
&lt;li&gt;You make a change to the prompt, the tool, or the model choice.&lt;/li&gt;
&lt;li&gt;You rerun the eval suite to see that your fix works without breaking other cases.&lt;/li&gt;
&lt;li&gt;You ship.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This loop is what separates agents that improve over time from agents that keep making the same mistakes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running Evals in Production (Not Just Before Deploy)
&lt;/h2&gt;

&lt;p&gt;Most developers treat evals as a pre-deploy check. Run them in CI, make sure they pass, ship. This is good but not enough.&lt;/p&gt;

&lt;p&gt;The problem is that production traffic is not a clean superset of your eval set. Users will do things you did not imagine. Edge cases you never thought of will hit your agent on day one. A 95% eval pass rate on your test suite means almost nothing if your test suite is missing the cases that actually break.&lt;/p&gt;

&lt;p&gt;Production evals fix this by running a subset of your eval logic on real production traffic. You sample a percentage of runs, pipe them through an LLM-as-judge or a deterministic check, and log the results as a quality metric.&lt;/p&gt;

&lt;p&gt;What this buys you:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A quality dashboard.&lt;/strong&gt; You see the percentage of agent runs that met your quality bar over the last day, week, month. This lets you detect regressions that would otherwise only show up in customer complaints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Faster incident detection.&lt;/strong&gt; When you push a prompt change and the production quality metric drops by 15%, you know something is wrong within an hour instead of three days later when support tickets pile up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continuous eval set growth.&lt;/strong&gt; Every production run that fails the judge becomes a candidate for your permanent eval suite. Your test set grows with your real usage.&lt;/p&gt;

&lt;p&gt;Starting point: sample 5% of runs, pipe them through a cheap LLM-as-judge that scores them on two or three dimensions that matter for your product (factual grounding, tool use correctness, response helpfulness). Put the scores on a dashboard. That is it. The dashboard will tell you when something is wrong and which traces to look at.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost and Token Observability
&lt;/h2&gt;

&lt;p&gt;The other thing that bites solo developers shipping agents is cost. You build something, ship it, and a month later see a bill that does not match your mental model of what the agent does.&lt;/p&gt;

&lt;p&gt;The cost comes from places you do not expect. A tool that returns a 50 KB blob of JSON, which the model then re-reads on every subsequent turn. A system prompt that grew 500 tokens during a refactor and now runs on every single call. A retry loop that happens silently when the model returns malformed JSON, doubling or tripling your per-request cost.&lt;/p&gt;

&lt;p&gt;Every agent observability tool I mentioned above tracks tokens by default. Use this.&lt;/p&gt;

&lt;p&gt;Specifically, build a dashboard that answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which spans (tools, model calls) account for most of the tokens?&lt;/li&gt;
&lt;li&gt;What is the average token count per session, and has it drifted over the last month?&lt;/li&gt;
&lt;li&gt;What is the p99 token count per session? (This is often where the cost overruns hide.)&lt;/li&gt;
&lt;li&gt;Which users or accounts are the most expensive? (A power user is fine; a runaway loop is not.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A specific pattern I use now: a sanity check at the start of every agent run that rejects requests where the prefilled context is already over a threshold (say 100k tokens). This has caught bugs where an edge case was pulling way more context than intended, and the agent would then run expensively and slowly for no reason.&lt;/p&gt;

&lt;p&gt;For the broader picture on keeping AI costs sane at production scale, I went into this in &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization in production&lt;/a&gt;, which pairs well with the token observability approach above.&lt;/p&gt;




&lt;h2&gt;
  
  
  Debugging Non-Deterministic Failures
&lt;/h2&gt;

&lt;p&gt;The hardest agent bugs are the ones that only happen sometimes. You run the same input and get the right answer. A user runs it five times and gets the wrong answer twice. What do you do?&lt;/p&gt;

&lt;p&gt;This is where good tracing changes the game.&lt;/p&gt;

&lt;p&gt;First, you need to know these failures are happening. Set up a trace filter that flags sessions where the same user retried a request within 30 seconds. That is a strong signal they did not like the first answer.&lt;/p&gt;

&lt;p&gt;Then, you need to compare runs. A good observability tool lets you diff two traces side by side. You want to see what was different. Often the difference is in the context window, not the input. The first run had one set of prior messages. The second run had a subtly different set because of some intermediate state change. The model responded differently because the context was different, not because the prompt was worse.&lt;/p&gt;

&lt;p&gt;Once you see the diff, you can usually fix the problem. Sometimes it is a prompt change. Sometimes it is a retrieval or memory change. Sometimes it is a model version pin (the provider silently updated the model and your prompt is now slightly off). The fix is downstream. The discovery is only possible with good observability.&lt;/p&gt;

&lt;p&gt;For the harder cases where the non-determinism is baked into the model itself, techniques like temperature 0, structured output, and caching can help. I covered some of these in &lt;a href="https://dev.to/blog/ai-agent-memory-state-persistence-2026"&gt;AI agent memory and state persistence&lt;/a&gt;, where the state layer is often where non-determinism sneaks in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Privacy and Data Handling
&lt;/h2&gt;

&lt;p&gt;One thing I should flag because it comes up a lot in solo-founder-shipping-to-enterprise territory: observability by default captures everything, including things you may not be allowed to capture.&lt;/p&gt;

&lt;p&gt;If your agent processes user data that is sensitive (health records, financial information, personal messages), the observability layer becomes a compliance surface. Logging the raw prompts and completions means your observability provider now has copies of that data.&lt;/p&gt;

&lt;p&gt;Three things that help:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PII redaction.&lt;/strong&gt; Most observability platforms have built-in redactors for email addresses, phone numbers, credit card numbers. Turn this on. Better to lose some debugging information than to accidentally log a user's SSN.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosting.&lt;/strong&gt; Langfuse and a few others offer a self-hostable version. If you need the data to never leave your infrastructure, this is the path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sampling and retention policies.&lt;/strong&gt; You do not need to keep every trace forever. A policy like "keep all failed traces for 90 days, sample 5% of successful traces for 30 days" gives you enough data to debug while limiting exposure.&lt;/p&gt;

&lt;p&gt;None of this is a substitute for actual compliance review for regulated industries. But for the common case of a SaaS product that handles user data that you would not want in a breach, these steps get you to a reasonable place.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Minimal Setup That Actually Works
&lt;/h2&gt;

&lt;p&gt;If I had to start from zero tomorrow and wanted the smallest possible observability setup that would still catch most real problems, here is what I would do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One observability tool&lt;/strong&gt; (Langfuse if I wanted free and self-hostable, LangSmith if I was already on LangChain, Braintrust if evals are my priority). Integrate it on day one of the project, not day one of production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traces for every agent run&lt;/strong&gt;, logging the full input, system prompt, tool calls, and response. Tagged with user ID, session ID, and any feature flags relevant to the behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A quality signal&lt;/strong&gt; captured for each run. In the simplest case, this is whether the user replied positively, retried, or abandoned. You can refine later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A basic eval suite&lt;/strong&gt; of 20 to 50 representative cases, run in CI on every prompt or model change, and sampled on 1% to 5% of production runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A token dashboard&lt;/strong&gt; showing total tokens per day, average tokens per session, and top token consumers. Check it weekly.&lt;/p&gt;

&lt;p&gt;That setup takes a weekend to build if you are starting fresh. It takes a couple of hours to add to an existing agent. It turns every production bug from a mystery into a specific trace you can look at, reproduce, and fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mindset Shift That Matters
&lt;/h2&gt;

&lt;p&gt;The last thing I want to leave you with is not a tool recommendation. It is a way of thinking.&lt;/p&gt;

&lt;p&gt;Traditional software development treats observability as a production concern. You write the code, test it, ship it, and add monitoring to see how it behaves in the wild.&lt;/p&gt;

&lt;p&gt;Agent development flips this. The model is the largest source of uncertainty in the system. You cannot unit test your way to confidence because the thing you are testing is non-deterministic by design. You cannot code review your way to correctness because the logic is in weights, not lines.&lt;/p&gt;

&lt;p&gt;The only way to know if your agent works is to watch it work. In development. In staging. In production. Continuously. With enough detail that you can diagnose any failure in minutes instead of days.&lt;/p&gt;

&lt;p&gt;Observability is not a nice-to-have layer you add once the core features are built. It is the development environment itself. The sooner you build this mindset, the sooner you go from shipping agents that mysteriously disappoint users to shipping agents that get measurably better every week.&lt;/p&gt;

&lt;p&gt;Your traces are the new IDE. Your eval set is the new test suite. Your quality dashboard is the new build pipeline.&lt;/p&gt;

&lt;p&gt;Build them first. Everything else gets easier.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>observability</category>
      <category>agents</category>
    </item>
    <item>
      <title>Better Auth vs Clerk vs Supabase Auth: Which Should Solo Devs Pick in 2026?</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 09:58:20 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/better-auth-vs-clerk-vs-supabase-auth-which-should-solo-devs-pick-in-2026-mf1</link>
      <guid>https://dev.to/alexcloudstar/better-auth-vs-clerk-vs-supabase-auth-which-should-solo-devs-pick-in-2026-mf1</guid>
      <description>&lt;p&gt;Authentication is the decision you make once, forget about, and then curse quietly two years later when you need to do something the vendor does not support.&lt;/p&gt;

&lt;p&gt;I have shipped products on three different auth stacks in the last four years. Each time I picked the one that seemed obvious at the time. Each time, the trade-off I was not thinking about turned into the thing that mattered most.&lt;/p&gt;

&lt;p&gt;The landscape in 2026 is different enough that anyone building a new product should stop and think about this for a second, instead of reaching for whichever provider they used last time. The default has shifted. The self-hosted option is actually good. The pricing math has changed. And there is one provider that has quietly become the right answer for a specific kind of product that most solo devs are building.&lt;/p&gt;

&lt;p&gt;Here is how I think about the choice in 2026 between Better Auth, Clerk, and Supabase Auth. These are the three options worth seriously considering. Everyone else is either too expensive, too niche, or not worth the switching cost anymore.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Decision Matters More Than It Feels Like It Does
&lt;/h2&gt;

&lt;p&gt;Auth is the most load-bearing piece of your product that you do not think about often.&lt;/p&gt;

&lt;p&gt;It sits in front of every request. It shapes your user model. It decides how you handle billing, teams, roles, sessions, invitations, password resets, and every compliance conversation you will ever have. When it is invisible it feels free. When it breaks or has to change, it is weeks of rewriting.&lt;/p&gt;

&lt;p&gt;The switching cost is where people get burned. Going from one auth provider to another means migrating user records, session tokens, password hashes (or not, if your new provider does not accept the old hash format), third-party OAuth connections, and every webhook integration downstream. Most products never switch because the cost never justifies the benefit. You live with what you picked, so picking well is worth an hour of thought.&lt;/p&gt;

&lt;p&gt;Three questions decide which provider fits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How much of the auth UI do you want to own?&lt;/li&gt;
&lt;li&gt;How much are you willing to pay per active user?&lt;/li&gt;
&lt;li&gt;What is your tolerance for lock-in?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The answers cluster into three patterns. Each provider is the right answer to one of those patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Clerk: The Polished Default
&lt;/h2&gt;

&lt;p&gt;Clerk is the closest thing to a "just works" auth provider in 2026. You drop their components into your app, configure a few things in their dashboard, and you have a complete auth system including sign-in, sign-up, password reset, social providers, MFA, email verification, a profile UI, and a user management dashboard.&lt;/p&gt;

&lt;p&gt;The quality of the components is the thing that keeps people on Clerk. The &lt;code&gt;&amp;lt;SignIn /&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;UserProfile /&amp;gt;&lt;/code&gt; components are not just functional. They look good out of the box. They handle edge cases that people forget exist. They ship with localization, theming, and accessibility built in. You cannot build this yourself in a reasonable time frame, which is the whole value proposition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Clerk wins:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are building a consumer or prosumer product where auth UX matters. Sign-in friction costs you conversion. The quality of your password reset flow is a real competitive detail. The difference between a janky MFA prompt and a polished one is measurable.&lt;/p&gt;

&lt;p&gt;You want organizations, invitations, and role-based access out of the box. Clerk's org model is one of the best pre-built ones available. If your product needs teams, this saves weeks.&lt;/p&gt;

&lt;p&gt;You are okay paying for active users once you grow. Clerk's pricing starts generous and gets expensive fast. At small scale it is effectively free. At 10,000 active users on a business product, you are paying enough per month that it shows up on the accounting summary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Clerk loses:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The lock-in is real. You do not own your user records in a useful way. You can export them, but the sessions, the verification state, and the OAuth connections all live inside Clerk. Migrating out is a project, not a weekend. This is the same pattern I covered in the &lt;a href="https://dev.to/blog/supabase-vs-firebase-2026"&gt;Supabase vs Firebase breakdown&lt;/a&gt;, and it has gotten worse for managed auth over the last two years.&lt;/p&gt;

&lt;p&gt;Pricing is unpredictable if your app has viral moments or traffic spikes. Clerk charges on monthly active users. A Product Hunt launch that brings 5,000 curious clickers with one-session visits gets counted. Some competitors count differently. Read the pricing page twice before committing.&lt;/p&gt;

&lt;p&gt;You cannot customize the flows beyond what they expose. If your product needs a non-standard sign-up experience, a unique verification step, or integration with an internal identity system, Clerk is the wrong answer. You will hit the walls of their abstraction and there is no escape hatch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Supabase Auth: The Pragmatic Bundle
&lt;/h2&gt;

&lt;p&gt;Supabase Auth is the auth layer that ships with Supabase. You cannot really evaluate it in isolation, because the reason to pick it is that you are using Supabase for other things too.&lt;/p&gt;

&lt;p&gt;If you are using Supabase Postgres, Supabase Storage, or Supabase Realtime, Supabase Auth is the path of least resistance. User records live in your own Postgres database. Row-level security policies reference the authenticated user directly. The auth state flows into your realtime subscriptions automatically. It is the tightest integration available between auth and data in 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Supabase Auth wins:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are building a product where Postgres is the database and Supabase is the backend. The integration with row-level security alone justifies the choice. You can write a policy like "users can only see their own rows" and enforce it at the database layer. No middleware, no manual check in every API route. The auth identity is a first-class concept in your data layer.&lt;/p&gt;

&lt;p&gt;Your cost profile favors flat platform pricing over per-user auth pricing. Supabase charges for database size and API bandwidth. Auth itself is effectively unlimited for most projects. If you expect to grow past a few thousand active users and do not want per-user fees, this is the cheapest provider by a wide margin.&lt;/p&gt;

&lt;p&gt;You want to own your user data. Users live in a Postgres table that you can query, join, and back up like any other data. No export ritual. No vendor-shaped user model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Supabase Auth loses:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The UI components are functional but not polished. They look like open-source components from 2022, because that is roughly what they are. You will end up building your own sign-in pages if design quality matters for your product. That is not bad, it is just more work than Clerk requires.&lt;/p&gt;

&lt;p&gt;The organization and team features are basic. You can build teams on top of Supabase Auth, but you are building them. Invitations, role-based permissions, and multi-tenant support are all DIY. If your product needs those out of the box, you will write them yourself.&lt;/p&gt;

&lt;p&gt;Edge cases around advanced flows are where it shows its age. Some auth providers have a decade of accumulated work on rare-but-important scenarios. Supabase Auth is newer and less thorough on those. For most products this does not matter. For some it does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Better Auth: The Open-Source Answer That Changed the Conversation
&lt;/h2&gt;

&lt;p&gt;Better Auth is the one that has shifted what the default answer should be in 2026. It is an open-source auth library for TypeScript, self-hosted by default, with first-class support for every framework worth caring about.&lt;/p&gt;

&lt;p&gt;It is not a managed service. You install it, configure it, and run it in your own application. It stores user data in your own database. It issues your own session tokens. There is no external service to depend on, no per-user billing, and no risk of a vendor sunsetting a feature you rely on.&lt;/p&gt;

&lt;p&gt;A year ago, the pitch for self-hosted auth was "it is cheaper and you own everything, but you will spend a month making it work." That tradeoff has changed. Better Auth is good enough out of the box that the setup cost is comparable to a managed provider, and the cost curve past the first thousand users bends in your favor forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Better Auth wins:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are shipping a TypeScript product and have strong preferences about your stack. Better Auth gives you hooks for every step of the auth lifecycle. You can plug in your own email sender, your own session store, your own rate limiter, your own password policy. If you already have opinions, it does not fight you.&lt;/p&gt;

&lt;p&gt;Your cost profile is long-tail. You are building something that could have a lot of users but not a lot of revenue per user. Newsletter tools, community products, developer tools with free tiers. Managed auth priced per active user will eat your margin. Better Auth priced per server costs the same at 100 users and 100,000 users.&lt;/p&gt;

&lt;p&gt;You want to avoid lock-in entirely. The user table is your user table. The sessions are your sessions. If Better Auth changes direction or a new library comes out that is better, you can migrate in a week because your data is already yours.&lt;/p&gt;

&lt;p&gt;You value reading the code. When something breaks, you can step through the auth library itself. When a new social provider launches, you can add it yourself without waiting for a vendor. This is the same reason a lot of developers prefer &lt;a href="https://dev.to/blog/drizzle-orm-vs-prisma-2026"&gt;Drizzle over Prisma&lt;/a&gt; in the ORM layer. Owning the layer means you can fix it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Better Auth loses:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You need to operate it. That means thinking about session storage, rate limiting, monitoring, and making sure your database migrations do not lock the users table during a busy hour. None of this is hard. All of it is work that Clerk does for you and Better Auth does not.&lt;/p&gt;

&lt;p&gt;No polished components. You are building your own sign-in UI. This is fine if you have taste and time. If you are a backend developer shipping a product alone and design is not your thing, the quality of your auth pages will show. Clerk wins on this dimension, cleanly.&lt;/p&gt;

&lt;p&gt;The ecosystem is newer. Integrations with specific services, documentation for obscure edge cases, and Stack Overflow answers to weird problems are thinner than Clerk or Supabase. You will occasionally have to read source code or ask in a Discord. That is fine for most developers and a dealbreaker for some.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework I Actually Use
&lt;/h2&gt;

&lt;p&gt;The marketing pitches all sound good. Here is how I pick between the three for real projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If the product is UI-sensitive and growth-sensitive, pick Clerk.&lt;/strong&gt; Consumer products, prosumer tools, anything where sign-in friction visibly matters. Pay the per-user cost for the better conversion and faster launch. The expensive auth bill later is a sign the product is working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If the stack is already Supabase or Postgres with row-level security, pick Supabase Auth.&lt;/strong&gt; The database integration is worth more than any other feature. Do not fight it. Use the provider that is already in your data layer. This is especially true for products where data access is the core complexity and auth is a supporting cast member.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If the product is TypeScript, margin-sensitive, and you have at least some taste for design, pick Better Auth.&lt;/strong&gt; Newsletter products, community tools, developer platforms, internal apps, anything where per-user pricing at scale would hurt. The setup cost is a weekend. The ownership is permanent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A fourth option I use sometimes:&lt;/strong&gt; start with a managed provider, migrate to Better Auth when the cost crosses a threshold. Clerk for the first six months. Better Auth once you have validated the product and growth is real enough to justify the migration project. This has the highest optionality but requires the discipline to actually migrate when the time comes. Most people do not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lock-In Math
&lt;/h2&gt;

&lt;p&gt;The part nobody talks about clearly is the cost of leaving each provider in two years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaving Clerk&lt;/strong&gt; means exporting your users, which you can do, and then figuring out how to handle sessions, OAuth connections, and password hashes in whatever you migrate to. Clerk hashes passwords with bcrypt, which most systems accept. OAuth connections and MFA factors do not transfer cleanly. In practice, a Clerk migration means asking every user to re-verify and reconnect social providers. That is a real UX cost and a real drop in active users during the transition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaving Supabase Auth&lt;/strong&gt; is easier on paper because the data is in your Postgres. In practice, Supabase Auth uses its own password hashing and session model, and the hashes can be migrated with care. You lose the integration with RLS policies that referenced the auth identity, so any migration needs to rethink the data access layer. The users table is yours. The glue is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaving Better Auth&lt;/strong&gt; is trivial by comparison. Your user table is your user table. You can plug a different auth library into the same schema. Sessions live in your database. There is no meaningful lock-in to unwind.&lt;/p&gt;

&lt;p&gt;If you are building something long-term and not sure what you will need in three years, lower lock-in is more valuable than lower effort today. That is the argument for Better Auth in any situation where the other providers do not have a clear advantage.&lt;/p&gt;




&lt;h2&gt;
  
  
  What About Auth0, Firebase Auth, and Everyone Else?
&lt;/h2&gt;

&lt;p&gt;Worth a quick mention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auth0&lt;/strong&gt; is enterprise auth. It is priced for companies with budgets, not solo developers. If you are building a business product that will sell to enterprises, it is reasonable. For everything else, it is overkill. Okta acquired Auth0 years ago, which added polish and also added price.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Firebase Auth&lt;/strong&gt; is still fine, but Firebase as a whole has lost momentum relative to Supabase for new projects. Google has kept it alive but not modernized it in a way that keeps pace with what developers want in 2026. I would not start a new project on it unless I was already committed to Firebase elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NextAuth / Auth.js&lt;/strong&gt; is in a weird place. It pioneered the open-source TypeScript auth space but has struggled with direction changes and breaking upgrades. Better Auth is the spiritual successor and has captured most of the energy NextAuth had two years ago. If you are on NextAuth and it works, fine. For a new project in 2026, Better Auth is the more active and better-designed choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WorkOS&lt;/strong&gt; is SSO-first, priced for B2B companies shipping to enterprises. If you need SAML, SCIM, and enterprise SSO from day one, it is the right answer. For consumer or prosumer products, it is the wrong shape.&lt;/p&gt;

&lt;p&gt;Everyone else is either a niche solution, an abandoned project, or marketing material. The three I covered in detail cover the real decision space for solo developers in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Thing I Wish Someone Had Told Me Earlier
&lt;/h2&gt;

&lt;p&gt;The dimension I underweighted every time I picked auth is how much the provider shapes the way I think about users.&lt;/p&gt;

&lt;p&gt;On Clerk, I think about users as records in their system that I reference by ID. On Supabase Auth, users are rows in a table I can query. On Better Auth, users are whatever my application says they are. These feel like implementation details until you are a year in and trying to do something the provider did not anticipate, like merging accounts, supporting a weird login method, or building a multi-tenant feature where one user belongs to many workspaces.&lt;/p&gt;

&lt;p&gt;Providers that own less of your user model give you more flexibility later. Providers that own more give you a faster start. Neither is wrong. Both have a price.&lt;/p&gt;

&lt;p&gt;The mistake I made twice was picking the faster start every time, without noticing that the flexibility cost was being paid later in cramped, frustrated workarounds. For my next product, I am picking the flexibility. Your calculus may be different, but this is the dimension most people do not weight properly.&lt;/p&gt;

&lt;p&gt;Pick auth like you pick a database. It is going to be there for a long time. The switching cost is higher than the setup cost. The feature differences matter less than the shape of what you can build on top of it.&lt;/p&gt;

&lt;p&gt;Whatever you pick, make sure you understand what you are actually committing to. If you cannot describe in one sentence what you would do to migrate off your auth provider, you do not understand the commitment you are making. That is true at 10 users and it is true at 10,000. The difference is only how much work it is to fix once it matters.&lt;/p&gt;

&lt;p&gt;For most solo developers shipping in 2026, my default is now Better Auth with Postgres and a simple sessions table. Clerk when UI polish is a business requirement. Supabase Auth when the whole product is already running on Supabase. That is the shape of the decision in one paragraph. The rest is detail. The &lt;a href="https://dev.to/blog/shipping-speed-only-strategy-2026"&gt;shipping speed question&lt;/a&gt; will push you toward whichever one lets you move fastest on your specific product. Use that instinct, but weight the long-term lock-in cost at least as much.&lt;/p&gt;

</description>
      <category>devtools</category>
      <category>startup</category>
      <category>saas</category>
    </item>
    <item>
      <title>AI SDK v6: The Practical Guide to Shipping AI Features Without Vendor Lock-In (2026)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 09:58:19 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/ai-sdk-v6-the-practical-guide-to-shipping-ai-features-without-vendor-lock-in-2026-17m0</link>
      <guid>https://dev.to/alexcloudstar/ai-sdk-v6-the-practical-guide-to-shipping-ai-features-without-vendor-lock-in-2026-17m0</guid>
      <description>&lt;p&gt;I spent most of last year bolting AI features onto products the wrong way.&lt;/p&gt;

&lt;p&gt;Direct provider SDK. Hardcoded model string. A streaming response handler I copy pasted from a blog post. It worked. It also meant that every time a new model came out, switching took half a day of untangling types and rewriting stream parsers that I did not remember writing in the first place.&lt;/p&gt;

&lt;p&gt;The AI SDK v6 is the fix I wish I had a year ago.&lt;/p&gt;

&lt;p&gt;If you have been putting off building AI features into your product because the ecosystem felt chaotic, or if you tried the AI SDK a year ago and bounced off it, this is the update worth coming back to. The abstractions finally match how people actually build. The streaming story is coherent. Provider switching is a one-line change. And the tool-use and agent patterns are good enough that you can ship real features instead of demos.&lt;/p&gt;

&lt;p&gt;Here is what actually changed, what it unlocks, and the patterns I use day to day now.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the AI SDK Actually Is
&lt;/h2&gt;

&lt;p&gt;Before the v6 specifics, a quick grounding for anyone who has heard the name and never used it.&lt;/p&gt;

&lt;p&gt;The AI SDK is a TypeScript library that gives you one API for talking to every major model provider. You write your code once against the SDK. Underneath, it talks to OpenAI, Anthropic, Google, Mistral, open-source models via Ollama or Together, and anything else with a compatible adapter. Switching providers is a string change, not a rewrite.&lt;/p&gt;

&lt;p&gt;It also handles the parts of AI work that are annoying to get right: streaming tokens to the browser, tool calls, structured output, chat state, retries, tracing. You can write those yourself. I have. You are not going to do a better job than the SDK for most products, and you will spend weeks on plumbing that does not differentiate your product from anyone else's.&lt;/p&gt;

&lt;p&gt;The v6 release refined all of that and added a few things that change what is reasonable to build solo.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is New in AI SDK v6
&lt;/h2&gt;

&lt;p&gt;The headline changes are smaller than the cumulative effect of them together. Individually, each improvement looks like a polish pass. Used together, they change the shape of what feels worth building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, the provider abstraction got simpler.&lt;/strong&gt; You used to import a provider package and configure it. Now you can pass a plain &lt;code&gt;"provider/model"&lt;/code&gt; string through the AI Gateway and the SDK handles the wiring. Switching from &lt;code&gt;"anthropic/claude-opus-4-7"&lt;/code&gt; to &lt;code&gt;"openai/gpt-5"&lt;/code&gt; is a string edit. Fallbacks between providers are first-class. If you have been watching the model landscape thrash around and wanted to avoid committing to one, this is the feature that matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, tools and agents are real primitives.&lt;/strong&gt; The &lt;code&gt;tool()&lt;/code&gt; helper and the agent loop are good enough that you can build agentic features without importing a separate framework. I used to reach for LangChain for anything with more than two steps. I do not anymore. The SDK's agent pattern is simpler, more debuggable, and stays out of your way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, streaming got cleaner.&lt;/strong&gt; The &lt;code&gt;streamText&lt;/code&gt;, &lt;code&gt;streamObject&lt;/code&gt;, and &lt;code&gt;streamUI&lt;/code&gt; APIs converged on a consistent shape. The React hooks (&lt;code&gt;useChat&lt;/code&gt;, &lt;code&gt;useObject&lt;/code&gt;, &lt;code&gt;useCompletion&lt;/code&gt;) work against the same streaming protocol the server sends. If you are using Next.js or any other framework, the end-to-end flow is the most boring it has ever been, which is the highest praise an AI streaming story has ever deserved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth, structured output is actually trustworthy.&lt;/strong&gt; &lt;code&gt;generateObject&lt;/code&gt; and &lt;code&gt;streamObject&lt;/code&gt; with a Zod schema produce outputs that match your types. Not "probably match." Match. The SDK retries and reprompts if the model drifts. You get validated TypeScript objects out the other side, and you can rely on them in your product code without a second layer of defensive parsing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fifth, observability is built in.&lt;/strong&gt; OpenTelemetry tracing is not an afterthought. You can see every prompt, every model call, every tool invocation, and every retry in a tracing UI without writing your own logger. When something goes wrong in production, you can actually see what happened.&lt;/p&gt;

&lt;p&gt;These are the load-bearing changes. Everything else is cleanup around the edges.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mental Model That Makes the SDK Click
&lt;/h2&gt;

&lt;p&gt;The thing that took me too long to internalize is that the AI SDK is not trying to be a framework. It is trying to be a standard library for AI features, the way &lt;code&gt;fetch&lt;/code&gt; is a standard library for HTTP.&lt;/p&gt;

&lt;p&gt;Once you look at it that way, the API makes sense. There are four core functions you use 90% of the time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;generateText&lt;/code&gt; when you want a string back&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;streamText&lt;/code&gt; when you want to stream a string back&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;generateObject&lt;/code&gt; when you want a typed object back&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;streamObject&lt;/code&gt; when you want to stream a typed object back&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else is a variation on those four themes. Tools attach to any of them. Agents are &lt;code&gt;streamText&lt;/code&gt; in a loop. Chat is &lt;code&gt;streamText&lt;/code&gt; with message history threaded through. Structured output is &lt;code&gt;generateObject&lt;/code&gt; with a schema.&lt;/p&gt;

&lt;p&gt;If you understand the four core functions, you understand the SDK. The rest is ergonomics.&lt;/p&gt;




&lt;h2&gt;
  
  
  Starting From Scratch: The Minimum Viable Setup
&lt;/h2&gt;

&lt;p&gt;Here is the smallest example that does something real. A Next.js App Router route that streams a chat response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// app/api/chat/route.ts&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toDataStreamResponse&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the whole backend. Eight lines. On the client side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;use client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;Chat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;handleInputChange&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;handleSubmit&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useChat&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;form&lt;/span&gt; &lt;span class="na"&gt;onSubmit&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;handleSubmit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;: &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;input&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="na"&gt;onChange&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;handleInputChange&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;form&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the part that makes the SDK worth using. The hook knows the protocol. The protocol is standardized. The server streams. The client renders. No custom fetch logic, no SSE parser, no state machine to debug.&lt;/p&gt;

&lt;p&gt;You can build a working chat UI in about ten minutes. That is not hype. It is the first time I have been able to write that sentence honestly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Provider Switching Without the Wincing
&lt;/h2&gt;

&lt;p&gt;One of the realities of shipping AI features in 2026 is that the best model for your use case changes every few weeks. GPT leads on one benchmark, Claude leads on another, and a Chinese open-source model nobody had heard of last month is suddenly competitive for half the price.&lt;/p&gt;

&lt;p&gt;If your code is coupled to one provider, you either ignore those changes and fall behind or you eat the rewrite cost every time you switch. Both options are bad.&lt;/p&gt;

&lt;p&gt;The AI SDK solves this by making model selection a configuration value rather than a structural dependency. With the Vercel AI Gateway, you can do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;CHAT_MODEL&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now switching models is an environment variable change. No code deploy. No provider package swap. If you want A/B testing between providers, you can set the model per request based on user ID, cohort, or feature flag.&lt;/p&gt;

&lt;p&gt;This is more important than it sounds. It means your product is not tied to the fortunes of any single lab. If Anthropic raises prices or OpenAI ships a better model, you can move without an engineering project. That is the main reason I default to gateway-style model strings now rather than direct provider packages, even though both work.&lt;/p&gt;

&lt;p&gt;The only time I reach for a provider-specific package is when I need a feature that is not in the gateway abstraction yet, like specific fine-tuning hooks. For 95% of product work, the gateway is the right default.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tools: Where the SDK Stops Being a Demo Library
&lt;/h2&gt;

&lt;p&gt;The real leap for products comes from tools. If you have only used the SDK for chat completions, you have seen the easy half. Tools are what turn a language model into something that can do work in your application.&lt;/p&gt;

&lt;p&gt;The pattern is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;searchOrders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Find orders for the current user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pending&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;shipped&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;delivered&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentUser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;searchOrders&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;maxSteps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model decides when to call the tool. You provide the schema and the implementation. The SDK handles the back-and-forth protocol, validates the arguments, calls your function, feeds the result back to the model, and continues the conversation.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;maxSteps&lt;/code&gt; parameter is important. Without it, the model calls tools exactly once and stops. With it, you get multi-step reasoning. The model can call a tool, see the result, decide to call another tool, and keep going until it has what it needs to answer.&lt;/p&gt;

&lt;p&gt;This is where the line between "chatbot with API calls" and "agent" starts to blur. If you set &lt;code&gt;maxSteps&lt;/code&gt; to 10 and give the model a few well-designed tools, you have built an agent. There is no separate agent framework to learn. The surface area is the same as the chat surface area, with tools attached.&lt;/p&gt;

&lt;p&gt;I covered the broader question of what agent memory and state management looks like in &lt;a href="https://dev.to/blog/ai-agent-memory-state-persistence-2026"&gt;my guide on AI agent memory&lt;/a&gt; if you want to go deeper on the stateful side.&lt;/p&gt;




&lt;h2&gt;
  
  
  Structured Output: The Feature That Changes What You Build
&lt;/h2&gt;

&lt;p&gt;Most of the AI features I see in products do not need chat at all. They need a structured result. Classify this ticket. Extract the invoice fields. Summarize this page into three bullet points. Generate a title, a description, and three tags for this upload.&lt;/p&gt;

&lt;p&gt;For those, &lt;code&gt;generateObject&lt;/code&gt; is the function you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;object&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generateObject&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Summarize this article: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;article&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// object is typed as { title: string; tags: string[]; summary: string }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You pass a Zod schema. You get back a validated object that matches it. The SDK handles the prompting, retries invalid outputs, and gives you something your TypeScript compiler is happy with.&lt;/p&gt;

&lt;p&gt;This changes what is worth building. A year ago, adding an AI feature to a product meant writing a prompt, parsing freeform text, and dealing with edge cases where the model wrapped its response in markdown or added commentary. Today, it means writing a schema and a prompt.&lt;/p&gt;

&lt;p&gt;The reliability question matters here. If you have tried this before and been burned by the model returning invalid output, the v6 retry logic is meaningfully better. The SDK reprompts with the validation error included, and modern models are good at correcting themselves on the second pass. In my use, structured output with a reasonable schema succeeds on the first try over 95% of the time, and the retry catches most of the rest.&lt;/p&gt;

&lt;p&gt;If your schema is extremely strict or the task is ambiguous, you can still get failures. Keep schemas lenient where they can be, and design prompts so the model has room to succeed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Streaming UI: When You Want More Than Text
&lt;/h2&gt;

&lt;p&gt;For a long time, AI output in apps meant streaming text into a &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt;. That is still the right answer for chat. For more structured experiences, the SDK gives you two better options.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;streamObject&lt;/code&gt; streams a structured object as it is being generated. You see partial data arrive as JSON fields fill in. If you are generating a form, a table, or a card layout, this is the right primitive. The user sees the skeleton fill in rather than waiting for the whole thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;partialObjectStream&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;streamObject&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;heading&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;})),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Generate a blog post about...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;partial&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;partialObjectStream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;partial&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// { title: "The ...", sections: [{ heading: "Why..." }] }&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;streamUI&lt;/code&gt; (in frameworks that support it) lets you stream actual React components. The model picks which component to render and what props to pass. You define the components. This is the shape of the experience if you have used v0 or similar tools. It is powerful and it is niche. For 90% of products, &lt;code&gt;streamObject&lt;/code&gt; plus your own rendering layer is simpler and easier to reason about.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Patterns That Keep Working
&lt;/h2&gt;

&lt;p&gt;After a year of building features with the SDK, a few patterns have settled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Put the model string in config, not code.&lt;/strong&gt; Do this even if you have no plans to switch. Future-you will thank present-you when a better model ships and you want to try it in five minutes instead of an afternoon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with &lt;code&gt;generateObject&lt;/code&gt;, not &lt;code&gt;generateText&lt;/code&gt;.&lt;/strong&gt; Every time I have written &lt;code&gt;generateText&lt;/code&gt; in product code, I have eventually rewritten it as &lt;code&gt;generateObject&lt;/code&gt; because I needed structure. Skip the intermediate step. If the output is going anywhere other than a chat bubble, it should be structured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use tools sparingly and name them well.&lt;/strong&gt; A model with three well-named tools outperforms a model with fifteen confusingly named ones. Each tool is a decision point for the model. More tools means more chances to pick wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set &lt;code&gt;maxSteps&lt;/code&gt; on every agentic call.&lt;/strong&gt; The default is 1, which is safe. Pick a number that matches your use case. Higher means more capability and more cost. I usually start at 5 and adjust from there based on traces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add tracing before you need it.&lt;/strong&gt; Enable OpenTelemetry from day one. The cost of setup is an hour. The value the first time something goes wrong in production is weeks. Observability for AI features is not optional if you care about reliability. I go deeper on this in &lt;a href="https://dev.to/blog/production-observability-solo-developer-2026"&gt;my production observability guide for solo developers&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat model output as untrusted input.&lt;/strong&gt; Sanitize anything you send to the browser, the database, or another system. The model will sometimes return something weird. That is not a bug in the model. It is the nature of the work. Validate at the boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Costs, and How to Keep That Under Control
&lt;/h2&gt;

&lt;p&gt;The question that kills more AI features than anything else is not "does it work?" It is "does it pay for itself?"&lt;/p&gt;

&lt;p&gt;The AI SDK does not change the cost of the models you call. A Claude request costs what a Claude request costs. What it does give you is the tooling to keep those costs visible and manageable.&lt;/p&gt;

&lt;p&gt;Three things keep my costs predictable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching identical prompts.&lt;/strong&gt; If the same prompt is going to run many times, cache the result. The SDK has integrations for this. You can also do it yourself with a hash of the input. For any feature where the input space is bounded, caching is free performance and free money saved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using cheap models for cheap tasks.&lt;/strong&gt; Not every AI call needs the biggest model. Classification tasks, simple extraction, and routing logic work fine on smaller, cheaper models. I default to the big model only for tasks that need real reasoning. Everything else goes to the cheap tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting and spend caps per user.&lt;/strong&gt; If your product has AI features available to users, set limits. A single user burning through your budget because they found a prompt injection or a runaway loop is a pattern I have seen too many times. The AI Gateway has spend caps built in. Use them. I wrote about this in more detail in &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization for production&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Once you have those three patterns in place, AI features become a predictable line item rather than a variance risk. That is the state you want to be in if you are shipping anything that runs in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Not to Use the AI SDK
&lt;/h2&gt;

&lt;p&gt;It is a boring take, but worth saying out loud. The SDK is not the right choice for everything.&lt;/p&gt;

&lt;p&gt;If you are building a pure chatbot against one provider and you are certain you will never switch, you can get by with that provider's SDK directly. You will save one layer of indirection and lose some TypeScript ergonomics. For most products this is a wash. For extremely minimal integrations, it is simpler.&lt;/p&gt;

&lt;p&gt;If you need a feature that only one provider has and the SDK has not abstracted it yet, use the provider SDK. You can still use the AI SDK for 90% of calls and drop down to the raw SDK for the 10% that need the specific capability.&lt;/p&gt;

&lt;p&gt;If you are doing heavy fine-tuning, custom inference, or deploying your own models, the SDK is not really the layer you need. It is a client library, not a model ops platform.&lt;/p&gt;

&lt;p&gt;For everything else, which is most of what anyone is shipping, the AI SDK is the right default.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reason This Matters More in 2026 Than It Did Last Year
&lt;/h2&gt;

&lt;p&gt;The economics of AI features changed in the last twelve months.&lt;/p&gt;

&lt;p&gt;A year ago, shipping a real AI feature meant spending two weeks on plumbing for every week spent on the feature itself. Streaming, tool calls, retries, observability, switching providers, handling structured output. Each one was a small project. Together they added up to a tax that made AI features feel expensive relative to what they delivered.&lt;/p&gt;

&lt;p&gt;Today, the plumbing is solved. You write a prompt, a schema, maybe a tool, and you ship. The SDK absorbs the infrastructure work. That means the economics tilt back toward the feature itself. You can prototype in a day. You can ship in a week. You can iterate on prompts and models without touching architecture.&lt;/p&gt;

&lt;p&gt;This is the unglamorous version of the AI revolution, and it is the one that actually changes what gets built. Not the demos. The features that ship in products your users never think of as "AI features" because they just work.&lt;/p&gt;

&lt;p&gt;If you have been watching the AI space and waiting for the right moment to actually build, the tooling has caught up. The remaining blocker is picking a problem worth solving. That part is on you.&lt;/p&gt;

&lt;p&gt;For what it is worth, the problems worth solving right now are boring ones. Automated tagging. Smarter search. Better onboarding. Things that were impossible or too expensive to build with traditional code, now trivial. Pick one of those, ship it in a week, and see what it does for your product before you try to build anything more ambitious.&lt;/p&gt;

&lt;p&gt;The tools are ready. The cost is manageable. The patterns are clear. The only thing left is building the thing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Claude Design vs Figma: What the 7% Stock Drop Actually Signals (2026)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 09:47:34 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/claude-design-vs-figma-what-the-7-stock-drop-actually-signals-2026-4kdd</link>
      <guid>https://dev.to/alexcloudstar/claude-design-vs-figma-what-the-7-stock-drop-actually-signals-2026-4kdd</guid>
      <description>&lt;p&gt;The morning Claude Design went live, Figma's stock dropped approximately 7%.&lt;/p&gt;

&lt;p&gt;Three days earlier, Mike Krieger had resigned from Figma's board.&lt;/p&gt;

&lt;p&gt;Neither of those things is proof of anything on its own. Together, they are a market signal worth reading carefully rather than dismissing as coincidence or overreacting to as inevitability.&lt;/p&gt;

&lt;p&gt;Here is what I think is actually happening.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a 7% Drop Means and What It Does Not
&lt;/h2&gt;

&lt;p&gt;A 7% single-day move on a specific product launch is a directional signal, not a verdict.&lt;/p&gt;

&lt;p&gt;Markets are not always right, but they are often early. The drop says that enough investors believe Claude Design poses a credible threat to enough of Figma's business to reprice the risk. It does not say Figma is finished. It says the competitive picture just became more complicated, and the people with money on the line updated their view.&lt;/p&gt;

&lt;p&gt;Krieger's timing makes the signal harder to dismiss. He is the co-founder of Instagram and was one of Figma's more prominent board members. Board members resign for many reasons. A resignation three days before a competitor launches a product that moves your stock 7% is the kind of coincidence worth noting without over-interpreting.&lt;/p&gt;

&lt;p&gt;What I am more interested in is the underlying question the market was actually asking: Is Figma's moat what we thought it was?&lt;/p&gt;




&lt;h2&gt;
  
  
  How Figma Built Its Position
&lt;/h2&gt;

&lt;p&gt;Figma's moat was never the drawing tools. It was collaboration.&lt;/p&gt;

&lt;p&gt;When Figma launched, it was the first design tool that let multiple people work on the same file at the same time in a browser. That sounds obvious now. In 2016 it was a genuine unlock. It turned design from a handoff problem into a shared space problem, and it built a network effect around teams. Once your design system lived in Figma, everything else had to work with Figma. The tool became infrastructure.&lt;/p&gt;

&lt;p&gt;Adobe recognized this when it tried to acquire Figma for $20 billion in 2022. Regulators blocked the deal in late 2023, but the attempted acquisition told you everything about how seriously the industry took Figma's position. You do not offer $20 billion for a drawing tool. You offer it for a platform that sits at the center of how product teams work.&lt;/p&gt;

&lt;p&gt;That position is what Claude Design is challenging, and it is challenging it from an angle Figma did not see coming.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Claude Design Threatens Figma's Moat
&lt;/h2&gt;

&lt;p&gt;Figma's moat is built around designers. The collaboration features, the component libraries, the design tokens, the developer handoff workflows. All of it assumes that someone on your team knows how to use Figma, and that the output needs to flow back into a design system.&lt;/p&gt;

&lt;p&gt;Claude Design does not care about design systems. It cares about output.&lt;/p&gt;

&lt;p&gt;If you search for a direct Claude Design vs Figma comparison right now, you will find almost nothing, because the two tools are not competing on the same terrain yet. The workflow is: describe what you need in plain language, get a layout, refine it in conversation, export it or hand it to Claude Code. There is no layer of design expertise required. There is no component library to maintain. The whole model is built for people who need design output but are not designers.&lt;/p&gt;

&lt;p&gt;That is a different customer than Figma has always served. And it is a much larger market.&lt;/p&gt;

&lt;p&gt;Canva figured this out years ago on the consumer end. Drag-and-drop templates for people who need something credible in 15 minutes. Canva's growth proved that there is enormous demand for design output from people who are not designers. Figma watched that happen and stayed in its lane, focused on professional design teams.&lt;/p&gt;

&lt;p&gt;Claude Design is not Canva. But it is coming at the same large market from the other direction: not simpler templates, but smarter generation. The output ceiling is higher. The entry barrier is lower. And critically, it integrates into the development workflow in a way that neither Figma nor Canva does. I covered what that integration actually looks like in &lt;a href="https://dev.to/blog/claude-design-review-developer-first-look-2026"&gt;my first-impressions piece on Claude Design&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Shifts Arriving at the Same Time
&lt;/h2&gt;

&lt;p&gt;What makes this moment interesting is not just Claude Design. It is three things arriving simultaneously.&lt;/p&gt;

&lt;p&gt;First, generation quality has cleared the credibility threshold for many use cases. Eighteen months ago, AI-generated designs looked like AI-generated designs. The 3x vision resolution in &lt;a href="https://dev.to/blog/claude-opus-4-7-review-benchmarks-developer-guide-2026"&gt;Claude Opus 4.7&lt;/a&gt; and the generation quality of Claude Design's output is good enough for landing pages, onboarding flows, and product marketing pages. Once that threshold clears, the category changes.&lt;/p&gt;

&lt;p&gt;Second, the handoff problem is being solved differently. Figma's answer to "how does the design get into the product" is developer mode and Figma APIs. Claude Design's answer is: the same model that generated the design can write the code for it. That is not a better handoff workflow. It is a different mental model entirely.&lt;/p&gt;

&lt;p&gt;Third, the tools are converging into platforms. Claude Code, Claude Design, Claude.ai, and Canva export are building a loop that keeps context inside one ecosystem. Every piece makes the others more useful. Figma is a powerful tool inside a workflow. Claude is becoming the workflow.&lt;/p&gt;

&lt;p&gt;None of these shifts are complete. But all three are moving in the same direction.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Figma Is Actually Doing
&lt;/h2&gt;

&lt;p&gt;Figma has not been standing still. The product has been adding AI capabilities steadily: generative features, auto-layout assistance, AI-powered prototyping suggestions. These are real additions that improve the experience for designers already inside the product.&lt;/p&gt;

&lt;p&gt;The problem is the architecture. Figma's AI features are capabilities layered onto an existing design-system workflow. They make Figma better at what Figma already does. Claude Design is generative from the ground up. It does not help you work inside a design system faster. It skips the design system entirely for a large class of use cases.&lt;/p&gt;

&lt;p&gt;Figma's response to this moment needs to be more than feature additions. The question is whether they can reorient the product around a workflow that starts with language rather than canvas. That is a harder problem than shipping a generation plugin, and the timeline for it matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Developers Should Pay Attention To
&lt;/h2&gt;

&lt;p&gt;The stock drop will get most of the attention. The more important signal for developers is the handoff change.&lt;/p&gt;

&lt;p&gt;If the model that generates your design can also generate your components, the concept of a design-to-development handoff starts to dissolve. You do not hand off from Figma to an engineer. You iterate in conversation until the output is both visually right and code-ready, then ship it.&lt;/p&gt;

&lt;p&gt;That workflow does not exist cleanly yet. Claude Design has bugs. The Claude Code handoff is promising but not seamless on complex components. The token allowance is separate and finite. None of this is production-ready for teams running complex design systems.&lt;/p&gt;

&lt;p&gt;But the direction is clear. The question for developers is not whether this replaces your current tooling today. It is whether the tooling you are investing in now will be the tooling you are investing in two years from now.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Do With This Information
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you are a developer who has avoided learning Figma:&lt;/strong&gt; You were probably right to wait. The path from description to component is getting short enough that Figma proficiency may not be the skill worth acquiring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you are a designer:&lt;/strong&gt; The threat is not that Claude Design replaces you. It is that the floor for design output rises, which compresses the value of mid-level design work while making senior design judgment more important. The developers you work with will start arriving with better starting points.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you are building a product on top of Figma's API:&lt;/strong&gt; Understand what part of your integration relies on Figma as infrastructure versus Figma as a preference. The infrastructure parts deserve more attention than they are probably getting right now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you hold Figma stock:&lt;/strong&gt; The 7% drop was the market repricing the risk. Not a verdict. The company has strong fundamentals, a real network effect, and years of runway to respond. But the response needs to be visible soon. AI features bolted onto the existing model will not be enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Signals for Design Tooling in 2026
&lt;/h2&gt;

&lt;p&gt;The Figma moment is not the end of Figma. It is the beginning of a period where the assumptions that built Figma's moat need to be reexamined.&lt;/p&gt;

&lt;p&gt;The assumption that design expertise is required to produce credible design output is already weakening. The assumption that the design-to-development handoff requires a dedicated step is being challenged. The assumption that collaboration lives in a dedicated design tool rather than in a shared AI context is less obvious than it was two years ago.&lt;/p&gt;

&lt;p&gt;Those are not Figma-specific problems. They are questions about what the design tooling category even is in 2026.&lt;/p&gt;

&lt;p&gt;The 7% drop on a single launch day is the market asking those questions out loud. Whether Figma has good answers is the more interesting question, and we will know a lot more in the next two quarters than we do right now.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>design</category>
    </item>
    <item>
      <title>Claude Design Review: First Impressions for Developers (2026)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 09:47:34 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/claude-design-review-first-impressions-for-developers-2026-273l</link>
      <guid>https://dev.to/alexcloudstar/claude-design-review-first-impressions-for-developers-2026-273l</guid>
      <description>&lt;p&gt;I opened Claude.ai on Friday morning and there was a new palette icon in the left nav. No announcement email, no push notification. Just a new product sitting there, available to Pro subscribers.&lt;/p&gt;

&lt;p&gt;That is how I found out Claude Design had launched.&lt;/p&gt;

&lt;p&gt;I have spent the last two days testing it. Not for polished marketing pages or investor decks. For the kind of design work I actually do as a solo developer: landing pages, onboarding flows, dashboard layouts, feature announcements. Work where I need something that looks credible but I am not going to hire a designer for.&lt;/p&gt;

&lt;p&gt;Here is what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Claude Design Actually Is
&lt;/h2&gt;

&lt;p&gt;Claude Design is Anthropic's generative design tool — their entry into a category that has been quietly filling up with AI-first competitors. It runs on &lt;a href="https://dev.to/blog/claude-opus-4-7-review-benchmarks-developer-guide-2026"&gt;Claude Opus 4.7&lt;/a&gt;, the model that shipped two days before it with a 3x vision resolution upgrade. It is in research preview right now, available to Pro, Max, Team, and Enterprise subscribers.&lt;/p&gt;

&lt;p&gt;The workflow is simpler than it sounds. You describe what you want in chat. A canvas appears. You refine by continuing the conversation, leaving inline comments, editing directly, or using custom sliders that Claude builds for your specific design context. There is no separate app to install and no Figma plugin to configure. It lives inside Claude.ai.&lt;/p&gt;

&lt;p&gt;The output options are comprehensive: zip, PDF, PPTX, HTML, Canva export, or direct handoff to Claude Code. That last one matters more than it sounds, and I will get to it.&lt;/p&gt;

&lt;p&gt;Onboarding asks you to connect your codebase and existing design files. It reads them to build a design system that reflects your actual project. Whether that works well in practice depends heavily on what you feed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers That Tell the Real Story
&lt;/h2&gt;

&lt;p&gt;Two early case studies from Anthropic are worth paying attention to because they set realistic expectations better than any spec sheet.&lt;/p&gt;

&lt;p&gt;Brilliant, the interactive learning platform, replicated pages in 2 prompts that previously required 20 or more in other tools. Datadog compressed a week-long design cycle into a single conversation.&lt;/p&gt;

&lt;p&gt;Those numbers are interesting. They are also from two well-resourced teams with clear briefs and existing design systems. For a solo developer running this for the first time, expect the prompt count to be higher and the cycle to be longer than Datadog's. The tool is genuinely capable. Your inputs still determine your outputs.&lt;/p&gt;

&lt;p&gt;The number that landed harder than any case study: Figma's stock dropped approximately 7% on launch day. Three days before that, Mike Krieger resigned from Figma's board. Whether or not his departure was related, the market read the timing as a signal. A 7% single-day drop on a specific product launch is not noise. That is people making a bet about where design tooling is heading.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Workflow Actually Feels Like
&lt;/h2&gt;

&lt;p&gt;The conversation-to-canvas loop is the part that works best.&lt;/p&gt;

&lt;p&gt;You describe a page, Claude generates it, you push back in natural language. "Make the hero section less cluttered." "Move the CTA above the fold." "The font feels too corporate." The model understands design feedback phrased the way a non-designer would phrase it, which is the entire point.&lt;/p&gt;

&lt;p&gt;The custom sliders were the feature I did not expect to care about and ended up finding genuinely useful. Claude builds controls specific to what you are designing, things like "density," "formality," or "contrast," tuned to the particular design rather than a generic preset list. It is a small thing but it makes iteration faster because you are not rewriting the same prompt four times to nudge something by a few degrees.&lt;/p&gt;

&lt;p&gt;The Claude Code handoff is where the developer-specific value shows up. When you finish a design you are happy with, you pass it directly into Claude Code and get working component code out the other side. The loop closes: describe it, design it, build it, inside one context. For anyone already running an &lt;a href="https://dev.to/blog/agentic-coding-2026"&gt;agentic coding workflow&lt;/a&gt;, this is the feature that makes Claude Design more than a standalone design exercise.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Limitations
&lt;/h2&gt;

&lt;p&gt;The separate weekly token allowance is the constraint that will hit you first.&lt;/p&gt;

&lt;p&gt;Claude Design does not share tokens with your existing chat or Claude Code usage. It has its own pool, reset weekly. In my first two days of testing I have not hit the ceiling, but I also have not been running it at full capacity on a real project. For anyone doing production design work across a full week, the token separation matters. You are not getting unlimited design generation on your existing subscription. You are getting a fixed design token budget sitting alongside your existing fixed budget for everything else.&lt;/p&gt;

&lt;p&gt;The known bugs are worth flagging before you build any workflow around this. Inline comments sometimes disappear mid-session. Compact view has a save error bug. These are research preview issues and they will get fixed, but if you are evaluating this for team use right now, treat the current build as early access software, not a stable product.&lt;/p&gt;

&lt;p&gt;The codebase onboarding works better the more organized your project is. If you have a coherent design system already documented, Claude Design will pick it up and apply it consistently. If your codebase is a mix of ad-hoc styles with no design token structure, the generated designs will be technically valid but will not feel like they belong to your product. The model reflects what you feed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Is Not
&lt;/h2&gt;

&lt;p&gt;Anthropic has been clear that Claude Design is not trying to replace Canva. It is built to complement and export into Canva. If you are doing social graphics, branded content, or presentation templates, Canva still wins on template depth and collaborative editing.&lt;/p&gt;

&lt;p&gt;Claude Design is for generative layout work where you are starting from a brief rather than a template. Landing pages, feature walkthroughs, onboarding screens, product marketing pages. Work where you need to go from a concept to something visually credible without a designer in the loop.&lt;/p&gt;

&lt;p&gt;It is also not trying to compete on the same ground as tools like Framer AI or Galileo, which operate closer to the design system and component layer. Claude Design's advantage is not template intelligence or component variety. It is that the model generating your layout is the same model writing your code. That integration is the moat, not the design quality in isolation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Should Developers Use Claude Design Right Now?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you are a solo developer or indie founder who ships without a design budget:&lt;/strong&gt; This is worth your time right now. The 2-prompt capability that Brilliant demonstrated is not universal, but the baseline quality is high enough to produce credible landing pages without a design background. The Claude Code handoff makes it part of a real development workflow rather than a standalone design exercise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you are on a Pro plan and curious:&lt;/strong&gt; The palette icon is already in your left nav. Give it one real task from your current project. The cost is time, not tokens, as long as you stay within the weekly allowance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you are evaluating this for your team:&lt;/strong&gt; Wait one or two builds. The inline comment bug and compact view save error are the kind of friction that will slow down shared workflows. Research preview means the product is usable, not that it is stable. Test it, but do not build your Q2 design process around it yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you are a Figma user:&lt;/strong&gt; The stock drop is worth paying attention to, but not a reason to change your tooling today. Claude Design and Figma are not direct substitutes at this point. One generates layouts from a conversation. The other manages component libraries and design systems at scale. They are adjacent, not identical.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;What Anthropic is building with Claude Design is not a design tool in isolation. It is a loop.&lt;/p&gt;

&lt;p&gt;Claude Code handles the build. Claude.ai handles the thinking. Claude Design handles the layout. Canva handles the branded content. They all export to each other, hand off to each other, and the context stays inside the Anthropic ecosystem.&lt;/p&gt;

&lt;p&gt;That loop is worth paying attention to as a developer, not because it is complete right now but because it is the direction. Every new Claude product makes the others more useful and raises the switching cost. Whether you call that a platform or a walled garden depends on how much you value the convenience versus the lock-in. I wrote more about what that lock-in strategy actually means for developers in &lt;a href="https://dev.to/blog/claude-design-vs-figma-stock-drop-signals-2026"&gt;the companion piece to this article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For right now: Claude Design works, the bugs are real, and the Claude Code handoff is the feature that makes it worth testing if you are a developer specifically. Give it one real task. The palette icon is already there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>design</category>
    </item>
    <item>
      <title>Testing AI-Generated Code: How to Actually Know If It Works</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Fri, 17 Apr 2026 07:48:51 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/testing-ai-generated-code-how-to-actually-know-if-it-works-16di</link>
      <guid>https://dev.to/alexcloudstar/testing-ai-generated-code-how-to-actually-know-if-it-works-16di</guid>
      <description>&lt;p&gt;I shipped a bug to production in January that embarrassed me. Not a subtle bug. A bug where a rate limiting function the AI wrote silently swallowed errors and returned true for every request, which meant our rate limiter was not actually rate limiting anything. The function looked fine on a visual scan. The TypeScript compiled. My quick test of the happy path worked. I merged it.&lt;/p&gt;

&lt;p&gt;The rate limiter failure showed up a week later when someone ran 4,000 requests in two minutes and our costs spiked.&lt;/p&gt;

&lt;p&gt;Here is what made it worse: the AI (Claude Code, in this case) had actually written a comment in the code that the error handling was a placeholder. I had not read that line carefully. I trusted the overall shape of the code without really testing it.&lt;/p&gt;

&lt;p&gt;That experience changed how I approach &lt;a href="https://dev.to/blog/ai-code-review-bottleneck-2026"&gt;AI-generated code review&lt;/a&gt;. Not by trusting AI less, but by building a real testing process instead of relying on "it looks right."&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AI Code Needs Different Testing Habits
&lt;/h2&gt;

&lt;p&gt;The danger is not that AI writes bad code. It writes good code most of the time. The danger is that it writes confident-looking code consistently, which dulls your instinct to check carefully.&lt;/p&gt;

&lt;p&gt;When a junior developer writes code that does something unexpected, there is a natural flag in your brain. This person is learning. There might be edge cases they missed. You slow down.&lt;/p&gt;

&lt;p&gt;AI code does not trigger that flag. It looks like polished, professional code. The variable names are sensible, the function structure is clean, the comments are there. So you scan it the way you would scan code from a senior engineer you trust, not the way you would test code from someone who confidently writes plausible-but-wrong implementations a few percent of the time.&lt;/p&gt;

&lt;p&gt;But that few percent matters. On a large codebase where &lt;a href="https://dev.to/blog/agentic-coding-2026"&gt;agentic coding tools&lt;/a&gt; are writing hundreds of functions per week, a few percent failure rate is a lot of bugs in flight.&lt;/p&gt;

&lt;p&gt;The other thing that changes with AI code: the failure modes are different. Human bugs tend to cluster around the things humans find cognitively hard. Off-by-one errors. Race conditions. Forgetting to handle a specific edge case the author did not think of.&lt;/p&gt;

&lt;p&gt;AI bugs are often more subtle. The AI knows the edge cases. It will handle them, but sometimes with logic that is plausible rather than correct. It handles the error case by returning a default value that happens to be wrong in production context. It implements a security check correctly for the example in its training data but misses an edge case specific to your implementation.&lt;/p&gt;

&lt;p&gt;This means you need tests, not just code review.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Testing Gap in AI-Assisted Development
&lt;/h2&gt;

&lt;p&gt;There is a pattern I keep seeing. Developers use AI to write application code at 3x to 5x their previous speed. Then they use AI to write tests, but in a way that just adds more code, not more confidence.&lt;/p&gt;

&lt;p&gt;"Write tests for this function" produces tests that test the same logic the function implements. The happy path passes. The cases the AI thought of are covered. But the tests are written by the same reasoning process that wrote the code, which means they share the same blind spots.&lt;/p&gt;

&lt;p&gt;This is the testing gap in AI-assisted development. You have more code, but not more verification. The test suite looks comprehensive and provides almost no additional safety beyond TypeScript compilation.&lt;/p&gt;

&lt;p&gt;Real testing for AI-generated code requires something different: testing driven by your understanding of the problem, not AI's understanding of the code it just wrote.&lt;/p&gt;




&lt;h2&gt;
  
  
  Static Analysis First
&lt;/h2&gt;

&lt;p&gt;The cheapest form of testing is static analysis. It costs almost nothing to set up and it catches a real category of AI bugs before they reach your test suite.&lt;/p&gt;

&lt;p&gt;TypeScript is your first layer, but you need to use it properly. This means strict mode.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"compilerOptions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"strict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"noUncheckedIndexedAccess"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"exactOptionalPropertyTypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;noUncheckedIndexedAccess&lt;/code&gt; flag is particularly useful for AI-generated code because AI often writes array access patterns that look correct but do not handle the undefined case. Turning this flag on surfaces those issues immediately.&lt;/p&gt;

&lt;p&gt;ESLint with relevant plugins catches things TypeScript misses. If you are writing Node.js backend code, &lt;code&gt;eslint-plugin-security&lt;/code&gt; flags common security anti-patterns that AI sometimes introduces. If you are writing React, &lt;code&gt;eslint-plugin-react-hooks&lt;/code&gt; catches dependency array mistakes that Claude gets wrong maybe 10% of the time.&lt;/p&gt;

&lt;p&gt;Biome is worth considering as a replacement for the ESLint setup if you want a faster, more opinionated static analysis tool. It ships with 200+ built-in rules and runs fast enough to use in a pre-commit hook without slowing down your workflow.&lt;/p&gt;

&lt;p&gt;The point is not to run any of these manually. Put them in your CI pipeline and run them as a pre-commit hook locally. AI-generated code should pass static analysis before it is even reviewed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing What AI Gets Wrong
&lt;/h2&gt;

&lt;p&gt;Once you have static analysis in place, you need tests that specifically target the failure modes of AI-generated code. This means thinking adversarially about what the AI might have gotten wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Boundary and Edge Case Testing
&lt;/h3&gt;

&lt;p&gt;AI code often handles the happy path and the most obvious edge cases correctly. It struggles with the boundaries that are specific to your system rather than the general category of problem.&lt;/p&gt;

&lt;p&gt;For any function that processes user input or external data, write tests for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The minimum and maximum valid values&lt;/li&gt;
&lt;li&gt;One step outside each boundary (what happens with -1 when 0 is the minimum valid value?)&lt;/li&gt;
&lt;li&gt;Empty and null inputs, even if the type system says they should not exist&lt;/li&gt;
&lt;li&gt;Inputs that are technically valid but semantically unusual (an email address that is 254 characters long, which is valid per spec)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have a pattern I call "assume it is wrong at the edges." For every AI-generated function that transforms data, I write at least three tests for inputs outside the expected range before I look at the implementation. This forces me to think about the contract rather than the implementation, and it often catches places where the contract is not what I assumed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error Handling Verification
&lt;/h3&gt;

&lt;p&gt;The bug I described at the start was an error handling bug. The function swallowed an exception and returned a default value. This is one of the most common AI bug patterns I have seen: technically valid error handling that is semantically wrong.&lt;/p&gt;

&lt;p&gt;Write explicit tests that verify error propagation. Do not just test that the function returns the right thing in the success case. Test that it fails the right way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;it&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;throws when the rate limit store is unavailable&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;mockRedis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mockRejectedValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Connection refused&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;checkRateLimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user-123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nx"&gt;rejects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toThrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Connection refused&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nf"&gt;it&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;does not allow requests through when the store check fails&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;mockRedis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mockRejectedValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Timeout&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;checkRateLimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user-123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// fail closed, not fail open&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is different from just testing the happy path. You are testing the failure contract. AI code that swallows errors and returns defaults will fail these tests immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Concurrency and Race Conditions
&lt;/h3&gt;

&lt;p&gt;This is the failure mode most likely to survive review and reach production. AI code often handles single-threaded logic correctly while introducing subtle race conditions in concurrent scenarios.&lt;/p&gt;

&lt;p&gt;If you are writing any code that deals with shared state, queues, or async operations that could run in parallel, write tests that explicitly check concurrent behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;it&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;correctly handles concurrent rate limit checks for the same user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nf"&gt;checkRateLimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user-123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;checkRateLimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user-123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;checkRateLimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user-123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;]);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;remainingCounts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;remainingCounts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;toBeGreaterThan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;remainingCounts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Concurrency bugs are hard to reliably reproduce through testing, but making the intent explicit in your test suite at least forces you to think about the concurrent behavior and documents the expected contract.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integration Tests for AI-Written Modules
&lt;/h2&gt;

&lt;p&gt;Unit tests catch individual function bugs. Integration tests catch the bugs that emerge when AI-written code interacts with your actual system.&lt;/p&gt;

&lt;p&gt;The place AI code most commonly breaks in integration is at the boundary with external services. The AI knows the general shape of how an API works. It might not know your specific version's behavior, your account's limits, or the edge cases in how the service responds to malformed requests.&lt;/p&gt;

&lt;p&gt;Write integration tests that actually hit your services in a staging environment. Not mocked versions. Real calls.&lt;/p&gt;

&lt;p&gt;For database operations specifically, this means tests that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually write to and read from a test database&lt;/li&gt;
&lt;li&gt;Check that transactions roll back correctly when something fails mid-way&lt;/li&gt;
&lt;li&gt;Verify that index usage is correct by checking query plans for slow-path queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For external API calls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run against a sandbox or staging environment where the API supports it&lt;/li&gt;
&lt;li&gt;Test response handling with actual API responses, not hardcoded response bodies the AI invented&lt;/li&gt;
&lt;li&gt;Verify that retry logic works by intentionally inducing failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I know this is more setup than mocking. It is worth it. &lt;a href="https://dev.to/blog/production-observability-solo-developer-2026"&gt;Production observability&lt;/a&gt; catches bugs after they hit users. Integration tests against real services catch a class of bugs that unit tests cannot, before they ship at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  Property-Based Testing for Complex Logic
&lt;/h2&gt;

&lt;p&gt;If you have never used property-based testing, AI-generated code is a good reason to start. The idea is simple: instead of writing specific test cases, you describe properties that should always hold, and the testing framework generates hundreds of random inputs to verify those properties.&lt;/p&gt;

&lt;p&gt;For AI-generated parsing, validation, or transformation code, property-based tests are particularly powerful because the AI's blind spots tend to be in the input space, not the logic space.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;
&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rate limiter never allows more requests than the limit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;asyncProperty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="nx"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="nx"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;minLength&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
      &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;requestCount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;limiter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createRateLimiter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;windowMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
        &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;allowedCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;requestCount&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allowed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;limiter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
          &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;allowedCount&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;allowedCount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This test generates thousands of random combinations of limits, request counts, and user identifiers. If your rate limiter ever allows more requests than the configured limit for any combination, it fails. This is a much stronger guarantee than writing five specific test cases.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;fast-check&lt;/code&gt; library is the best TypeScript option. For Python, &lt;code&gt;hypothesis&lt;/code&gt; is the standard. Both integrate well with standard testing frameworks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Code Review Step That Actually Matters
&lt;/h2&gt;

&lt;p&gt;Tests catch what you specify. Code review catches what you did not think to specify.&lt;/p&gt;

&lt;p&gt;The AI code review that matters is not checking for style or obvious bugs. TypeScript and linting handle those. The review that matters is checking semantic correctness: does this code do what we actually need it to do?&lt;/p&gt;

&lt;p&gt;This requires reading the code with the problem in mind, not the implementation. Ask yourself:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is this code supposed to guarantee?&lt;/strong&gt; Not what does it do, but what invariant does it enforce? If you cannot state that clearly, the code is not ready to ship regardless of whether it looks right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens to the user if this fails?&lt;/strong&gt; If the function returns a wrong value, what does the downstream code do? If it throws an exception, where does that get caught? Tracing the failure path through the system surfaces bugs that isolated code review misses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What changed compared to what was there before?&lt;/strong&gt; Diff review is natural for human-written code. With AI-generated code, especially when an agent refactors or extends existing functionality, the diff can be large and the subtle behavioral change can be in a small part of a big diff. Read the diff. Do not just read the final file.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/blog/ai-generated-code-technical-debt-2026"&gt;technical debt&lt;/a&gt; that accumulates from AI-generated code that was not reviewed properly is particularly insidious. It looks like clean code. It behaves mostly correctly. And it hides architectural problems that compound over months.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Testing Habit Into Your Workflow
&lt;/h2&gt;

&lt;p&gt;The testing process I have described is not a one-time thing you do when you remember. It needs to be part of how you work with AI coding tools, not something you bolt on afterward.&lt;/p&gt;

&lt;p&gt;Here is the workflow I have settled on after months of intensive AI-assisted development:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before asking the AI to write code&lt;/strong&gt;, write the test specification. Not the tests themselves, but a list of what behaviors the tests will need to verify. This forces you to think about the contract before you see the implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;While the AI is writing the code&lt;/strong&gt;, write the edge case tests. You know the problem. You know where the boundaries are. Write those tests before you read what the AI produced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When reviewing the AI output&lt;/strong&gt;, run the tests you wrote first. Do not start by reading the code. See which tests pass and which fail. Then read the code to understand why the failing tests failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before merging&lt;/strong&gt;, run static analysis, your unit tests, and at least the integration tests that cover the new code. Not as a formality, as an actual gate.&lt;/p&gt;

&lt;p&gt;This sounds like more process than it is. Once it becomes habit, the overhead is maybe 20% additional time compared to just accepting AI output. The time saved from not debugging production bugs more than pays for that overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The AI Evals Connection
&lt;/h2&gt;

&lt;p&gt;There is a related skill that goes beyond testing individual functions: evaluating AI behavior at the system level. If you have built any AI-powered features into your product, you need a way to measure whether those features are actually working correctly across a range of inputs.&lt;/p&gt;

&lt;p&gt;I wrote about &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt; in more depth, but the principle is the same: systematic verification beats eyeballing outputs. The same instinct that makes you write unit tests for your application code should make you want structured evaluation for any AI behavior you are shipping to users.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/blog/ai-generated-code-security-risks-2026"&gt;security risks specific to AI-generated code&lt;/a&gt; are also worth understanding as a separate category. Not all security issues will be caught by the testing approaches I described here. Prompt injection, over-permissioned tool access, and data leakage through AI context are security categories that need their own review practices.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Do Right Now
&lt;/h2&gt;

&lt;p&gt;If you are using AI coding tools heavily and do not have a real testing process in place, start here.&lt;/p&gt;

&lt;p&gt;First: add &lt;code&gt;strict: true&lt;/code&gt; and &lt;code&gt;noUncheckedIndexedAccess: true&lt;/code&gt; to your TypeScript config. Run the build. Fix the errors. These are bugs, not style choices.&lt;/p&gt;

&lt;p&gt;Second: pick the three most important functions that AI has written in your codebase in the last month. Write explicit error propagation tests for each one. If any of them fail, you now know you have a bug in production.&lt;/p&gt;

&lt;p&gt;Third: add &lt;code&gt;fast-check&lt;/code&gt; to your test dependencies and write one property-based test for the most complex data transformation in your codebase. Run it and see what happens.&lt;/p&gt;

&lt;p&gt;The goal is not 100% coverage or a comprehensive testing strategy document. The goal is to stop trusting AI code purely on visual inspection and start having enough automated verification that you can be confident the code does what you need it to do.&lt;/p&gt;

&lt;p&gt;Fast AI code and reliable AI code are not the same thing. A real testing process is what bridges the gap.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Fri, 17 Apr 2026 07:48:50 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/saas-churn-is-killing-your-business-here-is-what-to-do-about-it-without-a-support-team-jo4</link>
      <guid>https://dev.to/alexcloudstar/saas-churn-is-killing-your-business-here-is-what-to-do-about-it-without-a-support-team-jo4</guid>
      <description>&lt;p&gt;I knew something was wrong with my first SaaS product when I checked the dashboard one morning and found a dozen new sign-ups alongside a dozen cancellations. The sign-ups felt great. The cancellations felt like getting kicked in the stomach.&lt;/p&gt;

&lt;p&gt;What I did not understand at the time was the math. I was growing by adding new users. I was also growing a hole in the bottom of the bucket at exactly the same rate. Six months later I had the same number of customers I started with and twice the stress.&lt;/p&gt;

&lt;p&gt;Here is the number that changed how I think about this: reduce monthly churn from 8% to 3% and your average customer stays four times as long. The same product. The same acquisition cost. Four times the lifetime value.&lt;/p&gt;

&lt;p&gt;That is not a marginal improvement. That is the difference between a SaaS business that works and one that requires you to sprint harder every month just to stay in place.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Churn Equation Every Solo Founder Needs to Know
&lt;/h2&gt;

&lt;p&gt;Before you can fix churn, you need to understand what is actually happening.&lt;/p&gt;

&lt;p&gt;Monthly churn rate is the percentage of customers who cancel in a given month. If you have 100 customers and 8 cancel, your monthly churn rate is 8%. Sounds manageable. But run that math forward: an 8% monthly churn means the average customer stays about 12 months. A 3% monthly churn means they stay 33 months.&lt;/p&gt;

&lt;p&gt;For a $49/month product:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At 8% churn: average LTV is $612&lt;/li&gt;
&lt;li&gt;At 3% churn: average LTV is $1,617&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same product. Same price. Same acquisition cost. Your marketing budget goes 2.6 times further when churn is under control.&lt;/p&gt;

&lt;p&gt;The metric that captures this most usefully is Net Revenue Retention (NRR). NRR measures whether the revenue from your existing customers is growing or shrinking over time, accounting for upgrades, downgrades, and cancellations. Good bootstrapped SaaS businesses target 100%+ NRR. That means your existing customer base is generating the same or more revenue than last month even before you add a single new customer.&lt;/p&gt;

&lt;p&gt;If your NRR is below 100%, you are in a leaky bucket situation. No amount of acquisition will save you from the math.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Two Types of Churn (and Why They Need Different Fixes)
&lt;/h2&gt;

&lt;p&gt;Not all churn is the same. Treating it as one thing is why most retention efforts fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Voluntary churn&lt;/strong&gt; is when a customer actively decides to leave. They cancel their subscription because the product is not delivering enough value, because they found a better alternative, or because their situation changed and they no longer need what you offer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Involuntary churn&lt;/strong&gt; is when a customer loses their subscription due to payment failure. Expired credit cards, insufficient funds, card number changes after a bank reissue. No decision, just a failed transaction.&lt;/p&gt;

&lt;p&gt;For most solo SaaS founders, involuntary churn is a significant and underestimated portion of total churn. Research consistently finds that 20% to 40% of churn for subscription businesses is involuntary. Some businesses see even higher rates.&lt;/p&gt;

&lt;p&gt;This matters because the fix is completely different. You cannot onboard your way out of involuntary churn. You cannot improve your product features to prevent a credit card from expiring. You need a dedicated payment recovery system.&lt;/p&gt;

&lt;p&gt;Fix involuntary churn first. It is the highest-return, lowest-effort lever available to a solo founder because you are not fixing a product or positioning problem. You are fixing an infrastructure problem that has nothing to do with how good your product is.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fixing Involuntary Churn: Payment Recovery
&lt;/h2&gt;

&lt;p&gt;If you are using Stripe, you already have the tools for basic payment recovery. The question is whether you have configured them properly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Smart Retries
&lt;/h3&gt;

&lt;p&gt;Stripe's Smart Retries uses machine learning to retry failed payments at optimal times based on historical payment patterns. This alone recovers a significant portion of failed payments without any additional work. Make sure this is enabled in your Stripe settings. It is off by default for some account types.&lt;/p&gt;

&lt;p&gt;The default retry schedule matters. If you let Stripe retry aggressively over 48 hours and then mark the subscription as canceled, you are leaving money on the table. Extend your retry window to 7 to 14 days. Most involuntary churn from card issues resolves within that window if the customer notices and updates their payment method.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dunning Emails
&lt;/h3&gt;

&lt;p&gt;Smart retries handle the automatic recovery. Dunning emails handle the cases where the customer needs to take action to update their payment information.&lt;/p&gt;

&lt;p&gt;The sequence that works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Email 1 (Payment failure day):&lt;/strong&gt; Friendly, no-blame notification. "Hey, we tried to process your payment and hit a snag. Your card ending in 4242 may have expired or had a different issue. Update your payment info here."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Email 2 (3 days after failure):&lt;/strong&gt; Slightly more urgent, but still helpful. "Your account is still active but we have not been able to process your payment. Here is the link to update your payment method."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Email 3 (7 days after failure):&lt;/strong&gt; Clear consequence statement. "Your subscription will be paused on [date] if we cannot process payment. This takes 60 seconds to fix."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Email 4 (1 day before cancellation):&lt;/strong&gt; Final chance framing. Not guilt-inducing. Just clear. "Last reminder before your subscription pauses tomorrow."&lt;/p&gt;

&lt;p&gt;Three things make dunning emails work better: personalizing with the customer's name and the specific card information (last four digits, expiration date), making the link to update payment information directly accessible without requiring them to log in and navigate, and sending at the right time of day (mid-morning in their timezone tends to get the best response rates).&lt;/p&gt;

&lt;h3&gt;
  
  
  Win-Back for Lapsed Customers
&lt;/h3&gt;

&lt;p&gt;For customers who slipped through your recovery window and fully churned due to payment failure, set up a win-back sequence that fires 30 to 60 days after cancellation. These are customers who wanted to keep using your product but did not take action in time. A simple "we would love to have you back" email with an easy re-subscribe link recovers a surprising number of these.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fixing Voluntary Churn: The Onboarding Problem
&lt;/h2&gt;

&lt;p&gt;The majority of voluntary churn does not happen because your product eventually disappoints customers. It happens because customers never got far enough to see the value in the first place.&lt;/p&gt;

&lt;p&gt;If you check your churn timing data (Stripe and most billing tools will show you the distribution), you will likely find a spike in cancellations within the first 30 to 90 days. Sometimes within the first two weeks. This is the onboarding churn problem.&lt;/p&gt;

&lt;p&gt;The customer signed up with a specific problem in mind. They poked around your product for a bit, did not immediately see a clear path to solving that problem, and quietly decided it was not for them. By the time they cancel, they have probably not logged in for two weeks.&lt;/p&gt;

&lt;p&gt;The fix is not a better feature set. The fix is a better path from sign-up to first value.&lt;/p&gt;

&lt;h3&gt;
  
  
  Define Your Aha Moment
&lt;/h3&gt;

&lt;p&gt;Every successful SaaS product has a moment where a new user thinks "oh, this is what it does." This is the aha moment. The job of your onboarding is to get every new user to that moment as fast as possible.&lt;/p&gt;

&lt;p&gt;For a project management tool, the aha moment might be completing their first task and seeing it move columns. For an analytics tool, it might be seeing their first chart populated with real data. For a communication tool, it might be the first reply from a teammate.&lt;/p&gt;

&lt;p&gt;What is the aha moment for your product? If you cannot describe it in one sentence, your onboarding is probably not optimized for it.&lt;/p&gt;

&lt;p&gt;Once you know the moment, look at your analytics (or even just manual user interviews) to understand how many users reach it. If 60% of users never complete the first task that triggers the aha moment, that is your most important product problem. Not the feature list. Not the UI. The path to first value.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Activation Email Sequence
&lt;/h3&gt;

&lt;p&gt;For solo founders without the bandwidth for live onboarding calls with every user, an email sequence that guides users toward the aha moment is the highest-leverage retention tool available.&lt;/p&gt;

&lt;p&gt;Day 1 email: Welcome and a single, specific first action. Not "get started" (vague). "Create your first [specific thing] by clicking this link." The link should deep-link them directly to the relevant part of the UI.&lt;/p&gt;

&lt;p&gt;Day 3 email: Triggered only if they have not completed the first action. "Here is a 90-second video showing how [the specific thing from day 1] works." Video beats text for activation because it shows the product working, not just describes it.&lt;/p&gt;

&lt;p&gt;Day 7 email: For users who have completed the first action but not taken the second one. Guides them to the next logical step in your intended user journey.&lt;/p&gt;

&lt;p&gt;Day 14 email: A "how is it going" check-in that invites them to reply. This one has a specific goal: identifying users who are struggling silently. The users who reply with "I'm confused about X" are users you can save. The ones who do not engage at all are candidates for proactive outreach.&lt;/p&gt;

&lt;h3&gt;
  
  
  Proactive Outreach at Risk Signals
&lt;/h3&gt;

&lt;p&gt;You do not need a dedicated customer success team to do proactive outreach. You need a few triggers that tell you when a customer is at risk.&lt;/p&gt;

&lt;p&gt;The simplest risk signal: the customer has not logged in for 14 days. For most SaaS products, a two-week absence is a strong predictor of cancellation. Set up an automation that sends you a Slack message or email when this happens. Then send a personal note to that customer.&lt;/p&gt;

&lt;p&gt;Not a template. A real personal note. Something like: "Hey [Name], noticed you have not been in [product] for a bit. Is there anything I can help you with? Even just pointing you to a specific feature?"&lt;/p&gt;

&lt;p&gt;This takes three minutes. It saves customers who would have quietly canceled. I have converted 30% to 40% of these outreach conversations into saved subscriptions, because the customer usually has a specific problem that is easy to solve and just needs someone to answer it.&lt;/p&gt;

&lt;p&gt;Other risk signals worth watching:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature adoption gaps.&lt;/strong&gt; If you have a feature that strongly correlates with long-term retention (most products do), watch for customers who have never used it. Reach out and specifically point them to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Support ticket patterns.&lt;/strong&gt; Customers who open the same type of support ticket repeatedly are telling you something is confusing or broken. Fix the root cause, not just the individual tickets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plan downgrade requests.&lt;/strong&gt; A customer asking to downgrade is not already gone. They are telling you they want to stay but at a different price point. Engage with what they are trying to accomplish and see if there is a conversation about value worth having.&lt;/p&gt;




&lt;h2&gt;
  
  
  Retention After the First 90 Days
&lt;/h2&gt;

&lt;p&gt;If a customer makes it through the first 90 days, voluntary churn drops significantly. But it does not disappear. Long-term customers churn for different reasons.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Value Reset Problem
&lt;/h3&gt;

&lt;p&gt;Customers who have been with you for 6 to 12 months often develop what I call the value reset problem. They remember roughly what the product costs. They have stopped noticing the value it delivers because the value has become part of their routine. When renewal time comes, the cost is salient and the value is not.&lt;/p&gt;

&lt;p&gt;The fix is regular value reinforcement. A quarterly email that says "here is what you accomplished with [product] this quarter" using whatever metrics you can pull from their account data is surprisingly effective. People like to be reminded of what they are getting.&lt;/p&gt;

&lt;p&gt;If you cannot pull personalized metrics, personalized by segment works too. "Teams using [product] typically save X hours per month on Y task." Not as good as account-specific data, but better than nothing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exit Interviews (And What to Do With Them)
&lt;/h3&gt;

&lt;p&gt;When a customer cancels, ask them why. Not in a guilt-inducing way. Genuinely.&lt;/p&gt;

&lt;p&gt;A simple off-boarding survey with one question ("What's the main reason you're canceling?") with five to eight predefined options and a free-text field gives you data you will not get any other way. Cancellation reasons tend to cluster into a handful of categories, and understanding those categories tells you more about your product roadmap than any feature request list.&lt;/p&gt;

&lt;p&gt;Set up the survey with a tool like Typeform, triggered automatically when a subscription cancels in Stripe. Review the responses weekly. After a few months, patterns will emerge that point directly at fixable problems.&lt;/p&gt;

&lt;p&gt;The most actionable insight from exit interviews is often not what you expect. When I ran these for a previous product, the most common reason was not "too expensive" or "missing features." It was "I wasn't using it enough to justify the cost." That is a usage problem, which is an onboarding and engagement problem, which is fixable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Retention Tech Stack for a Solo Founder
&lt;/h2&gt;

&lt;p&gt;You do not need an enterprise customer success platform. Here is what actually works at solo founder scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stripe + ChartMogul or Baremetrics&lt;/strong&gt; for your churn metrics. You need to see your churn rate by cohort, your MRR movements, and your LTV trends. Baremetrics is simpler and cheaper. ChartMogul is more powerful for segmentation. Pick one and actually look at it weekly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stripe's dunning configuration&lt;/strong&gt; plus a simple drip campaign in whatever email tool you use for your transactional emails. The dunning sequence does not need to be sophisticated. It needs to exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PostHog or Mixpanel&lt;/strong&gt; for product analytics. You need to know which features correlate with retention and which customers are at risk based on usage patterns. PostHog has a generous free tier and is solid for solo founder scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A simple CRM spreadsheet or Notion database&lt;/strong&gt; for tracking proactive outreach. You do not need Salesforce. You need a place to write "reached out to [customer] on [date], they said [thing], following up on [date]."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customer.io or ConvertKit&lt;/strong&gt; for your onboarding email sequences. Both have the behavioral triggers you need (user logs in, user activates a feature, user goes dormant) without requiring enterprise setup.&lt;/p&gt;

&lt;p&gt;The total cost of this stack at early stage: $50 to $150 per month. The return on investment from reducing churn by even 2 percentage points on a $5,000 MRR business is enormous compared to that cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Churn and Pricing Are Connected
&lt;/h2&gt;

&lt;p&gt;One thing most solo founders miss: churn rate and pricing are not independent variables. A product priced at $9 per month attracts a different type of customer than one priced at $79 per month, and the $9 customer churns at a much higher rate.&lt;/p&gt;

&lt;p&gt;Price-sensitive customers have the lowest switching cost and the weakest commitment. They sign up when there is a discount, they downgrade at the first friction, and they cancel as soon as a cheaper alternative appears. Raising your prices does not just improve margin. It selectively filters for customers who are more committed to the product because they have more skin in the game.&lt;/p&gt;

&lt;p&gt;If you are seeing high churn at a low price point, raising prices is often the counterintuitive lever that improves both. I covered the specifics of &lt;a href="https://dev.to/blog/saas-pricing-indie-hackers-2026"&gt;SaaS pricing for indie hackers&lt;/a&gt; in more detail, including how to handle the transition without losing your existing customer base.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Do Right Now
&lt;/h2&gt;

&lt;p&gt;If you have a SaaS product with paying customers and have not thought seriously about retention, here is the order of operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First week:&lt;/strong&gt; Get your actual churn data. Pull it from Stripe or Baremetrics. Calculate your monthly churn rate for the last three months. Look at when in the customer lifecycle churn is most concentrated. You cannot fix what you have not measured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second week:&lt;/strong&gt; Set up proper dunning emails in Stripe. Enable Smart Retries if it is not already on. Enable the customer portal so customers can update their payment information without contacting you. This will immediately recover some involuntary churn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third week:&lt;/strong&gt; Look at your onboarding. Is there a clear path from sign-up to aha moment? Is there an email sequence that guides new users? If not, write and set up a three-email sequence (day 1, day 3, day 7) with specific calls to action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth week:&lt;/strong&gt; Set up your dormant user alert. Any customer who has not logged in for 14 days gets flagged. Send personal outreach to the first batch. See what you learn.&lt;/p&gt;

&lt;p&gt;Then do exit interviews for every cancellation going forward.&lt;/p&gt;

&lt;p&gt;The whole process does not require a team. It requires a few hours of setup and a weekly habit of looking at the data and acting on what it tells you. That is the job. Doing it consistently is what separates the &lt;a href="https://dev.to/blog/micro-saas-playbook-developer-guide-2026"&gt;micro SaaS products&lt;/a&gt; that compound over time from the ones that stay on the treadmill.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Math Is on Your Side
&lt;/h2&gt;

&lt;p&gt;Here is the way I think about the return on retention work now.&lt;/p&gt;

&lt;p&gt;Acquiring a new customer through paid ads, content, or cold outreach costs time and money. Retaining a customer who was about to leave costs time and a personal email. The economics are wildly different.&lt;/p&gt;

&lt;p&gt;If you spend three hours this week setting up a proper dunning sequence and it recovers two customers per month at $49 each, that is $1,176 in recovered annual revenue from three hours of work. No ad spend. No new features. Just plumbing that was missing.&lt;/p&gt;

&lt;p&gt;The same logic applies across the whole retention system. The &lt;a href="https://dev.to/blog/side-project-to-first-dollar-developer-monetization-2026"&gt;side project to first dollar&lt;/a&gt; moment is exciting. But the compounding that happens when you stop losing customers as fast as you gain them is where the real leverage lives.&lt;/p&gt;

&lt;p&gt;Fix the bucket before you fill it faster. The math will take care of the rest.&lt;/p&gt;

</description>
      <category>startup</category>
      <category>indiehacking</category>
      <category>saas</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to Test AI-Generated Code Without Losing Your Mind (or Your Users)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Thu, 16 Apr 2026 21:25:28 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/how-to-test-ai-generated-code-without-losing-your-mind-or-your-users-4lm6</link>
      <guid>https://dev.to/alexcloudstar/how-to-test-ai-generated-code-without-losing-your-mind-or-your-users-4lm6</guid>
      <description>&lt;p&gt;Last month I shipped a feature that looked perfect. The AI agent wrote the implementation in eight minutes. It generated a full test suite. Every test passed. The code review looked clean. I merged it on a Friday afternoon because I felt confident.&lt;/p&gt;

&lt;p&gt;By Monday morning, three users had reported corrupted data in their dashboards. The bug was in a data transformation function that silently rounded decimal values when they exceeded a specific precision threshold. The function worked correctly for 95% of inputs. The AI-generated tests only covered inputs that fell within the safe range. The AI that wrote the code and the AI that wrote the tests shared the same blind spot.&lt;/p&gt;

&lt;p&gt;That incident changed how I think about testing entirely. Not because AI tools are bad. I use them every day and I have &lt;a href="https://dev.to/blog/claude-code-vs-cursor-vs-github-copilot-2026/"&gt;written extensively about why&lt;/a&gt;. But because the testing instincts I developed over years of writing code by hand do not transfer cleanly to a workflow where AI generates most of the implementation.&lt;/p&gt;

&lt;p&gt;The problem is not that AI code is untestable. The problem is that most developers are testing it wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers That Should Worry You
&lt;/h2&gt;

&lt;p&gt;Before I get into strategy, you need to understand the scale of what we are dealing with.&lt;/p&gt;

&lt;p&gt;CodeRabbit analyzed thousands of pull requests across production systems and found that AI-generated code introduces 1.7 times more total issues than human-written code. Logic and correctness errors, the kind that actually break things for users, appear 75% more often. That is 194 additional logic errors per hundred pull requests compared to human-written code.&lt;/p&gt;

&lt;p&gt;The number that keeps me up at night is this one: 60% of AI code faults are silent failures. The code compiles. It passes tests. It looks correct during review. But it produces wrong results in production. You do not get an error message. You do not get a stack trace. You get corrupted data, wrong calculations, or incorrect behavior that users might not notice for days or weeks.&lt;/p&gt;

&lt;p&gt;VentureBeat reported that 43% of AI-generated code changes require manual debugging in production even after passing QA and staging tests. Veracode found that &lt;a href="https://dev.to/blog/ai-generated-code-security-risks-2026"&gt;45% of AI-generated code introduces security flaws&lt;/a&gt;. And a Sonar survey of developers confirmed what many of us already suspected: 96% do not fully trust the functional accuracy of AI-generated code.&lt;/p&gt;

&lt;p&gt;These numbers are not an argument against using AI tools. I still use them for the majority of my development work. But they are an argument for fundamentally rethinking how you test when AI is writing the code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Blind Spot Problem
&lt;/h2&gt;

&lt;p&gt;This is the core issue and it is surprisingly simple once you see it.&lt;/p&gt;

&lt;p&gt;When the same AI writes both the code and the tests, both outputs share the same assumptions about what "correct" means. The AI generates an implementation based on its understanding of the requirement. Then it generates tests that verify the implementation does what the implementation does. Not what the implementation should do. What it does.&lt;/p&gt;

&lt;p&gt;This creates tautological tests. Tests that pass by definition because they were reverse-engineered from the code they are testing. The test says "given input X, expect output Y" where Y is literally what the code already produces. If the code has a subtle logic error, the test will encode that same error as the expected behavior.&lt;/p&gt;

&lt;p&gt;I have seen this pattern dozens of times in my own work. The AI writes a sorting function that handles the common case but breaks on empty arrays. The AI-generated test suite includes fifteen test cases, all with non-empty arrays. The coverage report says 90%. The function ships. And the first user who hits the empty state gets a crash.&lt;/p&gt;

&lt;p&gt;The blind spot is not random. It is systematic. AI models are trained on code patterns and tend to test the patterns they generate. Edge cases, boundary conditions, and unusual inputs, the exact things that cause production failures, are consistently underrepresented in AI-generated test suites.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://dev.to/blog/ai-code-review-bottleneck-2026"&gt;traditional code review is also struggling&lt;/a&gt; with AI output. The code looks plausible. The tests look comprehensive. The review feels thorough. But the verification is circular.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Your Old Testing Habits Break Down
&lt;/h2&gt;

&lt;p&gt;If you learned to code before AI tools became standard, your testing habits were shaped by a specific workflow. You write code. You understand every line because you wrote it. You write tests that cover the cases you worried about while writing the implementation. Your tests reflect your mental model of the code.&lt;/p&gt;

&lt;p&gt;This workflow assumes deep comprehension of the implementation. When AI generates the code, that assumption breaks. You did not write it. You scanned it. You probably understood the general approach. But you did not make every micro-decision about error handling, type coercion, boundary conditions, and edge case coverage. The AI made those decisions, and it did not tell you which ones it was unsure about.&lt;/p&gt;

&lt;p&gt;The second habit that breaks is writing tests after the implementation. In a human-written workflow, tests-after-code works reasonably well because you remember the tricky parts. You think "I should test that null case because I almost forgot to handle it." With AI code, there is no "almost forgot" moment. The code appeared fully formed. You do not know which parts were tricky for the model and which were straightforward.&lt;/p&gt;

&lt;p&gt;The third habit that breaks is trusting coverage metrics. Eighty percent code coverage means something specific when a human writes the tests: someone thought about which lines matter and wrote assertions that exercise them. When AI generates tests to hit coverage targets, it can achieve 90% coverage with tests that verify almost nothing meaningful. The coverage number becomes a vanity metric.&lt;/p&gt;




&lt;h2&gt;
  
  
  Test-First Is Not Optional Anymore
&lt;/h2&gt;

&lt;p&gt;Here is where I landed after months of getting burned: if AI is writing the implementation, a human needs to write the test expectations first. Not the full test code. The expectations. The "what should this thing actually do" part.&lt;/p&gt;

&lt;p&gt;This is test-driven development, but adapted for the AI workflow. The process looks like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step one: Write the test descriptions in plain language.&lt;/strong&gt; Before you prompt the AI to write anything, write down what the function or feature should do. Not how. What. Include the edge cases you care about. Include the inputs that would be embarrassing if they broke.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step two: Convert those descriptions into test stubs or assertions.&lt;/strong&gt; You can use AI to help with the boilerplate, but the assertion values come from your understanding of the requirement, not from the AI's understanding of its own code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step three: Let the AI generate the implementation.&lt;/strong&gt; Now the agent has something to code against. The tests become the specification. If the implementation does not pass, the AI can iterate until it does. But the target, the definition of correct, came from you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step four: Review the implementation anyway.&lt;/strong&gt; Passing tests is necessary but not sufficient. You still need to check that the approach is sane, the &lt;a href="https://dev.to/blog/vibe-ceiling-ai-code-decision-framework-2026"&gt;architecture decisions are sound&lt;/a&gt;, and there are no security issues the tests would not catch.&lt;/p&gt;

&lt;p&gt;This workflow takes more time upfront than letting the AI generate everything. But it catches the category of bugs that matters most: silent logic errors that pass AI-generated tests and make it to production.&lt;/p&gt;

&lt;p&gt;The developers I talk to who have adopted this approach report a specific experience. The first week feels slower. By the third week, they are catching bugs that would have taken hours to debug in production. By the second month, the total time from feature request to shipped-and-stable is actually shorter because the debugging-in-production phase mostly disappears.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Six-Layer Testing Strategy
&lt;/h2&gt;

&lt;p&gt;Test-first development is the foundation, but it is not the complete picture. Here is the full strategy I use, layered from fastest feedback to slowest.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Static Analysis on Every Save
&lt;/h3&gt;

&lt;p&gt;Before any test runs, static analysis catches entire categories of problems automatically. ESLint for JavaScript and TypeScript, Semgrep for security patterns, and your language's type checker if you are using TypeScript (which you should be).&lt;/p&gt;

&lt;p&gt;AI-generated code is three times more likely to have readability issues and significantly more likely to introduce patterns that static analysis tools flag. Running these on save, not just on commit, means you catch problems before they enter your mental model as "probably fine."&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Human-Written Test Expectations
&lt;/h3&gt;

&lt;p&gt;This is the test-first layer described above. You define what correct behavior looks like. The AI implements to meet that definition. The assertions are yours. The implementation is the AI's.&lt;/p&gt;

&lt;p&gt;For pure functions, this is straightforward. For more complex features, write acceptance criteria as test descriptions and let those guide what gets implemented.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: AI-Generated Tests as a Supplement
&lt;/h3&gt;

&lt;p&gt;After the implementation passes your human-written tests, ask the AI to generate additional tests. These are useful for catching cases you did not think of. But treat them as suggestions, not as proof of correctness. Review the assertions. Check that they test meaningful behavior, not just "the function returns what the function returns."&lt;/p&gt;

&lt;p&gt;The goal here is coverage breadth, not coverage depth. The human-written tests provide depth on the cases that matter. The AI-generated tests provide breadth across the cases you might have missed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Adversarial Review with a Separate Prompt
&lt;/h3&gt;

&lt;p&gt;This is the practice that catches the most subtle bugs in my experience. After the code and tests are written, open a fresh context and prompt a different AI session to review the code specifically for bugs, edge cases, and security issues.&lt;/p&gt;

&lt;p&gt;The fresh context matters. The original AI session that wrote the code has accumulated assumptions about what "correct" means. A new session approaches the code like a code reviewer who has never seen it before. Prompt it to be adversarial: "Find bugs, edge cases, and security vulnerabilities in this code. Assume the implementation has at least one subtle logic error."&lt;/p&gt;

&lt;p&gt;This is the AI equivalent of getting a second pair of eyes on a pull request. It does not catch everything, but it catches things the original session's blind spots would miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Integration and End-to-End Tests
&lt;/h3&gt;

&lt;p&gt;Unit tests verify that individual functions work correctly. Integration and E2E tests verify that the system works correctly when all the pieces connect. AI-generated code is particularly prone to integration-level bugs because the model generates each piece in relative isolation.&lt;/p&gt;

&lt;p&gt;For any feature that touches data flow, external APIs, or multi-step user workflows, integration tests are not optional. These are the tests that catch "each function works perfectly but the system is broken" failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 6: Production Monitoring Segmented by Code Origin
&lt;/h3&gt;

&lt;p&gt;This is the layer most teams skip, and it is the one that closes the loop. If you &lt;a href="https://dev.to/blog/production-observability-solo-developer-2026"&gt;track production errors&lt;/a&gt; and can identify which code was AI-generated versus human-written, you can measure the actual bug rate difference in your specific codebase.&lt;/p&gt;

&lt;p&gt;Not every team needs this level of granularity. But if you are shipping fast with AI tools and want to know whether your testing strategy is actually working, production monitoring is the only source of truth.&lt;/p&gt;




&lt;h2&gt;
  
  
  What AI-Generated Tests Consistently Get Wrong
&lt;/h2&gt;

&lt;p&gt;After reviewing hundreds of AI-generated test suites, I see the same patterns repeatedly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing boundary tests.&lt;/strong&gt; The AI tests the middle of the range but not the edges. Arrays with zero or one element. Strings at the maximum length. Numbers at integer overflow boundaries. Dates at daylight saving transitions. These are where production bugs live and AI tests consistently do not go there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Happy path bias.&lt;/strong&gt; AI-generated tests are overwhelmingly positive-path tests. They verify that the function works when everything is correct. They rarely test what happens when the input is malformed, the network fails, the database is slow, or the user does something unexpected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mocking that hides bugs.&lt;/strong&gt; AI loves to mock dependencies. Sometimes that makes sense. But when the AI mocks the exact behavior it assumes the dependency has, and the real dependency behaves differently, the test passes and production fails. This is especially dangerous with database queries, API calls, and third-party libraries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing implementation details instead of behavior.&lt;/strong&gt; AI-generated tests frequently assert on internal state, call order, or specific implementation choices rather than observable behavior. These tests break when you refactor even if the behavior is identical. They verify that the code is structured a specific way, not that it does the right thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insufficient error path coverage.&lt;/strong&gt; The Sonar 2026 State of Code survey found that error handling deficiencies appear nearly twice as often in AI-generated code. The tests reflect this same gap. When the AI does not properly handle an error case in the implementation, it also does not test for it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Amazon Wake-Up Call
&lt;/h2&gt;

&lt;p&gt;In early March 2026, Amazon suffered two major outages within three days. The first disrupted service for nearly six hours and resulted in 120,000 lost orders. The second was worse: six hours of downtime, a 99% drop in US order volume, and approximately 6.3 million lost orders.&lt;/p&gt;

&lt;p&gt;Both incidents were traced to AI-assisted code changes deployed to production without proper approval workflows.&lt;/p&gt;

&lt;p&gt;Whether the root cause was insufficient testing, inadequate review, or broken deployment controls, the pattern is the same one I see in smaller codebases every week. AI-generated code that looked correct, passed automated checks, and made it to production where it failed at scale.&lt;/p&gt;

&lt;p&gt;The lesson is not "do not use AI to write code." The lesson is that the verification layer needs to be proportional to the risk. Code that serves millions of users requires a different testing standard than code that serves your side project. But even for side projects, the &lt;a href="https://dev.to/blog/ai-generated-code-technical-debt-2026"&gt;technical debt from unverified AI code&lt;/a&gt; accumulates faster than most developers realize.&lt;/p&gt;




&lt;h2&gt;
  
  
  Making This Practical
&lt;/h2&gt;

&lt;p&gt;I know what you are thinking. Six layers of testing sounds like a lot of overhead for a workflow that is supposed to make you faster.&lt;/p&gt;

&lt;p&gt;Here is how I actually apply this in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For green-zone code&lt;/strong&gt; (UI components, boilerplate, configuration, formatting): Layers 1 and 3 only. Static analysis and AI-generated tests as a quick sanity check. Do not over-invest in testing code that has low blast radius.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For yellow-zone code&lt;/strong&gt; (data transformations, API integrations, state management): Layers 1 through 4. Write test expectations first. Let AI implement. Generate supplemental tests. Do an adversarial review. This covers the majority of day-to-day development work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For red-zone code&lt;/strong&gt; (auth, payments, security, data at scale): All six layers. Write test expectations yourself. Review every line of the implementation. Adversarial review. Integration tests. Production monitoring. The cost of a bug in these areas justifies every minute of testing effort.&lt;/p&gt;

&lt;p&gt;This maps directly to the &lt;a href="https://dev.to/blog/vibe-ceiling-ai-code-decision-framework-2026"&gt;code-type classification system&lt;/a&gt; I wrote about previously. The testing investment should match the risk profile, not a blanket standard applied to everything.&lt;/p&gt;

&lt;p&gt;If you are using &lt;a href="https://dev.to/blog/spec-driven-development-2026"&gt;spec-driven development&lt;/a&gt;, the spec itself becomes the source of truth for Layer 2. Your spec defines the behavior. Your tests encode the spec. The AI implements to pass the tests. The loop is tight and the blind spots are minimized.&lt;/p&gt;

&lt;p&gt;For better &lt;a href="https://dev.to/blog/context-engineering-ai-coding-2026"&gt;context engineering&lt;/a&gt;, include your test file in the AI's context when it generates the implementation. The model produces significantly better code when it can see the tests it needs to pass. This is the simplest lever you have for improving AI output quality, and most developers do not use it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Summary
&lt;/h2&gt;

&lt;p&gt;Testing AI-generated code is harder than testing code you wrote yourself. That is just the reality. When you write code, you carry the context of every decision into the testing phase. When AI writes code, you are testing something you did not fully create, against assumptions you might not fully share.&lt;/p&gt;

&lt;p&gt;The solution is not more tests. It is better-positioned tests. Human-written expectations that define correctness independently from the implementation. Static analysis that catches pattern-level problems automatically. Adversarial review that breaks the blind spot cycle. And production monitoring that tells you when everything else missed something.&lt;/p&gt;

&lt;p&gt;Test-first development went from "nice practice that senior developers recommend" to "the minimum viable workflow for shipping AI-generated code responsibly." That is not a philosophical position. It is what the data says and what my own production incidents confirmed.&lt;/p&gt;

&lt;p&gt;The developers who will thrive with AI coding tools are not the ones who ship the fastest. They are the ones whose shipped code stays shipped. Testing is how you get there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>REST vs GraphQL vs tRPC: What I Actually Use and Why in 2026</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Thu, 16 Apr 2026 21:25:27 +0000</pubDate>
      <link>https://dev.to/alexcloudstar/rest-vs-graphql-vs-trpc-what-i-actually-use-and-why-in-2026-395i</link>
      <guid>https://dev.to/alexcloudstar/rest-vs-graphql-vs-trpc-what-i-actually-use-and-why-in-2026-395i</guid>
      <description>&lt;p&gt;Two years ago I migrated a perfectly functional REST API to GraphQL because I read too many blog posts about how REST was a relic and GraphQL was the future. The migration took three weeks. I spent another two weeks debugging the N+1 query problems that came with it. Then I spent a week implementing DataLoader to fix the performance issues I had introduced by migrating.&lt;/p&gt;

&lt;p&gt;The API served a simple dashboard with six endpoints. It did not need flexible query capabilities. It did not aggregate data from multiple sources. It did not serve a mobile client with different data needs than the web client. It served one frontend with predictable data requirements. REST was the right choice from the start.&lt;/p&gt;

&lt;p&gt;I learned something from that experience that no comparison article had told me: the best API layer is not the one with the most features. It is the one that matches your actual constraints. And most developers, including me at the time, choose based on hype instead of constraints.&lt;/p&gt;

&lt;p&gt;This is the decision framework I wish I had before that migration.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Comparison Table Trap
&lt;/h2&gt;

&lt;p&gt;Every "REST vs GraphQL vs tRPC" article starts with a comparison table. Features on the left, checkmarks on the right. GraphQL gets the checkmark for flexible queries. REST gets the checkmark for caching. tRPC gets the checkmark for type safety.&lt;/p&gt;

&lt;p&gt;These tables are technically accurate and practically useless.&lt;/p&gt;

&lt;p&gt;They tell you what each technology can do. They do not tell you which problems you actually have. And the gap between "this technology supports feature X" and "I need feature X in my specific project" is where most bad architecture decisions live.&lt;/p&gt;

&lt;p&gt;The developer building a public API for third-party integrations has fundamentally different constraints than the developer building a full-stack TypeScript app in a monorepo. The developer aggregating data from twelve microservices has different needs than the developer building CRUD endpoints for a dashboard. Putting all of them in front of the same comparison table and expecting them to reach the right conclusion is how you end up with over-engineered GraphQL servers for apps that needed five REST endpoints.&lt;/p&gt;

&lt;p&gt;The question is not "which is best?" The question is "who is consuming this API, what are their constraints, and what does my team already know?"&lt;/p&gt;




&lt;h2&gt;
  
  
  REST Is Not Dead (and Calling It Legacy Is Wrong)
&lt;/h2&gt;

&lt;p&gt;I have seen developers dismiss REST as outdated in the same breath they struggle to implement basic caching in their GraphQL server. REST is boring. REST is also the most battle-tested, best-understood, and most widely supported API pattern in existence.&lt;/p&gt;

&lt;p&gt;Here is what REST gives you that the newer options do not:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native HTTP caching.&lt;/strong&gt; REST's URL-per-resource model aligns perfectly with HTTP caching semantics. CDNs, browser caches, and proxy caches all understand &lt;code&gt;GET /products/123&lt;/code&gt;. This is not a minor advantage. For read-heavy applications, HTTP caching can eliminate entire categories of performance problems without writing a single line of cache management code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Universal tooling support.&lt;/strong&gt; Every programming language, every HTTP client, every monitoring tool, every API gateway understands REST. Your API will work with clients written in Python, Go, Rust, JavaScript, Swift, or anything else without any additional setup. When you build a public API, this matters enormously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple mental model.&lt;/strong&gt; Resources map to URLs. HTTP methods map to operations. Status codes map to outcomes. A junior developer can understand a REST API in hours. The onboarding cost is nearly zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proven at scale.&lt;/strong&gt; The largest APIs on the internet, Stripe, Twilio, GitHub's v3, are REST APIs. Not because those companies did not consider alternatives. Because REST's simplicity scales well operationally.&lt;/p&gt;

&lt;p&gt;The tradeoffs of REST are real. Over-fetching and under-fetching are genuine problems when your frontend needs vary significantly from your resource structure. Versioning REST APIs is painful. And building a good REST API still requires discipline around consistent response shapes, proper error handling, and meaningful status codes.&lt;/p&gt;

&lt;p&gt;But the solution to those tradeoffs is not always GraphQL. Sometimes it is a better-designed REST API. The &lt;a href="https://dev.to/blog/stop-obsessing-over-the-perfect-stack"&gt;stop obsessing over the perfect stack&lt;/a&gt; advice applies here too. A well-designed REST API beats a poorly implemented GraphQL server every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  GraphQL: Incredible When It Fits, Painful When It Does Not
&lt;/h2&gt;

&lt;p&gt;GraphQL enterprise adoption grew 340% since 2023. That number is impressive and also misleading, because it does not tell you how many of those adoptions went smoothly.&lt;/p&gt;

&lt;p&gt;I have used GraphQL on three production projects. One was a great fit. Two were not. The one that worked was a product with a complex, relational data model, multiple client types (web, mobile, third-party), and a dedicated frontend team that iterated rapidly on data requirements. GraphQL's ability to let the client specify exactly what it needs was genuinely valuable there.&lt;/p&gt;

&lt;p&gt;The two that were not great fits were simpler applications where the data requirements were predictable and stable. In both cases, GraphQL added complexity without adding proportional value.&lt;/p&gt;

&lt;p&gt;Here is where GraphQL genuinely shines:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complex, relational data with multiple consumers.&lt;/strong&gt; When your frontend needs to traverse relationships, for example loading a user with their posts, comments, and notification preferences in a single request, GraphQL eliminates the waterfall of REST calls. The client describes the data shape it needs and gets exactly that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend-for-Frontend patterns.&lt;/strong&gt; When you have multiple frontend clients with different data needs, GraphQL lets each client request exactly what it needs without building separate endpoints. A mobile client that needs minimal data and a desktop client that needs everything can share the same API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rapid frontend iteration.&lt;/strong&gt; When the frontend team is iterating faster than the backend team can ship new endpoints, GraphQL removes the dependency. The frontend team can adjust its queries without waiting for backend changes. On teams where this bottleneck is real, GraphQL's productivity impact is significant.&lt;/p&gt;

&lt;p&gt;Here is where GraphQL costs more than it saves:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;N+1 queries.&lt;/strong&gt; Without DataLoader or equivalent batching, a query that loads 100 users with their posts generates 101 database round trips. This is not a theoretical problem. It is the most common performance issue in production GraphQL systems. Solving it requires additional infrastructure that REST simply does not need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching is hard.&lt;/strong&gt; GraphQL uses POST requests by default. POST requests bypass HTTP caching. You need specialized caching strategies (persisted queries, response-level caching, or CDN-specific integrations) to get caching behavior that REST gives you for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security surface area.&lt;/strong&gt; Arbitrary query depth and complexity means a malicious or careless client can construct queries that overload your server. You need query complexity analysis, depth limiting, and rate limiting at the query level. This is operational overhead that scales with the flexibility you are offering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Onboarding cost.&lt;/strong&gt; REST takes days to learn. GraphQL takes weeks. The schema definition language, resolvers, DataLoader, fragments, subscriptions, and the mental model of graph traversal are all concepts that need to be understood before a developer is productive. For small teams, this cost is not trivial.&lt;/p&gt;

&lt;p&gt;The 2026 consensus among teams I talk to is that GraphQL has found its niche. It is not replacing REST universally. About 67% of large organizations use both GraphQL and REST, treating them as complementary rather than competing. GraphQL for client-facing BFF layers. REST for everything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  tRPC: The Best API Layer Nobody Outside TypeScript Knows About
&lt;/h2&gt;

&lt;p&gt;If you write TypeScript on both your frontend and backend, tRPC might be the most productive API layer available in 2026. And if you do not write TypeScript on both sides, tRPC is not an option at all. That constraint defines everything about when to use it.&lt;/p&gt;

&lt;p&gt;tRPC eliminates the API layer as a concept. You define functions on the server. You call those functions from the client. TypeScript infers the types across the boundary automatically. No schema file. No code generation. No runtime validation layer between client and server. You change a function signature on the server and your IDE immediately shows type errors on every client call site.&lt;/p&gt;

&lt;p&gt;The experience of using tRPC in a monorepo is hard to describe until you have tried it. The boundary between frontend and backend stops feeling like a boundary. You are invoking functions, not calling APIs. The feedback loop is instant because the type checker catches contract violations at compile time, not at runtime.&lt;/p&gt;

&lt;p&gt;Here is what makes tRPC compelling:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero-cost type safety.&lt;/strong&gt; There is no schema to maintain, no types to generate, no codegen step to run. The types are inferred from the actual implementation. This means the types are always correct because they are the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developer experience in monorepos.&lt;/strong&gt; tRPC's adoption surge tracks with the maturation of monorepo tooling. Turborepo, Nx, and pnpm workspaces made it practical for small teams to run frontend and backend together. tRPC makes it productive. The combination is genuinely transformative for &lt;a href="https://dev.to/blog/one-person-startup-scaling-2026"&gt;one-person startup scaling&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance.&lt;/strong&gt; tRPC adds minimal overhead. It is essentially a thin wrapper around HTTP. No query parsing, no schema validation at runtime, no resolver chain. For internal APIs where the client and server are both under your control, the performance characteristics are excellent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Growing ecosystem.&lt;/strong&gt; The T3 Stack (Next.js, tRPC, Prisma, Tailwind) has become one of the most popular full-stack patterns in the TypeScript ecosystem. Over 38,000 GitHub stars and a large community of developers building real products with it.&lt;/p&gt;

&lt;p&gt;Here is what limits tRPC:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TypeScript only.&lt;/strong&gt; If your client is not TypeScript, tRPC's core value proposition disappears. Mobile apps in Swift or Kotlin cannot consume tRPC endpoints without additional tooling. External developers integrating with your API cannot use tRPC unless they are also in the TypeScript ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not for public APIs.&lt;/strong&gt; tRPC is designed for internal communication between your own frontend and backend. It does not generate OpenAPI specs. It does not produce documentation that external developers can consume. If you need a public API, REST with OpenAPI is still the standard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tight coupling by design.&lt;/strong&gt; tRPC works because the client and server share types. This means they need to be in the same repository or share packages. For distributed teams working across separate repositories, this coupling becomes a constraint rather than a feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning curve for the pattern.&lt;/strong&gt; While the API itself is simple, the mental model of "server functions called from the client" requires developers to unlearn some REST habits. Input validation with Zod, middleware chains, and context patterns have their own learning curve.&lt;/p&gt;

&lt;p&gt;If you are building a TypeScript full-stack application and your API is consumed only by your own frontend, tRPC is probably the right default choice in 2026. The developer experience advantage is significant enough that it changes how fast you can iterate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework That Actually Works
&lt;/h2&gt;

&lt;p&gt;Stop asking "which is best?" Start asking these four questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question one: Who is consuming your API?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If external developers, third-party integrations, or non-TypeScript clients need to consume your API, use REST with OpenAPI documentation. Full stop. The ecosystem support, documentation tooling, and universal compatibility are not optional for public APIs.&lt;/p&gt;

&lt;p&gt;If only your own frontend team consumes the API and everyone writes TypeScript, tRPC is the strongest choice for internal communication.&lt;/p&gt;

&lt;p&gt;If multiple client types with significantly different data needs consume the API, GraphQL is worth the complexity cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question two: How complex is your data model?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your data is mostly flat resources with simple relationships (users, products, orders with straightforward joins), REST handles this cleanly. A well-designed REST API with a few compound endpoints covers most CRUD applications.&lt;/p&gt;

&lt;p&gt;If your data is deeply relational and clients need to traverse multiple levels of relationships in unpredictable ways, GraphQL's query flexibility provides real value.&lt;/p&gt;

&lt;p&gt;If your data model is complex but the query patterns are predictable (you know which relationships clients will need), tRPC or REST with purpose-built endpoints works better than GraphQL because you can optimize the specific queries you know you need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question three: What does your team already know?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This matters more than most architects want to admit. A team that knows REST well will build a better REST API than a mediocre GraphQL implementation. A team fluent in TypeScript will get more out of tRPC than a team still learning TypeScript fundamentals.&lt;/p&gt;

&lt;p&gt;The productivity cost of learning a new API paradigm is real and often underestimated. If your team has six months of runway and needs to ship fast, use what they know. Optimize the architecture later when you have more information about your actual constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question four: What is your scaling trajectory?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you are building a &lt;a href="https://dev.to/blog/side-project-to-first-dollar-developer-monetization-2026"&gt;side project or early-stage product&lt;/a&gt;, optimize for speed. tRPC in a monorepo if you are TypeScript-only. REST if you need flexibility. Do not add GraphQL unless you have a specific problem it solves.&lt;/p&gt;

&lt;p&gt;If you are building infrastructure that needs to last years and serve many consumers, invest in REST with OpenAPI for public surfaces and evaluate GraphQL for internal BFF layers where the data complexity justifies it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What About gRPC?
&lt;/h2&gt;

&lt;p&gt;I am not covering gRPC in depth because it serves a different use case than the other three. gRPC is optimized for service-to-service communication where bandwidth efficiency and performance matter more than developer experience. It uses Protocol Buffers for serialization, which is significantly more efficient than JSON but requires a compilation step and has a steeper learning curve.&lt;/p&gt;

&lt;p&gt;If you are building microservices that need to communicate with low latency and high throughput, gRPC is excellent. If you are building a web application with a frontend, gRPC is not the right tool for the client-facing layer.&lt;/p&gt;

&lt;p&gt;Most production architectures in 2026 that use gRPC use it alongside REST or GraphQL: gRPC for internal service communication, REST or GraphQL for client-facing APIs. It is a complement, not a competitor, to the other options.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Multi-Protocol Reality
&lt;/h2&gt;

&lt;p&gt;The most useful thing I have learned about API design in 2026 is that the answer is almost never "use one protocol for everything."&lt;/p&gt;

&lt;p&gt;The pattern I see in well-architected systems looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public API&lt;/strong&gt;: REST with OpenAPI documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal frontend-to-backend&lt;/strong&gt;: tRPC (TypeScript projects) or GraphQL (multi-client projects)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service-to-service&lt;/strong&gt;: gRPC or REST, depending on performance requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI agent integration&lt;/strong&gt;: MCP (Model Context Protocol) for &lt;a href="https://dev.to/blog/mcp-model-context-protocol-developer-guide-2026"&gt;tool exposure&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each protocol serves a different boundary with different requirements. Trying to use one protocol everywhere is like using one database for everything. It works until it does not, and then the migration is painful.&lt;/p&gt;

&lt;p&gt;The developers I know who are most productive with API design are the ones who can pick the right tool for each boundary without over-thinking it. They use REST where REST is strong, tRPC where type safety matters, and GraphQL where data flexibility is a real requirement. They do not spend weeks debating which is "better" in the abstract.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Use on My Projects
&lt;/h2&gt;

&lt;p&gt;For full transparency, here is what I use and why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personal projects and side projects&lt;/strong&gt;: tRPC with Next.js in a monorepo. The developer experience is unmatched for solo development. I do not need a public API. I do not need multi-language client support. I need to move fast with full type safety, and tRPC delivers that better than anything else I have tried.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client work with API requirements&lt;/strong&gt;: REST with OpenAPI. When clients need documentation, when external teams will integrate, when the API needs to outlast my involvement in the project, REST is the responsible choice. I use &lt;a href="https://dev.to/blog/honojs-vs-express-2026"&gt;Hono&lt;/a&gt; for the server and generate OpenAPI specs automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complex data products&lt;/strong&gt;: GraphQL, but only when the data model and client requirements genuinely warrant it. I have learned to be honest about whether I actually need GraphQL's flexibility or whether I just want to use it because it feels more sophisticated.&lt;/p&gt;

&lt;p&gt;The honest pattern is: tRPC for speed when I control both sides, REST for durability when I do not, GraphQL only when the data complexity demands it. Everything else is over-engineering.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Take
&lt;/h2&gt;

&lt;p&gt;The API design landscape in 2026 is more settled than the blog posts make it seem. REST is not going anywhere. GraphQL found its niche and stopped trying to replace everything. tRPC is the breakout winner for TypeScript teams who do not need public APIs.&lt;/p&gt;

&lt;p&gt;The developers who struggle with API design are not the ones who pick the "wrong" technology. They are the ones who pick a technology for the wrong reasons. Choosing GraphQL because it is trendy, or avoiding REST because it feels old, or adopting tRPC because a YouTube video made it look easy. Those are all bad reasons.&lt;/p&gt;

&lt;p&gt;Good reasons sound like: "Our mobile and web clients need different data shapes from the same backend, so GraphQL's client-driven queries save us from maintaining parallel endpoints." Or: "We are a two-person TypeScript team shipping fast in a monorepo, so tRPC gives us type safety without the overhead of schema management." Or: "We need a public API that any HTTP client can consume, so REST with OpenAPI is the obvious choice."&lt;/p&gt;

&lt;p&gt;Start with the constraints. Pick the tool that fits. Move on and build the actual product. The API layer is important, but it is not the product. The product is the product.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>programming</category>
      <category>devtools</category>
    </item>
  </channel>
</rss>
