SAR

Posted on Jul 4

The 3 AI Engineering Problems Nobody Solved at the World's Fair

#ai #programming #webdev

I just spent three days at the AI Engineer World's Fair in San Francisco. 7,000 engineers, dozens of tracks, every major brand-name sponsor you can think of. The energy was insane — like someone injected 2021 crypto-conference hype into actual working software.

And honestly? I came home with mixed feelings.

On one hand, the progress is real. Agents are no longer demo-ware. Companies like Uber showed uReview, their internal system where agents autonomously review PRs, spin up test suites, catch edge cases, and commit fixes back before a human even looks at the branch. That's not a prototype — that's production, handling real code.

On the other hand, the conference floor was a masterclass in avoiding the hard questions.

I spent most of my time in the hallways and breakout sessions asking a simple question: "What's actually still broken?" After about 20 conversations — with engineers from Anthropic, Google DeepMind, Vercel, and a dozen startups you haven't heard of yet — three patterns kept coming up. Nobody had good answers for any of them.

Here's what I learned.

1. The Frontier Trap — Everyone's Burning Money and Pretending It's Fine

📸 📸 Unsplash

The most talked-about debate at the fair wasn't about agents or sandboxes or open models. It was about loops — whether we can finally take humans out of the coding loop and let AI run autonomously.

Geoff Huntley from Latent Patterns argued yes, comparing it to the early Kubernetes days. Messy at first, revolutionary once we figure it out. Dex Horthy from HumanLayer argued no, showing real data from his experiments where taking humans out resulted in disasters. The audience vote was close but tipped toward "not yet."

But here's what nobody in that debate acknowledged: the real bottleneck isn't the human. It's the model choice.

Ryan Swift, another attendee I tracked down after his workshop, put it bluntly: "Most engineers simply refuse to consider any model other than the latest and most powerful frontier for their day-to-day tasks. I spend an inordinate amount of time trying to convince people that frontier models aren't always necessary."

He's right. I saw it everywhere.

Teams running Claude Opus 4 to check the weather. Companies burning $2,000/day on GPT-5 for tasks a fine-tuned 8B model could handle. The default answer to "which model?" was always "the biggest one."

Here's what the data actually shows, which Ryan shared in his session:

Sonnet 4.6 performs comparably to Opus 4.1 on 90% of coding tasks
Gemini Flash 3.5 competes with Gemini Pro 3.1
GPT-5.4 Mini matches GPT-5.1 on every common benchmark
Fast models cost 5-15x less and are 3-8x faster

The math is obvious. But engineers don't trust it. They'd rather pay for a sledgehammer when they need a screwdriver because getting it wrong once feels worse than overpaying a thousand times.

The real problem: nobody's built the tooling to prove which model is sufficient for a given task. Teams are making gut decisions based on vibes — the same "vibe-based engineering" they claim to have left behind. Until we've automated routing systems that can say "your query needs Sonnet, not Opus" with demonstrable confidence, we're going to keep burning cash on frontier models for work that doesn't need them.

2. The Evaluation Chasm — Vibe Checks Are Dead, But Nothing's Replaced Them

📸 📸 Unsplash

Ben Halpern, DEV's founder, was walking the fair cataloging exactly this. His conclusion: "vibe-based engineering is dead." Reviewing a few outputs, deciding they look reasonable, and shipping to production is no longer acceptable.

The new standard, he wrote, involves spinning up isolated virtual environments — temporary sandboxes with mock databases and network access — and letting an agent attempt a complex task. The framework doesn't grade style; it checks if the task was completed, counts the steps, and verifies security protocols weren't violated.

Sounds great. Except almost nobody at the fair had actually implemented this.

I talked to a team from a well-funded AI startup who admitted they're still evaluating their agents by "having a senior engineer read 20 outputs and grade them on a curve." Another team from a Fortune 500 company said they use a simple pass/fail script that checks if the output JSON is valid. That's it. If it parses, it ships.

The gap between "what we know we should do" and "what anyone has actually built" is enormous. Here's why:

Good evals are expensive. Spinning up a micro-sandbox for every agent interaction costs compute and time. For a chat application handling millions of requests, running a full evaluation pipeline on every response is financially infeasible.

Ground truth is unclear. What does "success" look like for an agent that writes documentation? Or refactors a codebase? Or replies to a customer email? We can't even agree on the evaluation criteria, let alone automate it.

Regression testing is early-stage. One engineer showed me their eval framework — it looked like a collection of 40 loosely related Python scripts, each testing a different agent capability, maintained by whoever had time that week. When a new model dropped, they'd run the scripts manually and compare numbers in a spreadsheet.

This is where the industry is right now. We've moved past "does it feel right?" but we haven't landed on "does it work?" yet. And that limbo is dangerous — teams are shipping agentic systems to production with evaluation frameworks that wouldn't pass a first-year CS project's test suite.

3. The Trust Ceiling — Agents Can Do Everything Except Convince You They're Safe

Giving an agent the authority to write code, modify files, and run terminal commands introduces risks that most teams are only beginning to understand.

The industry standard is coalescing around micro-sandboxes — lightweight, ephemeral micro-VMs from providers like E2B or Docker that spin up in milliseconds, handle a specific computation, and are immediately destroyed. Secure by design. Container escape risks minimized. File system tampering neutralized.

But security isn't the real trust problem. The trust problem is deeper.

A senior engineer from a major cloud provider told me something that stuck: "We can make agents technically secure. What we can't do is make engineers feel safe using them."

There's a difference between being safe and feeling safe. And the AI industry is terrible at the second part.

The conference covered credential masking extensively — protocols like AAuth that grant agents mission-bounded authority to call a tool without ever seeing the raw API keys. This neutralizes prompt injection leaks. It's good engineering. But when a developer watches an agent autonomously modify production infrastructure, the question isn't "is the credential safe?" It's "do I trust this thing not to delete my database?"

That question doesn't have a technical answer yet. It's earned through reliability, predictability, and time — three things the current AI engineering cycle doesn't give you.

Every agent framework at the fair had a "human in the loop" fallback. Every single one. Because nobody — not the vendors, not the platform teams, not the most bullish loop advocates — actually trusts agents to run fully autonomously in production. The debate at the closing session wasn't "should we've a human in the loop?" It was "can we eventually remove them, and when?"

That's the honest state of AI engineering in mid-2026. We've built agents that can do almost anything. We just can't trust them to do it alone.

Disclosure: Some of the links in this article are affiliate links. If you purchase through them, I may earn a commission at no extra cost to you. I only recommend products I genuinely find useful.

What This Actually Means

I'm not writing this to dunk on the conference. The AI Engineer World's Fair was genuinely impressive — the energy, the technical depth, the sheer number of people building real things. It's easy to focus on what's broken and miss that this is still the most exciting time to be building software since 2007.

But the hype cycle has a way of papering over hard problems, and the three I've laid out here are the ones that will separate real engineering from demo-ware.

Here's my honest advice if you're building in this space right now:

Stop defaulting to frontier models. Take a weekend to benchmark a smaller, faster model against your actual workload. The savings alone could fund your eval infrastructure.

Build a shitty eval first. Don't wait for the perfect framework. Write five test cases that represent your most common agent tasks, automate them, and track pass rates over time. You can refine later.

Assume your agent isn't safe enough. Over-invest in sandboxing, credential isolation, and kill switches. The day your agent accidentally does something destructive — and it's not a matter of if but when — you'll be grateful for every precaution you took.

The three problems nobody solved at the World's Fair won't fix themselves. But they're fixable. Just not with hype.

I was at the AI Engineer World's Fair 2026 in San Francisco on July 2-4. Some names have been omitted to protect the honest.