DEV Community: Daniel Botha

Games as Model Eval: 1-Click Deploy AI Town on Fly.io

Daniel Botha — Tue, 26 Aug 2025 12:11:01 +0000

Recently, I suggested that The Future Isn't Model Agnostic, that it's better to pick one model that works for your project and build around it, rather than engineering for model flexibility. If you buy that, you also have to acknowledge how important comprehensive model evaluation becomes.

Benchmarks tell us almost nothing about how a model will actually behave in the wild, especially with long contexts, or when trusted to deliver the tone and feel that defines the UX we’re shooting for. Even the best evaluation pipelines usually end in subjective, side-by-side output comparisons. Not especially rigorous, and more importantly, boring af.

Can we gamify model evaluation? Oh yes. And not just because we get to have some fun for once. Google backed me up this week when it announced the Kaggle Game Arena. A public platform where we can watch AI models duke it out in a variety of classic games. Quoting Google; "Current AI benchmarks are struggling to keep pace with modern models... it can be hard to know if models trained on internet data are actually solving problems or just remembering answers they've already seen."

When models boss reading comprehension tests, or ace math problems, we pay attention. But when they fail to navigate a simple conversation with a virtual character or completely botch a strategic decision in a game environment, we tell ourselves we're not building a game anyway and develop strategic short-term memory loss.
Just like I've told my mom a thousand times, games are great at testing brains, and it's time we take this seriously when it comes to model evaluation.

Why Games Don't Lie

Games provide what benchmarks can't, "a clear, unambiguous signal of success." They give us observable behavior in dynamic environments, the kind that would be extremely difficult (and tedious) to simulate with prompt engineering alone.

Games force models to demonstrate the skills we actually care about; strategic reasoning, long-term planning, and dynamic adaptation in interactions with an opponent or a collaborator.

Pixel Art Meets Effective Model Evaluation - AI Town on Fly.io

AI Town is a brilliant project by a16z-infra, based on the the mind-bending paper, Generative Agents: Interactive Simulacra of Human Behavior. It's a beautifully rendered little town in which tiny people with AI brains and engineered personalities go about their lives, interacting with each other and their environment. Characters need to remember past conversations, maintain relationships, react dynamically to new situations, and stay in character while doing it all.

I challenge you to find a more entertaining way of evaluating conversational models.

I've forked the project to make it absurdly easy to spin up your own AI Town on Fly Machines. You've got a single deploy script that will set everything up for you and some built-in cost and performance optimizations, with our handy scale to zero functionality as standard (so you only pay for the time spent running it). This makes it easy to share with your team, your friends and your mom.

In it's current state, the fork makes it as easy as possible to test any OpenAI-compatible service, any model on Together.ai and even custom embedding models. Simply set the relevant API key in your secrets.

Games like AI Town give us a window into how models actually think, adapt, and behave beyond the context of our prompts. You move past performance metrics and begin to understand a model’s personality, quirks, strengths, and weaknesses; all factors that ultimately shape your project's UX.

The Future Isn't Model Agnostic

Daniel Botha — Tue, 26 Aug 2025 12:01:24 +0000

Your users don't care that your AI project is model agnostic.

In my last project, I spent countless hours ensuring that the LLMs running my services could be swapped out as easily as possible. I couldn't touch a device with an internet connection without hearing about the latest benchmark-breaking model and it felt like a clear priority to ensure I could hot swap models with minimal collateral damage.

So yeah. That was a waste of time.

The hype around new model announcements feels more manufactured with each release. In reality, improvements are becoming incremental. As major providers converge on the same baseline, the days of one company holding a decisive lead are numbered.

In a world of model parity, the differentiation moves entirely to the product layer. Winning isn't about ensuring you're using the best model, its about understanding your chosen model deeply enough to build experiences that feel magical. Knowing exactly how to prompt for consistency, which edge cases to avoid, and how to design workflows that play to your model's particular strengths

Model agnosticism isn't just inefficient, it's misguided. Fact is, swapping out your model is not just changing an endpoint. It's rewriting prompts, rerunning evals, users telling you things just feel... different. And if you've won users on the way it feels to use your product, that last one is a really big deal.

Model < Product

Recently, something happened that fully solidified this idea in my head. Claude Code is winning among people building real things with AI. We even have evangelists in the Fly.io engineering team, and those guys are weird smart. Elsewhere, whole communities have formed to share and compare claude.md's and fight each other over which MCP servers are the coolest to use with Claude.

Enter stage right, Qwen 3 Coder. It takes Claude to the cleaners in benchmarks. But the response from the Claude Code user base? A collective meh.

This is nothing like 2024, when everyone would have dropped everything to get the hot new model running in Cursor. And it's not because we've learned that benchmarks are performance theater for people who've never shipped a product.

It's because products like Claude Code are irrefutable evidence that the model isn't the product. We've felt it first hand when our pair programmer's behaviour changes in subtle ways. The product is in the rituals. The trust. The predictability. It's precisely because Claude Code's model behavior, UI, and user expectations are so tightly coupled that its users don't really care that a better model might exist.

I'm not trying to praise Anthropic here. The point is, engineering for model agnosticism is a trap that will eat up time that could be better spent … anywhere else.

Sure, if you're building infra or anything else that lives close to the metal, model optionality still matters. But people trusting legwork to AI tools are building deeper relationships and expectations of their AI tools than they even care to admit. AI product success stories are written when products become invisible parts of users' daily rituals, not showcases for engineering flexibility.

Make One Model Your Own

As builders, it's time we stop hedging our bets and embrace the convergence reality. Every startup pitch deck with 'model-agnostic' as a feature should become a red flag for investors who understand product-market fit. Stop putting 'works with any LLM' in your one-liner. It screams 'we don't know what we're building.'

If you're still building model-agnostic AI tools in 2025, you're optimizing for the wrong thing. Users don't want flexibility; they want reliability. And in a converged model landscape, reliability comes from deep specialization, not broad compatibility.

Pick your model like you pick your therapist; for the long haul. Find the right model, tune deeply, get close enough to understand its quirks and make them work for you. Stop architecting for the mythical future where you'll seamlessly swap models. That future doesn't exist, and chasing it is costing you the present.

Bonus level: All-in On One Model Means All-out On Eval

If any of this is landing for you, you'll agree that we have to start thinking of model evaluation as architecture, not an afterthought. The good news is, rigorous model eval doesn't have to be mind numbing anymore.

Turns out, games are really great eval tools! Now you can spin up your very own little AI Town on Fly.io with a single click deploy to test different models as pixel people in an evolving environment. I discuss the idea further in Games as Model Eval: 1-Click Deploy AI Town on Fly.io.

Build Better Agents With MorphLLM

Daniel Botha — Tue, 26 Aug 2025 11:34:45 +0000

I'm an audiophile, which is a nice way to describe someone who spends their children's college fund on equipment that yields no audible improvement in sound quality. As such, I refused to use wireless headphones for the longest time. The fun thing about wired headphones is when you forget they're on and you stand up, you simultaneously cause irreparable neck injuries and extensive property damage. This eventually prompted me to buy good wireless headphones and, you know what, I break fewer things now. I can also stand up from my desk and not be exposed to the aural horrors of the real world.

This is all to say, sometimes you don't know how big a problem is until you solve it. This week, I chatted to the fine people building MorphLLM, which is exactly that kind of solution for AI agent builders.

Slow, Wasteful and Expensive AI Code Changes

If you’re building AI agents that write or edit code, you’re probably accepting the following as "the way it is": Your agent needs to correct a single line of code, but rewrites an entire file to do it. Search-and-replace right? It’s fragile, breaks formatting, silently fails, or straight up leaves important functions out. The result is slow, inaccurate code changes, excessive token use, and an agent feels incompetent and unreliable.

Full file rewrites are context-blind and prone to hallucinations, especially when editing that 3000+ line file that you've been meaning to refactor. And every failure and iteration is wasted compute, wasted money and worst of all, wasted time.

Why We Aren’t Thinking About This (or why I wasn't)

AI workflows are still new to everyone. Best practices are still just opinions and most tooling is focused on model quality, not developer velocity or cost. This is a big part of why we feel that slow, wasteful code edits are just the price of admission for AI-powered development.

In reality, these inefficiencies become a real bottleneck for coding agent tools. The hidden tax on every code edit adds up and your users pay with their time, especially as teams scale and projects grow more complex.

Better, Faster AI Code Edits with Morph Fast Apply

MorphLLM's core innovation is Morph Fast Apply. It's an edit merge tool that is semantic, structure-aware and designed specifically for code. Those are big words to describe a tool that will empower your agents to make single line changes without rewriting whole files or relying on brittle search-and-replace. Instead, your agent applies precise, context-aware edits and it does it ridiculously fast.

It works like this:

You add an 'edit_file' tool to your agents tools.
Your agent outputs tiny edit_file snippets, using //...existing code... placeholders to indicate unchanged code.
Your backend calls Morph’s Apply API, which merges the changes semantically. It doesn't just replace raw text, it makes targeted merges with the code base as context.
You write back the precisely edited file. No manual patching, no painful conflict resolution, no context lost.

The Numbers

MorphLLM's Apply API processes over 4,500 tokens per second and their benchmark results are nuts. We're talking 98% accuracy in ~6 seconds per file. Compare this to 35s (with error corrections) at 86% accuracy for traditional search-and-replace systems. Files up to 9k tokens in size take ~4 seconds to process.

Just look at the damn graph:

These are game-changing numbers for agent builders. Real-time code UIs become possible. Dynamic codebases can self-adapt in seconds, not minutes. Scale to multi-file edits, documentation, and even large asset transformations without sacrificing speed or accuracy.

How to Get in on the MorphLLM Action

Integration with your project is easy peasy. MorphLLM is API-compatible with OpenAI, Vercel AI SDK, MCP, and OpenRouter. You can run it in the cloud, self-host, or go on-prem with enterprise-grade guarantees.

I want to cloud host mine, if only I could think of somewhere I could quickly and easily deploy wherever I want and only pay for when I'm using the infra 😉.

Get Morphed

MorphLLM feels like a plug-in upgrade for code agent projects that will instantly make them faster and more accurate. Check out the docs, benchmarks, and integration guides at docs.morphllm.com. Get started for free at https://morphllm.com/dashboard

Games as Model Eval: 1-Click Deploy AI Town on Fly.io

Daniel Botha — Tue, 19 Aug 2025 08:46:14 +0000

Recently, I suggested that [The Future Isn't Model Agnostic](https://fly.io/blog/the-future-isn-t-model-agnostic/), that it's better to pick one model that works for your project and build around it, rather than engineering for model flexibility. If you buy that, you also have to acknowledge how important comprehensive model evaluation becomes.

Why Games Don't Lie

Games force models to demonstrate the skills we actually care about; strategic reasoning, long-term planning, and dynamic adaptation in interactions with an opponent or a collaborator.

Pixel Art Meets Effective Model Evaluation - AI Town on Fly.io

I challenge you to find a more entertaining way of evaluating conversational models.

Trust Calibration for AI Software Builders

Daniel Botha — Tue, 19 Aug 2025 08:42:45 +0000

Trust calibration is a concept from the world of human-machine interaction design, one that is super relevant to AI software builders. Trust calibration is the practice of aligning the level of trust that users have in our products with its actual capabilities.

If we build things that our users trust too blindly, we risk facilitating dangerous or destructive interactions that can permanently turn users off. If they don't trust our product enough, it will feel useless or less capable than it actually is.

So what does trust calibration look like in practice and how do we achieve it? A 2023 study reviewed over 1000 papers on trust and trust calibration in human / automated systems (properly referenced at the end of this article). It holds some pretty eye-opening insights – and some inconvenient truths – for people building AI software. I've tried to extract just the juicy bits below.

Limiting Trust

Let's begin with a critical point. There is a limit to how deeply we want users to trust our products. Designing for calibrated trust is the goal, not more trust at any cost. Shoddy trust calibration leads to two equally undesirable outcomes:

Over-trust causes users to rely on AI systems in situations where they shouldn't (I told my code assistant to fix a bug in prod and went to bed).
Under-trust causes users to reject AI assistance even when it would be beneficial, resulting in reduced perception of value and increased user workload.

What does calibrated trust look like for your product? It’s important to understand that determining this is less about trying to diagram a set of abstract trust parameters and more about helping users develop accurate mental models of your product's capabilities and limitations. In most cases, this requires thinking beyond the trust calibration mechanisms we default to, like confidence scores.

For example, Cursor's most prominent trust calibration mechanism is its change suggestion highlighting. The code that the model suggests we change is highlighted in red, followed by suggested changes highlighted in green. This immediately communicates that "this is a suggestion, not a command."

In contrast, Tesla's Autopilot is a delegative system. It must calibrate trust differently through detailed capability explanations, clear operational boundaries (only on highways), and prominent disengagement alerts when conditions exceed system limits.

Building Cooperative Systems

Perhaps the most fundamental consideration in determining high level trust calibration objectives is deciding whether your project is designed to be a cooperative or a delegative tool.

Cooperative systems generally call for lower levels of trust because users can choose whether to accept or reject AI suggestions. But these systems also face a unique risk. It’s easy for over-trust to gradually transform user complacency into over-reliance, effectively transforming what we designed as a cooperative relationship into a delegative one, only without any of the required safeguards.

If you're building a coding assistant, content generator, or design tool, implement visible "suggestion boundaries" which make it clear when the AI is offering ideas versus making decisions. Grammarly does this well by underlining suggestions rather than auto-correcting, and showing rationale on hover.

For higher-stakes interactions, consider introducing friction. Require explicit confirmation before applying AI suggestions to production code or publishing AI-generated content.

Building Delegative Systems

In contrast, users expect delegative systems to replace human action entirely. Blind trust in the system is a requirement for it to be considered valuable at all.

If you're building automation tools, smart scheduling, or decision-making systems, invest heavily in capability communication and boundary setting. Calendly's smart scheduling works because it clearly communicates what it will and won't do (I'll find times that work for both of us vs. I'll reschedule your existing meetings). Build robust fallback mechanisms and make system limitations prominent in your onboarding.

Timing Is Everything

The study suggests that when we make trust calibrations is at least as important as how. There are three critical windows for trust calibration, each with their own opportunities and challenges.

Pre-interaction calibration happens before users engage with the system. Docs and tutorials fall into this category. Setting expectations up front can prevent initial over-trust, which is disproportionally more difficult to correct later.

Pre-interaction calibrations could look like capability-focused onboarding that shows both successes and failures. Rather than just demonstrating perfect AI outputs, show users examples where the AI makes mistakes and how to catch them.

During-interaction calibration is trust adjustment through real-time feedback. Dynamically updated cues improved trust calibration better than static displays, and adaptive calibration that responds to user behavior outperformed systems that displayed static information.

Build confidence indicators that are updated based on context, not just model confidence. For example, if you're building a document AI, show higher confidence for standard document types the system has seen thousands of times, and lower confidence for unusual formats.

Post-interaction calibration focuses on learning and adjustment that helps users understand successes and failures in the system after interactions. These aren’t reliable, since by the time users receive the information, their trust patterns are set and hard to change.

Post-interaction feedback can still be valuable for teaching. Create "reflection moments" after significant interactions. Midjourney does this by letting users rate image outputs, helping users learn what prompts work best while calibrating their expectations for future generations.

Trust is front-loaded and habit-driven. The most effective calibration happens before and during use, when expectations are still forming and behaviors can still be shifted. Any later and you’re mostly fighting entrenched patterns.

Performance vs. Process Information

Users can be guided through performance-oriented signals (what the system can do) or process-oriented signals (how it works). The real challenge is matching the right kind of explanation to the right user, at the right moment.

Performance-oriented calibration focuses on communicating capability through mechanisms like reliability statistics, confidence scores, and clear capability boundaries.
Process-oriented calibration offers detailed explanations of decision-making processes, breakdowns of which factors influenced decisions, and reasoning transparency.

Process transparency seems like the obvious go-to at first glance, but the effectiveness of process explanations varies wildly based on user expertise and domain knowledge. If we are designing for a set of users that may fall anywhere on this spectrum, we have to avoid creating information overload for novice users while providing sufficient information to expert users who want the detail.

The most effective systems in the study combined both approaches, providing layered information that allows users to access the level of detail most appropriate for their expertise and current needs.

Static vs. Adaptive Calibration

I really wanted to ignore this part, because it feels like the study’s authors are passive aggressively adding todos to my projects. In a nutshell, adaptive calibration – when a system actively monitors user behavior and adjusts its communication accordingly - is orders of magnitude more effective than static calibration while delivering the same information to every user, regardless of differences in expertise, trust propensity, or behavior.

Static calibration mechanisms are easy to build and maintain, which is why we like them. But the stark reality is that they put the burden of appropriate calibration entirely on our users. We’re making it their job to adapt their behaviour based on generic information.

This finding has zero respect for our time or mental health, but it also reveals a legit opportunity for clever builders to truly separate their product from the herd.

Practical adaptive calibration techniques

Behavioral adaptation: Track how often users accept vs. reject suggestions and adjust confidence thresholds accordingly. If a user consistently rejects high-confidence suggestions, lower the threshold for showing uncertainty.
Context awareness: Adjust trust signals based on use context. A writing AI might show higher confidence for grammar fixes than creative suggestions, or lower confidence late at night when users might be tired.
Detect expertise: Users who frequently make sophisticated edits to AI output probably want more detailed explanations than those who typically accept entire file rewrites.

The Transparency Paradox

The idea that transparency and explainability can actually harm trust calibration is easily the point that hit me the hardest. While explanations can improve user understanding, they can also create information overload that reduces users' ability to detect and correct trash output. What's worse, explanations can create a whole new layer of trust calibration issues, with users over-trusting the explanation mechanism itself, rather than critically evaluating the actual output.

This suggests that quality over quantity should be our design philosophy when it comes to transparency. We should provide carefully crafted, relevant information rather than comprehensive but overwhelming detail. The goal should be enabling better decision-making rather than simply satisfying user curiosity about system internals.

Anthropomorphism and Unwarranted Trust

It seems obvious that we should make interactions with our AI project feel as human as possible. Well, it turns out that systems that appear more human-like through design, language, or interaction patterns are notoriously good at increasing user trust beyond actual system capabilities.

So it’s entirely possible that building more traditional human-computer interactions can actually make our AI projects safer to use and therefore, more user-friendly.

Use tool-like language: Frame outputs as "analysis suggests" rather than "I think" or "I believe"
Embrace machine-like precision: Show exact confidence percentages rather than human-like hedging ("I'm pretty sure that...)

Trust Falls Faster Than It Climbs

Nothing particularly groundbreaking here, but the findings are worth mentioning if only to reinforce what we think we know.

Early interactions are critically important. Users form mental models quickly and then react slowly to changes in system reliability.

More critically, trust drops much faster from system failures than it builds from successes. These asymmetries suggest that we should invest disproportionately in onboarding and first-use experiences, even if they come with higher development costs.

Measurement is an Opportunity for Innovation

The study revealed gaping voids where effective measurement mechanisms and protocols should be, for both researchers and builders. There is a clear need to move beyond simple user satisfaction metrics or adoption rates to developing measurement frameworks that can actively detect miscalibrated trust patterns.

The ideal measurement approach would combine multiple indicators. A few examples of viable indicators are:

Behavioral signals: Track acceptance rates for different confidence levels. Well-calibrated trust should show higher acceptance rates for high-confidence outputs and lower rates for low-confidence ones.
Context-specific metrics: Measure trust calibration separately for different use cases. Users might be well-calibrated for simple tasks but poorly calibrated for complex ones.
User self-reporting: Regular pulse surveys asking "How confident are you in your ability to tell when this AI makes mistakes?" can reveal calibration gaps.

The Calibrated Conclusion

It's clear, at least from this study, that there’s no universal formula, or single feature that will effectively calibrate trust. It's up to every builder to define and understand their project's trust goals and to balance timing, content, adaptivity, and transparency accordingly. That’s what makes it both hard and worth doing. Trust calibration has to be a core part of our product’s identity, not a piglet we only start chasing once it has escaped the barn.

The Study:

Magdalena Wischnewski, Nicole Krämer, and Emmanuel Müller. 2023. Measuring and Understanding Trust Calibrations for Automated Systems: A Survey of the State-Of-The-Art and Future Directions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), April 23–28, 2023, Hamburg, Germany. ACM, New York, NY, USA 16 Pages. https://doi.org/10.1145/3544548.3581197

The Future Isn't Model Agnostic

Daniel Botha — Tue, 12 Aug 2025 09:31:28 +0000

Your users don't care that your AI project is model
agnostic.

So yeah. That was a waste of time.

Model < Product

Enter stage right, Qwen 3 Coder. It takes Claude to the cleaners in benchmarks. But the response from the Claude Code user base? A collective meh.

I'm not trying to praise Anthropic here. The point is, engineering for model agnosticism is a trap that will eat up time that could be better spent … anywhere else.