Posted on Jan 16

I build AI features like a final boss: the 7 dangerous assumptions teams keep making

#webdev #programming #ai #devops

Shipping AI looked easy. that’s how Iknew something was wrong.

Press enter or click to view image in full size

The first time we shipped an AI feature, it felt… smooth.

Too smooth.

No red dashboards.
No angry messages.
No “hey, is prod supposed to do this?” energy.

The demo worked. The output looked smart. Someone in the meeting actually said, “Wow, this feels magical.”
Which, if you’ve been doing this long enough, is usually the moment the game switches from tutorial mode to the boss fight.

Because AI features don’t fail like normal features.

They don’t crash loudly. They don’t throw obvious errors. They don’t wake you up with alerts. They sit there, smiling, answering confidently while quietly teaching users the wrong thing… or burning money in the background… or degrading just slowly enough that no one notices until trust is already gone.

That’s the trap.

Right now, everyone is “adding AI.” Copilots. Assistants. Chat boxes duct-taped onto products that were never designed for them. And honestly? I get it. The tools are good. The APIs are friendly. You can go from idea to demo faster than ever.

Which is exactly why this is dangerous.

Most AI features don’t break because the model is bad. They break because of assumptions teams don’t even realize they’re making assumptions that feel reasonable, sound experienced, and still quietly wreck the product later.

I’ve made all of these. Some confidently. Some while explaining to someone else why we were “being careful.”

TL;DR
Shipping AI isn’t about picking the right model. It’s about everything around it: cost, UX, failure modes, trust, and long-term decay. The real final boss isn’t the tech it’s the assumptions you carry into production without noticing.

The model is the hard part

This is the first assumption almost everyone makes. And honestly, I don’t blame anyone for it.

You spend days sometimes weeks arguing about models. GPT-4 vs GPT-4.1. Temperature tweaks. Context window math. You finally land on something that produces good answers. Clean answers. Answers that make the demo feel impressive instead of awkward.

At that point, there’s a very human urge to relax.
The hard part is done. Right?

That’s usually when the real problems start.

In our case, the model itself was fine. Solid, predictable, boring in the best way. The failures came from everything around it. Latency spikes we hadn’t load-tested properly. Requests timing out under real traffic. One slow response blocking an entire user flow because we assumed the model would “basically always respond.”

It turns out “basically always” is not a reliability strategy.

We had moments where nothing was technically broken. No errors. No crashes. Just enough delay to make users hesitate. Click again. Lose confidence. From the outside, it looked like “AI being flaky.” Internally, it was just missing retries, weak fallbacks, and optimistic assumptions about how fast the world would be.

This is the part people don’t like to talk about: once models hit a certain quality floor, the model choice matters less than your system design. Timeouts matter. Backpressure matters. What you do when the model doesn’t answer matters more than how smart it is when it does.

If your AI feature can’t fail politely, it will fail socially.

Ask yourself one uncomfortable question before shipping:

What does the user experience look like when the model is slow, unavailable, or confused?

If the answer is “uh… hopefully that doesn’t happen,” congrats you’ve met the real final boss.

Users know how to prompt”

They don’t.
They really, really don’t.

This assumption usually sneaks in quietly. You’ve been testing the feature yourself. Your teammates try it. Everyone knows roughly what to type because everyone already knows what the system can do. The AI feels smart. Responsive. Helpful.

Then real users show up.

And they type things like:

Fix this

Make it better

Why is this wrong

No context. No input. Just vibes and hope.

The first time I watched session replays of this, it clicked: prompting isn’t a user skill. It’s a UX problem we keep pretending belongs to someone else. We ship a text box, label it “Ask anything,” and act surprised when people… ask badly.

Of course they do. They’re not living in prompt threads. They don’t know what context the model has. They don’t know what it can or can’t see. They assume it’s smarter than it is because that’s what the UI implies.

We had flows where the AI technically worked, but only if you asked in the exact right way. Which meant power users got value and everyone else bounced. From their perspective, the feature wasn’t “hard to use.” It was just inconsistent. Sometimes helpful, sometimes dumb, sometimes confusing.

That’s not an AI problem. That’s on us.

The teams that get this right don’t brag about clever prompts. They hide them. They guide the user without making it feel like guidance. Examples, defaults, constraints, and guardrails quietly do the heavy lifting so the user feels smart instead of blamed.

If your AI feature needs a tutorial to be used correctly, you didn’t ship intelligence. You shipped homework.

One question worth asking before launch:

Did we design how users should talk to this or did we just hope they’d figure it out?

Hallucinations are edge cases

This one sounds reasonable. That’s what makes it dangerous.

Models are better now. Way better. They hallucinate less, recover faster, and usually stay on track if you give them enough context. So it’s easy to tell yourself that wrong answers are rare an acceptable tradeoff for speed and magic.

They’re not rare. They’re just quiet.

The real problem isn’t that AI gets things wrong. It’s that it gets them wrong confidently. Clean sentences. Polished tone. Zero hesitation. From a user’s perspective, that reads as authority.

We shipped a feature that summarized internal docs. Looked great in testing. Clear language, bullet points, even helpful suggestions. Then we noticed something subtle: every so often, it would invent details that felt right. Not absurd. Not obviously broken. Just… false.

No errors. No warnings. Nothing that told the user, “Hey, double-check this.”

That’s worse than a crash.

When software crashes, users know something went wrong. When AI hallucinates, users walk away with the wrong answer and no reason to doubt it. Trust erodes quietly, one confident mistake at a time.

This is why hallucinations aren’t edge cases. They’re a default failure mode. If you don’t design around them retrieval, citations, scoped knowledge, or explicit uncertainty you’re shipping a system that teaches users to trust it right up until they shouldn’t.

A hard but necessary question:

How does a user know when your AI might be wrong?

If the answer is “they’ll probably notice,” you’re playing roulette with credibility.

Press enter or click to view image in full size

Cost will sort itself out later”

This assumption usually shows up wearing a very calm face.

The feature is working. Users like it. Usage is growing. Someone says, “We’ll optimize later once we see real numbers.” Which sounds responsible. Mature, even.

And then the bill arrives.

AI costs don’t scale the way most teams expect. They don’t rise smoothly with traffic. They jump. One enthusiastic user pasting large inputs can do more damage than a hundred casual ones. A single prompt change can quietly double token usage. A retry loop you forgot about can turn into a money printer you never asked for.

We had a month where traffic stayed flat and spend didn’t. Finance wasn’t impressed by explanations about “better reasoning” or “richer outputs.” From their point of view, the graph went up for no obvious business reason and they weren’t wrong.

The tricky part is that AI cost often hides behind success. The feature is being used. That’s good. But without hard limits, budgets, and visibility, “popular” quickly becomes “unsustainable.”

This feels a lot like early cloud days. EC2 felt infinite. S3 felt cheap. Until someone left something running and learned what “usage-based pricing” really means.

If you don’t know your worst-case monthly cost, you’re not budgeting. You’re hoping.

One question worth answering before launch:

What’s the most expensive way a user can use this feature?

If that number makes you uncomfortable, users will find it for you.

Alignment is just polish”

This one usually gets deferred with good intentions.

The feature works. The answers are technically correct. Someone says, “We’ll clean up tone and edge cases later.” Which sounds reasonable until users start interacting with it like it’s a product, not a demo.

Because alignment issues don’t feel like AI bugs to users. They feel like you being careless.

We had outputs that were accurate but uncomfortable. Overly verbose. Slightly condescending. Sometimes way too confident in situations where hesitation would’ve been healthier. Nothing was “wrong” in a technical sense but something felt off.

And users notice that instantly.

Tone, refusal behavior, and boundaries aren’t polish. They’re part of the contract you’re making with the person on the other side of the screen. If your AI refuses a request rudely, users don’t think “ah yes, a safety constraint.” They think the product is broken or hostile.

Worse, misalignment leaks trust fast. One weird response is enough to make people hesitate the next time. Two or three, and they stop relying on it entirely.

This is the part where AI starts to feel less like a feature and more like a coworker with no filter.

If you wouldn’t ship an API without auth or rate limits, don’t ship an AI without guardrails. Alignment isn’t about being safe in theory. It’s about being predictable in practice.

A question worth asking before launch:

What does this AI do when it shouldn’t answer at all?

If the answer is vague, users will discover the edges for you.

AI features age like normal code”

They don’t.

Normal code mostly breaks when you touch it. AI features break when you don’t.

This assumption sneaks in because nothing looks wrong at first. The feature is live. The metrics are stable. No alerts are firing. So it quietly slides down the priority list, lumped in with “we’ll refactor later.”

Months pass.

Then someone says, “Did this get worse?”

Not broken. Just… worse. Answers feel less sharp. Tone drifts. Users rephrase more often. Support tickets show up that don’t point to any single bug. Nothing is technically on fire, which makes it harder to justify digging in.

We had a prompt that worked beautifully at launch. It wasn’t changed. The model version shifted. User behavior changed. Expectations moved. The system stayed frozen in time while everything around it evolved.

AI features rot faster than CRUD because they’re coupled to things you don’t fully control: models, data distributions, and human trust. If no one owns quality over time, decay is guaranteed.

This is where “set it and forget it” quietly turns into “why don’t people use this anymore?”

A question every team should answer:

Who is responsible for noticing when this feature gets subtly worse?

If the answer is “we’ll know,” you probably won’t.

Press enter or click to view image in full size

Someone else will own it later”

This is the quietest assumption of all. And probably the most dangerous.

AI features often start life as experiments. Side projects. Enhancements. Someone hacks something together, it works surprisingly well, and suddenly it’s in production. Users rely on it. Sales demos include it. Support tickets mention it.

But ownership never really solidifies.

Who updates the prompt when it starts drifting?
Who watches cost when usage patterns change?
Who decides when an answer is wrong enough to be a problem?

We’ve all been in meetings where someone asks, “Who owns this?” and the room goes quiet for just a beat too long. Not because nobody cares but because everyone assumed it would eventually become someone else’s responsibility.

AI makes this worse because failures are fuzzy. It’s not down. It’s not throwing errors. It’s just… not great sometimes. That’s hard to page someone for. Easy to ignore. Perfect conditions for slow decay.

We had an AI feature that quietly went from “nice differentiator” to “something people stopped trusting.” No single incident. No big outage. Just a steady erosion that nobody was explicitly accountable for catching.

AI systems don’t fail fast. They fail socially.

If an AI feature matters enough to ship, it matters enough to own. That means clear responsibility for quality, cost, and behavior over time not just at launch.

One uncomfortable but necessary question:

When this gives a wrong or harmful answer, whose problem is it?

If the answer isn’t immediate, that’s not a process gap. That’s a risk.

The real final boss isn’t the model

If there’s one thing shipping AI has forced me to relearn, it’s this: AI doesn’t fail loudly. It fails politely.

No crashes. No alarms. No obvious “we messed up” moment.

Just slightly worse answers.
A little more hesitation from users.
A quiet loss of trust that never shows up as a single metric you can point at.

That’s what makes these assumptions so dangerous. Every one of them sounds reasonable in isolation. Every one of them feels like something you can “deal with later.” And every one of them, left unchecked, slowly turns an impressive AI feature into something people stop relying on.

The teams that struggle with AI aren’t usually the ones picking bad models. They’re the ones treating AI like a feature instead of a system. Something you ship once instead of something you operate. Something that’s “basically fine” instead of something that needs ongoing ownership.

Building AI like a final boss doesn’t mean overengineering everything. It means assuming the fight is longer than the demo. It means designing for failure, misuse, drift, and cost before users teach you the hard way. It means being honest about what you don’t know yet.

What’s coming next isn’t smarter models alone. It’s teams that get boring things right: UX, guardrails, ownership, and trust. The kind of stuff that never makes launch tweets but decides whether your AI feature survives six months later.

If you’ve been burned by one of these assumptions, you’re not behind. You’re just learning the real mechanics of the game.

Drop your worst one in the comments. I guarantee someone else is fighting the same boss.

Helpful resources

OpenAI rate limits & retries https://platform.openai.com/docs/guides/rate-limits Why “it usually responds” isn’t a reliability plan.
GitHub Copilot product & UX overview https://github.com/features/copilot A good example of hiding complexity instead of teaching users to prompt.
RAG patterns & examples (GitHub search) https://github.com/search?q=rag+llm+retrieval&type=repositories Real-world approaches to grounding models in actual data.

DEV Community

I build AI features like a final boss: the 7 dangerous assumptions teams keep making

Shipping AI looked easy. that’s how Iknew something was wrong.

The model is the hard part

Users know how to prompt”

Hallucinations are edge cases

Cost will sort itself out later”

Alignment is just polish”

AI features age like normal code”

Someone else will own it later”

The real final boss isn’t the model

Helpful resources

Top comments (0)