Juan M. Altamirano for CloudX

Posted on Apr 13

What Backend Engineers Get Wrong About AI Integration

#ai #backend #engineering #programming

So you've been asked to "add AI" to your project. Maybe it's a chatbot, maybe it's a smart search, maybe it's just your PM watching too many YouTube videos about agents. Either way, here you are.

And look, as backend engineers, we're actually well-positioned for this. We know how to design APIs, handle async flows, think about failure modes. The problem is that LLMs don't behave like anything we've integrated before, and our instincts sometimes work against us.

Here, I will try to cover eight common mistakes that I've seen (and made).

1. Treating the LLM like a deterministic function

This is the big one.

We're used to thinking in terms of: same input → same output. Call a database, get a row. Call an API, get a JSON response. Write a unit test, pin the behavior forever.

LLMs don't work like that. Under the hood, they're predicting the next most probable token based on everything that came before, which means there's inherent randomness added into every response. Same prompt, different temperature, different day, and you might get a slightly different answer. Or a wildly different one.

Think of it less like calling a function and more like asking a very smart contractor to do some work for you. They'll usually get it right, but you need to review the output, define what "right" looks like, and have a plan for when they hand you something unexpected.

The practical consequence: don't build flows that assume the model always returns what you expect. Validate the output. Have fallbacks. Don't let a malformed response crash your service.

2. Not handling retries and timeouts

If you call an external REST API and it times out, you probably have retry logic. Maybe exponential backoff. Maybe a circuit breaker.

For some reason, when we start calling LLM APIs, all of that discipline disappears.

LLM calls are slow. We're talking in the order of seconds depending on the model and the response length. And they fail. Rate limits, provider outages, network issues, it all happens.

Set a timeout. Retry with backoff. If you're building something user-facing, stream the response so it doesn't feel like the app is frozen. Treat the LLM provider like what it is: an external dependency that can and will go down.

3. Ignoring token costs in loops

Token costs have a way of sneaking up on you.

A common pattern that seems harmless: you have a list of items, you loop over them, you call the LLM for each one. 50 items? 50 calls. Each one dragging along the same massive system prompt like dead weight. The model isn't doing 50x the thinking. You're just paying 50x for the setup.

It's like printing the entire company handbook before every meeting just to discuss one item.

Some things to think about:

Can you batch items into a single prompt? Sometimes yes, sometimes the quality drops.
Is the system prompt really 2000 tokens? Can it be 400?
Are you sending context the model doesn't need for that specific call?
Could you cache responses for identical or near-identical inputs?

Token costs are easy to ignore in development (small dataset, free tier) and painful to discover in production.

4. Breaking prompt caching without realizing it

This one is especially relevant if you're building agents.

Most providers cache prompts automatically and give you significant discounts when a request matches a previously seen prefix. In an agent loop where you're continuously resending the system prompt, full conversation history and tool calls, that cache is what keeps your bill reasonable.

The catch: think of it like a prefix match. The moment something differs from the previous call, the cache stops there and you pay full price from that point on.

A classic example is injecting the current timestamp into your system prompt. Seems harmless, but since it changes on every call, nothing after it ever gets cached, and you're paying full price every time.

Treat everything before the dynamic part of your prompt as sacred. Same order, same content, same whitespace. Push dynamic context as far down the message history as possible, after the static parts that can be cached.

This leads to decisions that look counterintuitive: sometimes you deliberately send more tokens just to keep the cacheable prefix intact. A 2000-token cached call is cheaper than a 1500-token cache miss. Once you internalize that, it changes how you structure your payloads entirely.

5. Treating prompt engineering as someone else's problem

"The AI team handles the prompts."

If you're building the integration, you need to understand how prompts work. Not at a research level, but enough to know why your endpoint is returning garbage.

The prompt is part of your application logic. It deserves the same attention as your SQL queries or your request validation. A bad prompt will produce bad output no matter how clean your surrounding code is.

At minimum, know the difference between system and user messages, understand what temperature does to your outputs, and learn when to use structured outputs, instead of trying to parse free-form text. Which brings me to...

6. Trusting free-form text when you don’t have to

I've seen code like this in the wild:

response = llm.complete("Analyze this and return: risk level, explanation, recommendation")
lines = response.split("\n")
risk = lines[0].split(":")[1].strip()

Please don't do this.

Modern LLM APIs support structured outputs — you define a schema, the model returns valid JSON that matches it. Every time. No parsing, no IndexError because the model decided to add an extra line.

In Python, pair it with Pydantic. In Typescript, use Zod for validation and parsing. In Go, define a struct and unmarshal directly. Your future self will thank you.

class AnalysisResult(BaseModel):
    risk_level: Literal["HIGH", "MEDIUM", "LOW", "NONE"]
    explanation: str
    recommendation: str

result = llm.complete_structured(prompt, schema=AnalysisResult)
# result.risk_level is always there. Always a string. Always one of the four values.

7. Putting everything in the context window and calling it RAG

RAG (Retrieval Augmented Generation) is genuinely useful. The idea is simple: instead of stuffing all your knowledge into the prompt, you retrieve only the relevant bits and include those.

But a lot of implementations I've seen are just... dumping documents into the prompt and calling it RAG.

Real RAG has a retrieval step. You embed the user's query, find the most semantically similar chunks from your knowledge base, and only send those. If you're sending 50 pages of documentation on every call, you're not doing RAG, you're doing expensive copy-paste.

The retrieval quality matters as much as the generation. Bad retrieval means irrelevant context, and irrelevant context means bad answers (no matter how good your model is).

8. Not thinking about what happens when the model is wrong

LLMs hallucinate. They confidently state things that are false. This isn't a bug that will get patched, it's just how these models work.

So the question isn't "what if the model is wrong?"... it's "what happens in my system when the model is wrong?"

If you're using an LLM to triage support tickets, a wrong answer means someone spends extra time investigating. Annoying, not catastrophic. But if you're using it to generate financial reports or medical summaries, wrong answers have a very different impact.

Design your system with the assumption that the model will sometimes be wrong. Add human review steps where it matters. Log the inputs and outputs so you can audit. Don't pipe raw LLM output into anything irreversible.

Wrapping up

None of this means LLMs aren't useful, they absolutely are. But they're a new kind of dependency with failure modes we're not used to. The good news is that most of the solid engineering practices we already have (validation, retries, observability, defensive design) apply here too. We just need to remember to actually use them.

One last thing: this space moves fast. What's a workaround today might be a built-in feature tomorrow. If you're already the kind of engineer who keeps up with tech, just make sure AI topics are in the mix. It's worth it.

If you've run into other footguns that aren't on this list, drop them in the comments. I'm curious what patterns other backend engineers have hit.