Marko Frei

Posted on Jun 5

The Limits of AI Models: What LLMs Still Can't Do (And Why)

#ai #llm #machinelearning #programming

It's easy to be either a hype believer or a reflexive cynic about AI models. The more useful position is the boring one in the middle: these tools are genuinely powerful and they have hard, well-understood limits. If you build with them, knowing exactly where they break is what separates a robust product from a demo that falls apart in front of a real user.

So here's a tour of the real limits of today's models, and — more importantly — why each one exists. The "why" is what lets you predict failures instead of being surprised by them.

1. They make things up (and can't tell when they have)

The most famous limit: hallucination. A model will state a fabricated fact, a fake citation, or a nonexistent API method with exactly the same fluent confidence it uses for correct answers.

Why: An LLM is trained to produce plausible continuations of text, not true ones. There's no internal fact-checker and no concept of "I don't actually know this." If a confident-sounding wrong answer is statistically likely given the prompt, the model generates it just as smoothly as a right one. Confidence and correctness are completely decoupled.

Working around it: Treat every factual claim as unverified. Use retrieval (RAG) to ground answers in real documents, ask for sources you can check, and never put a raw model output in front of a user where a confident lie causes real harm.

2. Their knowledge is frozen in time

A model only knows what was in its training data, up to a cutoff date. Ask about anything newer and it will either admit ignorance or, worse, confidently improvise.

Why: Training is a discrete, enormously expensive event. The model's "knowledge" is baked into its weights at that moment and doesn't update as the world changes.

Working around it: This is exactly why web search and retrieval pipelines exist — you inject fresh information into the prompt at runtime rather than relying on the frozen weights.

3. They have no memory between calls

Each API request is independent. The model doesn't remember your last conversation, your preferences, or what it told you five minutes ago. Chat apps fake continuity by resending the entire history with every message.

Why: The model is a stateless function. Its only "memory" is whatever text you put in the current prompt. Nothing persists on its side.

Working around it: Any persistence — user memory, conversation history, learned preferences — is something you have to build and re-inject. The model won't do it for you.

4. The context window is finite, and it degrades

Everything the model can "see" — your prompt, instructions, history, retrieved documents — has to fit inside a fixed token budget. And even within that budget, models don't use all of it equally well. Information buried in the middle of a long context tends to get less attention than material at the start or end (the "lost in the middle" effect).

Why: Attention operates over a bounded sequence, and its quality isn't uniform across very long inputs. More context is not automatically better context.

Working around it: Put the most important instructions and data near the beginning or end. Summarize or chunk long material rather than dumping it all in. Don't assume a giant context window means the model is actually reasoning over all of it equally.

5. They fall apart on long, multi-step tasks

A model can nail a single step and still fail a task that requires fifty steps in sequence — the classic "great at the demo, unreliable in the agent" problem.

Why: Errors compound. Even a small per-step error rate becomes a near-certain failure over a long chain. Worse, recent research highlights a self-conditioning effect: once a model has made a mistake earlier in its own output, it becomes more likely to make further mistakes, because it's now predicting tokens conditioned on its own flawed work. The longer the horizon, the worse it gets.

Working around it: Decompose big tasks into small, independently verifiable steps. Validate between steps. Keep a human or a deterministic checker in the loop for anything where a wrong step silently corrupts everything downstream.

6. "Reasoning" is more brittle than it looks

Models can produce genuinely impressive chains of reasoning — and then fail a logically identical problem because you changed the names or numbers. They're sensitive to surface phrasing in ways a person who truly understood the problem wouldn't be.

Why: A lot of what looks like reasoning is sophisticated pattern-matching over things the model has seen. When a problem is close to its training distribution, it shines. Push it genuinely outside that distribution and the apparent reasoning can collapse. Newer "thinking" models that spend more compute at inference time help here, but they don't erase the underlying brittleness.

Working around it: Don't trust reasoning on novel or high-stakes problems without verification. Give the model the structure (the plan, the constraints) rather than hoping it invents reliable logic from scratch.

7. No grounding in the physical world

Models learned everything from text and images, not from living in the world. They have no senses, no body, and no real causal model of physics. They can describe how to ride a bike perfectly and have zero idea what balancing actually feels like.

Why: There's no embodiment and no real-world feedback loop in training. The model knows how people write about the world, which is not the same as understanding the world.

Working around it: Be cautious anywhere true physical-world reasoning, causality, or common-sense grounding matters. Text fluency about a domain is not competence in it.

8. They inherit the biases and gaps of their data

A model reflects its training data — including that data's biases, blind spots, and overrepresented viewpoints. It can quietly encode stereotypes or be systematically weaker on topics, languages, and cultures that were underrepresented online.

Why: The model is a compression of its corpus. Whatever skews exist in the data tend to show up, sometimes amplified, in the output.

Working around it: Test across diverse cases, don't assume neutrality, and be especially careful using these systems for consequential decisions about people.

9. Brute-force scaling is hitting diminishing returns

For years the recipe was "make it bigger." That's slowing. Frontier models show smaller gains on key benchmarks despite enormous increases in training budget, high-quality training text is becoming scarce, and the compute and energy costs are staggering.

Why: Scaling laws give diminishing returns — each doubling of resources buys a smaller improvement — and you eventually run low on both fresh high-quality data and economically sane amounts of compute.

Working around it: As a builder, this is mostly good news: the action is shifting toward better techniques, smaller specialized models, fine-tuning, and clever inference-time methods. You rarely need the biggest model; you need the right one for the job.

10. We can't fully explain what they're doing

Even the people who build these models can't fully explain why a given output appeared. The reasoning lives in billions of opaque numbers, and interpretability research, while advancing, is far from giving us a clear account.

Why: The behavior is emergent from training, not designed rule-by-rule. There's no readable program inside to inspect.

Working around it: Don't treat the model as an auditable decision-maker in domains that require explainability. If you need to justify why a decision was made, the model's confident-sounding explanation is itself just generated text, not a true trace of its reasoning.

The actual takeaway

None of this means AI models aren't useful — they obviously are. It means they're a specific kind of tool with a specific failure surface. The mental model that keeps you out of trouble:

Trust them for drafting, transforming, summarizing, and exploring, where a human reviews the output.
Distrust them for ground truth, long autonomous chains, novel reasoning, and explainable decisions, unless you've built verification around them. The engineers who build great things with AI aren't the ones who think it's magic or the ones who think it's useless. They're the ones who know precisely where the edges are and design around them.

Which of these limits has bitten you hardest in a real project? I'm especially curious about war stories from people building agents, since that's where so many of these failures stack up at once.

DEV Community