Joshua Chukwu

Posted on May 13

Why similarity matters more than exact matches in LLM systems

#ai #llm #performance #systemdesign

Series: AI Isn’t an Engineering Problem Anymore (Part 6)
It’s a cost problem—and most teams don’t realize it yet.

In the last post, I talked about why simple caching doesn’t solve most LLM inefficiency.
Caching works well for:
exact prompts
repeated workflows
deterministic pipelines
But that’s not how humans actually use LLMs.
Humans don’t think in exact matches.
We think in:
approximations
iterations
refinements
revisits
And that changes everything.

The problem with exact matching

Traditional caching systems work by asking:
“Have I seen this exact request before?”
That works great for:
static assets
repeated queries
deterministic systems
But LLM usage is rarely exact.
You ask:
“Why is this deployment failing?”
Then later:
“Could this Railway deployment issue be related to env validation?”
Different prompts.
Very similar intent.

Humans think in similarity, not identity

This is the part I think most systems still struggle with.
When humans solve problems, we rarely:
repeat the exact same sentence
follow the exact same reasoning path
or preserve identical context
Instead we:
circle around ideas
refine wording
revisit earlier thoughts
approach the same problem from different angles
From a human perspective:
these are connected
From most systems’ perspectives:
they are completely unrelated requests

Why this matters

Because if the system only recognizes:
exact matches
Then most real-world repetition gets missed.
And that means:
recomputation continues
context keeps growing
costs keep compounding

A pattern I kept noticing

One thing I started noticing while working across different systems was this:
The model would often regenerate reasoning I had effectively already paid for before.
Not identical wording.
But the same underlying explanation:
same deployment issue
same architectural tradeoff
same debugging pattern
same reasoning path
Just wrapped in a slightly different context.

The hidden inefficiency

This creates a strange situation.
The user feels like:
they are progressing through a problem
But the system may actually be:
repeatedly recomputing overlapping reasoning
That overlap is hard to notice because:
humans naturally think iteratively
conversations evolve gradually
context changes slightly every step

Why similarity is difficult

Similarity sounds easy conceptually.
It isn’t.
Because similarity is:
contextual
probabilistic
semantic
constantly shifting
Two prompts can:
look different syntactically
but mean almost the same thing
Or:
look similar
but require completely different reasoning
That makes reuse much harder than traditional caching.

The deeper issue

At this point, the problem stops being:
“Can we store responses?”
And becomes:
“Can we recognize when work is fundamentally overlapping?”
That is a very different system problem.

Context makes this worse

Long threads amplify the issue even more.
Because each new message carries:
previous reasoning
prior attempts
accumulated context
So even when the new information is small:
the system still processes the entire growing state around it
That’s one reason costs quietly compound over time.
This changes how I think about memory
A lot of current AI workflows treat memory like:
“store more context”
But I’m starting to think the more important question is:
“What context actually needs to survive?”
Because not all memory is useful.
Some memory:
reinforces direction
Other memory:
introduces drift
redundancy
noise
repeated reasoning loops
The systems that win
I think the systems that eventually win won’t just be:
the smartest models
or the largest context windows
They’ll be the systems that:
understand relevance
recognize overlap
manage context intelligently
and avoid unnecessary recomputation

What I’m trying to understand

At this point, the question I keep coming back to is:
How much intelligence is actually new computation?
Humans are dynamic in behavior and thought.
We:
revisit ideas
refine reasoning
circle back
change direction constantly
But systems tend toward determinism, because the more deterministic a system is,
the easier it becomes to:
optimize
predict
cache
and run efficiently
So the tension is:
How intelligent can AI become while still achieving the ultimate goal of efficiency?
Because intelligence may naturally resist exact repetition, while efficiency depends on it.

What I’ll explore nex

t
In the next post, I’ll go deeper into this idea:
why bigger context windows alone may not actually solve the problem

👉 Part 5 is here: https://dev.to/joshua_chukwu_ccb92f05a94/i-tried-caching-llm-responses-it-didnt-work-the-way-i-expected-aj

Closing thought

Humans don’t think in exact matches.
We think in related ideas, partial overlaps, revisits, and refinements.
But most LLM systems still treat those interactions as completely new work.
And I think that gap is where a lot of the hidden inefficiency lives.

DEV Community