DEV Community: Joshua Chukwu

What’s actually missing in most AI stacks

Joshua Chukwu — Tue, 19 May 2026 13:44:21 +0000

Series: AI Isn’t an Engineering Problem Anymore (Part 9)
It’s a cost problem—and most teams don’t realize it yet.

Over the last few posts, I talked a lot about the differences between how humans process information and how systems process information.

For example:

When humans take a multiple-choice quiz and several answers look similar, we naturally slow down.

We:

scan carefully
compare patterns
revisit assumptions
and spend more time deciding what is actually relevant

And when there’s a timer involved, we suddenly become aware of the actual cost of processing information:

TIME.

Systems process information differently, albeit much faster.

But the underlying tradeoff still exists:

more context
more comparisons
more ambiguity
more computation

Which raises an interesting question:

should we feed systems information the same way humans naturally think about problems?

Or should we instead optimize workflows around the strengths and limitations of the systems themselves?

The more I think about it, the more I feel like most AI stacks today are missing an entire layer.

Not:

smarter models
larger context windows
or more API access

A different layer entirely.

The current AI stack is incomplete

Right now, many AI workflows still look like this:

User → Prompt → Model → Response

At small scale, this works perfectly fine.

But once usage compounds across:

teams
organizations
agents
workflows
and long-running projects

the cracks start appearing.

Because the system has very little understanding of:

reuse
coordination
attribution
cost efficiency
workflow overlap
or memory lifecycle management

What’s missing

I think most AI systems are currently missing:

an operational intelligence layer.

Something sitting between:

humans and raw model inference

A layer responsible for:

orchestration
routing
observability
context optimization
memory lifecycle management
intelligent reuse
governance

and compute efficiency

Not to replace the models.

But to make large-scale AI usage sustainable.

Right now, most systems are reactive

Most current workflows only react after:

costs spike
limits get hit
latency grows
context becomes bloated
or workflows become chaotic

But by then:

the inefficiency has already compounded.

The cloud parallel keeps showing up

Cloud computing followed a similar pattern.

At first:

compute availability was the breakthrough.

Later:

orchestration mattered
observability mattered
governance mattered
cost attribution mattered
optimization mattered

I think AI is entering a similar phase now.

Intelligence alone does not create efficiency

This is the part I keep coming back to.

A smarter model does not automatically solve:

duplicated reasoning
overlapping workflows
repeated context
unnecessary inference
or organizational inefficiency

Those problems exist above the model layer.

The dangerous illusion

Larger context windows and more capable models can sometimes create the illusion that:

scaling problems are solved.

But in many cases:

the system may simply be brute-forcing more compute through increasingly messy workflows.

That works temporarily.

Until scale compounds.

What organizations will eventually ask

I think organizations will increasingly start asking questions like:

Where is our AI spend actually going?
Which workflows are inefficient?
Which teams generate the most repeated reasoning?
What context should persist?
What should expire?
What should never hit the model at all?
What work are we recomputing unnecessarily?

Those are operational questions.

Not purely model questions.

The next phase of AI infrastructure

The first phase of AI was:

access to intelligence.

The next phase may become:

efficient coordination of intelligence.

And I think that changes the infrastructure conversation completely.

My opinion

I think the companies that win long term won’t necessarily be:

the companies generating the most tokens

But the companies that best understand:

orchestration
reuse
observability
memory management
and intelligent compute allocation

What I’ll explore next

In the final post of this series, I’ll talk about the conclusion this entire journey eventually led me to:

The current setup and workflow I use to manage AI context, repeated reasoning, memory drift, and operational inefficiency across long-running projects.

👉 Part 8 is here: https://dev.to/joshua_chukwu_ccb92f05a94/why-you-cant-just-cache-everything-privacy-safety-and-reality-5943

Closing thought

Most AI conversations today focus on:

model intelligence.

But I think the harder long-term problem may become:

operational intelligence.

Why you can’t just cache everything (privacy, safety, and reality)

Joshua Chukwu — Mon, 18 May 2026 12:05:00 +0000

Series: AI Isn’t an Engineering Problem Anymore (Part 8)
It’s a cost problem—and most teams don’t realize it yet.

In the last few posts, I talked a lot about:
repeated reasoning
workflow duplication
context growth
reuse
and AI control planes.

At first glance, the solution feels obvious:
“Why not just cache everything?”
If organizations repeatedly ask similar questions, why recompute the same reasoning over and over again?
The idea sounds simple.
Reality is not.

The hidden assumption

A lot of discussions around AI optimization assume:
every request is safely reusable.
But once AI moves from:
hobby projects
to:
organizations
enterprises
production workflows
internal tooling
the problem changes completely.
Because now the system starts interacting with:
private codebases
internal documents
customer data
financial information
credentials
legal discussions
deployment infrastructure
and operational decisions
That changes the optimization equation immediately.

The trust boundary problem

This is where things become difficult.
Two prompts may look semantically similar:
“Why is my deployment failing?”

But underneath, the contexts may contain:
completely different infrastructure
different secrets
different environments
different permissions
different organizations
Which means:
Similarity alone is not enough.
A system cannot blindly reuse reasoning across trust boundaries.

The dangerous version of optimization

This is the part I think many people underestimate.
Aggressive optimization without governance can quietly become:
a privacy problem
a security problem
or a trust problem
Especially once:
organizations
teams
or multiple users
share the same AI infrastructure layer.
Because now the system must answer questions like:
What can safely be reused?
What should remain isolated?
Which contexts are sensitive?
Who owns the generated reasoning?
What should expire?
What should never persist at all?
Those are not just engineering problems anymore.
They become:
organizational
legal
operational
and ethical problems.

Why enterprise AI becomes harder

At small scale, people mostly think about:
“making the model smarter.”
At organizational scale, companies start worrying about:
observability
compliance
governance
attribution
auditability
and trust boundaries
Which is why scaling AI usage inside organizations becomes much more complicated than:
“Just increase the context window.”

Human behavior complicates this further

Humans are messy.
We:
paste sensitive logs
include unnecessary context
reuse prompts carelessly
carry old information forward
and mix unrelated workflows together constantly
That means optimization systems cannot simply assume:
more memory = better.
Sometimes:
more memory increases risk.

The difficult tradeoff

This creates a difficult system tension.
Organizations want:
lower cost
faster workflows
more reuse
less recomputation
But they also need:
isolation
privacy
security
and trust
And those goals often push against each other.

My Opinion

I’m believe that the long-term winners in AI infrastructure won’t just be:
the companies with the largest models
or the cheapest inference
But the companies that best understand:
orchestration
trust boundaries
memory lifecycle
governance
and intelligent reuse under constraints

Optimization without control becomes dangerous

One thing cloud infrastructure taught us is this:
efficiency without governance eventually creates chaos.
I think AI infrastructure may be heading toward the same lesson.

What I’ll explore next

In the next post, I’ll summarize what I think is currently missing across most AI stacks:
the missing layer between model intelligence and operational efficiency.

👉 Part 7 is here: https://dev.to/joshua_chukwu_ccb92f05a94/why-ai-products-need-a-control-plane-not-just-api-calls-26ne?comments_sort=top#toggle-comments-sort-dropdown

Closing thought

The challenge is no longer:
“Can AI generate useful responses?”
The harder challenge may become:
“How do we optimize intelligence without breaking trust?”

Why AI products need a “control plane” (not just API calls)

Joshua Chukwu — Fri, 15 May 2026 12:00:00 +0000

Series: AI Isn’t an Engineering Problem Anymore (Part 7)
It’s a cost problem—and most teams don’t realize it yet.

What is a Control Plane?

Imagine a world with hundreds of thousands of planes in the sky, but no airports, no central communication systems, and no coordination for routing, landing, or takeoff scheduling.
Sounds chaotic, right?
Air traffic control acts as a control plane.
The planes do the actual work, but the control systems coordinate:
routing
scheduling
communication
and safety
Now let’s refocus back to AI.

Most AI products today are surprisingly simple underneath.
At their core, many of them are essentially:
User input → LLM API call → Response

At a small scale, this works perfectly fine.
But as AI adoption grows, raw API calls alone stop being enough.

The hidden assumption

A lot of current AI workflows assume:
“Since the model is smart, the system will scale.”
But intelligence alone does not solve:
cost visibility
repeated reasoning
workflow duplication
routing
attribution
memory growth
governance
or context management
Those are infrastructural problems.

What happens inside organizations

Most businesses already use AI to some extent.
Maybe not officially at the organizational level yet, but when:
developers use Codex
teams use ChatGPT
support uses Claude
designers use AI generation tools
employees automate tasks independently
then the organization is already relying on AI operationally.
The problem is:
Most of this usage is happening without coordination.

The duplication problem scales quietly

At small scale:
repeated prompts
retries
overlapping reasoning
duplicate workflows
feel harmless.
At an organizational scale, they compound.
Different people may:
solve the same issue repeatedly
regenerate similar reasoning
feed the same context multiple times
independently rediscover the same solution paths
Without realizing it.
This starts looking less like “usage”
And more like:
distributed inefficiency.

Why APIs alone are insufficient

An API call only answers:
“Can the model respond?”
It does not answer:
Have we solved something similar before?
Should this request even hit the model?
Is this the optimal model for this task?
Is this repeated work?
Is this context unnecessarily large?
Which team is driving the cost?
Which workflows are inefficient?
What should persist?
What should expire?
Those questions exist above the model layer.
This is where a control plane becomes important
AI systems eventually need something closer to a CONTROL PLANE, not just raw model access.
A layer responsible for:
routing
caching
attribution
observability
governance
context optimization
and intelligent reuse
Because once AI becomes operational infrastructure, organizations eventually need visibility into:
where compute is going
why it is being consumed
and whether the work being performed is actually necessary

The cloud parallel

Cloud computing went through something similar.
At first:
spinning up compute felt magical.
Then eventually organizations realized:
compute sprawl exists
waste compounds
visibility matters
governance matters
optimization matters
I think AI is approaching a similar phase.

The difficult part

We have previously shown that human interaction with AI is inherently messy.
Humans:
revisit ideas
refine prompts
change direction
retry tasks
explore uncertainty
So unlike traditional deterministic systems, the boundaries between:
“new work”
and
“repeated work”
become much harder to detect.

Why this matters

Because if organizations do not eventually solve:
reuse
coordination
attribution
and context efficiency
then scaling AI usage may become significantly more expensive than most people currently expect.
Especially for:
startups
vibecoders
small engineering teams
and companies heavily integrating AI into daily workflows

My Opinion

I think the companies that win long term won’t just be the companies with the best models
But the companies that best understand:
orchestration
memory management
workflow efficiency
and intelligent compute allocation

What I’ll explore next

In the next post, I’ll talk about something that makes this even harder:
why you can’t simply cache everything
Especially once:
privacy
enterprise trust
and sensitive data
enter the picture.

👉 Part 6 is here: (https://dev.to/joshua_chukwu_ccb92f05a94/why-similarity-matters-more-than-exact-matches-in-llm-systems-46pa)

Closing thought

The first phase of AI adoption was:
“Can we make this work?”
The next phase may become:
“Can we make this scale efficiently?”

Why similarity matters more than exact matches in LLM systems

Joshua Chukwu — Wed, 13 May 2026 12:00:00 +0000

Series: AI Isn’t an Engineering Problem Anymore (Part 6)
It’s a cost problem—and most teams don’t realize it yet.

In the last post, I talked about why simple caching doesn’t solve most LLM inefficiency.
Caching works well for:
exact prompts
repeated workflows
deterministic pipelines
But that’s not how humans actually use LLMs.
Humans don’t think in exact matches.
We think in:
approximations
iterations
refinements
revisits
And that changes everything.

The problem with exact matching

Traditional caching systems work by asking:
“Have I seen this exact request before?”
That works great for:
static assets
repeated queries
deterministic systems
But LLM usage is rarely exact.
You ask:
“Why is this deployment failing?”
Then later:
“Could this Railway deployment issue be related to env validation?”
Different prompts.
Very similar intent.

Humans think in similarity, not identity

This is the part I think most systems still struggle with.
When humans solve problems, we rarely:
repeat the exact same sentence
follow the exact same reasoning path
or preserve identical context
Instead we:
circle around ideas
refine wording
revisit earlier thoughts
approach the same problem from different angles
From a human perspective:
these are connected
From most systems’ perspectives:
they are completely unrelated requests

Why this matters

Because if the system only recognizes:
exact matches
Then most real-world repetition gets missed.
And that means:
recomputation continues
context keeps growing
costs keep compounding

A pattern I kept noticing

One thing I started noticing while working across different systems was this:
The model would often regenerate reasoning I had effectively already paid for before.
Not identical wording.
But the same underlying explanation:
same deployment issue
same architectural tradeoff
same debugging pattern
same reasoning path
Just wrapped in a slightly different context.

The hidden inefficiency

This creates a strange situation.
The user feels like:
they are progressing through a problem
But the system may actually be:
repeatedly recomputing overlapping reasoning
That overlap is hard to notice because:
humans naturally think iteratively
conversations evolve gradually
context changes slightly every step

Why similarity is difficult

Similarity sounds easy conceptually.
It isn’t.
Because similarity is:
contextual
probabilistic
semantic
constantly shifting
Two prompts can:
look different syntactically
but mean almost the same thing
Or:
look similar
but require completely different reasoning
That makes reuse much harder than traditional caching.

The deeper issue

At this point, the problem stops being:
“Can we store responses?”
And becomes:
“Can we recognize when work is fundamentally overlapping?”
That is a very different system problem.

Context makes this worse

Long threads amplify the issue even more.
Because each new message carries:
previous reasoning
prior attempts
accumulated context
So even when the new information is small:
the system still processes the entire growing state around it
That’s one reason costs quietly compound over time.
This changes how I think about memory
A lot of current AI workflows treat memory like:
“store more context”
But I’m starting to think the more important question is:
“What context actually needs to survive?”
Because not all memory is useful.
Some memory:
reinforces direction
Other memory:
introduces drift
redundancy
noise
repeated reasoning loops
The systems that win
I think the systems that eventually win won’t just be:
the smartest models
or the largest context windows
They’ll be the systems that:
understand relevance
recognize overlap
manage context intelligently
and avoid unnecessary recomputation

What I’m trying to understand

At this point, the question I keep coming back to is:
How much intelligence is actually new computation?
Humans are dynamic in behavior and thought.
We:
revisit ideas
refine reasoning
circle back
change direction constantly
But systems tend toward determinism, because the more deterministic a system is,
the easier it becomes to:
optimize
predict
cache
and run efficiently
So the tension is:
How intelligent can AI become while still achieving the ultimate goal of efficiency?
Because intelligence may naturally resist exact repetition, while efficiency depends on it.

What I’ll explore nex

t
In the next post, I’ll go deeper into this idea:
why bigger context windows alone may not actually solve the problem

👉 Part 5 is here: https://dev.to/joshua_chukwu_ccb92f05a94/i-tried-caching-llm-responses-it-didnt-work-the-way-i-expected-aj

Closing thought

Humans don’t think in exact matches.
We think in related ideas, partial overlaps, revisits, and refinements.
But most LLM systems still treat those interactions as completely new work.
And I think that gap is where a lot of the hidden inefficiency lives.

I tried caching LLM responses. It didn’t work the way I expected.

Joshua Chukwu — Mon, 11 May 2026 12:30:00 +0000

Series: AI Isn’t an Engineering Problem Anymore (Part 5)
It’s a cost problem—and most teams don’t realize it yet.

In the last few posts, I’ve been exploring how LLM usage behaves in practice:
it’s iterative
it’s repetitive
and it compounds through growing context
So the obvious question becomes:
why not just cache the responses?

The simple idea

At first, this feels straightforward.
If you’ve already asked something before:
just store the response and reuse it
No recomputation.
No extra cost.

The reality

It works… but only in very limited cases.
Specifically:
exact matches
If the prompt is identical:
same wording
same structure
same input
Then yes, caching works.

The problem with exact matches

That’s not how people use LLMs.
In practice, prompts look like this:
slightly reworded
slightly extended
slightly more context
Same intent.
Different string.

A simple example

You ask:
“Why is my rover not turning in place?”
Later, you ask:
“What could cause a skid-steer robot to fail a zero-radius turn?”
These are clearly related.
But to a basic cache:
they are completely different requests
So the system:
misses the cache
recomputes the answer
charges again

Why this matters

Because most repetition isn’t:
exact
It’s:
conceptual

What caching actually solves

Basic caching helps with:
identical API calls
repeated automated workflows
fixed prompt pipelines
That’s useful.
But it only captures a small portion of real-world usage.

What it doesn’t solve

It doesn’t handle:
rephrased prompts
debugging loops
evolving context
team-level overlap
Which is where a lot of the cost actually comes from.

The deeper issue

At this point, the problem isn’t:
“how do we store responses?”
It’s:
“how do we recognize when two requests are actually the same work?”

A different framing

Instead of thinking in terms of:
prompt ….. response
It becomes:
intent ….. reasoning….. output
Caching only works at the prompt level.
But the repetition happens at the intent level.

Why this is hard

Because intent isn’t explicit.
It’s:
inferred
contextual
and often slightly changing
Which makes it difficult to:
detect overlap
reuse prior work
or avoid recomputation

What this leads to

So even after adding caching:
costs still grow
repetition still happens
inefficiency remains
Just slightly reduced.

What I’m trying to understand

At this point, the question becomes:
what would it take to reuse work beyond exact matches?

What I’ll explore next

In the next post, I’ll go deeper into this:
what a system would need to actually recognize and reuse similar work

Part 4 is here: (https://dev.to/joshua_chukwu_ccb92f05a94/youre-probably-paying-twice-for-the-same-llm-response-481e?preview=c6a3bad002bb14076c2a13b65fc8db1237dfc016b3f1a582c0e448db6511dde5856630f2bb7a1d75865b66f23c5afdb373018f530b75a78e57b11a64)

Closing thought

Caching feels like the obvious solution.
And it helps.
But it doesn’t address the core issue:
most of the repetition in LLM usage isn’t identical—it’s just similar
And until we handle that,
we’ll keep recomputing the same ideas
over and over again.

You’re probably paying twice for the same LLM response

Joshua Chukwu — Fri, 08 May 2026 12:06:00 +0000

Series

: AI Isn’t an Engineering Problem Anymore (Part 4)
It’s a cost problem—and most teams don’t realize it yet.

In the last post, I talked about where AI costs actually come from—and how context growth quietly compounds them.
But there’s another pattern that’s even more uncomfortable once you notice it:
you’re probably paying twice for the same underlying answer
Not literally the exact same string.
But functionally, the same work.
The obvious version
Let’s start with something simple.
You ask:
“Why can’t my rover perform a zero-radius turn?”
Then later you ask:
“What could cause skid-steer instability at low speeds?”
Different wording.
Same underlying problem.

The subtle version (this is where it gets interesting)

Now imagine a debugging session.
You go through a loop:
ask a question
get a partial answer
refine your prompt
ask again
Each step feels like progress.
But look at it from a system perspective:
you’re repeatedly asking for overlapping reasoning
And each time:
the model recomputes it
the system bills it
nothing is reused

Why this happens naturally

Because this is how humans think.
We don’t ask perfect questions.
We:
explore
rephrase
iterate
So the system ends up doing:
slightly different versions of the same work
over and over again.

The team version (this is worse)

Now scale this beyond one person.
Multiple engineers might:
debug similar issues
build similar features
ask similar questions
But there’s no shared layer that says:
“we’ve already solved something like this”
So the same reasoning gets recomputed across:
people
time
workflows

Why simple caching doesn’t solve it

At this point, the obvious solution seems like:
“just cache the response”
But that only works for:
exact matches
In reality, most prompts are:
slightly reworded
slightly extended
slightly different
So traditional caching misses most of the problem.

A better way to think about it

Instead of asking:
“is this the same prompt?”
The better question is:
“is this the same intent?”
Because that’s what actually matters.

A quick analogy

It’s a very human pattern.
You’ve probably done this before, looking for something e.g a key, checking the same place twice, even though you already know it’s not there.
Not because it makes sense.
But because you’re searching, refining, trying again.
That’s exactly how we interact with LLMs.

The hidden cost

This is where things connect back to Post 3.
If:
a large portion of your usage is repetitive
and context keeps growing
and nothing is reused
Then:
you’re not just paying for usage
you’re paying for recomputation
And computation has always had a cost.
We’re only noticing it now because LLMs make that cost visible at scale.

A simple mental model

Think of it like this:
Every time you ask a similar question, the system:
re-derives the reasoning
re-generates the explanation
re-processes the context
Even if:
you’ve effectively already “paid” for that knowledge before

Why this matters more than it seems

At small scale, this is fine.
At larger scale, it becomes:
expensive
inefficient
invisible
Because it doesn’t show up as:
“duplicate cost”
It shows up as:
normal usage

The uncomfortable realization

Most teams don’t have a way to:
detect repetition
reuse prior work
or even measure how much overlap exists
So they default to:
keep asking, keep recomputing, keep paying

What I’m trying to understand

At this point, the question becomes:
how much of LLM usage is actually new?
And more importantly:
how much of it needs to be recomputed?

What I’ll explore next

In the next post, I’ll go deeper into this:
what a system would actually need in order to reuse work effectively (beyond simple caching)

Part 3 is here: (https://dev.to/joshua_chukwu_ccb92f05a94/where-your-ai-budget-is-actually-going-its-not-what-you-think-3bi0?preview=856c097d9c26551c1a836539589fd827398a31aa4efacf6a4b5ce551f241a957c369cff2b5e18f6c0187f9f94b07430dd024cdd47bc200dd6501ba1d)

Closing thought

AI gives you answers instantly.
But under the hood, every answer is being computed from scratch—again and again.
And unless we rethink how that work is reused,
we’re going to keep paying for the same intelligence
multiple times.

Where your AI budget is actually going (it’s not what you think)

Joshua Chukwu — Wed, 06 May 2026 12:30:00 +0000

Series:

AI Isn’t an Engineering Problem Anymore (Part 3)
It’s a cost problem–and most teams don’t realize it yet.

In the last post, I talked about how most LLM usage isn’t as “new” as it feels.
A lot of it is:
iterative
repetitive
overlapping
That’s interesting on its own.
But it becomes a lot more important when you start looking at it through a different lens:
cost.

The assumption most people make.

When people think about AI costs, they usually assume it comes from:
heavy usage
complex and heavy queries
large models
high traffic
Which is partially true, but incomplete.
What actually adds up.
In practice, a significant portion of usage comes from things like:
retrying prompts
slightly reworded questions
debugging loops
near-duplicate workflows
None of these feel expensive individually.
But together, they add up.
In my experience from robotics and even outside engineering, anything that compounds tends to spiral faster than expected.

The part most people miss

.
There’s another layer that makes this worse:
context growth
As conversations get longer, the model doesn’t just process your latest message.
It processes:
your current prompt
plus everything that came before it (within the context window)
So each new message isn’t just:
“one more request”
It’s:
“one more request plus an increasing amount of prior context”.
Why this compounds quickly
Think about a long debugging session.
Message 1: What is A
small context
relatively cheap
Message 10: what is A made up of
includes previous messages
more tokens
Message 30: compound characteristics of sample A+(message 1 - 29)
includes a large conversation history
significantly more tokens
Now combine that with:
iteration loops
retries
near-duplicate prompts
And you get a pattern where:
cost doesn’t just grow linearly—it compounds with usage patterns.

A rough mental model

Imagine your usage looks like this:
40% - genuinely new work
30% - variations of the same request
20% - retries / debugging loops
10% - other
Now layer in context growth.
Even if each request seems small:
later requests are more expensive than earlier ones
And all of it is still treated as:
new work.

Why this is easy to miss

Because cost doesn’t show up per thought.
It shows up per request.
And each request feels justified.

The compounding effect

Now scale this:
across a team
across features
across users
What starts as:
“a bit of iteration”
becomes:
a large portion of your AI spend.

The hidden problem

The issue isn’t just cost.
It’s visibility.
Most teams don’t know:
where their AI usage is going
how much is repeated
how context is affecting cost
which workflows are inefficient
So they default to:
keep building, keep shipping, keep paying.

The shift

At some point, the question changes from:
“What can AI do?”
to:
“What is AI costing us?”
This is where AI stops being:
just an engineering problem
And becomes:
a financial one.

A different way to think about it

Knowledge is often described as power.
But knowledge on its own is more like potential energy.
It only becomes power when it’s applied—when it turns into decision-making and control.
In the context of AI:
access to models is knowledge
usage patterns are behavior
but efficiency is applied understanding
The teams that will win aren’t just the ones using AI.
They’re the ones who:
understand how their usage behaves
maximize what they already have
and design systems that avoid unnecessary repetition

A broader implication

In this age, AI is becoming something that almost every team, and even every household will use.
But just like any other resource:
usage without visibility leads to waste
There needs to be some form of:
awareness
control
and, effectively, a “meter” on how it’s being used
Not to restrict usage, but to understand it
What I’m trying to understand
At this point, the question isn’t just:
how do we use AI?
It’s:
how do we make AI usage economically sustainable?

What I’ll explore next

In the next post, I’ll go deeper into one specific piece of this:
why simple caching approaches don’t fully solve the problem
👉 Part 2 is here: (https://dev.to/joshua_chukwu_ccb92f05a94/why-most-llm-api-usage-is-quietly-inefficient-4eko?preview=401f3ac7119c46d21f585a03fcb4a625008594ab67a937a4cdfafeebd060d28d70ff4a0887e0f29bc789f100e99c71b343e3223a6756451f5f83dc94)

Closing thought

AI makes it easier to build.
But it also makes it easier to spend.
And unless we start thinking about how usage behaves and not just what models can do,
that spend can grow in ways that are hard to see until it’s too late.

Why most LLM API usage is quietly inefficient

Joshua Chukwu — Mon, 04 May 2026 12:00:00 +0000

Series: AI Isn’t an Engineering Problem Anymore (Part 2)
It’s a cost problem—and most teams don’t realize it yet.

In the last post, I talked about hitting a usage limit while debugging my robot and realizing how repetitive my own AI usage had become.
At the time, it felt like a personal workflow issue.
But the more I thought about it, the more it became clear:
This isn’t just a “me problem.”
It’s a pattern.

The illusion of “new” work

When we use LLMs, whether through APIs or tools, it feels like every request is new.
You type something different.
You add more context.
You refine your question.
But under the hood, a lot of those requests are doing very similar work.
For example:
debugging the same issue from different angles
rewording prompts to get a better answer
retrying when output isn’t quite right
asking for clarification on something you already asked
Each one feels justified.
And most of them are.

Where inefficiency actually comes from

The inefficiency isn’t from using AI too much.
It comes from how we naturally interact with it.
A few patterns show up almost everywhere:

1. Iteration loops

You don’t ask once, you iterate.
“Try this approach”
“That didn’t work, what about this?”
“What if I change this parameter?”
Each step builds on the last, but often overlaps heavily.

2. Near-duplicate prompts

These are the most interesting ones.
They’re not identical, but they’re close:
same intent
slightly different phrasing
maybe a bit more context
To a human, they’re obviously related.
To most systems, they’re treated as completely new.

3. Retry behavior

Sometimes you just don’t like the answer.
So you try again.
same prompt
or a slightly modified one
This is normal.
But it means the same underlying request can be executed multiple times.

4. Team-level duplication

This gets amplified in teams.
Multiple developers might:
debug similar issues
build similar features
ask similar questions
But there’s no shared memory between them.
So the same work gets repeated across people.

Why this is hard to notice

The tricky part is:
None of this feels inefficient at the moment.
It feels like:
progress
exploration
iteration
And that’s because it is.

A quick analogy (this is exactly how it happens)

My dog gained 15 pounds in a year without us realizing.
What was happening?
Every weekday:
I fed him before leaving for work at 6:30am
my girlfriend fed him again at 8:30am
From each of our perspectives:
“I only fed him once.”
But at the system level:
He was getting fed twice, every single day.
We only noticed when something else didn’t add up,
a 25-pound bag of dog food disappearing way too fast.

Back to LLM usage

That’s exactly how LLM usage behaves.
Individually:
each request feels justified
each interaction feels necessary
But at the system level:
similar work is being recomputed
similar responses are being regenerated
similar costs are being incurred
Over and over again.
There’s also a more subtle version of this that most people don’t think about.
Even the smallest additions to prompts—things that feel natural or polite—are still part of the computation.

This is obviously a lighthearted example that seems trivial at first.

But multiply it across millions of requests…
Now it’s a systems problem.
How many extra tokens do you think you generate just from “please” and “thanks”?

A simple mental model

Think of LLM usage like this:
Every request is treated as a completely new computation, even when it’s not.
There’s no built-in concept of:
“we’ve already solved something like this”
“this looks similar to a previous request”
“we could reuse part of this result”
So the system does exactly what it’s designed to do:
recompute everything

When this becomes a real problem

If you’re just experimenting, this isn’t a big deal.
But once you start:
building products
scaling usage
or running teams
This pattern starts to matter.
Because now you have:
more requests
more iteration
more overlap
And all of it compounds.

This is where the shift happens

At some point, the problem stops being:
“How do we use AI effectively?”
And starts becoming:
“How do we use AI efficiently?”
That’s a very different question.

Why this isn’t obvious (yet)

Most of the conversation around AI is still focused on:
model quality
capabilities
performance
Not:
usage patterns
repetition
system-level efficiency
So a lot of teams don’t even look for this problem.

What I’m trying to understand

After noticing this pattern, the question I’ve been thinking about is:
How much of LLM usage is actually new… and how much is just repetition in disguise?
And more importantly:
If a meaningful portion is repetitive, what should we do about it?

What I’ll explore next

In the next post, I’ll go deeper into one specific part of this:
why you’re probably paying twice for the same LLM response, even when the prompts aren’t identical.

👉 Part 1 is here:
(https://dev.to/joshua_chukwu_ccb92f05a94/youve-hit-your-chatgpt-usage-limit-and-what-it-actually-reveals-about-llm-usage-700)

“You’ve hit your ChatGPT usage limit” — and what it actually reveals about LLM usage

Joshua Chukwu — Fri, 01 May 2026 00:21:36 +0000

“You’ve hit your ChatGPT usage limit.”

I didn’t expect that message to mean anything beyond mild inconvenience.
But it ended up revealing something much deeper about how I was using AI, and how most of us probably are.

Background

I’ve been working on building an autonomous snow-clearing robot since 2023.
It’s one of those projects where everything sounds straightforward until you actually try to make it work:
motor control
traction
turning dynamics
real-world constraints
Things got a lot more interesting once AI tools became part of my workflow. Suddenly:
debugging got faster
ideas came quicker
I could iterate without getting stuck for hours
It genuinely felt like I was getting closer to something I had been chasing for a while.

The turning point

Then I made what felt like a small decision at the time:
I bought a set of cheap motors from a manufacturer.
Bad idea.
The software was glitchy.
The behavior was inconsistent.
And my rover couldn’t perform a proper zero-radius turn.
So I did what most of us do now:
I leaned heavily on ChatGPT.

The usage spiral

At first, I was on the free plan.
That lasted… not very long.
I’d start debugging in the morning, and before noon:
“You’ve hit your usage limit.”
That alone should have been a signal.
Instead, I upgraded.

The upgrade (and addiction phase)

When the “try free for 1 month” plan rolled out, I jumped on it.
And honestly, it changed everything.
I wasn’t just using it for debugging anymore:
I started automating parts of my workflow
I used it at work
I even used it for things I used to avoid—like CAD design
It stopped feeling like a tool.
It started feeling like a multiplier.

The moment that stuck

Then one day, I hit the limit again.
But this time the message was different:
“Your usage limit will reset in 7125 minutes.”
7125 minutes.
Such a strangely specific number that I had to calculate it.
divide by 60 → hours
divide by 24 → days
It came out to roughly 5 days.
That’s when it hit me:
I had gotten so used to having this capability on demand that being cut off for a few days felt… unreal.
Like I had to come back down to earth after living somewhere else for a while.

What I started noticing

After that moment, I started paying closer attention to how I was actually using ChatGPT.
Not in a formal, instrumented way, just observing my own behavior.
A few patterns stood out almost immediately.

1. I was asking the same question… differently

When debugging the motor issue on my rover, I wasn’t asking completely new questions each time.
It was more like:
slight variations of the same prompt
reworded explanations
retrying when the answer didn’t feel quite right
Something like:
“Why can’t my rover perform a zero-radius turn?”
would turn into:
“What could cause skid-steer instability at low speeds?”
“Could motor torque limits affect turning radius?”
“Why does my robot struggle to rotate in place?”
Different wording.
Same underlying problem.
And every time, it was treated as a fresh request.

2. Debugging creates loops

The nature of debugging makes this worse.
You don’t just ask once and move on—you iterate:
test something
observe behavior
come back with slightly more context
ask again
That loop might happen 10–20 times for a single issue.
And each iteration:
feels necessary
feels new
but often overlaps heavily with previous ones

3. I wasn’t aware of how much I was repeating

At no point did it feel like I was “wasting” usage.
It felt like I was:
making progress
refining my understanding
getting closer to the answer
But in reality, I was often:
revisiting the same concepts
re-triggering similar responses
paying (in usage) for near-duplicate work

The realization

That’s when the earlier message started to make more sense:
“You’ve hit your ChatGPT usage limit.”
It wasn’t just about “using too much AI.”
It was about how I was using it.

The uncomfortable question

If this is how I was using it as a single person working on one project…
What does this look like for:
a small team of developers
multiple engineers debugging in parallel
a product that has users triggering similar workflows

A simple thought experiment

Imagine a team of 5 engineers(A practical case of my office)
Each one:
debugs with AI
iterates through similar prompts
retries and rephrases
Even if 30–40% of their prompts overlap conceptually, there’s no mechanism to:
recognize that overlap
reuse prior results
or even measure it
Every request is treated as completely new work

Why this matters (even if you’re not thinking about cost)

At the time, I wasn’t thinking: “I’m wasting tokens”
I was thinking: “I need to fix this rover”
And that’s the point.
Most usage doesn’t feel wasteful at the moment.
It feels productive.

The shift in perspective

But once you zoom out, a different pattern appears:
a lot of AI usage is iterative
a lot of that iteration is repetitive
and that repetition is invisible while you’re in it

What this post is really about

This isn’t about:
ChatGPT limits
free vs paid plans
or even just cost
It’s about something more subtle:
How easily we fall into patterns of repeated AI usage without realizing it

Where this goes next

In my case, this started as frustration:
glitchy motors
endless debugging
hitting limits at the worst possible time
But it led to a much more interesting question:
How much of AI usage is actually new… and how much is just repetition in disguise?
That’s what I’ll dig into next.

(Next post)
Why most LLM API usage is quietly inefficient