Why the hidden cost of low-intent AI systems is not just more tokens, but weaker architecture.

From prompt optimization to intent optimization: higher ICR means more useful work, fewer wasted tokens, and better token economics.
Introduction
We talk a lot about tokens.
Input tokens.
Output tokens.
Context windows.
Pricing.
Latency.
Caching.
But I think we are missing the deeper question.
Why are we using so many tokens in the first place?
Most conversations about token optimization start too late. They focus on trimming prompts, compressing context, caching responses, or choosing cheaper models.
Those things matter.
But they are symptoms.
The deeper issue is often not the model.
It is the interaction architecture.
A weak interaction architecture forces the user to explain too much, repeat too much, clarify too much, and carry too much system knowledge in every request.
That is not just a usability problem.
It is an economic problem.
Because in AI-native systems, tokens are no longer just text.
Tokens are infrastructure.
They affect cost.
They affect latency.
They affect context capacity.
They affect reliability.
They affect how much useful work an AI system can perform before the interaction collapses under its own verbosity.
This is where Intent Compression Ratio (ICR) becomes useful.
In my previous article, Intent Compression Ratio: Measuring the Power of Intent, I looked at ICR as a way to understand how much procedural complexity moves from the developer to the system.
In this post, I want to look at the same idea from another angle:
What happens to token economics when intent is poorly compressed?
The Token Problem Is Not Just Token Count
It is tempting to think token optimization means making prompts shorter.
That is only partially true.
A short prompt can be bad if it is ambiguous.
A long prompt can be necessary if the system has no context.
The real question is simpler:
How many tokens does the system need to faithfully understand and fulfill the user’s intent?
That is the economic unit that matters.
Not prompt length alone.
Not output length alone.
But the total token footprint required to move from intent to outcome.
That includes:
- The first prompt.
- The clarification turns.
- The repeated context.
- The hidden system instructions.
- The tool call explanations.
- The retries.
- The corrections.
- The familiar “no, that is not what I meant” loop.
When you add all of that up, a small human request can become a very large machine interaction.
I call this token amplification.
Token Amplification
Token amplification happens when compact human intent expands into a disproportionately large token footprint.
For example, a developer may say:
Deploy the payment service with autoscaling and observability.
That is a compact intent.
But in a low-ICR system, the developer may need to explain:
- What service means.
- Which environment.
- Which runtime.
- Which scaling rules.
- Which observability stack.
- Which naming conventions.
- Which security constraints.
- Which deployment policy.
- Which rollback behavior.
- Which team standards.
Then the system asks clarifying questions.
Then the user repeats context.
Then the system generates something close, but not quite right.
Then the user corrects it.
By the end, the original intent has become a long token trail.
The system did not use more tokens because it understood more.
It used more tokens because it understood less.
That distinction matters.
ICR as an Economic Signal
For this discussion, I use a deliberately simple version of ICR:
ICR = Intent Fulfilled / Total Tokens
Higher ICR means more useful work completed per token consumed.
Lower ICR means the system burns more tokens to achieve the same outcome.
This makes ICR more than a developer experience idea.
It is also a cost metric.
A high-ICR system does not merely feel better to use.
It is more economical.
It reduces repeated explanations.
It reduces clarification loops.
It reduces redundant context transfer.
It reduces orchestration overhead.
It reduces the probability of retries.
That means token optimization is not only a prompt engineering problem.
It is an intent design problem.
Prompt Verbosity Is Often a Tax on Missing Intent
We have all seen this pattern.
A prompt starts simple.
Then it grows.
“Act as a senior engineer…”
“Follow these conventions…”
“Use this format…”
“Do not forget…”
“Before answering…”
“Assume the following context…”
“Here are the constraints…”
“Here are examples…”
“Here are exceptions to the examples…”
Some of this is useful.
But often, prompt verbosity is compensating for missing system memory, missing context, missing abstractions, or missing intent contracts.
The user is not just expressing the goal.
The user is rebuilding the world around the goal.
Every time.
That is expensive.
Not only in dollars, but in cognition.
The developer becomes responsible for carrying the system model in their head and serializing it into text repeatedly.
That is not intelligence.
That is manual context transport.
Skills Are Where Intent Becomes Executable
Intent by itself is not enough.
If I say:
Deploy the payment service.
The system still needs to know what “deploy” means in my environment.
Does it mean Kubernetes? Snowpark Container Services? Terraform? A CI/CD pipeline? A staging rollout first? Which observability defaults? Which governance rules? Which rollback policy?
This is why skills are becoming such an important primitive in agentic systems.
Claude has Skills. Cortex Code has Skills. Gemini has Gems. Codex has Skills.
Different names, but they point in a similar direction.
A skill is not just a saved prompt.
At its best, a skill is compressed operational knowledge.
It can encode the steps, constraints, conventions, tools, expected output, safety boundaries, and domain-specific judgment required to complete a recurring task.
Without skills, the user has to serialize all of that context into the prompt.
With skills, the system can load the right operational pattern and execute the intent with less repeated explanation.
That is ICR in practice.
The human expression gets smaller, but the executable meaning gets richer.
The Economics of Repetition
The point is not to make every prompt shorter.
The point is to stop repeating what the system should already know.
A lot of token waste does not come from the user’s intent.
It comes from everything wrapped around the intent.
The same architectural context.
The same coding standards.
The same deployment rules.
The same governance constraints.
The same output format.
The same safety instructions.
Again and again.
In a single prompt, this feels harmless.
Across a workflow, it becomes expensive.
Across agents, tools, and retries, it becomes amplification.
This is the economics of repetition.
Low-ICR systems make humans restate context.
High-ICR systems turn repeated context into reusable capability.
That is why skills matter.
A skill is one way to stop paying the repetition tax every time. It packages recurring operational knowledge so the user can express the outcome instead of re-explaining the procedure.
The token savings are not just from fewer words.
They come from moving repeated meaning out of the prompt path and into the system.
But compression creates a second responsibility: transparency.
A Small Experiment: ICR Lab
To make this visible, I built a small open-source Streamlit app called ICR Lab.
The app is intentionally simple. It does not call a live LLM, benchmark models, or try to be an agent framework.
It simulates token behavior across different interaction styles so the economics are easier to see.
The current version compares five modes:
- Verbose Prompting : large prompts with repeated context.
- Clarification Heavy : multiple rounds of back-and-forth before the system can act.
- Context-Aware : the system reuses some context and reduces repeated explanation.
- Intent-Optimized : the user expresses compact semantic intent and the system assumes more structured responsibility.
- Over-Compressed : the user strips so much context that the system misinterprets the intent, triggering correction rounds that amplify tokens beyond what a verbose prompt would have cost.
The line I keep coming back to is this:
Same intent. Different interaction architecture. Different token economics.

Same intent. Different interaction architecture. Different token economics. ICR Lab visualizes how token amplification changes across interaction modes.
That is the entire point of the demo.
The task does not change.
The model does not change.
The interaction pattern changes.
And the token footprint changes with it.
The fifth mode exists to prevent the wrong takeaway. Over-compression is not optimization. When the system lacks enough shared context to decompress a terse intent, it guesses wrong — and the correction cost exceeds the savings.
ICR is not “make prompts shorter”. It is “make meaning denser”.
The Wrong Lesson: Just Make Prompts Shorter
There is a dangerous interpretation here.
Someone might say:
So high ICR means shorter prompts?
Not exactly.
A short prompt with no shared context is just ambiguity.
A high-ICR interaction is not merely brief.
It is dense.
It carries more meaning per token because the surrounding system understands the domain, constraints, policies, tools, and desired state.
That understanding can come from many places:
- Platform profiles.
- Reusable skills.
- Semantic memory.
- Declarative manifests.
- Typed tool interfaces.
- Repository context.
- Organizational conventions.
The point is not to delete words.
The point is to stop repeating what the system should already know.
Compression Without Visibility Is Opacity
There is another trap.
High compression can become dangerous when it hides too much.
If I say:
Deploy the connector.
And the system creates users, roles, policies, network rules, secrets, and tasks behind the scenes, I may get a great experience.
Until something breaks.
Then I need to know what was created, in what order, with what constraints, which step failed, what can be retried safely, and what state was left behind.
This is why ICR must be paired with transparency.
The system should absorb complexity, but not erase visibility.
High ICR without transparency is magic.
High ICR with transparency is engineering.
That distinction matters even more when money enters the picture.
Token-efficient systems that are impossible to debug only move the cost elsewhere.
They save tokens on the happy path and burn engineering hours during failure.
From Prompt Optimization to Intent Optimization
Prompt optimization asks:
How do I phrase this better?
Intent optimization asks:
Why does the user need to phrase this much at all?
That is the deeper shift.
The future of AI systems will not be defined only by larger context windows.
Larger context windows help, but they can also hide bad architecture.
If the system can accept more context, we may simply dump more context into it: more documents, more history, more instructions, more examples, more logs, more everything.
But stuffing the context window is not the same as understanding intent.
A bigger backpack does not make you organized.
It just lets you carry more mess.
The better direction is often not:
Give the model more tokens.
It is:
Require fewer tokens to preserve the same intent.
That is intent optimization.
What ICR Starts to Reveal
Once you look at AI systems through ICR, a few things become obvious.
A verbose prompt is not always rich intent.
Sometimes it is a workaround for missing abstraction.
A clarification loop is not always intelligence.
Sometimes it is the system externalizing its uncertainty back onto the user.
A large context window is not always better architecture.
Sometimes it is a bigger place to store repeated noise.
An agent workflow is not automatically efficient.
Sometimes agents multiply token usage by passing bloated context between themselves.
And a skill is not just automation.
It is a compressed unit of repeatable intent.
That last point matters.
In Intent Driven Development, I argued that developers are not becoming less important because AI can generate code.
They are becoming more important because the bottleneck has moved.
The bottleneck is no longer syntax.
It is clarity of intent.
Token economics reinforces that shift.
A vague developer with AI creates token sprawl.
A precise developer with intent-aware systems creates leverage.
After seeing these patterns, the next question is natural:
Why simulate this instead of using a live LLM?
Why ICR Lab Uses Simulation
ICR Lab v1 uses deterministic simulation instead of live model calls.
That is deliberate.
The goal is not to benchmark OpenAI, Anthropic, Snowflake Cortex, local models, or anything else.
The goal is to isolate the interaction pattern.
If the app used live LLM calls immediately, the conversation would drift into model quality.
Which model is cheaper?
Which model answers better?
Which model counts tokens differently?
Which model follows instructions more reliably?
Those are valid questions.
But they are not the question I wanted to explore.
The question is simpler:
How does interaction architecture change token economics?
Simulation keeps that visible.
Same input.
Same modes.
Same curves.
Same comparison.
No API keys.
No model variance.
No hidden backend.
Live model integrations can come later.
The concept needs to be clear before the instrumentation becomes real.
Closing Thought
The future may not belong to systems that consume the most tokens.
It may belong to systems that need the fewest tokens to understand us correctly.
That does not mean smaller models.
It does not mean shorter prompts.
It does not mean hiding complexity behind magical interfaces.
It means better intent architecture.
Systems that know the context.
Systems that preserve constraints.
Systems that compress repeated patterns.
Systems that expose what they did.
Systems that turn verbose human instruction into transparent machine execution.
That is where ICR and token economics meet.
Not in counting tokens for the sake of counting tokens.
But in asking a better question:
How much useful intent did this system fulfill per token consumed?
That question changes the optimization target.
And once you see it, token cost stops looking like a pricing problem alone.
It starts looking like a design problem.
Try the Demo
I built a small companion app for this post: ICR Lab.
It is open source and runs as a Streamlit app.
The goal is simple:
Enter a task.
Compare interaction modes.
Watch token amplification happen.
Then watch it collapse as intent gets better compressed.
It is not a benchmark.
It is a playground for building intuition.
Because sometimes the best way to understand a systems problem is to make the invisible curve visible.
Related Reading
- ICR Lab GitHub Repo
- ICR Lab Live App
- Infrastructure as Intent: The Field Velocity Blueprint
- The Ghost in the Machine: Why AI Needs the Spirit of UML
- Intent Driven Development: The Shift Developers Can’t Ignore
- Intent Compression Ratio: Measuring the Power of Intent
About the Author
Kamesh Sampath is a Lead Developer Advocate at Snowflake, author, and long-time open-source contributor with 25+ years in enterprise software. He works across data engineering and AI with developer communities, helping practitioners turn modern data platforms into systems that hold up in production.
Through talks, writing, and hands-on demos, Kamesh makes cloud, data, and AI topics easier to understand and apply — grounded in real-world constraints. His sessions mix deep technical detail with practical patterns that developers and data teams can apply right away.
Lately, he’s been speaking about Apache NiFi (Snowflake Openflow), AI (Snowflake Cortex), and PostgreSQL.
He believes technology becomes powerful when it is shared, taught, and built together.
Top comments (0)