Google I/O 2026 had the headline stuff. Gemma 4 in four sizes. The new agent-friendly Gemini surfaces. Genie. Project Mariner stuff. All worth talking about.
But the announcement that's actually going to change which agents make sense to build, and which ones I kill on the spreadsheet, isn't any of those. It's the cache-discount pricing tier on Gemini 2.5 Flash.
I'm the person who maintains GeminiLens, an open-source observability layer for Gemini agents. Cost calc is a core part of that lib. I had to rewrite the calc the day after the keynote. Here's why.
The old math
Before I/O 2026, my mental model for a high-frequency Gemini agent looked like this:
- Input tokens: $0.30 per million
- Output tokens: $2.50 per million
- For a research agent that calls Gemini 30+ times in one run with a growing system prompt, the input token cost dominates. Two thirds of every dollar goes to re-sending the same system prompt and tool list over and over.
You can't fix that with a smaller prompt. The tool list IS the agent. Trimming it makes the agent dumber.
So the choice was usually: either pay the full input bill, or rebuild the agent around a smaller model that's bad at tools.
The new math
The cache-discount tier (announced at I/O, live in the API now) introduces a third price for input tokens: cached input. Tokens you've already sent recently (within the cache TTL window) cost roughly an order of magnitude less than fresh input tokens.
For an agent loop where the system prompt + tool list never changes turn-to-turn, ~95% of "input tokens" on every call after the first one are now cached. The cost graph collapses.
I redid my favorite stress test scenario: a 30-step research agent with a 4K-token system prompt and a 12-tool function-calling schema.
| Old pricing | New pricing | |
|---|---|---|
| Avg cost per step | $0.0048 | $0.0011 |
| Cost per 30-step run | $0.14 | $0.033 |
| Cost per 100k daily runs | $14,400 | $3,300 |
That's a 4.3x reduction, not from "use a smaller model" or "be smarter about prompts" but from the pricing change alone.
What this means for which agents make sense
I keep a list of "agent ideas I'd love to build but the unit economics kill them." After the cache-discount tier, three of them moved from no to maybe:
- Real-time security alert triage where every new alert kicks off a 20-step Gemini investigation. Old math: ~$1.50 per alert. New math: ~$0.35. Suddenly viable at high alert volumes.
- Daily-per-user research digests for a B2B SaaS. Old math: $1.10 per user per day. New math: ~$0.25. Now defensible at $20/seat.
- Long-running monitoring agents that wake up every 5 minutes, re-evaluate state, decide nothing changed, go back to sleep. Old math made this look stupid. New math: a fully chatty Gemini agent on a Cron is now cheaper than a small heuristic Lambda.
None of these are flashy. None of them got mentioned in a keynote slide. But for any team building unit-economics-driven Gemini agents, the cache-discount tier is the I/O announcement that changes the model.
What I'm changing in GeminiLens
I had to update three things in GeminiLens the same day:
1. Cost calc now tracks token classes separately.
# before
total_cost = input_tokens * INPUT_PRICE + output_tokens * OUTPUT_PRICE
# after
total_cost = (
fresh_input_tokens * INPUT_PRICE_FRESH +
cached_input_tokens * INPUT_PRICE_CACHED +
output_tokens * OUTPUT_PRICE
)
The naive sum was hiding which calls were cache-hot vs cold. Now the JSONL audit log carries fresh_input_tokens and cached_input_tokens as separate fields. The Streamlit dashboard renders a per-call cache-hit ratio.
2. Dashboard now shows "could-have-been-cached" warnings.
If I see a 30-step run where every call has zero cached tokens, that's a bug — the cache TTL is probably set wrong or the prompt is being reshuffled. Now flagged as a warning.
3. Cost-per-run estimator splits cold-start vs steady-state.
For agent loops, the first call is cold (full input price), and steady-state is hot. Reporting the average flattens the picture and makes optimization decisions harder. The new dashboard shows both numbers separately.
The under-the-radar thing
Most I/O coverage I've seen frames the cache tier as "an API optimization" — paragraph 4 of a TechCrunch post, two-thirds down the Gemini docs page. It's actually a unit-economics step change for one specific shape of agent: the kind where the prompt stays large and the loop stays long.
If you're shipping that kind of agent, redo your spreadsheet. The math you had on Friday May 16 is no longer the math.
This is my entry for the Google I/O 2026 Writing Challenge. I work on open-source AI agent reliability tooling under @MukundaKatta. GeminiLens is on PyPI: pip install geminilens.
Top comments (0)