Token Consumption Anxiety and the Open Source App I Built to Solve It

#ai #webdev #productivity #opensource

Thanks to AI, I've spent more time architecting and building apps, which means I spend a lot of time looking at frontier models and agonizing over token use. I’ve also been battling a very modern affliction: token consumption anxiety.

It feels modern AI-powered app architecture is asking us slaps an LLM at the front door. You want to dynamically pick the best model for a specific task? Great, the industry standard is to call an expensive, heavy model just to decide if the prompt should go to Claude, Gemini, or a smaller open-source model. We are burning latency and spending tokens at near absurd levels.

I got tired of this cycle. I wanted a model picker with exactly zero models in the request path. So, I fired up Antigravity, let the AI (a trio of Gemini, Codex, and Claude) do the coding while I directed the architecture, and built a tool to solve my own headache.

The result is RightModel. It's a tool that evaluates your task and recommends the ideal model—but the way it gets there is entirely different. Let’s walk through the architecture.

Handling the request

When you submit a task to RightModel, there are zero LLM calls in the default path. The system evaluates your parameters, computes the ideal model against a pre-existing ruleset, and returns the response instantly.

Here's an example JSON snippet:

{
  "task_type": "code_generation",
  "recommended_model": "claude-3-5-sonnet",
  "reason": "High complexity context matched; tier 1 code model selected."
}

Everything interesting happens before the request, not during it.

The "intelligence" at runtime

The core of the app is the ruleset. It contains task-type classification rules, model-tier mapping, and tie-breakers.

While I used AI to help author these rules initially, the final artifact is human-reviewable and human-owned. I’m not relying on an LLM to make a black-box runtime decision; I’m executing code.

Solving the staleness problem

The LLM landscape moves fast, so a static ruleset needs to keep up to date. To keep RightModel accurate without making live API calls during a user request, the app pulls fresh pricing data from OpenRouter via a scheduled workflow trigger via Google Cloud Scheduler. This scheduling can be done with another service, depending on the app architecture.

Notice what gets regenerated: the pricing data, not the rule logic. The logic remains a curated, human-authored layer. I also caution the user about this staleness directly with a footer stating exactly when the data was last refreshed, for transparency.

AI as an escalation path

Sometimes, requests don't fit cleanly into a ruleset. A task might trigger an "ambiguous" or "low confidence" flag.

When this happens, RightModel doesn't perform a silent fallback or an automatic, expensive upgrade. Instead, the user sees an explicit "Deep Analysis" button. This LLM call is powered by Gemini 2.5 Flash, but I plan to tweak this based on user feedback and technology updates.

Enter: Precomputed AI

Building this app made me realize this architecture isn't isolated to picking models. A happy accident, really, and I've been calling this pattern Precomputed AI.

At its core, Precomputed AI shifts LLM reasoning out of the real-time request path and into an asynchronous build pipeline. It requires three specific properties, all of which power RightModel:

A versioned artifact (the ruleset)
A regeneration cadence (the pricing cron and visible staleness)
A declared escalation path (the Deep Analysis button)

What do you think?

If you're shipping LLM-powered tools right now, I challenge you to ask yourself: which parts of your reasoning actually need to be live?

You can read more at the Precomputed AI website, and try out the RightModel app. I'd particularly value feedback from people creating AI-powered apps and solutions.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.