I was paying 3x too much for AI APIs. Here's what I changed.

#webdev #ai #productivity #showdev

I started using AI APIs about a year ago for side projects I was hacking on in the evenings. Nothing production scale.

By month three I was running up about $80 a month in charges. Not wild, but when I broke it down, I was spending way more than I needed to. Half of what I was doing could have run on a cheap model for pennies. I was just lazy.

Here's what I actually changed:

First, I stopped using the flagship for everything. My defaults were Claude 3.5 Sonnet and GPT 4o. Both great. Both way overpowered for half of what I asked them.

I had a little utility that turned a messy chunk of text into a clean title. Take in a paragraph, return one sentence. I was using Sonnet at $3 input and $15 output per million tokens. For a task a much simpler model could handle.

Swapping that one call to Gemini 2.5 Flash Lite at $0.10 input and $0.40 output cut the per request cost by about 30x. Output quality was identical.

Rule I follow now. If the task is "transform this text a little," try a budget model first. Only reach for a flagship if the budget one actually fails.

Second, I cached and trimmed my system prompts. Every major provider offers prompt caching now. Anthropic gives you 90 percent off cached tokens. OpenAI does it automatically once your prompt goes over 1,024 tokens.

At 3,000 calls a month with a 600 token system prompt, that prompt alone was costing me $5.40 on Sonnet. With caching, 54 cents.

While I was in there, I actually read my prompt for the first time in months. It was a mess. "Please provide a response." "It would be helpful if you could." Polite costs tokens. I cut it from 600 to 300. Saves 50 percent on input forever.

Read your system prompt out loud. If it sounds like a cover letter, it's too long.

Third, I got tired of doing the math. For every new model I wanted to try, I was running the same spreadsheet. Input tokens times price per million. Output tokens times price per million. Add. Check caching. It took long enough that I'd just pick something and hope.

So I built it into a tool. It's at quantacost.com. Paste text, pick a model, see what it costs. Compare 39 models side by side. Free, no signup. Prices are verified every morning against the official pricing pages, because I got burned once using someone else's calculator with numbers that were a year stale.

The right model for most tasks is not the smartest one. It's the cheapest one that doesn't fail.

Top comments (2)

PEACEBINFLOW • Apr 23

This resonates. There's something quietly humbling about realizing you've been using a scalpel to spread butter.

The part that stuck with me isn't the caching or the prompt trimming—though both are solid—it's the psychological layer of it. We default to the "best" model not because we've evaluated the task, but because using the flagship feels like we're doing better work. There's a weird, subconscious prestige tied to picking GPT-4o or Sonnet, even when the output is functionally identical. It's like driving a sports car to check the mail.

Your rule about "transform this text a little" is a good heuristic. It makes me wonder how many of my own API calls are just glorified sed commands with a polite voice attached. We've gotten so good at offloading thinking to the cloud that we forgot some things just need a blunt instrument, not a philosopher.

Curious—did you notice any change in your own writing or code once you stopped expecting the model to handle all the nuance? I found that when I switched to cheaper models, I started being more precise with my prompts because I couldn't lean on the model's raw intelligence to fill in my vague gaps. It was an accidental discipline.

Phillip Tori • Apr 23

Yeah this is exactly it. The accidental discipline part hits.

Once I stopped throwing everything at Sonnet I realized how much of my prompting was basically "here's a vibe, figure out what I mean." The flagship models were papering over my
own laziness. I'd write something ambiguous, the model would pick a reasonable interpretation, and I'd feel like it understood me. What actually happened was I got away with underspecifying. Cheaper models break that spell fast. When the reply is technically correct but obviously not what you meant, you sit there annoyed for a second before realizing the prompt could have meant three different things. That's a you problem, not a model problem.

The carryover into other writing was unexpected. I started catching the same vagueness in PR comments and in my own docs. When you've been forced to spell out exactly what "clean this up" means for a small model, you notice how often you're asking humans to do the same guessing.