Made Büro

Posted on May 6

Compare LLM API Costs Across Providers

#python #llm #opensource #cli

Choosing a model shouldn't mean opening five browser tabs and doing math. But that's the current state of things, every provider has its own pricing page, its own token definitions and its own fine print.

llmprices is a CLI that cuts through this. Give it a prompt or a token count, and it returns a sorted cost comparison across providers in seconds. There's also a web version at llmcost.run.

# Find best price-to-quality ratio
llm-cost calc "Build a Python REST API" --sort value

# Filter by capability tier
llm-cost list --tier advanced

# Combine both
llm-cost calc "Code task" --tier advanced --sort value --top 5

The Problem

LLM pricing is more complex than it looks at first glance.
Input and output tokens are priced separately, and the gap between them is significant output can cost anywhere from 3x to 20x more than input depending on the provider. Context window size affects whether a model is even viable for a given task, regardless of its per-token rate. And perhaps most importantly, the cheapest model isn't always the best choice: a $0.06/1M input model is excellent for classification pipelines, but the wrong tool for complex reasoning or code generation.

There's no neutral, up-to-date source that puts all of this in one place. llmprices is an attempt to be that.

Core Commands

pip install llmprices

calc
Estimates the cost of a prompt across all models and sorts by price.

llm-cost calc "Build a Python REST API" --output 500

╭── Cost estimate · 7 input + 500 output ───────────────────────────────────╮
│  #  Provider      Model                Total cost    vs cheapest           │
│  1  Mistral AI    Mistral Small 3.2    $0.000090     cheapest              │
│  2  DeepSeek      DeepSeek V4 Flash    $0.000141     1.6x                  │
│  3  Google        Gemini 2.5 Flash-L   $0.000200     2.2x                  │
│  4  xAI           Grok 4.1 Fast        $0.000251     2.8x                  │
│  5  OpenAI        GPT-5.4 Nano         $0.000626     7.0x                  │
│  6  Anthropic     Claude Haiku 4.5     $0.002507     27.9x                 │
│  7  Google        Gemini 3.1 Pro       $0.006014     66.8x                 │
│  8  OpenAI        GPT-5.5              $0.015035    167.1x                 │
╰───────────────────────────────────────────────────────────────────────────╯

You can pass a raw text prompt (tokens are counted automatically) or skip it and specify token counts directly:

llm-cost calc --input 4000 --output 1000
llm-cost calc --input 10000 --output 2000 --top 5 --provider google

compare
Direct head-to-head comparison between specific models.

llm-cost compare gpt-5-5 claude-opus-4-7 gemini-3-1-pro

llm-cost compare claude-sonnet-4-6 gpt-5-4 gemini-3-flash --input 5000 --output 1000

llm-cost compare gpt-5-5 claude-opus-4-7 --prompt "Explain how transformers work"

list
Browse the full catalog with filtering and sorting.

llm-cost list
llm-cost list --provider anthropic
llm-cost list --sort output
llm-cost list --search gemini

Efficiency Tiers

Sorting by price is useful, but it doesn't account for capability. A budget model ranked first for cost might produce unusable output for a complex task. To address this, models are grouped into four tiers based on capability level — independently of price.

Flagship — GPT-5.5, Claude Opus 4.7, o3. Suitable for complex analytical tasks, critical decision-making, and high-quality creative work where output quality matters more than cost.

Advanced — GPT-5.4, Claude Sonnet 4.6, DeepSeek R1, o4 Mini. A solid default for most professional workloads: code generation, detailed analysis, structured output.

Standard — GPT-5, Claude Haiku 4.5, Gemini Flash, Mistral Large 3. Reliable for everyday tasks, basic text processing, and simple question-answering at scale.

Budget — Mistral Small 3.2, Command R7B, DeepSeek V4 Flash. Best for high-volume pipelines, classification, and prototyping where task complexity is low.

llm-cost list --tier budget
llm-cost calc "Code review" --tier advanced --sort value
llm-cost calc "Complex analysis" --tier flagship --sort value

Some Things the Data Makes Clear

Running calc across different workloads reveals a few patterns worth keeping in mind.

The cost spread is wider than most people expect. For the same token count, the cheapest available model is often 100–200x less expensive than the most capable one. The vs cheapest column in the output makes this concrete rather than abstract.

At scale, tier selection matters more than model selection within a tier. For a pipeline processing 1 million classification requests at 100 input + 50 output tokens, budget-tier models land at $11–28 total. The same volume on GPT-5.5 costs around $2,000. For tasks that don't require flagship reasoning, that's a 99%+ cost reduction with no meaningful quality tradeoff.

Context window size is a hidden cost variable. A model with a lower per-token rate but a 128K context limit can end up more expensive than a pricier 1M-context model once you account for chunking logic, additional requests, and the engineering overhead of working around the constraint.

The practical workflow is: start with advanced, drop to budget if the output quality is sufficient for the task and escalate to flagship only when the task genuinely requires it.

Contributing

Prices are stored in a plain YAML file at llm_cost/data/prices.yaml. Adding a new model takes four fields:

my-new-model:
  name: My New Model
  input: 1.50      # $ per 1M input tokens
  output: 6.00     # $ per 1M output tokens
  context: 200000  # context window in tokens

PyPI: pypi.org/project/llmprices
Web: llmcost.run
Source: github.com/madeburo/llmcost

Top comments (5)

unni mana • May 9

This is very informative. Thank you
Is the pricing dynmic?

Made Büro • May 11

Thanks! In the CLI tool pricing is maintained manually via the prices.yaml file and updated with each release. If you need real-time, automatically updated pricing, check out openmodels.run it stays in sync with the latest rates from providers automatically. Also you can compare LLM models and providers

unni mana • May 12

Thank you

Xidao • May 12

This is a really useful tool — the tier-based approach is smart because it acknowledges that cheapest != best for every task.

One dimension that would make this even more valuable is effective cost per task completion. Some models need multiple attempts or longer outputs to get a correct answer, so the raw per-token cost can be misleading. For example, a model that costs 3x more per token but gets the answer right on the first try is actually cheaper than a budget model that needs three retries. This is especially relevant for code generation and reasoning tasks where pass rate varies significantly between providers.

Have you considered adding a quality-adjusted cost metric, or is that too subjective to standardize? Even a rough "cost per successful completion" estimate based on published benchmarks would be a killer feature for this tool.

LEI GUO • May 25

ecomai.online - DeepSeek API, $1 trial, works from any country.