Choosing a model shouldn't mean opening five browser tabs and doing math. But that's the current state of things, every provider has its own pricing page, its own token definitions and its own fine print.
llmprices is a CLI that cuts through this. Give it a prompt or a token count, and it returns a sorted cost comparison across providers in seconds. There's also a web version at llmcost.run.
# Find best price-to-quality ratio
llm-cost calc "Build a Python REST API" --sort value
# Filter by capability tier
llm-cost list --tier advanced
# Combine both
llm-cost calc "Code task" --tier advanced --sort value --top 5
The Problem
LLM pricing is more complex than it looks at first glance.
Input and output tokens are priced separately, and the gap between them is significant output can cost anywhere from 3x to 20x more than input depending on the provider. Context window size affects whether a model is even viable for a given task, regardless of its per-token rate. And perhaps most importantly, the cheapest model isn't always the best choice: a $0.06/1M input model is excellent for classification pipelines, but the wrong tool for complex reasoning or code generation.
There's no neutral, up-to-date source that puts all of this in one place. llmprices is an attempt to be that.
Core Commands
pip install llmprices
calc
Estimates the cost of a prompt across all models and sorts by price.
llm-cost calc "Build a Python REST API" --output 500
╭── Cost estimate · 7 input + 500 output ───────────────────────────────────╮
│ # Provider Model Total cost vs cheapest │
│ 1 Mistral AI Mistral Small 3.2 $0.000090 cheapest │
│ 2 DeepSeek DeepSeek V4 Flash $0.000141 1.6x │
│ 3 Google Gemini 2.5 Flash-L $0.000200 2.2x │
│ 4 xAI Grok 4.1 Fast $0.000251 2.8x │
│ 5 OpenAI GPT-5.4 Nano $0.000626 7.0x │
│ 6 Anthropic Claude Haiku 4.5 $0.002507 27.9x │
│ 7 Google Gemini 3.1 Pro $0.006014 66.8x │
│ 8 OpenAI GPT-5.5 $0.015035 167.1x │
╰───────────────────────────────────────────────────────────────────────────╯
You can pass a raw text prompt (tokens are counted automatically) or skip it and specify token counts directly:
llm-cost calc --input 4000 --output 1000
llm-cost calc --input 10000 --output 2000 --top 5 --provider google
compare
Direct head-to-head comparison between specific models.
llm-cost compare gpt-5-5 claude-opus-4-7 gemini-3-1-pro
llm-cost compare claude-sonnet-4-6 gpt-5-4 gemini-3-flash --input 5000 --output 1000
llm-cost compare gpt-5-5 claude-opus-4-7 --prompt "Explain how transformers work"
list
Browse the full catalog with filtering and sorting.
llm-cost list
llm-cost list --provider anthropic
llm-cost list --sort output
llm-cost list --search gemini
Efficiency Tiers
Sorting by price is useful, but it doesn't account for capability. A budget model ranked first for cost might produce unusable output for a complex task. To address this, models are grouped into four tiers based on capability level — independently of price.
Flagship — GPT-5.5, Claude Opus 4.7, o3. Suitable for complex analytical tasks, critical decision-making, and high-quality creative work where output quality matters more than cost.
Advanced — GPT-5.4, Claude Sonnet 4.6, DeepSeek R1, o4 Mini. A solid default for most professional workloads: code generation, detailed analysis, structured output.
Standard — GPT-5, Claude Haiku 4.5, Gemini Flash, Mistral Large 3. Reliable for everyday tasks, basic text processing, and simple question-answering at scale.
Budget — Mistral Small 3.2, Command R7B, DeepSeek V4 Flash. Best for high-volume pipelines, classification, and prototyping where task complexity is low.
llm-cost list --tier budget
llm-cost calc "Code review" --tier advanced --sort value
llm-cost calc "Complex analysis" --tier flagship --sort value
Some Things the Data Makes Clear
Running calc across different workloads reveals a few patterns worth keeping in mind.
The cost spread is wider than most people expect. For the same token count, the cheapest available model is often 100–200x less expensive than the most capable one. The vs cheapest column in the output makes this concrete rather than abstract.
At scale, tier selection matters more than model selection within a tier. For a pipeline processing 1 million classification requests at 100 input + 50 output tokens, budget-tier models land at $11–28 total. The same volume on GPT-5.5 costs around $2,000. For tasks that don't require flagship reasoning, that's a 99%+ cost reduction with no meaningful quality tradeoff.
Context window size is a hidden cost variable. A model with a lower per-token rate but a 128K context limit can end up more expensive than a pricier 1M-context model once you account for chunking logic, additional requests, and the engineering overhead of working around the constraint.
The practical workflow is: start with advanced, drop to budget if the output quality is sufficient for the task and escalate to flagship only when the task genuinely requires it.
Contributing
Prices are stored in a plain YAML file at llm_cost/data/prices.yaml. Adding a new model takes four fields:
my-new-model:
name: My New Model
input: 1.50 # $ per 1M input tokens
output: 6.00 # $ per 1M output tokens
context: 200000 # context window in tokens
PyPI: pypi.org/project/llmprices
Web: llmcost.run
Source: github.com/madeburo/llmcost




Top comments (0)