Most LLM providers bill by the token. For simple chat queries this feels predictable, but once you start building agents, processing documents, or maintaining multi-turn context, token counts grow quickly and costs scale with every character you send. Understanding how token-based pricing works is the first step toward optimizing your inference spend.
What Are Tokens in LLM Pricing?
Tokens are the atomic units of text that language models process. A token can represent a single character, a fragment of a word, or a common word sequence. Providers like OpenAI, Together AI, Fireworks AI, and OpenRouter charge by counting the tokens in your API request and in the model's response. Typically, pricing is split into two buckets: input tokens (everything you send, including the system prompt, conversation history, and any document context) and output tokens (the generated completion).
This distinction matters because input and output rates often differ. Output tokens usually cost more per unit than input tokens, but for long-context applications, input volume dominates the bill.
How Costs Scale with Context and Conversation
Token-based pricing is linear. If you double the prompt length, you double the input cost. If your application retains a rolling conversation history, every previous turn gets resent on each new request. A 4,000-token conversation becomes 8,000 tokens on the second turn, 12,000 on the third, and so on until you hit the model's context limit.
Agents compound this further. A single agentic loop might include a system prompt with tool definitions, a user query, retrieved documents from a vector store, prior reasoning steps, and function results appended back into context. Each loop can consume tens of thousands of tokens. Because token-based providers bill per token, these workloads become expensive to run at scale.
Estimating Token Count in Practice
Developers often underestimate how quickly text converts to tokens. As a rough rule, one token covers about three to four characters of English text, but code, multilingual text, or special characters can consume more.
Here is a simple Python snippet that demonstrates how a modest prompt payload balloons once you include system instructions, conversation history, and retrieved context:
import tiktoken
def estimate_cost(text, model="gpt-4"):
# Approximate encoding
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
system_prompt = (
"You are a senior software engineer. Review the code for bugs, "
"performance issues, and security vulnerabilities. Respond with structured JSON."
)
user_query = "Refactor this authentication module."
# Simulated RAG context
retrieved_docs = "\n\n".join([f"Document {i}: ..." for i in range(5)])
# Simulated conversation history
conversation_history = "\n".join(
[f"User: Question {i}\nAssistant: Answer {i}" for i in range(6)]
)
full_prompt = f"{system_prompt}\n{retrieved_docs}\n{conversation_history}\nUser: {user_query}\nAssistant:"
total_tokens = estimate_cost(full_prompt)
print(f"Estimated input tokens: {total_tokens}")
In this example, a routine RAG plus conversation flow can easily reach several thousand input tokens before the model generates a single response token. Over hundreds or thousands of requests, this linear scaling directly impacts your monthly bill.
Request-Based Pricing as a Flat Alternative
Oxlo.ai takes a different approach. Instead of metering every token, Oxlo.ai charges one flat cost per API request regardless of prompt length. That means your bill does not scale with input size, conversation history depth, or the number of documents you stuff into context.
For long-context and agentic workloads, this structure can be significantly cheaper than token-based alternatives. Whether you send a 200-token greeting or a 20,000-token payload with tool schemas and retrieved chunks, the cost per call stays the same. You can focus on building richer prompts and more capable agents without watching a meter run on every token.
Oxlo.ai offers 45+ open-source and proprietary models across seven categories, including LLMs like DeepSeek R1 671B MoE, Llama 3.3 70B, and Kimi K2.6, all fully OpenAI SDK compatible. There are no cold starts on popular models, and the platform supports streaming, function calling, JSON mode, and vision. You can explore the exact request-based rates on the Oxlo.ai pricing page.
When to Optimize Your Pricing Model
Token-based pricing works fine for short, stateless queries where input and output volumes are small and predictable. If your application mostly sends brief prompts and receives brief answers, the per-token difference may not matter.
However, if you are building any of the following, a flat per-request model removes the penalty for context richness:
- Document analysis pipelines that ingest entire PDFs
- Multi-turn chatbots with retained memory
- Agentic systems that loop through tool calls and reasoning steps
- Code generation tools with large file contexts
Oxlo.ai is built specifically for these workloads. With request-based pricing, you get predictable costs and the freedom to use the full context window of models like DeepSeek V4 Flash (1M context) or Kimi K2.6 (131K context) without linear cost growth.
Conclusion
Token-based pricing is straightforward on the surface, but its linear nature makes it expensive for the complex, long-context workloads that modern AI applications demand. By understanding how tokens accumulate, you can make informed decisions about your inference provider.
If your project involves agents, large context windows, or high-frequency API calls with heavy prompts, switching to a request-based provider like Oxlo.ai gives you predictable costs and removes the tax on prompt engineering. The API is a drop-in replacement for the OpenAI SDK, so you can test the difference with minimal migration effort.
Check out the Oxlo.ai pricing page to compare plans, or make your first request through the OpenAI-compatible endpoint at https://api.oxlo.ai/v1.
Top comments (0)