Mohamed Hamed

Posted on May 11 • Originally published at mohamedhamed.io

Part 8 — Token-by-Token: Why AI Generates Text One Word at a Time (And Why It Costs 4x More)

#tokens #llm #costoptimization #aifundamentals

THE HIDDEN TAX OF AI
Output Is King

  INPUT COST
  $2.50
  Per 1M Tokens (GPT-4o)


  4x MORE
  OUTPUT COST
  $10.00
  Per 1M Tokens (GPT-4o)

The reason? The AI writes very slowly on the inside — one token at a time.

Last article we saw the Transformer architecture. Today we watch it in action during live generation — and discover why the output side is 4x more expensive.

Here's something that surprises most developers when they first hear it:

ChatGPT doesn't think its answer in advance and then display it.

It predicts one token. Then another. Then another. Each prediction uses the previous ones as context. It's not writing — it's recursively predicting.

Remember how the Transformer reads everything in parallel (previous article)? Generation flips that on its head — now it's forced to be sequential because each new token depends on the last.

And understanding this one fact changes how you design prompts, control API costs, build streaming UIs, and debug unexpected AI behavior.

The One Thing an LLM Actually Does

Strip away all the complexity and a large language model does exactly one thing:

Given all the tokens it has seen so far, predict the single most likely next token.

  📜
  CONTEXT

→

  🤖
  LLM

→

  ✨
  ONE TOKEN

↩ RECURSIVE LOOP — output fed back as next input

Think of it like predictive text on your phone — except instead of suggesting 3 words, it's choosing from 100,000+ possible tokens, and it does this thousands of times to build a complete response.

Step-by-Step: How Generation Actually Works

Let's trace through a real example. Prompt: "What's the best smart glasses?"

    "What's the best smart glasses?" + [START]

  →
  "Ray" — 35% ⭐
  Step 1




    "What's...glasses?" + "Ray"

  →
  "-" — 85% ⭐
  Step 2




    "What's...glasses?" + "Ray-"

  →
  "Ban" — 95% ⭐
  Step 3




    "...Ray-Ban" + all prev

  →
  "Meta" → "Ultra" → "because" → ... → [END]





✅ Final response (assembled from sequential predictions):
"Ray-Ban Meta Ultra — lightweight, 48MP camera, translates 40 languages, full-day battery."
Generated token-by-token — never computed all at once.

Autoregressive Generation: The Mathematical Reality

The formal name for this process is autoregressive generation — each output token becomes part of the input for the next prediction. (This is the same "next-token prediction" that the training loop from Article 4 taught the model to do — except now it's happening live during inference.)

This creates a critical asymmetry in how the model works:

    Response Length
    Generation Steps
    Implication




    10 tokens (~8 words)
    10 sequential predictions
    Fast, cheap


    100 tokens (~75 words)
    100 sequential predictions
    Moderate


    1,000 tokens (~750 words)
    1,000 sequential predictions
    Slow, expensive


    4,000 tokens (a blog post)
    4,000 sequential predictions
    Very slow, very expensive

This is why output tokens cost 4x more than input tokens. Reading your 10,000-token prompt can be largely parallelized. But generating each output token requires a sequential forward pass through the full model — there's no way to batch or parallelize this without changing the output.

The Probability Distribution: Every Token Is a Vote

At each generation step, the model doesn't just know the one "right" answer. It produces a probability distribution over its entire vocabulary — every possible next token, each with a likelihood score.

Probability Distribution After "What's the best smart..."

  "Ray"
  35%


  "Apple"
  20%


  "Meta"
  15%


  "currently"
  8%


  ...others
  ~22% (100K tokens)



⚠️ The model doesn't always pick the highest-probability token — that's controlled by Temperature (a topic for another article).

This is the same softmax activation we saw inside the neuron (Article 3) and Transformer block (Article 6) — here it converts raw logits over the full vocabulary into a probability distribution over what to say next.

The model selects one token, appends it to the context, and runs the entire prediction process again. This continues until it generates an [END] token or hits a maximum length.

KV Cache: The Hidden Optimization That Makes This Workable

Here's the obvious problem with autoregressive generation: if each new token requires the model to re-read the entire context (your prompt + all previous outputs), the computation time would grow quadratically. A 1,000-token response from a 10,000-token prompt would be impossibly slow.

The solution is the KV Cache (Key-Value Cache):

❌
Without KV Cache
Every new token requires reprocessing the entire context from scratch

  Token 1: read all 10K input
  Token 2: read all 10K input again
  Token 3: read all 10K input again
  ... (10,000x overhead per token)

Slow + Expensive 💸



⚡
With KV Cache
The attention Keys and Values for processed tokens are stored and reused

  Input: compute KV once, cache
  Token 1: only compute new token
  Token 2: only compute new token
  ... (reuse cached KVs)

Fast + Smart 🚀

How it works technically: During the Transformer's attention computation, every token produces a Key (K) and Value (V) vector. These don't change for tokens already processed. The KV Cache stores them in GPU memory, so each new generation step only needs to compute the K and V for the one new token. This reuse is only possible because of the self-attention mechanism from the previous article — without Q/K/V, there would be nothing to cache.

This is also why reading input is cheaper than generating output — the entire input can be processed in one forward pass with full parallelization, while output tokens must be generated one at a time even with the cache.

How This All Fits Inside the Transformer We Just Learned

Every generation step is a partial forward pass through the full Transformer stack:

  1.
  The new token passes through Positional Encoding (Article 6) — it gets a position vector so the model knows it's token #347, not #1.


  2.
  Multi-Head Self-Attention runs — but with KV Cache, only the new token's Q is computed fresh; all previous K/V pairs are retrieved from cache.


  3.
  The result flows through the Feed-Forward layers (where the neurons from Article 3 live) — all 96 layers, stacked.


  4.
  The final layer outputs a probability distribution via softmax over the 100K+ vocabulary — one token is selected, appended, and the loop repeats.

TTFT and Throughput: The Two Metrics That Matter

When building AI applications, two performance metrics dominate:

⚡ TTFT (Time to First Token)
How long before the user sees the first word of the response.
Dominated by: Input processing time. Bigger prompts = longer TTFT.
Why it matters: Users perceive TTFT as "responsiveness." A 3-second TTFT feels laggy even if generation speed is fast.


📊 Throughput (Tokens/Second)
How fast the model generates tokens after the first one appears.
Dominated by: Model size, hardware, and batch efficiency.
Why it matters: For long responses, throughput determines total completion time. GPT-4o: ~100-150 tok/s. Gemini Flash: ~300+ tok/s.

⚡ Developer tip: TTFT is your user experience problem
If your system prompt is 5,000 tokens and your users are sending 2,000-token prompts, your TTFT could be 2-4 seconds before the response even begins. Consider prompt caching, smaller system prompts, or showing a loading indicator that accounts for TTFT specifically.

The Real Cost Impact: A Developer's Calculator

Since output generation is 4x more expensive than input processing, how you instruct the model affects your bill more than how much data you send.

Scenario: 1,000 API calls per day on GPT-4o

  ❌ "Write a detailed response" — 500 output tokens

    500 tok × 1,000 calls = 500K tokens
    500K ÷ 1M × $10.00

  $5.00/day → $150/month



  ✅ "Be concise, 1-2 sentences" — 100 output tokens

    100 tok × 1,000 calls = 100K tokens
    100K ÷ 1M × $10.00

  $1.00/day → $30/month





Same task. Same quality. 5x cost difference.
 Just by controlling output length in your prompt.

Streaming: The UX Secret

Since the model generates token-by-token anyway, streaming is free — you can show each token to the user as it's produced instead of waiting for the complete response.

Without Streaming
User stares at a blank screen for 5 seconds. Then the entire 400-word response appears at once. Perceived as "slow AI."


With Streaming
User sees the first word appear in 0.5 seconds, then watches the response build. Perceived as "fast, responsive AI" — even if total time is the same.

from openai import OpenAI

client = OpenAI()

# Streaming example — show tokens as they arrive
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What are the best smart glasses in 2026?"}],
    stream=True,  # ← This is all you need — stream=True is free and transforms UX
    max_tokens=150  # ← Control output length = control cost
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
# Output appears token by token, not all at once

This is literally how ChatGPT's web interface works — the streaming appearance is the natural behavior of the model surfaced directly to the user.

The One-Way Street: Generation Is Irreversible

Here's the most important practical implication of token-by-token generation:

The model cannot go back and correct a previous token.

Once "Ray" is generated and added to the context, the model is committed. Every subsequent token is conditioned on "Ray" appearing there. If the model had wanted to say "Apple" but statistical chance led it to generate "Ray" first, it now has to generate something coherent following "Ray" — it cannot reconsider.

Practical implication 1: Prompt quality matters more than you think
If your prompt is ambiguous, the model might generate an early token that commits it to the wrong interpretation. It will then generate a coherent-but-wrong response. Better prompts → better first tokens → better entire responses.


Practical implication 2: This is why hallucination happens
If the model generates a confident-sounding but wrong fact early in a response, it doesn't "realize" the mistake — it just continues generating tokens that are consistent with the wrong fact. This is why early hallucinations are so hard to fix — by the time the model "knows" it's on the wrong track, it has already committed 50 tokens to a false premise. The next article covers hallucination in depth.


Practical implication 3: Output format instructions help
If you instruct the model to output JSON or markdown at the start, it will generate the opening `{` or `#` token first, which statistically primes all subsequent tokens to follow that format. Prompts like "respond in JSON" work because they shape the first-token probability distribution.

Developer Quick Reference

    Concept
    What It Means
    Action For You




    Autoregressive
    Each token depends on all previous tokens
    Longer outputs = more time + more cost


    Output costs 4x
    Generating > reading (sequential vs parallel)
    Use max_tokens; prompt for conciseness


    KV Cache
    Input attention scores are cached and reused
    Enable prompt caching for repeated system prompts


    TTFT
    Time to first token — perceived as "speed"
    Keep prompts lean; always use streaming


    Streaming
    Show tokens as they're generated
    Always enable in user-facing apps


    Irreversible
    The model can't backtrack and fix errors
    Use clear prompts; consider structured outputs


    Cost Formula
    Total cost = (input_tokens × input_price) + (output_tokens × output_price × 4)
    Always estimate output tokens before building at scale

The Core Insight

ChatGPT doesn't think — it predicts
The illusion of intelligent, fluent text is produced by a model that makes one probabilistic choice at a time, each choice constrained by everything that came before. There is no thinking ahead. There is no revision. There is only: given all of this, what word comes next? Done 100, 500, 2,000 times — very fast, very convincingly.

Pro Tips for Builders

💡 What Knowing Token Generation Changes For You

  1.
  Always enable streaming in user-facing apps. It costs nothing extra and makes responses feel 3-5x faster to users. The perceived latency drop is the biggest free UX win in AI development.


  2.
  Output length is your biggest cost lever. The difference between "explain in detail" and "explain in 2 sentences" can be a 5-10x cost reduction with no quality loss for many tasks.


  3.
  Put output format instructions first. "Respond in JSON:" as the first line of your prompt statistically primes the first token to be {, which propagates through every subsequent token. The model doesn't plan ahead — it just follows the path its first token started.


  4.
  Enable prompt caching for repeated system prompts. Anthropic and OpenAI both offer prompt caching — if your system prompt is 5,000 tokens and you send 10,000 requests/day, caching can cut your input costs by 80-90%.

Try It Yourself

Experiment 1: Count the cost before building

import tiktoken

def estimate_cost(prompt: str, expected_output_words: int) -> dict:
    enc = tiktoken.encoding_for_model("gpt-4o")
    input_tokens = len(enc.encode(prompt))
    output_tokens = int(expected_output_words * 1.33)  # ~0.75 words per token

    input_cost = (input_tokens / 1_000_000) * 2.50
    output_cost = (output_tokens / 1_000_000) * 10.00

    return {
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "input_cost": f"${input_cost:.6f}",
        "output_cost": f"${output_cost:.6f}",
        "total_per_call": f"${input_cost + output_cost:.6f}",
        "daily_1000_calls": f"${(input_cost + output_cost) * 1000:.2f}"
    }

# Test it
result = estimate_cost(
    prompt="You are a helpful assistant. What are the best smart glasses?",
    expected_output_words=200
)
print(result)
# → {'daily_1000_calls': '$2.67', ...}

Experiment 2: Visualize generation timing
Add timestamps to streaming output to see TTFT vs token rate:

import time
start = time.time()
first_token = True

for chunk in client.chat.completions.create(model="gpt-4o",
                                             messages=[...], stream=True):
    if chunk.choices[0].delta.content:
        if first_token:
            print(f"\n⚡ TTFT: {time.time() - start:.2f}s")
            first_token = False
        print(chunk.choices[0].delta.content, end="", flush=True)

Experiment 3: stream=True vs stream=False — feel the UX difference

import time
from openai import OpenAI

client = OpenAI()
prompt = [{"role": "user", "content": "Write a 3-paragraph summary of how LLMs work."}]

# Without streaming — user waits for the full response
start = time.time()
response = client.chat.completions.create(model="gpt-4o", messages=prompt, stream=False)
print(f"Non-streaming total wait: {time.time() - start:.2f}s")
print(response.choices[0].message.content)

# With streaming — user sees first token almost immediately
print("\n--- Streaming version ---")
start = time.time()
first_token_time = None
for chunk in client.chat.completions.create(model="gpt-4o", messages=prompt, stream=True):
    if chunk.choices[0].delta.content:
        if first_token_time is None:
            first_token_time = time.time() - start
            print(f"⚡ TTFT: {first_token_time:.2f}s")
        print(chunk.choices[0].delta.content, end="", flush=True)
# Same total time — but perceived as much faster because content starts immediately

Top comments (1)

Vikrant Shukla • May 12

Good intro to the autoregressive cost shape. Worth adding for practitioners: the "4x" multiplier is a useful headline but the real production cost driver is output token variance, not the per-token price — a model that occasionally rambles for 2k tokens on a 200-token task dominates your bill via the long tail. Two cheap mitigations we use: (1) hard max_tokens budgets per task class, logged and alerted on saturation, and (2) speculative decoding with a small draft model where the provider supports it — reliably 1.8–3x faster on Claude/GPT-class outputs with minimal quality delta. Pair both with per-request token logging (prompt, completion, cached) so finance and eng see the same numbers.