Vinicius Fagundes

Posted on Nov 17

Temperature, Tokens, and Context Windows: The Three Pillars of LLM Control

#ai #dataengineering #data #product

📚 Tech Acronyms Reference

Quick reference for acronyms used in this article:

AI - Artificial Intelligence
API - Application Programming Interface
BERT - Bidirectional Encoder Representations from Transformers
BPE - Byte-Pair Encoding
DB - Database
GPT - Generative Pre-trained Transformer
GPU - Graphics Processing Unit
JSON - JavaScript Object Notation
LLM - Large Language Model
NLP - Natural Language Processing
Q&A - Question and Answer
RAG - Retrieval-Augmented Generation
ROI - Return on Investment
SQL - Structured Query Language
TF-IDF - Term Frequency-Inverse Document Frequency
XML - Extensible Markup Language

🎯 Introduction: Beyond the Hidden

Let's be real: most engineers interact with Large Language Models (LLMs) through a thin wrapper that hides what's actually happening. You send a string, you get a string back. It feels like magic.

But here's the thing—if you're building production LLM systems, especially as a data engineer responsible for pipelines that process millions of requests, you need to understand what's under the hood.

As a data engineer, you already know how to build pipelines, optimize queries, and manage infrastructure at scale. Now it's time to apply that same rigor to Artificial Intelligence (AI) systems—and understand the fundamentals that separate expensive experiments from Return on Investment (ROI)-positive production systems.

This isn't about reading research papers or implementing transformers from scratch. It's about understanding the three fundamental controls that determine:

How much you'll pay (tokens)
What quality you'll get (temperature)
What constraints you're working within (context windows)

Miss these fundamentals, and you'll either blow your budget, ship unreliable systems, or both.

Let me show you why these three concepts matter, starting from first principles.

💡 Data Engineer's ROI Lens

Throughout this article, we'll view every concept through three questions:

How does this impact cost? (Token efficiency, compute, storage)
How does this affect reliability? (Consistency, error rates, failures)
How does this scale? (Batch processing, throughput, latency)

These aren't just theoretical concepts—they're the levers that determine whether your AI initiative delivers value or burns budget.

🔤 Part 1: Tokenization Deep-Dive

What Actually IS a Token?

Here's what most people think: "A token is a word."

Wrong.

A token is a subword unit created through a process called Byte-Pair Encoding (BPE). It's the fundamental unit that Large Language Models (LLMs) process—not characters, not words, but something in between.

Why Subword Tokenization?

Think about it from a data engineering perspective. If we treated every unique word as a token, we'd have problems:

Problem 1: Vocabulary Explosion

English has ~170,000 words in common use
Add technical terms, proper nouns, typos, slang → millions of possible "words"
Storing and computing with a multi-million token vocabulary? Computationally expensive and memory-intensive.

Problem 2: Out-of-Vocabulary Words

What happens when the model sees "ChatGPT" but was only trained on "chat" and "GPT" separately?
With word-level tokenization, you'd have an unknown token [UNK]. Information lost.

The BPE Solution:

BPE builds a vocabulary by iteratively merging the most frequent character pairs.

Here's the intuition:

Start with individual characters: ['h', 'e', 'l', 'l', 'o']
Find most frequent pair: 'l' + 'l' → merge into 'll'
Continue: 'he' + 'llo' → 'hello' (if frequent enough)
Common words become single tokens; rare words split into subwords

Real Example:

Let's tokenize these strings (using GPT tokenizer):

"Hello World" → ["Hello", " World"] = 2 tokens
"Hello, World!" → ["Hello", ",", " World", "!"] = 4 tokens  
"HelloWorld" → ["Hello", "World"] = 2 tokens
"hello world" → ["hello", " world"] = 2 tokens

Notice:

Capitalization affects tokenization
Punctuation often becomes separate tokens
Spaces are part of tokens (notice " World" with leading space)

The Stop Words Question: Do LLMs Care?

If you've worked with traditional Natural Language Processing (NLP) (think Term Frequency-Inverse Document Frequency (TF-IDF), bag-of-words), you know about stop words—common words like "the", "is", "at", "which" that are often filtered out because they carry little semantic meaning.

Here's the interesting part: LLMs don't use stop word lists. They tokenize everything.

Why?

Traditional NLP (Natural Language Processing) reasoning:
"The cat sat on the mat" → Remove stop words → "cat sat mat" → Easier processing, less noise

LLM (Large Language Model) reasoning:
"The cat sat on the mat" has grammatical structure. Those "meaningless" words actually encode relationships, tense, and context that matter for understanding.

Example:

"The contract is valid" (present tense, current state)
"The contract was valid" (past tense, no longer true)

That "is" vs "was" changes everything. Stop words matter.

But here's the tokenization insight:

Common words like "the", "is", "and" are so frequent that BPE assigns them single tokens. Rare words get split into multiple tokens.

"The" → 1 token (very common)
"Constantinople" → 4-5 tokens (less common)
"Antidisestablishmentarianism" → 8-10 tokens (rare)

So while LLMs don't filter stop words, they handle them efficiently through tokenization. Common words = cheap (1 token). Rare words = expensive (multiple tokens).

Data Engineering Implication:

When estimating token costs for text processing pipelines, documents with lots of common English words will be cheaper per character than documents with:

Technical jargon
Domain-specific terminology
Non-English text
Proper nouns and neologisms

A 1,000-word customer support ticket in plain English might be 1,300 tokens. A 1,000-word legal document with Latin phrases and case names might be 1,800+ tokens.

The Multilingual Problem

Here's where it gets expensive for data engineers building global systems:

English: "Hello" → 1 token
Japanese: "こんにちは" → 3-4 tokens (depending on tokenizer)
Arabic: "مرحبا" → 3-5 tokens
Code: `def hello_world():` → 5-7 tokens

Why?

Most LLM tokenizers (like OpenAI's) are trained primarily on English text. Non-Latin scripts get broken into smaller byte-level tokens, inflating token count.

Cost Impact for Data Engineers:

If you're processing customer support tickets in 10 languages:

English baseline: 1,000 tokens/ticket
Japanese: 2,500 tokens/ticket (2.5x multiplier)
Arabic: 2,200 tokens/ticket (2.2x multiplier)

At $0.002 per 1K tokens (input) and $0.006 per 1K tokens (output):

English: $0.002 input + $0.006 output = $0.008/ticket
Japanese: $0.005 input + $0.015 output = $0.020/ticket

Scaling to 1M tickets/month: That's $8K vs $20K—a $12K/month difference just from tokenization.

Real-World ROI Example:

A fintech company processing multilingual loan applications learned this the hard way:

Before understanding tokenization:

Estimated: 1,000 applications/day × $0.05/application = $50/day
Budget: ~$18K/year

Reality check (production launch):

Multilingual documents (Spanish, Portuguese, Chinese)
JSON structured output requirements
Actual cost: $0.12/application = $120/day = $44K/year

Ouch. 2.4x over budget.

After optimization:

Implemented dynamic batching (16 docs per API call)
Used sliding context windows (reduced history bloat)
Switched to cheaper models for extraction, premium for analysis
Result: $0.04/application = $40/day = $15K/year

Annual impact: $44K → $15K = $29K saved (66% cost reduction)

This is why understanding tokens, temperature, and context windows isn't academic—it's the difference between a profitable AI system and an expensive mistake.

The Token Count Isn't What You Think

Common mistake: Estimating tokens by word count.

Rule of thumb: 1 token ≈ 4 characters in English
But this breaks for:
- Code (lots of special characters)
- Non-English languages
- Text with heavy punctuation
- Structured data (JavaScript Object Notation (JSON), Extensible Markup Language (XML))

Example with JSON:

{"name": "John", "age": 30}

You might think: "That's like 6 words, so ~6 tokens."

Actual token count: 11 tokens

["{", "name", "\":", " \"", "John", "\",", " \"", "age", "\":", " ", "30", "}"]

Every brace, colon, quote—they often become separate tokens.

Lesson for Data Engineers: When building LLM pipelines that output structured data, account for the token overhead of formatting. A 100-word natural language response might be 125 tokens, but the same information as JSON could be 180+ tokens.

Vocabulary Size Trade-offs

Modern LLMs use vocabularies of 50K-100K tokens.

GPT (Generative Pre-trained Transformer)-3/4: ~50K tokens

LLaMA (Large Language Model Meta AI): ~32K tokens

PaLM (Pathways Language Model): ~256K tokens

Why not bigger?

The final layer of an LLM computes probabilities over the entire vocabulary. With 50K tokens and a hidden dimension of 12,288 (GPT-4), that's a matrix of:

50,000 × 12,288 = 614,400,000 parameters

Just for the final projection layer. Larger vocabularies = more parameters = more compute.

Why not smaller?

Smaller vocabularies mean longer token sequences for the same text. Remember, attention mechanisms scale at O(n²) with sequence length. More tokens = more computation.

There's a sweet spot, and most modern LLMs landed on 50K-100K.

🌡️ Part 2: Temperature and Sampling Strategies

The Probability Distribution Problem

Here's what's actually happening when an LLM generates text:

Step 1: The model processes your input and produces logits (raw scores) for every token in its vocabulary.

logits = {
  "the": 4.2,
  "a": 3.8,
  "an": 2.1,
  "hello": 1.5,
  ...
  "zebra": -3.2
}

These aren't probabilities yet—they're unbounded scores.

Step 2: Apply softmax to convert logits into a probability distribution:

P(token) = e^(logit) / Σ(e^(logit_i))

This gives us:

probabilities = {
  "the": 0.45,
  "a": 0.38,
  "an": 0.10,
  "hello": 0.05,
  ...
  "zebra": 0.0001
}

Now we have a valid probability distribution (sums to 1.0).

Step 3: Sample from this distribution to pick the next token.

What Temperature Actually Does

Temperature is applied before the softmax:

P(token) = e^(logit/T) / Σ(e^(logit_i/T))

Where T is temperature.

Temperature = 1.0 (default):

No modification to logits
Standard probability distribution

Temperature = 0.0 (deterministic):

Effectively becomes argmax (always pick highest logit)
Same input → same output (mostly—more on this later)

Temperature > 1.0 (e.g., 1.5):

Divides logits, flattening the distribution
Lower-probability tokens get more chance

Temperature < 1.0 (e.g., 0.3):

Multiplies effective logits, sharpening the distribution
Higher-probability tokens dominate even more

Visualizing Temperature

Let's say we have these logits for the next token:

Original logits:
"the": 4.0
"a": 3.0  
"an": 2.0
"hello": 0.5

At Temperature = 1.0:

After softmax:

"the": 0.53 (53% chance)
"a": 0.20 (20% chance)
"an": 0.07 (7% chance)  
"hello": 0.016 (1.6% chance)

At Temperature = 0.5 (sharper):

Divide logits by 0.5 (= multiply by 2):

"the": 8.0
"a": 6.0
"an": 4.0
"hello": 1.0

After softmax:

"the": 0.84 (84% chance) ← Much more confident
"a": 0.11 (11% chance)
"an": 0.04 (4% chance)
"hello": 0.007 (0.7% chance)

At Temperature = 2.0 (flatter):

Divide logits by 2.0:

"the": 2.0
"a": 1.5
"an": 1.0  
"hello": 0.25

After softmax:

"the": 0.36 (36% chance) ← Less confident
"a": 0.22 (22% chance)
"an": 0.13 (13% chance)
"hello": 0.06 (6% chance)

Key Insight: Temperature doesn't change the order of probabilities—"the" is always most likely. It changes how much more likely the top choice is compared to others.

When to Use Each Temperature

Temperature = 0.0: Deterministic Tasks

Structured Query Language (SQL) query generation
Data extraction from text
Classification tasks
Any time you need consistency across runs

Temperature = 0.3-0.5: Focused but Varied

Technical documentation
Code generation (with some creativity)
Summarization where facts matter

Temperature = 0.7-0.9: Balanced Creativity

Conversational Artificial Intelligence (AI)
Question and Answer (Q&A) systems
Content generation with personality

Temperature = 1.0+: High Creativity

Creative writing
Brainstorming
Generating diverse options

Real-World Temperature ROI:

A legal tech company building a contract analysis tool discovered the hard way that temperature matters:

Initial approach (temp=0.7):

Used for Structured Query Language (SQL) query generation from natural language
Failure rate: 43% of generated queries had syntax errors
Manual review required for every query
Cost: Developer time reviewing = $50/hour

After understanding temperature (temp=0.0):

Same task, temp=0 for deterministic SQL generation
Failure rate: 3% (mostly edge cases)
Manual review only on failures
Result: 93% reduction in review time

ROI Impact:

1,000 queries/day × 2 min review/query × $50/hour = $1,667/day wasted
After optimization: 30 queries/day × 2 min review × $50/hour = $50/day
Annual savings: $590K

One parameter change. Massive return on investment (ROI).

Beyond Temperature: Top-p and Top-k

Temperature alone isn't enough. Even at temp=0.7, you might sample a very low-probability token (the "zebra" with 0.01% chance).

Top-k Sampling:

Only consider the top k most likely tokens. Set the rest to probability 0, then renormalize.

Top-k = 3 means only consider the 3 most likely tokens:
"the": 0.53 → renormalized to 0.66
"a": 0.20 → renormalized to 0.25
"an": 0.07 → renormalized to 0.09
"hello": 0.016 → ignored (probability = 0)

Top-p (Nucleus) Sampling:

More adaptive. Instead of fixed k, include the smallest set of tokens whose cumulative probability exceeds p.

Top-p = 0.9 means include tokens until cumulative probability ≥ 90%:

"the": 0.53 (cumulative: 53%)
"a": 0.20 (cumulative: 73%)
"an": 0.07 (cumulative: 80%)
"hello": 0.016 (cumulative: 81.6%)
... keep adding until cumulative ≥ 90%

Why Top-p > Top-k:

Top-k is rigid. If the model is very confident, maybe only 2 tokens are reasonable, but you're forcing it to consider 50. If it's uncertain, maybe 100 tokens are plausible, but you're limiting to 50.

Top-p adapts to the model's confidence. High confidence? Small nucleus. Low confidence? Larger nucleus.

Most production systems use: temperature=0.7, top_p=0.9, top_k=0 (disabled)

The "Temperature = 0 Isn't Deterministic" Gotcha

You'd think temp=0 always gives the same output for the same input.

Not quite.

Even at temp=0:

Floating point precision: Different hardware might round differently
Top-p still applies: If you have top_p=0.9 with temp=0, you're still sampling from the top 90% mass
Non-deterministic operations: Some implementations use non-deterministic Graphics Processing Unit (GPU) operations

For true determinism: Set temperature=0, top_p=1.0, seed=42 (and pray the API supports seeded generation).

🪟 Part 3: Context Windows and Memory Constraints

What IS a Context Window?

The context window is the maximum number of tokens an LLM can process in a single request (input + output combined).

Common context windows:

GPT-3.5: 4K tokens (~3,000 words)
GPT-4: 8K tokens (base), 32K tokens (extended)
GPT-4 Turbo: 128K tokens (~96,000 words)
Claude 2 (Anthropic): 100K tokens
Claude 3 (Anthropic): 200K tokens

But here's what data engineers need to understand: It's not just about "how much text fits." It's about computational complexity.

The O(n²) Problem

Transformers use self-attention, which computes relationships between every token and every other token.

For a sequence of length n, that's:

n × n = n² comparisons

Example:

1,000 tokens: 1,000,000 attention computations
2,000 tokens: 4,000,000 attention computations (4x)
4,000 tokens: 16,000,000 attention computations (16x)

Quadratic scaling is brutal.

This is why longer context windows are:

More expensive (more compute per request)
Slower (more operations to process)
More memory-intensive (need to store that n×n attention matrix)

Why Context Windows Exist

It's not an arbitrary limit. It's a memory and compute constraint.

During training, transformers are trained on sequences of a fixed maximum length (e.g., 8,192 tokens). The model learns positional encodings for positions 0 to 8,191.

What happens at position 8,192?

The model has never seen it. Positional encodings break down. Attention patterns become unreliable.

Modern techniques (like ALiBi, rotary embeddings) help extend beyond training length, but there are still practical limits.

Token Counting in Context

Critical for data engineers: Context window includes input + output.

Context window: 8,192 tokens
Your prompt: 7,000 tokens
Model's max output: 1,192 tokens

If the model tries to generate more than 1,192 tokens, it'll hit the limit mid-generation and truncate.

Even worse: Some APIs reserve tokens for special markers, formatting, system messages. Your effective context might be 8,192 - 500 = 7,692 tokens.

Context Management Strategies

Strategy 1: Sliding Windows

Instead of keeping full conversation history, maintain a sliding window:

Window size: 2,000 tokens
New message: 300 tokens

Option A: Drop oldest messages until total ≤ 2,000
Option B: Keep first message (system context) + last N messages
Option C: Keep first + last, drop middle (risky—loses context)

Strategy 2: Summarization

Periodically summarize old messages:

Messages 1-10: "User asked about product features. We discussed pricing, integrations, and support."
Messages 11-15: [keep full text]

Trade-off: Summarization costs tokens (you need to generate the summary), but saves tokens long-term.

Strategy 3: Retrieval-Augmented Generation (RAG)

Don't put everything in context. Store information externally (vector Database (DB)), retrieve relevant chunks, inject into context.

User query: "What's our refund policy?"
→ Retrieve top 3 relevant docs (500 tokens)
→ Include only those in context
→ Generate response

This pattern allows you to work with unlimited knowledge bases while staying within context window constraints.

Batch Processing Implications for Data Engineers

If you're processing millions of documents, context windows create batch size constraints.

Example: Embedding Generation

You want to embed 100,000 customer support tickets (avg 500 tokens each).

Naive approach:

for ticket in tickets:
    embedding = embed(ticket)  # 1 Application Programming Interface (API) call per ticket

Result: 100,000 API calls. Slow. Rate-limited. Expensive.

Batch approach:

batch_size = 16  # Fit within context window
for batch in chunks(tickets, batch_size):
    embeddings = embed(batch)  # 1 API call for 16 tickets

Result: 6,250 API calls. Much better.

But there's a catch: If your context window is 8K tokens, and you batch 16 tickets at 500 tokens each = 8,000 tokens, you're at the limit. If one ticket is 600 tokens, you overflow.

Solution: Dynamic batching based on token count, not fixed batch size.

# Pseudocode
current_batch = []
current_tokens = 0
max_batch_tokens = 7500  # Leave buffer

for ticket in tickets:
    ticket_tokens = count_tokens(ticket)

    if current_tokens + ticket_tokens > max_batch_tokens:
        # Process current batch
        embeddings = embed(current_batch)
        # Start new batch
        current_batch = [ticket]
        current_tokens = ticket_tokens
    else:
        current_batch.append(ticket)
        current_tokens += ticket_tokens

This is basic data engineering—but it matters for LLM pipelines.

Cost Implications

Context windows directly impact cost.

OpenAI Pricing (GPT-4):

Input: $0.03 per 1K tokens
Output: $0.06 per 1K tokens

Scenario: Customer support chatbot

Average conversation:
- System message: 200 tokens
- Conversation history: 1,500 tokens  
- User message: 100 tokens
- Response: 200 tokens

Input tokens per message: 200 + 1,500 + 100 = 1,800
Output tokens per message: 200

Cost per message: (1.8 × $0.03) + (0.2 × $0.06) = $0.054 + $0.012 = $0.066

At 100,000 messages/month: $6,600/month

Optimization: Sliding window (keep last 500 tokens of history)

Input tokens per message: 200 + 500 + 100 = 800
Cost per message: (0.8 × $0.03) + (0.2 × $0.06) = $0.024 + $0.012 = $0.036

At 100,000 messages/month: $3,600/month

Savings: $3,000/month just from context management.

🎯 Conclusion: The Foundation of ROI-Positive AI Systems

Understanding tokens, temperature, and context windows isn't academic—it's the foundation of every cost optimization, quality improvement, and scaling decision you'll make in production.

As a data engineer, you know that small inefficiencies compound at scale. A 20% optimization in query performance isn't just "nice to have"—it's millions of dollars when you're processing petabytes. The same principle applies to Large Language Model (LLM) systems.

The Business Impact:

These three fundamentals directly control:

💰 Cost:

Token efficiency across languages and formats (2-3x cost difference)
Context window optimization ($3K/month savings from simple sliding windows)
Batch processing strategies (6,250 API calls vs 100,000)
Temperature selection (one parameter = $590K annual savings)

📊 Quality:

Appropriate temperature for your use case (43% → 3% error rate)
Sampling strategies (top-p, top-k for controlled creativity)
Maintaining context in multi-turn interactions (user experience)

⚡ Performance:

Quadratic scaling of attention with context length (understand before you scale)
Batch size constraints from token limits (throughput optimization)
Rate limiting and throughput planning (production readiness)

The ROI Pattern:

Every example we've seen follows the same pattern:

Underestimate complexity → Budget overruns or quality issues
Understand fundamentals → Make informed architecture decisions
Optimize systematically → 60-90% cost reductions, 10x quality improvements

This is your competitive advantage. Most teams treat LLMs as black boxes and pay the price in production. You'll understand the levers that matter.

Key Takeaways for Data Engineers

On Tokens:

Tokens ≠ words. They're Byte-Pair Encoding (BPE) subword units.
Common English words = 1 token. Rare words, non-English text, code = multiple tokens.
Action: Always count tokens programmatically, never estimate by word count.
ROI Impact: Multilingual support can cost 2-3x more than estimated.

On Temperature:

Temperature controls probability distribution sharpness, not "randomness."
temp=0 for deterministic tasks (SQL, extraction). temp=0.7-0.9 for creative tasks.
Combine with top-p (nucleus sampling) for adaptive token selection.
Action: Match temperature to your use case's consistency requirements.
ROI Impact: Wrong temperature = 40%+ error rates. Right temperature = 3%.

On Context Windows:

Context = input + output tokens combined. It's a compute constraint, not arbitrary.
Attention scales at O(n²). Double the context = 4x the compute cost.
Manage proactively: sliding windows, summarization, Retrieval-Augmented Generation (RAG).
Action: Monitor token usage per request. Optimize before it becomes expensive.
ROI Impact: Context management alone can save $3K+/month at moderate scale.

Found this helpful? Drop a comment with the biggest "aha!" moment you had, or share how you're applying these concepts in your production systems.

DEV Community

Temperature, Tokens, and Context Windows: The Three Pillars of LLM Control

📚 Tech Acronyms Reference

🎯 Introduction: Beyond the Hidden

🔤 Part 1: Tokenization Deep-Dive

What Actually IS a Token?

Why Subword Tokenization?

The Stop Words Question: Do LLMs Care?

The Multilingual Problem

The Token Count Isn't What You Think

Vocabulary Size Trade-offs

🌡️ Part 2: Temperature and Sampling Strategies

The Probability Distribution Problem

What Temperature Actually Does

Visualizing Temperature

When to Use Each Temperature

Beyond Temperature: Top-p and Top-k

The "Temperature = 0 Isn't Deterministic" Gotcha

🪟 Part 3: Context Windows and Memory Constraints

What IS a Context Window?

The O(n²) Problem

Why Context Windows Exist

Token Counting in Context

Context Management Strategies

Batch Processing Implications for Data Engineers

Cost Implications

🎯 Conclusion: The Foundation of ROI-Positive AI Systems

Key Takeaways for Data Engineers

Top comments (0)