DEV Community

Mike Anderson
Mike Anderson

Posted on

What Are Tokens and Temperature in AI Models?

What Are Tokens and Temperature in AI Models?

Practical guidance for managers and engineers who need predictable, cost-aware, and useful AI outputs.

Last reviewed: May 16, 2026


Opening

When people start working with AI models, they often focus on the model name first.

Is Claude Opus better than Claude Sonnet? Should I use Gemini, Llama, Gemma, or Qwen? Which model is best for coding, summarization, security analysis, or business reporting?

Those are valid questions, but two smaller settings often determine whether the output is useful, predictable, and cost-effective:

tokens
temperature
Enter fullscreen mode Exit fullscreen mode

Tokens affect how much text the model can read and write. Temperature affects how predictable or creative the answer will be.

If you are a manager, tokens influence cost, latency, and whether a long report gets cut off. If you are an engineer, tokens affect context windows, output limits, prompt design, API behavior, and JSON reliability. If you are using AI in cybersecurity, cloud operations, compliance, customer support, development, or analytics, these two concepts matter every day.

This article explains tokens and temperature using practical examples across models such as Claude Opus 4.7, Claude Sonnet 4.6, Llama 3.3, Gemma, Gemini, and Qwen 2.5. It also includes small Python examples you can adapt for real projects.


First, what is a token?

A token is the unit of text that an AI model processes.

A token may be:

  • A full word
  • Part of a word
  • A punctuation mark
  • A number
  • A symbol
  • A piece of whitespace
  • A byte-like unit for unusual characters

For English prose, one token is often roughly three to four characters, or about three-quarters of a word. That is only a rough estimate. The exact token count depends on the model, tokenizer, language, formatting, and character set.

Example:

I love AWS!
Enter fullscreen mode Exit fullscreen mode

A model might split it approximately like this:

["I", " love", " AWS", "!"]
Enter fullscreen mode Exit fullscreen mode

That would be four tokens.

A longer sentence:

CloudTrail shows DeleteTrail from an unusual IP address.
Enter fullscreen mode Exit fullscreen mode

Might become something like:

["Cloud", "Trail", " shows", " Delete", "Trail", " from", " an", " unusual", " IP", " address", "."]
Enter fullscreen mode Exit fullscreen mode

The exact split will vary, but the idea is the same: the model does not process text as full paragraphs. It processes sequences of tokens.


Why tokens matter

Tokens matter for four reasons.

1. Context window

The context window is how much information the model can consider at one time. It usually includes the system prompt, user prompt, attached text, tool results, conversation history, retrieved documents, and any other provider-visible context. Some platforms account for reasoning or intermediate steps separately, so always check the provider's token accounting.

A large context window lets you provide more material, such as:

  • Long contracts
  • CloudTrail logs
  • Incident timelines
  • Source code files
  • Architecture documents
  • Security findings
  • Meeting transcripts
  • Policy documents

But a larger context window does not automatically mean better output. If you dump too much irrelevant text into the prompt, the model may become slower, more expensive, and less focused.

2. Output length

Most APIs let you set a maximum output length. This is usually called something like:

max_tokens
max_output_tokens
max_new_tokens
num_predict
Enter fullscreen mode Exit fullscreen mode

The name depends on the provider.

This setting controls how much the model is allowed to generate. If the limit is too small, the model may stop mid-answer.

That matters when you ask for structured output.

Bad result:

{
  "summary": "The alert indicates suspicious IAM activity",
  "severity": "high",
  "next_steps": [
    "Review CloudTrail logs",
Enter fullscreen mode Exit fullscreen mode

The JSON is broken because the model ran out of output tokens.

For structured SOC notes, compliance summaries, or API responses, an output limit that is too small can break downstream automation.

3. Cost

Most commercial models charge by token. Input tokens and output tokens may have different prices. Longer prompts and longer answers usually cost more.

For managers, this matters because a workflow that looks cheap in a prototype can become expensive in production if every request includes thousands of unnecessary tokens.

For engineers, this means prompt design is not just style. It is cost control.

One important thing to understand: running locally does not remove the need to tune tokens and temperature.

Tokens still control how much the model can read and write, and they affect memory, speed, truncation, and throughput.

Temperature still controls how predictable or creative the model is, and it directly affects consistency, JSON reliability, and operational usefulness.

4. Latency

More tokens usually means more processing time.

A short prompt with a short answer may return quickly. A long prompt with retrieved documents, logs, and a large output budget may take noticeably longer.

For real-time user interfaces, alert triage, chatbots, and coding assistants, token planning directly affects user experience.


Input tokens vs output tokens

Think of a model call like this:

input tokens  →  model  →  output tokens
Enter fullscreen mode Exit fullscreen mode

Input tokens are what you send to the model.

Examples:

  • System instructions
  • User question
  • Chat history
  • Retrieved documents
  • Log snippets
  • Tool results
  • Code files

Output tokens are what the model generates.

Examples:

  • A summary
  • A JSON response
  • A Python function
  • A security triage note
  • A business recommendation
  • A rewritten email

A common mistake is thinking max_tokens controls the full request. In many APIs, it controls only the output. The input is governed by the model’s context window.

So this setup:

max_tokens = 800
Enter fullscreen mode Exit fullscreen mode

Usually means:

The model may generate up to 800 output tokens.
Enter fullscreen mode Exit fullscreen mode

It does not mean:

The full prompt plus answer is limited to 800 tokens.
Enter fullscreen mode Exit fullscreen mode

Always check the specific provider’s naming.


What is temperature?

Temperature controls how random or predictable the model’s output is.

A low temperature makes the model favor higher-probability next tokens. The output becomes more stable, conservative, and repeatable.

A high temperature gives the model more freedom to choose less obvious tokens. The output becomes more varied, creative, and sometimes less reliable.

A practical range looks like this:

Temperature Behavior Good for
0.0 to 0.2 Very focused and conservative JSON, security analysis, compliance, extraction, classification
0.3 to 0.5 Balanced Technical explanations, documentation, summaries
0.6 to 0.8 More varied Brainstorming, marketing drafts, alternative phrasings
0.9+ Highly creative and less predictable Fiction, ideation, playful content

For most business and engineering workflows, I rarely start above 0.5.

For cybersecurity, compliance, incident response, finance, legal review, or anything that needs repeatability, I usually start around:

temperature = 0.1 or 0.2
Enter fullscreen mode Exit fullscreen mode

That does not make the model perfectly deterministic. Some systems can still produce slight variation even at low temperature. But it reduces unnecessary creativity.


Temperature example in plain English

Imagine this prompt:

Summarize this alert in one sentence:
An IAM user created a new access key from an unusual IP address.
Enter fullscreen mode Exit fullscreen mode

At low temperature, the model might answer:

An IAM user created a new access key from an unusual IP address, which may indicate credential misuse or unauthorized access.
Enter fullscreen mode Exit fullscreen mode

At higher temperature, it might answer:

This alert suggests a potentially risky identity event where a user generated fresh AWS credentials from an unfamiliar network location.
Enter fullscreen mode Exit fullscreen mode

Both may be acceptable.

But for a SOC workflow, the first answer is often better because it is direct and easier to compare across alerts.

Now imagine a JSON workflow. Low temperature matters even more.

You want this:

{
  "severity": "high",
  "confidence": "medium",
  "next_checks": [
    "Review CloudTrail activity for the user",
    "Check whether the access key creation was approved",
    "Verify the source IP address"
  ]
}
Enter fullscreen mode Exit fullscreen mode

You do not want this:

This is definitely scary. I would probably investigate immediately.
Enter fullscreen mode Exit fullscreen mode

For structured workflows, keep temperature low.

For production JSON workflows, low temperature helps, but it is not enough. Use provider-supported structured outputs where available, validate responses against a schema, and retry or fail safely when validation fails.


How tokens and temperature work together

Tokens and temperature solve different problems.

Setting Controls Main risk if wrong
Max tokens Output length Truncated answers, broken JSON, incomplete reports
Context window Input + working context capacity Missing information or too much irrelevant information
Temperature Randomness Hallucination, inconsistency, boring output, or weak creativity

For a security incident report, you may use:

max_tokens = 2000
temperature = 0.1
Enter fullscreen mode Exit fullscreen mode

Why?

  • The output needs enough room for a complete triage note.
  • The output should be consistent and evidence-based.
  • The model should not invent indicators or exaggerate severity.

For a marketing brainstorm, you may use:

max_tokens = 1200
temperature = 0.7
Enter fullscreen mode Exit fullscreen mode

Why?

  • The output does not need to be perfectly deterministic.
  • You want variety.
  • You can manually review and select the best ideas.

Model examples: how the idea applies across Claude, Gemini, Llama, Gemma, and Qwen

Note: Model names, availability, context limits, and API parameters change frequently. Verify the current provider documentation before using any model string in production.

The concepts are similar across model families, but the API parameter names and behavior can differ.

Claude Opus 4.7

Claude Opus 4.7 is positioned for complex reasoning and agentic coding. Use it when the task is difficult, high-value, or requires deeper analysis.

Good use cases:

  • Complex architecture review
  • Advanced code reasoning
  • Long incident analysis
  • Executive briefing generation
  • Multi-step technical planning

Practical token and temperature guidance:

temperature: 0.1–0.3 for technical or security work
max_tokens: high enough for the expected report
Enter fullscreen mode Exit fullscreen mode

Use lower temperature when asking for structured output or factual analysis.

Claude Sonnet 4.6

Claude Sonnet 4.6 is better when you want a strong balance of speed, intelligence, and cost.

Good use cases:

  • Engineering documentation
  • SOC alert summaries
  • Business analysis
  • Internal knowledge assistants
  • Code review and explanation
  • Repeatable production workflows

Practical guidance:

temperature: 0.1–0.4
max_tokens: sized to the output format
Enter fullscreen mode Exit fullscreen mode

For many production workflows, Sonnet-class models are often a better default than Opus-class models because they balance quality and operational cost.

Gemini Pro-class models

As of May 16, 2026, Google's Gemini model list shows Gemini 3 Pro Preview as shut down and recommends migration to newer Gemini 3-series options such as Gemini 3.1 Pro Preview. The concept remains the same: use lower temperature for factual, structured work and higher temperature for creative work.

Good use cases for Gemini Pro-class models:

  • Multimodal analysis
  • Research-style reasoning
  • Long technical explanations
  • Document understanding
  • Complex coding or planning tasks

Practical guidance:

temperature: 0.1–0.3 for structured or factual outputs
max_output_tokens: enough for the final answer
Enter fullscreen mode Exit fullscreen mode

Llama 3.3

Llama 3.3 is commonly used as a strong open-weight model for local or self-hosted workloads, especially in the 70B instruct variant.

Good use cases:

  • Local AI assistants
  • Internal summarization
  • RAG prototypes
  • Engineering support
  • Private document analysis

Practical guidance:

temperature: 0.1–0.4 for enterprise tasks
num_predict or max tokens: based on the runtime
Enter fullscreen mode Exit fullscreen mode

When running locally, hardware matters. A 70B model usually needs significantly more memory and compute than smaller models.

Gemma

Gemma is Google's open-weight model family. Gemma models are useful when you want local or controlled deployment with smaller model sizes.

Good use cases:

  • Lightweight assistants
  • Summarization
  • Internal tools
  • Edge or workstation experiments
  • Cost-sensitive workloads

Practical guidance:

temperature: 0.2–0.5 for normal technical writing
lower for extraction or classification
Enter fullscreen mode Exit fullscreen mode

Smaller Gemma models may be fast and practical, but they may not reason as deeply as larger hosted models.

Qwen 2.5

Qwen 2.5 remains a strong open-weight family with variants for general chat, coding, math, and long-context tasks. For newer deployments, also check the current Qwen3 options available in your runtime.

Good use cases:

  • Coding assistance
  • Structured output
  • Local technical analysis
  • Multilingual tasks
  • Log and text summarization
  • RAG-style workflows

Practical guidance:

temperature: 0.1–0.3 for code and JSON
temperature: 0.3–0.5 for explanations
Enter fullscreen mode Exit fullscreen mode

Qwen 2.5 Coder variants are especially useful for engineering workflows where code quality and structure matter.


Python example: Claude Opus 4.7 or Sonnet 4.6

This example uses the Anthropic SDK pattern. Set your API key as an environment variable before running it.

export ANTHROPIC_API_KEY="your_api_key_here"
Enter fullscreen mode Exit fullscreen mode
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=600,
    temperature=0.1,
    messages=[
        {
            "role": "user",
            "content": """
Summarize this security alert for a manager.

Alert:
An AWS IAM user created a new access key from an unusual IP address.
CloudTrail shows no recent approved change ticket.
Return:
1. Business risk
2. Technical meaning
3. Recommended next checks
"""
        }
    ],
)

print(response.content[0].text)
Enter fullscreen mode Exit fullscreen mode

For a more complex task, switch the model:

model="claude-opus-4-7"
Enter fullscreen mode Exit fullscreen mode

Use Opus when the task requires deeper reasoning. Use Sonnet when you need a strong production default with better speed and cost balance.


Python example: Gemini Pro-class model

Google’s model names change over time, so always check the current Gemini model list before production use. If Gemini 3 Pro is unavailable in your environment, use the currently recommended Pro-class model, such as Gemini 3.1 Pro Preview or Gemini 2.5 Pro where appropriate.

export GEMINI_API_KEY="your_api_key_here"
Enter fullscreen mode Exit fullscreen mode
import os
from google import genai

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="""
Explain tokens and temperature to an engineering manager.
Keep it practical and include one example about JSON output.
""",
    config={
        "temperature": 0.2,
        "max_output_tokens": 700,
    },
)

print(response.text)
Enter fullscreen mode Exit fullscreen mode

For factual or operational content, keep temperature low. For brainstorming, increase it carefully.


Python example: local Llama 3.3 with Ollama

If you run local models through Ollama, you can call the local HTTP API from Python.

First pull the model:

ollama pull llama3.3
Enter fullscreen mode Exit fullscreen mode

Then call it:

import requests

payload = {
    "model": "llama3.3",
    "prompt": """
Explain what max tokens and temperature mean.
Use an example from cloud security alert triage.
""",
    "stream": False,
    "options": {
        "temperature": 0.2,
        "num_predict": 500
    }
}

response = requests.post(
    "http://localhost:11434/api/generate",
    json=payload,
    timeout=120
)

response.raise_for_status()
print(response.json()["response"])
Enter fullscreen mode Exit fullscreen mode

In Ollama, num_predict controls how many tokens the model is allowed to generate.


Python example: local Gemma with Ollama

Pull a Gemma model that fits your machine:

ollama pull gemma3:4b
Enter fullscreen mode Exit fullscreen mode

Then call it:

import requests

prompt = """
Rewrite this technical note for a manager:

Temperature controls randomness. Low temperature gives consistent answers.
High temperature gives more creative but less predictable answers.
"""

payload = {
    "model": "gemma3:4b",
    "prompt": prompt,
    "stream": False,
    "options": {
        "temperature": 0.4,
        "num_predict": 350
    }
}

response = requests.post(
    "http://localhost:11434/api/generate",
    json=payload,
    timeout=120
)

print(response.json()["response"])
Enter fullscreen mode Exit fullscreen mode

A moderate temperature works well for rewriting and explanation. Use lower temperature for extraction, classification, or JSON.


Python example: local Qwen 2.5 with Ollama

Qwen 2.5 is useful when you want strong local technical behavior.

Pull a model:

ollama pull qwen2.5:7b
Enter fullscreen mode Exit fullscreen mode

Then run:

import requests
import json

alert = {
    "source": "CloudTrail",
    "eventName": "CreateAccessKey",
    "user": "svc-reporting",
    "sourceIPAddress": "203.0.113.10",
    "change_ticket": None
}

prompt = f"""
Analyze this alert using only the evidence provided.

Return valid JSON with:
- summary
- severity
- confidence
- key_evidence
- next_checks

Alert:
{json.dumps(alert, indent=2)}
"""

payload = {
    "model": "qwen2.5:7b",
    "prompt": prompt,
    "stream": False,
    "options": {
        "temperature": 0.1,
        "num_predict": 700
    }
}

response = requests.post(
    "http://localhost:11434/api/generate",
    json=payload,
    timeout=120
)

print(response.json()["response"])
Enter fullscreen mode Exit fullscreen mode

For JSON output, use low temperature. If the JSON is still inconsistent, simplify the schema, use provider-supported structured output where available, and validate the response in your application.


Recommended settings by use case

Use this as a starting point.

Use case Temperature Output token guidance
JSON extraction 0.0–0.2 Enough for full schema
SOC alert triage 0.1–0.2 800–2000 tokens
Executive summary 0.2–0.4 400–1200 tokens
Code generation 0.1–0.3 Depends on file size
Brainstorming 0.6–0.8 800–2000 tokens
Marketing copy 0.5–0.8 500–1500 tokens
Compliance analysis 0.1–0.2 1000–3000 tokens
Long incident report 0.1–0.3 2000+ tokens

These are not fixed rules. They are good starting points.


Practical advice for managers

For managers, the key point is simple:

Tokens affect cost and completeness. Temperature affects reliability and tone.

When evaluating an AI workflow, ask these questions:

  • How many input tokens are we sending per request?
  • How many output tokens do we allow?
  • Are we paying for unnecessary context?
  • Are answers being truncated?
  • Is the model producing consistent outputs?
  • Is the temperature appropriate for the task?
  • Are we using low temperature for high-risk decisions?
  • Are humans reviewing outputs before action?

For operational workflows, consistency is usually more valuable than creativity.


Practical advice for engineers

For engineers, the practical controls are:

  • Keep prompts focused.
  • Do not send irrelevant context.
  • Set output limits high enough for the expected answer.
  • Use low temperature for structured output.
  • Validate JSON outputs with code.
  • Monitor token usage in production.
  • Keep separate settings for different workflows.
  • Test with realistic examples, not only happy-path prompts.

A good production system does not use one default setting for everything.

For example:

TASK_SETTINGS = {
    "json_extraction": {"temperature": 0.1, "max_tokens": 800},
    "incident_summary": {"temperature": 0.2, "max_tokens": 2000},
    "brainstorming": {"temperature": 0.7, "max_tokens": 1200},
    "code_review": {"temperature": 0.2, "max_tokens": 1800},
}
Enter fullscreen mode Exit fullscreen mode

Different jobs need different controls.


Common mistakes

Mistake 1: Setting max tokens too low

This causes incomplete reports, broken JSON, and missing recommendations.

Mistake 2: Using high temperature for security or compliance

High temperature may produce more interesting answers, but interesting is not the same as correct.

Mistake 3: Sending too much context

More context can help, but only if it is relevant. A smaller, cleaner prompt often performs better than a large noisy one.

Mistake 4: Assuming temperature 0 is perfectly deterministic

Many platforms can still show minor variation because of infrastructure, model behavior, or inference details. Treat low temperature as more consistent, not mathematically guaranteed.

Mistake 5: Reusing one setting for every task

A creative writing assistant and an incident triage assistant should not use the same settings.


A simple mental model

Use this analogy:

Max tokens = how much paper the model has to write on.
Context window = how much material the model can keep on the desk.
Temperature = how adventurous the model is while choosing words.
Enter fullscreen mode Exit fullscreen mode

For a security report, you want enough paper and a disciplined writer.

For brainstorming, you may want a more adventurous writer.

For JSON, you want the writer to follow the form exactly.


Final takeaway

Tokens and temperature are small settings with large operational impact.

Tokens decide how much the model can read and write. They affect cost, latency, completeness, and whether structured outputs survive intact.

Temperature decides how predictable or creative the model will be. It affects consistency, tone, and risk.

For manager-level reporting, use enough output tokens to avoid shallow summaries and keep temperature moderate. For engineering workflows, tune settings by task and validate the output. For cybersecurity, compliance, and production automation, keep temperature low, keep prompts focused, and never allow the model to take high-impact action without human review.

The model matters, but the settings matter too.

A strong model with poorly chosen token and temperature settings can still produce weak results. A well-chosen model with disciplined settings can become a reliable assistant for real business and engineering work.

Top comments (0)