Chat Template: From Messages To Tokens

#openai #ai #python #llm

How Messages Get Rendered into Model Inputs

In an AI Agent, context is usually carried by messages. We send a request to the model service with messages in the payload so the LLM can “see” the current state of the task. For example:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.openai.com/v1",
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Write a Python implementation of quicksort"},
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    temperature=0.7,
)
print(response.choices[0].message.content)

We know an LLM is an autoregressive model, and what it actually consumes is a sequence of tokens produced by a tokenizer over plain text. So the natural question is: how do structured messages become tokens, and what does the corresponding raw text look like?

Understanding this makes the “under the hood” of an Agent much less fuzzy.

It’s a safe bet the API server does some transformation: it turns the structured info in messages into whatever input format the model expects. For closed models we can’t see that implementation, but for open-source models we can.

One direct reference is vLLM’s serving logic: it includes the full flow for many popular open models, from receiving an API request to returning generated output. vLLM is fairly heavy though. If you only care about the step from structured messages to the actual input text, there’s a faster route.

Rendering Logic from chat_template.jinja

Open-source models typically ship their model files on Hugging Face. You can usually find a file named chat_template.jinja, for example:

chat_template.jinja for kimi-k2.5
chat_template.jinja for GLM 4.7
chat_template.jinja for gpt-oss

For some models, the chat template also lives in tokenizer_config.json under the chat_template field, like Qwen3.

In practice, API servers often render messages into the model’s real input using this Jinja template.

Let’s use Qwen3.5’s chat_template.jinja as an example. Download the template and run a small script to see the rendered output:

import json
from pathlib import Path

from jinja2 import Environment, StrictUndefined


def raise_exception(message: str) -> None:
    raise ValueError(message)


messages = [
    {"role": "system", "content": "You are an experienced software engineer"},
    {"role": "user", "content": "help me fix the following bugs: xxx"},
]

env = Environment(undefined=StrictUndefined, trim_blocks=True, lstrip_blocks=True)
env.globals["raise_exception"] = raise_exception
env.filters["tojson"] = lambda v: json.dumps(v, ensure_ascii=False)

tpl = env.from_string(Path("./chat_template.jinja").read_text(encoding="utf-8"))
print(
    tpl.render(
        messages=messages,
        tools=None,
        add_generation_prompt=True,
        add_vision_id=False,
        enable_thinking=True,
    )
)

Running it gives:

<|im_start|>system
You are an experienced software engineer<|im_end|>
<|im_start|>user
help me fix the following bugs: xxx<|im_end|>
<|im_start|>assistant
<think>

That’s the raw text that actually gets fed into the model. Of course it still goes through the tokenizer and becomes token IDs. Some special markers (like <|im_start|>) map to dedicated token IDs as well.

From the output it’s pretty clear what happened: role and content in messages were rendered into something like:

{{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}

This format is what the model was trained on, and it varies across models.

What enable_thinking Looks Like

In the example above we only provided two messages: system and user. But the rendered result ends with an extra role:

<|im_start|>assistant
<think>

This role has no content and it isn’t closed by <|im_end|>. It’s basically the server telling the model: “You’ve got the system and user context. Now start producing the assistant reply. I’ve already set up the format. See <think>? Include your reasoning.”

If you set enable_thinking to False in the script, the tail becomes:

<|im_start|>system
You are an experienced software engineer<|im_end|>
<|im_start|>user
help me fix the following bugs: xxx<|im_end|>
<|im_start|>assistant
<think>

</think>

Now it’s more like: “Time to produce the assistant content. See <think></think>? The door is closed for you. You don’t need to output a thinking section; just answer.”

So whether the model “thinks” is largely steered by these input-token patterns. Nothing mystical: during training the model learns to predict the next token given these patterns, and at inference time it continues them in the same way.

Beyond this minimal example, each model has its own rendering templates and logic for tool schemas, control tokens, and newer patterns like interleaved thinking. I won’t go deeper here—you can explore it yourself using the same approach.

Top comments (1)

Rehab • Jun 8

Great tip! I also use self-healing loops to avoid restarts. Have you tried MAPE-K?