Jasmin Virdi

Posted on Jun 9

The Messages Array, in 4 GIFs

#llm #ai #javascript #webdev

Hidden token costs of chat history

This is the third post of series Building TinyAgent where we are building a small agent from scratch in Node.js with no frameworks just the API calls.

Post 1 made one API call. Post 2 streamed the response. This post turns those calls into a conversation.

It sounds pretty simple but the moment your agent has memory, your bill starts growing fast and almost no tutorial warns you about it.

Let's see why? 🧐

1. The array grows

The API has no memory as it is stateless but we can keep a messages array, and you send the whole thing on every call.

Each turn adds two messages which are the user's question and the model's reply. So:

Turn 1  →  1 message  in array,   1 sent
Turn 2  →  3 messages in array,   3 sent
Turn 5  →  9 messages in array,   9 sent
Turn 10 → 19 messages in array,  19 sent

Turn 10 sends 19 messages, not 1. Every prior turn is in the request.

There is no "memory" as this is just an array which you should hold it.

2. The cost curve

Here's where it hurts.

A 30-turn conversation isn't 30× the cost of one turn. It's about 100× more, because every turn pays for every turn before it.

Quick math. Say each user message is ~30 tokens and each reply is ~300 tokens. Using Claude Sonnet 4.5 pricing ($3 per million input, $15 per million output):

Turn 1  input:    30 tokens   →  $0.005   (this turn)
Turn 10 input:   3,000 tokens →  $0.014   (this turn)
Turn 30 input:   9,600 tokens →  $0.033   (this turn)

Cumulative through turn 1:   ~$0.005
Cumulative through turn 30:  ~$0.57
ratio: ~120×

Your exact multiplier depends on how long your messages are. 😨

The cost curve isn't a line. It's a curve that bends sharply upward and every new turn makes every future turn more expensive too.

A few things follow:

The first 10 turns feel free. That's the trap. They're cheap enough to ignore until turn 20, when the bill suddenly jumps.

Long system prompts compound. A 2,000-token prompt is fine once but by turn 30, you've sent it 30 times.

Output is linear. Input is quadratic. Replies cost 5× more per token, but you only generate them once. The history, you resend every turn so input dominates the bill as conversations get longer.

This is why an agent that costs pennies in dev costs hundreds of dollars in production. Dev sessions are short whereas real users have long ones.

3. Three ways out

Three patterns, plus a bonus trick and you can pick based on how long your conversations actually are.

Full history. Send everything, every turn as it is the Simplest way but costs grow fast. Fine for short chats (under ~10 turns). Don't optimize until you measure a real problem.

Sliding window. Keep only the last N turns. Cost stays flat forever but there is a Trade-off: the agent forgets early turns. Good for one-off tasks, bad when older context matters.

const recent = messages.slice(-10); // last 5 turns

Summarization. When the array gets long, use a cheaper model to compress old turns into a paragraph and replace them with the summary. Costs stay bounded, important context survives. You pay for one extra model call per compression, and the summary can miss things you had want kept. Best for long sessions.

const messages = [
  { role: "user", content: "Earlier in our conversation: <summary>" },
  { role: "assistant", content: "Got it — continuing from there." },
  // turn 16 onwards
  { role: "user", content: "..." },
];

Put the summary in a user message, not system as that keeps the system prompt cacheable. The short assistant reply keeps the user/assistant alternation valid.

There's no universal best pattern. Measure your real conversations, then pick.

4. Prompt caching

Anthropic-specific, but it changes the math on pattern #1 a lot.

Mark a prefix of your messages as cacheable, and Anthropic stores the processed state for 5 minutes. Any call that reuses the same prefix pays only 10% of the input cost for the cached part. The first call pays a small premium (25% extra) to write the cache.

{
  role: "system",
  content: [{
    type: "text",
    text: longSystemPrompt,
    cache_control: { type: "ephemeral" }
  }]
}

The new cost formula:

cost = uncached_input × $3.00/M    // never cached
     + cache_creation × $3.75/M    // first call only, 1.25×
     + cache_read     × $0.30/M    // every later call, 0.10×
     + output         × $15.00/M

The response shows you what hit the cache — check usage.cache_creation_input_tokens and usage.cache_read_input_tokens.

A few things to know:

Long system prompts are basically free after the first call. Cold start has more latency (the prefix has to be processed once) but every call after is cheap.

You can cache message history too. As long as you only append new messages, the cached prefix stays valid.

Where you put the cache marker decides what stays cached. Cache the system prompt only → sliding window works fine. Cache deeper into the messages → any earlier edit (summarization, window drop) breaks the cache from that point.

The cache expires after 5 minutes of no activity and there's a 1-hour tier in beta if you need it.

This is why "just send full history" works better on Anthropic than on most APIs. Cache the system prompt, append messages as they come, and pattern #1 often beats pattern #2 or #3 for sessions under an hour.

5. The whole thing, in ~50 lines

Here's chat.js. It reads input in a loop, keeps a messages array, calls the API each turn, and prints the running cost. Use --strategy=window to keep only the last 5 turns. (Non-streaming for simplicity swap in the SSE loop from Post 2 for live token output.)

import readline from "node:readline";

const strategy = (process.argv.find(a => a.startsWith("--strategy="))
                  ?.split("=")[1]) ?? "full";

const messages = [];
let turn = 0, inTokens = 0, outTokens = 0;

function prepare() {
  if (strategy === "window") return messages.slice(-10); // last 5 turns
  return messages;
}

const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
});

function ask() {
  rl.question("you: ", async (text) => {
    if (!text.trim()) return ask();
    if (text === "/exit") return rl.close();

    messages.push({ role: "user", content: text });

    const res = await fetch("https://api.anthropic.com/v1/messages", {
      method: "POST",
      headers: {
        "x-api-key": process.env.ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json",
      },
      body: JSON.stringify({
        model: "claude-sonnet-4-5",
        max_tokens: 1024,
        messages: prepare(),
      }),
    });

    const data = await res.json();
    const reply = data.content[0].text;
    messages.push({ role: "assistant", content: reply });

    turn += 1;
    inTokens  += data.usage.input_tokens;
    outTokens += data.usage.output_tokens;
    const cost = (inTokens / 1e6) * 3 + (outTokens / 1e6) * 15;

    console.log(`\nclaude: ${reply}`);
    console.log(`[turn ${turn} · ${inTokens} in / ${outTokens} out · $${cost.toFixed(4)}]\n`);
    ask();
  });
}

ask();

ANTHROPIC_API_KEY=sk-... node chat.js
ANTHROPIC_API_KEY=sk-... node chat.js --strategy=window

Try running the above code and have a real conversation and then watch the in number climb every turn. That number is your bill.

If we want to know the token count before sending then use /v1/messages/count_tokens with no inference call required. Useful for gating, trimming, or logging.

#Three things to try before the next post

Run a 15-turn conversation and watch the input token count grow.
Switch to --strategy=window and have the same chat where at turn 14 we ask about something from turn 2. The agent would not have idea that is the trade off.
Add cache_control to your system prompt. Run 5 turns. Check usage.cache_read_input_tokens. It lights up from turn 2 onwards. This one field is the difference between a $50/month feature and a $500/month one. 🚀

#What's next

TinyAgent has memory now the next thing we would focus on in the series is the agent should know how to do things by running a function and not just talk.

Happy Coding! 👩‍💻

Top comments (18)

Alex Shev • Jun 12

Visualizing the messages array is helpful because most bugs in LLM apps are not in the model call itself. They are in what got included, in what order, and under which role. A small animation can make that state visible in a way a JSON dump usually does not, especially for people new to chat-based architectures.

Jasmin Virdi • Jun 12

Thanks @alexshev !

Exactly, the thoughts I had in mind while writing this series

Alex Shev • Jun 13

Exactly. The messages array is where small misunderstandings turn into real behavior. Visualizing it makes the contract feel concrete: roles, order, context, and tool output all become things you can inspect instead of vague prompt magic.

Manuel Bruña • Jun 15

The messages array looks simple until you debug a real agent trace. Keeping roles, tool calls, and intermediate reasoning boundaries clear makes later failures much easier to inspect. Bad message hygiene becomes bad system behavior.

Jasmin Virdi • Jun 16

Thanks for adding up @tecnomanu

Manuel Bruña • Jul 9

Thanks for the reply. The main point for me is that message arrays become useful when they preserve roles and tool boundaries clearly. Once those blur, agent debugging gets much harder.

Mudassir Khan • Jun 16

the "long system prompts compound" line is the one most teams hit hardest tbh. in our prod RAG setup we inject retrieved docs into the system prompt — at 3k tokens average, that's 90k input tokens just from the system message by turn 30, before any conversation happens.

went full sliding window after that. trade off is the agent loses early context on long sessions, so we snapshot key facts into a context_summary block instead of pure recency truncation.

curious if you're doing hard truncation or summarization to handle the curve?

Marcus Chen • Jun 11

Nice way to make the state visible. The messages array gets real interesting in voice, where every turn you keep is latency you pay at synthesis time. We ended up summarizing everything older than a few turns into one system-side note and keeping only the recent turns verbatim, which felt wrong until we measured that response quality barely moved and time-to-first-token dropped noticeably. The array is a budget, not just a log

Jasmin Virdi • Jun 11

Thanks @realmarcuschen

That's a great analysis. How much difference did you observe in cost and output rendering time?

Mininglamp • Jun 11

The messages array abstraction works fine for simple chatbots but starts breaking when you need agents that maintain state across tool calls. The moment you have parallel tool execution or need to inject system context mid-conversation, the linear array model gets awkward fast. Most agent frameworks end up building a graph on top of it anyway.

Jasmin Virdi • Jun 11

Thanks @mininglamp

Nice addition! Would definitely include this point in the upcoming post of the series for tool call.

Yves Jutard • Jun 10

Exactly the questions I had today. Thanks for this write-up + the cool illustrations 👏

Jasmin Virdi • Jun 10

Thanks @yvem

Sloan the DEV Moderator • Jun 9

Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.

We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.

Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.

We hope you understand and take care to follow our guidelines going forward!

Nazar Boyko • Jun 10

Great series! 👏
The caching section is the part most agent tutorials skip entirely. Worth adding: you're not limited to a single cache marker. Anthropic gives you up to 4 cache_control breakpoints, so a common layout is one on the tools, one on the system prompt, one on the last message. Since a read checks for the longest matching cached prefix, append-only history keeps hitting the cache without you re-marking every turn. The gotcha you nailed is the real one though! Any edit before a breakpoint (a window drop, a summary swap) busts everything after it, which is exactly why window and caching fight each other.