This is the third post of series Building TinyAgent where we are building a small agent from scratch in Node.js with no frameworks just the API calls.
Post 1 made one API call. Post 2 streamed the response. This post turns those calls into a conversation.
It sounds pretty simple but the moment your agent has memory, your bill starts growing fast and almost no tutorial warns you about it.
Let's see why? ๐ง
1. The array grows
The API has no memory as it is stateless but we can keep a messages array, and you send the whole thing on every call.
Each turn adds two messages which are the user's question and the model's reply. So:
Turn 1 โ 1 message in array, 1 sent
Turn 2 โ 3 messages in array, 3 sent
Turn 5 โ 9 messages in array, 9 sent
Turn 10 โ 19 messages in array, 19 sent
Turn 10 sends 19 messages, not 1. Every prior turn is in the request.
There is no "memory" as this is just an array which you should hold it.
2. The cost curve
Here's where it hurts.
A 30-turn conversation isn't 30ร the cost of one turn. It's about 100ร more, because every turn pays for every turn before it.
Quick math. Say each user message is ~30 tokens and each reply is ~300 tokens. Using Claude Sonnet 4.5 pricing ($3 per million input, $15 per million output):
Turn 1 input: 30 tokens โ $0.005 (this turn)
Turn 10 input: 3,000 tokens โ $0.014 (this turn)
Turn 30 input: 9,600 tokens โ $0.033 (this turn)
Cumulative through turn 1: ~$0.005
Cumulative through turn 30: ~$0.57
ratio: ~120ร
Your exact multiplier depends on how long your messages are. ๐จ
The cost curve isn't a line. It's a curve that bends sharply upward and every new turn makes every future turn more expensive too.
A few things follow:
The first 10 turns feel free. That's the trap. They're cheap enough to ignore until turn 20, when the bill suddenly jumps.
Long system prompts compound. A 2,000-token prompt is fine once but by turn 30, you've sent it 30 times.
Output is linear. Input is quadratic. Replies cost 5ร more per token, but you only generate them once. The history, you resend every turn so input dominates the bill as conversations get longer.
This is why an agent that costs pennies in dev costs hundreds of dollars in production. Dev sessions are short whereas real users have long ones.
3. Three ways out
Three patterns, plus a bonus trick and you can pick based on how long your conversations actually are.
Full history. Send everything, every turn as it is the Simplest way but costs grow fast. Fine for short chats (under ~10 turns). Don't optimize until you measure a real problem.
Sliding window. Keep only the last N turns. Cost stays flat forever but there is a Trade-off: the agent forgets early turns. Good for one-off tasks, bad when older context matters.
const recent = messages.slice(-10); // last 5 turns
Summarization. When the array gets long, use a cheaper model to compress old turns into a paragraph and replace them with the summary. Costs stay bounded, important context survives. You pay for one extra model call per compression, and the summary can miss things you had want kept. Best for long sessions.
const messages = [
{ role: "user", content: "Earlier in our conversation: <summary>" },
{ role: "assistant", content: "Got it โ continuing from there." },
// turn 16 onwards
{ role: "user", content: "..." },
];
Put the summary in a user message, not system as that keeps the system prompt cacheable. The short assistant reply keeps the user/assistant alternation valid.
There's no universal best pattern. Measure your real conversations, then pick.
4. Prompt caching
Anthropic-specific, but it changes the math on pattern #1 a lot.
Mark a prefix of your messages as cacheable, and Anthropic stores the processed state for 5 minutes. Any call that reuses the same prefix pays only 10% of the input cost for the cached part. The first call pays a small premium (25% extra) to write the cache.
{
role: "system",
content: [{
type: "text",
text: longSystemPrompt,
cache_control: { type: "ephemeral" }
}]
}
The new cost formula:
cost = uncached_input ร $3.00/M // never cached
+ cache_creation ร $3.75/M // first call only, 1.25ร
+ cache_read ร $0.30/M // every later call, 0.10ร
+ output ร $15.00/M
The response shows you what hit the cache โ check usage.cache_creation_input_tokens and usage.cache_read_input_tokens.
A few things to know:
Long system prompts are basically free after the first call. Cold start has more latency (the prefix has to be processed once) but every call after is cheap.
You can cache message history too. As long as you only append new messages, the cached prefix stays valid.
Where you put the cache marker decides what stays cached. Cache the system prompt only โ sliding window works fine. Cache deeper into the messages โ any earlier edit (summarization, window drop) breaks the cache from that point.
The cache expires after 5 minutes of no activity and there's a 1-hour tier in beta if you need it.
This is why "just send full history" works better on Anthropic than on most APIs. Cache the system prompt, append messages as they come, and pattern #1 often beats pattern #2 or #3 for sessions under an hour.
5. The whole thing, in ~50 lines
Here's chat.js. It reads input in a loop, keeps a messages array, calls the API each turn, and prints the running cost. Use --strategy=window to keep only the last 5 turns. (Non-streaming for simplicity swap in the SSE loop from Post 2 for live token output.)
import readline from "node:readline";
const strategy = (process.argv.find(a => a.startsWith("--strategy="))
?.split("=")[1]) ?? "full";
const messages = [];
let turn = 0, inTokens = 0, outTokens = 0;
function prepare() {
if (strategy === "window") return messages.slice(-10); // last 5 turns
return messages;
}
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
function ask() {
rl.question("you: ", async (text) => {
if (!text.trim()) return ask();
if (text === "/exit") return rl.close();
messages.push({ role: "user", content: text });
const res = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"x-api-key": process.env.ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
},
body: JSON.stringify({
model: "claude-sonnet-4-5",
max_tokens: 1024,
messages: prepare(),
}),
});
const data = await res.json();
const reply = data.content[0].text;
messages.push({ role: "assistant", content: reply });
turn += 1;
inTokens += data.usage.input_tokens;
outTokens += data.usage.output_tokens;
const cost = (inTokens / 1e6) * 3 + (outTokens / 1e6) * 15;
console.log(`\nclaude: ${reply}`);
console.log(`[turn ${turn} ยท ${inTokens} in / ${outTokens} out ยท $${cost.toFixed(4)}]\n`);
ask();
});
}
ask();
ANTHROPIC_API_KEY=sk-... node chat.js
ANTHROPIC_API_KEY=sk-... node chat.js --strategy=window
Try running the above code and have a real conversation and then watch the in number climb every turn. That number is your bill.
If we want to know the token count before sending then use
/v1/messages/count_tokenswith no inference call required. Useful for gating, trimming, or logging.
#Three things to try before the next post
Run a 15-turn conversation and watch the input token count grow.
Switch to
--strategy=windowand have the same chat where at turn 14 we ask about something from turn 2. The agent would not have idea that is the trade off.Add
cache_controlto your system prompt. Run 5 turns. Checkusage.cache_read_input_tokens. It lights up from turn 2 onwards. This one field is the difference between a $50/month feature and a $500/month one. ๐
#What's next
TinyAgent has memory now the next thing we would focus on in the series is the agent should know how to do things by running a function and not just talk.
Happy Coding! ๐ฉโ๐ป




Top comments (0)