Why Your Production Agent Costs 5 More on Mondays (And the Fix)

#ai #agents #observability #opentelemetry

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You open the cost dashboard on Tuesday morning. The Monday bar is taller than the rest of the week put together. Closer to five times the Wednesday-through-Friday average — not double, not triple. Saturday and Sunday are flat lines near the floor. Then Monday lights up and stays lit until late afternoon, after which the curve falls back to whatever you priced the feature at.

A team I talked to spent two weeks on this. They thought it was a regression. The model had not changed, the prompts had not changed, the rate limits had not changed. The thing that had changed was the shape of the traffic on Monday morning, and five different causes were stacking on top of each other in a way that none of them noticed individually.

Each cause has a detection signal in the usage field your provider already returns, plus a fix that does not require ripping out your agent.

The five suspects, in priority order

Cost spikes that follow a calendar are almost always one of these. Run through the list top to bottom because the cheap detections come first.

Cron-scheduled batch jobs that fire at 9 AM Monday.
A Sunday-night deploy that invalidated the prompt cache.
Longer conversations because users return to half-finished sessions.
More retries because a downstream API is flaky after weekend maintenance.
Bigger inputs because Monday is when people import the weekend's reports.

The order matters. A cron job is one query away. A cache miss takes a hit-rate calculation. The other three need traces. Walk down the list, do not jump.

Cause 1: a cron job you forgot about

Someone, somewhere on your team, scheduled a job that touches the agent on a cadence. Maybe it is a weekly report generator, or a "catch up on the weekend" digest. Sometimes it is a customer-side automation you do not own. Whichever it is, it fires at 9 AM Monday on a clean cron line and chews through tokens before any human is awake.

Detection is one query against your traces. Group by minute-of-day on Mondays only and look for a wall:

select
  date_trunc('minute', start_time) as minute,
  sum(input_tokens + output_tokens) as tokens
from spans
where service_name = 'agent'
  and extract(dow from start_time) = 1
  and start_time >= now() - interval '4 weeks'
group by 1
order by tokens desc
limit 20;

If the top of that list is a single minute (09:00:00, 09:05:00, 09:15:00) with twenty times the tokens of the surrounding minutes, you have a cron job. The fix is not to make the agent cheaper. The fix is to find the scheduler and either spread the job (0 9-12 * * 1 instead of 0 9 * * 1) or move it off the agent entirely if it does not need an LLM. A weekly digest of the past week's tickets does not need a model on the hot path; it needs a SQL query and a template.

Cause 2: the prompt cache went cold over the weekend

If your agent uses prompt caching, your effective cost on a cache-hot prompt is a fraction of the cost on a cache-cold one. A Sunday-night deploy that touched the system prompt, the tool list, the model identifier, or the cache breakpoints invalidates that cache for every conversation that starts on Monday morning. You pay full input price for the first call of every fresh session until the cache warms up again.

You can see this in the four token classes Anthropic returns on every response (prompt-caching docs). The relevant ones are cache_read_input_tokens (cheap) and cache_creation_input_tokens (more expensive than fresh input). Cache hit rate by hour, plotted against day of week, tells the story:

select
  date_trunc('hour', start_time) as hour,
  extract(dow from start_time) as day_of_week,
  sum(cache_read_input_tokens)::float
    / nullif(sum(cache_read_input_tokens
       + cache_creation_input_tokens
       + input_tokens), 0) as hit_rate
from spans
where service_name = 'agent'
  and start_time >= now() - interval '4 weeks'
group by 1, 2
order by 1;

If Monday morning hit rate is 0.05 and Friday afternoon hit rate is 0.85, your cache is being invalidated by your release cadence. The fix is one of: stop deploying on Sunday night, restructure the prompt so the volatile parts come after the cache breakpoint instead of before, or run a one-shot warmer that re-establishes the cache for your top N system prompts before the Monday traffic arrives.

Cause 3: users return to long-running conversations

Some products keep conversation history across sessions. The user closes the tab on Friday at 5 PM with a thread that has eight turns in it. They open it again on Monday at 9 AM and the next message is turn nine. That ninth turn carries every previous turn in its messages[], which means the input cost on Monday's first call per user is roughly proportional to how much they typed before the weekend.

Detection is input_tokens distribution by day of week and turn position:

select
  extract(dow from start_time) as day_of_week,
  case
    when turn_index = 1 then 'first'
    when turn_index <= 5 then 'early'
    else 'deep'
  end as turn_bucket,
  avg(input_tokens) as avg_input
from spans
where service_name = 'agent'
group by 1, 2
order by 1, 2;

If Monday's deep row is much larger than other days' deep rows, users are resuming. The fix is conversation summarisation. After turn N, replace the oldest K turns with a model-generated summary in a single message, and continue from there. The break-even point depends on your prompt structure. Measure it before turning it on. A conversation that gets summarised at turn 10 and again at turn 20 typically costs less than one that drags every original turn forward forever.

Cause 4: retries against a flaky downstream API

If your agent calls tools, and one of those tools is a downstream API that runs maintenance over the weekend, Monday morning is when that API is least healthy. Every failed tool call that the agent retries costs you the model's input tokens for the retry turn. A 30 percent tool error rate on Monday morning, against a 2 percent rate the rest of the week, can quietly inflate the agent's bill by a meaningful multiple.

Detection: count tool spans with non-success status, grouped by tool and day:

select
  extract(dow from start_time) as day_of_week,
  attributes->>'tool.name' as tool,
  count(*) filter (
    where status_code = 'ERROR'
  )::float / count(*) as error_rate
from spans
where span_kind = 'CLIENT'
  and parent_service = 'agent'
  and start_time >= now() - interval '4 weeks'
group by 1, 2
having count(*) > 100
order by 1, 3 desc;

If a specific tool's error rate is 5× higher on Monday than on Thursday, the fix is in the tool layer, not the agent. Add a circuit breaker around that tool, surface a clean "this lookup is unavailable" tool result to the model instead of retrying, and let the agent route around it for the morning. Retrying a dead API with the model in the loop is the most expensive way to discover the API is dead.

Cause 5: the weekend's data finally lands on Monday

Monday morning is when people upload reports. The PDF that summarises last week, the CSV exported from the warehouse on Sunday night, a digest the support team pasted in from a Slack channel. Whichever it is, the average prompt size on Monday is bigger because the average input is bigger, and you are paying for tokens that did not exist on Friday.

Detection looks like Cause 3, except the bucket is the first turn of a conversation rather than the deep turns:

select
  extract(dow from start_time) as day_of_week,
  percentile_cont(0.5) within group (
    order by input_tokens
  ) as p50_input,
  percentile_cont(0.95) within group (
    order by input_tokens
  ) as p95_input
from spans
where service_name = 'agent'
  and turn_index = 1
group by 1
order by 1;

If Monday's p95 first-turn input is 3× the rest of the week's, users are pasting in larger documents. The fix is one of: pre-process the document outside the model (extract the table you actually need with a parser), apply a length cap with a clean error message, or charge for the tokens at the product layer if the user is on a metered plan.

The dashboard query that ties it together

When you want one query that surfaces "is today expensive, and which suspect should I check first," join cost across day-of-week and the four token classes:

select
  extract(dow from start_time) as dow,
  count(*) as calls,
  avg(input_tokens) as avg_input,
  avg(output_tokens) as avg_output,
  avg(cache_read_input_tokens) as avg_cache_hit,
  avg(cache_creation_input_tokens)
    as avg_cache_write,
  sum(input_tokens
    + output_tokens
    + cache_creation_input_tokens
    + cache_read_input_tokens) as total_tokens
from spans
where service_name = 'agent'
  and start_time >= now() - interval '4 weeks'
group by 1
order by 1;

Read it column by column:

total_tokens up with avg_input flat and calls up → Cause 1.
avg_cache_write up with avg_cache_hit collapsed → Cause 2.
avg_input up only on deep turns → Cause 3.
calls per conversation up → Cause 4.
avg_input up on first turns → Cause 5.

The OTel side is the same numbers, recorded as histograms on the agent span: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, plus two custom attributes for the cache fields. Once they are histograms, the day-of-week breakdown is a Grafana query against your metrics backend, not a database query against your trace store, and you can wire alerts off the p95 of each one.

What a good week looks like afterwards

Monday is still going to be your highest day. People work on Mondays. Aim for Monday at 1.2× to 1.5× a midweek day. Anything past 2× without a known business reason is one of these five suspects, and the table above is the order to walk through them.

The agent does not know what your bill looks like. The cron job does not know either. Your dashboard does, and the five queries above turn that into a triage path you can run in fifteen minutes the next time someone asks why Monday cost what it cost.

If this was useful

The LLM Observability Pocket Guide walks through the OTLP attributes the queries above lean on (gen_ai.usage.*, the four cache token classes, span-kind conventions for tool calls), and how to build the day-of-week dashboard on top of whatever tracing backend you already have. The chapters on cost attribution and span-level token accounting pair directly with the triage list above.