DEV Community

Jamie Cole
Jamie Cole

Posted on

I Ran the Same LLM Prompt Every Day for 30 Days: Here's What Changed

I ran the exact same prompt every day for 30 days on GPT-4o. I measured output similarity, cost, and latency. Here is what I found.

The Experiment

Same prompt: "Summarize this: [random 500-word article]"
Same model: GPT-4o
Same 10 articles per day
30 days of data

I measured:

  • Cosine similarity between outputs (was the meaning the same?)
  • Token count (was the length the same?)
  • Latency (was the speed the same?)
  • Cost (was it consistent?)

The Results

Output Similarity

The outputs were semantically similar but not identical. On a scale of 0 (completely different) to 1 (identical), daily similarity averaged 0.87.

There were two days where similarity dropped below 0.75. Both times, the model had clearly changed its summarization style. Shorter summaries, more direct language.

Token Count

Average output tokens: 78
Standard deviation: 12
Range: 52-124 tokens

Day 12 had unusually short outputs (52-61 tokens). Day 23 had unusually long outputs (98-124 tokens).

Latency

Average latency: 2.3 seconds
Standard deviation: 0.8 seconds
Range: 1.1s - 4.7s

No obvious pattern by day of week or time of day.

Cost

Always $0.003 per call for this prompt size. Consistent.

What This Means

For simple, deterministic tasks: LLMs are consistent.

For generative tasks: there is measurable variance day-to-day.

This is why you need baseline monitoring. If I had not measured, I would not have known the outputs were changing.

The Practical Implication

If you are building LLM features that depend on consistent output length, format, or style: you need automated monitoring.

A 15% variance in output length could break your UI. A 10% variance in summarization quality might not be noticeable until a user complains.


Measure everything. You cannot fix what you are not tracking.

Try DriftWatch for automated LLM monitoring — from £9.90/mo

Top comments (0)