I ran the exact same prompt every day for 30 days on GPT-4o. I measured output similarity, cost, and latency. Here is what I found.
The Experiment
Same prompt: "Summarize this: [random 500-word article]"
Same model: GPT-4o
Same 10 articles per day
30 days of data
I measured:
- Cosine similarity between outputs (was the meaning the same?)
- Token count (was the length the same?)
- Latency (was the speed the same?)
- Cost (was it consistent?)
The Results
Output Similarity
The outputs were semantically similar but not identical. On a scale of 0 (completely different) to 1 (identical), daily similarity averaged 0.87.
There were two days where similarity dropped below 0.75. Both times, the model had clearly changed its summarization style. Shorter summaries, more direct language.
Token Count
Average output tokens: 78
Standard deviation: 12
Range: 52-124 tokens
Day 12 had unusually short outputs (52-61 tokens). Day 23 had unusually long outputs (98-124 tokens).
Latency
Average latency: 2.3 seconds
Standard deviation: 0.8 seconds
Range: 1.1s - 4.7s
No obvious pattern by day of week or time of day.
Cost
Always $0.003 per call for this prompt size. Consistent.
What This Means
For simple, deterministic tasks: LLMs are consistent.
For generative tasks: there is measurable variance day-to-day.
This is why you need baseline monitoring. If I had not measured, I would not have known the outputs were changing.
The Practical Implication
If you are building LLM features that depend on consistent output length, format, or style: you need automated monitoring.
A 15% variance in output length could break your UI. A 10% variance in summarization quality might not be noticeable until a user complains.
Measure everything. You cannot fix what you are not tracking.
Top comments (0)