Evaluation metrics now preserve existing indicators instead of overwriting them when storing results.
The Problem
Every time you run an evaluation on a prompt, you want to know if it's better than the last version—not just different. But most evaluation systems overwrite your previous results the moment you store new ones, which means you lose the comparison data that tells you whether your optimization actually worked. I kept running evaluations, getting a score, then immediately losing the baseline I needed to know if I'd improved anything.
What changes for you
I changed the storage system to preserve all existing evaluation indicators when new results come in. Now when you run an evaluation and store the results, any metrics you measured previously—semantic similarity, context accuracy, cost estimates—stay intact. You only overwrite what you're actively measuring in the current run.
This means you can run a semantic evaluation today, store it, then run a cost analysis tomorrow without losing yesterday's semantic score. When you pull up that prompt later, you see both metrics side-by-side. You know whether your 'cheaper' version also maintained quality, or whether you traded accuracy for cost savings. The system tracks what you measure, when you measured it, and keeps it available for comparison.
The benefit is cumulative evidence. Each evaluation adds to what you already know about a prompt, instead of replacing it. When I'm testing a workflow prompt, I run context detection first (91.94% accuracy on my own prompts), store that result, then run a semantic comparison against my reference version. Both scores persist. I can see that my optimized prompt maintained 0.89 semantic similarity while correctly detecting 'agent_workflow' context—proof that the optimization didn't drift from my original intent.
How it works (brief)
The implementation uses indicator merging at the storage layer. When you call the evaluation storage endpoint with new results, the system loads any existing indicators for that prompt, merges the new data into the existing structure, and writes the combined result back. The merge is key-level: if you're storing a 'semantic_similarity' score, only that key gets updated. 'context_type', 'cost_estimate', 'accuracy_score'—anything you measured before—remains untouched.
Here's what the storage call looks like:
const result = await mcp.call_tool('store_evaluation_result', {
prompt_id: 'workflow_v3',
indicators: {
semantic_similarity: 0.89,
evaluation_timestamp: '2025-05-15T10:30:00Z'
}
});
If 'workflow_v3' already has stored indicators like {context_type: 'agent_workflow', context_confidence: 0.94}, the system merges them. The final stored state becomes:
{
"context_type": "agent_workflow",
"context_confidence": 0.94,
"semantic_similarity": 0.89,
"evaluation_timestamp": "2025-05-15T10:30:00Z"
}
You can retrieve this combined data anytime with retrieve_evaluation_result. The response includes all indicators ever stored for that prompt, with the most recent timestamp for each metric. This lets you compare across evaluation runs without manually tracking which metrics came from which session.
The merge logic also handles nested objects. If you store a cost breakdown like {cost_estimate: {input_tokens: 1500, output_tokens: 800, total_cost: 0.023}}, that entire structure persists even when you later store a separate semantic_similarity score. No partial overwrites—each top-level indicator key is treated as an atomic unit.
Real Metrics
Authentic Metrics from Production:
- evaluation_cost: 0 — free model auto-selected
- context_types: 7
- semantic_score_range: 0.0-1.0
What I didn't know
The first version I built did a shallow merge, which caused problems when indicators had nested structures. I'd store a cost estimate with token counts, then store a semantic score, and the cost estimate would vanish because the merge only looked at top-level keys. I had to rewrite the merge function to handle arbitrary nesting depth, which added complexity but fixed the data loss issue.
Timestamp handling was harder than expected. I initially stored a single 'last_updated' timestamp for the entire indicator set, but that made it impossible to know when each individual metric was measured. If you saw a semantic score of 0.92 and a cost estimate of $0.015, you couldn't tell if they were from the same evaluation run or weeks apart. I switched to per-indicator timestamps, which means every metric now carries its own measurement time. This makes the data structure more verbose, but it's the only way to preserve evaluation history accurately.
I also found edge cases where users might want to intentionally overwrite an indicator—like re-running a semantic evaluation after changing the reference prompt. The system doesn't currently support explicit overwrites; every storage call is a merge. For now, the workaround is to delete the stored result and re-store from scratch, but that's clunky. I'm considering adding an 'overwrite_mode' flag to the storage tool for cases where merging isn't the right behavior.
What I measured
I tested this on my own prompt library—about 40 prompts I use regularly for agent workflows and content generation. Before the merge behavior, I'd run an evaluation, store it, then lose the data the next time I tested a different metric. I was manually copying scores into a spreadsheet to track changes over time, which defeated the purpose of having an evaluation system.
After implementing indicator preservation, I ran context detection on all 40 prompts (91.94% accuracy overall), stored those results, then ran semantic comparisons against reference versions a week later. When I retrieved the results, every prompt had both its context type and its semantic similarity score available. No data loss. I could immediately see which prompts had drifted from their original intent during optimization—three prompts showed semantic scores below 0.75, which flagged them for manual review.
The cost difference is subtle but real. I'm no longer re-running evaluations just to recover lost baseline data. That saved me roughly 15-20 evaluation calls per week on my own usage, which translates to about 30,000 fewer tokens processed monthly. At $0.015 per 1K tokens (my usual model), that's about $0.45/month saved—not huge, but it's free money I was wasting on redundant API calls. More importantly, I'm not losing 10 minutes per session reconstructing evaluation history from memory or old logs.
Key Takeaways
- Store evaluation results incrementally. Run context detection today, cost analysis tomorrow, semantic comparison next week—each result adds to the evidence base without erasing what you already measured.
- Check timestamps on individual metrics, not just the overall result. A prompt with a 0.95 semantic score from three months ago and a $0.08 cost estimate from yesterday isn't giving you current data on both dimensions.
- If you need to intentionally overwrite an indicator—like re-measuring semantic similarity after changing your reference prompt—delete the stored result first, then re-store. The merge behavior assumes you want to accumulate metrics, not replace them.
- Use the preserved indicators to catch optimization drift. If your context type stays 'agent_workflow' but your semantic similarity drops from 0.92 to 0.68, your optimization changed the prompt's meaning—not just its efficiency.
- Combine context detection (91.94% accuracy) with semantic scoring to verify that optimizations maintain intent. Context tells you what category the prompt fits; semantic similarity tells you whether it still matches your original version. Both metrics together give you confidence that optimization didn't break the prompt.
Want to try it yourself? Try Prompt Optimizer free at https://promptoptimizer.xyz
Building Prompt Optimizer. MCP-native prompt optimization with 91.94% context detection accuracy.
Top comments (0)