You rewrote your prompt. The output looks different. But is it actually better?
Most of us have been there — reading prompt engineering best practices, tweaking instructions, and hoping the changes help. But without comparison, you're just guessing.
The Problem
When you improve a prompt, you typically:
- Run the new version
- Look at the output
- Think "yeah, this seems better"
But you're comparing against your memory of the old output. Different runs produce different results anyway. How do you know the improvement came from your changes and not just LLM variance?
What I Built
rashomon is a Claude Code plugin that focuses on one practical question: "Did my instruction change actually affect the result?"
It analyzes your prompt, generates an optimized version, runs both in isolated environments, and compares the actual results.
Real Example
I ran this prompt through rashomon:
Add logging to track function calling usage
A reasonable instruction. But vague.
What rashomon detected
| Issue | Detail |
|---|---|
| Vague instructions | What, where, and why to log are unclear |
| No output format | Log structure not specified |
| Missing context | No project architecture information |
The optimized prompt
## Context
This is a Slack bot using Google Gemini API with function calling.
The project uses a shared `logger` utility with structured logging.
Function calling flows through:
1. `GeminiService.executeWithRetry()` - detects function calls
2. `FunctionHandler.handleFunctionCall()` - executes them
## Task
Add logging to track function calling usage for analytics and debugging.
## Requirements
At Function Call Detection (GeminiService):
- Function name(s) detected
- Number of function calls in response
At Function Execution (FunctionHandler):
- Parameters passed (sanitized - exclude sensitive data)
- Execution duration
- Result status (success/failure)
## Output Format
logger.info('Function call detected', {
functionName: 'executeWithRetry',
detectedFunctions: ['searchNotionPages'],
functionCallCount: 1
})
What changed
| Aspect | Original | Optimized |
|---|---|---|
| Logging Scope | 1 stage (execution only) | 2 stages (detection + execution) |
| Parameter Sanitization | None | Passwords, tokens, secrets redacted |
| Files Modified | 2 | 2 |
The original prompt looked reasonable, but led the agent to log at only one point. The optimized version covered both detection and execution — with security considerations the original didn't address.
Classification: Structural Improvement
About Variance
Not every difference is an improvement. rashomon distinguishes between structural gains and mere variance.
I tried to create a Variance example — a prompt so clear that optimization wouldn't matter. I couldn't. In practice, the same vague prompt sometimes works beautifully, sometimes completely misses the point.
rashomon just makes that inconsistency visible.
Try It
Requires Claude Code.
claude
/plugin marketplace add shinpr/rashomon
/plugin install rashomon@rashomon
# Restart session
/rashomon Your prompt here
See what actually changes when you improve your prompts — not just different wording.
Why rashomon?
Inspired by the Rashomon effect — the idea that the same event can produce different outcomes depending on perspective rashomon makes those differences explicit and comparable.
- Spending too much time on trial-and-error with prompts?
- Read best practices but not sure how they apply to your case?
- Want proof that your changes actually made things better?
rashomon analyzes, improves, and compares prompts—so you can see what actually changed, and whether it matters.
Who Is This For?
rashomon is designed for:
- Developers using Claude Code daily
- Teams iterating on complex prompts (coding, analysis, writing)
- Anyone who wants evidence, not vibes, when improving prompts
Not ideal if:
- You don't use git
- You want one-shot prompt rewriting without comparison
Quick Example
/rashomon Write a function to sort an array
What You Get
1. Detected Issues
- BP-002…

Top comments (0)