Stop "Vibe Checking" Your AI. Use Snapshot Testing Instead.

We have all been there.

You write a perfect prompt for your AI feature.
It works great on your machine.
Two weeks later, you tweak the system prompt slightly.
Suddenly, your bot starts replying with 5-paragraph essays instead of one-liners.

In traditional web development, we solved this years ago with Snapshot Testing. If a UI component changes by one pixel, the test fails until you approve it.

Why aren't we doing this for AI?

Most of us are still "Vibe Checking" manually running the prompt, reading the output, and saying, "Yeah, seems okay."

I built a tool to fix this.

Introducing SafeStar

SafeStar is a zero-dependency CLI tool that brings the "Snapshot & Diff" workflow to AI engineering.

It doesn't care if you use Python, Node, or curl. It treats your AI as a black box and answers one question: "Did the behavior change compared to last time?"

How it works

It follows a Git-like workflow:

Snapshot a baseline of "good" behavior.
Run your current code.
Diff the results to detect drift.

Quick Start

You can try it right now without changing your code.

1. Install it

npm install --save-dev safestar

2. Define a Scenario
Create a file scenarios/refund.yaml. Tell SafeStar how to run your script using the exec key.

name: refund_bot
prompt: "I want a refund immediately."

# Your actual code command
exec: "python3 my_agent.py"

# Run it 5 times to catch randomness/instability
runs: 5

# Simple guardrails
checks:
  max_length: 200
  must_not_contain:
    - "I am just an AI"

3. Create a Baseline
Run it until you get an output you like, then "freeze" it.

npx safestar baseline refund_bot

4. Check for Drift in CI
Now, whenever you change your prompt or model, run this:

npx safestar diff scenarios/refund.yaml

If your model drifts, SafeStar screams at you:

--- SAFESTAR REPORT ---
Status: FAIL

Metrics:
  Avg Length: 45 chars -> 120 chars
  Drift:      +166% vs baseline (WARNING)
  Variance:   0.2 -> 9.8 (High instability)

Why I built this

I was tired of complex evaluation dashboards that give me a "correctness score" of 87/100. I don't care about the score. I care about regressions.

If my bot was working yesterday, I just want to know if it is different today.

SafeStar is open source, local-first, and fits right into GitHub Actions.