How to use Charm's VHS to create GIF-based visual regression tests for your CLI's terminal output — catching formatting bugs that unit tests miss.
Your CLI's unit tests verify that the right data comes out. But they don't test what the user actually sees.
A missing newline. A table column that wraps at 80 characters. A progress spinner that bleeds into the output. An ANSI color code that renders as garbage on a light terminal theme. These are visual bugs that pass every unit test but make your CLI look broken.
VHS by Charm solves this by recording your terminal as a GIF from a script — and you can use those GIFs as visual regression tests.
Using VHS
VHS reads a .tape file that describes terminal interactions:
# demo.tape
Output demo.gif
Set Width 120
Set Height 40
Set Theme "Monokai"
Type "stave apply --controls ./controls --observations ./obs --format text"
Enter
Sleep 2s
Run it:
vhs demo.tape
Output: demo.gif — a pixel-perfect recording of what the terminal looks like when that command runs.
How This Differs from Asciinema
| Asciinema (.cast) | VHS (.gif/.png) | |
|---|---|---|
| Output | Text-based replay (NDJSON) | Pixel-based image (GIF/PNG/WebM) |
| Renders | In a JavaScript player | As a static image anywhere |
| Tests | Text content correctness | Visual formatting correctness |
| Use case | Documentation, interactive replay | README badges, visual regression |
| File size | Small (text) | Large (image) |
| Searchable | Yes (it's text) | No (it's pixels) |
Asciinema answers: "What text does the CLI produce?"
VHS answers: "What does the CLI look like?"
Both are useful. They test different things.
Visual Regression Testing Pattern
Step 1: Create a .tape file per workflow
# tapes/apply-violation.tape
Output testdata/screenshots/apply-violation.gif
Set Width 120
Set Height 40
Set FontSize 14
Set Theme "Catppuccin Mocha"
Type "stave apply --controls controls/s3 --observations observations --now 2026-01-15T00:00:00Z --format text"
Enter
Sleep 3s
Step 2: Generate the baseline
vhs tapes/apply-violation.tape
Commit testdata/screenshots/apply-violation.gif as the golden file.
Step 3: Compare in CI
# .github/workflows/visual.yml
- name: Generate screenshots
run: |
for tape in tapes/*.tape; do
vhs "$tape"
done
- name: Check for visual changes
run: |
git diff --exit-code testdata/screenshots/
If any GIF changes, the diff catches it. The developer reviews the visual change and either updates the golden file or fixes the formatting bug.
Step 4: Review with PR comments
For GitHub PRs, you can post the before/after GIF directly in a comment:
- name: Post visual diff
if: failure()
run: |
echo "Visual regression detected. See the updated screenshots below."
# Upload artifacts or post to PR
What Visual Tests Catch That Unit Tests Miss
Table alignment
CONTROL_ID ASSET_ID STATUS
CTL.S3.PUBLIC.001 my-very-long-bucket NON_COMPLIANT
-name-that-wraps
A unit test checks that the data is correct. A visual test catches that the column wraps and breaks the alignment.
Color and formatting
[PASS] CTL.S3.ENCRYPT.001 — Server-Side Encryption
[FAIL] CTL.S3.PUBLIC.001 — No Public Read Access
A unit test sees [PASS] and [FAIL]. A visual test sees whether the ANSI color codes render correctly — green for pass, red for fail — or whether they produce \033[32m[PASS]\033[0m garbage.
Progress indicators
Running: evaluating controls... ⠋
A spinner that works in a real terminal but bleeds into piped output. A visual test with a fixed terminal size catches this.
Help text layout
Usage:
stave apply [flags]
Flags:
-i, --controls string Path to control definitions (default "controls/s3")
-o, --observations string
Path to observation snapshots (default "observations")
Does the flag help wrap correctly? Are the defaults aligned? Is the long description properly indented? Unit tests don't check layout. VHS checks layout.
VHS .tape Cheat Sheet
Output file.gif # Output file (gif, png, webm, mp4)
Set Width 120 # Terminal width
Set Height 40 # Terminal height
Set FontSize 14 # Font size in pixels
Set Theme "Dracula" # Terminal theme
Set TypingSpeed 50ms # Delay between keystrokes
Type "command" # Type text (simulated keystrokes)
Enter # Press Enter
Sleep 2s # Wait for output
Ctrl+C # Send interrupt
Tab # Press Tab (for completion testing)
Backspace 5 # Delete 5 characters
Hide # Stop recording (for setup commands)
Show # Resume recording
Combining Both Tools
For a complete CLI testing strategy:
| Layer | Tool | Tests |
|---|---|---|
| Unit tests | go test |
Data correctness, error handling, exit codes |
| E2E golden files |
go test + JSON comparison |
Full output correctness, determinism |
| Text recordings | Custom asciicast generator | Documentation accuracy, demo freshness |
| Visual regression | VHS | Formatting, alignment, colors, layout |
Each layer catches different bugs. Unit tests catch logic errors. Golden files catch output regressions. Asciicast recordings catch documentation drift. VHS catches visual formatting bugs.
Getting Started
# Install VHS (macOS)
brew install charmbracelet/tap/vhs
# Install VHS (Linux)
go install github.com/charmbracelet/vhs@latest
# Create your first tape
cat > hello.tape << 'EOF'
Output hello.gif
Set Width 80
Set Height 24
Type "echo 'Hello from VHS'"
Enter
Sleep 1s
EOF
# Record
vhs hello.tape
The GIF is your visual test. Commit it, compare it in CI, review it in PRs.
Stave uses programmatic asciicast generation for documentation recordings and Go-based golden file testing for output correctness. VHS is the natural next step for visual regression testing of the text-formatted output.
Top comments (4)
It is a useful framing. Most CLI testing stops at the data layer and assumes formatting just works.
The table wrapping edge case is where this really earns its keep. A unit test sees the right data, but a wrapped column breaks the entire visual hierarchy. That is the kind of bug users notice immediately, but tests never catch.
Curious about the CI workflow. Do you find the GIF file sizes cause any issues with artifact storage or PR load times? A 120x40 terminal recording can get surprisingly heavy compared to a plain text fixture.
Good catch — the size cost is real (a 120×40 / ~3s recording lands at 150–200 KB; twenty tapes is ~4 MB of committed binaries that git history won't deduplicate). But the bigger problem I've since hit is that VHS GIFs aren't byte-deterministic across runs — font hinting, frame timing, and cursor-blink phase all jitter the output — so the git diff --exit-code workflow in the article fails spuriously. To do real visual regression you need a perceptual diff (SSIM or frame-extracted comparison), not git diff. Honest update: VHS is best used as automated documentation recording, not as a regression-testing tool. For layout regressions, plain-text golden files with fixed-width rendering catch the same bugs deterministically and cheaply.
Really appreciate the honest update. You nailed it. That jitter makes binary diffs useless for visual stuff, so text golden files are definitely the right call for reliable regression testing.
Do you have a specific script to generate those golden files, or do you manage them manually?
Thank you!
You are welcome. For Stave I use Go's testscript package with .txtar files — each test bundles the command sequence and the expected stdout/stderr in a single text file. To regenerate, you run the tests with go test -update (or -rewrite in newer versions), which re-captures the actual output into the expected sections. The file is plain text so a regenerated golden shows up as a normal text diff in the PR, reviewable line-by-line. No separate generation script — the test runner is the generator. The trick that makes it deterministic is freezing time with a --now flag on the CLI and pinning terminal width via COLUMNS, so the same input always produces the same bytes. Two short pieces in the rogpeppe/go-internal docs (testscript research.swtch.com/testing) and Russ Cox's Quick Testing with Test Scripts post are the best starting points if you want to wire it up.