DEV Community

Bala Paranj
Bala Paranj

Posted on

Visual Regression Testing for CLIs with VHS

How to use Charm's VHS to create GIF-based visual regression tests for your CLI's terminal output — catching formatting bugs that unit tests miss.

Your CLI's unit tests verify that the right data comes out. But they don't test what the user actually sees.

A missing newline. A table column that wraps at 80 characters. A progress spinner that bleeds into the output. An ANSI color code that renders as garbage on a light terminal theme. These are visual bugs that pass every unit test but make your CLI look broken.

VHS by Charm solves this by recording your terminal as a GIF from a script — and you can use those GIFs as visual regression tests.

Using VHS

VHS reads a .tape file that describes terminal interactions:

# demo.tape
Output demo.gif
Set Width 120
Set Height 40
Set Theme "Monokai"

Type "stave apply --controls ./controls --observations ./obs --format text"
Enter
Sleep 2s
Enter fullscreen mode Exit fullscreen mode

Run it:

vhs demo.tape
Enter fullscreen mode Exit fullscreen mode

Output: demo.gif — a pixel-perfect recording of what the terminal looks like when that command runs.

How This Differs from Asciinema

Asciinema (.cast) VHS (.gif/.png)
Output Text-based replay (NDJSON) Pixel-based image (GIF/PNG/WebM)
Renders In a JavaScript player As a static image anywhere
Tests Text content correctness Visual formatting correctness
Use case Documentation, interactive replay README badges, visual regression
File size Small (text) Large (image)
Searchable Yes (it's text) No (it's pixels)

Asciinema answers: "What text does the CLI produce?"
VHS answers: "What does the CLI look like?"

Both are useful. They test different things.

Visual Regression Testing Pattern

Step 1: Create a .tape file per workflow

# tapes/apply-violation.tape
Output testdata/screenshots/apply-violation.gif
Set Width 120
Set Height 40
Set FontSize 14
Set Theme "Catppuccin Mocha"

Type "stave apply --controls controls/s3 --observations observations --now 2026-01-15T00:00:00Z --format text"
Enter
Sleep 3s
Enter fullscreen mode Exit fullscreen mode

Step 2: Generate the baseline

vhs tapes/apply-violation.tape
Enter fullscreen mode Exit fullscreen mode

Commit testdata/screenshots/apply-violation.gif as the golden file.

Step 3: Compare in CI

# .github/workflows/visual.yml
- name: Generate screenshots
  run: |
    for tape in tapes/*.tape; do
      vhs "$tape"
    done

- name: Check for visual changes
  run: |
    git diff --exit-code testdata/screenshots/
Enter fullscreen mode Exit fullscreen mode

If any GIF changes, the diff catches it. The developer reviews the visual change and either updates the golden file or fixes the formatting bug.

Step 4: Review with PR comments

For GitHub PRs, you can post the before/after GIF directly in a comment:

- name: Post visual diff
  if: failure()
  run: |
    echo "Visual regression detected. See the updated screenshots below."
    # Upload artifacts or post to PR
Enter fullscreen mode Exit fullscreen mode

What Visual Tests Catch That Unit Tests Miss

Table alignment

CONTROL_ID          ASSET_ID              STATUS
CTL.S3.PUBLIC.001   my-very-long-bucket   NON_COMPLIANT
                    -name-that-wraps
Enter fullscreen mode Exit fullscreen mode

A unit test checks that the data is correct. A visual test catches that the column wraps and breaks the alignment.

Color and formatting

[PASS] CTL.S3.ENCRYPT.001 — Server-Side Encryption
[FAIL] CTL.S3.PUBLIC.001 — No Public Read Access
Enter fullscreen mode Exit fullscreen mode

A unit test sees [PASS] and [FAIL]. A visual test sees whether the ANSI color codes render correctly — green for pass, red for fail — or whether they produce \033[32m[PASS]\033[0m garbage.

Progress indicators

Running: evaluating controls... ⠋
Enter fullscreen mode Exit fullscreen mode

A spinner that works in a real terminal but bleeds into piped output. A visual test with a fixed terminal size catches this.

Help text layout

Usage:
  stave apply [flags]

Flags:
  -i, --controls string   Path to control definitions (default "controls/s3")
  -o, --observations string
                          Path to observation snapshots (default "observations")
Enter fullscreen mode Exit fullscreen mode

Does the flag help wrap correctly? Are the defaults aligned? Is the long description properly indented? Unit tests don't check layout. VHS checks layout.

VHS .tape Cheat Sheet

Output file.gif              # Output file (gif, png, webm, mp4)
Set Width 120                # Terminal width
Set Height 40                # Terminal height
Set FontSize 14              # Font size in pixels
Set Theme "Dracula"          # Terminal theme
Set TypingSpeed 50ms         # Delay between keystrokes

Type "command"               # Type text (simulated keystrokes)
Enter                        # Press Enter
Sleep 2s                     # Wait for output
Ctrl+C                       # Send interrupt
Tab                          # Press Tab (for completion testing)
Backspace 5                  # Delete 5 characters

Hide                         # Stop recording (for setup commands)
Show                         # Resume recording
Enter fullscreen mode Exit fullscreen mode

Combining Both Tools

For a complete CLI testing strategy:

Layer Tool Tests
Unit tests go test Data correctness, error handling, exit codes
E2E golden files go test + JSON comparison Full output correctness, determinism
Text recordings Custom asciicast generator Documentation accuracy, demo freshness
Visual regression VHS Formatting, alignment, colors, layout

Each layer catches different bugs. Unit tests catch logic errors. Golden files catch output regressions. Asciicast recordings catch documentation drift. VHS catches visual formatting bugs.

Getting Started

# Install VHS (macOS)
brew install charmbracelet/tap/vhs

# Install VHS (Linux)
go install github.com/charmbracelet/vhs@latest

# Create your first tape
cat > hello.tape << 'EOF'
Output hello.gif
Set Width 80
Set Height 24
Type "echo 'Hello from VHS'"
Enter
Sleep 1s
EOF

# Record
vhs hello.tape
Enter fullscreen mode Exit fullscreen mode

The GIF is your visual test. Commit it, compare it in CI, review it in PRs.


Stave uses programmatic asciicast generation for documentation recordings and Go-based golden file testing for output correctness. VHS is the natural next step for visual regression testing of the text-formatted output.

Top comments (4)

Collapse
 
gimi5555 profile image
Gilder Miller

It is a useful framing. Most CLI testing stops at the data layer and assumes formatting just works.
The table wrapping edge case is where this really earns its keep. A unit test sees the right data, but a wrapped column breaks the entire visual hierarchy. That is the kind of bug users notice immediately, but tests never catch.
Curious about the CI workflow. Do you find the GIF file sizes cause any issues with artifact storage or PR load times? A 120x40 terminal recording can get surprisingly heavy compared to a plain text fixture.

Collapse
 
bala_paranj_059d338e44e7e profile image
Bala Paranj

Good catch — the size cost is real (a 120×40 / ~3s recording lands at 150–200 KB; twenty tapes is ~4 MB of committed binaries that git history won't deduplicate). But the bigger problem I've since hit is that VHS GIFs aren't byte-deterministic across runs — font hinting, frame timing, and cursor-blink phase all jitter the output — so the git diff --exit-code workflow in the article fails spuriously. To do real visual regression you need a perceptual diff (SSIM or frame-extracted comparison), not git diff. Honest update: VHS is best used as automated documentation recording, not as a regression-testing tool. For layout regressions, plain-text golden files with fixed-width rendering catch the same bugs deterministically and cheaply.

Collapse
 
gimi5555 profile image
Gilder Miller

Really appreciate the honest update. You nailed it. That jitter makes binary diffs useless for visual stuff, so text golden files are definitely the right call for reliable regression testing.
Do you have a specific script to generate those golden files, or do you manage them manually?

Thank you!

Thread Thread
 
bala_paranj_059d338e44e7e profile image
Bala Paranj

You are welcome. For Stave I use Go's testscript package with .txtar files — each test bundles the command sequence and the expected stdout/stderr in a single text file. To regenerate, you run the tests with go test -update (or -rewrite in newer versions), which re-captures the actual output into the expected sections. The file is plain text so a regenerated golden shows up as a normal text diff in the PR, reviewable line-by-line. No separate generation script — the test runner is the generator. The trick that makes it deterministic is freezing time with a --now flag on the CLI and pinning terminal width via COLUMNS, so the same input always produces the same bytes. Two short pieces in the rogpeppe/go-internal docs (testscript research.swtch.com/testing) and Russ Cox's Quick Testing with Test Scripts post are the best starting points if you want to wire it up.