Bala Paranj

Posted on Apr 7

Designing Errors Out of Your Go CLI

#cli #codequality #design #go

Most Go CLIs have too many error checks. Not because error handling is wrong — because the errors themselves are wrong.

I read the Power of Go Tools by John Arundel. Based on his book, I audited a 50,000-line Go CLI for functions that return errors unnecessarily. The result: 10 functions refactored, 50+ if err != nil checks eliminated, and the remaining error checks now mean something.

This is John Ousterhout's philosophy from A Philosophy of Software Design: don't handle errors better — design them out of existence.

The Four Patterns

I searched for four specific anti-patterns:

Idempotent No-ops — returning an error for a state that already matches the desired outcome
Empty-Set Errors — returning an error when a slice is empty instead of treating it as "no work to do"
Recoverable Internals — returning an error when the function could make a sensible default decision
Crash-Only Candidates — returning an error from initialization code where failure is truly fatal

Here's what I found and how I fixed each one.

Pattern 1: Idempotent No-ops

The temp file cleanup

// BEFORE: returns error if temp file is already gone
func IsDirectoryWritable(dir string) error {
    f, err := os.CreateTemp(dir, ".stave-write-check-*")
    if err != nil {
        return err
    }
    path := f.Name()
    _ = f.Close()
    return os.Remove(path) // ← why does this error matter?
}

The function's job is to check if a directory is writable. If os.Remove fails because the temp file is already gone, the check already passed. The post-condition is met.

// AFTER: cleanup errors are irrelevant
func IsDirectoryWritable(dir string) error {
    f, err := os.CreateTemp(dir, ".stave-write-check-*")
    if err != nil {
        return err
    }
    path := f.Name()
    _ = f.Close()
    _ = os.Remove(path)
    return nil
}

Rule: If the post-condition is satisfied, the function succeeds. Period.

Pattern 2: Empty-Set Safety

The empty registry

// BEFORE: empty registry is treated as an error
func NewIndex(data []byte) (*Index, error) {
    var idx registryIndex
    if err := yaml.Unmarshal(data, &idx); err != nil {
        return nil, err
    }
    if len(idx.Packs) == 0 {
        return nil, ErrEmptyRegistry // ← is zero packs really an error?
    }
    // ...
}

An empty registry means "no packs enabled." That's a valid state — the caller already checks len(packs) before iterating. The error just forces every caller to handle a case that isn't exceptional.

// AFTER: empty registry is a valid zero-work state
func NewIndex(data []byte) (*Index, error) {
    var idx registryIndex
    if err := yaml.Unmarshal(data, &idx); err != nil {
        return nil, err
    }
    // Zero packs is valid — the registry simply has nothing enabled.
    r := &Index{
        version: strings.TrimSpace(idx.Version),
        // ...
    }
    // ...
}

Rule: If an empty input means "nothing to do," the function should succeed with a zero-value result, not fail.

The empty predicate

// BEFORE: empty expression is an error
if expr == "" {
    return CompiledPredicate{}, fmt.Errorf("predicate produced empty expression")
}

// AFTER: empty predicate = always-pass (no rules = compliant)
if expr == "" {
    expr = "true"
}

An empty predicate in a security tool means "no rules defined for this control." That's not an error — it's compliance by default.

Pattern 3: Recoverable Internals

The nil dependency

// BEFORE: nil loader is a fatal error
func ListSnapshotFilesFlat(ctx context.Context, dir string, opts ScannerOptions) ([]SnapshotFile, error) {
    if opts.MetadataLoader == nil {
        return nil, errMetadataLoaderRequired
    }
    // ...
}

The MetadataLoader extracts timestamps from snapshot files. If it's nil, there's a perfectly good fallback: use the file's modification time.

// AFTER: nil loader falls back to filesystem metadata
if opts.MetadataLoader == nil {
    opts.MetadataLoader = func(filePath, _ string) (time.Time, error) {
        fi, err := os.Stat(filePath)
        if err != nil {
            return time.Time{}, err
        }
        return fi.ModTime(), nil
    }
}

Rule: If a sensible default exists, use it instead of forcing the caller to provide a value they probably don't care about.

The zero time

// BEFORE: zero time is rejected
func NewActiveWindow(openedAt time.Time) (ExposureWindow, error) {
    if openedAt.IsZero() {
        return ExposureWindow{}, fmt.Errorf("opened_at is required")
    }
    return ExposureWindow{openedAt: openedAt, active: true}, nil
}

A zero time means "not specified." The window should still be created — the caller can check w.OpenedAt().IsZero() if they care. Most don't.

// AFTER: zero time is accepted — callers check if they care
func NewActiveWindow(openedAt time.Time) ExposureWindow {
    return ExposureWindow{openedAt: openedAt, active: true}
}

This one removed error checks from 30+ call sites. Most were already using tl, _ := NewActiveWindow(...) — the Go equivalent of saying "I know this error doesn't matter."

Pattern 4: Crash-Only Candidates

The programming error

// BEFORE: returns error that no caller can recover from
func NewExposureLifecycle(a Asset) (*ExposureLifecycle, error) {
    if a.ID.IsEmpty() {
        return nil, fmt.Errorf("asset ID must not be empty")
    }
    return &ExposureLifecycle{ID: a.ID, asset: a}, nil
}

An empty asset ID is a programming error at the call site — the caller constructed the asset. If the ID is empty, there's a bug upstream. No caller can meaningfully recover from this. They all either t.Fatal(err) in tests or propagate it up the stack until someone logs it and exits.

// AFTER: panic on contract violation
func NewExposureLifecycle(a Asset) *ExposureLifecycle {
    if a.ID.IsEmpty() {
        panic("contract violated: NewExposureLifecycle requires non-empty asset ID")
    }
    return &ExposureLifecycle{ID: a.ID, asset: a}
}

This matches the checkContracts() pattern already used in the same file for internal state validation. It eliminated error-check boilerplate from 40+ call sites.

Rule: If the error means "you have a bug," use panic. If the error means "the user gave bad input," return an error. Don't conflate programming errors with operational errors.

The Impact

Before:

10 functions returning errors
50+ if err != nil checks across the codebase
Many callers using tl, _ := ... (ignoring the error anyway)

After:

10 functions with clear contracts
50+ fewer error checks
Every remaining if err != nil represents a truly exceptional event

How to Find These in Your CLI

Search for these patterns:

# Functions where callers ignore the error
grep -rn ', _ :=' --include='*.go' | grep -v vendor/ | sort | uniq -c | sort -rn

# Empty-set errors
grep -rn 'len(.*) == 0' --include='*.go' | grep 'Errorf\|errors.New'

# Existing panic patterns (your team already uses crash-only where appropriate)
grep -rn 'panic(' --include='*.go' | grep -v vendor/ | grep -v _test.go

The first command is the most revealing. If you see 40 call sites ignoring an error, the error shouldn't exist.

When NOT to Do This

User input boundaries: Always validate and return errors for user-supplied data
Network/IO operations: These fail for real external reasons
Cross-package contracts: If the caller is in a different package, errors are the right contract
Data integrity: Schema validation, hash mismatches, malformed input — these are real errors

The goal isn't to eliminate error handling. It's to make every if err != nil mean something. When a developer sees an error check, they should know it represents a truly exceptional event — not a no-op, not an empty set, not a sensible default, and not a programming bug.

These refactorings were applied to Stave, an offline configuration safety evaluator.

DEV Community