SEN LLC

Posted on Apr 16

Writing a YAML Linter in Rust Because YAML Is Secretly Terrible

#rust #yaml #cli #tutorial

Writing a YAML Linter in Rust Because YAML Is Secretly Terrible

A small Rust CLI that catches the YAML problems nobody warns you about — tabs, duplicate keys, BOMs, indent mixing — with a --fix mode that's conservative enough to put in a pre-commit hook. Two runtime dependencies, ≤20 MB Docker image.

📦 GitHub: https://github.com/sen-ltd/yaml-lint

The pitch in one paragraph

YAML looks like it's just "JSON for humans". In practice, every team I've worked on has shipped at least one broken YAML file to CI and learned the hard way that YAML has a surprising number of foot-guns. Tabs are forbidden in indentation. Duplicate keys are "implementation-defined" under the YAML 1.2 spec, which means one parser silently keeps the first, another silently keeps the last. BOMs will refuse to load in some YAML libraries. Trailing whitespace doesn't break anything but makes diff review painful. CRLF endings confuse a surprising number of tools. And the classic "Norway problem" — country: no parsing as the boolean false — lives on in YAML 1.1 parsers.

yamllint (Python) is the well-known tool for this, but it's heavy, slow to start under a Docker entrypoint, and its ruleset has grown large enough that every repo needs a config file. yq parses YAML beautifully but doesn't lint. I wanted a tiny, single-binary, zero-config Rust CLI I could drop into a pre-commit hook on every repo and forget about. This is yaml-lint.

Concretely the tool does nine rules, three output formats, a conservative --fix mode, and four exit codes, in about 900 lines of Rust across five source files. Two runtime dependencies: clap and serde_yaml. The release image is 9.55 MB on Alpine. There are 61 tests.

The rest of this article is the design of that tool: which rules are line-based and which need a real parser, why --fix is deliberately narrow, and a few specific Rust patterns that turned out to matter.

The module shape

I split the code into five small files on purpose:

src/
├── main.rs        # CLI entry, file IO, dispatch, exit codes
├── cli.rs         # clap definitions
├── rules.rs       # line-based lint rules (pure functions over text)
├── structural.rs  # serde_yaml-backed checks (duplicate-key, empty-value)
├── fixer.rs       # mechanical rewriter (pure: text -> text)
└── formatters.rs  # text / json / github output

rules.rs and fixer.rs take &str and return Vec<Finding> or String respectively. No file IO, no panics, no global state. That makes them trivially testable — and it means the integration tests in tests/cli.rs are really only for things you can only test through the CLI (argument parsing, exit codes, actual file writes).

Rule 1: tabs — pure, line-based, and unexpectedly fundamental

Tabs are the simplest rule and also the most important, because the YAML spec forbids tabs in indentation:

YAML does not rely on any particular context-dependent interpretation for its tokens, including indentation which is always counted in number of spaces. Tab characters must not be used.

Most parsers error out on tab-indented files, but with a confusing error message and a line number that's sometimes wrong. Catching it ahead of time is cheap:

pub fn check_tabs(text: &str) -> Vec<Finding> {
    let mut out = Vec::new();
    for (i, line) in text.split('\n').enumerate() {
        if let Some(col) = line.find('\t') {
            out.push(Finding::new(
                "tabs",
                Severity::Error,
                i + 1,
                col + 1,
                "tab character; YAML forbids tabs in indentation",
            ));
        }
    }
    out
}

Two details that matter:

split('\n') not lines(). str::lines() is nice, but it loses information — you can't tell whether the file ended with a final newline or not, which is itself a rule (no-trailing-newline). Consistently using split('\n') across all rules means every rule sees the same byte positions.
Byte find is fine here only because \t is ASCII. For long-line I use chars().count() instead, because a 50-character CJK string is 150 bytes, and flagging that as a "long line" would make the linter useless for Japanese YAML. Small thing; important in practice.

The whole rules.rs module follows this shape: one pure function per rule, all called from a single run_all(text, &disabled, max_line_length) that sorts and returns findings. Unit tests live in the same file, one per rule, with a positive case and a negative case. That gives me 14 of the 61 tests just from the rules module.

Rule 2: duplicate-key — why `serde_yaml` isn't enough

This is the rule where the architecture gets interesting.

A duplicate key is two instances of the same key at the same indentation level in the same mapping:

server:
  port: 8080
  port: 9090   # which one wins?

The YAML 1.2 spec says this is implementation-defined. Most parsers silently keep one of the two (serde_yaml actually errors, which is unusual — more on that below). Either way, if your config file has a duplicate key, it's almost certainly a bug, and you want to catch it at lint time.

The naive approach is: parse the YAML with serde_yaml, walk the Value::Mapping tree, look for duplicates. It doesn't work, for two reasons:

serde_yaml::Value::Mapping is backed by a deduplicating map. By the time you have a Value in hand, the duplicates are already gone. You literally cannot see them.
serde_yaml::from_str errors on duplicates before returning a Value at all. The error message includes the key name but has no line number for either occurrence.

So serde_yaml fails twice: once because it drops information, and once because it rejects the input I'm trying to analyze. The fix is to do duplicate-key detection on the raw source text with a hand-rolled scanner, then only call serde_yaml for the things it's good at (structural walking of a valid tree).

Here's the essence of the scanner. It keeps a stack of (indent, seen_keys) frames and reports the second occurrence of a key at the same indent:

pub fn find_duplicate_keys(text: &str) -> Vec<Finding> {
    let mut out = Vec::new();
    let mut stack: Vec<(usize, Vec<String>)> = Vec::new();

    for (i, line) in text.split('\n').enumerate() {
        let trimmed = line.trim_start();
        if trimmed.is_empty()
            || trimmed.starts_with('#')
            || trimmed.starts_with("---")
            || trimmed.starts_with("- ")
            || trimmed == "-"
        {
            continue;
        }
        let indent = line.len() - trimmed.len();

        // Pop frames that are deeper than the current line.
        while let Some((top, _)) = stack.last() {
            if *top > indent {
                stack.pop();
            } else {
                break;
            }
        }

        let Some(key) = extract_key(trimmed) else { continue };

        if stack.last().map(|(ind, _)| *ind) != Some(indent) {
            stack.push((indent, Vec::new()));
        }
        let frame = stack.last_mut().unwrap();
        if frame.1.contains(&key) {
            out.push(Finding::new(
                "duplicate-key",
                Severity::Error,
                i + 1,
                indent + 1,
                format!("duplicate key `{}` at this indentation level", key),
            ));
        } else {
            frame.1.push(key);
        }
    }
    out
}

extract_key handles the two common cases (quoted keys like "weird key": 1 and unquoted ones like name:) and deliberately rejects flow-style mappings ({a: 1, a: 2}) because you'd need a real tokenizer to handle those correctly, and in practice nobody writes duplicate keys inside flow style — it only happens in block style where the visual distance between the two hides the mistake. YAGNI.

The tests for this one are interesting because they include negative cases: name: x inside two different parents is not a duplicate, even though the word "name" appears twice. The stack-based approach gets that right.

Rule 3: empty-value — and the parse-error fallback

empty-value wants to catch:

api_key:
rate_limit: 100

where api_key: is probably a typo — the author meant to put a value there and never did. This is a case where serde_yaml actually helps: if parsing succeeds, I walk the Value::Mapping and look for Value::Null children:

fn walk_empty(v: &Value, out: &mut Vec<Finding>) {
    match v {
        Value::Mapping(m) => {
            for (k, val) in m {
                if matches!(val, Value::Null) {
                    let key_name = match k {
                        Value::String(s) => s.clone(),
                        other => format!("{:?}", other),
                    };
                    out.push(Finding::new(
                        "empty-value",
                        Severity::Info,
                        0,
                        0,
                        format!("key `{}` has an empty value", key_name),
                    ));
                }
                walk_empty(val, out);
            }
        }
        Value::Sequence(s) => s.iter().for_each(|i| walk_empty(i, out)),
        _ => {}
    }
}

Note the line: 0 — serde_yaml::Value loses source positions, so the finding can't point at a line. That's fine when the parse succeeds; it's a severity-info finding and the rule and key name are enough to grep for.

But what happens when the parse fails? That's where it gets subtle. If the file has tabs or duplicate keys, serde_yaml::from_str errors out, and I lose my ability to walk the tree — which means a file that has both a duplicate key and an empty value would only report the duplicate and silently miss the empty. That's a bad user experience: fix one thing, run again, get a new finding.

The fallback is a second, much dumber empty-value scanner that runs whenever the parse fails. It just looks at lines matching ^\s*key:\s*$ and checks that the next non-blank line isn't more indented (otherwise it's a mapping header, not an empty value):

fn scan_empty_values_text(text: &str) -> Vec<Finding> {
    // ... extract `key:` lines with no non-comment content after ':'
    // ... for each, look at the next non-blank line
    //     if it's more indented, this is a mapping header, skip
    //     otherwise, emit an empty-value finding
}

This is imprecise — it can't see quoted-key edge cases as well as serde_yaml — but it's additional, not replacing, so when the parser is happy I still get the accurate walk. The result is that running the linter on a file with tabs, duplicate keys, a BOM, and an empty value reports all four in one pass, which is what you want from a linter.

Rule 4: --fix and the conservative scope

The fixer module is the smallest and the one I'm most opinionated about. It's a pure function: fn fix(text: &str) -> String. It performs five mechanical rewrites:

Strip a leading BOM if present.
Replace CRLF with LF, and drop any stray CR.
Replace each leading tab with 2 spaces.
Trim trailing whitespace from every line.
Append a final newline if missing.

Notice what it doesn't do:

It doesn't fix duplicate keys, because you have to pick which one to keep, and the tool has no basis for that.
It doesn't fix indent-mix, because re-indenting correctly requires parsing, which means a round-trip through serde_yaml, which means losing comments. The cure is worse than the disease.
It doesn't fix long lines, because the right place to wrap a YAML string literal is a judgment call.
It doesn't touch values, only whitespace.

The property I care about most is idempotency: fix(fix(x)) == fix(x). There's a test for it:

#[test]
fn fix_idempotent() {
    let input = "\u{feff}\tname: alice   \r\n";
    let once = fix(input);
    let twice = fix(&once);
    assert_eq!(once, twice);
}

And a stronger one — round-tripping through the linter proves the fixer really fixes what it claims:

#[test]
fn fix_round_trip_reaches_clean_state() {
    let input = "\u{feff}\tname: alice   \nversion: 1\r\n";
    let fixed = fix(input);
    let rule_findings = run_all(&fixed, &[], 120);
    let struct_findings = structural::run(&fixed, &[]);
    let fixable = [
        "bom", "crlf", "tabs",
        "trailing-whitespace", "no-trailing-newline",
    ];
    for f in rule_findings.iter().chain(struct_findings.iter()) {
        assert!(!fixable.contains(&f.rule),
                "unexpected finding after fix: {:?}", f);
    }
}

After one pass, any rule in the "fixable" set must not fire. Any finding that remains must be in the "unfixable" set (duplicate-key, indent-mix, etc.) — and those are what human review is for.

The pattern here is: separate the "things a machine can correct without risk" from the "things that need a human", be explicit about which is which, and don't let the former quietly grow into the latter. Every time a fixer starts to resolve things that need judgment, it starts producing diffs that surprise the author, and then people turn off the fixer. A narrow, predictable fixer stays on.

Exit codes and `--fail-on`

The CLI has three exit codes — 0, 1, 2 — and a --fail-on flag that decides what counts as "bad" for exit 1:

0  No findings at or above --fail-on (default: error)
1  Findings at or above --fail-on
2  Bad arguments, IO error, unknown rule id

This matches shellcheck, yamllint, and the classic Unix convention. --fail-on warning is the setting I actually run in CI — it's strict enough to force tab cleanup and CRLF normalization but still loose enough to skip long-line and empty-value (which are info severity and often deliberate).

The --fail-on info variant is what you want when running on an unfamiliar codebase for the first time, to get the full picture before deciding which rules to disable.

Three output formats

text: path:line:col: severity [rule] message, the default.
json: one object per file, for piping into jq or for editor integrations. I rolled a tiny hand-written JSON writer instead of pulling serde_json just for this — three deps would have been one too many.
github: GitHub Actions annotation syntax (::error file=...,line=...,col=...::message), so findings surface inline on PRs when you run it in a workflow.

The formatters module is pure and takes (&str, &[Finding]) -> String, which means the CLI emit path is basically:

fn emit(path: &str, findings: &[Finding], format: FormatArg) {
    let s = match format {
        FormatArg::Text   => formatters::text(path, findings),
        FormatArg::Json   => format!("{}\n", formatters::json(path, findings)),
        FormatArg::Github => formatters::github(path, findings),
    };
    if !s.is_empty() { print!("{}", s); }
}

Four lines. Everything interesting happened in the formatter module.

Tradeoffs and what's out of scope

A few things I'm deliberately not doing:

Schema validation. That's a different problem with a different tool: use check-jsonschema. yaml-lint is about YAML-the-format, not YAML-the-data.
The Norway problem. country: no parsing as boolean false is a YAML 1.1 quirk; serde_yaml targets YAML 1.2, which changes the boolean set to just true/false/True/False/TRUE/FALSE. So no stays a string and the footgun is gone — no rule needed.
Comment-preserving --fix for structural issues. To rewrite indent-mix correctly you'd need to go through an AST library that keeps comments, which in the serde_yaml/yaml-rust2 world is surprisingly hard. A text-based fixer is honest about what it can and cannot do.
Stdin mode. Every other linter accepts - for stdin; I skipped it because --fix needs a file path, and supporting two different IO modes doubles the test matrix for not much benefit. Use --format json - if enough people ask.

Try it in 30 seconds

git clone https://github.com/sen-ltd/yaml-lint
cd yaml-lint
docker build -t yaml-lint .
docker run --rm yaml-lint --help

# Real use:
docker run --rm -v "$PWD":/work yaml-lint /work/config.yaml
docker run --rm -v "$PWD":/work yaml-lint /work/config.yaml --fix
docker run --rm -v "$PWD":/work yaml-lint /work/config.yaml --format github

Or as a pre-commit hook:

repos:
  - repo: local
    hooks:
      - id: yaml-lint
        name: yaml-lint
        entry: yaml-lint --fail-on warning
        language: system
        files: \.ya?ml$

The binary is small enough that you can drop it into every CI image without flinching, and the rules are focused enough that you don't need a .yamllint.yaml config file to get useful output on day one.

Takeaways

YAML has more foot-guns than you think, and most of them are catchable by a line-based scanner. You don't need to understand YAML to lint YAML; you just need to understand what can go wrong.
serde_yaml is the wrong tool for some YAML questions. For duplicate-key detection especially, you need to work on raw text, because the parser either deduplicates or rejects the input. Pick the right level of abstraction.
Conservative --fix scope is a feature, not a limitation. A fixer that only touches whitespace can be turned on in pre-commit without fear. A fixer that rewrites structural issues gets disabled the first time it surprises someone.
Pure functions over text are trivially testable. The rules module has one unit test per rule, positive and negative, with no IO and no mocks. That's why the test count could get to 61 without anyone calling it heroic effort.

Source: https://github.com/sen-ltd/yaml-lint. MIT license, two runtime deps, ≤20 MB Docker image. Drop it into your next project's pre-commit and see what falls out.

DEV Community

Writing a YAML Linter in Rust Because YAML Is Secretly Terrible

Writing a YAML Linter in Rust Because YAML Is Secretly Terrible

The pitch in one paragraph

The module shape

Rule 1: tabs — pure, line-based, and unexpectedly fundamental

Rule 2: duplicate-key — why `serde_yaml` isn't enough

Rule 3: empty-value — and the parse-error fallback

Rule 4: --fix and the conservative scope

Exit codes and `--fail-on`

Three output formats

Tradeoffs and what's out of scope

Try it in 30 seconds

Takeaways

Top comments (0)

Writing a YAML Linter in Rust Because YAML Is Secretly Terrible

The pitch in one paragraph

The module shape

Rule 1: tabs — pure, line-based, and unexpectedly fundamental

Rule 2: duplicate-key — why serde_yaml isn't enough

Rule 3: empty-value — and the parse-error fallback

Rule 4: --fix and the conservative scope

Exit codes and --fail-on

Three output formats

Tradeoffs and what's out of scope

Try it in 30 seconds

Takeaways

Rule 2: duplicate-key — why `serde_yaml` isn't enough

Exit codes and `--fail-on`