SEN LLC

Posted on Apr 15

gron Was Right: Flat JSON Is Easier to Grep (A Rust Take With Four Formats and a Round-Trip)

#rust #cli #json #tutorial

gron Was Right: Flat JSON Is Easier to Grep (A Rust Take With Four Formats and a Round-Trip)

A small Rust CLI that flattens nested JSON into path = value lines — plus JSONPath, env, TSV, and JSON output modes, and a bidirectional --invert that rebuilds the original document. Written because I found myself reaching for gron constantly, and I wanted a single-binary Rust version with a couple of extra formats and fewer surprises.

📦 GitHub: https://github.com/sen-ltd/json-flatten

I spend a lot of time looking at JSON on the command line. APIs, config files, webhook payloads, kubectl get -o json, a misbehaving service's log line that somebody decided to JSON-encode. Pretty-printed output is easy on the eyes for one document but hostile to everything else you want to do with JSON at the shell: grepping it, diffing two versions of it, loading it into a spreadsheet, piping it into cut.

gron is the classic answer to this. It flattens JSON into greppable assignments and it's great. It's also written in Go, which is fine — except Go tools don't always feel at home in a Rust project, they don't always share the musl-alpine minimal-image story, and I wanted a version where I could tweak the path quoting rules and add a few extra output formats without having to learn Go first.

So I wrote json-flatten. It's roughly 600 lines of Rust, three dependencies (clap, serde_json, glob), and it drops into a 9.48 MB alpine container. This article is a tour of why the "flat JSON" format matters, how the flattener and ungron are actually implemented, and the tradeoffs I picked on purpose.

The problem: pretty JSON is hostile to pipes

Here's a tiny API response:

{
  "users": [
    {"name": "alice", "age": 30, "role": "admin"},
    {"name": "bob",   "age": 25, "role": "user"}
  ],
  "total": 2,
  "version": "1.4.0"
}

Now grep it for alice:

$ grep alice response.json
    {"name": "alice", "age": 30, "role": "admin"},

That's a matching line, but it doesn't tell you where alice lives. Is she a top-level key? A user? An admin? You have to look at the file to find out. For a 50 KB webhook payload with half a dozen nested arrays, that's the kind of sixty-second interruption that compounds over a debugging session.

Flatten it and the problem disappears:

$ json-flatten response.json | grep alice
users[0].name = "alice"

Every line carries the full path from the root, so grep is location-aware by construction. The same trick works for diffs. diff a.json b.json on pretty-printed output is order-sensitive and noisy; diff <(json-flatten --sort-keys a.json) <(json-flatten --sort-keys b.json) is a handful of added/removed lines that you can actually read.

And then there are the other cases — TSV into a spreadsheet, env lines into a .env file, an explicit JSON-typed stream for another JSON tool. Those aren't what gron does, but they all want the same underlying operation: walk the tree, emit one row per leaf, give each row a path.

Design

The flattener, in 40 lines

The core is a recursive walk over serde_json::Value. Scalars and empty containers are leaves; non-empty objects and arrays recurse. The path is passed down as a mutable Vec<Segment> and pushed/popped on each step — no allocation per recursion level except for the keys themselves.

fn walk(
    value: &Value,
    path: &mut Path,
    depth: usize,
    opts: FlattenOptions,
    out: &mut Vec<Entry>,
) {
    // Depth cap: emit the whole subtree as one literal entry.
    if let Some(max) = opts.max_depth {
        if depth >= max {
            out.push(Entry { path: path.clone(), value: value.clone() });
            return;
        }
    }

    match value {
        Value::Object(map) => {
            if map.is_empty() {
                out.push(Entry { path: path.clone(), value: Value::Object(Map::new()) });
                return;
            }
            for (k, v) in map.iter() {
                path.push_key(k.clone());
                walk(v, path, depth + 1, opts, out);
                path.pop();
            }
        }
        Value::Array(arr) => {
            if arr.is_empty() {
                out.push(Entry { path: path.clone(), value: Value::Array(Vec::new()) });
                return;
            }
            for (i, v) in arr.iter().enumerate() {
                path.push_index(i);
                walk(v, path, depth + 1, opts, out);
                path.pop();
            }
        }
        _ => {
            out.push(Entry { path: path.clone(), value: value.clone() });
        }
    }
}

A few non-obvious choices here:

Empty containers are leaves. {"a": {}} becomes one entry, a = {}, not zero entries. If you drop them, the ungron round-trip loses them and the output can't represent a document with an explicitly-empty object.
--depth N emits a literal. When recursion stops, the remaining subtree is emitted verbatim as a serde_json::Value. The format layer serializes it back to JSON, so you get a.b = {"c": 1} and you can still read it with jq.
The flattener takes no I/O, no config, no format choice. It returns Vec<Entry>. Filtering, sorting, and output formatting all live downstream. This is worth doing even for a 600-line tool — the integration tests spin up the binary, but the unit tests can hit flatten() directly with serde_json::json!({...}) literals, and that's where the bulk of the coverage lives.

Path escaping: the part you can't punt on

The obvious output format is users[0].name = "alice", but the moment you allow arbitrary JSON you have to decide what to do with keys that contain dots, brackets, quotes, spaces, or nothing at all. Real webhook payloads contain all of these.

The rule I settled on:

A key is rendered unquoted if and only if it's a "bare identifier": starts with a letter or underscore, contains only letters, digits, and underscores. Everything else gets bracket-quoted as a JSON-escaped string.

pub fn is_bare_identifier(k: &str) -> bool {
    let mut chars = k.chars();
    match chars.next() {
        Some(c) if c.is_ascii_alphabetic() || c == '_' => {}
        _ => return false,
    }
    chars.all(|c| c.is_ascii_alphanumeric() || c == '_')
}

So foo stays foo, foo_bar stays foo_bar, but a.b becomes ["a.b"], with "quote" becomes ["with \"quote\""], and 日本語 becomes ["日本語"]. The quoting mirrors JSON string syntax, which matters for two reasons:

Humans can read it. If you know JSON strings, you know the escape rules.
The --invert parser can unambiguously recover the original key. That's a correctness property: for anything the flattener produces, flatten → invert == original.

The alternative I considered was gron-style backticks or Python-style single quotes. Both are fine for display but both collide with common shell characters. Double-quoted JSON-escaped keys survive being re-parsed by other JSON tools without further trouble.

Ungron: parse the same output back

--invert is probably the feature I use most often. The workflow is: flatten → grep or sed to modify → ungron to rebuild. The parser is straightforward because we designed the output format to be parseable:

pub fn unflatten(input: &str) -> Result<Value, String> {
    let mut root: Value = Value::Null;
    let mut seen_root_assign = false;

    for (lineno, raw_line) in input.lines().enumerate() {
        let line = strip_ansi(raw_line);
        let line = line.trim();
        if line.is_empty() || line.starts_with('#') { continue; }

        let eq = line.find('=').ok_or_else(|| {
            format!("line {}: no '=' found: {}", lineno + 1, line)
        })?;
        let path_part = line[..eq].trim();
        let value_part = line[eq + 1..].trim();

        let path = parse_dots(path_part)
            .map_err(|e| format!("line {}: bad path: {}", lineno + 1, e))?;
        let value: Value = serde_json::from_str(value_part)
            .map_err(|e| format!("line {}: bad value: {}", lineno + 1, e))?;

        if path.is_empty() {
            root = value;
            seen_root_assign = true;
            continue;
        }
        seen_root_assign = true;
        set_at(&mut root, path.segments(), value, lineno + 1)?;
    }

    if !seen_root_assign {
        return Err("no entries found in input".to_string());
    }
    Ok(root)
}

Two things the parser has to handle that aren't immediately obvious:

ANSI escapes. The default flatten output is colored when stdout is a TTY, and it's tempting to json-flatten file.json | json-flatten - --invert in a shell. You don't want to force users to remember --no-color. So the parser strips ANSI SGR sequences before parsing each line.
Comments and blanks. I allow #-prefixed comment lines and blank lines through the parser so you can hand-edit an intermediate file. (This is also what makes it pleasant to use as a half-way "data definition" format for small config files.)

The write-at-path logic is a small recursive function that grows objects and arrays as it walks. The interesting bit is that arrays are grown by padding with null until the index exists, so xs[3] = 10 in the middle of a stream implicitly creates xs[0..=2] = null. That matches how the flattener emits arrays in order, but it's a choice — you could also require dense input and error on gaps.

I ran this round-trip on a bunch of real JSON: Kubernetes objects, GitHub webhook payloads, OpenAPI specs, npm ls --json. As long as you don't use --depth (which is deliberately lossy) and you don't use --sort-keys on a document where key order is significant (arguably it never should be, but), the round-trip is exact.

The env format collision problem

One format I almost didn't include, and still think twice about every time someone uses it, is --format env:

$ echo '{"db":{"host":"localhost","port":5432}}' | json-flatten - --format env
DB_HOST=localhost
DB_PORT=5432

This is great for piping into source or docker run --env-file or whatever you're using to push config into a process. It's also lossy by design. The key transformation is "uppercase every alphanumeric, replace everything else with underscore, join segments with underscore". That map is not injective:

{"db.host": "a"} and {"db": {"host": "a"}} both produce DB_HOST=a.
{"db_host": "a"} does too.
So does {"db-host": "a"}.

If you then try to round-trip the env output back into JSON, you can't. There isn't enough information to reconstruct whether DB_HOST was nested, flat, hyphenated, or underscored. The README calls this out explicitly: use env for one-way piping, not for round-tripping.

I considered making the collision detectable — walk the output and error if two distinct JSON paths collide on the same env key. In the end I didn't, because the common case is flat config objects where collisions never happen, and the rare case is someone deliberately putting a dotted key into their JSON, in which case they already know what they're doing. But it's on the list.

JSONPath vs dots vs TSV vs JSON

Four output formats might sound like overkill but each of them solves a real problem:

dots is the default. It's what gron does, roughly. Best for grep and diff.
jsonpath is strict JSONPath ($.users[0].name). This matters because some downstream tools are strict about the $ prefix and won't accept the bare dotted form. I use it when I want to paste a path into a config file that takes JSONPath expressions, or when I'm teaching somebody who already knows JSONPath.
tsv is two columns, tab-separated, value stringified. Drops straight into a spreadsheet or awk '{print $2}'.
env is the uppercase shell-style form above.
json is an array of {path, value, type} objects. The type field is the most useful part — the Rust side knows whether a leaf is a string, number, bool, or null, and it's a waste to throw that away. jq consumers love it.

I didn't add a CSV format. TSV is less ambiguous (tabs don't appear in JSON leaf values unless somebody's done something strange) and tools that want CSV can pipe through tr '\t' ',' or use mlr.

Tradeoffs I picked on purpose

No streaming. The whole document goes into serde_json::Value. For 99% of JSON you see in the wild, this is fine. For GB-scale logs, use jq -c with split. Adding a streaming mode would double the complexity and I don't need it.
No JSON5 / JSONC. Strict JSON only. serde_json is strict by default and I'm not going to pull in a second parser.
No custom value rendering. Strings are always JSON-quoted in the output. This is a deliberate choice to match the gron convention and to make --invert parseable. It does mean that foo = "bar" looks slightly uglier than foo = bar for bare strings, and yes, I've gotten used to it.
Default output sorts object keys. This is a serde_json default (without the preserve_order feature) and I decided it was a feature, not a bug: predictable output is easier to diff. --sort-keys is still there as an explicit opt-in for when you want to force it.
Path quoting uses double quotes, not backticks. Backticks are friendlier in Markdown but they're ambiguous in the shell and many terminals. JSON-style double quotes are what every JSON tool already knows how to parse.

Tests

57 tests, split about two-thirds unit and one-third CLI integration. The unit tests pound on flatten(), path escaping, parsing, format rendering, and the ungron round-trip. The integration tests spin up the binary via assert_cmd and drive each output format through stdin → stdout, verify the exit codes (0 success, 1 bad JSON, 2 bad args), and run the full flatten-then-invert pipeline end to end.

The round-trip test is the one I'm happiest with:

#[test]
fn ungron_roundtrip_known() {
    let original = json!({
        "users": [
            {"name": "alice", "age": 30},
            {"name": "bob", "age": 25}
        ],
        "total": 2,
        "active": true
    });
    let flat = crate::flatten::flatten(&original, Default::default());
    let mut text = String::new();
    for e in &flat {
        text.push_str(&e.path.to_dots());
        text.push_str(" = ");
        text.push_str(&serde_json::to_string(&e.value).unwrap());
        text.push('\n');
    }
    let back = unflatten(&text).unwrap();
    assert_eq!(back, original);
}

It flattens a non-trivial document, manually renders the flat format exactly like the CLI does, runs the unflattener on the result, and asserts byte-for-byte equality with the original serde_json::Value. If this test ever fails, something is wrong with either the escaping or the path parser — and both have bitten me during development, so it's earned its keep.

Try it in thirty seconds

# Build and run
docker build -t json-flatten https://github.com/sen-ltd/json-flatten.git

# Flatten a small object
echo '{"users":[{"name":"alice","age":30}],"total":1}' \
  | docker run --rm -i json-flatten -

# All four formats
docker run --rm -i json-flatten - --format jsonpath < payload.json
docker run --rm -i json-flatten - --format env      < config.json
docker run --rm -i json-flatten - --format tsv      < data.json
docker run --rm -i json-flatten - --format json     < response.json

# Round-trip
echo '{"a":{"b":{"c":1}}}' \
  | docker run --rm -i json-flatten - \
  | docker run --rm -i json-flatten - --invert

And if you're going to use it daily:

alias jf=json-flatten

Closing

gron was right: flat JSON is easier to grep. It was also right about a lot of the specific design decisions — colon-less assignments are human-readable, bracket-quoted escape rules are unambiguous, and ungron is a surprisingly useful affordance. json-flatten is my Rust take on the same idea with a handful of extra output formats, stricter path escaping for weird keys, and a round-trip test I trust.

Source: https://github.com/sen-ltd/json-flatten. Fork it, steal the path-escaping logic, add the formats you wish I'd added. MIT.

DEV Community

gron Was Right: Flat JSON Is Easier to Grep (A Rust Take With Four Formats and a Round-Trip)

gron Was Right: Flat JSON Is Easier to Grep (A Rust Take With Four Formats and a Round-Trip)

The problem: pretty JSON is hostile to pipes

Design

The flattener, in 40 lines

Path escaping: the part you can't punt on

Ungron: parse the same output back

The env format collision problem

JSONPath vs dots vs TSV vs JSON

Tradeoffs I picked on purpose

Tests

Try it in thirty seconds

Closing

Top comments (0)