SEN LLC

Posted on Apr 16

Writing a tar Inspector in Rust with Three Dependencies

#rust #tar #cli #tutorial

Writing a tar Inspector in Rust with Three Dependencies

tar -tzf archive.tar.gz | grep something and tar -Oxzf archive.tar.gz some/file.txt both work. They also both feel like reading hieroglyphs. I wanted something where "what's in this thing?" and "show me this one file" were two obviously different commands.

The hard part of writing a tar tool is not parsing tar. tar is an almost comically simple format — a sequence of 512-byte headers, each followed by the file's bytes padded to a 512-byte boundary, terminated by two blocks of zeros. The hard part is deciding what not to implement so the result stays small and boring.

I ended up with a Rust CLI called tar-inspect that does four things — list, stat, cat, search — over plain tar and gzip-compressed tar, and whose entire dependency closure is three crates: clap, tar, flate2. The final release binary is under 2 MB, the Alpine-based Docker image is 9.5 MB, and every feature was chosen because reaching for standard tar(1) flags annoyed me one too many times.

📦 GitHub: https://github.com/sen-ltd/tar-inspect

The problem: tar's UX is a time capsule

If you want to list the contents of a tar.gz, you type:

tar -tzf archive.tar.gz

If you want to print a single file from it to stdout:

tar -Oxzf archive.tar.gz path/to/file.txt

Those flags are not mnemonics. They are a historical accident from the days when every byte of the CLI parser was carved into stone. t is "list table", z is "gzip", f is "from file", and O is "output to stdout instead of extracting". Every single person I know looks this up. Every time.

And then when you do look it up, you discover a subtle bug: the flag order matters on some implementations, --file can confuse the short-flag parser, and passing -- to stop flag parsing varies by platform. None of that is hard to learn — it's just friction that accumulates every day.

I wanted the following syntax instead:

tar-inspect list    archive.tar.gz
tar-inspect list    archive.tar.gz --filter '*.rs'
tar-inspect cat     archive.tar.gz ./README.md
tar-inspect stat    archive.tar.gz ./Cargo.toml --format json
tar-inspect search  archive.tar.gz 'src/*.rs'

Four verbs, zero one-letter flags, and --format json so that jq can enter the chat.

Design: three crates and nothing else

Rust has a crate for everything tar-shaped. The good one is tar itself, by Alex Crichton — it's been stable since forever, it does exactly the tar format (no compression, no fancy compression detection, just headers and blocks), and its API is a straightforward iterator. Compression is a separate crate, and for tar the overwhelmingly dominant compression is gzip, which means flate2.

I was tempted by compress-tools, which wraps libarchive and can do everything — gzip, bzip2, xz, zstd, and cpio and zip while it's at it. I ruled it out for three reasons:

It needs libarchive-dev at build time, which means a bigger Docker base and a C toolchain.
libarchive has its own quirks and bugs — if something goes wrong, I'm debugging C via FFI.
Covering every format is not the point. Covering the 90% case with 3 deps is.

Once I committed to gzip-only I had to decide whether bz2 and xz support would be a regression compared to GNU tar. I decided no — if someone has a .tar.xz they can pipe it:

xzcat foo.tar.xz | tar-inspect list -
bzcat foo.tar.bz2 | tar-inspect list -

The - for stdin is the tell that I meant for this to be composable. "We support exactly gzip natively and everything else through pipes" is a better pitch than "we support three more compressions you'll rarely see".

The reader: sniffing gzip from two bytes

You cannot tell whether a file is a tar or a tar.gz by looking at its extension. People rename things. People receive files over Slack with no extension at all. Real file(1) does it by magic bytes — and so can we, with two of them.

Gzip starts with 1f 8b. Plain tar has no magic of its own — the first bytes are just a filename, ASCII. So a two-byte peek is enough:

pub fn detect_from_magic(head: &[u8]) -> Compression {
    if head.len() >= 2 && head[0] == 0x1f && head[1] == 0x8b {
        Compression::Gzip
    } else {
        Compression::None
    }
}

Separating detection into a pure function means tests don't need temp files — they just hand it literal slices. Good little win.

The full open() for files uses that:

pub fn open(path: &str) -> io::Result<ArchiveReader> {
    if path == "-" {
        return open_stdin();
    }

    let mut f = File::open(Path::new(path))?;
    let mut magic = [0u8; 2];
    let n = read_fill(&mut f, &mut magic)?;
    let detected = detect_from_magic(&magic[..n]);

    // Re-open: simpler than seeking, and File::open is cheap on a cached inode.
    let f2 = File::open(Path::new(path))?;
    wrap(Box::new(BufReader::new(f2)), detected)
}

fn wrap(inner: Box<dyn Read>, c: Compression) -> io::Result<ArchiveReader> {
    match c {
        Compression::None => Ok(inner),
        Compression::Gzip => Ok(Box::new(GzDecoder::new(inner))),
    }
}

Two design choices worth flagging here:

Box<dyn Read>. The whole program wants one interface: "a thing I can read tar bytes from". Making the reader type erased lets list, stat, cat, and search all take the same thing regardless of compression. It's one heap allocation per invocation, which is free.
Re-open instead of seek. Once you've consumed two bytes from a File, you could seek(SeekFrom::Start(0)) and hand it to the gzip decoder. But re-opening is simpler, the inode is still hot, and it composes naturally with the stdin case (which can't seek at all).

Stdin gets a slightly different treatment because it's not seekable:

fn open_stdin() -> io::Result<ArchiveReader> {
    let mut stdin = io::stdin();
    let mut magic = [0u8; 2];
    let n = read_fill(&mut stdin, &mut magic)?;
    let detected = detect_from_magic(&magic[..n]);
    let head = std::io::Cursor::new(magic[..n].to_vec());
    let chained: Box<dyn Read> = Box::new(head.chain(stdin));
    wrap(chained, detected)
}

The trick is Read::chain: consume the two bytes into a Cursor, then glue the cursor in front of the rest of stdin. To the gzip decoder, it looks like one continuous stream. This is the kind of tiny standard-library gem that makes Rust's IO primitives feel good once you get used to them.

The inspector: tar is a one-shot iterator

The tar crate exposes an archive as an iterator over entries:

let mut archive = tar::Archive::new(reader);
for entry in archive.entries()? {
    let entry = entry?;
    // `entry` contains the header + a reader positioned at the file bytes
}

What the docs under-sell is that this iterator is strictly one-shot when your reader is a decompressor. Once you've advanced past an entry, you cannot go back. For a seekable plain-tar you technically could seek, but tar::Archive doesn't expose that and has no way to know the decompressed offsets for gzip anyway. So every operation, including "give me the metadata of the one file I care about", starts from the beginning of the archive and walks forward until it finds it or runs out.

This is fine. Archives are usually small, reads are sequential, and the alternative (building an index) would require a second crate and some state management for a feature nobody needs. Here's what stat looks like:

pub fn stat(reader: ArchiveReader, target: &str) -> io::Result<Option<EntryInfo>> {
    let mut archive = tar::Archive::new(reader);
    for entry in archive.entries()? {
        let entry = entry?;
        let info = info_from(&entry)?;
        if path_matches(&info.path, target) {
            return Ok(Some(info));
        }
    }
    Ok(None)
}

And cat, which streams rather than buffers:

pub fn cat<W: Write>(
    reader: ArchiveReader,
    target: &str,
    max_bytes: u64,
    out: &mut W,
) -> io::Result<Option<u64>> {
    let mut archive = tar::Archive::new(reader);
    for entry in archive.entries()? {
        let mut entry = entry?;
        let path = entry.path()?.to_string_lossy().into_owned();
        if !path_matches(&path, target) {
            continue;
        }
        if entry.header().entry_type() == tar::EntryType::Directory {
            return Err(io::Error::new(
                io::ErrorKind::InvalidInput,
                format!("{} is a directory", path),
            ));
        }

        let mut buf = [0u8; 8192];
        let mut written: u64 = 0;
        loop {
            if written >= max_bytes {
                break;
            }
            let want = ((max_bytes - written).min(buf.len() as u64)) as usize;
            let n = entry.read(&mut buf[..want])?;
            if n == 0 {
                break;
            }
            out.write_all(&buf[..n])?;
            written += n as u64;
        }
        return Ok(Some(written));
    }
    Ok(None)
}

The streaming matters. If you had a 2 GB file inside the archive and implemented cat as "read into Vec<u8>, then write out", you'd happily try to allocate 2 GB and hand it to stdout in one go. With the bounded 8 KB buffer, memory usage is constant regardless of how big the target file is, and --max-size gives the user a sane default cap (1 MiB) so a cat on a giant binary blob doesn't flood their terminal for a minute.

The "./" gotcha that every tar tool hits

There's one wart in tar's file paths that I learned about the hard way, and it's worth calling out because it'll bite you exactly once. When you build an archive with:

tar -cf archive.tar -C src .

Every entry in the archive has a leading ./. Paths are stored literally as written, so hello.txt becomes ./hello.txt. If the user then runs tar-inspect cat archive.tar hello.txt, the naive implementation misses it.

I fixed this in one place with a tiny normalization helper:

fn path_matches(stored: &str, target: &str) -> bool {
    let s = strip_leading_dot(stored);
    let t = strip_leading_dot(target);
    s == t
}

fn strip_leading_dot(p: &str) -> &str {
    p.strip_prefix("./").unwrap_or(p)
}

Both sides get normalized, so ./hello.txt matches hello.txt and vice versa. This is the same trick that GNU tar uses internally, and it's the kind of thing you can only discover by actually running the tool against a tar you built yourself five minutes ago.

The glob matcher: I wrote it instead of pulling in another crate

search and --filter need glob matching. The obvious answer is the glob crate, but that would be dependency number four, and the matching I actually need is roughly ten lines:

pub fn matches(pattern: &str, text: &str) -> bool {
    let pat = pattern.as_bytes();
    let txt = text.as_bytes();
    let mut pi = 0usize;
    let mut ti = 0usize;
    let mut star_pi: Option<usize> = None;
    let mut star_ti = 0usize;

    while ti < txt.len() {
        if pi < pat.len() && (pat[pi] == b'?' || pat[pi] == txt[ti]) {
            pi += 1; ti += 1;
        } else if pi < pat.len() && pat[pi] == b'*' {
            star_pi = Some(pi);
            star_ti = ti;
            pi += 1;
        } else if let Some(sp) = star_pi {
            pi = sp + 1;
            star_ti += 1;
            ti = star_ti;
        } else {
            return false;
        }
    }
    while pi < pat.len() && pat[pi] == b'*' {
        pi += 1;
    }
    pi == pat.len()
}

It's classic iterative backtracking: remember the most recent * position, and when a later mismatch happens, rewind the pattern to the star and advance the text by one. Worst case O(n·m), but pattern lengths are tiny and path lengths are hundreds of bytes at most, so it's irrelevant. Seven unit tests cover the *, ?, literal, empty, and combo cases.

One conscious deviation from shell globs: * matches slashes. I want src/*.rs to find src/a/b/main.rs, because that is the useful semantic for "find Rust files in src". If you need finer control, pass a more specific pattern.

Tradeoffs, written down

No bz2, no xz, no zstd. Workaround with pipes. If this is a dealbreaker for your use case, you probably want a libarchive wrapper, not this.
No in-place modification. tar-inspect is read-only. If you want to edit an archive, use tar or rebuild it.
cat on a 10 GB file streams, but lists still walk the whole archive. That's unavoidable with a streaming format — there is no index to jump to.
Unit tests run in-memory, integration tests build tar archives in a tempdir with the tar crate's Builder API. No binary fixtures in git, no pre-recorded blobs, nothing to go stale. Every test runs both the plain and gzip paths where relevant.

Try it in 30 seconds

git clone https://github.com/sen-ltd/tar-inspect
cd tar-inspect
docker build -t tar-inspect .

# Make a sample archive and inspect it.
mkdir -p /tmp/demo/src/sub
echo "hello world" > /tmp/demo/src/hello.txt
echo "nested"      > /tmp/demo/src/sub/nested.txt
tar -czf /tmp/demo/sample.tar.gz -C /tmp/demo/src .

docker run --rm -v /tmp/demo:/work tar-inspect list   /work/sample.tar.gz
docker run --rm -v /tmp/demo:/work tar-inspect cat    /work/sample.tar.gz ./hello.txt
docker run --rm -v /tmp/demo:/work tar-inspect search /work/sample.tar.gz '*.txt'
docker run --rm -v /tmp/demo:/work tar-inspect stat   /work/sample.tar.gz ./hello.txt --format json

Pipe --format json output through jq and you have a scriptable archive query tool in a 9.5 MB image with three dependencies. That's my kind of afternoon project.

Why bother writing it

You could argue "just memorize the tar flags", and honestly, you'd be right. But tar-inspect took one afternoon, it's a read-only inspection tool I can drop into any Dockerfile, and the act of building it taught me three concrete things I didn't previously know: the two-byte gzip magic, the Read::chain trick for prepending bytes to a stream, and the ./ path convention that every tar tool has to handle. Writing small things is a pretty reliable way to find out that a well-known format is smaller than you thought it was. tar turns out to be one of those.

DEV Community

Writing a tar Inspector in Rust with Three Dependencies

Writing a tar Inspector in Rust with Three Dependencies

The problem: tar's UX is a time capsule

Design: three crates and nothing else

The reader: sniffing gzip from two bytes

The inspector: tar is a one-shot iterator

The "./" gotcha that every tar tool hits

The glob matcher: I wrote it instead of pulling in another crate

Tradeoffs, written down

Try it in 30 seconds

Why bother writing it

Top comments (0)