DEV Community

Highpass Studio
Highpass Studio

Posted on

Your logs are still a text file

Your logs are still a text file

Every incident investigation starts the same way:

zgrep "user_id=51013" logs/*.gz
Enter fullscreen mode Exit fullscreen mode

...and you wait.

30 seconds. A minute.

You tweak the query. Run it again. Another minute.

Same files. Same decompression. Same full scan.

After ten queries, you've spent ten minutes rereading the same data.

What if grep could remember?

I built xgrep for that.

# One-time: build index (~2 min for 1.7GB)
xgrep --build-index logs/*.gz

# Every query after that
xgrep "user_id=51013" logs/*.gz    # 25ms
xgrep "ERROR" logs/*.gz            # 25ms
xgrep "timeout.*conn" logs/*.gz    # 25ms
Enter fullscreen mode Exit fullscreen mode

Instead of decompressing everything every time, xgrep:

  • splits logs into 64KB blocks
  • builds a bloom filter per block
  • only reads blocks that might match

Everything else is skipped.

The result: read 1% of the data instead of 100%.

That's the whole idea.

Benchmarks on real production logs

Datasets from LogHub: Hadoop (HDFS), Blue Gene/L (BGL), and Spark.

HDFS — 7.5GB decompressed

Query xgrep zgrep Speedup
Block ID 30ms 27s 913x
WARN 28ms 25s 907x
INFO (very common) 23ms 28s 1,217x

BGL — 5.0GB decompressed

Query xgrep zgrep Speedup
Node ID 26ms 17s 655x
FATAL 25ms 17s 708x

Spark — 3,852 files

Query xgrep zgrep Speedup
Executor 5s 10m 118x
ERROR 2.7s 10m 220x

These are repeated-query (cached) results.

First query is still ~18x faster than zgrep (parallel decompression), but the real win is every query after that — which is how incident debugging actually works.
Here's the replacement JSON section:


JSON logs

zcat logs.json.gz | jq 'select(.user_id == "42042")' is the standard workflow. It works. It's also brutally slow — full decompression, full JSON parse, zero skipping, on every query.

xgrep's -j flag does field-aware search on NDJSON/JSONL logs:

Benchmark

1M NDJSON lines, 244MB uncompressed, 22MB gzip.

Query Matches Baseline xgrep -j Speedup Block skip
user_id=42042 9 40.6s 0.22s 188x 97%
status=503 111,130 40.6s 1.75s 23x 0%
level=error status=503 15,838 40.6s 1.71s 24x 0%

Baseline: zcat logs.json.gz | jq 'select(...)'

Every count matches jq exactly. 9/9, 111,130/111,130, 15,838/15,838. No approximations, no missed lines.

How it works

During index build, xgrep hashes three things per JSON field into each block's bloom filter: the field name, the value, and the field-value pair. When you query user_id=42042, the bloom can distinguish "42042 appears in the user_id field" from "42042 appears somewhere in the line." That precision is what drives the skip rate.

Benchmark

1M NDJSON lines, 244MB uncompressed, 22MB gzip.

| Query | Matches | zcat \| jq | xgrep -j | Speedup | Block skip |
|---|---|---|---|---|---|
| user_id=42042 | 9 | 40.6s | 0.22s | 188x | 97% |
| status=503 | 111,130 | 40.6s | 1.75s | 23x | 0% |
| level=error status=503 | 15,838 | 40.6s | 1.71s | 24x | 0% |

Every count matches jq exactly. 9/9, 111,130/111,130, 15,838/15,838. No approximations, no missed lines.

Why it's still fast at 0% skip

The selective query (188x) is the classic block-pruning win — 97% of blocks never get read. But the broad queries are the interesting result. At 0% skip, every block is searched, and xgrep is still 23x faster than zcat | jq. That's because jq parses every line into a full JSON AST and evaluates an expression tree. xgrep does a targeted field lookup — no AST, no expression evaluator, just hash check then verify.

Two advantages compound: I/O avoidance (skip blocks) and CPU avoidance (lighter evaluation). Even when the first one doesn't apply, the second one still delivers.

How it works (short version)

  1. Index: decompress once, split into blocks, build bloom filters
  2. Query: check filters, read only candidate blocks
  3. Execution: memory-mapped, OS loads only what's needed

The key metric isn't speed. It's bytes touched per query.

  • zgrep: 100% every time
  • xgrep: 0.1-1%

That's why the gap grows with data size.

Tradeoffs (honest)

  • Cache size: ~5x compressed size (stores decompressed data)
  • First run: ~2 min index build (amortized quickly)
  • Not universal grep: built for compressed logs + repeated search
  • For plain text: use ripgrep.

Who this is for

If you've ever:

  • waited on zgrep during an incident
  • rerun the same search 10 times
  • dealt with rotated .gz logs
  • wanted log-platform speed without log-platform overhead

Try it

cargo install xgrep-cli
xgrep "ERROR" logs/*.gz
Enter fullscreen mode Exit fullscreen mode

github.com/HighpassStudio/xgrep

Deep dive

Architecture + benchmark methodology: ARCHITECTURE.md


xgrep is Apache-2.0 licensed. Built with Rust, rayon, memchr, and flate2.

Top comments (0)