Highpass Studio

Posted on Mar 22

Your logs are still a text file

#cli #performance #showdev #tooling

Your logs are still a text file

Every incident investigation starts the same way:

zgrep "user_id=51013" logs/*.gz

...and you wait.

30 seconds. A minute.

You tweak the query. Run it again. Another minute.

Same files. Same decompression. Same full scan.

After ten queries, you've spent ten minutes rereading the same data.

What if grep could remember?

I built xgrep for that.

# One-time: build index (~2 min for 1.7GB)
xgrep --build-index logs/*.gz

# Every query after that
xgrep "user_id=51013" logs/*.gz    # 25ms
xgrep "ERROR" logs/*.gz            # 25ms
xgrep "timeout.*conn" logs/*.gz    # 25ms

Instead of decompressing everything every time, xgrep:

splits logs into 64KB blocks
builds a bloom filter per block
only reads blocks that might match

Everything else is skipped.

The result: read 1% of the data instead of 100%.

That's the whole idea.

Benchmarks on real production logs

Datasets from LogHub: Hadoop (HDFS), Blue Gene/L (BGL), and Spark.

HDFS — 7.5GB decompressed

Query	xgrep	zgrep	Speedup
Block ID	30ms	27s	913x
`WARN`	28ms	25s	907x
`INFO` (very common)	23ms	28s	1,217x

BGL — 5.0GB decompressed

Query	xgrep	zgrep	Speedup
Node ID	26ms	17s	655x
`FATAL`	25ms	17s	708x

Spark — 3,852 files

Query	xgrep	zgrep	Speedup
`Executor`	5s	10m	118x
`ERROR`	2.7s	10m	220x

These are repeated-query (cached) results.

First query is still ~18x faster than zgrep (parallel decompression), but the real win is every query after that — which is how incident debugging actually works.
Here's the replacement JSON section:

JSON logs

zcat logs.json.gz | jq 'select(.user_id == "42042")' is the standard workflow. It works. It's also brutally slow — full decompression, full JSON parse, zero skipping, on every query.

xgrep's -j flag does field-aware search on NDJSON/JSONL logs:

Benchmark

1M NDJSON lines, 244MB uncompressed, 22MB gzip.

Query	Matches	Baseline	xgrep -j	Speedup	Block skip
user_id=42042	9	40.6s	0.22s	188x	97%
status=503	111,130	40.6s	1.75s	23x	0%
level=error status=503	15,838	40.6s	1.71s	24x	0%

Baseline: zcat logs.json.gz | jq 'select(...)'

Every count matches jq exactly. 9/9, 111,130/111,130, 15,838/15,838. No approximations, no missed lines.

How it works

During index build, xgrep hashes three things per JSON field into each block's bloom filter: the field name, the value, and the field-value pair. When you query user_id=42042, the bloom can distinguish "42042 appears in the user_id field" from "42042 appears somewhere in the line." That precision is what drives the skip rate.

Benchmark

1M NDJSON lines, 244MB uncompressed, 22MB gzip.

| Query | Matches | zcat \| jq | xgrep -j | Speedup | Block skip |
|---|---|---|---|---|---|
| user_id=42042 | 9 | 40.6s | 0.22s | 188x | 97% |
| status=503 | 111,130 | 40.6s | 1.75s | 23x | 0% |
| level=error status=503 | 15,838 | 40.6s | 1.71s | 24x | 0% |

Every count matches jq exactly. 9/9, 111,130/111,130, 15,838/15,838. No approximations, no missed lines.

Why it's still fast at 0% skip

The selective query (188x) is the classic block-pruning win — 97% of blocks never get read. But the broad queries are the interesting result. At 0% skip, every block is searched, and xgrep is still 23x faster than zcat | jq. That's because jq parses every line into a full JSON AST and evaluates an expression tree. xgrep does a targeted field lookup — no AST, no expression evaluator, just hash check then verify.

Two advantages compound: I/O avoidance (skip blocks) and CPU avoidance (lighter evaluation). Even when the first one doesn't apply, the second one still delivers.

How it works (short version)

Index: decompress once, split into blocks, build bloom filters
Query: check filters, read only candidate blocks
Execution: memory-mapped, OS loads only what's needed

The key metric isn't speed. It's bytes touched per query.

zgrep: 100% every time
xgrep: 0.1-1%

That's why the gap grows with data size.

Tradeoffs (honest)

Cache size: ~5x compressed size (stores decompressed data)
First run: ~2 min index build (amortized quickly)
Not universal grep: built for compressed logs + repeated search
For plain text: use ripgrep.

Who this is for

If you've ever:

waited on zgrep during an incident
rerun the same search 10 times
dealt with rotated .gz logs
wanted log-platform speed without log-platform overhead

Try it

cargo install xgrep-cli
xgrep "ERROR" logs/*.gz

github.com/HighpassStudio/xgrep

Deep dive

Architecture + benchmark methodology: ARCHITECTURE.md

xgrep is Apache-2.0 licensed. Built with Rust, rayon, memchr, and flate2.

DEV Community

Your logs are still a text file

Your logs are still a text file

What if grep could remember?

Benchmarks on real production logs

HDFS — 7.5GB decompressed

BGL — 5.0GB decompressed

Spark — 3,852 files

JSON logs

Benchmark

How it works

Benchmark

Why it's still fast at 0% skip

How it works (short version)

Tradeoffs (honest)

Who this is for

Try it

Deep dive

Top comments (0)