Your logs are still a text file
Every incident investigation starts the same way:
zgrep "user_id=51013" logs/*.gz
...and you wait.
30 seconds. A minute.
You tweak the query. Run it again. Another minute.
Same files. Same decompression. Same full scan.
After ten queries, you've spent ten minutes rereading the same data.
What if grep could remember?
I built xgrep for that.
# One-time: build index (~2 min for 1.7GB)
xgrep --build-index logs/*.gz
# Every query after that
xgrep "user_id=51013" logs/*.gz # 25ms
xgrep "ERROR" logs/*.gz # 25ms
xgrep "timeout.*conn" logs/*.gz # 25ms
Instead of decompressing everything every time, xgrep:
- splits logs into 64KB blocks
- builds a bloom filter per block
- only reads blocks that might match
Everything else is skipped.
The result: read 1% of the data instead of 100%.
That's the whole idea.
Benchmarks on real production logs
Datasets from LogHub: Hadoop (HDFS), Blue Gene/L (BGL), and Spark.
HDFS — 7.5GB decompressed
| Query | xgrep | zgrep | Speedup |
|---|---|---|---|
| Block ID | 30ms | 27s | 913x |
WARN |
28ms | 25s | 907x |
INFO (very common) |
23ms | 28s | 1,217x |
BGL — 5.0GB decompressed
| Query | xgrep | zgrep | Speedup |
|---|---|---|---|
| Node ID | 26ms | 17s | 655x |
FATAL |
25ms | 17s | 708x |
Spark — 3,852 files
| Query | xgrep | zgrep | Speedup |
|---|---|---|---|
Executor |
5s | 10m | 118x |
ERROR |
2.7s | 10m | 220x |
These are repeated-query (cached) results.
First query is still ~18x faster than zgrep (parallel decompression), but the real win is every query after that — which is how incident debugging actually works.
Here's the replacement JSON section:
JSON logs
zcat logs.json.gz | jq 'select(.user_id == "42042")' is the standard workflow. It works. It's also brutally slow — full decompression, full JSON parse, zero skipping, on every query.
xgrep's -j flag does field-aware search on NDJSON/JSONL logs:
Benchmark
1M NDJSON lines, 244MB uncompressed, 22MB gzip.
| Query | Matches | Baseline | xgrep -j | Speedup | Block skip |
|---|---|---|---|---|---|
| user_id=42042 | 9 | 40.6s | 0.22s | 188x | 97% |
| status=503 | 111,130 | 40.6s | 1.75s | 23x | 0% |
| level=error status=503 | 15,838 | 40.6s | 1.71s | 24x | 0% |
Baseline: zcat logs.json.gz | jq 'select(...)'
Every count matches jq exactly. 9/9, 111,130/111,130, 15,838/15,838. No approximations, no missed lines.
How it works
During index build, xgrep hashes three things per JSON field into each block's bloom filter: the field name, the value, and the field-value pair. When you query user_id=42042, the bloom can distinguish "42042 appears in the user_id field" from "42042 appears somewhere in the line." That precision is what drives the skip rate.
Benchmark
1M NDJSON lines, 244MB uncompressed, 22MB gzip.
| Query | Matches | zcat \| jq | xgrep -j | Speedup | Block skip |
|---|---|---|---|---|---|
| user_id=42042 | 9 | 40.6s | 0.22s | 188x | 97% |
| status=503 | 111,130 | 40.6s | 1.75s | 23x | 0% |
| level=error status=503 | 15,838 | 40.6s | 1.71s | 24x | 0% |
Every count matches jq exactly. 9/9, 111,130/111,130, 15,838/15,838. No approximations, no missed lines.
Why it's still fast at 0% skip
The selective query (188x) is the classic block-pruning win — 97% of blocks never get read. But the broad queries are the interesting result. At 0% skip, every block is searched, and xgrep is still 23x faster than zcat | jq. That's because jq parses every line into a full JSON AST and evaluates an expression tree. xgrep does a targeted field lookup — no AST, no expression evaluator, just hash check then verify.
Two advantages compound: I/O avoidance (skip blocks) and CPU avoidance (lighter evaluation). Even when the first one doesn't apply, the second one still delivers.
How it works (short version)
- Index: decompress once, split into blocks, build bloom filters
- Query: check filters, read only candidate blocks
- Execution: memory-mapped, OS loads only what's needed
The key metric isn't speed. It's bytes touched per query.
- zgrep: 100% every time
- xgrep: 0.1-1%
That's why the gap grows with data size.
Tradeoffs (honest)
- Cache size: ~5x compressed size (stores decompressed data)
- First run: ~2 min index build (amortized quickly)
- Not universal grep: built for compressed logs + repeated search
- For plain text: use ripgrep.
Who this is for
If you've ever:
- waited on
zgrepduring an incident - rerun the same search 10 times
- dealt with rotated
.gzlogs - wanted log-platform speed without log-platform overhead
Try it
cargo install xgrep-cli
xgrep "ERROR" logs/*.gz
github.com/HighpassStudio/xgrep
Deep dive
Architecture + benchmark methodology: ARCHITECTURE.md
xgrep is Apache-2.0 licensed. Built with Rust, rayon, memchr, and flate2.
Top comments (0)