How a pure-Python jq ended up 40x faster than the C bindings

#python #performance #json #opensource

I spent yesterday building purejq, a
pure-Python implementation of jq. I expected it to be the slow-but-portable
option. Then I benchmarked it against the jq package on PyPI (the C
bindings everyone uses to run jq from Python) and got this, on a 100k-object
array, in-process:

workload	purejq	jq PyPI (C bindings)
field-access stream	9 ms	368 ms
filter + count	55 ms	442 ms
map + aggregate	18 ms	444 ms
group_by	112 ms	704 ms
transform + sort	136 ms	899 ms

Pure Python, 7-40x faster than the C extension. That number looked wrong to
me too, so before publishing anything I made the benchmark script verify
every output against the actual jq binary first (tools/bench.py --verify),
re-ran everything as median-of-7, and gave the bindings their best-case API.
The gap is real. Here's why.

The serialization tax

The C bindings wrap real jq, and real jq only speaks JSON. So every call
does this:

your dicts -> JSON text -> C parser -> jq evaluates -> JSON text -> dicts

That round trip costs about 350-450 ms for 100k small objects on my
machine, before any actual filtering happens. You can see it in the numbers:
even a trivial field access pays the same ~400 ms floor as a group_by.

purejq skips the trip entirely. It compiles the jq program once into Python
closures and walks your dicts and lists directly:

import purejq

prog = purejq.compile("group_by(.team) | map({team: .[0].team, n: length})")
prog.first(data)   # operates on your objects, no serialization

The lesson generalizes beyond jq: when you embed a C library that has its
own data model, the marshaling boundary is often more expensive than the
work. An interpreter written in your language gets to skip the boundary,
and that can buy back an order of magnitude.

Surprise number two: the CLI beats the jq binary on big files

This one I really didn't expect. End to end on a 93 MB file (1M objects),
parse + filter + output:

workload	purejq CLI	jq 1.8.1 binary
single lookup	0.51 s	1.68 s
filter + count	1.08 s	1.96 s
group_by	2.32 s	3.89 s

No trick here either, just arithmetic: on large files, most of the wall
clock goes to parsing JSON, and CPython's C-backed json module parses
at ~130 MB/s on my machine (orjson does ~220 MB/s, purejq uses it when
installed). jq's built-in parser is slower than both. purejq's actual
filter evaluation is slower than jq's C engine, but it's sitting behind a
faster parser, and the parser dominates.

To be fair to jq: on already-parsed streams in a shell pipeline, or small
inputs in a tight loop, the C binary still wins comfortably. If that's
your workload, keep using jq.

What keeps the Python side from being embarrassing

A few things mattered more than I expected:

Compile once, run many. Programs become nested Python closures; evaluation never touches the AST again.
Static binding. If a program never redefines select, the call is resolved at compile time instead of walking scopes at runtime.
Single-output fast paths. Things like .score * 2 + 1 provably yield exactly one value, so they compile to plain function calls instead of generators. Object literals with constant keys skip the generator product entirely.
Let C do the sorting. When sort keys are uniformly strings or numbers, sort_by/group_by/unique fall through to Python's native sort instead of a comparison callback. That one change was worth 5x on sort-heavy workloads.
PyPy for free. Pure Python means PyPy just works: another 2-9x on heavy workloads (map+aggregate drops from 18 ms to 2 ms).

Trust, but verify

Claiming "it's jq" is easy; the repo vendors jq's official test suite and
runs it in CI on CPython 3.9-3.14 and PyPy. 751 of 781 cases pass (96.2%),
and the 30 failures are listed in a file with reasons: no module system
yet, integers stay arbitrary-precision instead of rounding to doubles, and
a few error-message wordings.

One more disclosure, since the commit history shows it anyway: I'm a
product manager, not a programmer, and I built this with Claude in a day.
My role was picking the target, insisting on jq's own test suite as the
acceptance bar, and being suspicious of every benchmark number until it
had a verification path. I can't review the code line by line; I can read
a conformance percentage and a --verify output. Make of that what you
will.