This is a submission for the GitHub Finish-Up-A-Thon Challenge
What I Built
pq — jq for Parquet. A 50 MB Rust single binary that wraps DuckDB's query engine in a jq-style expression DSL, optimized for terminal one-liners and unix pipes.
$ pq sales.parquet 'group_by .country | sum .revenue | top 3 by sum_revenue'
┌─────────┬─────────────┐
│ country ┆ sum_revenue │
╞═════════╪═════════════╡
│ US ┆ 19065.00 │
│ FR ┆ 999.99 │
│ DE ┆ 312.00 │
└─────────┴─────────────┘
Where it started. I work in adtech. I look at parquet files dozens of times a day — campaign deliveries, partner exports, audience snapshots. Every existing option was painful:
| Tool | Pain |
|---|---|
pyarrow / pandas
|
5-second cold start, 200 MB virtualenv |
parquet-tools |
JVM, slow, no query support |
pqrs |
Inspector only — can't filter or project |
duckdb CLI |
Great engine, but SELECT email FROM 'file.parquet' WHERE country='US' is too verbose to type 50 times a day |
| Spark | Are you serious |
pq is the tool I actually want — single binary, no JVM, no Python, jq-style syntax for piping into the rest of the unix toolbox. It's been my default cat for parquet since v0.5.
Demo
- Repo: github.com/thehwang/parq
- Latest release: v0.14.0 (this submission)
-
Install:
brew install thehwang/parq/pq - Tutorial: doc/tutorial.md — 30-minute hands-on walkthrough
A taste of what shipped in v0.14:
# Streaming JSON output (was the only buffered format until v0.14)
$ pq big.parquet '.id, .country' -o json | head -c 200
# returns instantly even on a 40 GB file
# Schema-drift gate for CI
$ pq diff baseline.parquet candidate.parquet
# Schema diff
- a: `baseline.parquet`
- b: `candidate.parquet`
## Added (1)
| column | type | nullable |
|-----------|---------|----------|
| `country` | VARCHAR | yes |
$ echo $?
1 # exits non-zero on drift, slots into CI without scripting
And the new TUI Explain panel — press capital E for EXPLAIN ANALYZE, get row-group pruning per scan (this is exactly the panel you see on the cover image at the top of this post):
Explain · ANALYZE 22.0 ms
1 scan(s) • 833.3k actual rows • 1 filter(s) pushed • 2 projection(s)
✓ predicate pushdown: country='US'
✓ projection pushdown: 2 col(s) user_id, country
● actual 833.3k rows (estimated ~833.3k)
● pruned: 83% (833.3k/5.0M rows) ← new in v0.14, color-coded gauge
Color cues: green ≥ 50% pruned, gold any > 0%, dim 0%. The dim case fires a heuristic: "filter country = 'US' didn't prune any row groups — column may lack min/max stats (common for STRING from older Spark writers)". That's the kind of hint I used to need a DuckDB profile + a calculator to compute.
The Comeback Story
Three TODOs, three weeks of "almost done"
pq had hit v0.13 in May — solid big-file support (streaming output, Ctrl-C interrupt, metadata-only count --lite / stats --lite, async TUI preview, stderr spinner). Then three v0.14 issues sat in my GitHub project board for three weeks, slowly accreting // TODO comments in my notes:
-
#2: streaming JSON output — the one output format that still buffered the entire result into a
Vecbecause the writer needed to wrap it in[ … ]. - #3: row-group pruning ratio in the Explain panel — the most-requested observability feature ("did my filter actually help?").
-
#4:
pq diff— schema-drift detection I'd been wanting myself for a CI gate at work.
Each was "small but tedious" finishing work — the kind that's easy to put off when there's a more interesting feature to start. Classic 90% / 90% problem.
Then the hashtag showed up
GitHub announced the Finish-Up-A-Thon. Hashtag-shaming worked. Over the weekend I cleared the whole milestone:
-
PR #6 (commit) closes #2: streaming JSON via a hand-written incremental array writer (
[, then row, then,\n+ row, then]). Memory stays flat,head -c 200returns instantly. -
PR #9 (commit) closes #3: row-group pruning extracted from DuckDB's JSON profile, merged with
parquet_file_metadata(...)for the file's true row count. This one had a story — see Copilot section below. -
PR #10 (commit) closes #4:
pq diffas a new subcommand, markdown by default, JSON for tooling, exit 1 on drift. Detects added / dropped / type-changed columns, plus nullability changes (which are breaking for downstream consumers and easy to miss). -
PR #8 (commit) was an unplanned bonus: a CI infrastructure fix. The
tui smoke (vhs)job started failing mid-weekend because Ubuntu 24.04 runners stopped shippingttydin default repos. I'd have hit this on the next merge anyway; the Finish-Up-A-Thon was a forcing function to actually fix it instead of[skip ci]-ing around it. -
PR #11 (commit) closes #5 (the v0.14 tracking issue): bump
Cargo.tomlto 0.14.0, README v0.14 section, full reference manual entry (doc/reference.md§14), tutorial Lesson 6.
By the numbers:
- 5 PRs merged, 5 issues closed
- Test count: 204 → 215 (+8 unit, +3 integration)
- All CI green: macOS, Ubuntu,
tui smoke (vhs) -
v0.14.0tagged, release workflow building macOS arm64/x86_64 + Linux musl + Windows binaries + Homebrew bottle
What I'm proudest of isn't the line count — it's that the milestone is empty. No "ship it and clean up later" comments left in the code. README's "What's coming" section now has nothing in the v0.14 row to delete. That itch is gone.
Postscript: the cover image caught DuckDB lying
The cover image at the top of this post? It almost wasn't.
After tagging v0.14.0 I tried to record a custom cover showing off the new pruning gauge — a 5M-row parquet file with id < 1M, expecting a satisfying green pruned: 80% line. What VHS captured instead was a wall of JSON debris over the panels, plus this:
● pruned: 0% (25.0M/5.0M rows)
25 million scanned rows out of a 5 million row file. That's not a typo — it was wrong twice over.
Bug #1: a silent PRAGMA. The pruning code sets DuckDB's enable_profiling='json' to grab the JSON profile, then "resets" with PRAGMA disable_profiling afterward. Against DuckDB 1.10.501 that pragma is a silent no-op — accepts the call without erroring but doesn't actually flip the bit. Subsequent EXPLAIN ANALYZE calls (and the TUI runs one on every preview tick) kept returning JSON in column 1, which then bled into the rendered panel as garbage. The documented inverse enable_profiling='no_output' is what actually works. I only found this out by writing a Python probe and trying every reset spelling DuckDB's docs hinted at.
Bug #2: the wrong field. With JSON no longer leaking, the cover redrew — and the numbers still made no sense (25.0M/5.0M). Turns out operator_rows_scanned from DuckDB's JSON profile is roughly 10× a parquet scan's actual row count (likely an internal multi-pass / per-thread accumulator). The correct field for "rows out of scan after pushdown" is operator_cardinality. Same query against the same file: cardinality 1.0M, total 5.0M, ratio 0.8 — the green 80% I was expecting in the first place.
Both went out as v0.14.1 (PR #14, closes #12) with a regression test that opens a real DuckDB connection, runs the round-trip, and asserts the next plain EXPLAIN returns text — not JSON. Test count tipped to 216.
The cover at the top is the after-shot. The pre-fix one is in my recycle bin.
What was meant to be three TODOs ended up as five PRs and a patch release. The challenge wanted a finish; it got a finish and a postscript. Worth it.
My Experience with GitHub Copilot
This was a tale of two halves.
The first half — Copilot at its best. PR #6 (streaming JSON) was end-to-end Copilot Chat. The codebase already had stream_ndjson and stream_csv as templates; the only thing missing was the matching stream_json with the array-bracket bookkeeping. I handed Copilot the existing functions plus the buffered print_json it was replacing, and it produced a clean refactor on the first try — opening bracket on first row, comma-newline on subsequent rows, closing bracket at EOF, plus the right error handling for the partial-write case. The commit ships with a real Co-authored-by: Copilot trailer.
That's the sweet spot: when a codebase has a clear pattern to extend, Copilot pattern-matches and writes near-perfect code. I wrote the issue, gave it 30 seconds of context, and the PR was open inside 10 minutes.
The second half — where Copilot hits its ceiling. PR #9 (pruning ratio) was the opposite. The "obvious" approach — set DuckDB's profile_output PRAGMA to write a JSON profile to a temp file, then read it back — turns out not to work for EXPLAIN ANALYZE. The PRAGMA only writes the file for top-level statements; EXPLAIN ANALYZE returns the profile inline in column 1 of its own result. There's no obvious way to know that without reading DuckDB source or doing what I ended up doing: spinning up Python and probing EXPLAIN ANALYZE against a real parquet file, dumping the row, and discovering the JSON was already there.
Copilot's first attempt for #3 wrote ~170 lines using the temp-file approach. It also:
- ran
EXPLAIN ANALYZEtwice (wasteful — the second run executes the full query again), - didn't reset profiling PRAGMAs after the call (so subsequent preview ticks got JSON-shaped output),
- assumed
Filename(s)was comma-split (it isn't — DuckDB returns the literal glob string), - merged JSON scans into text scans by index without checking the lengths matched.
I caught most of this in a 7-bug code review with Copilot, then ran out of credits before we could iterate on the architectural changes. With the WIP unmerged and Copilot Chat unavailable, I fell back to Cursor (Claude) and rewrote the feature from scratch — same goal, correct architecture, ~230 lines including tests. The whole rewrite took about 25 minutes; the architectural debugging during the Copilot review took 90.
Honest takeaway: Copilot was great for PR #6 because the shape was already in the repo. For PR #9 the architecture itself was unknown territory and it hit a wall — directionally right (use the JSON profile!), but wrong in every detail that mattered. The most useful thing the Copilot Chat history gave me was the list of bugs to avoid on the rewrite. I still ship that list as ground truth in the PR #9 commit message.
The combination ended up working: Copilot for the pattern-completion sprints (#2), Cursor for the empirical / architectural reasoning (#3, #4, docs). I'd run the same playbook again. Maybe with Copilot Pro next time so I don't run out mid-debug.
Repo: github.com/thehwang/parq · brew install thehwang/parq/pq · pq --help
Top comments (0)