thehwang

Posted on May 30

Three TODOs, three weeks, one weekend: finishing pq v0.14

#devchallenge #githubchallenge #rust #cli

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

pq — jq for Parquet. A 50 MB Rust single binary that wraps DuckDB's query engine in a jq-style expression DSL, optimized for terminal one-liners and unix pipes.

$ pq sales.parquet 'group_by .country | sum .revenue | top 3 by sum_revenue'
┌─────────┬─────────────┐
│ country ┆ sum_revenue │
╞═════════╪═════════════╡
│ US      ┆ 19065.00    │
│ FR      ┆ 999.99      │
│ DE      ┆ 312.00      │
└─────────┴─────────────┘

Where it started. I work in adtech. I look at parquet files dozens of times a day — campaign deliveries, partner exports, audience snapshots. Every existing option was painful:

Tool	Pain
`pyarrow` / `pandas`	5-second cold start, 200 MB virtualenv
`parquet-tools`	JVM, slow, no query support
`pqrs`	Inspector only — can't filter or project
`duckdb` CLI	Great engine, but `SELECT email FROM 'file.parquet' WHERE country='US'` is too verbose to type 50 times a day
Spark	Are you serious

pq is the tool I actually want — single binary, no JVM, no Python, jq-style syntax for piping into the rest of the unix toolbox. It's been my default cat for parquet since v0.5.

Demo

Repo: github.com/thehwang/parq
Latest release: v0.14.0 (this submission)
Install: brew install thehwang/parq/pq
Tutorial: doc/tutorial.md — 30-minute hands-on walkthrough

A taste of what shipped in v0.14:

# Streaming JSON output (was the only buffered format until v0.14)
$ pq big.parquet '.id, .country' -o json | head -c 200
# returns instantly even on a 40 GB file

# Schema-drift gate for CI
$ pq diff baseline.parquet candidate.parquet
# Schema diff
- a: `baseline.parquet`
- b: `candidate.parquet`

## Added (1)
| column    | type    | nullable |
|-----------|---------|----------|
| `country` | VARCHAR | yes      |

$ echo $?
1   # exits non-zero on drift, slots into CI without scripting

And the new TUI Explain panel — press capital E for EXPLAIN ANALYZE, get row-group pruning per scan (this is exactly the panel you see on the cover image at the top of this post):

Explain · ANALYZE  22.0 ms
1 scan(s)  •  833.3k actual rows  •  1 filter(s) pushed  •  2 projection(s)
  ✓ predicate pushdown: country='US'
  ✓ projection pushdown: 2 col(s) user_id, country
  ● actual 833.3k rows  (estimated ~833.3k)
  ● pruned: 83% (833.3k/5.0M rows)        ← new in v0.14, color-coded gauge

Color cues: green ≥ 50% pruned, gold any > 0%, dim 0%. The dim case fires a heuristic: "filter country = 'US' didn't prune any row groups — column may lack min/max stats (common for STRING from older Spark writers)". That's the kind of hint I used to need a DuckDB profile + a calculator to compute.

The Comeback Story

Three TODOs, three weeks of "almost done"

pq had hit v0.13 in May — solid big-file support (streaming output, Ctrl-C interrupt, metadata-only count --lite / stats --lite, async TUI preview, stderr spinner). Then three v0.14 issues sat in my GitHub project board for three weeks, slowly accreting // TODO comments in my notes:

#2: streaming JSON output — the one output format that still buffered the entire result into a Vec because the writer needed to wrap it in [ … ].
#3: row-group pruning ratio in the Explain panel — the most-requested observability feature ("did my filter actually help?").
#4: pq diff — schema-drift detection I'd been wanting myself for a CI gate at work.

Each was "small but tedious" finishing work — the kind that's easy to put off when there's a more interesting feature to start. Classic 90% / 90% problem.

Then the hashtag showed up

GitHub announced the Finish-Up-A-Thon. Hashtag-shaming worked. Over the weekend I cleared the whole milestone:

PR #6 (commit) closes #2: streaming JSON via a hand-written incremental array writer ([, then row, then ,\n + row, then ]). Memory stays flat, head -c 200 returns instantly.
PR #9 (commit) closes #3: row-group pruning extracted from DuckDB's JSON profile, merged with parquet_file_metadata(...) for the file's true row count. This one had a story — see Copilot section below.
PR #10 (commit) closes #4: pq diff as a new subcommand, markdown by default, JSON for tooling, exit 1 on drift. Detects added / dropped / type-changed columns, plus nullability changes (which are breaking for downstream consumers and easy to miss).
PR #8 (commit) was an unplanned bonus: a CI infrastructure fix. The tui smoke (vhs) job started failing mid-weekend because Ubuntu 24.04 runners stopped shipping ttyd in default repos. I'd have hit this on the next merge anyway; the Finish-Up-A-Thon was a forcing function to actually fix it instead of [skip ci]-ing around it.
PR #11 (commit) closes #5 (the v0.14 tracking issue): bump Cargo.toml to 0.14.0, README v0.14 section, full reference manual entry (doc/reference.md §14), tutorial Lesson 6.

By the numbers:

5 PRs merged, 5 issues closed
Test count: 204 → 215 (+8 unit, +3 integration)
All CI green: macOS, Ubuntu, tui smoke (vhs)
v0.14.0 tagged, release workflow building macOS arm64/x86_64 + Linux musl + Windows binaries + Homebrew bottle

What I'm proudest of isn't the line count — it's that the milestone is empty. No "ship it and clean up later" comments left in the code. README's "What's coming" section now has nothing in the v0.14 row to delete. That itch is gone.

Postscript: the cover image caught DuckDB lying

The cover image at the top of this post? It almost wasn't.

After tagging v0.14.0 I tried to record a custom cover showing off the new pruning gauge — a 5M-row parquet file with id < 1M, expecting a satisfying green pruned: 80% line. What VHS captured instead was a wall of JSON debris over the panels, plus this:

● pruned: 0% (25.0M/5.0M rows)

25 million scanned rows out of a 5 million row file. That's not a typo — it was wrong twice over.

Bug #1: a silent PRAGMA. The pruning code sets DuckDB's enable_profiling='json' to grab the JSON profile, then "resets" with PRAGMA disable_profiling afterward. Against DuckDB 1.10.501 that pragma is a silent no-op — accepts the call without erroring but doesn't actually flip the bit. Subsequent EXPLAIN ANALYZE calls (and the TUI runs one on every preview tick) kept returning JSON in column 1, which then bled into the rendered panel as garbage. The documented inverse enable_profiling='no_output' is what actually works. I only found this out by writing a Python probe and trying every reset spelling DuckDB's docs hinted at.

Bug #2: the wrong field. With JSON no longer leaking, the cover redrew — and the numbers still made no sense (25.0M/5.0M). Turns out operator_rows_scanned from DuckDB's JSON profile is roughly 10× a parquet scan's actual row count (likely an internal multi-pass / per-thread accumulator). The correct field for "rows out of scan after pushdown" is operator_cardinality. Same query against the same file: cardinality 1.0M, total 5.0M, ratio 0.8 — the green 80% I was expecting in the first place.

Both went out as v0.14.1 (PR #14, closes #12) with a regression test that opens a real DuckDB connection, runs the round-trip, and asserts the next plain EXPLAIN returns text — not JSON. Test count tipped to 216.

The cover at the top is the after-shot. The pre-fix one is in my recycle bin.

What was meant to be three TODOs ended up as five PRs and a patch release. The challenge wanted a finish; it got a finish and a postscript. Worth it.

My Experience with GitHub Copilot

This was a tale of two halves.

The first half — Copilot at its best. PR #6 (streaming JSON) was end-to-end Copilot Chat. The codebase already had stream_ndjson and stream_csv as templates; the only thing missing was the matching stream_json with the array-bracket bookkeeping. I handed Copilot the existing functions plus the buffered print_json it was replacing, and it produced a clean refactor on the first try — opening bracket on first row, comma-newline on subsequent rows, closing bracket at EOF, plus the right error handling for the partial-write case. The commit ships with a real Co-authored-by: Copilot trailer.

That's the sweet spot: when a codebase has a clear pattern to extend, Copilot pattern-matches and writes near-perfect code. I wrote the issue, gave it 30 seconds of context, and the PR was open inside 10 minutes.

The second half — where Copilot hits its ceiling. PR #9 (pruning ratio) was the opposite. The "obvious" approach — set DuckDB's profile_output PRAGMA to write a JSON profile to a temp file, then read it back — turns out not to work for EXPLAIN ANALYZE. The PRAGMA only writes the file for top-level statements; EXPLAIN ANALYZE returns the profile inline in column 1 of its own result. There's no obvious way to know that without reading DuckDB source or doing what I ended up doing: spinning up Python and probing EXPLAIN ANALYZE against a real parquet file, dumping the row, and discovering the JSON was already there.

Copilot's first attempt for #3 wrote ~170 lines using the temp-file approach. It also:

ran EXPLAIN ANALYZE twice (wasteful — the second run executes the full query again),
didn't reset profiling PRAGMAs after the call (so subsequent preview ticks got JSON-shaped output),
assumed Filename(s) was comma-split (it isn't — DuckDB returns the literal glob string),
merged JSON scans into text scans by index without checking the lengths matched.

I caught most of this in a 7-bug code review with Copilot, then ran out of credits before we could iterate on the architectural changes. With the WIP unmerged and Copilot Chat unavailable, I fell back to Cursor (Claude) and rewrote the feature from scratch — same goal, correct architecture, ~230 lines including tests. The whole rewrite took about 25 minutes; the architectural debugging during the Copilot review took 90.

Honest takeaway: Copilot was great for PR #6 because the shape was already in the repo. For PR #9 the architecture itself was unknown territory and it hit a wall — directionally right (use the JSON profile!), but wrong in every detail that mattered. The most useful thing the Copilot Chat history gave me was the list of bugs to avoid on the rewrite. I still ship that list as ground truth in the PR #9 commit message.

The combination ended up working: Copilot for the pattern-completion sprints (#2), Cursor for the empirical / architectural reasoning (#3, #4, docs). I'd run the same playbook again. Maybe with Copilot Pro next time so I don't run out mid-debug.

Repo: github.com/thehwang/parq · brew install thehwang/parq/pq · pq --help

Top comments (8)

Valentyn Kit • Jun 26

The under-celebrated call here is not writing a parquet reader: wrapping DuckDB means you inherit predicate pushdown and row-group pruning that'd take years to match, and you just own the DSL surface. That's the opposite of the usual "rewrite the engine in Rust" reflex, and it's the right one. Does the jq-style DSL compile down to DuckDB SQL, or do you walk the parquet yourself for simple projections?

thehwang • Jun 27

Yep, you nailed it. it compiles straight to DuckDB SQL, no separate parquet-walking path even for simple projections. parser.rs is literally a "pq DSL to DuckDB SQL" string compiler: every query becomes a QueryPlan that renders to one SQL string over read_parquet('…'). So .tags[0] to tags1, .metadata["plan"] to element_at(metadata,'plan')[1], .events[] to a hoisted UNNEST. And that's exactly why pushdown is free . I emit a plain WHERE/projection and DuckDB's scan does the predicate pushdown + row-group pruning. The moment I hand-walked parquet for the "easy" cases I'd own a second, worse execution path and lose all of that.

Harjot Singh • May 31

"Three TODOs, three weeks, one weekend" is a relatable arc and a quietly useful data point about how TODOs actually behave: they sit untouched not because they're three weeks of work but because starting has a fixed activation cost, and then once you sit down it's a weekend. The three-week delay was procrastination/context-loss tax, not engineering time. The lesson I take from this pattern (and I hit it constantly) is that the gap between "I should do X" and "I'm doing X" is where most of the calendar disappears, and the fix is shrinking the cost of starting, not the work itself.

That's a big part of why I build the way I do - collapse the activation cost. Moonshift, the thing I work on, is a multi-agent pipeline that takes a prompt to a deployed SaaS, so the "ugh, set up the project, wire the boring parts" friction that makes a TODO sit for three weeks is mostly gone - you describe it and there's something running. Lowering the cost-to-start is how the weekend's worth of work stops waiting three weeks. Multi-model routing keeps a build ~$3 flat, first run free no card. Nice ship on v0.14. Genuine question: was the three-week delay the boring setup/context-reload part, or the actual problem being hard? In my experience it's almost always the former, which is the part worth automating away.

thehwang • May 31

Mostly the former, but not entirely, and the exception is the interesting part.

The two boring ones (streaming JSON, schema diff) were pure activation cost: hours of work, three weeks of not-starting. You're right about where the calendar went.

The third (pruning ratio) was genuinely hard — it hinged on a DuckDB profiling quirk that took most of the weekend to understand. That cost wasn't at the start, so nothing about cheaper starting would've touched it. Lowering cost-to-start clears the tedious TODOs (most of them); the hard ones sit because they're hard, and that's the part that doesn't
automate away.

Harjot Singh • May 31

That distinction is the whole thing. The two boring ones weren't hard, they were just expensive to start, so they aged three weeks for psychological reasons, not technical ones. The pruning-ratio one is a different beast: it didn't sit because of activation cost, it sat because it was genuinely unsolved, you couldn't start until you'd thought it through. Those need opposite fixes. Activation-cost TODOs want a forcing function (timebox, just-start). Genuinely-hard ones want thinking time and are dangerous to rush. The trap is treating them the same, either grinding for hours on a 2-hour task or speed-running a real design decision. It's exactly why I have Moonshift's agents take the activation-cost work (boring-but-mechanical) so the human's scarce attention goes to the pruning-ratio class. What made the pruning ratio finally click?

thehwang • May 31

The click was a method switch, not an insight，almost anticlimactic.

I'd been trying to reason about what DuckDB should do from the docs: set profile_output, point it at a temp file, read the JSON back. Spent a while there because that's what the documentation implies. It doesn't work for EXPLAIN ANALYZE — the profile comes back inline in column 1 of the query's own result, not in the file.

What unstuck it was giving up on reasoning and just probing. Opened a Python REPL, ran EXPLAIN ANALYZE against a real parquet file, printed the raw row. The JSON was just sitting there in row[1]. Five minutes of looking at actual output beat hours of reasoning about intended behavior.

And it kept paying off: the two follow-up bugs I shipped a patch for — a PRAGMA reset that's silently a no-op, and a row-count field reporting 10x the real number — were both caught the same way. Print what DuckDB actually returns, not what I expect. Every thing I got wrong, I got wrong by trusting the docs; every fix came from print(row).

So the transferable bit: the "thinking time" the hard ones need isn't always abstract thought. Sometimes it's just building a feedback loop tight enough to see ground truth, then actually looking.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.