Running Ad-Hoc Video Analytics on Parquet Exports With DuckDB

#duckdb #parquet #analytics #sqlite

The query that doesn't belong in your app database

A partner emailed me a deceptively small question about DailyWatch: which channels drove the most watch-time growth in our gaming category over the last 90 days, broken down by week and by country? On paper that is one GROUP BY. In practice it is the kind of query that, run against a live application database during peak traffic, gets you paged at 2 a.m.

Our production store is SQLite — yes, SQLite, with an FTS5 index for search, sitting behind PHP 8.4 on LiteSpeed and fronted by Cloudflare. It is excellent at what it does: serving read-mostly requests for video pages, search, and category listings with sub-millisecond latency. What it is not built for is a self-join across tens of millions of watch events, three window functions deep, while the same file is fielding live traffic. SQLite uses a single writer lock, and a long analytical scan evicts the page cache that your hot read path depends on.

For a long time my answer to "I need to slice the data five different ways" was either build yet another reporting table plus a cron to maintain it, or copy the database somewhere and pray. Both are bad. Reporting tables ossify: the moment someone wants a sixth dimension you are writing a migration and backfilling. And copying a live SQLite file mid-write gives you a torn snapshot unless you go through the backup API.

DuckDB rearranged this entire workflow for me. What follows is the actual process I run: export operational data to Parquet on a schedule, then point DuckDB at those files for any ad-hoc question — no load step, no separate server, and zero impact on the database that serves the site.

Why DuckDB instead of more SQLite

The two get conflated because both are embedded, serverless, single-binary databases you drop into a project. The difference that matters is storage and execution:

SQLite is row-oriented, tuned for OLTP — read this row, update that row, by key.
DuckDB is column-oriented and vectorized — scan hundreds of millions of values in one column and aggregate them across every core on the machine.

For analytics that distinction is everything. When I ask for "sum of seconds watched grouped by country," DuckDB reads only the country and seconds_watched columns off disk and ignores the other thirty. SQLite has to walk full rows. On our watch-events data the same aggregate is routinely 20–50× faster in DuckDB, and it never touches the production file.

The other reason is Parquet. DuckDB treats Parquet as a first-class storage format — it reads it natively, exploits its column statistics, and pushes filters down into the file so it can skip whole row groups. That means I can keep cheap, compressed, immutable snapshots of our data on disk (or in object storage) and query them like tables without ever importing anything.

A few properties make it a genuinely safe tool to keep in the operational toolbox:

Embedded. It is a library and a CLI, not a service to run, secure, and patch.
Read-only against my exports. Ad-hoc analysis cannot corrupt anything because it never writes to the source.
Composable. It speaks Parquet, CSV, JSON, SQLite, Postgres, and object storage in one query.

Exporting SQLite to Parquet without a torn snapshot

The first job is getting a clean, consistent copy of the operational tables out of SQLite. DuckDB can attach a SQLite file directly through its sqlite extension and COPY any query result to Parquet. I run this from PHP because that is what our cron tooling is already written in, but it is really just shelling out to the duckdb CLI.

<?php
// export_parquet.php — dump SQLite tables to Parquet via the DuckDB CLI
declare(strict_types=1);

$dbPath = __DIR__ . '/data/videos.sqlite';
$outDir = __DIR__ . '/exports/' . date('Y-m-d');
@mkdir($outDir, 0775, true);

$tables = ['videos', 'channels', 'watch_events'];

foreach ($tables as $table) {
    $sql = sprintf(
        "INSTALL sqlite; LOAD sqlite; " .
        "ATTACH '%s' AS app (TYPE sqlite, READ_ONLY); " .
        "COPY (SELECT * FROM app.%s) " .
        "TO '%s/%s.parquet' (FORMAT parquet, COMPRESSION zstd);",
        $dbPath, $table, $outDir, $table
    );

    $cmd = 'duckdb -c ' . escapeshellarg($sql);
    exec($cmd, $out, $code);

    if ($code !== 0) {
        fwrite(STDERR, "export failed for {$table}\n");
        exit(1);
    }
    echo "wrote {$outDir}/{$table}.parquet\n";
}

A few things worth calling out:

I write into a date-stamped directory (exports/2026-06-18/). Each run is an immutable snapshot, which makes "compare this week to last" trivial and makes the whole job idempotent — re-running it the same day just overwrites today's folder.
COMPRESSION zstd typically gives me 6–10× smaller files than the raw SQLite pages, and DuckDB reads zstd Parquet fast.
Attaching the live file READ_ONLY is safe for a consistent read, but if a table is being written heavily I point this at a copy made with SQLite's .backup (or after a WAL checkpoint) so the snapshot is genuinely point-in-time clean.

This runs in well under a minute on our dataset and produces files I can hand to anyone — an analyst, a notebook, a Go service — without giving them access to production.

Querying Parquet directly — there is no load step

This is the part that still feels like cheating. There is no "import the Parquet into a table" step. You point a query at the file and DuckDB does the rest. Here is the partner's question, almost verbatim, as SQL:

-- channels_by_watchtime.sql
SELECT
    c.name                              AS channel,
    e.country,
    date_trunc('week', e.watched_at)    AS week,
    sum(e.seconds_watched) / 3600.0     AS hours
FROM read_parquet('exports/2026-06-18/watch_events.parquet') e
JOIN read_parquet('exports/2026-06-18/channels.parquet')     c
  ON c.id = e.channel_id
WHERE e.category = 'gaming'
  AND e.watched_at >= now() - INTERVAL 90 DAY
GROUP BY ALL
ORDER BY week, hours DESC;

read_parquet() is a table function; it returns rows you can join, filter, and aggregate exactly like a table. Because Parquet stores per-row-group min/max statistics, the WHERE e.category = 'gaming' and the date filter let DuckDB skip row groups that cannot match before it reads them — predicate pushdown. The narrow SELECT list means only the referenced columns are read off disk — projection pushdown. On a 40-million-row export this query returns in a couple of seconds on a laptop, against files I can keep in object storage or git-lfs.

GROUP BY ALL is one of those DuckDB quality-of-life features that removes a whole class of "column 3 must appear in GROUP BY" annoyances — it groups by every non-aggregated column in the select list. Small thing, but when you are iterating on an ad-hoc query twenty times in a row, it adds up.

Globbing across partitioned exports

Because each daily run writes to its own folder, a glob pattern turns a pile of snapshots into one logical table. This is where DuckDB earns its keep for trend questions. Here is the growth analysis in Python — the language I reach for when I want to chart or export the result:

import duckdb

con = duckdb.connect()  # in-memory, nothing to provision
con.execute("SET memory_limit = '2GB'; SET threads = 4;")

# Glob across every daily export; DuckDB reads only the columns it needs.
rows = con.execute("""
    WITH weekly AS (
        SELECT
            channel_id,
            date_trunc('week', watched_at) AS week,
            sum(seconds_watched) / 3600.0  AS hours
        FROM read_parquet('exports/*/watch_events.parquet',
                          union_by_name = true)
        WHERE category = 'gaming'
        GROUP BY channel_id, week
    )
    SELECT
        channel_id,
        week,
        hours,
        hours - lag(hours) OVER (
            PARTITION BY channel_id ORDER BY week
        ) AS wow_delta
    FROM weekly
    QUALIFY wow_delta IS NOT NULL
    ORDER BY wow_delta DESC
    LIMIT 25
""").fetchall()

for channel_id, week, hours, delta in rows:
    print(f"{week:%Y-%m-%d}  channel={channel_id:>6}  {hours:8.1f}h  delta {delta:+.1f}")

Notes on what is happening:

read_parquet('exports/*/watch_events.parquet', union_by_name = true) reads every daily export and stitches them together. union_by_name means I don't get burned if I added a column to the schema partway through the quarter — DuckDB aligns by column name and fills missing ones with NULL instead of erroring.
The lag() window function gives me week-over-week deltas per channel without a self-join.
QUALIFY filters on the window result directly — no wrapping subquery just to drop the first week where the delta is NULL. It is the HAVING equivalent for window functions, and most engineers have never seen it.
The connection is in-memory (duckdb.connect() with no path). Nothing is provisioned, nothing persists, and the process exits clean.

I genuinely do most of my exploratory work in this loop: tweak SQL, fetchall(), eyeball, repeat. When the shape of the answer is right I either dump it to CSV for the partner or push a summary back toward the app.

From a one-off query to a number in the app

Exploration is half the job. The other half is taking the query that produced a useful answer and wiring its result into something durable — a small dashboard, a JSON fragment the front end reads, a reporting row. We have a couple of Go services in the stack, and the go-duckdb driver speaks the standard database/sql interface, so a summary job is unremarkable:

package main

import (
    "database/sql"
    "fmt"
    "log"

    _ "github.com/marcboeker/go-duckdb"
)

func main() {
    db, err := sql.Open("duckdb", "") // in-memory instance
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()

    const q = `
        SELECT country,
               count(*)                    AS plays,
               sum(seconds_watched)/3600.0 AS hours
        FROM read_parquet('exports/*/watch_events.parquet',
                          union_by_name = true)
        WHERE watched_at >= now() - INTERVAL 7 DAY
        GROUP BY country
        ORDER BY hours DESC
        LIMIT 10`

    rows, err := db.Query(q)
    if err != nil {
        log.Fatal(err)
    }
    defer rows.Close()

    for rows.Next() {
        var country string
        var plays int64
        var hours float64
        if err := rows.Scan(&country, &plays, &hours); err != nil {
            log.Fatal(err)
        }
        fmt.Printf("%-3s %8d plays %10.1f h\n", country, plays, hours)
    }
}

This job computes a 7-day country leaderboard and prints it, but in production it writes the result to a small JSON file that the PHP front end serves as a cached dashboard fragment. Because that fragment changes at most once per export, Cloudflare caches it at the edge and LiteSpeed serves it from its own cache on a miss — the dashboard costs us essentially nothing to serve, and the heavy DuckDB scan happens once, offline, not per request.

The pattern that matters here is the separation of concerns: DuckDB does the analytical heavy lifting against immutable Parquet, and the result that lands back in the serving path is tiny and cacheable. The production SQLite database never participates in analytics at all. It keeps doing the one thing it is superb at — answering page requests behind FTS5 search.

Keeping it cheap, bounded, and safe

A few operational guardrails I have learned to set, because an unbounded analytical query on a small box will happily eat all of your RAM:

Cap memory and threads. SET memory_limit = '2GB'; SET threads = 4; keeps a runaway query from starving everything else on a shared host.
Stay read-only. Analysis only ever reads Parquet snapshots, so there is nothing to corrupt and no lock contention with the live site.
Reach object storage when exports grow. INSTALL httpfs; LOAD httpfs; lets DuckDB read s3:// or Cloudflare R2 paths with the same read_parquet() glob. Our older snapshots live in R2 and a single query spans local plus remote folders transparently.
Let the format do the pruning. Filter on the columns you partitioned or sorted by, keep zstd compression on, and lean on row-group statistics rather than scanning everything.
Treat exports as immutable. Never mutate a snapshot; write a new dated folder. Reproducibility for free, and a trivial rollback if an export run was bad.

The economics are the quiet win. There is no warehouse to pay for, no cluster to keep warm, no per-query billing. The export cron runs in under a minute, the Parquet files are small, and DuckDB is a single binary I already have on every box. Ad-hoc analytics went from "schedule a meeting and build a pipeline" to "open a terminal and ask."

What I'd tell myself a year ago

If you are running a read-heavy app on SQLite — or honestly Postgres, or MySQL — and you keep getting dragged into one-off analytical questions, stop building a reporting table for each one. Export to Parquet on a schedule, point DuckDB at the files, and answer questions interactively. You get columnar speed, window functions, QUALIFY, and GROUP BY ALL, and you get all of it without putting a single extra byte of load on the database that keeps your site up.

The rule we settled on is simple: SQLite serves the site, DuckDB answers the questions, and Parquet is the boundary between them. The two never fight over the same file, and I have not been paged for a runaway report since.