DEV Community

ahmet gedik
ahmet gedik

Posted on

Running Ad-Hoc Video Analytics on Parquet Exports with DuckDB

We track every trending video across roughly thirty European markets. Every few hours a fetch job pulls fresh metrics — view count, likes, comments, and the rank a clip held in a given country — and appends a row to a video_snapshots table in SQLite. SQLite in WAL mode handles that write load beautifully: concurrent readers never block the writer, the database file lives on the same LiteSpeed box that serves the site, and there is no second daemon to babysit. For serving pages and feeding the API, it is close to perfect.

Then someone on the editorial side asks a question like "which categories went viral fastest in Poland and the Netherlands last month, and how did that compare to the same clips in Germany?" — and the whole arrangement falls over. That query touches tens of millions of rows, groups across regions and dates, and computes per-video growth curves with window functions. Run it against the live SQLite file and you either stall the page-cache warmers or wait four minutes for a single answer. Run it twice in parallel and an analyst has effectively DoS'd production.

For a long time the standard advice here is "stand up a warehouse." But we are a lean operation at ViralVidVault, and spinning up ClickHouse or shipping data into BigQuery introduces a second copy of personal-ish data to govern under GDPR, a network hop, egress cost, and an ops surface we do not want. DuckDB turned out to be the answer that fits the existing stack: it is an in-process analytical engine, it reads Parquet natively, and it never touches the live database unless we explicitly point it there. This post walks through how we export SQLite to partitioned Parquet, run genuinely heavy ad-hoc queries on it, and wire the whole thing behind a thin PHP endpoint — without the production database ever feeling it.

Why DuckDB and not a warehouse

The mental model that made this click for me: DuckDB is to OLAP what SQLite is to OLTP. It is a single library, no server, no cluster, no config files. You embed it, point it at files, and run SQL. For an ad-hoc analytics layer that gets queried a few hundred times a day by humans, that profile is exactly right.

The properties that mattered for us specifically:

  • It reads Parquet directly. No load step, no ingestion pipeline. The Parquet files on disk are the table. Add a new day's export and it is queryable immediately.
  • Columnar + vectorized execution. Aggregations that scan one or two columns out of fifteen only read those columns off disk. A SELECT region, avg(view_count) over 40M rows reads megabytes, not gigabytes.
  • It can read SQLite itself. The sqlite extension lets DuckDB attach a .db file read-only, which makes the export step a one-liner instead of a custom serializer.
  • Zero new attack surface. There is no listening port, no auth layer to misconfigure, no separate box to patch. It runs as a CLI invocation or an embedded library and exits.
  • Hive-style partition pruning. Partition by date and region on disk, and a query filtered to region = 'PL' for last week physically skips every other folder.

The trade-off is that DuckDB is not built for high-concurrency point lookups or transactional writes — that is still SQLite's job. We are not replacing the operational database; we are giving analysts a read-only mirror they cannot hurt.

Exporting SQLite to Parquet without blocking ingestion

The export is where most of the GDPR and performance thinking lives, so it is worth doing carefully. The key insight is that DuckDB can do the whole export by itself: attach the SQLite file in read-only mode, SELECT the columns we actually want for analytics (dropping anything PII-adjacent), and COPY straight to partitioned Parquet.

Running it read-only is what keeps production safe. WAL mode already allows readers alongside the writer, and READ_ONLY guarantees the export can never take a write lock or corrupt the file even if the process is killed mid-run.

import duckdb

con = duckdb.connect()
con.execute("INSTALL sqlite; LOAD sqlite;")

# READ_ONLY is the whole safety story: this can never lock or mutate prod.
con.execute("ATTACH '/srv/vvv/data/viral.db' AS src (TYPE sqlite, READ_ONLY);")

con.execute("""
COPY (
    SELECT
        video_id,
        region,
        category_id,
        view_count,
        like_count,
        comment_count,
        rank_position,
        captured_at,
        CAST(captured_at AS DATE) AS capture_date
    FROM src.video_snapshots
    WHERE captured_at >= now() - INTERVAL 2 DAY
)
TO '/srv/vvv/exports/snapshots'
(
    FORMAT parquet,
    PARTITION_BY (capture_date, region),
    COMPRESSION zstd,
    OVERWRITE_OR_IGNORE true
);
""")

print("export complete")
Enter fullscreen mode Exit fullscreen mode

A few decisions worth calling out:

  • We project, we do not SELECT *. The export query is the natural choke point for data minimisation. There is no user_id, no IP, no session — none of it ever reaches the analytics layer, so an analyst literally cannot query personal data that is not there. That is GDPR data-minimisation enforced by schema, not by policy.
  • Partition by capture_date and region. This produces a directory tree like capture_date=2026-06-01/region=DE/data.parquet. DuckDB reads the partition values from the path, so date- and region-filtered queries prune entire folders.
  • ZSTD compression. Snapshot data is highly repetitive (the same video_id and category_id recur thousands of times), so ZSTD typically gets us 8-12x over the raw rows while staying fast to decompress.
  • Rolling two-day window. We re-export the last two days each night and let OVERWRITE_OR_IGNORE replace those partitions. Older partitions are immutable once written, which makes the export idempotent and cheap.

We schedule it as a plain cron entry next to the existing fetch jobs. There is nothing clever here, and that is the point:

# /etc/cron.d/vvv-export
# Nightly Parquet export, 03:17 to dodge the on-the-hour fetch jobs.
17 3 * * *  www-data  /usr/bin/python3 /srv/vvv/bin/export_snapshots.py >> /var/log/vvv/export.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Because the export reads SQLite read-only and writes to a separate exports/ directory, it can run while the fetch cron is appending new snapshots and while LiteSpeed is serving traffic. Nobody contends with anybody.

Queries that would melt SQLite

Now the fun part. Here is the actual shape of the "fastest-growing categories per market" question that started this whole thing. It uses a window function to compute each video's day-over-day view gain, then aggregates that into per-category medians and a 95th percentile — the kind of analytical SQL that SQLite technically supports but executes at a glacial pace over tens of millions of rows.

-- Fastest-growing categories per market, last 30 days.
WITH growth AS (
    SELECT
        video_id,
        region,
        category_id,
        capture_date,
        view_count,
        view_count - lag(view_count) OVER (
            PARTITION BY video_id, region
            ORDER BY capture_date
        ) AS daily_gain
    FROM read_parquet(
        '/srv/vvv/exports/snapshots/**/*.parquet',
        hive_partitioning = true
    )
    WHERE capture_date >= today() - 30
      AND region IN ('PL', 'NL', 'DE')
)
SELECT
    region,
    category_id,
    count(DISTINCT video_id)                AS clips,
    round(median(daily_gain))               AS median_daily_views,
    round(quantile_cont(daily_gain, 0.95))  AS p95_daily_views
FROM growth
WHERE daily_gain > 0
GROUP BY region, category_id
ORDER BY region, median_daily_views DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

The hive_partitioning = true flag is what makes this fast. The WHERE capture_date >= today() - 30 AND region IN ('PL','NL','DE') predicate is matched against the directory structure before any file is opened, so DuckDB only touches ~90 partition folders out of the thousands on disk. On our dataset this query returns in well under a second on a single shared LiteSpeed box, against a SQLite equivalent that took minutes and pinned a core the whole time.

A couple of DuckDB-specific niceties that show up constantly in ad-hoc work:

  • quantile_cont and median are built in. No extension, no custom aggregate. Percentile latency-style analysis on view gains is one expression.
  • read_parquet accepts globs and lists. You can point at **/*.parquet, a single file, or an explicit list, and DuckDB unions them with a consistent schema.
  • EXPLAIN ANALYZE prints how many partitions and bytes were actually scanned — invaluable for confirming pruning is doing its job rather than silently reading everything.

Wrapping DuckDB behind a PHP endpoint

Most of our app is PHP 8.4, and analysts do not write SQL into a terminal — they hit an internal dashboard. So we need a thin server-side wrapper. There is no need for a PHP extension here; the DuckDB CLI with the -json flag gives us machine-readable output, and proc_open lets us feed SQL over stdin so nothing ever lands on the shell command line.

The important safety property: we open DuckDB against an in-memory database (:memory:) and create a view over the Parquet files. The session can read the exports and nothing else — there is no attached writable database, so even a malicious query cannot mutate state.

<?php
declare(strict_types=1);

final class DuckQuery
{
    public function __construct(
        private readonly string $binary = '/usr/local/bin/duckdb',
        private readonly string $exportRoot = '/srv/vvv/exports/snapshots',
    ) {}

    /** @return list<array<string,mixed>> */
    public function run(string $sql, int $timeoutSeconds = 20): array
    {
        $glob = $this->exportRoot . '/**/*.parquet';
        $bootstrap = sprintf(
            "CREATE VIEW snapshots AS "
            . "SELECT * FROM read_parquet('%s', hive_partitioning = true);",
            str_replace("'", "''", $glob)
        );

        $descriptors = [
            0 => ['pipe', 'r'],
            1 => ['pipe', 'w'],
            2 => ['pipe', 'w'],
        ];

        // :memory: + -json => read-only against Parquet, machine-readable rows.
        $proc = proc_open(
            [$this->binary, '-json', ':memory:'],
            $descriptors,
            $pipes
        );

        if (!is_resource($proc)) {
            throw new RuntimeException('Could not start DuckDB');
        }

        fwrite($pipes[0], $bootstrap . "\n" . rtrim($sql, "; ") . ";\n");
        fclose($pipes[0]);

        $out = stream_get_contents($pipes[1]);
        $err = stream_get_contents($pipes[2]);
        fclose($pipes[1]);
        fclose($pipes[2]);

        if (proc_close($proc) !== 0) {
            throw new RuntimeException("DuckDB failed: {$err}");
        }

        /** @var list<array<string,mixed>> $rows */
        $rows = json_decode($out ?: '[]', true, flags: JSON_THROW_ON_ERROR);
        return $rows;
    }
}

// Usage from a dashboard controller:
$duck = new DuckQuery();
$rows = $duck->run(
    "SELECT region, count(DISTINCT video_id) AS clips
       FROM snapshots
      WHERE capture_date >= today() - 7
      GROUP BY region
      ORDER BY clips DESC"
);
header('Content-Type: application/json');
echo json_encode($rows, JSON_THROW_ON_ERROR);
Enter fullscreen mode Exit fullscreen mode

The dashboard only ever sends parameterised, server-built queries — analysts pick filters from dropdowns, they do not type raw SQL — so the snapshots view plus a query-allowlist on our side keeps this from becoming an injection vector. For the cases where we genuinely want free-form SQL (internal, authenticated, trusted users only), the in-memory read-only session is the backstop: the worst a bad query can do is spin a CPU for up to timeoutSeconds and then get killed.

In front of this endpoint we let Cloudflare Workers cache the common dashboard queries for a few minutes keyed on the filter parameters. The cardinality of "region + date-range + category" combinations the editorial team actually uses is small, so a short edge TTL absorbs most of the load and the PHP layer barely sees repeat traffic.

Batch jobs in Go for the heavy lifters

Not everything runs interactively. Some rollups — the daily "top movers" email, the weekly trend digest — are batch jobs, and we run a few of those in Go using the marcboeker/go-duckdb driver. It speaks the standard database/sql interface, so the code looks like any other Go SQL job, except the "database" is a pile of Parquet files.

package main

import (
    "database/sql"
    "encoding/json"
    "log"
    "os"

    _ "github.com/marcboeker/go-duckdb"
)

func main() {
    db, err := sql.Open("duckdb", "") // empty DSN => in-memory
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()

    const q = `
        SELECT region,
               count(DISTINCT video_id) AS videos,
               round(avg(view_count))   AS avg_views
        FROM read_parquet('/srv/vvv/exports/snapshots/**/*.parquet',
                          hive_partitioning = true)
        WHERE capture_date >= today() - 7
        GROUP BY region
        ORDER BY avg_views DESC`

    rows, err := db.Query(q)
    if err != nil {
        log.Fatal(err)
    }
    defer rows.Close()

    type stat struct {
        Region   string  `json:"region"`
        Videos   int64   `json:"videos"`
        AvgViews float64 `json:"avg_views"`
    }

    var out []stat
    for rows.Next() {
        var s stat
        if err := rows.Scan(&s.Region, &s.Videos, &s.AvgViews); err != nil {
            log.Fatal(err)
        }
        out = append(out, s)
    }
    if err := rows.Err(); err != nil {
        log.Fatal(err)
    }

    if err := json.NewEncoder(os.Stdout).Encode(out); err != nil {
        log.Fatal(err)
    }
}
Enter fullscreen mode Exit fullscreen mode

The nice thing about a static Go binary here is that it ships with DuckDB compiled in. The digest job is a single executable on a cron, reading the same Parquet exports the dashboard reads. No driver to install on the box, no runtime to keep in sync.

Keeping GDPR boring

The most reassuring part of this design is how little GDPR work it actually requires, precisely because the boundaries are physical rather than procedural:

  • PII never enters the analytics layer. The export query projects a fixed column list. There is no user identifier, IP, or session token in any Parquet file. Data minimisation is enforced by the export schema, not by reviewer discipline.
  • The data we keep is aggregate signal, not personal data. A view count and a category for a public video is not personal data about our visitors. We are analysing video performance, not people.
  • Partition layout makes retention trivial. Because everything is partitioned by capture_date, our retention job is a glorified rm -rf of folders older than the policy window. No DELETE over a live table, no vacuum, no fragmentation.
  • No third country, no extra processor. Because DuckDB is in-process and the Parquet files sit on the same EU-hosted infrastructure as the rest of the stack, there is no cross-border transfer and no new sub-processor to add to the records of processing.

If we ever did need a derived per-user metric, the right move would be to hash or pseudonymise it in the export SELECT so the raw identifier never reaches Parquet — but so far we have not needed to, and the cleanest PII is the kind that does not exist in your analytics store.

What it actually bought us

The before/after is stark. Queries that took minutes and risked production stability now run in well under a second against an isolated, read-only dataset. The export adds a few seconds to a nightly cron and a few hundred megabytes of compressed Parquet to disk. There is no new server, no new port, no new credential to rotate, and nothing an analyst can do from the dashboard that touches the operational SQLite file.

If you are running a small-to-medium product on SQLite and feeling the pain the moment analytical questions show up, this is the pattern I would reach for first:

  • Keep SQLite (in WAL mode) as the operational source of truth.
  • Export a projected, partitioned, ZSTD-compressed Parquet snapshot on a cron, reading the database read-only.
  • Query it with DuckDB — from the CLI, from PHP via proc_open, or from a Go batch job via the driver.
  • Let partition pruning and columnar scans do the heavy lifting, and cache the hot dashboard queries at the edge.

You get warehouse-class ad-hoc analytics without a warehouse, and the production database never even knows the analysts are there. For a lean European team that has to think about every byte of data it stores, that combination of speed, simplicity, and a small data footprint is hard to beat.

Top comments (0)