ahmet gedik

Posted on Jun 28

Ad-Hoc Video Analytics with DuckDB on Parquet Exports from Production SQLite

#duckdb #sql #analytics #php

Every couple of hours, a cron job on a cheap LiteSpeed shared host wakes up, hits the YouTube Data API for eight regions — US, GB, DE, FR, IN, BR, AU, CA — and writes trending-video metadata into a single SQLite database. That is the entire ingestion pipeline behind TrendVidStream, a multi-region video streaming discovery site I run. PHP 8.4 renders the pages, SQLite FTS5 powers search, and deploys happen over plain FTP because that is what the hosting gives you. It works, it is boring, and it costs almost nothing.

What this setup does not handle gracefully is me, at 11pm, asking questions like: "Which videos trended in four or more regions this week, how fast did they climb, and is the music category quietly eating everything else?" Those are analytical questions — full scans, window functions, cross-snapshot joins. Running them against the production SQLite file means either hammering a shared-host CPU that is also serving traffic, or copying a multi-hundred-megabyte .db file to my laptop every time I get curious.

The fix that stuck: a nightly export of the hot tables to hive-partitioned Parquet, pulled down over the same FTP channel the deploys already use, and queried locally with DuckDB. The whole thing is about 120 lines of PHP and SQL, and it turned "I wonder if..." questions from a 20-minute chore into a 4-second query.

Why not just query SQLite directly

SQLite is genuinely great at what the site needs: point lookups by video ID, FTS5 full-text search, small ordered scans for category pages. It is a row store with a single-writer model, and the access patterns of a discovery site fit that perfectly.

Analytical queries are a different animal:

They scan everything. "View-count delta per video per region over 14 days" touches every row in the window. A row store reads entire rows even when you need three columns out of eighteen.
They run on the wrong machine. Shared hosting gives you a slice of a CPU and a 180-second execution cap. A 40-second aggregate query is both slow and rude to your own visitors.
They want history. The production table is a current state table — each fetch upserts the latest stats. The moment you ask "how fast is this climbing," you need snapshots, and bolting a history table onto the production schema bloats the file that FTS5 and the page cache live in.
They are exploratory. Half my queries are dead ends. Dead ends belong on a laptop, not on a box serving requests.

DuckDB is the obvious counterpart: an in-process columnar engine that reads Parquet natively, does proper window functions and grouping sets, and installs with pip install duckdb or a single static binary. No server, no cluster, no infrastructure to babysit — which matters when your entire ops story is cron plus FTP.

First contact: pointing DuckDB at the SQLite file

Before building any export pipeline, it is worth knowing that DuckDB can read SQLite databases directly through its sqlite extension. For one-off exploration against a copied-down .db file, this is all you need:

INSTALL sqlite;
LOAD sqlite;
ATTACH 'app.db' AS prod (TYPE sqlite, READ_ONLY);

SELECT region, count(*) AS videos, max(fetched_at) AS last_fetch
FROM prod.videos
GROUP BY region
ORDER BY videos DESC;

This works, and it is how I sanity-check the export pipeline itself. But as a daily workflow it has two problems. First, you are still shipping the whole database file around — mine is dominated by FTS5 index pages and description text that analytics never touches. Second, SQLite's type affinity leaks through: my timestamps are stored as TEXT (2026-06-12 14:20:01), integers that were ever written as strings stay strings, and every query ends up wrapped in casts.

Parquet snapshots solve both. You project only the analytical columns, cast once at export time, and get real types plus columnar compression. My daily snapshot of ~40,000 rows across eight regions lands at about 2.5 MB zstd-compressed — small enough that pulling it over FTP is a non-event.

The export cron, in PHP

The site already has a cron entry point that loops over regions, so the export hangs off the end of that. PHP does not have a native Parquet writer worth using, and I did not want one — the trick is to ship the DuckDB CLI binary (a single ~40 MB static executable) alongside the app and let PHP orchestrate it. The CLI reads the SQLite file through the same extension shown above and writes partitioned Parquet in one COPY statement:

<?php
declare(strict_types=1);

// cron/export_parquet.php — runs nightly after the last fetch cycle

const DB_PATH   = __DIR__ . '/../data/app.db';
const EXPORT_DIR = __DIR__ . '/../data/exports';
const DUCKDB_BIN = __DIR__ . '/../bin/duckdb';

$snapshotDate = new DateTimeImmutable('today')->format('Y-m-d'); // PHP 8.4: new in expression

$sql = <<<SQL
INSTALL sqlite; LOAD sqlite;
ATTACH '%s' AS prod (TYPE sqlite, READ_ONLY);
COPY (
    SELECT
        video_id,
        title,
        channel_id,
        channel_title,
        category_id,
        region,
        CAST(view_count AS BIGINT)  AS view_count,
        CAST(like_count AS BIGINT)  AS like_count,
        strptime(published_at, '%%Y-%%m-%%d %%H:%%M:%%S') AS published_at,
        strptime(fetched_at,   '%%Y-%%m-%%d %%H:%%M:%%S') AS fetched_at,
        DATE '%s' AS snapshot_date
    FROM prod.videos
    WHERE fetched_at >= strftime(now() - INTERVAL 1 DAY, '%%Y-%%m-%%d %%H:%%M:%%S')
) TO '%s' (
    FORMAT parquet,
    COMPRESSION zstd,
    PARTITION_BY (snapshot_date, region),
    OVERWRITE_OR_IGNORE true
);
SQL;

$query = sprintf($sql, DB_PATH, $snapshotDate, EXPORT_DIR);

$proc = proc_open(
    [DUCKDB_BIN, ':memory:'],
    [0 => ['pipe', 'r'], 1 => ['pipe', 'w'], 2 => ['pipe', 'w']],
    $pipes
);

fwrite($pipes[0], $query);
fclose($pipes[0]);
$stderr = stream_get_contents($pipes[2]);
$exit = proc_close($proc);

if ($exit !== 0) {
    error_log('[export_parquet] FAILED: ' . trim($stderr));
    exit(1);
}

echo 'Exported snapshot ' . $snapshotDate . PHP_EOL;

A few deliberate choices in there:

READ_ONLY on the attach. DuckDB takes a read snapshot and never touches the WAL in write mode, so the running site is unaffected even if a fetch cycle overlaps.
All casting happens here. strptime turns SQLite's text timestamps into real TIMESTAMP columns once, at export time. Every downstream query gets clean types for free.
PARTITION_BY (snapshot_date, region) writes a hive layout: exports/snapshot_date=2026-06-12/region=DE/data_0.parquet. That layout is the whole reason later queries are fast — DuckDB prunes partitions from the file path before reading a single byte.
The WHERE clause makes the export incremental. Only rows touched in the last day are snapshotted, so the export takes about two seconds on shared hosting instead of rewriting history every night.

Getting the files off the server

Deploys for this site go out over FTP with lftp, so the analytics pull reuses the exact same mechanism in reverse. One line in a local cron:

lftp -e 'mirror --only-newer /data/exports ./exports; quit' -u $FTP_USER,$FTP_PASS $FTP_HOST

--only-newer means each morning I transfer one day's partitions — eight small Parquet files — not the whole archive. The hive directory structure survives the mirror untouched, which matters because DuckDB derives the snapshot_date and region columns from those directory names. After a few months the local exports/ tree holds the full history that the production database deliberately does not.

There is a nice asymmetry here: the production box only ever keeps the latest day or two of exports (a cleanup step deletes older partition directories), so the shared host stays lean while the laptop accumulates the long tail.

The queries that pay rent

With the data local, everything happens through read_parquet with hive_partitioning enabled. These three queries are the ones I actually run weekly, not demo material.

-- 1. Cross-region overlap: videos trending in 3+ regions on one day.
--    This is the core editorial signal for a multi-region discovery site.
SELECT
    video_id,
    any_value(title)            AS title,
    count(DISTINCT region)      AS region_count,
    list(DISTINCT region)       AS regions,
    max(view_count)             AS peak_views
FROM read_parquet('exports/**/*.parquet', hive_partitioning = true)
WHERE snapshot_date = DATE '2026-06-12'
GROUP BY video_id
HAVING count(DISTINCT region) >= 3
ORDER BY region_count DESC, peak_views DESC
LIMIT 25;

-- 2. Velocity: 24h view-count delta per video per region.
WITH daily AS (
    SELECT video_id, region, snapshot_date,
           max(view_count) AS views
    FROM read_parquet('exports/**/*.parquet', hive_partitioning = true)
    WHERE snapshot_date >= current_date - INTERVAL 7 DAY
    GROUP BY ALL
)
SELECT video_id, region, snapshot_date,
       views - lag(views) OVER w AS views_24h,
       round(100.0 * (views - lag(views) OVER w) / nullif(lag(views) OVER w, 0), 1) AS pct_24h
FROM daily
WINDOW w AS (PARTITION BY video_id, region ORDER BY snapshot_date)
QUALIFY views_24h IS NOT NULL
ORDER BY views_24h DESC
LIMIT 50;

-- 3. Category drift: week-over-week share of trending slots per category.
SELECT category_id,
       date_trunc('week', snapshot_date) AS week,
       count(*) AS slots,
       round(100.0 * count(*) / sum(count(*)) OVER (PARTITION BY date_trunc('week', snapshot_date)), 1) AS share_pct
FROM read_parquet('exports/**/*.parquet', hive_partitioning = true)
GROUP BY ALL
ORDER BY week DESC, share_pct DESC;

Things worth noticing if you are coming from SQLite:

QUALIFY filters on window-function results without an extra subquery layer. Query 2 in SQLite is a three-level nesting exercise; here it reads top to bottom.
GROUP BY ALL groups by every non-aggregated column. Small thing, huge quality-of-life for exploratory work where you keep adding columns.
list(DISTINCT region) returns an actual array. Seeing [US, GB, DE, IN] inline beats a second lookup query.
Partition pruning is visible. Query 1 filters on snapshot_date, which is a directory name, so DuckDB opens only that day's eight files. EXPLAIN confirms it. On a year of history, the query cost stays flat.

On my laptop, all three queries over ~90 days of snapshots (about 3.5 million rows) complete in well under a second. The same overlap query against the production SQLite schema — before I gave up on running it there — took around 30 seconds and pinned the CPU.

Automating the weekly digest in Python

Once the same four or five queries proved useful repeatedly, they graduated into a small Python script that runs after the morning FTP pull and writes a markdown digest. DuckDB's Python API makes this almost embarrassingly short:

#!/usr/bin/env python3
"""weekly_digest.py — cross-region trending summary from Parquet exports."""
import duckdb
from datetime import date, timedelta
from pathlib import Path

GLOB = 'exports/**/*.parquet'
since = date.today() - timedelta(days=7)

con = duckdb.connect()  # in-memory; the Parquet files ARE the database

overlap = con.execute("""
    SELECT video_id, any_value(title) AS title,
           count(DISTINCT region) AS regions,
           max(view_count) AS peak_views
    FROM read_parquet($glob, hive_partitioning = true)
    WHERE snapshot_date >= $since
    GROUP BY video_id
    HAVING count(DISTINCT region) >= 3
    ORDER BY regions DESC, peak_views DESC
    LIMIT 15
""", {'glob': GLOB, 'since': since}).fetchall()

by_region = con.execute("""
    SELECT region, count(DISTINCT video_id) AS uniques,
           round(avg(view_count)) AS avg_views
    FROM read_parquet($glob, hive_partitioning = true)
    WHERE snapshot_date >= $since
    GROUP BY region ORDER BY uniques DESC
""", {'glob': GLOB, 'since': since}).fetchall()

lines = [f'# Trending digest — week of {since}', '', '## Multi-region hits', '']
for vid, title, regions, views in overlap:
    lines.append(f'- **{title}** — {regions} regions, peak {views:,} views (`{vid}`)')

lines += ['', '## Region health', '']
for region, uniques, avg_views in by_region:
    lines.append(f'- {region}: {uniques} unique trending videos, avg {int(avg_views):,} views')

Path(f'digests/{since}.md').write_text('\n'.join(lines) + '\n')
print(f'wrote digests/{since}.md ({len(overlap)} multi-region hits)')

Note the parameterized queries — $glob and $since — which keep the SQL copy-pasteable between the CLI and the script. The digest tells me in thirty seconds whether a region's API fetch silently degraded (uniques drop), whether a category is surging, and what the genuinely global videos were this week. The third one feeds directly back into the product.

Closing the loop back to production

The analytics layer would be a vanity project if nothing flowed back to the site. The piece that does: the multi-region overlap query gets serialized to a small JSON file and pushed to each host over the existing FTP deploy channel. On the PHP side, rendering it is deliberately dumb — no DuckDB, no SQLite involvement, just a cached file read:

<?php
declare(strict_types=1);

final readonly class GlobalTrending
{
    public function __construct(private string $path) {}

    /** @return list<array{video_id: string, title: string, regions: int}> */
    public function top(int $limit = 10): array
    {
        if (!is_file($this->path) || filemtime($this->path) < time() - 86400 * 2) {
            return []; // stale or missing → widget simply doesn't render
        }
        $rows = json_decode(file_get_contents($this->path), true, flags: JSON_THROW_ON_ERROR);
        return array_slice($rows, 0, $limit);
    }
}

The staleness check is the important line. If my laptop is closed for a week and no fresh JSON arrives, the "trending worldwide" widget disappears instead of showing fossilized data. The production site never depends on the analytics pipeline being alive — it only gets enhanced by it. For a one-person operation, that failure mode is the difference between a hobby-friendly system and a pager.

Gotchas I hit so you don't have to

SQLite type affinity will bite exactly once. A handful of my early view_count values were written as text by an old fetcher version. DuckDB's sqlite extension surfaced them as VARCHAR and the first export died on the cast. Fix it at export time with TRY_CAST and a logged count of nulls, not by mutating production data mid-flight.
Normalize timezones before comparing regions. Eight regions means fetch timestamps that look misaligned if any layer applied a local offset. I store everything UTC in SQLite and declared the export's strptime results as UTC; do this on day one, because retrofitting timezone hygiene across months of Parquet partitions is miserable.
OVERWRITE_OR_IGNORE matters for reruns. Without it, re-running the export after a partial failure either errors or duplicates rows inside a partition. With it, the day's partition is rewritten atomically and the cron is safely idempotent.
Don't glob more than you need. read_parquet('exports/**/*.parquet') is fine with a snapshot_date filter (pruning saves you), but an unfiltered SELECT count(*) over a year of partitions opens every file. Listing thousands of small files has fixed overhead even when each one is tiny — keep daily partitions per region as single files, resist the urge to export hourly.
Pin the DuckDB version in both places. The CLI binary on the host writes the Parquet; the Python package on the laptop reads it. Parquet itself is stable, but I keep both on the same release anyway so behavior differences in strptime and casting never become a debugging session.
Dedup within snapshots, not across them. The same video_id legitimately appears in multiple regions and on multiple days — that's the signal, not noise. Every aggregate needs to be explicit about which axis it collapses, hence the count(DISTINCT region) and max(view_count) patterns above rather than naive counts.

Conclusion

The pattern here is small and reusable: keep the transactional store (SQLite) doing exactly what it is good at, snapshot the analytical slice to hive-partitioned Parquet on a schedule, move files with whatever boring transport you already trust — FTP, in my case — and let DuckDB do the heavy reading on hardware you don't share with your visitors. No warehouse, no streaming infra, no monthly bill. For a multi-region discovery site fed by cron jobs, it answers real product questions in seconds, and the production box never even notices the questions were asked.

If you are running anything on SQLite and finding yourself writing painful analytical queries against it, try the ATTACH ... (TYPE sqlite) one-liner first. Five minutes with that is usually all it takes to be convinced.

DEV Community