Building Video Heatmap Analytics with HyperLogLog in Postgres

#postgres #analytics #hyperloglog #php

The problem: counting unique viewers per second is a row explosion

A viewer scrubs to 4:12 of a 9-minute trending clip, watches for 40 seconds, jumps back to the intro, then bounces. Multiply that by the few hundred thousand sessions a day that hit a mid-size aggregator and you get the question every product person eventually asks: which parts of this video do people actually watch, and how many distinct people watched each part?

The naive answer is a watch_events table: one row per (user, video, second). It works until it doesn't. A 9-minute video is 540 seconds. One viewer who watches the whole thing generates 540 rows. A million viewers across our catalog generate hundreds of millions of rows per day, and the only query anyone runs against them is COUNT(DISTINCT user_id) GROUP BY second. That COUNT(DISTINCT) is a sort-or-hash over the entire partition every single time someone opens the analytics tab.

At TopVideoHub we aggregate trending video across Asia-Pacific, so a single popular clip can spike from zero to half a million sessions in an afternoon when it lands in the JP and KR feeds simultaneously. We did not want a fact table that grew by hundreds of millions of rows a day to answer a question whose answer is approximately fine. "Roughly 41,000 unique viewers saw the hook at 0:08" is just as actionable as "41,287". That tolerance for approximation is exactly what HyperLogLog is built for, and Postgres has a battle-tested extension for it.

This post is the design we landed on: fixed-size HLL sketches, one per (video, time_bucket), that you can merge, slice, and union across regions in milliseconds. The main app is PHP 8.4 on LiteSpeed behind Cloudflare, with our search layer on SQLite FTS5; the analytics store is a separate Postgres instance, and HLL is what made that store affordable.

Why HyperLogLog instead of COUNT(DISTINCT)

HyperLogLog estimates the cardinality of a set using a fixed amount of memory regardless of how many elements you throw at it. The intuition: hash every element, look at the longest run of leading zeros you've seen, and use that to estimate how many distinct things must have passed through to produce a run that long. Real implementations split the hash space into many registers and average them with a bias-corrected harmonic mean, which is where the accuracy comes from.

The properties that matter for heatmaps:

Fixed size. A sketch with log2m = 12 (4096 registers) is about 2.5 KB and answers for one viewer or fifty million viewers. Our per-bucket storage is constant, not linear in traffic.
Mergeable (unionable). The union of two HLL sketches is itself a valid HLL sketch. This is the killer feature: I can keep one sketch per region per bucket and union them on read to get a global heatmap, or union adjacent buckets to zoom out from per-second to per-10-second resolution. No re-scanning raw events.
Bounded, tunable error. At log2m = 12 the standard error is around 1.6%. For a watch heatmap that is invisible to the human eye.

We use the postgresql-hll extension (the Citus/Aggregate Knowledge implementation). Install and enable it:

-- as a superuser, once per database
CREATE EXTENSION IF NOT EXISTS hll;

-- one sketch per (video, time bucket, region).
-- bucket_sec is the second offset into the video, snapped to a resolution.
CREATE TABLE video_heatmap (
    video_id    text        NOT NULL,
    region      text        NOT NULL,      -- 'JP', 'KR', 'GLOBAL', ...
    bucket_sec  integer     NOT NULL,      -- 0, 1, 2, ... up to video duration
    viewers     hll         NOT NULL,      -- the sketch
    PRIMARY KEY (video_id, region, bucket_sec)
);

-- a covering index for the common 'one video, one region' read
CREATE INDEX video_heatmap_lookup
    ON video_heatmap (video_id, region, bucket_sec)
    INCLUDE (viewers);

The hll type stores the serialized sketch directly in the row. Postgres ships aggregate functions (hll_add_agg, hll_union_agg) and scalar functions (hll_add, hll_cardinality, || for union) that operate on it.

Ingesting watch progress without an event table

The client sends a heartbeat: every few seconds the player reports the range of seconds the viewer has been present for since the last beat, plus an anonymous, salted viewer id. We do not store the heartbeat. We fold it straight into the sketches.

The core write is "add this viewer's hash to every bucket they watched." In SQL that's hll_add per bucket, but doing it per-second per-request is chatty. Instead we batch on the ingestion worker. Here is the Python ingester that consumes heartbeats off a queue and applies them in one statement per heartbeat using generate_series to expand the watched range:

import hashlib
import os
import psycopg

DB = os.environ["ANALYTICS_DSN"]  # e.g. postgresql://app@db/analytics
VIEWER_SALT = os.environ["VIEWER_SALT"].encode()

# 64-bit hash of the viewer id. hll_hash_bigint expects a bigint.
def viewer_hash(viewer_id: str) -> int:
    digest = hashlib.blake2b(viewer_id.encode() + VIEWER_SALT, digest_size=8).digest()
    # interpret as signed 64-bit; Postgres bigint is signed
    return int.from_bytes(digest, "big", signed=True)

def record_watch(conn, video_id: str, region: str,
                 start_sec: int, end_sec: int, viewer_id: str) -> None:
    h = viewer_hash(viewer_id)
    with conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO video_heatmap (video_id, region, bucket_sec, viewers)
            SELECT %(vid)s, %(region)s, g.b,
                   hll_add(hll_empty(), hll_hash_bigint(%(h)s))
            FROM generate_series(%(start)s, %(end)s) AS g(b)
            ON CONFLICT (video_id, region, bucket_sec)
            DO UPDATE SET viewers = video_heatmap.viewers
                                    || hll_add(hll_empty(),
                                               hll_hash_bigint(%(h)s));
            """,
            {"vid": video_id, "region": region, "h": h,
             "start": start_sec, "end": end_sec},
        )

if __name__ == "__main__":
    with psycopg.connect(DB, autocommit=False) as conn:
        # in production these come off Redis/SQS; one demo row here
        record_watch(conn, "yt_abc123", "JP", start_sec=8, end_sec=47,
                     viewer_id="sess_91f2")
        conn.commit()

A few deliberate choices in that statement:

hll_hash_bigint is the extension's own hash. Always hash through it rather than feeding raw ids; HLL accuracy depends on a well-distributed hash, and mixing hash functions across writes corrupts the sketch. We pre-hash the viewer id with BLAKE2b only to anonymize and to fit a bigint, then let hll_hash_bigint do the HLL-specific mixing.
|| is union. video_heatmap.viewers || hll_add(...) folds the new viewer into the existing sketch. Adding the same viewer to the same bucket twice is a no-op for the cardinality estimate, which is exactly the dedup behavior we want for "unique viewers."
generate_series(start, end) expands the watched range into one row per bucket in a single round trip. A viewer present for 40 seconds touches 40 buckets in one statement.

The salt is important and easy to get wrong: it must be stable for the retention window, or the same person counts as many. We rotate it daily and keep heatmaps at daily granularity, then union days on read for weekly/monthly views — which works precisely because unions are free.

Reading the heatmap

Now the part that pays for itself. The heatmap read is a scan of one video's buckets with a cardinality estimate per bucket. Because each sketch is fixed-size and pre-aggregated, this is a tight index scan, not a COUNT(DISTINCT) over raw rows.

If a region is requested, read that region directly. If "global" is requested, union across regions per bucket with hll_union_agg:

-- per-second heatmap for one video, unioned across all regions
SELECT bucket_sec,
       hll_cardinality(hll_union_agg(viewers))::int AS unique_viewers
FROM   video_heatmap
WHERE  video_id = $1
GROUP  BY bucket_sec
ORDER  BY bucket_sec;

To zoom out from per-second to per-10-second resolution, union the buckets — no loss of correctness, because unioning sketches and then estimating is the right order of operations (estimating first and summing would double-count anyone present across buckets):

-- 10-second resolution, JP only
SELECT (bucket_sec / 10) * 10 AS bucket_start,
       hll_cardinality(hll_union_agg(viewers))::int AS unique_viewers
FROM   video_heatmap
WHERE  video_id = $1
  AND  region = 'JP'
GROUP  BY bucket_sec / 10
ORDER  BY bucket_start;

That second query is the one I want to stress, because it is the whole argument for HLL. With a raw event table, changing the bucket resolution means re-running COUNT(DISTINCT user_id) with a different GROUP BY over every raw row — and you cannot precompute per-second distinct counts and sum them, because a viewer present at second 3 and second 7 would be counted twice in a 10-second bucket. HLL solves both problems at once: precompute fine-grained sketches, union them at any granularity on read, estimate last.

Wiring it into the PHP app

The public site is PHP 8.4. The analytics dashboard endpoint reads from Postgres and returns a normalized array the frontend renders as a bar strip under the player. We compute a retention curve (cumulative unique viewers who reached each point) and a raw per-bucket count in one pass:

<?php
declare(strict_types=1);

final class HeatmapRepository
{
    public function __construct(private readonly \PDO $pdo) {}

    /**
     * @return list<array{bucket:int, viewers:int, retention:float}>
     */
    public function forVideo(string $videoId, string $region = 'GLOBAL', int $resolution = 5): array
    {
        $sql = <<<SQL
            SELECT (bucket_sec / :res) * :res AS bucket,
                   hll_cardinality(hll_union_agg(viewers))::int AS viewers
            FROM   video_heatmap
            WHERE  video_id = :vid
              AND  (:region = 'GLOBAL' OR region = :region)
            GROUP  BY bucket_sec / :res
            ORDER  BY bucket
        SQL;

        $stmt = $this->pdo->prepare($sql);
        $stmt->execute(['vid' => $videoId, 'region' => $region, 'res' => $resolution]);
        $rows = $stmt->fetchAll(\PDO::FETCH_ASSOC);

        // peak unique viewers is our denominator for the retention curve
        $peak = 0;
        foreach ($rows as $r) {
            $peak = max($peak, (int) $r['viewers']);
        }
        $peak = max($peak, 1);

        return array_map(static fn(array $r): array => [
            'bucket'    => (int) $r['bucket'],
            'viewers'   => (int) $r['viewers'],
            'retention' => round((int) $r['viewers'] / $peak, 4),
        ], $rows);
    }
}

// usage inside a controller
$pdo = new \PDO(getenv('ANALYTICS_DSN'), options: [
    \PDO::ATTR_ERRMODE => \PDO::ERRMODE_EXCEPTION,
]);
$repo = new HeatmapRepository($pdo);
header('Content-Type: application/json');
header('Cache-Control: public, max-age=300');   // Cloudflare/LiteSpeed will cache the strip
echo json_encode($repo->forVideo($_GET['v'], $_GET['region'] ?? 'GLOBAL'), JSON_THROW_ON_ERROR);

Note the Cache-Control: public, max-age=300. Heatmaps move slowly and the estimate is approximate anyway, so we let LiteSpeed and Cloudflare cache the JSON strip for five minutes. A trending video getting hammered serves the heatmap almost entirely from edge cache; Postgres sees one read per video per five minutes, not one per viewer.

Compaction: hot writes versus cold reads

There is one operational wrinkle. The ON CONFLICT DO UPDATE write path rewrites the sketch on every heartbeat, and a hot video produces a lot of contention on the same (video, region, bucket) rows. Two things keep this healthy.

First, shard the write, union on read. Instead of every worker updating the same row, each ingestion worker writes to its own partition keyed by a worker id, and a periodic compaction job unions partitions down into the canonical row. The read query already uses hll_union_agg, so reading across partitions needs no change — you just drop the worker-id predicate.

Second, let HLL pick its own representation. The extension stores small sketches in a sparse "explicit/sparse" mode and only promotes to the full dense representation once a bucket has accumulated enough distinct viewers. Buckets on the long tail of your catalog — the videos nobody scrubbed past second 2 — stay tiny. You can tune the promotion thresholds on the type itself:

-- log2m=12 (4096 registers, ~1.6% error), regwidth=5,
-- sparse->dense threshold and explicit threshold left at defaults.
ALTER TABLE video_heatmap
    ALTER COLUMN viewers TYPE hll(12, 5);

-- nightly compaction: fold worker shards into the canonical region rows
INSERT INTO video_heatmap (video_id, region, bucket_sec, viewers)
SELECT video_id, region, bucket_sec, hll_union_agg(viewers)
FROM   video_heatmap_shards
WHERE  captured_on = current_date - 1
GROUP  BY video_id, region, bucket_sec
ON CONFLICT (video_id, region, bucket_sec)
DO UPDATE SET viewers = video_heatmap.viewers || EXCLUDED.viewers;

TRUNCATE video_heatmap_shards;   -- or DELETE the compacted partition

With log2m = 12 every dense sketch is ~2.6 KB. A 10-minute video at per-second resolution across 8 regions is 600 * 8 ≈ 4800 rows, well under 13 MB even if every bucket went dense — which they don't, because the tail stays sparse. Compare that to the raw event table, where a single popular video could be tens of millions of rows for the same query.

What we gave up, and why it was fine

HLL is an estimate, so a few capabilities are simply off the table and you should know them going in:

No exact counts. If finance needs a billing-grade unique count, HLL is the wrong tool. For product analytics, the ~1.6% error is noise.
No per-user drill-down. You cannot ask "did viewer X watch the ending" — the sketch threw away identities by design. That's a feature for privacy (we keep no per-viewer watch log) but a constraint if you wanted cohort replay.
No set difference. HLL unions cleanly but does not support intersection or subtraction accurately. "Viewers who saw the intro but not the ending" is not answerable from sketches alone; inclusion-exclusion on HLL compounds error badly. We keep a separate, sampled raw stream for the rare questions that need it.

None of those mattered for the heatmap, which only ever asks "how many distinct people were present at this point."

Takeaways

If you're building watch heatmaps or any "unique-things-per-bucket-over-time" analytics, HyperLogLog in Postgres turns an unbounded, ever-scanning fact table into a fixed-size, pre-aggregated, mergeable store:

One hll sketch per (video, region, time_bucket) keeps storage constant in traffic.
hll_union_agg lets you union across regions and re-bucket on read with no re-scan and no double-counting — union first, estimate last.
Hash through hll_hash_bigint, keep your anonymizing salt stable across the retention window, and let the sparse/dense representation handle the long tail.
Shard writes, compact nightly, and cache the JSON strip at the edge so hot videos barely touch the database.

We shipped this against a Postgres instance an order of magnitude smaller than the raw-events design would have demanded, and the analytics tab renders the per-second curve in well under a frame budget. For a metric that's allowed to be approximately right, that's the trade you want.