ahmet gedik

Posted on Jun 24

Building Graph-Based Video Relationship Queries With Apache AGE

#postgres #graph #php #sql

The related-videos query that brought our read replica to its knees

At ViralVidVault we track viral video trends across European markets, and the single most expensive query in our stack was never the trend ranking or the GDPR-safe analytics rollups. It was the innocent-looking "show me videos related to this one" panel. Related, for us, means something fuzzy and multi-hop: videos that share a channel, share tags, OR are frequently co-viewed in the same anonymized session window. On SQLite that turned into a pile of self-joins and a WITH RECURSIVE block that walked the co-view table two and three hops deep. At one hop it was fine. At two hops, a video with a few thousand co-view edges fanned out to hundreds of thousands of intermediate rows, and our LiteSpeed workers started timing out behind Cloudflare.

The data is a graph. We were pretending it was a set of tables and paying for the lie on every page load. This post is about how we moved the relationship layer to Apache AGE, the openCypher extension for PostgreSQL, kept the rest of the stack (PHP 8.4 application, SQLite as the local trend store, edge caching at Cloudflare) exactly where it was, and cut p95 on the related panel from ~900ms to under 40ms.

I want to be precise about scope: AGE did not replace our primary store. It became a purpose-built read model for traversals. That distinction matters, so I will keep coming back to it.

Why a graph model actually fits viral video data

Relational joins are great when the number of hops is fixed and known at query-design time. Video relationships are neither. Consider the questions our product team kept asking:

Which videos are two hops away through shared channels but NOT directly co-viewed? (Discovery candidates we have not surfaced yet.)
Given a video that is spiking, what is the shortest co-view path to last week's winner? (Trend lineage.)
For a creator, which other creators sit in the same co-view cluster? (Partnership signals.)

Every one of those is a variable-length path query. In SQL you express variable-length traversal with recursion, and recursion over a many-to-many edge table is where the planner stops helping you. In a property graph you express it as a pattern. The difference is not cosmetic; the engine indexes adjacency and stops expanding paths that cannot match.

Apache AGE is attractive specifically because it lives inside PostgreSQL. We did not have to run Neo4j as a separate operational concern with its own backup story, its own auth, its own GDPR data-subject-deletion procedure. AGE is a Postgres extension. A graph is a schema. Cypher runs through a function called cypher() that returns normal rows. Our existing Postgres backup, our existing row-level deletion tooling, and our existing connection pool all kept working.

Installing AGE and modeling the graph

AGE ships as a loadable extension. On a recent Postgres you build it against your server headers or use a packaged build. Once installed, the setup is small. We model two node labels — Video and Channel — and three edge types: UPLOADED_BY, SHARES_TAG, and CO_VIEWED. The co-view edge carries a weight so we can rank paths later.

-- Load the extension and put its catalog on the search path
CREATE EXTENSION IF NOT EXISTS age;
LOAD 'age';
SET search_path = ag_catalog, "$user", public;

-- One named graph holds the whole video relationship model
SELECT create_graph('vault');

-- Labels are created lazily on first insert, but we create them
-- explicitly so we can attach property indexes up front.
SELECT create_vlabel('vault', 'Video');
SELECT create_vlabel('vault', 'Channel');
SELECT create_elabel('vault', 'CO_VIEWED');
SELECT create_elabel('vault', 'UPLOADED_BY');
SELECT create_elabel('vault', 'SHARES_TAG');

-- AGE stores properties as agtype (a JSON superset). Index the id
-- property we use as the lookup key for every traversal entrypoint.
CREATE INDEX video_id_idx
  ON vault."Video"
  USING btree (agtype_access_operator(properties, '"id"'::agtype));

That last index is the part people skip and then complain AGE is slow. Every traversal starts by matching a node on a property — here the external video id we already use everywhere else. Without an index on properties->'id', that anchor match is a sequential scan over the whole label, and the fastest traversal in the world cannot save a query that begins with a table scan. Index the entrypoint property and the picture changes completely.

A quick note on agtype: it is AGE's value type, a JSON-compatible superset that also models integers and floats without precision loss. You will see it in every result set. When you pull values back into PHP or Go you are parsing agtype, which for our purposes behaves like slightly stricter JSON.

Loading edges from the existing trend store

Our trend data lives in SQLite with WAL enabled — it is small, fast, and embarrassingly reliable for the ingest side. The graph is a derived projection, rebuilt incrementally. The loader reads new co-view aggregates out of SQLite and upserts nodes and edges into AGE. I wrote it in Python because the nightly trend pipeline is already Python.

The key idea: use MERGE so the load is idempotent. Re-running the loader must not create duplicate nodes or double-counted edges. We batch parameters and let Cypher's MERGE do the dedupe.

import sqlite3
import psycopg

PG = "host=127.0.0.1 dbname=vault user=vault"
SQLITE = "/var/data/trends.db"

UPSERT = """
SELECT * FROM cypher('vault', $$
    MERGE (a:Video {id: $src})
    MERGE (b:Video {id: $dst})
    MERGE (a)-[e:CO_VIEWED]->(b)
    SET e.weight = $w, e.window = $win
$$, %s) AS (v agtype);
"""

def load_co_views(batch_size=500):
    src = sqlite3.connect(SQLITE)
    src.row_factory = sqlite3.Row
    rows = src.execute(
        "SELECT src_id, dst_id, weight, window_day "
        "FROM co_view_edges WHERE exported = 0"
    )

    with psycopg.connect(PG) as pg:
        pg.execute("LOAD 'age';")
        pg.execute('SET search_path = ag_catalog, "$user", public;')
        cur = pg.cursor()
        batch = []
        for r in rows:
            params = {
                "src": r["src_id"], "dst": r["dst_id"],
                "w": float(r["weight"]), "win": r["window_day"],
            }
            # AGE takes Cypher params as a single agtype JSON argument
            cur.execute(UPSERT, (psycopg.types.json.Json(params),))
            batch.append(r["src_id"])
            if len(batch) >= batch_size:
                pg.commit()
                batch.clear()
        pg.commit()
    print("co-view load complete")

if __name__ == "__main__":
    load_co_views()

Two things worth calling out. First, AGE passes Cypher parameters as one agtype object, not as positional placeholders inside the Cypher string — that is why everything goes through a single JSON argument and is referenced as $src, $dst, and so on inside the $$ ... $$ block. Second, we keep the directionality honest: CO_VIEWED is directed from the earlier-watched video to the later one, which lets us ask lineage questions later. If you only care about symmetric similarity, store both directions or query with an undirected pattern.

Querying related videos from PHP

The application is PHP 8.4. We already talk to Postgres for a few subsystems through PDO, so the related-videos lookup is just another prepared statement — except the statement wraps a Cypher block. Here is the production shape of it, trimmed of logging and the Cloudflare cache headers.

<?php
declare(strict_types=1);

final class RelatedVideos
{
    public function __construct(private \PDO $pdo) {}

    /**
     * Videos within two hops via channel OR co-view, excluding the seed.
     * @return list<array{id:string, score:float}>
     */
    public function forVideo(string $videoId, int $limit = 12): array
    {
        $this->pdo->exec("LOAD 'age'");
        $this->pdo->exec('SET search_path = ag_catalog, "$user", public');

        $cypher = <<<CYPHER
            SELECT * FROM cypher('vault', \$\$
                MATCH (seed:Video {id: \$vid})
                MATCH (seed)-[r:CO_VIEWED*1..2]->(rel:Video)
                WHERE rel.id <> \$vid
                WITH rel, reduce(s = 1.0, e IN r | s * e.weight) AS path_score
                RETURN rel.id AS id, max(path_score) AS score
                ORDER BY score DESC
                LIMIT \$lim
            \$\$, :params) AS (id agtype, score agtype);
        CYPHER;

        $params = json_encode(['vid' => $videoId, 'lim' => $limit]);
        $stmt = $this->pdo->prepare($cypher);
        $stmt->bindParam(':params', $params);
        $stmt->execute();

        $out = [];
        foreach ($stmt->fetchAll(\PDO::FETCH_ASSOC) as $row) {
            // agtype strings arrive JSON-quoted; trim and cast.
            $out[] = [
                'id'    => trim($row['id'], '"'),
                'score' => (float) $row['score'],
            ];
        }
        return $out;
    }
}

The escaping is the ugly part. PHP heredoc plus Cypher's $$ delimiters plus AGE's $param syntax means you are escaping dollar signs a lot. My advice from doing this in anger: keep the Cypher in a heredoc, escape exactly the three things — the $$ block delimiters and any literal $ that belongs to Cypher params — and resist the urge to build the query with string concatenation. The :params bind is a real bound parameter, so the user-supplied videoId never touches the SQL text. That is your injection boundary and it stays clean.

Notice CO_VIEWED*1..2 — that is a variable-length pattern, one or two hops. The reduce() multiplies edge weights along each path into a decaying score, and max() keeps the strongest path per destination. Expressing "strongest two-hop path" took one line. The equivalent recursive SQL was forty.

Ranking by weighted co-view paths

The naive version above multiplies raw weights, which biases toward short, heavy edges. For the actual product we wanted a score that rewards multiple independent paths to the same destination — a video reachable three different ways is a better recommendation than one reachable through a single strong fluke edge. AGE handles that because Cypher path patterns naturally enumerate every matching path, and we aggregate across them.

SELECT * FROM cypher('vault', $$
    MATCH (seed:Video {id: 'vid_88213'})
    MATCH p = (seed)-[r:CO_VIEWED*1..2]->(rel:Video)
    WHERE rel.id <> 'vid_88213'
    WITH rel,
         count(p)                                   AS path_count,
         reduce(s = 1.0, e IN r | s * e.weight)      AS this_path
    WITH rel,
         sum(this_path) * (1 + log(path_count + 1))  AS score
    RETURN rel.id AS id, score
    ORDER BY score DESC
    LIMIT 12
$$) AS (id agtype, score agtype);

The count(p) plus log() term is the multi-path bonus. This is the kind of scoring you tune against click-through data, and the point is that the tuning happens in a readable query rather than buried in application code stitching together join results. When a data scientist wants to change the decay or the path-count weighting, they edit Cypher, not a recursive CTE nobody on the team fully trusts.

One caveat learned in production: unbounded variable-length patterns (*1.. with no upper bound) on a dense co-view graph will try to enumerate enormous numbers of paths. Always set an upper bound. We use *1..2 for the live panel and *1..3 only for offline lineage analysis where latency does not matter.

Serving it at the edge without leaking PII

Because we are GDPR-first, the graph deliberately contains no personal data. Nodes are videos and channels; edges are aggregated co-view counts over anonymized session windows, never user identifiers. There is no individual to subject-access or erase inside the graph, which makes the whole thing dramatically simpler to reason about under GDPR — the erasure obligation lives in the raw event store with a short retention window, and the graph only ever sees k-anonymized aggregates.

The related panel is cacheable per video, so it fits naturally at the edge. A thin Go service sits between Cloudflare Workers and Postgres, returning a compact JSON payload that the Worker caches. Go because it pools Postgres connections cleanly and the JSON marshaling is trivial.

package main

import (
    "context"
    "encoding/json"
    "net/http"

    "github.com/jackc/pgx/v5/pgxpool"
)

type Related struct {
    ID    string  `json:"id"`
    Score float64 `json:"score"`
}

func handler(pool *pgxpool.Pool) http.HandlerFunc {
    const q = `SELECT id::text, score::text FROM cypher('vault', $$
        MATCH (s:Video {id: $vid})-[r:CO_VIEWED*1..2]->(rel:Video)
        WHERE rel.id <> $vid
        WITH rel, reduce(x = 1.0, e IN r | x * e.weight) AS sc
        RETURN rel.id, max(sc) ORDER BY max(sc) DESC LIMIT 12
    $$, $1) AS (id agtype, score agtype)`

    return func(w http.ResponseWriter, req *http.Request) {
        vid := req.URL.Query().Get("v")
        params, _ := json.Marshal(map[string]string{"vid": vid})

        rows, err := pool.Query(context.Background(), q, string(params))
        if err != nil {
            http.Error(w, "query failed", 500)
            return
        }
        defer rows.Close()

        out := []Related{}
        for rows.Next() {
            var id, score string
            if err := rows.Scan(&id, &score); err == nil {
                var r Related
                json.Unmarshal([]byte(id), &r.ID)
                json.Unmarshal([]byte(score), &r.Score)
                out = append(out, r)
            }
        }

        // Let Cloudflare cache the panel; it carries no PII.
        w.Header().Set("Cache-Control", "public, max-age=300")
        w.Header().Set("Content-Type", "application/json")
        json.NewEncoder(w).Encode(out)
    }
}

The agtype values come back as JSON-encoded text when you cast with ::text, which is why json.Unmarshal cleanly strips the quotes from the id and parses the score. A 300-second edge cache absorbs the overwhelming majority of reads, so Postgres only sees a request per video per five minutes per cache node. The graph traversal is cheap; the cache makes it nearly free.

What we measured

Numbers from our European production traffic after the migration, comparing the old recursive-SQL path against AGE:

Related-panel p95 latency: ~900ms down to ~38ms at the origin, sub-millisecond on edge cache hits.
Two-hop traversals that previously materialized 200k+ intermediate rows now expand bounded path sets and return in single-digit milliseconds.
Operational surface stayed flat: no new database to back up, monitor, or secure. AGE rides on the Postgres we already run.
Query readability: the core recommendation logic went from a ~40-line recursive CTE to a ~6-line Cypher pattern that a new engineer can read in one sitting.

Things I would warn you about. AGE's planner is younger than core Postgres's; always EXPLAIN your Cypher and confirm the entrypoint match uses your property index rather than a label scan. Pin your AGE version to your Postgres major version deliberately — mismatches are the most common install failure. And treat the graph as a rebuildable projection, not a source of truth; ours can be dropped and reloaded from the SQLite trend store in minutes, which removes a whole category of operational fear.

Conclusion

The lesson was not "graphs beat SQL." It was that video relationship data is genuinely a graph, variable-length traversal is its native operation, and forcing it through recursive joins meant fighting the query planner on every request. Apache AGE let us adopt a property-graph model and openCypher without leaving the PostgreSQL ecosystem we already trust for backups, security, and GDPR-compliant deletion. We kept SQLite for ingest, kept PHP 8.4 for the app, kept Cloudflare for the edge, and slotted AGE in as a focused read model for one hard problem. If you have a relationship query that fans out across multiple hops and your recursive CTE is starting to scare you, build a small AGE projection beside your existing store and measure it. The migration is far smaller than standing up a separate graph database, and the queries read like the questions your product team actually asks.

DEV Community