ahmet gedik

Posted on Jul 3

Querying Video Relationships with Apache AGE Graph Extension on Postgres

#postgres #graphdatabase #php #sql

The recommendation query that SQLite could not answer cheaply

We run discovery for a catalog of streaming titles spread across eight regions. For a long time the core data lived in SQLite with FTS5, and that combination is genuinely excellent for what it does: full-text search over titles, descriptions, and cast lists is fast, the database is a single file, and our FTP-based deploy pipeline ships it without any server-side database daemon to babysit. At TrendVidStream the constraint that finally broke this model was not text search at all — it was relationships.

The question users actually ask, implicitly, through their clicks is: "I just watched this thriller that streams on three platforms in Germany, with this director and two of these actors — what else is two or three hops away from it?" That is a graph traversal. In SQL you express it as a pile of self-joins against a junction table, and each additional hop is another join. Two hops is tolerable. Three hops with filtering on region availability and platform is a query that I did not want to maintain, and that planned badly the moment the join fan-out got wide.

So we moved the relationship layer — and only the relationship layer — onto PostgreSQL with the Apache AGE extension, which gives you openCypher graph queries inside Postgres. FTS5 still does what it is good at. This post is about why that split works, how the graph is modeled, and the exact queries that replaced the join pyramids.

What Apache AGE actually gives you

Apache AGE ("A Graph Extension") is a Postgres extension that adds a property-graph layer and the cypher() function so you can run openCypher inside a normal SQL statement. The important properties for a backend engineer:

It is just Postgres. Same connection, same transactions, same backup tooling, same psql. You are not adopting a separate database product like Neo4j with its own operational surface.
Graph data lives in a graph (a named schema of vertex and edge tables). You can mix Cypher results back into SQL with a regular JOIN.
Vertices and edges carry JSON-ish properties (agtype), so a video node can hold title, year, and runtime without a rigid column-per-attribute schema.
Traversals of variable depth (-[*1..3]-) are first-class, which is the entire reason we are here.

The trade-off is honest: AGE is not as fast as a dedicated native graph engine at billion-edge scale, and agtype casting is verbose. For a catalog in the low millions of edges — which is where a regional streaming discovery graph sits — it is more than enough, and the operational simplicity of "it's still Postgres" wins.

Installing and creating the graph

AGE ships as an extension you load per session (or preload). Once installed against your Postgres build, setup is three statements:

-- Load the extension and put its functions on the search path
CREATE EXTENSION IF NOT EXISTS age;
LOAD 'age';
SET search_path = ag_catalog, "$user", public;

-- Create a named graph that will hold our video relationship model
SELECT create_graph('discovery');

Every Cypher statement is wrapped in the cypher() function and must declare the shape of its return columns. That declaration is the part that trips people up, so here is the minimal create-and-read pattern:

-- Create two video vertices and a SIMILAR_TO edge between them
SELECT * FROM cypher('discovery', $$
    CREATE (a:Video {ext_id: 'vid_8842', title: 'Cold Harbor', year: 2024, runtime: 118}),
           (b:Video {ext_id: 'vid_9001', title: 'Night Freight', year: 2023, runtime: 104}),
           (a)-[:SIMILAR_TO {score: 0.82}]->(b)
$$) AS (created agtype);

-- Read it back
SELECT * FROM cypher('discovery', $$
    MATCH (a:Video)-[r:SIMILAR_TO]->(b:Video)
    RETURN a.title, r.score, b.title
$$) AS (source agtype, score agtype, target agtype);

The AS (...) column list is mandatory and must match the arity of RETURN. Forget a column and Postgres throws immediately — annoying at first, then a useful guardrail.

The graph model for a multi-region catalog

Our relationship model has four vertex labels and a handful of edge types. Keeping the label set small is deliberate; every extra label is another thing to index and reason about.

Vertex labels:

Video — the title itself, keyed by the ext_id we already use in SQLite so the two stores stay reconcilable.
Person — directors and actors, with a role carried on the edge, not the node, because the same person both directs and acts.
Platform — Netflix, Prime, a regional broadcaster, etc.
Region — one node per region code (US, DE, JP, …), which lets availability be a graph edge instead of a denormalized column.

Edge types:

(:Person)-[:WORKED_ON {role}]->(:Video)
(:Video)-[:SIMILAR_TO {score}]->(:Video) — our computed similarity, directional so we can store asymmetric scores.
(:Video)-[:AVAILABLE_IN]->(:Region)
(:Video)-[:STREAMS_ON {region}]->(:Platform) — platform availability is region-scoped, so the region lives on the edge.

Indexes matter as much here as in plain Postgres. AGE stores vertices in a table per label, and you index their properties via the underlying storage. The lookup we do constantly is "find the node for this ext_id", so:

-- Index the ext_id property on Video vertices for O(log n) entry-point lookups
CREATE INDEX video_ext_id_idx
    ON discovery."Video"
    USING btree (agtype_access_operator(properties, '"ext_id"'::agtype));

Without that index, every traversal that starts from a known title does a sequential scan to find its starting vertex, and that single mistake will dominate your query time. Index the entry points; the traversal itself is the cheap part once you are inside the graph.

The query that replaced the join pyramid

Here is the real payoff. The "more like this" query for a watch page: starting from one video, walk up to three relationship hops, but only return candidates that are actually available in the viewer's region. In SQL this was four self-joins against junction tables plus a WHERE on the region table. In Cypher it reads almost like the sentence I used to describe it:

SELECT * FROM cypher('discovery', $$
    MATCH (seed:Video {ext_id: 'vid_8842'})
    MATCH (seed)-[:SIMILAR_TO|WORKED_ON*1..3]-(rec:Video)-[:AVAILABLE_IN]->(:Region {code: 'DE'})
    WHERE rec.ext_id <> 'vid_8842'
    RETURN DISTINCT rec.ext_id, rec.title, rec.year
    LIMIT 24
$$) AS (ext_id agtype, title agtype, year agtype);

The *1..3 is the variable-length traversal. The pipe in :SIMILAR_TO|WORKED_ON means "follow either edge type", so a recommendation can arrive via similarity or via a shared director/actor, in any combination, within three hops. The region filter is just another pattern in the same MATCH. Adding a fourth hop is changing one digit, not writing another join.

The DISTINCT matters: a popular title is reachable by many paths, and without it you get the same ext_id back a dozen times. Push the de-dup and the LIMIT into Cypher so the SQL layer never sees the fan-out.

Calling it from PHP 8.4

Our application is PHP, and the graph layer is reached over PDO exactly like any other Postgres query. The one wrinkle is agtype: results come back as JSON-ish strings (often quoted), so a thin decode helper keeps the calling code clean. PHP 8.4 property hooks and readonly make the result object pleasant.

<?php
declare(strict_types=1);

final class VideoGraph
{
    public function __construct(private readonly \PDO $pdo) {}

    /**
     * Return up to $limit recommended ext_ids reachable within 3 hops
     * of $seedExtId and available in $regionCode.
     *
     * @return list<array{ext_id: string, title: string, year: int}>
     */
    public function recommendations(string $seedExtId, string $regionCode, int $limit = 24): array
    {
        $cypher = <<<CYPHER
            MATCH (seed:Video {ext_id: \$seed})
            MATCH (seed)-[:SIMILAR_TO|WORKED_ON*1..3]-(rec:Video)-[:AVAILABLE_IN]->(:Region {code: \$region})
            WHERE rec.ext_id <> \$seed
            RETURN DISTINCT rec.ext_id, rec.title, rec.year
            LIMIT \$limit
        CYPHER;

        // AGE passes Cypher parameters as a single JSON object argument.
        $params = json_encode([
            'seed'   => $seedExtId,
            'region' => $regionCode,
            'limit'  => $limit,
        ], JSON_THROW_ON_ERROR);

        $sql = "SELECT * FROM cypher('discovery', \$cy\$ {$cypher} \$cy\$, :params) "
             . 'AS (ext_id agtype, title agtype, year agtype)';

        $stmt = $this->pdo->prepare($sql);
        $stmt->execute([':params' => $params]);

        return array_map(
            static fn(array $row): array => [
                'ext_id' => self::decode($row['ext_id']),
                'title'  => self::decode($row['title']),
                'year'   => (int) self::decode($row['year']),
            ],
            $stmt->fetchAll(\PDO::FETCH_ASSOC),
        );
    }

    /** agtype scalars arrive JSON-encoded; strip the quoting. */
    private static function decode(string $raw): string
    {
        return (string) json_decode($raw, false, 512, JSON_THROW_ON_ERROR);
    }
}

Two things worth calling out. First, AGE supports a parameter object as the third argument to cypher(), so user-supplied values (the seed id, the region) never get string-concatenated into the query — that is your injection boundary. Second, the dollar-quoted $cy$ ... $cy$ delimiter avoids fighting with single quotes inside the Cypher body.

Keeping the graph in sync with the source of truth

The relationship data is derived. SQLite plus our regional cron jobs remain the source of truth for what titles exist and where they stream; the graph is a projection we rebuild and incrementally update. We sync it with a small Python job that runs at the tail of the same multi-region cron that refreshes the catalog. It pulls changed rows and upserts vertices and edges idempotently with Cypher MERGE.

import json
import psycopg

GRAPH = "discovery"

UPSERT_VIDEO = """
SELECT * FROM cypher(%(graph)s, $$
    MERGE (v:Video {ext_id: $ext_id})
    SET v.title = $title, v.year = $year, v.runtime = $runtime
    RETURN v.ext_id
$$, %(params)s) AS (ext_id agtype);
"""

UPSERT_AVAIL = """
SELECT * FROM cypher(%(graph)s, $$
    MATCH (v:Video {ext_id: $ext_id})
    MERGE (r:Region {code: $region})
    MERGE (v)-[:AVAILABLE_IN]->(r)
    RETURN r.code
$$, %(params)s) AS (code agtype);
"""


def sync_video(cur, row: dict) -> None:
    """row comes straight from the SQLite export for one changed title."""
    cur.execute(UPSERT_VIDEO, {
        "graph": GRAPH,
        "params": json.dumps({
            "ext_id": row["ext_id"],
            "title": row["title"],
            "year": int(row["year"]),
            "runtime": int(row["runtime"]),
        }),
    })
    for region in row["regions"]:  # e.g. ['US', 'DE', 'JP']
        cur.execute(UPSERT_AVAIL, {
            "graph": GRAPH,
            "params": json.dumps({"ext_id": row["ext_id"], "region": region}),
        })


def run(dsn: str, changed_rows: list[dict]) -> int:
    with psycopg.connect(dsn) as conn:
        conn.execute("LOAD 'age';")
        conn.execute('SET search_path = ag_catalog, "$user", public;')
        with conn.cursor() as cur:
            for row in changed_rows:
                sync_video(cur, row)
        conn.commit()
    return len(changed_rows)

MERGE is the idempotent verb: "match this pattern, or create it if absent." Running the sync twice produces the same graph, which means a half-failed cron run is safe to simply re-run. We process changed rows only — full rebuilds happen weekly during a quiet window — so the per-cycle cost stays proportional to churn, not catalog size.

One operational note specific to our FTP-based deploy world: the graph lives on the Postgres host, not in the shipped application bundle, so it is decoupled from the FTP deploy entirely. Code deploys do not touch graph state, and graph rebuilds do not require a deploy. That separation removed a whole class of "did the data and the code ship together?" anxiety.

Mixing Cypher results back into SQL

The feature I underestimated is that a cypher() call is just a table-returning function, so you can join its output against ordinary relational tables. We keep view-count and rating metrics in regular Postgres tables (they update far too often to live as graph properties). Ranking graph recommendations by popularity is one statement:

WITH recs AS (
    SELECT (rec->>'ext_id') AS ext_id
    FROM cypher('discovery', $$
        MATCH (seed:Video {ext_id: 'vid_8842'})
        MATCH (seed)-[:SIMILAR_TO|WORKED_ON*1..3]-(rec:Video)
        RETURN DISTINCT rec
    $$) AS (rec agtype)
)
SELECT r.ext_id, m.view_count_7d, m.avg_rating
FROM recs r
JOIN video_metrics m ON m.ext_id = r.ext_id
WHERE m.region = 'DE'
ORDER BY m.view_count_7d DESC
LIMIT 24;

Graph does the connectivity — what is related to what. Relational does the aggregates — how popular each candidate is right now. Trying to force either job onto the wrong tool is what made the original all-SQLite design painful. Letting each layer do what it is good at, in the same database engine, is the whole point.

What I would tell my past self

If you are weighing this move, the decision points that actually mattered for us:

Adopt AGE only for genuine multi-hop relationship queries. If your hardest query is two joins, you do not need a graph; keep it relational. The win shows up at depth three and beyond, and with variable-length or alternating-edge traversals.
Index your entry-point properties first. The single biggest performance cliff is an un-indexed ext_id lookup that forces a sequential scan before the traversal even begins.
Keep the graph as a projection. Treat your existing store as the source of truth and rebuild the graph idempotently with MERGE. This makes failed syncs trivially recoverable and keeps the two stores reconcilable by shared keys.
Push DISTINCT and LIMIT into Cypher. Popular nodes are reachable by many paths; de-duplicate inside the traversal, not in application code.
Parameterize Cypher with the params object. Never concatenate user input into the query body; AGE gives you a proper parameter boundary.

We did not replace SQLite and FTS5 — text search is still theirs, and the single-file deploy story is too good to give up. We added a graph layer for the one shape of question that relational joins handle badly, and it stayed boring to operate because it never stopped being Postgres. That combination — the right tool for connectivity, the familiar tool for everything else — is the version of this I would build again.

DEV Community