ahmet gedik

Posted on May 31

Why Apache AGE Beat Recursive CTEs for Our Video Recommendation Graph

#postgres #graphdb #cypher #php

The recommendation endpoint started timing out at 1.4s p95 once our catalog crossed 180k videos. The culprit was an honest five-level recursive CTE that walked channel → collaboration → channel → video → tag → similar_video. PostgreSQL handles recursive CTEs well in isolation, but they collapse once your branching factor passes ~30 and you have non-trivial filter predicates at every level. We tried indexed materialized views and a nightly-denormalized related_videos table. Both worked. Both went stale within hours on a site where regional trending shifts every few hours.

I run TopVideoHub, an Asia-Pacific multi-language video aggregator covering nine regions (JP, KR, TW, HK, SG, VN, TH, ID, PH) with a heavy CJK content mix. The serving stack is plain: PHP 8.4 on LiteSpeed behind Cloudflare, SQLite with FTS5 and a CJK tokenizer for search, and PostgreSQL for everything that needs joins. The graph problem is not about scale in the absolute sense; we have ~220k videos, ~12k channels, ~40k tags. It is about traversal: "find videos similar to X that are also trending in the viewer's region and have a Mandarin or Japanese variant."

That class of query is what pushed us to Apache AGE. This post walks through the setup, the data model that worked, the PHP integration, and the actual Cypher queries that replaced about 600 lines of recursive SQL.

Why a graph extension and not Neo4j

The short answer: I already have PostgreSQL in production, I already have backup and replication tooling for it, and I do not want a second stateful service in my deploy script. Apache AGE is a Postgres extension that adds property graph storage and openCypher query support. It lives in the same database as the relational tables. I can join a Cypher result back to a regular table in one query. That is hard to give up.

The tradeoffs are real and worth naming up front:

AGE's planner is less mature than Neo4j's. Complex variable-length patterns can pick bad plans
Cypher support is openCypher, not the full Neo4j dialect. No APOC procedures, no Graph Data Science library
Writes go through SQL-wrapped Cypher (SELECT * FROM cypher(...)) which adds a small parse overhead per call
Schema evolution is informal; AGE does not enforce label or property constraints the way Neo4j does

For our workload — read-heavy, mostly two-to-four hop traversals, ingested once per crawl cycle — none of those bit hard enough to justify operating a separate database.

Installing AGE on Postgres 16

The version packaged in Debian's apt is old. I built from source, which takes about two minutes on a modest VPS. Install postgresql-server-dev-16, build-essential, bison, and flex, then clone apache/age, check out the release/PG16/1.5.0 tag, and run make followed by sudo make install. After that, in the target database:

CREATE EXTENSION age;
LOAD 'age';
SET search_path = ag_catalog, "$user", public;

SELECT create_graph('video_graph');

The LOAD 'age' and SET search_path statements must run on every connection. With PHP's PDO that means putting them in an init query, or wrapping every request in a small bootstrap. I went with the bootstrap path — persistent connections caused stale graph handles once after a failover drill, and the per-request cost is negligible compared to the actual graph traversal.

The graph model

I spent more time on the model than on the code. The wrong shape makes Cypher queries look like SQL — lots of conditions, few traversals. The right shape pushes filters onto edge types instead of property predicates.

Node labels:

Video — (id, yt_id, title, lang, duration_s, published_at, views)
Channel — (id, yt_id, name, country, lang)
Tag — (name) — normalized lowercase, no language variants
Topic — (slug, name) — our curated 40 topics, more stable than raw tags
Region — (code) — the nine ISO region codes

Edge types — this is where the leverage is:

(:Channel)-[:PUBLISHED]->(:Video) — direct authorship
(:Video)-[:TAGGED]->(:Tag) — many-to-many
(:Video)-[:ABOUT]->(:Topic) — curated mapping, often via tag voting
(:Video)-[:TRENDING_IN {rank, snapshot_at}]->(:Region) — refreshed each crawl
(:Video)-[:LANG_VARIANT_OF]->(:Video) — same content, different audio or subs
(:Channel)-[:COLLAB_WITH {count}]->(:Channel) — derived from descriptions and titles

LANG_VARIANT_OF is the edge that makes the multi-language story work. Most "Japanese version of this K-pop clip" relationships were buried in title patterns or in YouTube's own relatedToVideoId results. We extract them in a nightly batch and write the edges explicitly so the recommendation query can follow them with a single hop, instead of fuzzy-matching titles at query time.

Loading data from SQLite into AGE

The crawler writes to SQLite first — it is local to the LiteSpeed app server, so the read path in app/Indexing.php has zero network latency. A separate Python worker tails the SQLite changes and incrementally upserts into Postgres after each crawl finishes.

import sqlite3
import psycopg
from itertools import islice

SRC = sqlite3.connect("/var/tvh/data/videos.db")
SRC.row_factory = sqlite3.Row

DST = psycopg.connect("dbname=tvh_graph user=tvh", autocommit=False)

def batched(it, n):
    it = iter(it)
    while batch := list(islice(it, n)):
        yield batch

def upsert_videos(cur, rows):
    cypher = """
    SELECT * FROM cypher('video_graph', $$
        UNWIND $rows AS r
        MERGE (v:Video {id: r.id})
        SET v.yt_id = r.yt_id,
            v.title = r.title,
            v.lang = r.lang,
            v.duration_s = r.duration_s,
            v.views = r.views
    $$, %s) AS (v agtype)
    """
    cur.execute(cypher, (psycopg.types.json.Jsonb({"rows": rows}),))

with DST.cursor() as cur:
    cur.execute("LOAD 'age'")
    cur.execute("SET search_path = ag_catalog, public")

    src_rows = SRC.execute("""
        SELECT id, yt_id, title, lang, duration_s, views
        FROM videos
        WHERE updated_at > strftime('%s', 'now', '-24 hours')
    """)

    for batch in batched((dict(r) for r in src_rows), 500):
        upsert_videos(cur, batch)
        DST.commit()

A few things worth flagging from production:

AGE accepts parameters as agtype, which psycopg serializes for you via Jsonb — that is the supported contract
MERGE is the upsert. It matches on the labeled node plus key properties, then creates if absent
Batch size of 500 was the sweet spot for us. Larger batches hit the AGE parameter parser's overhead; smaller batches paid more per-statement cost
Do not put MATCH inside an UNWIND if you can avoid it — the planner often cannot push the unwind through and you end up with a Cartesian product

Edges are loaded in a second pass after all nodes exist. For trending edges, I send the ranked video list per region and MERGE the edge with its rank property in one statement:

def link_trending(cur, region_code, ranked_video_ids):
    rows = [{"vid": vid, "rank": i + 1} for i, vid in enumerate(ranked_video_ids)]
    cur.execute("""
        SELECT * FROM cypher('video_graph', $$
            MATCH (r:Region {code: $region})
            UNWIND $rows AS row
            MATCH (v:Video {id: row.vid})
            MERGE (v)-[t:TRENDING_IN]->(r)
            SET t.rank = row.rank,
                t.snapshot_at = timestamp()
        $$, %s) AS (e agtype)
    """, (psycopg.types.json.Jsonb({
        "region": region_code,
        "rows": rows
    }),))

The trending edge has a small operational wart: yesterday's TRENDING_IN edges keep existing unless we delete them. I run a separate sweep at the end of ingest that removes any TRENDING_IN edge with snapshot_at older than the current crawl's start timestamp. Forgetting that sweep was the first incident in production — we briefly recommended videos that had stopped trending three days earlier.

The query that pays for everything

Here is the recommendation query that replaced the recursive CTE. The intent: given a video the user just watched, return up to 20 similar videos that are trending in their region and ideally available in their preferred language.

SELECT * FROM cypher('video_graph', $$
    MATCH (seed:Video {id: $seed_id})
    MATCH (seed)-[:ABOUT]->(t:Topic)<-[:ABOUT]-(candidate:Video)
    WHERE candidate.id <> seed.id

    MATCH (candidate)-[trending:TRENDING_IN]->(:Region {code: $region})

    OPTIONAL MATCH (candidate)-[:LANG_VARIANT_OF*0..1]-(lv:Video)
    WHERE lv.lang = $pref_lang

    WITH candidate, lv, trending.rank AS rank,
         CASE WHEN lv IS NOT NULL THEN 0 ELSE 1 END AS lang_penalty
    RETURN COALESCE(lv.id, candidate.id) AS video_id,
           COALESCE(lv.title, candidate.title) AS title,
           rank, lang_penalty
    ORDER BY lang_penalty ASC, rank ASC
    LIMIT 20
$$, $1) AS (video_id agtype, title agtype, rank agtype, lang_penalty agtype);

The LANG_VARIANT_OF*0..1 is the trick that makes the multi-language story collapse to one query. Zero-or-one hop means: prefer the language-matched variant if it exists, otherwise fall back to the original candidate. COALESCE then picks whichever node is non-null. The ORDER BY puts language-matched results first, then sorts by trending rank inside each group.

Production p95: 88ms cold, 14ms when LiteSpeed's page cache hits the same parameter set. The old CTE was 1.4s p95 with no L1 cache hit, because the URL shape varied too much for the cache key to match across users.

Calling AGE from PHP 8.4

PHP 8.4 does not ship a native AGE adapter, but PDO over pgsql handles it fine. The thin wrapper I use:

<?php
declare(strict_types=1);

final class GraphClient
{
    public function __construct(private \PDO $pdo)
    {
        $this->pdo->setAttribute(\PDO::ATTR_ERRMODE, \PDO::ERRMODE_EXCEPTION);
        $this->pdo->exec("LOAD 'age'");
        $this->pdo->exec("SET search_path = ag_catalog, public");
    }

    public function cypher(string $graph, string $query, array $params = []): array
    {
        $sql = sprintf(
            "SELECT * FROM cypher(%s, \$cypher\$%s\$cypher\$, %s) AS (result agtype)",
            $this->pdo->quote($graph),
            $query,
            $this->pdo->quote(json_encode($params, JSON_THROW_ON_ERROR))
        );

        $stmt = $this->pdo->query($sql);
        $rows = [];
        foreach ($stmt as $row) {
            $rows[] = $this->parseAgtype($row['result']);
        }
        return $rows;
    }

    private function parseAgtype(string $raw): mixed
    {
        if (preg_match('/^(.*?)::(?:vertex|edge|path|numeric|text)$/', $raw, $m)) {
            $raw = $m[1];
        }
        return json_decode($raw, true, flags: JSON_THROW_ON_ERROR);
    }
}

// Usage in a controller
$graph = new GraphClient($pdo);
$results = $graph->cypher('video_graph', <<<CYPHER
    MATCH (v:Video {id: \$seed_id})-[:ABOUT]->(t:Topic)<-[:ABOUT]-(rec:Video)
    WHERE rec.id <> v.id
    RETURN rec.id AS id, rec.title AS title
    LIMIT 12
CYPHER, ['seed_id' => $seedId]);

Footguns worth knowing about before you ship this:

AGE appends type tags like ::vertex to every returned value. The parser strips them before json_decode
The $cypher$ … $cypher$ dollar-quoting avoids escape-hell when your query contains single quotes or embedded JSON
Do not use PDO::prepare for the outer SQL. AGE's cypher() function requires the graph name and the Cypher source to be literals at parse time, not bind parameters
Do parameterize the JSON payload (the third argument). That is where user input goes, and PDO::quote is the boundary that prevents Cypher injection

The user-controlled values are inside the JSON blob; the Cypher query body only references parameter names like $seed_id. As long as the JSON encoding is correct, the Cypher parser cannot see attacker-controlled tokens.

Caching layer

LiteSpeed sits in front of PHP-FPM with our page cache. For graph-backed endpoints I use two TTLs:

The /watch/{id} recommendation strip is cached for 2 hours per (video_id, region, lang) tuple, with those three values forming the cache key
The /trending/{region} page caches for 15 minutes. It only varies by region

Cloudflare in front uses s-maxage=600 for the watch page and s-maxage=120 for trending. The graph is the slow tier; everything else has to absorb the rest of the traffic. With this layering, fewer than 4% of requests hit AGE on a normal day. The 96% that hit cache do not care how fast Cypher is.

When the crawler runs (every 4 hours on this site), I purge the graph-dependent cache entries explicitly rather than waiting for TTL expiry. Tagged purges in our page cache and a BAN to Cloudflare's cache rules clean both layers within seconds, and the next request repopulates them with fresh trending data.

What I would skip if doing it again

Three things I burned time on:

Variable-length path patterns longer than four hops. AGE's planner gets unhappy. If you need a six-hop discovery query, materialize it into a derived edge type during ingestion and traverse that
Storing region as a property on Video instead of as an edge to Region. I started that way to keep the schema flat. Re-modeling TRENDING_IN as an edge cut the most common query from 230ms to 30ms because the planner could finally use the relationship index
Using MERGE with too many properties on the matched node. MERGE matches on every property you set in the match pattern — if even one differs, you get a duplicate node. Always MERGE on the unique key alone, and SET everything else in a separate clause

The other lesson — and this applies to any graph database, not just AGE — was to write the ingest pipeline before writing the queries. We had two weeks where the queries were beautiful and the results were wrong because LANG_VARIANT_OF was being computed from noisy title matches. Fixing the ingest fixed the recommendations more than any query tuning ever did.

When this approach does not fit

If your relationship structure is shallow (one or two joins), you do not need a graph. Regular SQL with proper indexes will be faster and simpler. If your write rate exceeds your read rate by a wide margin, AGE's per-statement parse overhead will hurt — a write-heavy event log is the wrong shape for it. If you need OLAP-style analytics over the graph (PageRank, betweenness centrality, community detection), AGE has none of that built in; you would want Neo4j with the Graph Data Science library, or to export the graph to a separate analytics tool.

For a read-heavy aggregator with multi-hop similarity queries and a need to keep operational surface area small, it has been a clean win. The 600 lines of recursive SQL got replaced by about 80 lines of Cypher, the p95 dropped by an order of magnitude, and Postgres is still the only stateful service I run.

Conclusion

Apache AGE is not a Neo4j replacement and it does not try to be. It is a graph layer for people who already have Postgres and want to keep their stack honest. For our Asia-Pacific multi-language video catalog, modeling the relationship structure as a graph — with LANG_VARIANT_OF and TRENDING_IN as first-class edges — turned the recommendation problem from a slow recursive join into a fast labeled traversal. If you are already in Postgres and your bottleneck is traversal rather than scale, it is worth the afternoon of setup.

DEV Community