Using SurrealDB Graph Queries to Power Video Recommendations

#surrealdb #php #database #webdev

Our recommendation logic at ViralVidVault used to be a 280-line SQLite query with six self-joins. Every time a European visitor finished a viral clip, we wanted to answer one question: what should play next? The honest version of that question is a graph traversal — "find videos watched by people who watched this video, weighted by recency and region" — but we were forcing it through a relational engine that treats relationships as join tables. The query took 400ms on a warm cache and timed out on cold ones. Worse, adding a new signal (creator affinity, tag overlap, watch-completion rate) meant another join and another 50ms.

The relational model wasn't wrong, it was just the wrong shape for the data. Recommendations are inherently a graph problem: users, videos, creators, and tags are nodes, and "watched", "published", and "tagged" are edges you want to walk. So we moved that one subsystem to SurrealDB, kept everything else on our PHP 8.4 / SQLite WAL stack, and the next-video query dropped to a consistent 12-30ms. This post is the practical version of how that works, including the GDPR constraints that shaped the schema.

Why a graph database for "watch next"

The core recommendation pattern is collaborative filtering by traversal. Given video A, you want:

Users who watched A (one hop out)
Other videos those users watched (second hop)
Ranked by how often they co-occur, recency, and regional relevance

In SQL that's two joins through a watches table plus aggregation. In a graph database it's a single traversal expression, because the edges are the index. SurrealDB is interesting here specifically because it's multi-model — it gives you graph edges (RELATE), document fields, and SQL-like querying in one engine, over an HTTP API that slots into a PHP app without a new driver ecosystem.

A few things mattered for us beyond raw speed:

Edges carry properties. The watched edge stores completion, region, and a coarse timestamp, so the traversal itself can filter and weight without a separate lookup.
Schemafull tables let us pin types and reject anything we don't want stored — important when the data is behavioral and lives under GDPR.
Embedded or server mode. You can run SurrealDB as a single binary with a RocksDB backend, which fits our "no heavy managed DB" preference.

Modeling the graph (GDPR-first)

The schema is the place where compliance is won or lost. We don't store personal identifiers on the behavioral graph at all. A "viewer" node is a rotating pseudonymous session token, not an account, and it carries a hard expires field. The recommendation quality comes from aggregate co-watch patterns, not from profiling a named individual — which is both better for privacy and, conveniently, what the GDPR's data-minimization principle pushes you toward.

-- SurrealQL schema definition
DEFINE TABLE video SCHEMAFULL;
DEFINE FIELD title       ON video TYPE string;
DEFINE FIELD region      ON video TYPE string ASSERT $value INSIDE ['EU','UK','US','GLOBAL'];
DEFINE FIELD published   ON video TYPE datetime;
DEFINE FIELD tags        ON video TYPE array<string>;
DEFINE INDEX video_region ON video FIELDS region;

-- Pseudonymous viewer: no name, no email, hard TTL
DEFINE TABLE viewer SCHEMAFULL;
DEFINE FIELD session_hash ON viewer TYPE string;
DEFINE FIELD region       ON viewer TYPE string;
DEFINE FIELD expires      ON viewer TYPE datetime;

-- The edge that does the work. Properties live ON the relationship.
DEFINE TABLE watched SCHEMAFULL TYPE RELATION FROM viewer TO video;
DEFINE FIELD completion ON watched TYPE float ASSERT $value >= 0 AND $value <= 1;
DEFINE FIELD region     ON watched TYPE string;
DEFINE FIELD at         ON watched TYPE datetime;

Writing a watch event is a RELATE statement. Notice the edge is created in one call and the completion ratio is stored on the edge, not duplicated onto a node:

-- Record that a viewer watched a video to 80% completion
RELATE viewer:7f3a->watched->video:k9x2 SET
    completion = 0.80,
    region     = 'EU',
    at         = time::now();

Because the viewer table has an expires field, a daily job deletes stale pseudonymous nodes and their edges, which keeps the graph small and satisfies storage-limitation requirements. The recommendation signal degrades gracefully — older co-watch patterns simply age out, which is what you want for a viral video site where relevance has a half-life of days.

The traversal query

Here is the query that replaced the 280-line SQL monster. It reads almost like the English description of the algorithm. Start at the current video, walk back along watched edges to find viewers who watched it well (completion over 60%), then walk forward from those viewers to the other videos they watched, and aggregate.

-- "Watch next" for video:k9x2, scoped to EU region
LET $seed = video:k9x2;

SELECT
    id,
    title,
    region,
    count() AS co_watches,
    math::mean(<-watched.completion) AS avg_completion
FROM (
    SELECT
        ->watched->video.* AS recs
    FROM (
        -- viewers who watched the seed video to >60%
        SELECT in AS viewer
        FROM watched
        WHERE out = $seed
          AND completion > 0.6
    ).viewer
)
SPLIT recs
WHERE recs.id != $seed
  AND recs.region IN ['EU','GLOBAL']
GROUP BY id, title, region
ORDER BY co_watches DESC, avg_completion DESC
LIMIT 12;

The parts that matter:

->watched->video is the second hop. SurrealDB's arrow syntax walks edges directly; there is no join table to name.
WHERE completion > 0.6 filters on the edge property during traversal, so we never score videos that people bailed on.
recs.region IN ['EU','GLOBAL'] enforces our European focus at query time, using the index we defined on region.
SPLIT flattens the array of recommended videos into rows so we can group and count co-occurrences.

On our dataset (~2.1M watch edges, 90-day window) this returns in 12-30ms warm. The cold-start case — a brand-new video with no edges yet — returns nothing, so we fall back to a trending query, covered below.

Wiring it into PHP 8.4

We didn't want a new persistence layer or an ORM. SurrealDB exposes an HTTP /sql endpoint that takes raw SurrealQL and returns JSON, so a thin client over cURL is all you need. Here's the actual shape of our client, using PHP 8.4 features (constructor promotion, readonly props, typed everything):

<?php
declare(strict_types=1);

final class SurrealClient
{
    public function __construct(
        private readonly string $endpoint = 'http://127.0.0.1:8000/sql',
        private readonly string $namespace = 'vvv',
        private readonly string $database  = 'recommendations',
        private readonly string $auth      = 'cm9vdDpyb290', // base64 user:pass
    ) {}

    /** @return array<int, array<string, mixed>> */
    public function query(string $surql): array
    {
        $ch = curl_init($this->endpoint);
        curl_setopt_array($ch, [
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_POST           => true,
            CURLOPT_POSTFIELDS     => $surql,
            CURLOPT_TIMEOUT_MS     => 800,
            CURLOPT_HTTPHEADER     => [
                'Accept: application/json',
                'NS: ' . $this->namespace,
                'DB: ' . $this->database,
                'Authorization: Basic ' . $this->auth,
            ],
        ]);

        $raw  = curl_exec($ch);
        $code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        if ($raw === false || $code !== 200) {
            throw new RuntimeException("SurrealDB query failed (HTTP $code)");
        }

        // SurrealDB returns an array of result-sets; we want the last statement's rows
        $decoded = json_decode($raw, true, flags: JSON_THROW_ON_ERROR);
        $last    = end($decoded);

        return $last['result'] ?? [];
    }
}

And the recommendation service that uses it. The key production detail is the fallback chain: graph traversal first, trending second, never an empty player.

<?php
declare(strict_types=1);

final class RecommendationService
{
    public function __construct(
        private readonly SurrealClient $db,
    ) {}

    /** @return array<int, array<string, mixed>> */
    public function watchNext(string $videoId, string $region = 'EU'): array
    {
        // Whitelist the region to avoid query injection on a string interpolation
        $region = in_array($region, ['EU', 'UK', 'GLOBAL'], true) ? $region : 'GLOBAL';
        $id     = preg_replace('/[^a-z0-9:]/', '', strtolower($videoId));

        $rows = $this->db->query(<<<SURQL
            LET \$seed = $id;
            SELECT id, title, count() AS co_watches
            FROM (
                SELECT ->watched->video.* AS recs
                FROM (SELECT in AS v FROM watched
                      WHERE out = \$seed AND completion > 0.6).v
            )
            SPLIT recs
            WHERE recs.id != \$seed AND recs.region IN ['$region','GLOBAL']
            GROUP BY id, title
            ORDER BY co_watches DESC
            LIMIT 12;
            SURQL);

        return $rows !== [] ? $rows : $this->trendingFallback($region);
    }

    /** @return array<int, array<string, mixed>> */
    private function trendingFallback(string $region): array
    {
        return $this->db->query(<<<SURQL
            SELECT id, title, count(<-watched) AS heat
            FROM video
            WHERE region IN ['$region','GLOBAL']
              AND published > time::now() - 7d
            ORDER BY heat DESC
            LIMIT 12;
            SURQL);
    }
}

Two things I'd flag for anyone copying this:

Sanitize interpolated identifiers. SurrealDB supports parameterized queries via the WebSocket/RPC protocol, but over the simple HTTP /sql endpoint you're concatenating. We whitelist the region and strip the video id to [a-z0-9:]. If you take user input anywhere else, use the RPC query with bound $vars instead.
Set a hard CURLOPT_TIMEOUT_MS. A recommendation widget should never block page render. We give it 800ms and fall through to a cached trending list on timeout.

Edge caching with Cloudflare Workers

Most of our traffic is anonymous and regional, which means the same "watch next" response is valid for thousands of EU visitors hitting the same video. There's no reason to recompute the traversal per request. We put a Cloudflare Worker in front of the recommendation endpoint and cache by video_id + region for 90 seconds — short enough that viral spikes still surface fresh co-watch data, long enough to absorb a thundering herd.

export default {
  async fetch(request, env, ctx) {
    const url    = new URL(request.url);
    const video  = url.searchParams.get('v');
    const region = request.cf?.country === 'GB' ? 'UK'
                 : (request.cf?.continent === 'EU' ? 'EU' : 'GLOBAL');

    if (!video) return new Response('missing v', { status: 400 });

    // Cache key includes region so EU and GLOBAL get distinct results
    const cacheKey = new Request(`https://rec.cache/${video}/${region}`, request);
    const cache    = caches.default;

    let res = await cache.match(cacheKey);
    if (res) return res;

    res = await fetch(`${env.ORIGIN}/recommend?v=${video}&region=${region}`);
    res = new Response(res.body, res);
    res.headers.set('Cache-Control', 'public, max-age=90');
    ctx.waitUntil(cache.put(cacheKey, res.clone()));
    return res;
  },
};

The request.cf.continent check is how we derive region without storing IP addresses — Cloudflare gives us the coarse geo at the edge, we map it to one of three buckets, and the origin never sees the visitor's IP for recommendation purposes. That keeps the GDPR surface tiny: no IP logging, no fingerprint, just a continent bucket that's already public metadata of the request.

What got better, and the honest trade-offs

After the migration, the numbers that moved:

Latency: next-video query went from 400ms (SQLite, warm) to 12-30ms (SurrealDB traversal), and the cold path no longer times out.
Maintainability: adding a signal — say, weighting by creator affinity — is a third hop in the traversal, not a fourth join. We added tag-overlap scoring in about 20 lines.
Click-through: the "watch next" CTR rose noticeably once recommendations stopped being region-blind, though I won't pretend I can fully attribute that to the engine versus the region filter we added at the same time.

The trade-offs are real and worth stating plainly:

Operational surface. It's another stateful service to run, back up, and monitor. We kept SQLite WAL as the system of record for everything non-behavioral (videos, blog content, config) and treat the SurrealDB graph as a derived, disposable store. If it falls over, we replay watch events from logs and the site still serves trending.
Query maturity. SurrealQL is expressive but younger than SQL; some aggregations need restructuring and the error messages can be terse. Read the version notes before relying on a function.
It's not a silver bullet. Graph traversal wins when your access pattern is a traversal. For our straightforward lookups and full-text search, SQLite is still faster and simpler. We only moved the one subsystem that was genuinely graph-shaped.

Conclusion

The lesson wasn't "graph databases are faster." It was that we'd been modeling a graph problem relationally and paying for the mismatch on every page view. Moving only the recommendation traversal to SurrealDB — while keeping PHP 8.4, SQLite WAL, and Cloudflare exactly where they were — gave us a 10x latency win on the query that mattered and made new ranking signals cheap to add. The GDPR posture actually improved in the process, because a recommendation graph built on pseudonymous, expiring co-watch edges stores less personal data than the user-profile approach most people reach for first. If you have one query in your stack that's secretly a graph traversal wearing six joins, it's worth isolating and reshaping rather than optimizing in place.