The related-videos rail that ate our query planner
For about two years the "related videos" rail on TrendVidStream was a SQLite query with four self-joins over a views table, filtered by region, ordered by a hand-tuned co-occurrence score. It worked when we had a few hundred thousand view rows. It stopped working the week we crossed eight regions and started ingesting trending videos on a per-region cron. The query that powered "people who watched this also watched" went from 30ms to 900ms, the planner started picking the wrong index depending on which region's data was hot, and every attempt to make it region-aware added another JOIN and another WHERE region = ? that the optimizer had to reason about.
The root problem is that recommendation is not a relational problem pretending to be hard. It is a graph problem being forced through a relational engine. "Viewers who watched A also watched B" is a two-hop traversal: video → viewers → videos. In SQL that is a self-join on the bridge table, and every extra hop is another join. In a graph database it is a path expression you write once.
We kept SQLite FTS5 for full-text search — it is genuinely excellent for that and our PHP 8.4 stack is built around it — but we moved the recommendation graph into SurrealDB. This post is the honest version of how that went: the schema, the traversal queries, how we call them from PHP, how the multi-region cron feeds the graph, and the parts that bit us.
Why a graph model actually fits video discovery
Video discovery is a web of typed relationships, not a set of rows. A viewer watched a video. A video belongs to a channel. A channel publishes in a region. A video is tagged with topics. The recommendation you want is a walk over those edges: start at the video the user is on, walk to the viewers who watched it, walk forward to what else those viewers watched, then bias the result by shared tags and the user's region.
SurrealDB models this directly. Records live in tables, but relationships are first-class edge records created with RELATE, and you traverse them with arrow syntax: ->watched->video. The arrow direction is the edge direction, and you can chain arrows to express multi-hop paths inline. That single feature is what collapses our four-join SQL into one readable line.
There are three properties that made it worth the operational cost of running a second datastore:
-
Traversal cost is proportional to the edges you walk, not the size of the tables. A two-hop recommendation touches the neighborhood of one video, not the whole
viewstable. -
Edges carry data. A
watchededge can store watch percentage, timestamp, and region, so we can weight a recommendation by how much of a video someone actually watched rather than treating a 3-second bounce the same as a full view. -
Queries read like the question. When the on-call engineer reads
->watched<-watched->video, they can see it is "co-viewers' other videos" without reverse-engineering a join graph.
Defining the schema and the edges
SurrealDB is schemaless by default, but for a recommendation graph you want a schema so that bad writes from a flaky cron run get rejected instead of silently corrupting traversals. Here is the SurrealQL we run on migration. Note the edge tables (watched, tagged) defined with TYPE RELATION and constrained endpoints.
-- Run once via the /sql HTTP endpoint or the CLI
DEFINE TABLE video SCHEMAFULL;
DEFINE FIELD title ON video TYPE string;
DEFINE FIELD channel ON video TYPE record<channel>;
DEFINE FIELD region ON video TYPE string ASSERT $value INSIDE ['us','gb','de','fr','in','br','au','ca'];
DEFINE FIELD duration ON video TYPE int;
DEFINE FIELD published ON video TYPE datetime;
DEFINE INDEX video_region ON video FIELDS region;
DEFINE TABLE viewer SCHEMAFULL;
DEFINE FIELD region ON viewer TYPE string;
-- Edge: viewer -> watched -> video, carrying engagement data
DEFINE TABLE watched TYPE RELATION IN viewer OUT video SCHEMAFULL;
DEFINE FIELD percent ON watched TYPE float ASSERT $value >= 0 AND $value <= 1;
DEFINE FIELD region ON watched TYPE string;
DEFINE FIELD at ON watched TYPE datetime;
-- Edge: video -> tagged -> topic
DEFINE TABLE topic SCHEMAFULL;
DEFINE TABLE tagged TYPE RELATION IN video OUT topic SCHEMAFULL;
Writing an edge is a RELATE statement. When our ingest sees a watch event, it does one upsert per video and one relate per view:
-- Ensure the nodes exist (idempotent), then connect them
UPSERT video:dQw4 SET title = 'Region trends recap', region = 'gb', duration = 612;
UPSERT viewer:ix92 SET region = 'gb';
RELATE viewer:ix92 -> watched -> video:dQw4
SET percent = 0.83, region = 'gb', at = time::now();
The RELATE creates a watched record whose in is the viewer and whose out is the video. From now on the edge is traversable in both directions: forward from the viewer with ->watched->video, and backward from the video with <-watched<-viewer.
The recommendation query in one traversal
Here is the query that replaced the four-join SQL. Read it as the sentence it is: starting from a video, walk back to the viewers who watched it with real engagement, then forward to the other videos those viewers watched, count how often each appears, and rank.
-- $vid is the video the user is currently watching, e.g. video:dQw4
-- $region is the requesting user's region
LET $rec = (
SELECT
out AS video,
count() AS co_views,
math::mean(percent) AS avg_engagement
FROM (
-- co-viewers' other watches, excluding the source video
SELECT ->watched.out AS out, ->watched.percent AS percent
FROM video:dQw4<-watched<-viewer->watched
WHERE percent >= 0.4
)
WHERE out != video:dQw4
GROUP BY video
);
SELECT
video,
co_views,
avg_engagement,
-- region match is a soft boost, not a hard filter
(co_views * avg_engagement) * IF video.region = $region { 1.5 } ELSE { 1.0 } AS score
FROM $rec
ORDER BY score DESC
LIMIT 12;
A few things worth calling out, because they are the difference between a demo and something that survives production:
-
The
percent >= 0.4filter on the edge is doing real work. It means a co-view only counts if the co-viewer actually watched 40% of the source. Drive-by clicks do not pollute recommendations. -
Region is a soft boost (
* 1.5), not aWHEREclause. This was a deliberate change from the SQL version. Hard-filtering by region gave users in smaller regions an empty rail. A multiplier keeps global hits visible while favoring local content. - The grouping and scoring happen server-side in one round trip. The PHP layer receives 12 ranked rows, not a result set it has to sort.
Calling it from PHP 8.4
Our application layer is PHP 8.4, deployed over FTP to LiteSpeed hosts, so we do not get to install a fancy SurrealDB extension on every box. That is fine — SurrealDB exposes an HTTP /sql endpoint that takes raw SurrealQL and returns JSON, so plain curl is all we need. Here is the client we actually use, trimmed to the relevant parts. PHP 8.4's typed properties and constructor promotion keep it tidy.
<?php
declare(strict_types=1);
final class SurrealClient
{
public function __construct(
private readonly string $endpoint = 'http://127.0.0.1:8000/sql',
private readonly string $namespace = 'trendvid',
private readonly string $database = 'discovery',
private readonly string $user = 'rec_ro',
private readonly string $pass = '',
) {}
/** @return list<array<string,mixed>> rows from the final statement */
public function query(string $surql, array $vars = []): array
{
$ch = curl_init($this->endpoint);
// SurrealDB takes bound vars as headers prefixed with the var name,
// but for dynamic values we interpolate server-side LET is safer.
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_TIMEOUT => 3,
CURLOPT_HTTPHEADER => [
'Accept: application/json',
'Content-Type: text/plain',
'NS: ' . $this->namespace,
'DB: ' . $this->database,
],
CURLOPT_USERPWD => "{$this->user}:{$this->pass}",
CURLOPT_POSTFIELDS => $surql,
]);
$raw = curl_exec($ch);
if ($raw === false) {
throw new RuntimeException('SurrealDB unreachable: ' . curl_error($ch));
}
curl_close($ch);
/** @var list<array{status:string,result:mixed}> $batch */
$batch = json_decode($raw, true, flags: JSON_THROW_ON_ERROR);
$last = end($batch) ?: ['result' => []];
if (($last['status'] ?? '') !== 'OK') {
throw new RuntimeException('Query failed: ' . json_encode($last));
}
return is_array($last['result']) ? $last['result'] : [];
}
}
function relatedVideos(SurrealClient $db, string $videoId, string $region): array
{
// SurrealDB returns one result block per statement; we read the last.
$surql = <<<SURQL
LET \$vid = type::thing('video', '{$videoId}');
SELECT video, co_views, avg_engagement,
(co_views * avg_engagement) *
IF video.region = '{$region}' { 1.5 } ELSE { 1.0 } AS score
FROM (
SELECT out AS video, count() AS co_views, math::mean(percent) AS avg_engagement
FROM (
SELECT ->watched.out AS out, ->watched.percent AS percent
FROM \$vid<-watched<-viewer->watched WHERE percent >= 0.4
) WHERE out != \$vid GROUP BY video
)
ORDER BY score DESC LIMIT 12;
SURQL;
return $db->query($surql);
}
One real lesson encoded here: never interpolate user-supplied IDs as bare record literals. We pass them through type::thing('video', $id) so SurrealDB parses the value as data, not as query syntax. Treat it exactly like you treat parameterized SQL — a record link built from request input is an injection vector if you string-concatenate it into a path.
Feeding the graph from the multi-region cron
Our ingest already runs as a per-region cron that pulls trending videos. The same job now writes edges into SurrealDB. We do it in Python because the cron orchestration was already Python, and we batch the writes — sending one RELATE per event over HTTP would melt under a region's worth of events.
import requests
SURREAL = "http://127.0.0.1:8000/sql"
HEADERS = {"Accept": "application/json", "Content-Type": "text/plain",
"NS": "trendvid", "DB": "discovery"}
AUTH = ("rec_rw", "")
def flush_watch_events(events: list[dict], region: str) -> None:
"""events: [{viewer, video, percent}], all for one region."""
if not events:
return
# Build one transaction so a mid-batch failure rolls back cleanly.
stmts = ["BEGIN TRANSACTION;"]
for e in events:
v, vid, pct = e["viewer"], e["video"], float(e["percent"])
stmts.append(f"UPSERT viewer:{v} SET region = '{region}';")
stmts.append(
f"RELATE viewer:{v} -> watched -> video:{vid} "
f"SET percent = {pct:.3f}, region = '{region}', at = time::now();"
)
stmts.append("COMMIT TRANSACTION;")
resp = requests.post(SURREAL, data="\n".join(stmts),
headers=HEADERS, auth=AUTH, timeout=10)
resp.raise_for_status()
blocks = resp.json()
failed = [b for b in blocks if b.get("status") != "OK"]
if failed:
raise RuntimeError(f"{len(failed)} statements failed in {region} batch")
The transaction matters more than it looks. Our cron runs eight region jobs that can overlap, and an unbatched failure used to leave half-written view data that skewed recommendations for hours. Wrapping each batch in BEGIN/COMMIT means a region either lands fully or not at all, and the next run retries the whole batch.
Precomputing hot rails with a Go worker
The live traversal is fast — single-digit milliseconds for a normal video — but our homepage shows recommendation rails for the top few hundred trending videos per region, and computing those on every request is wasteful when they barely change between cron runs. So we precompute them with a small Go worker that runs after each ingest, writes the result back into SurrealDB as a materialized rail record, and lets the PHP layer read a single row instead of running the traversal.
package main
import (
"bytes"
"fmt"
"io"
"net/http"
"time"
)
const surreal = "http://127.0.0.1:8000/sql"
func surql(query string) ([]byte, error) {
req, _ := http.NewRequest("POST", surreal, bytes.NewBufferString(query))
req.Header.Set("Accept", "application/json")
req.Header.Set("Content-Type", "text/plain")
req.Header.Set("NS", "trendvid")
req.Header.Set("DB", "discovery")
req.SetBasicAuth("rec_rw", "")
client := &http.Client{Timeout: 8 * time.Second}
resp, err := client.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
return io.ReadAll(resp.Body)
}
// precomputeRail runs the traversal and stores the top results on the video
// itself, so reads are a single SELECT with no graph walk.
func precomputeRail(videoID, region string) error {
q := fmt.Sprintf(`
LET $vid = type::thing('video', '%s');
LET $rec = (
SELECT out AS video, count() AS co_views, math::mean(percent) AS eng
FROM (SELECT ->watched.out AS out, ->watched.percent AS percent
FROM $vid<-watched<-viewer->watched WHERE percent >= 0.4)
WHERE out != $vid GROUP BY video
ORDER BY (co_views * eng) DESC LIMIT 12
);
UPDATE $vid SET rail = $rec, rail_built = time::now(), rail_region = '%s';`,
videoID, region)
_, err := surql(q)
return err
}
func main() {
// In production this list comes from the trending query per region.
hot := []struct{ id, region string }{
{"dQw4", "gb"}, {"a8Bz", "us"}, {"k0Lm", "de"},
}
for _, v := range hot {
if err := precomputeRail(v.id, v.region); err != nil {
fmt.Printf("rail %s failed: %v\n", v.id, err)
}
}
}
Now the homepage read is SELECT rail FROM video:dQw4 — one record, no traversal — and the expensive walk only happens for long-tail videos that are not in the precomputed set. This is the same split we use everywhere: cheap reads for the common case, live computation for the tail.
Keeping SQLite FTS5 for what it is good at
We did not throw away SQLite. Search — "find videos matching these words" — is still FTS5, because a graph database is the wrong tool for ranked text matching and FTS5 is genuinely fast and zero-ops. The clean division we settled on:
- SQLite FTS5 owns search, filtering, and anything text- or attribute-based. It is the source of truth for video metadata.
- SurrealDB owns relationships: who watched what, what is similar to what, what to recommend next.
- The cron writes to both. Metadata goes to SQLite; the watch edges go to SurrealDB.
The two never need to join across the wire because the recommendation query returns video record IDs, and PHP hydrates the display data from SQLite by ID — a primary-key lookup, the cheapest read there is.
What bit us
A few things we learned the expensive way, so you do not have to:
-
Unbounded traversals will find your most popular video and never come back. A globally trending video has hundreds of thousands of co-viewers; a naive
<-watched<-viewer->watchedover it walks an enormous neighborhood. Thepercent >= 0.4edge filter and theLIMITare not optional — they are what keep the query bounded. -
Schemaless writes from a flaky cron silently rot the graph. Before we made the edge tables
SCHEMAFULL, a bug that wrotepercentas a string instead of a float mademath::meanskip rows without erroring. Define your fields and assert their ranges. - Region as a hard filter starves small regions. Soft-boosting with a multiplier was the single change that most improved perceived quality for users outside the US and GB.
-
Don't interpolate record IDs. Use
type::thing(). A graph path built from raw request input is an injection vector exactly like raw SQL. - Materialize the hot set. The live traversal is fast enough for the tail, but precomputing rails for trending videos cut homepage database time by more than half.
Conclusion
The shift that mattered was conceptual, not just operational: recommendation stopped being a query we tuned and became a graph we traverse. The four-join SQL that fell over at eight regions is now a single arrow-path expression that reads like the question it answers, runs in single-digit milliseconds for the common case, and stays bounded because the engagement filter and limits are baked into the path. SQLite FTS5 still owns search, SurrealDB owns the relationship graph, and the multi-region cron feeds both without a cross-store join in sight. If you are still expressing "viewers who watched this also watched" as a self-join that grows a clause every time a product manager asks for one more signal, try modeling it as edges and walking them — the query gets shorter as the requirements get more complex, which is the opposite of what happens in SQL.
Top comments (0)