ahmet gedik

Posted on Jun 8

Building Cross-Language Video Search with pgvector and OpenAI Embeddings

#postgres #ai #php #python

On TopVideoHub we aggregate trending video metadata across the Asia-Pacific region — Japanese, Korean, Traditional Chinese, Thai, Vietnamese, plus the usual English firehose. For years our search ran on SQLite FTS5 with a custom CJK tokenizer, and for keyword lookups it is still excellent. But a support ticket finally forced the issue: a user searched kitchen fail compilation and got nothing, even though our catalog had a wildly popular clip titled 料理大失敗まとめ. Same concept, zero lexical overlap. FTS5 matches tokens, not meaning, and no amount of tokenizer tuning bridges kitchen fail to 料理失敗.

That is the wall keyword search hits on a multi-language catalog. This post is the architecture I landed on: keep SQLite FTS5 for what it is good at, and bolt on a Postgres + TopVideoHub pgvector sidecar that does semantic, cross-lingual retrieval using OpenAI embeddings. I will show the schema, the embedding pipeline, the PHP 8.4 query path, and the hybrid ranking that actually shipped.

Why keyword search hits a wall on multi-language catalogs

FTS5 is a lexical engine. It builds an inverted index of tokens and ranks with BM25. With a CJK-aware tokenizer it handles 料理失敗 fine — it segments the run of characters into searchable units. What it cannot do:

Match across languages. kitchen and 料理 are unrelated tokens.
Match synonyms or paraphrases. funny cat will not surface hilarious kitten.
Understand intent. videos to fall asleep to is a vibe, not a keyword.

Our Asia-Pacific audience searches in a mix of their native language and English, often in the same session. Roughly 22% of our zero-result queries were semantic misses like the kitchen example — real demand we were silently dropping on the floor.

Embeddings fix this because a good multilingual model maps kitchen fail and 料理失敗 to nearby points in vector space. OpenAI's text-embedding-3-small is genuinely multilingual: I measured cosine similarity of 0.61 between those two phrases, versus ~0.08 for unrelated pairs. That signal is more than enough to rank with.

Why a Postgres sidecar instead of replacing SQLite

Our primary store is SQLite — it is fast, file-based, plays nicely with LiteSpeed and our deployment model, and FTS5 keyword search stays sub-millisecond. I did not want to rip that out. Vector search is a fundamentally different workload (high-dimensional ANN indexes, periodic background re-embedding), so I run it as a separate Postgres instance with the pgvector extension. SQLite remains the source of truth for video metadata; Postgres holds only what semantic search needs: the video id, a bit of denormalized metadata for filtering, and the embedding.

This keeps each engine doing what it is good at and means a Postgres outage degrades search to keyword-only rather than taking the site down.

Schema and the pgvector column

text-embedding-3-small returns 1536-dimensional vectors. The table is deliberately thin — id, language, region, published timestamp for filtering, the source text we embedded (kept for debugging and re-embedding), and the vector itself.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE video_embeddings (
    video_id     BIGINT PRIMARY KEY,
    lang         TEXT        NOT NULL,
    region       TEXT        NOT NULL,
    published_at TIMESTAMPTZ NOT NULL,
    source_text  TEXT        NOT NULL,
    embedding    vector(1536) NOT NULL,
    model        TEXT        NOT NULL DEFAULT 'text-embedding-3-small',
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- HNSW index for approximate nearest-neighbour search.
-- vector_cosine_ops because we normalise and compare with cosine distance.
CREATE INDEX video_embeddings_hnsw
    ON video_embeddings
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Partial b-tree indexes to keep metadata pre-filters cheap.
CREATE INDEX video_embeddings_region_idx ON video_embeddings (region, published_at DESC);

A note on the distance operator. pgvector exposes <-> (L2), <#> (negative inner product), and <=> (cosine distance). OpenAI embeddings are normalised to unit length, so cosine and inner product rank identically — but I use <=> explicitly because cosine distance lands in a predictable [0, 2] range, which makes thresholds and score blending much easier to reason about than raw inner products.

What text do you actually embed?

This decision mattered more than the model choice. A YouTube-style title alone is too sparse. I embed a compact composite of the fields that carry meaning, with light labels so the model gets structure:

Title (the strongest signal)
Channel name
Top 3-5 tags
First ~300 characters of the description

I deliberately do not embed view counts, durations, or IDs — numeric metadata pollutes the semantic space. I also truncate aggressively: text-embedding-3-small bills per token, and beyond a few hundred tokens of video metadata you get diminishing returns on retrieval quality while paying linearly more.

The embedding pipeline

This Python worker pulls rows that need (re-)embedding, calls OpenAI in batches — the embeddings endpoint accepts up to 2048 inputs per request, which slashes both latency and overhead — and upserts into Postgres. It is idempotent: a content hash of the source text means we never pay to re-embed text that has not changed.

import hashlib
import os
import openai
import psycopg
from psycopg.rows import dict_row

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MODEL = "text-embedding-3-small"
BATCH = 256  # well under the 2048 input cap, keeps payloads sane


def build_source_text(row: dict) -> str:
    tags = ", ".join((row.get("tags") or "").split(",")[:5])
    desc = (row.get("description") or "")[:300]
    return (
        f"Title: {row['title']}\n"
        f"Channel: {row.get('channel', '')}\n"
        f"Tags: {tags}\n"
        f"Description: {desc}"
    ).strip()


def embed_batch(texts: list[str]) -> list[list[float]]:
    # One request embeds the whole batch; order of results matches input order.
    resp = client.embeddings.create(model=MODEL, input=texts)
    return [d.embedding for d in resp.data]


def run(pg: psycopg.Connection, rows: list[dict]) -> int:
    pending = []
    for row in rows:
        text = build_source_text(row)
        digest = hashlib.sha256(text.encode("utf-8")).hexdigest()
        if row.get("source_hash") == digest:
            continue  # unchanged, skip the API call entirely
        pending.append((row, text))

    written = 0
    for i in range(0, len(pending), BATCH):
        chunk = pending[i : i + BATCH]
        vectors = embed_batch([t for _, t in chunk])
        with pg.cursor() as cur:
            for (row, text), vec in zip(chunk, vectors):
                cur.execute(
                    """
                    INSERT INTO video_embeddings
                        (video_id, lang, region, published_at,
                         source_text, embedding, model)
                    VALUES (%s, %s, %s, %s, %s, %s, %s)
                    ON CONFLICT (video_id) DO UPDATE SET
                        source_text = EXCLUDED.source_text,
                        embedding   = EXCLUDED.embedding,
                        updated_at  = now()
                    """,
                    (row["video_id"], row["lang"], row["region"],
                     row["published_at"], text, vec, MODEL),
                )
        pg.commit()
        written += len(chunk)
    return written

psycopg3 adapts a Python list[float] straight into pgvector's text format, so no manual serialisation is needed. If you are on psycopg2 you would register the pgvector.psycopg2 adapter, or just format the list as '[0.1,0.2,...]' yourself.

We run this from the same cron that ingests trending videos. New videos get embedded within the hour; the cost is trivial (more on that below).

Querying from PHP 8.4

The read path lives in our PHP app. At query time we embed the user's search string once, then ask Postgres for the nearest neighbours with a metadata pre-filter. The pre-filter matters: restricting by region before the ANN search keeps the candidate set relevant and lets the planner use the partial index.

<?php
declare(strict_types=1);

final class SemanticVideoSearch
{
    public function __construct(
        private readonly PDO $pg,
        private readonly OpenAiClient $openai,
    ) {}

    /** @return list<array{video_id:int, distance:float}> */
    public function search(string $query, string $region, int $limit = 20): array
    {
        $vector = $this->openai->embed($query); // returns list<float>, 1536 dims
        $literal = $this->toVectorLiteral($vector);

        // ef_search trades recall for latency; set it per-connection.
        $this->pg->exec('SET hnsw.ef_search = 80');

        $sql = <<<SQL
            SELECT video_id, embedding <=> :vec AS distance
            FROM video_embeddings
            WHERE region = :region
              AND published_at > now() - interval '180 days'
            ORDER BY embedding <=> :vec
            LIMIT :limit
        SQL;

        $stmt = $this->pg->prepare($sql);
        $stmt->bindValue(':vec', $literal);
        $stmt->bindValue(':region', $region);
        $stmt->bindValue(':limit', $limit, PDO::PARAM_INT);
        $stmt->execute();

        return array_map(
            static fn(array $r): array => [
                'video_id' => (int) $r['video_id'],
                'distance' => (float) $r['distance'],
            ],
            $stmt->fetchAll(PDO::FETCH_ASSOC),
        );
    }

    private function toVectorLiteral(array $vector): string
    {
        // pgvector accepts '[0.1,0.2,...]'. Bind as a string, cast in SQL if needed.
        return '[' . implode(',', array_map(
            static fn(float $v): string => rtrim(sprintf('%.6f', $v), '0'),
            $vector,
        )) . ']';
    }
}

One subtlety that bit me: ORDER BY embedding <=> :vec is what actually engages the HNSW index. If you order by a different expression — say a blended score computed in SQL — Postgres falls back to a sequential scan and re-computes distance for every row. Keep the ANN ordering pure, fetch an over-large candidate set, then re-rank in application code. That is exactly what the hybrid layer does.

Hybrid search: blend FTS5 keyword score with vector distance

Pure vector search has a failure mode that mirrors keyword search's: it is too fuzzy for exact intent. If someone searches a specific channel name or an exact title, BM25 nails it and embeddings wander off into vaguely-related territory. The fix is to run both engines and fuse the rankings. I use Reciprocal Rank Fusion (RRF) because it needs no score normalisation — it only cares about rank position, so a BM25 score and a cosine distance can be combined without pretending they live on the same scale.

<?php
declare(strict_types=1);

/**
 * Reciprocal Rank Fusion. k=60 is the value from the original Cormack et al.
 * paper and works well in practice — larger k flattens the contribution of
 * top ranks, smaller k sharpens it.
 *
 * @param list<int> $keywordRanked  video_ids from FTS5, best first
 * @param list<int> $vectorRanked   video_ids from pgvector, best first
 * @return list<int>                fused video_ids, best first
 */
function reciprocalRankFusion(
    array $keywordRanked,
    array $vectorRanked,
    int $k = 60,
): array {
    $scores = [];
    foreach ([$keywordRanked, $vectorRanked] as $ranking) {
        foreach ($ranking as $rank => $videoId) {
            $scores[$videoId] = ($scores[$videoId] ?? 0.0)
                + 1.0 / ($k + $rank + 1);
        }
    }
    arsort($scores);
    return array_keys($scores);
}

The flow per query:

Fire FTS5 (SQLite) and pgvector (Postgres) concurrently. In PHP that means kicking off the OpenAI embed + Postgres query while the local FTS5 query runs — the network round-trip dominates, so overlapping them hides most of the latency.
Take the top ~50 from each.
Fuse with RRF, then truncate to the page size.

This gives us exact-match precision from FTS5 and the cross-lingual recall from embeddings. The kitchen-fail query now surfaces 料理大失敗まとめ via the vector leg, while a search for an exact channel name still ranks that channel first via the keyword leg.

A Go worker for the re-embedding backlog

The Python cron handles incremental embedding fine, but when we change the composite text format (or bump the model) we need to re-embed the whole catalog — millions of rows. For that backfill I wrote a small concurrent Go worker, because the job is almost entirely I/O wait on the OpenAI API and Go's goroutines saturate the rate limit cleanly.

package main

import (
    "context"
    "sync"
    "time"

    "golang.org/x/time/rate"
)

// limiter caps us at the OpenAI tier rate; tune to your account's RPM.
var limiter = rate.NewLimiter(rate.Every(time.Minute/3000), 50)

type Job struct {
    VideoIDs []int64
    Texts    []string
}

func runWorkers(ctx context.Context, jobs <-chan Job, workers int,
    embed func(context.Context, []string) ([][]float32, error),
    store func(context.Context, []int64, [][]float32) error,
) {
    var wg sync.WaitGroup
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for job := range jobs {
                if err := limiter.Wait(ctx); err != nil {
                    return
                }
                vecs, err := embed(ctx, job.Texts)
                if err != nil {
                    // log and skip; a later pass retries unembedded rows
                    continue
                }
                if err := store(ctx, job.VideoIDs, vecs); err != nil {
                    continue
                }
            }
        }()
    }
    wg.Wait()
}

A clean re-embed of our active catalog runs in under 20 minutes this way, bounded entirely by the rate limiter rather than CPU.

Tuning recall versus latency

Two knobs control the recall/latency tradeoff on the read path:

ef_construction (index build time, default 64). Higher builds a better graph but slower. 64 was fine for us; 128 gave marginal recall gains for double the build time.
hnsw.ef_search (query time, default 40). This is the one you tune live. At ef_search = 40 our p95 was ~8ms but recall@20 sat around 0.93. Bumping to 80 pushed recall to ~0.98 for ~14ms p95 — a trade I happily took, since the OpenAI embed call (~120ms) dwarfs the Postgres query anyway.

The lesson: do not micro-optimise the pgvector query when an API round-trip is your real latency budget. Which is why caching matters.

Cost and caching

Two costs to watch: the OpenAI bill and query latency.

Embedding cost. text-embedding-3-small is cheap — fractions of a cent per thousand short metadata strings. Embedding our whole active catalog costs a couple of dollars. The query-side embeds are the recurring cost, and they are tiny per call but add up at scale.
Cache the query embeddings. Search traffic is Zipfian — a small set of popular queries dominates. We cache query string -> embedding in our existing cache layer (and Cloudflare absorbs repeat full-result responses at the edge). That kills both the API cost and the 120ms round-trip for hot queries, which is most of them.
Normalise the cache key. Lowercase, trim, collapse whitespace, and Unicode-normalise (NFKC) before hashing — CJK input arrives in inconsistent width forms (full-width vs half-width), and skipping normalisation tanks your hit rate.

What I would do differently

Start with the hybrid layer, not pure vector. I shipped vector-only first and immediately regretted the regressions on exact-match queries. RRF should have been day one.
Store the source text from the start. Being able to re-embed without re-joining against SQLite saved me during the model-format change.
Watch the language distribution of misses, not just the volume. Our biggest wins were concentrated in Thai and Vietnamese queries, where our keyword tokenizer was weakest — embeddings quietly covered for tokenizer gaps I did not even know I had.

Semantic search did not replace FTS5 on TopVideoHub; it filled the holes FTS5 structurally cannot. Keyword search for precision, embeddings for meaning, RRF to fuse them, and a thin Postgres sidecar so none of it threatens the SQLite core. The kitchen-fail ticket is closed, and our zero-result rate dropped by roughly two thirds.

DEV Community