ONNX Runtime + pgvector in Django: semantic search without PyTorch or external APIs

#webdev #django #python #machinelearning

Exogram is an open-source social network for Kindle readers. There is a recurring tension in the design of small-to-medium web applications that need semantic search: the easiest path—calling an external embedding API—introduces costs, latency, and privacy concerns that are often disproportionate to the scale of the problem. The harder path—running a model locally—has historically meant pulling PyTorch into your Docker image and accepting a bloated, fragile deployment. This article documents a third option: running inference with ONNX Runtime, backed by pgvector for storage, on standard Django infrastructure. No external API calls, no separate vector database, no PyTorch in production.

The problem with "just call the API"

The reflex to reach for OpenAI's embedding API is understandable. You get high-quality embeddings with one HTTP call, no model management, and results that work immediately. For a prototype, that tradeoff is usually correct.

For a production app that processes user data—specifically, personal reading highlights and notes—the calculus changes. Every search query and every imported highlight would leave the system. The cost is small per request but real and unbounded at scale. The latency of a round-trip to an external service is added to every write path. And if the service is down or rate-limited, your feature breaks. For a small app where the dataset per user sits in the range of hundreds to tens of thousands of highlights, paying these costs is hard to justify.

The ADR in this repository (docs/en/adr/0003-onnx-runtime-local.md) states the reasoning plainly: OpenAI embeddings were discarded for privacy and cost reasons. The user's highlights, including personal notes, should not leave the system for processing. That constraint is load-bearing for every subsequent decision.

What ONNX Runtime actually solves

ONNX (Open Neural Network Exchange) is a serialization format for neural network weights and computation graphs. A model trained in PyTorch or TensorFlow can be exported to ONNX format once, and then run by any compatible runtime—without needing the original framework. ONNX Runtime is Microsoft's inference engine for that format: it handles the execution on CPU or GPU, manages memory, and exposes a simple API to pass inputs and get outputs.

The practical consequence for deployment is significant. Running sentence-transformers with PyTorch in a Docker image requires roughly 1.5 GB of Python dependencies. ONNX Runtime is the lean alternative: you ship the model file (a single .onnx binary) and the runtime library, and you drop the entire PyTorch dependency tree. The Exogram Celery worker loads the model with onnxruntime==1.20.1, tokenizes with tokenizers==0.21.0 (a Rust-backed library from Hugging Face that does not require PyTorch), and nothing else.

The model itself is paraphrase-multilingual-MiniLM-L12-v2, exported to ONNX format by the Xenova project on Hugging Face. It produces 384-dimensional embeddings, supports 50+ languages (critical for an app where highlights are frequently in Spanish), and runs at roughly 50ms per text on CPU. The file weighs 470 MB on disk and occupies about 1 GB of RAM at runtime.

Loading the model

The entire inference stack lives in backend/books/embeddings.py. The class PureONNXEmbeddingModel handles download, validation, and session initialization:

def _load_model(self):
    """Downloads (if needed) and loads the ONNX model + tokenizer."""
    try:
        self._download_file(self.MODEL_URL, self.model_path)
        self._download_file(self.TOKENIZER_URL, self.tokenizer_path)

        # Limit internal ONNX Runtime threads to avoid saturating the CPU
        # in a Docker environment shared with other processes.
        # 2 threads is enough for reasonable throughput without freezing the host.
        sess_options = ort.SessionOptions()
        sess_options.intra_op_num_threads = 2
        sess_options.inter_op_num_threads = 1

        self.session = ort.InferenceSession(
            str(self.model_path),
            sess_options=sess_options,
            providers=['CPUExecutionProvider']
        )

        # Rust tokenizer — fast, no Python overhead
        self.tokenizer = Tokenizer.from_file(str(self.tokenizer_path))

    except EmbeddingModelUnavailable:
        raise
    except Exception as e:
        raise EmbeddingModelUnavailable(f"Could not load ONNX model: {e}") from e

The thread count configuration deserves a note. ONNX Runtime defaults to using all available CPU cores, which is aggressive behavior when the model runs inside a Celery worker that shares a host with Django, PostgreSQL, and Redis. Capping intra_op_num_threads at 2 gives the model enough parallelism without making the container a bad neighbor. This is a concrete tradeoff: you leave some inference speed on the table to preserve host stability.

The model is initialized once per worker process through a thread-safe singleton:

_model_lock = threading.Lock()
_model = None
_model_error: Exception | None = None

def get_model() -> PureONNXEmbeddingModel:
    """
    Returns the ONNX model with lazy initialization per process.

    If loading fails, the exception is cached to avoid retrying on every task.
    To reset, restart the Celery worker.
    """
    global _model, _model_error
    if _model is not None:
        return _model
    if _model_error is not None:
        raise _model_error
    with _model_lock:
        if _model is not None:
            return _model
        if _model_error is not None:
            raise _model_error
        try:
            _model = PureONNXEmbeddingModel()
        except EmbeddingModelUnavailable as e:
            _model_error = e
            raise
    return _model

The error caching decision is worth explaining. If the model file is missing or corrupted on startup, the first task will fail and cache the exception. All subsequent tasks in that worker process will fail fast without retrying the expensive initialization. The fix is to restore the model file and restart the worker—an intentional operational checkpoint rather than silent partial degradation.

Generating an embedding

Once the session is loaded, the encode path is three steps: tokenize, run the ONNX inference session, and apply mean pooling.

def encode(self, texts: Union[str, List[str]], normalize: bool = True) -> np.ndarray:
    """Generates 384-dimensional embeddings for one or more texts."""
    if isinstance(texts, str):
        texts = [texts]

    embeddings = []

    for text in texts:
        inputs = self._tokenize(text)

        inputs_onnx = {
            'input_ids': inputs['input_ids'],
            'attention_mask': inputs['attention_mask'],
        }

        # Some Xenova ONNX exports require token_type_ids as an explicit input
        input_names = [i.name for i in self.session.get_inputs()]
        if 'token_type_ids' in input_names:
            inputs_onnx['token_type_ids'] = np.zeros_like(inputs['input_ids'])

        outputs = self.session.run(None, inputs_onnx)

        # Mean pooling weighted by attention mask.
        # Rather than using the [CLS] token output directly (outputs[0][0][0]), we
        # compute a weighted average over all token embeddings. This model family
        # (paraphrase-multilingual-MiniLM-L12-v2) was fine-tuned with mean pooling as
        # the intended reduction strategy—using [CLS] alone would produce degraded
        # sentence representations because the training objective never concentrated
        # semantic content into that single position. Padding tokens are naturally
        # excluded: their attention mask value is 0, so they contribute zero to both
        # the numerator (sum_embeddings) and denominator (sum_mask).
        token_embeddings = outputs[0][0]  # shape: [seq_len, hidden_size]
        attention_mask = inputs['attention_mask'][0]

        mask_expanded = np.expand_dims(attention_mask, -1).astype(float)
        sum_embeddings = np.sum(token_embeddings * mask_expanded, axis=0)
        sum_mask = np.sum(mask_expanded, axis=0)
        embedding = sum_embeddings / np.maximum(sum_mask, 1e-9)

        embeddings.append(embedding)

    embeddings = np.array(embeddings)

    if normalize:
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        embeddings = embeddings / np.maximum(norms, 1e-9)

    return embeddings

Mean pooling is the standard reduction step for this model family: rather than taking only the [CLS] token's representation, you average all token embeddings weighted by the attention mask (so padding tokens contribute zero). L2 normalization is applied by default because cosine similarity between two normalized vectors reduces to a dot product, which is both mathematically cleaner and what pgvector's cosine distance operator expects.

The tokenizer truncates to 128 tokens, which is sufficient for Kindle highlights—a sentence to a few paragraphs at most. That limit keeps inference fast and avoids the quadratic memory cost of longer sequences.

Storing vectors in PostgreSQL

The Highlight model in backend/books/models.py carries a single extra column:

from pgvector.django import HnswIndex, VectorField

class Highlight(models.Model):
    user = models.ForeignKey(Profile, on_delete=models.CASCADE)
    book = models.ForeignKey(Book, on_delete=models.CASCADE)
    content = models.TextField()
    # ...

    # Embedding vector for semantic similarity
    # Model: paraphrase-multilingual-MiniLM-L12-v2 — 384 dimensions, multilingual
    embedding = VectorField(dimensions=384, null=True, blank=True)

    class Meta:
        indexes = [
            # ...
            HnswIndex(
                name='highlight_embedding_hnsw',
                fields=['embedding'],
                m=16,
                ef_construction=64,
                opclasses=['vector_cosine_ops'],
            ),
        ]

The VectorField from pgvector-django is a first-class Django field: it participates in migrations, can be filtered and annotated like any other column, and stores the 384-dimensional float array as a native PostgreSQL vector type. The HnswIndex is the crucial performance piece—HNSW (Hierarchical Navigable Small World) is a graph-based approximate nearest neighbor structure that brings ANN search to O(log n) instead of O(n) for brute-force scanning. The opclasses=['vector_cosine_ops'] parameter tells PostgreSQL that this index should be built for cosine distance queries specifically.

Embeddings are generated asynchronously in Celery, after the import transaction commits:

@shared_task(
    autoretry_for=(Exception,),
    max_retries=3,
    retry_backoff=True,
    retry_backoff_max=300,
)
def batch_generate_embeddings(highlight_ids: list):
    """
    Batch processes multiple highlights.

    Only processes highlights with embedding__isnull=True (idempotent).
    Sub-batches of 16 to avoid blocking the database.
    """
    highlights = list(
        Highlight.objects.filter(id__in=highlight_ids, embedding__isnull=True)
    )
    # ...
    for i in range(0, total, batch_size):
        batch = highlights[i:i + batch_size]
        contents = [h.content for h in batch]

        embeddings = encode_batch(contents)  # returns (batch_size, 384) array

        for highlight, embedding in zip(batch, embeddings):
            highlight.embedding = embedding.tolist()
            highlight.save(update_fields=['embedding'])

Two deliberate choices here: the idempotency guard (embedding__isnull=True) means this task can be safely retried or replayed without double-processing anything. And the sub-batch size of 16 is a practical limit to avoid holding large result sets in memory while the model runs. The import endpoint does not wait for embeddings—it returns a 201 immediately after the database transaction commits, and the user's highlights become searchable once the worker completes. A separate GET /api/highlights/embedding-status/ endpoint lets the frontend show progress.

Querying for semantic similarity

The search view in backend/books/similarity_views.py is where the pgvector payoff becomes concrete:

from pgvector.django import CosineDistance

class SemanticSearchView(views.APIView):
    def post(self, request):
        query = request.data.get('query', '').strip()

        # Encode the search query with the same model used at write time
        query_embedding = encode_text(query).tolist()

        qs = Highlight.objects.exclude(embedding__isnull=True)

        if scope == 'mine':
            qs = qs.filter(user=request.user.profile)
        # ... additional scope/privacy filters ...

        # Cosine distance search with threshold
        results = qs.annotate(
            distance=CosineDistance('embedding', query_embedding)
        ).filter(
            distance__lt=0.45   # roughly 0.55 cosine similarity
        ).order_by('distance')[:limit]

This Django ORM expression translates to a single PostgreSQL query using pgvector's <=> operator—the cosine distance operator:

SELECT *, (embedding <=> '[0.021, -0.045, ...]'::vector) AS distance
FROM books_highlight
WHERE user_id = 7
  AND embedding IS NOT NULL
  AND (embedding <=> '[0.021, -0.045, ...]'::vector) < 0.45
ORDER BY distance
LIMIT 10;

The HNSW index is used automatically for this query type. The threshold of 0.45 filters out results that are too distant—cosine distance of 0 means identical vectors, 1 means orthogonal. A threshold of 0.45 corresponds to approximately 0.55 cosine similarity, which in practice means "thematically related" rather than just "uses similar words." The exact cutoff is tunable; 0.45 was chosen empirically.

The same CosineDistance annotation powers a second endpoint, GET /api/highlights/<id>/similar/, which finds highlights similar to a given one—useful for discovering unexpected connections across different books. There is also a discovery feed in the affinity app that computes per-user centroids (the mean of all their highlight embeddings) and finds users with similar reading profiles using the same cosine distance query against a UserCluster table.

Where this approach reaches its limits

This stack works well for the problem it was designed to solve: per-user semantic search over a personal reading corpus of hundreds to tens of thousands of highlights, running on a single server. It would not be the right choice in several scenarios.

The first is scale. pgvector with HNSW indexes performs well up to tens of millions of vectors on adequately provisioned hardware, but it is not a distributed system. If you need to shard your vector store horizontally or handle millions of concurrent queries, a dedicated system like Qdrant or Weaviate is the correct path. The ADR is explicit about this: "If the volume grows to tens of millions of vectors, migrating to a dedicated vector store is the natural path."

The second is model quality. paraphrase-multilingual-MiniLM-L12-v2 is a strong general-purpose multilingual model for its size, but it is not state-of-the-art. For domains with highly specialized vocabulary—medical, legal, scientific—a domain-fine-tuned model would produce meaningfully better results. ONNX Runtime does not constrain your model choice; you could export a fine-tuned model to ONNX and swap it in without changing any other part of this architecture.

The third is the CPU-only constraint. Running inference on CPU at ~50ms per text is acceptable when embedding generation happens asynchronously in a background worker. It would not be acceptable if you needed sub-10ms search latency on the read path—for example, if you were re-encoding the query inside a synchronous request handler at high QPS. In this application, the query embedding is generated synchronously (the search endpoint calls encode_text directly), which adds ~50ms to each search request. For the current scale, that is acceptable; at higher traffic it would require either a GPU provider in ONNX Runtime, a caching layer for common queries, or a different architecture.

The fourth is operational. The 470 MB model file needs to be on a persistent volume and available before the first embedding task runs. The download_onnx_model management command handles this, and the Docker Compose configuration mounts a shared models_cache volume. In a Kubernetes environment with ephemeral pods this requires a proper init container or a pre-baked image. It is manageable but not invisible.

The actual tradeoff

The core decision documented in these ADRs is not "ONNX Runtime vs. PyTorch" or "pgvector vs. Pinecone." It is: at what scale does the operational simplicity of a single-datastore, no-external-API architecture stop being an asset and start being a liability? For Exogram—a personal reading app with a small, privacy-sensitive dataset per user—the answer is clearly in favor of keeping everything local. The user's reading data does not leave the system. There is no per-query cost. The only infrastructure required is what the app already runs: PostgreSQL, Redis, and Celery workers.

If the requirements were different—millions of users, real-time indexing, cross-user recommendations at scale, sub-second search over billions of documents—this architecture would need to evolve. That is not a failure of the approach; it is the correct scoping of a solution to the actual problem. Knowing when a simpler stack is sufficient is as important an engineering judgment as knowing when it is not.