TI for Kreuzberg

Posted on Mar 23

BM25 + Vector Search in One Query: kreuzberg-surrealdb + SurrealDB v3

#vectordatabase #machinelearning #ai #softwareengineering

Author: Varun Tandon, Software Engineer at Kreuzberg.

Hybrid Search in 40 Lines: kreuzberg-surrealdb + SurrealDB v3

Every hybrid search tutorial starts with clean text already in the database: ten toy documents, never scanned, never duplicated, never OCR'd. Real pipelines start somewhere else: a directory of client PDFs, some scanned, some protected, plus legacy DOCX files and an ingestion layer you've assembled from LangChain loaders, Unstructured subprocesses, and filename-based IDs that inflate your vector store on every re-run.

kreuzberg-surrealdb replaces that entire pre-query layer. Two calls get you to a working hybrid search pipeline: setup_schema() creates the HNSW vector index and BM25 full-text index in SurrealDB; ingest_directory() handles format detection, OCR, chunking, embedding, and deduplication across 88+ file formats. Then SurrealDB's search::rrf() runs hybrid BM25 + vector search in a single query. It requires SurrealDB v3.

Quick Start: Connection to Hybrid Search

pip install kreuzberg-surrealdb
# Requires SurrealDB v3: surreal start --user root --pass root

import asyncio
from surrealdb import AsyncSurreal
from kreuzberg_surrealdb import DocumentPipeline

async def main():
    async with AsyncSurreal("ws://localhost:8000/rpc") as db:
        await db.signin({"username": "root", "password": "root"})
        await db.use("myns", "mydb")

        # "balanced" preset = bge-base-en-v1.5, 768 dims
        pipeline = DocumentPipeline(db=db, embedding_model="balanced")

        # Creates HNSW vector index + BM25 full-text index
        await pipeline.setup_schema()

        # Format detection, OCR, chunking, embedding, dedup
        await pipeline.ingest_directory("./docs")

        query = "regulatory compliance Q4 2025"
        embedding = await pipeline.embed_query(query)

        # search::rrf() is SurrealDB — not kreuzberg-surrealdb
        results = await db.query("""
            SELECT * FROM search::rrf([
              (SELECT id, content FROM chunks
               WHERE embedding <|10,COSINE|> $embedding),
              (SELECT id, content, search::score(1) AS score FROM chunks
               WHERE content @1@ $query
               ORDER BY score DESC LIMIT 10)
            ], 10, 60);
        """, {"query": query, "embedding": embedding})

        for row in results[0]:
            print(row.get("content", "")[:200])
            print(row.get("document", {}).get("source", ""))
            print("---")

asyncio.run(main())

Building Ingestion for SurrealDB

Hybrid search with RRF improved Mean Reciprocal Rank from 0.410 to 0.486: an 18.5% gain over single-mode retrieval in a production RAG system. That gain depends entirely on both indexes being correctly populated. Getting there from scratch means solving four problems.

Format extraction. LangChain's PDFLoader returns empty strings or raises errors on scanned PDFs (GitHub issue #6376). LibreOffice in Unstructured runs single-threaded, so concurrent ingestion creates silent race conditions on file handles. Missing libmagic on the host causes DOCX files to be misidentified as application/zip, bypassing all DOCX-specific extraction logic.

HNSW DDL. You specify DIMENSION, DIST, EFC, and M manually. Wrong values silently produce an underperforming index. DimensionMismatchError fires at insert time, not schema creation. Switch embedding models after writing records and every subsequent insert fails.

Deduplication. LlamaIndex document IDs default to filename-based hashing (GitHub issue #13461). Re-running the pipeline on unchanged files creates new vector records, triggers re-embedding API calls, and inflates the vector store. Content-hash dedup isn't in LlamaIndex's default configuration.

The LangChain SurrealDBVectorStore covers retrieval only. Schema creation, chunking, embedding, and batched inserts remain on you.

Setting Up kreuzberg-surrealdb

pip install kreuzberg-surrealdb

You own the AsyncSurreal connection — authenticate, select namespace and database, then pass it to DocumentPipeline:

from surrealdb import AsyncSurreal
from kreuzberg_surrealdb import DocumentPipeline

async with AsyncSurreal("ws://localhost:8000/rpc") as db:
    await db.signin({"username": "root", "password": "root"})
    await db.use("myns", "mydb")

    pipeline = DocumentPipeline(db=db, embedding_model="balanced")
    await pipeline.setup_schema()

What `setup_schema()` Generates

One call creates everything SurrealDB needs to run both retrievers:

-- documents table
DEFINE TABLE documents SCHEMAFULL;
DEFINE FIELD source        ON documents TYPE string;
DEFINE FIELD content       ON documents TYPE string;
DEFINE FIELD content_hash  ON documents TYPE string;
DEFINE FIELD ingested_at   ON documents TYPE datetime;
DEFINE FIELD quality_score ON documents TYPE float;
  -- OCR confidence (0.0–1.0) for scanned content
DEFINE FIELD title         ON documents TYPE string;
DEFINE FIELD authors       ON documents TYPE array;
DEFINE FIELD metadata      ON documents TYPE object FLEXIBLE;

-- chunks table
DEFINE TABLE chunks SCHEMAFULL;
DEFINE FIELD document    ON chunks TYPE record<documents>;
DEFINE FIELD content     ON chunks TYPE string;
DEFINE FIELD embedding   ON chunks TYPE array<float>;
DEFINE FIELD chunk_index ON chunks TYPE int;
DEFINE FIELD word_count  ON chunks TYPE int;
DEFINE FIELD page_number ON chunks TYPE int;
DEFINE FIELD char_start  ON chunks TYPE int;
DEFINE FIELD char_end    ON chunks TYPE int;

-- HNSW vector index
DEFINE INDEX idx_chunk_embedding ON chunks FIELDS embedding
  HNSW DIMENSION 768 TYPE F32 DIST COSINE EFC 150 M 12;

-- BM25 full-text index
DEFINE ANALYZER text_analyzer TOKENIZERS class
  FILTERS lowercase, stemmer(english);
DEFINE INDEX idx_chunk_content ON chunks FIELDS content
  SEARCH ANALYZER text_analyzer BM25(1.2, 0.75);

Embedding Presets

The preset determines the DIMENSION value in the HNSW DDL — you never specify it manually:

Preset	Model	Dimensions	Use case
`"fast"`	all-MiniLM-L6-v2	384	Low-latency, resource-constrained
`"balanced"`	bge-base-en-v1.5	768	Default; best general-purpose trade-off
`"quality"`	bge-large-en-v1.5	1024	High-recall when compute is available
`"multilingual"`	multilingual-e5-base	768	Non-English or mixed-language corpora

For a custom model:

model = EmbeddingModelType.fastembed("BAAI/bge-small-en-v1.5", embedding_dimensions=384)
pipeline = DocumentPipeline(db=db, embedding_model=model)

One important constraint: SurrealDB enforces vector dimension server-wide. All tables on the same instance must use the same dimension. Pick the preset before first ingest — changing it later means dropping the HNSW index, re-running setup_schema(), and re-embedding the entire corpus.

Chunking Configuration

from kreuzberg import ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(
        max_chars=512,   # Smaller = more precise retrieval, more records
        max_overlap=100  # Prevents context loss at chunk boundaries
    )
)
pipeline = DocumentPipeline(db=db, config=config, embedding_model="balanced")

Ingesting a Mixed Document Corpus

ingest_directory() walks the directory, detects each file's format, extracts text (with OCR for scanned content), chunks, embeds, and writes to SurrealDB. No Tesseract configuration required.

await pipeline.ingest_directory("./docs", glob="**/*")

The glob parameter follows pathlib.Path.glob() syntax — **/* walks all subdirectories recursively (default), **/*.pdf scopes to PDFs only.

For targeted ingestion or upload flows:

# Single file
await pipeline.ingest_file("./reports/q4-2025.pdf")

# Bytes — e.g. from an HTTP upload handler
await pipeline.ingest_bytes(
    data=pdf_bytes,
    mime_type="application/pdf",
    source="upload://q4-2025.pdf"
)

Deduplication

kreuzberg-surrealdb computes a SHA-256 hash from each chunk's extracted text content and uses it as the SurrealDB RecordID (pattern: {content_hash}_{chunk_index}). All inserts use INSERT IGNORE. Running ingest_directory() twice on unchanged content is a complete no-op: zero new records, zero re-embedding calls.

This differs meaningfully from LlamaIndex's filename_as_id=True default. When you re-ingest the same file from a different path, LlamaIndex generates a new RecordID from the new path and creates a duplicate. kreuzberg-surrealdb hashes the content itself — same text from any path, same RecordID, same no-op.

Honest limitations:

Sequential ingestion. ingest_files() and ingest_directory() process files in a sequential loop. For high-throughput pipelines, use a queue-based architecture where independent workers each call ingest_file() per document.
No orphan deletion. Files removed from the source directory stay in the database. Manual cleanup: DELETE FROM documents WHERE source NOT IN $active_sources;
Exact-match dedup only. Two slightly different versions of the same document create two separate records. Near-duplicate detection isn't supported.

The Hybrid Search Payoff: How `search::rrf()` Works

Both indexes are now in SurrealDB — HNSW for vector retrieval, BM25 for full-text. SurrealDB's search::rrf() combines them in a single query using Reciprocal Rank Fusion (RRF).

Because RRF operates on ranked positions rather than raw scores, BM25's unbounded values and cosine similarity's 0–1 range are never directly compared. No score normalization. No alpha parameter. The formula (Cormack, Clarke & Buettcher, SIGIR 2009):

RRF_score(d) = Σ 1 / (k + rank_i(d))

k=60 is the smoothing constant from the original paper — not a tunable weight. It prevents top-ranked documents from dominating when they appear near rank 1 in only one list.

Attribution: What Owns What

Step	Owner
Extraction from 88+ formats, OCR	kreuzberg (via kreuzberg-surrealdb)
Chunking and embedding	kreuzberg (via kreuzberg-surrealdb)
HNSW + BM25 index creation	kreuzberg-surrealdb (`setup_schema()`)
Consistent query embedding	kreuzberg-surrealdb (`embed_query()`)
Hybrid fusion	SurrealDB (`search::rrf()`)
Vector + full-text retrieval	SurrealDB
Chunk → document traversal	SurrealDB record links

All Three Search Modes

Always call embed_query() before a vector or hybrid search. It ensures the query vector uses the same model and dimension as stored chunk embeddings. A mismatch causes cosine similarity scores to become meaningless without raising an error.

Hybrid (BM25 + vector):

query = "regulatory compliance Q4 2025"
embedding = await pipeline.embed_query(query)

results = await db.query("""
    SELECT * FROM search::rrf([
      (SELECT id, content FROM chunks
       WHERE embedding <|10,COSINE|> $embedding),
      (SELECT id, content, search::score(1) AS score FROM chunks
       WHERE content @1@ $query
       ORDER BY score DESC LIMIT 10)
    ], 10, 60);
""", {"query": query, "embedding": embedding})

Vector-only:

SELECT *, vector::distance::knn() AS distance FROM chunks
WHERE embedding <|10,COSINE|> $embedding ORDER BY distance;

BM25-only:

SELECT *, search::score(1) AS score FROM chunks
WHERE content @1@ $query
ORDER BY score DESC LIMIT 10;

Chunk → parent document traversal (no JOIN, no second query):

SELECT *, document.source, document.quality_score FROM chunks
WHERE content @1@ $query LIMIT 5;

The document field on each chunk is a SurrealDB record link — dot notation traverses it inline.

Where Each Retriever Fails

BM25 fails on: paraphrased queries, vocabulary mismatch ("car" vs "automobile"), semantic synonyms, conceptual proximity without term overlap.

Vector fails on: exact product codes, named entities, precise version strings, rare technical terms, regulation IDs, serial numbers.

Hybrid RRF covers both.

Filtering by OCR Quality

Low-quality extraction degrades both retrievers. Filter on quality_score before retrieval:

SELECT * FROM search::rrf([
  (SELECT id, content FROM chunks WHERE embedding <|10,COSINE|> $embedding),
  (SELECT id, content, search::score(1) AS score FROM chunks
   WHERE content @1@ $query ORDER BY score DESC LIMIT 10)
], 10, 60)
WHERE document.quality_score > 0.7;

Tuning HNSW and BM25 Parameters

setup_schema() exposes four tunable parameters. The defaults work well for 256–512 token chunks in typical document corpora.

await pipeline.setup_schema(
    hnsw_efc=200,             # Higher = better recall, slower index build
    hnsw_m=16,                # Higher = better recall, more memory per node
    distance_metric="COSINE",
    bm25_k1=1.5,              # Term-frequency saturation
    bm25_b=0.5                # Length normalization
)

Parameter	Default	When to tune
`hnsw_efc`	150	Large corpora (100K+ docs) where recall matters more than indexing speed
`hnsw_m`	12	High-dimensional embeddings (1024-dim); memory is available
`bm25_k1`	1.2	Technical corpora with high term repetition (code, legal docs)
`bm25_b`	0.75	Corpora with highly variable document lengths

Parameters are fixed at schema creation time. Changing them requires dropping and recreating the indexes and re-embedding the full corpus. Pick before ingesting production data.

Why Not pgvector + Qdrant?

Running pgvector and Qdrant separately means two write paths, two uptime SLAs, and no ACID guarantees across them. Here's a failure mode every engineer hits eventually: a Qdrant write succeeds; a Postgres write fails during a network partition. The vector store now holds an embedding whose parent document record doesn't exist. Your search returns a chunk with no context — no source, no metadata, no document link. The retry wrapper is still on the backlog.

kreuzberg-surrealdb's ingest_directory() writes to documents and chunks in the same database. Both the HNSW index and the BM25 index are maintained within the same transaction. search::rrf() runs inside that same database — no cross-service retrieval latency, no dual-write coordination. The record link from chunks.document to documents is always consistent because both were written in the same transaction.

The LangChain EnsembleRetriever compounds the problem: two separate HTTP calls to two separate systems, merged in Python with a hardcoded weights parameter. Weights don't apply to a rank-based algorithm; that mismatch is baked into the design. search::rrf() doesn't have this problem.

Honest trade-offs: SurrealDB isn't Elasticsearch. At very large scale — hundreds of millions of vectors — specialized vector databases have more managed hosting options and operational tooling. ingest_files() is sequential; high-throughput batch ingestion requires a queue-based architecture regardless of which database you're using. As of SurrealDB v3, there's no managed cloud option at scale. Verify current hosting options before adopting this stack for production infrastructure.

Frequently Asked Questions

What SurrealDB version is required? search::rrf() requires SurrealDB v3. It is not available in v2. BM25 and vector search work separately on v2, but not the combined hybrid query.

Can I use a custom embedding model? Yes, via EmbeddingModelType.fastembed() or EmbeddingModelType.custom(). You must provide embedding_dimensions explicitly. All chunks and queries must use the same model and dimension as SurrealDB enforces dimension consistency server-wide.

Is ingestion concurrent or sequential? ingest_files() and ingest_directory() are sequential. For high-throughput pipelines, use a queue-based architecture with one worker per document. ingest_bytes() can be called concurrently from multiple coroutines.

What happens to records for deleted files? Nothing automatic. Records remain until manually removed. See orphan cleanup above.

Next Steps

pip install kreuzberg-surrealdb
GitHub: github.com/kreuzberg-dev/kreuzberg-surrealdb
Deduplication demo: examples/incremental_ingest.py

DEV Community

BM25 + Vector Search in One Query: kreuzberg-surrealdb + SurrealDB v3

Hybrid Search in 40 Lines: kreuzberg-surrealdb + SurrealDB v3

Quick Start: Connection to Hybrid Search

Building Ingestion for SurrealDB

Setting Up kreuzberg-surrealdb

What `setup_schema()` Generates

Embedding Presets

Chunking Configuration

Ingesting a Mixed Document Corpus

Deduplication

The Hybrid Search Payoff: How `search::rrf()` Works

Attribution: What Owns What

All Three Search Modes

Where Each Retriever Fails

Filtering by OCR Quality

Tuning HNSW and BM25 Parameters

Why Not pgvector + Qdrant?

Frequently Asked Questions

Next Steps

Top comments (0)

Hybrid Search in 40 Lines: kreuzberg-surrealdb + SurrealDB v3

Quick Start: Connection to Hybrid Search

Building Ingestion for SurrealDB

Setting Up kreuzberg-surrealdb

What setup_schema() Generates

Embedding Presets

Chunking Configuration

Ingesting a Mixed Document Corpus

Deduplication

The Hybrid Search Payoff: How search::rrf() Works

Attribution: What Owns What

All Three Search Modes

Where Each Retriever Fails

Filtering by OCR Quality

Tuning HNSW and BM25 Parameters

Why Not pgvector + Qdrant?

Frequently Asked Questions

Next Steps

What `setup_schema()` Generates

The Hybrid Search Payoff: How `search::rrf()` Works