DEV Community

SKasagar
SKasagar

Posted on • Originally published at caseonix.ca

Privacy-first RAG on Cloudflare's edge — here's everything I changed from the naïve baseline published:

Live app: localmind.caseonix.ca · originally posted as a 3-part series at caseonix.ca/notes

I built a privacy-first document intelligence platform called LocalMind. Documents are uploaded, classified, reviewed, and searched at the Cloudflare edge — content never leaves Cloudflare. It uses Vectorize for vector similarity, Workers AI for embeddings, and Google Gemma 4 (26B) for summaries, reviews, comparisons, and chat. Multi-tenant with namespace isolation per team.

This post stitches together my three lab notes on the RAG side of LocalMind:

  1. The pipeline — how chunking → embedding → vector index → retrieval → generation is wired.
  2. Quality improvements — eleven things I changed to move beyond the naïve "split, embed, top-K, stuff" baseline.
  3. The NLP layer — the classical and LLM-based NLP that runs alongside RAG (PII, NER, table flattening, classification, structured analysis).

If you're shipping RAG on the edge, want to compare notes on contextual chunking, or are doing PIPEDA-aware document handling for a Canadian deployment — this is a long one but the diagrams should help.


Part 1 — The pipeline

How I wired retrieval-augmented generation end-to-end. Everything runs on the Cloudflare edge — Workers AI for inference, Vectorize for ANN, D1 for the canonical chunk text, R2 for the original blob. I use teamId as the Vectorize namespace on every read and write to keep tenants isolated.

Ingest pipeline (write path)

I put the whole ingest pipeline inside a per-team Durable Object so it can run async, hold rate-limit state, and use alarms to verify the vector index.

flowchart LR
    A[R2 file] --> B[parseDocument<br/>PDF/DOCX/XLSX/OCR]
    B --> C[flattenTables<br/>rejoin headers]
    C --> D[chunkText<br/>≤512 tok, 50 overlap]
    D --> E[redactPII<br/>regex + LLM]
    E --> F[(D1: document_chunks<br/>display copy)]
    E --> G[generateChunkContext<br/>Gemma — 1-2 sentence header]
    G --> H[generateEmbeddings<br/>BGE-Small, batch 32]
    H --> I[(Vectorize.upsert<br/>namespace=teamId)]
    I --> J[DO alarm — verify<br/>retry 5× / 90s]
Enter fullscreen mode Exit fullscreen mode

Orchestration lives in a per-team Durable Object's processDocument. The main twist is a dual-text split: D1 stores the redacted-but-otherwise-original chunk that users see in citations, and Vectorize stores a contextually enriched version where Gemma 4 has prepended a 1-2 sentence header that says where the chunk fits in the document. The header improves recall for chunks that would otherwise be too short or too back-referencing to retrieve on their own, and it never shows up in the user-facing citation.

Chunking

The chunker is small and deterministic. It splits the text on paragraph breaks (\n{2,}) and packs whole paragraphs into a buffer until the buffer would go over MAX_CHUNK_TOKENS (512 tokens, approximated as 4 chars/token). When a paragraph won't fit, I flush the buffer and carry a 50-token tail forward as overlap so context isn't cut off in the middle of a thought. If a single paragraph is bigger than the cap on its own, I fall back to a sentence regex (/[^.!?]+[.!?]+\s*/g) and pack sentences instead. No tokenizer call, no LLM — paragraph structure is the main signal, and the token math is conservative on purpose.

Before chunks ever touch D1 or Vectorize, I run them through two PII passes (regex + LLM). The redaction is applied to both the display copy in D1 and the version that gets embedded — privacy was a hard requirement for me, so nothing identifiable leaves the chunk pipeline. (Detail in Part 3.)

Embedding

The embedding service is small on purpose — a single short module with two functions. It calls @cf/baai/bge-small-en-v1.5 via Workers AI for both the document side and the query side. On the query path I prepend BGE's asymmetric prefix "Represent this sentence for searching relevant passages: ". That's how BGE was trained on the query side, and using it raises cosine similarity for relevant chunks above the noise floor.

In the DO, I batch embeddings at 32, then call Vectorize.upsert in batches of 100. Vectorize is eventually consistent, so once the upsert returns I schedule a Durable Object alarm 90 seconds out. The alarm re-queries the first and last chunk IDs to check they're indexed, retries up to 5 times, and only marks the document ready for search once both ends are visible. If verification fails after the final retry, I flip the document to error so the user knows search won't work for it.

Retrieval & generation (read path)

flowchart TB
    Q[user query] --> E1[expandQuery<br/>Gemma → 2-3 variants]
    E1 --> E2[generateQueryEmbedding ×N<br/>BGE prefix applied]
    E2 --> V[Vectorize.query ×N<br/>namespace=teamId, topK=15]
    V --> M[merge + dedupe<br/>max score per chunk]
    M --> H[hydrate from D1<br/>drop sim < 0.3]
    H --> R[rerankChunks<br/>Gemma 1-5, keep ≥3, top 7]
    R --> C[buildContextFromChunks<br/>≤6000 tok, ≤2/doc team-wide]
    C --> S[synthesizeAnswer<br/>Gemma + history + cite]
    S --> A[answer + sources]
Enter fullscreen mode Exit fullscreen mode

Step by step:

  1. Query expansion. Gemma rewrites the user's query into 2-3 variants under a JSON schema. The original is always kept, capped at 3. If the LLM call fails I fall back to the original query only.
  2. Multi-vector search. I embed each variant with the BGE query prefix and dispatch all of them in parallel to Vectorize.query, scoped to the team's namespace with topK = 15.
  3. Merge and dedupe. I union results by chunk ID and keep the highest cosine score across variants.
  4. Hydrate and threshold. I load chunk text from D1 in a single inArray query, join it to document names, and drop anything below MIN_SIMILARITY = 0.3. (BGE-Small puts truly relevant chunks in the 0.35-0.55 band on my corpus, so 0.3 is the tuned floor.)
  5. LLM rerank. I send the top 15 candidates to Gemma with a strict 1-5 relevance rubric, again JSON-schema constrained. I cut anything below 3 and keep the top 7. If fewer than 2 chunks pass, I fall back to the vector ranking.
  6. Context assembly. I concatenate chunks with [N] filename headers and --- separators, stopping at MAX_CONTEXT_TOKENS = 6000. For team-wide searches I cap each document at 2 chunks so one chatty document can't crowd out the others.
  7. Synthesis. The system prompt binds Gemma to the supplied context, requires [Document: filename] citations, and forbids external knowledge. Conversation history is walked newest-first and trimmed at 1500 tokens.

Tunables — single source of truth

Knob Value What it controls
MAX_CHUNK_TOKENS 512 chunker upper bound
CHUNK_OVERLAP_TOKENS 50 tail carried between chunks
TOP_K 15 vectors retrieved per query variant
MIN_SIMILARITY 0.3 post-vector floor
QUERY_EXPANSION_MAX 3 LLM rewrites per query
RERANK_TOP_K 7 chunks kept after rerank
RERANK_MIN_SCORE 3 Gemma relevance cutoff (1-5)
MAX_CONTEXT_TOKENS 6000 budget for synthesis prompt
MAX_CHUNKS_PER_DOC_TEAM_SEARCH 2 per-doc cap in team-wide context

Models

  • Embedder: @cf/baai/bge-small-en-v1.5 (BGE-Small, 384-dim) on Workers AI.
  • Generator / reranker / classifier: @cf/google/gemma-4-26b-a4b-it, configurable.
  • Vector index: Cloudflare Vectorize, partitioned by teamId namespace.
  • Chunk store: Cloudflare D1 (SQLite) via Drizzle.
  • Blob store: Cloudflare R2.

Part 2 — How I improved RAG quality

A naïve "split by N tokens, embed, top-K cosine, stuff into prompt" pipeline is the baseline most RAG tutorials give you. I moved away from that baseline at almost every layer of LocalMind because each step had a real failure mode in my testing.

Baseline vs. what I built

flowchart TB
    subgraph Baseline["Naïve baseline RAG"]
        B1[fixed-window<br/>chunking] --> B2[embed<br/>raw chunks]
        B2 --> B3[single-query<br/>top-K cosine]
        B3 --> B4[stuff into<br/>prompt]
    end
    subgraph LocalMind["LocalMind RAG"]
        L1[paragraph-aware chunk<br/>+ overlap<br/>+ table flatten<br/>+ PII redact] --> L2[contextual prepend<br/>+ BGE query prefix<br/>+ embed]
        L2 --> L3[query expansion ×N<br/>+ multi-vector merge<br/>+ similarity floor 0.3<br/>+ Gemma rerank 1-5]
        L3 --> L4[token-budgeted context<br/>+ per-doc fairness cap<br/>+ citation-bound synthesis<br/>+ history trim]
    end
    Baseline -.evolved into.-> LocalMind
Enter fullscreen mode Exit fullscreen mode

Below, each change is shown as the problem it fixes.

Indexing-side improvements

flowchart LR
    R[raw text] --> C1[paragraph-aware<br/>chunking]
    C1 --> C2[table<br/>flattening]
    C2 --> C3[PII redaction<br/>before embed]
    C3 --> C4[contextual<br/>prepend]
    C4 --> C5[BGE query/doc<br/>asymmetry]
    C5 --> V[(Vectorize)]
Enter fullscreen mode Exit fullscreen mode

Paragraph-aware chunking instead of fixed windows. Naïve chunkers slice every N characters, which often cuts across sentence and paragraph boundaries and embeds half-thoughts. My chunker packs whole paragraphs first, only falls back to sentence splitting when a single paragraph overflows, and carries a 50-token overlap on every flush so cross-boundary references survive. Embeddings stay tied to coherent semantic units, and overlap covers the cases where a discussion does cross paragraphs.

Table flattening before chunking. PDFs and DOCX exports lose tabular structure on extraction — column headers end up in row 1 and the data rows below them lose all meaning ("$4.2M" with no header is useless to a retriever). My table flattener walks the parsed text and re-attaches headers to each row so a query like "Q3 revenue" has something to match against in the embedding space. (Detail in Part 3.)

Contextual chunking (the biggest single win). I applied Anthropic's contextual-retrieval idea here. For each chunk, Gemma writes a 1-2 sentence header that says where the chunk fits in the broader document — for example, "This chunk is from the risk-factors section of Acme's 2024 10-K and discusses supply-chain exposure to…". I prepend that header to the chunk before embedding, so the vector encodes both the local content and its global context. The display copy in D1 stays untouched, so users still see the clean original text in citations.

flowchart LR
    OC[original chunk] --> D1[(D1 — display copy)]
    OC --> CTX[Gemma writes<br/>1-2 sentence header]
    CTX --> ENR[contextualized chunk<br/>= header + original]
    ENR --> EMB[BGE embed]
    EMB --> VZ[(Vectorize — retrieval copy)]
Enter fullscreen mode Exit fullscreen mode

This single change saves a lot of short chunks and back-referencing chunks ("the company also reported…" — which company?) that vanilla embeddings can't disambiguate.

BGE asymmetric query encoding. BGE-Small was trained with a specific instruction prefix on the query side. Most pipelines ignore it. I prepend "Represent this sentence for searching relevant passages: " to every query before embedding, which matches the model's training distribution and raises cosine similarity for relevant chunks above the noise level.

PII redaction before embedding. Privacy-first is the headline requirement of LocalMind, but it's also a retrieval lever: I redact chunks before they're embedded, so PII never gets encoded into the vector space. A side benefit is that chunks cluster by topic instead of by random identifiers like phone numbers.

Retrieval-side improvements

flowchart TB
    Q[user query] --> QE[query expansion<br/>2-3 variants]
    QE --> MV[multi-vector search<br/>parallel topK=15]
    MV --> MD[merge + dedupe<br/>max score per chunk]
    MD --> TH[similarity floor<br/>≥ 0.3]
    TH --> RR[LLM rerank<br/>Gemma 1-5]
    RR --> CT[token-budgeted context<br/>+ per-doc fairness cap]
    CT --> SY[citation-bound<br/>synthesis]
Enter fullscreen mode Exit fullscreen mode

Query expansion + multi-vector search. A single query phrasing often misses chunks that use different vocabulary ("revenue" vs. "top line", "termination" vs. "wrongful dismissal"). I have Gemma rewrite the query into 2-3 alternative phrasings, embed each one with the BGE prefix, and run all variants in parallel against Vectorize. I merge results by chunk ID, keeping the max score across variants, so a chunk surfaced by multiple phrasings gets credit but isn't double-counted. Recall goes up without a precision hit.

Tuned similarity floor. A naïve top-K returns K chunks no matter what — including garbage when the corpus has nothing relevant. I profiled BGE-Small's score distribution on my actual corpus: relevant chunks land at 0.35-0.55, noise sits below 0.3. My MIN_SIMILARITY = 0.3 floor cuts off the noise band, and if zero chunks pass I short-circuit to "I couldn't find relevant information" instead of letting the model hallucinate from low-quality context.

Two-stage retrieval — vector then LLM rerank. Cosine similarity is fast but coarse. I retrieve a wide candidate set (top 15 per variant) then hand it to Gemma with a strict 1-5 relevance rubric and a JSON-schema output. Anything scoring below 3 is dropped; I keep the top 7. Reranking boosts chunks that answer the question over chunks that just share vocabulary with it, which is the failure mode pure-vector retrieval is most prone to. I added a fallback: if fewer than 2 chunks pass the rerank threshold I revert to vector ordering rather than ship a near-empty context.

Per-document fairness cap on team-wide search. Without a cap, one document with high lexical overlap can take over the top 7 and crowd out other documents with more diverse but still-relevant context. MAX_CHUNKS_PER_DOC_TEAM_SEARCH = 2 enforces breadth across the result set when the search spans multiple documents.

Token-budgeted context assembly with citations. The context builder enforces a 6000-token cap and writes each chunk with a [N] filename header and --- separator. The synthesis system prompt binds Gemma to the supplied context, forbids external knowledge, and requires [Document: filename] citations — so answers are grounded and checkable, not filled in from the model's training data.

Conversation history trim. Long chat sessions can blow the context window. I walk the history newest-first and trim it at MAX_HISTORY_TOKENS = 1500, so older turns drop off but the recent thread stays intact.

Reliability improvements

Eventual-consistency handling. Vectorize is eventually consistent — the upsert returning successfully doesn't mean the chunks are queryable yet. A document marked "ready" before its vectors are indexed produces empty searches and looks like a quality bug from the user's seat. I sample first/last chunk IDs after upsert via a Durable Object alarm, retry on failure (5 attempts × 90s), and only flip status to ready once both ends are visible — or to error if they never show up.

Fallback at every LLM call. Each LLM step (query expansion, reranking, synthesis) has an explicit fallback path. Expansion failure → original query only. Rerank failure or fewer than 2 passing chunks → vector ranking. Synthesis failure surfaces as a clear error rather than partial output. The pipeline never quietly produces lower-quality results without saying so in the logs.

Score-keeping

Failure mode I observed What I changed
Half-sentence chunks Paragraph-aware chunker + overlap
Detached table values Table flattener re-attaches headers
Back-referencing / short chunks unfindable Contextual prepend before embed
Weak query embeddings BGE instruction prefix
Vocabulary-mismatch misses Query expansion + multi-vector merge
Low-quality top-K 0.3 similarity floor
Vocab-match ≠ answer-match Gemma rerank with 1-5 rubric
One doc dominates results Per-document fairness cap
Hallucination Citation-bound system prompt
Long chats blow context History token cap
Vectors not yet indexed DO alarm verification loop

Part 3 — The NLP layer alongside RAG

RAG is the headline feature, but a fair amount of classical and LLM-based NLP runs alongside it during ingest. Each pass produces structured fields I persist on the documents row and surface in the UI, and several of them feed back into retrieval quality.

NLP fan-out during ingest

flowchart TB
    P[parseDocument<br/>PDF/DOCX/XLSX/OCR] --> TF[tableFlattener<br/>re-attach headers]
    TF --> R1[regex PII redactor<br/>SSN/SIN/CC/addr/...]
    R1 --> R2[LLM PII detector<br/>medical/financial/...]
    R2 --> CL[redacted text]
    CL --> N1[regex NER<br/>email/phone/date/$/%/url]
    CL --> N2[document classifier<br/>contract/financial/policy/<br/>hr/invoice/general]
    CL --> N3[document analysis<br/>title + description + topics +<br/>key points + people + orgs +<br/>sentiment + risk flags]
    CL --> N4[document review agent<br/>per-type checklist + risks]
    N1 --> M[merge entities<br/>regex + LLM]
    N3 --> M
    M --> DB[(D1: documents row)]
    N2 --> DB
    N3 --> DB
    N4 --> DB
Enter fullscreen mode Exit fullscreen mode

Layered PII redaction (regex + LLM)

Privacy is the headline requirement of LocalMind, so PII handling has two layers — deterministic patterns first, LLM second.

flowchart LR
    T[chunk text] --> R[regex pass<br/>12 categories]
    R --> P[LLM pass<br/>10 sensitive categories]
    P --> O[fully redacted text]
    R -.detected types.-> M[piiTypes set]
    P -.detected categories.-> M
    M --> DB[(documents.pii_types)]
Enter fullscreen mode Exit fullscreen mode

Regex pass

I wrote a pattern bank for the 12 deterministic PII categories: SSN, SIN, credit card, IP address, DOB, Canadian postal code, US ZIP, street address, PO box, health card, passport number, driver's license. Three redaction modes: full ([SSN REDACTED]), partial (***-**-1234), and strip (delete the match).

Partial mode is the default. It keeps the last few digits or characters visible so a human reviewer can still cross-reference (****-****-****-9876) without exposing the full value. Each pattern has its own masking function so the partial form makes sense for the data type — for a postal code I keep the FSA, for an address I keep the street type word, for a phone-like field I keep the last 4.

LLM pass

Regex can't catch unstructured sensitive PII like "diagnosed with Type 2 diabetes" or "donated to the Liberal Party". For that I run Gemma over the regex-cleaned text with a JSON schema that enforces 10 sensitive categories: medical condition, medication, financial detail, ethnic/racial, religious, sexual orientation, criminal/legal, biometric, family relationship, political opinion.

The model returns exact text spans (not paraphrases), which I apply via replaceAll. That keeps redaction lossless and the labels predictable ([MEDICAL CONDITION REDACTED], etc). I framed the categories against PIPEDA / PHIPA so the output maps to a real Canadian compliance posture.

Both passes union their detected types into the documents.pii_types array, which the UI shows as a chip strip.

Hybrid named-entity recognition

Two NER passes run on every document and their outputs are merged before persisting.

Regex NER

Six deterministic categories with a pattern bank:

Type Pattern handles
email standard RFC-ish addresses, lowercased on normalize
phone NANP with optional country code, separators stripped on normalize
date ISO 2024-03-15, slashed 3/15/24, written March 15, 2024, fiscal Q3 2024
currency $, , £ with K/M/B suffixes ($4.2M)
percentage 12.5%
url http(s)://...

After matching, each (type, normalized-value) pair is counted, and the result is sorted by frequency. The most-mentioned entities float to the top of the entities list — useful for "what is this document about" at a glance.

LLM NER

The document analysis pass also returns people and organizations arrays — the fuzzy entities that regex can't reliably extract. These get merged with the regex entities and stored on documents.entities. So the entities list ends up being a hybrid: deterministic stuff with frequency counts plus LLM-extracted names.

Table structure recovery

This is the bit I'm proudest of, and it's a pure-NLP problem disguised as text processing.

Document parsers (PDF text extraction, XLSX) destroy table semantics: column headers end up on one line, data rows on subsequent lines, and after chunking the headers and values are completely disconnected. A row that originally read "Q3 2024 | Revenue | $4.2M | +18%" comes out of the parser as a strip of numbers with no header context, which is useless for both retrieval and reading.

The flattener detects two table shapes and re-emits each row as a Header: Value | Header: Value | … line.

flowchart TB
    L[parsed lines] --> D{table?}
    D -->|"CSV: 3+ comma fields, 2+ rows"| FC[flatten CSV]
    D -->|"whitespace: 3+ cols, 3+ rows"| FW[flatten whitespace]
    D -->|no| K[keep line]
    FW --> G["skip if numeric headers<br/>or wide financial"]
    FC --> O["Header1: Val1 #124; Header2: Val2 #124; ..."]
    FW --> O
    G --> K
    K --> R[output text]
    O --> R
Enter fullscreen mode Exit fullscreen mode

The interesting parts:

  • CSV detection guards against $32,100 false positives — if any field has a run of 2+ whitespace inside it, the commas are probably inside numbers, not delimiters. The detector rejects.
  • Whitespace-aligned table detection finds gap regions (positions where 2+ consecutive characters are spaces across all rows) to infer column boundaries. It needs at least 3 rows so single-line false positives don't slip through.
  • Financial-statement guard. Wide whitespace tables where the "headers" are mostly numeric (column dates, $ totals) get skipped — flattening makes them harder to read, not easier. Narrow invoice/pricing tables (3-4 columns) still get flattened.
  • Sanity check. If the flattened output is more than 2× the input length, something went wrong with boundary detection — fall back to the original.

Each flattened row becomes a self-contained semantic unit that embeds well, so a query like "Q3 revenue" can match against the actual data.

Document classification

A single Gemma call buckets each document into one of six types: contract, financial, policy, hr, invoice, general. Falls through to general if the model misbehaves.

The classification is the routing key for the downstream review agent — each document type gets a different review checklist. So classification isn't just a UI tag, it's a control-flow decision.

Document analysis (summary + topics + sentiment)

A single Gemma call (generateDocumentAnalysis) produces seven structured fields under one JSON schema:

Field Type What it is
title string descriptive title
description string 1-2 sentence overview
keyTopics string[] 3-5 short topic strings
keyPoints string[] 3-6 bullet points of the most important facts
people string[] person names (LLM NER)
organizations string[] company/org names (LLM NER)
sentiment object structured sentiment, see below

Sentiment itself is structured:

{
  "label": "positive | negative | neutral | mixed",
  "confidence": 0.0,
  "riskFlags": ["legal risk", "compliance concern", "..."]
}
Enter fullscreen mode Exit fullscreen mode

The free-form riskFlags array is the design choice I want to highlight. Instead of forcing the model to pick from a fixed risk taxonomy, I let it call out concerning content in its own words, then let the UI surface those flags. That way the model can flag a "supply-chain concentration risk" without needing me to predefine that bucket.

There's a plain-text fallback path: if the schema-constrained call fails, I try a simple "summarize this document in 2-3 sentences" prompt and persist that as the description. I'd rather degrade gracefully than leave a document with no summary at all.

Document review agent

Layered on top of classification, the review agent runs a per-document-type review (checklist items + extracted fields + risk brief), persisted as review_json. This is where domain knowledge gets injected — different rubrics per document type rather than a one-size-fits-all summary. A contract gets a contract checklist; an invoice gets an invoice checklist.

JSON-schema constraints everywhere

Every LLM call I wrote (PII detection, query expansion, reranking, document analysis, review) uses Workers AI's response_format: { type: 'json_schema', json_schema: ... } to force well-typed output.

This matters for three reasons:

  1. No regex-parsing of model output. Either JSON.parse succeeds and the schema validates, or I fall back to a simpler path.
  2. Failure modes are explicit. A schema validation failure is a clear log line, not a silent quality regression.
  3. Prompts stay shorter. The schema documents the contract, so the system prompt doesn't need to repeat "respond with JSON in the format…".

Where it all lands

Everything above writes to columns on the documents row:

Column Source
summary generateDocumentAnalysis (title + description + key topics + key points)
pii_redacted, pii_types, pii_count regex + LLM PII passes
entities regex NER + LLM NER, merged
sentiment_label, sentiment_json generateDocumentAnalysis
review_json runDocumentReview (classifier + per-type rubric)
processing_stage DO updates as the pipeline progresses (scanning_pii, embedding, analyzing, finalizing)

The frontend reads these straight off the documents API and renders them as panels on the document detail page, so the NLP work is visible to users as soon as ingest finishes.


What's next

  • Hybrid retrieval (BM25 alongside vectors) is the obvious next move — at the 0.3 BGE floor I'm sometimes missing exact-keyword queries.
  • Caching contextual headers per (doc, paragraph hash) would cut ingest latency on large docs significantly.
  • Dedicated cross-encoder reranker (e.g. bge-reranker-base) would be cheaper and likely better-calibrated than reusing Gemma 4.
  • A retrieval eval harness — even a small hand-labeled set of 50 query/relevant-chunk pairs — would let me regress these knobs against a number instead of intuition.

If you're working on RAG, edge AI, or PIPEDA-aware document handling, I'd love to compare notes — drop a comment or open a thread on caseonix.ca.

Live app: localmind.caseonix.ca

Top comments (0)