<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alexander Budanov</title>
    <description>The latest articles on DEV Community by Alexander Budanov (@alexbudanov).</description>
    <link>https://dev.to/alexbudanov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875125%2Fe9a416a4-9548-479a-b548-1361cfe29aa3.png</url>
      <title>DEV Community: Alexander Budanov</title>
      <link>https://dev.to/alexbudanov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexbudanov"/>
    <language>en</language>
    <item>
      <title>How I Chose an Embedding Model for Bug Report Deduplication</title>
      <dc:creator>Alexander Budanov</dc:creator>
      <pubDate>Tue, 21 Apr 2026 08:58:09 +0000</pubDate>
      <link>https://dev.to/alexbudanov/how-i-chose-an-embedding-model-for-bug-report-deduplication-1lgb</link>
      <guid>https://dev.to/alexbudanov/how-i-chose-an-embedding-model-for-bug-report-deduplication-1lgb</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; I'm the founder of &lt;a href="https://apexbridge.tech" rel="noopener noreferrer"&gt;Apex Bridge Technology&lt;/a&gt; and the creator of &lt;a href="https://bugspotter.io" rel="noopener noreferrer"&gt;BugSpotter&lt;/a&gt;. This benchmark was built to make a real product decision — which embedding model ships with BugSpotter. I'm publishing the methodology, data, and code so you can verify the numbers and make your own call.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I benchmarked 6 self-hosted embedding models against TF-IDF/BM25 baselines on 650 bug reports — including 250 real SDK captures collected via Playwright from the &lt;a href="https://github.com/apex-bridge/bugspotter" rel="noopener noreferrer"&gt;BugSpotter&lt;/a&gt; demo app — and cross-validated on 407 real Mozilla Bugzilla bugs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plain BM25 beats most embedding models on real-world data.&lt;/strong&gt; On Mozilla Bugzilla, whitespace-tokenized BM25 scores F1=0.954 — beating &lt;code&gt;bge-m3&lt;/code&gt; (0.948), &lt;code&gt;nomic&lt;/code&gt; (0.894), &lt;code&gt;snowflake&lt;/code&gt; (0.872), narrowly beating &lt;code&gt;all-minilm&lt;/code&gt; (0.952), and losing only to qwen3 (0.966) and mxbai (0.962). &amp;lt;1ms per pair, no Ollama, no vector DB. &lt;strong&gt;If your bug reports are English plain text, BM25 is probably the right answer&lt;/strong&gt; — reach for embeddings only for multilingual matching or vague UI-interaction bugs. See Cross-Validation on Mozilla Bugzilla for the full ranking.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Among embedding models: qwen3 leads; bge-m3 and mxbai tied for second.&lt;/strong&gt; qwen3 (CV F1=0.990) beats bge-m3 (0.986) and mxbai (0.984) by ~0.004-0.006 F1 — small but consistent across 3 seeds. bge-m3 and mxbai overlap on bootstrap CIs; you can't rank them. Pick qwen3 if you can afford 2.7s latency; otherwise choose between bge-m3 and mxbai on deployment constraints.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Field-weighting BM25 overfits, even at small grid size.&lt;/strong&gt; "Tuned" BM25F with grid-searched field weights scores 0.923 oracle F1 but only 0.872 ± 0.012 under proper 5-fold CV — a 5-point overfitting gap from just 6 weight configs on 4,475 pairs. Plain BM25 (no fields) at 0.951 beats every BM25F variant. The simpler the lexical method, the better.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What the SDK captures beats what users type.&lt;/strong&gt; Machine-captured fields (console errors, network logs, stack traces) take F1 from 0.951 (title only) to 0.990 (full capture).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Thresholds don't transfer between models &lt;em&gt;or&lt;/em&gt; datasets.&lt;/strong&gt; Optimal on my synthetic set: 0.62–0.73. Optimal on Mozilla Bugzilla: 0.27–0.62. The commonly-cited 0.9 misses 42–78% of duplicates on my data. Tune on your own labeled pairs.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip to Recommendations →&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All code, data, and results: &lt;a href="https://github.com/apex-bridge/bugspotter-embedding-benchmark" rel="noopener noreferrer"&gt;github.com/apex-bridge/bugspotter-embedding-benchmark&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;If you run a bug tracker — whether it's Jira, Linear, a self-hosted tool, or something you built yourself — you've seen this: the same bug reported three times by three different people, in three different ways.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;BUG-1041:&lt;/strong&gt; Checkout button doesn't work after coupon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BUG-1042:&lt;/strong&gt; Can't complete purchase with promo code active.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BUG-1043:&lt;/strong&gt; klick on 'place order' does nothing with cupon.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Same bug. Three tickets. Three engineers triage it. One fixes it, two discover it's already fixed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The embedding approach
&lt;/h3&gt;

&lt;p&gt;The modern solution is vector similarity: embed each bug report into a high-dimensional vector, store it in a vector database, and when a new report comes in — find the nearest neighbors. If the cosine similarity is above some threshold, flag it as a potential duplicate. (Cosine similarity measures how close two vectors are — 1.0 = identical, 0.0 = unrelated. F1 score balances precision and recall — 1.0 = perfect, and you want it as high as possible.)&lt;/p&gt;

&lt;p&gt;Simple, elegant, and well-understood. Except for three questions nobody answers well:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Which embedding model?&lt;/strong&gt; MTEB leaderboards rank models on academic benchmarks (STS, NLI, retrieval). Bug reports are none of these — they're short, technical, full of stack traces and error codes. A model that tops MTEB might fail on &lt;code&gt;TypeError: Cannot read properties of undefined&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What threshold?&lt;/strong&gt; The commonly cited threshold of 0.9 for near-duplicate detection comes from general-purpose NLP — not from bug reports with console errors and stack traces. Is 0.9 actually right for this domain?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What text to embed?&lt;/strong&gt; A bug report has a title, description, console logs, network errors, stack traces, browser info. Which parts should go into the embedding? Just the title? Everything? Does including the stack trace help or hurt?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why self-hosted matters
&lt;/h3&gt;

&lt;p&gt;You could use OpenAI's &lt;code&gt;text-embedding-3-small&lt;/code&gt; and call it a day. But if you're building a self-hosted tool — or your users care about data privacy — think about what's in a bug report. Stack traces contain file paths, variable names, internal URLs. Console errors reveal your tech stack. For regulated industries, sending this to an external API is a non-starter.&lt;/p&gt;

&lt;p&gt;I needed a model that runs locally via Ollama, fits on a budget server, and gives production-quality dedup — without any data leaving the network. Sentry has &lt;a href="https://blog.sentry.io/how-sentry-decreased-issue-noise-with-ai/" rel="noopener noreferrer"&gt;validated embedding-based grouping at scale&lt;/a&gt; — but on a narrower task (error signature grouping, not free-text bug dedup) with a custom fine-tuned model, not off-the-shelf embeddings.&lt;/p&gt;

&lt;h3&gt;
  
  
  What this benchmark answers
&lt;/h3&gt;

&lt;p&gt;Existing benchmarks (&lt;a href="https://arxiv.org/abs/2308.09193" rel="noopener noreferrer"&gt;Patil et al. 2023&lt;/a&gt;, &lt;a href="https://dl.acm.org/doi/abs/10.1145/3576042" rel="noopener noreferrer"&gt;Zhang et al. 2023&lt;/a&gt;) evaluate on Mozilla/Eclipse where bug reports are plain-text title + description — no structured fields. &lt;a href="https://arxiv.org/abs/2412.14802" rel="noopener noreferrer"&gt;Shibaev et al. 2025&lt;/a&gt; covers stack traces specifically but not full bug context. My benchmark complements these by testing on the kind of structured captures modern SDKs emit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 models&lt;/strong&gt; from 22M to 7.6B parameters, all running through Ollama&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;650 bug reports&lt;/strong&gt; — 100 real GitHub issues + 300 synthetic paraphrases (30 archetypes x 10) + 250 real SDK captures (25 bugs x 10, collected via Playwright from the BugSpotter demo app)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4,475 labeled pairs&lt;/strong&gt; including hard negatives (different bugs that look similar)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 vector stores&lt;/strong&gt; compared head-to-head&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Everything reproducible&lt;/strong&gt; — one script, one €25/mo server, MIT-licensed code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal: a practical answer to "which model, what threshold, what text, what store" — backed by numbers, not opinions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup &amp;amp; Methodology
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hardware
&lt;/h3&gt;

&lt;p&gt;No GPU. Hetzner CPX42 — 8 vCPU (AMD EPYC), 16 GB RAM, €25/mo. The kind of box most self-hosted setups actually run on. I ran the full pipeline 3 times on identical instances (seeds 42/123/456) to verify stability.&lt;/p&gt;

&lt;p&gt;Everything runs in Docker: Ollama for embedding inference, PostgreSQL 16 + pgvector for vector storage, Qdrant for comparison. ChromaDB and sqlite-vec run embedded in Python — no additional containers.&lt;/p&gt;

&lt;p&gt;The entire stack starts with &lt;code&gt;docker compose up -d&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 6 models
&lt;/h3&gt;

&lt;p&gt;All models run through Ollama's &lt;code&gt;/api/embed&lt;/code&gt; endpoint. No fine-tuning, no custom configurations — just pull and run. Each report is embedded individually (batch size 1), using default &lt;code&gt;num_ctx&lt;/code&gt;. Note: Ollama has had documented embedding consistency issues across versions (&lt;a href="https://github.com/ollama/ollama/issues/3777" rel="noopener noreferrer"&gt;#3777&lt;/a&gt;, &lt;a href="https://github.com/ollama/ollama/issues/4207" rel="noopener noreferrer"&gt;#4207&lt;/a&gt;). I used Ollama v0.20.7 and Python 3.12. Results may differ on other versions — I recommend pinning the Ollama version in production.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;Max Tokens&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;all-minilm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22M&lt;/td&gt;
&lt;td&gt;384&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;F16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;nomic-embed-text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;137M&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;F16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;snowflake-arctic-embed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;334M&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;F16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mxbai-embed-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;335M&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;F16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;bge-m3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;568M&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;F16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;qwen3-embedding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.6B&lt;/td&gt;
&lt;td&gt;4096&lt;/td&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Why these six: they span 22M to 7.6B parameters, they're all available in the Ollama registry, and they represent different architectures and training approaches. all-minilm is the baseline (smallest, fastest — it's what BugSpotter originally shipped with, and this benchmark is the reason we moved off it). qwen3-embedding is the ceiling (SOTA on MTEB, but quantized to fit in 16GB RAM). Note that qwen3 is architecturally different — it's a decoder-based model using last-token pooling from a full LLM, not a BERT-style encoder. This explains the 95x latency gap vs all-minilm.&lt;/p&gt;

&lt;p&gt;Notable absence: &lt;code&gt;qwen3-embedding:4b&lt;/code&gt; — the model most likely to hit the sweet spot between all-minilm and qwen3-embedding:7.6B. I'll add it when Ollama publishes the weights.&lt;/p&gt;

&lt;p&gt;Note: nomic-embed-text MRL truncation requires &lt;strong&gt;v1.5+&lt;/strong&gt;. If you pull the default &lt;code&gt;nomic-embed-text&lt;/code&gt; without specifying the version, you may get v1 with fixed 768 dims and no truncation support.&lt;/p&gt;

&lt;h3&gt;
  
  
  The dataset
&lt;/h3&gt;

&lt;p&gt;There's no public benchmark for duplicate bug report detection with structured fields (console logs, network errors, stack traces). Existing academic datasets (Mozilla Bugzilla, Eclipse) contain only title + description as plain text. So I built my own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;650 bug reports from 3 sources:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source 1: Real GitHub Issues (100 reports).&lt;/strong&gt; Scraped from major open-source repositories using the GitHub API. Only issues labeled &lt;code&gt;bug&lt;/code&gt; that contain error messages or stack traces in code blocks. These provide realistic vocabulary and formatting diversity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source 2: Synthetic bug reports (300 reports).&lt;/strong&gt; 30 bug archetypes — each representing a common frontend error pattern (checkout failures, CORS errors, memory leaks, hydration mismatches, useEffect loops, etc.) — with 10 variations each. The 9 paraphrases per archetype are semantically equivalent but lexically different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Original: &lt;em&gt;"Checkout button unresponsive after coupon applied"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Paraphrase: &lt;em&gt;"Cannot complete purchase with promo code active"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Noisy: &lt;em&gt;"klick on 'place order' does nothing with cupon"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Truncated: &lt;em&gt;"Coupon breaks checkout CTA"&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Paraphrases were generated with AI assistance and manually reviewed to ensure semantic equivalence while maximizing lexical diversity. I acknowledge this limits stylistic diversity compared to real multi-author bug reports (see Limitations). This forces models to understand meaning, not just match words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source 3: Real SDK captures (250 reports).&lt;/strong&gt; 25 bugs x 10 variations each, captured via Playwright from the BugSpotter demo app. These are real bug reports in the SDK's structured format — with console errors, network logs, stack traces, and browser metadata. Unlike v1's 40 synthetic SDK captures, these are genuine captures from a running application, making them far more realistic. These bugs are deliberately placed in the &lt;strong&gt;same components&lt;/strong&gt; as the archetypes (checkout, auth, feed, modal) but describe &lt;strong&gt;different problems&lt;/strong&gt;. This creates natural hard negatives.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Dataset caveat:&lt;/strong&gt; 300 of 650 reports are synthetic paraphrases. This likely makes F1 scores higher than you'd see on real multi-author bug reports. Treat the numbers as relative model rankings, not production expectations. This benchmark tests semantic similarity between bug reports — a proxy for, but not identical to, real-world duplicate detection where two reports about the same bug may be textually very different.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Ground truth: 4,475 labeled pairs
&lt;/h3&gt;

&lt;p&gt;Every pair is labeled as &lt;code&gt;duplicate&lt;/code&gt; or &lt;code&gt;not_duplicate&lt;/code&gt; across four difficulty levels. D3 is the critical category — "Checkout button broken with coupon" vs "Checkout total shows NaN after removing last item" — both mention checkout, both are bugs, but they're different problems. A model that only matches keywords will fail here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding text construction
&lt;/h3&gt;

&lt;p&gt;Each bug report is converted to a single text string for embedding, matching the production &lt;code&gt;build_embedding_text()&lt;/code&gt; function. I tested four simpler strategies (title only, title + description, etc.) — results in the Deep Dives section.&lt;/p&gt;

&lt;p&gt;Full embedding text format&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;title
| description
| console_errors (up to 5)
| failed_network_requests (up to 3)
| Browser: X
| OS: Y
| Page: /path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;All parts joined with &lt;code&gt;|&lt;/code&gt; (pipe separator was inherited from production code; not ablated in this benchmark).&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluation pipeline
&lt;/h3&gt;

&lt;p&gt;For each of the 6 models:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Embed&lt;/strong&gt; all 650 reports (3 passes — 1 cold, 2 warm, take median latency)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute&lt;/strong&gt; cosine similarity for all 4,475 pairs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sweep&lt;/strong&gt; threshold from 0.50 to 0.99 (step 0.01), compute precision, recall, F1 at each&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Find&lt;/strong&gt; optimal threshold (max F1) and compute ROC-AUC&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Between models, I explicitly unload the previous model from Ollama to prevent memory cross-contamination. Latency variance across warm runs was low (&amp;lt;5% CV for all models), confirming median values are representative.&lt;/p&gt;

&lt;p&gt;To verify reproducibility, I ran the full pipeline 3 times on separate Hetzner CPX42 instances (seeds 42, 123, 456 for synthetic data generation). Total cost: 3 × ~€0.20 = €0.60, ~5 hours per VM. The seeds affect only minor noise injection in paraphrases — the models, SDK captures, and core dataset are identical across runs. Ollama embeddings are deterministic at batch size 1 in this configuration (I verified by embedding the same report multiple times and getting identical vectors). All results in this article report mean ± std across these 3 runs.&lt;/p&gt;

&lt;p&gt;A TF-IDF baseline (sklearn TfidfVectorizer with sublinear_tf and bigrams) runs on the same pairs for comparison — same input text, different vectorization method.&lt;/p&gt;

&lt;p&gt;The full pipeline runs end-to-end with one command: &lt;code&gt;./deploy/run_clean.sh --seed 42&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: The Numbers
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Heads-up:&lt;/strong&gt; this section ranks the 6 embedding models plus lexical baselines on the synthetic benchmark. The headline finding — &lt;em&gt;whether you need embeddings at all&lt;/em&gt; — comes from the independent Mozilla Bugzilla data in Cross-Validation on Mozilla Bugzilla. If you only want to know "BM25 or embeddings?", jump there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The main table
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;F1 is 5-fold cross-validated: threshold is picked on 4 train folds, F1 is measured on the held-out fold, results are averaged across the 5 folds. Threshold column shows the mean train-fold optimum. All values mean across 3 seeds.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;CV F1&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;th&gt;&lt;a href="mailto:Recall@0.9"&gt;Recall@0.9&lt;/a&gt;&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;qwen3-embedding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.6B&lt;/td&gt;
&lt;td&gt;0.990&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;td&gt;2,662ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;bge-m3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;568M&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;td&gt;0.69&lt;/td&gt;
&lt;td&gt;29%&lt;/td&gt;
&lt;td&gt;268ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mxbai-embed-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;335M&lt;/td&gt;
&lt;td&gt;0.984&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;47%&lt;/td&gt;
&lt;td&gt;224ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;all-minilm&lt;/strong&gt; †&lt;/td&gt;
&lt;td&gt;22M&lt;/td&gt;
&lt;td&gt;0.978&lt;/td&gt;
&lt;td&gt;0.62&lt;/td&gt;
&lt;td&gt;37%&lt;/td&gt;
&lt;td&gt;28ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;snowflake-arctic-embed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;334M&lt;/td&gt;
&lt;td&gt;0.977&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;22%&lt;/td&gt;
&lt;td&gt;220ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;nomic-embed-text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;137M&lt;/td&gt;
&lt;td&gt;0.973&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;36%&lt;/td&gt;
&lt;td&gt;82ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;TF-IDF baseline&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.973&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.17&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.0%&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;&amp;lt;1ms&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;BM25 baseline&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.951&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.13&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.1%&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;&amp;lt;1ms&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;BM25F (field-weighted, default)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.923&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.09&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.7%&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;&amp;lt;1ms&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;BM25F tuned (5-fold CV)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.872&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.05&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.0%&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;&amp;lt;1ms&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;† all-minilm is evaluated on 4,415 pairs (the other models on 4,475). Sixty pairs are dropped because 76 reports exceed all-minilm's 256-token context window — including all 10 reports each from &lt;code&gt;sdk_json_parse_crash&lt;/code&gt;, &lt;code&gt;sdk_rate_limit_429&lt;/code&gt;, and &lt;code&gt;sdk_zindex_conflict&lt;/code&gt;, plus 12 GitHub issues. all-minilm cannot embed these at all. In production that means all-minilm will silently fail on long bug reports; the F1 here is an optimistic upper bound over the pairs it can handle.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Four things jump out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. qwen3 leads; bge-m3 and mxbai tied for second.&lt;/strong&gt; Bootstrap 95% CIs (1000 resamples over pairs, evaluated at the cv-picked threshold) give qwen3 [0.989–0.992], bge-m3 [0.985–0.988], and mxbai [0.983–0.987]. qwen3's lower bound sits at or above bge-m3's upper bound — qwen3's lead over the #2/#3 pair is small (~0.004) but consistent across seeds. bge-m3 and mxbai overlap — you can't rank them from this data. The bottom tier (all-minilm, snowflake, nomic) sits cleanly below. Archetype-level CV (holding out entire bug types, not random pairs) drops F1 by only 0.003–0.005. Pick from the top tier based on latency, dimensions, and hard-negative error rates — not on the headline F1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Lexical baselines are surprisingly strong — but only the simplest ones.&lt;/strong&gt; TF-IDF scores F1=0.973, plain BM25 scores 0.951. The best embedding (qwen3) beats TF-IDF by ~1.7 points and the #2 embedding (bge-m3) by only ~1.3. Embeddings still win, but not by much. The lexical picture gets more interesting in the other direction: &lt;strong&gt;adding field weighting actively hurts&lt;/strong&gt;. Default BM25F (weights: title=3, desc=1, console=2, network=1.5) scores 0.923 — 5 points &lt;em&gt;below&lt;/em&gt; plain BM25. And "tuning" the weights by grid search under proper 5-fold CV scores only 0.872 ± 0.012 — 5 more points below default. The grid-searched weights look good on the fold they were picked on (oracle F1 = 0.923, essentially matching default) but don't generalize to held-out folds. So on this data, the best lexical method is the simplest one: plain BM25 with whitespace tokenization and no fields at all. Porter stemming makes it collapse to F1=0.038 because it turns "undefined" into "undefin" and "CORS" into "cor", destroying the exact-token matching that makes stack traces and error IDs distinctive. &lt;strong&gt;And on Mozilla Bugzilla (next section), plain BM25 &lt;em&gt;beats&lt;/em&gt; most of the embedding models&lt;/strong&gt; — so "use BM25, skip the stemming and field weights" is a surprisingly defensible default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Latency varies ~95x.&lt;/strong&gt; all-minilm embeds a bug report in 28ms. qwen3 takes 2,662ms — nearly 3 seconds. For real-time "is this a duplicate?" on bug submit, that's the difference between invisible and noticeable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The &lt;a href="mailto:Recall@0.9"&gt;Recall@0.9&lt;/a&gt; column is the most important one.&lt;/strong&gt; It shows what happens if you use the commonly cited 0.9 threshold. Even the best model (qwen3) would only catch 58% of duplicates. The worst (snowflake) catches 22%. These numbers mean that at threshold 0.9, your dedup system is missing a significant portion of duplicates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxeqt86vaw1osb2ok4m76.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxeqt86vaw1osb2ok4m76.png" alt="Embedding Models for Bug Deduplication: F1 vs Latency" width="800" height="577"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Each bubble is a model. X = median latency, Y = F1 score, size = parameter count. The top-left corner (high F1, low latency) is the sweet spot — bge-m3 and mxbai come closest.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A note on absolute numbers:&lt;/strong&gt; these F1 scores reflect this benchmark dataset with controlled synthetic paraphrases. Real-world performance will be lower — treat these as relative rankings between models, not expected production metrics.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Robustness: hard pairs only (D2 + D3)
&lt;/h3&gt;

&lt;p&gt;A fair critique: the full-set F1 includes 1,398 D4 "easy negative" pairs (different bugs, different components) that any sensible model should nail. Are the rankings just being propped up by the easy cases?&lt;/p&gt;

&lt;p&gt;I re-swept thresholds on the D2 + D3 subset only (paraphrases + hard negatives, 3,062 pairs per seed). Rankings are unchanged and F1s are stable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;F1 (full, re-swept)&lt;/th&gt;
&lt;th&gt;F1 (D2+D3 only, re-swept)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen3-embedding&lt;/td&gt;
&lt;td&gt;0.991&lt;/td&gt;
&lt;td&gt;0.992&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bge-m3&lt;/td&gt;
&lt;td&gt;0.987&lt;/td&gt;
&lt;td&gt;0.989&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai-embed-large&lt;/td&gt;
&lt;td&gt;0.985&lt;/td&gt;
&lt;td&gt;0.987&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;0.979&lt;/td&gt;
&lt;td&gt;0.980&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;snowflake-arctic-embed&lt;/td&gt;
&lt;td&gt;0.979&lt;/td&gt;
&lt;td&gt;0.981&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;0.974&lt;/td&gt;
&lt;td&gt;0.976&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;TF-IDF baseline&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.973&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.978&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;BM25 baseline&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.951&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.960&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Both columns are F1 at the optimal threshold for their respective pair set — not the 5-fold CV number from the main Results table. The comparison is internally consistent (same protocol on both subsets); the ~0.001 gap to the main table's CV F1 is the usual oracle-vs-CV difference.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Everything goes slightly &lt;em&gt;up&lt;/em&gt; on the harder subset, because the threshold sweep finds a tighter optimum when easy pairs no longer anchor it. The ~2-point embedding-vs-lexical gap holds. If you had expected TF-IDF to collapse when easy negatives are removed, this benchmark doesn't show that effect on this data.&lt;/p&gt;

&lt;h3&gt;
  
  
  📝 A bug I caught during development
&lt;/h3&gt;

&lt;p&gt;When I first ran this benchmark I got TF-IDF=0.774, BM25=0.388, BM25F=0.499 — baseline numbers suspiciously worse than any baseline has a right to be. Both were wrong, for two separate reasons.&lt;/p&gt;

&lt;p&gt;The threshold-sweep function was copied from the embedding-similarity code, which scans &lt;code&gt;[0.50, 1.00]&lt;/code&gt; — the right range for cosine similarities. But BM25's normalized scores cluster in &lt;code&gt;[0.05, 0.25]&lt;/code&gt;, so the sweep was searching empty space and returning whatever F1 happened to land at 0.50. Fixing the range to &lt;code&gt;[0.00, 1.00]&lt;/code&gt; gave the numbers you see above.&lt;/p&gt;

&lt;p&gt;The F1=0.038 I initially got for "tuned BM25F" was a second bug: I was stemming with Porter, which turned &lt;code&gt;undefined&lt;/code&gt; into &lt;code&gt;undefin&lt;/code&gt;, &lt;code&gt;CORS&lt;/code&gt; into &lt;code&gt;cor&lt;/code&gt;, &lt;code&gt;processPayment&lt;/code&gt; into &lt;code&gt;processpay&lt;/code&gt; — destroying the exact-token matching that makes stack traces and error IDs distinctive. Removing stemming and grid-searching field weights on all 4,475 pairs lifted F1 to 0.923 — but under proper 5-fold CV (weights picked on train folds, F1 measured on held-out) it drops to 0.872 ± 0.012. That's a 5-point overfitting gap: field weights selected on any fold don't generalize to the next. Even at 6 configs × 4,475 pairs, grid search is enough to overfit. Plain BM25 at 0.951 beats both versions of BM25F — the lesson is that field weighting doesn't earn its complexity on this data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson: if your baseline is way worse than a baseline has any right to be, suspect yourself first.&lt;/strong&gt; I'm sharing this before the main Recommendations because catching this kind of bug is the most practically useful thing I can teach anyone reading a retrieval benchmark — and it almost shipped a "21-point gap between embeddings and lexical methods" claim that was a methodology bug, not a finding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why threshold matters more than model choice
&lt;/h3&gt;

&lt;p&gt;The difference between the best and worst model is 1.7% F1 (0.990 vs 0.973). The difference between using the right threshold vs 0.9 is &lt;strong&gt;42–78% of your duplicates missed&lt;/strong&gt;. Threshold selection is the single highest-leverage decision in a dedup pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxdy3s0ahvaj4uxzwahh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxdy3s0ahvaj4uxzwahh.png" alt="Precision-Recall Curves" width="800" height="644"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Precision-Recall curves for all 6 models. The dots mark each model's optimal threshold. Notice how the curves separate mainly in the high-recall region — that's where threshold choice matters most.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Each model has its own optimal operating point. There's no universal "good threshold" — it depends on the model's embedding space:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Optimal Threshold&lt;/th&gt;
&lt;th&gt;What 0.9 would cost you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;0.62&lt;/td&gt;
&lt;td&gt;Miss 63% of duplicates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;snowflake&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;Miss 78% of duplicates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bge-m3&lt;/td&gt;
&lt;td&gt;0.69&lt;/td&gt;
&lt;td&gt;Miss 71% of duplicates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai-large&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;Miss 53% of duplicates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic v1.5&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;Miss 64% of duplicates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3 7.6B&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;Miss 42% of duplicates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Practical takeaway:&lt;/strong&gt; don't hardcode a threshold — especially not mine. These thresholds were tuned on the same dataset used for evaluation. Run a sweep on your own data, even a small one (50–100 labeled pairs) — the optimal threshold for your model + your data will likely differ.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  How models see duplicates vs non-duplicates
&lt;/h3&gt;

&lt;p&gt;The violin plot below shows the distribution of cosine similarity scores for each pair type, per model. The key question: &lt;strong&gt;is there a clean gap between the duplicate distribution (D1, D2) and the non-duplicate distribution (D3, D4)?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frl8trzqm091qnklprkb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frl8trzqm091qnklprkb8.png" alt="Cosine Similarity Distributions by Pair Type" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What this reveals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;D1 (exact duplicates)&lt;/strong&gt; cluster at 0.90–1.0 for all models — no surprises.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;D2 (paraphrases)&lt;/strong&gt; spread between 0.65–0.95 — this is where models differ. mxbai and qwen3 push D2 higher (tighter clusters), making them easier to separate from negatives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;D3 (hard negatives)&lt;/strong&gt; are the problem. They overlap with D2 in the 0.55–0.75 range for weaker models (nomic, snowflake). This overlap is exactly where false positives come from.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;D4 (easy negatives)&lt;/strong&gt; sit below 0.5 for most models — these are trivial to filter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The wider the gap between D2 and D3, the easier it is to pick a threshold that works. mxbai has the cleanest separation. nomic has the most overlap — which explains its lower F1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Dives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What to embed: the input strategy experiment
&lt;/h3&gt;

&lt;p&gt;A bug report contains many fields. Which ones should go into the embedding? I tested four strategies using mxbai-embed-large (other models may rank strategies differently, but the direction should hold):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;What's included&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;th&gt;Threshold&lt;/th&gt;
&lt;th&gt;Avg words&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A: Title only&lt;/td&gt;
&lt;td&gt;title&lt;/td&gt;
&lt;td&gt;0.951&lt;/td&gt;
&lt;td&gt;0.62&lt;/td&gt;
&lt;td&gt;7.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B: Title + Desc&lt;/td&gt;
&lt;td&gt;title + description&lt;/td&gt;
&lt;td&gt;0.978&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;28.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C: + Console&lt;/td&gt;
&lt;td&gt;title + desc + first_error&lt;/td&gt;
&lt;td&gt;0.976&lt;/td&gt;
&lt;td&gt;0.71&lt;/td&gt;
&lt;td&gt;34.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;D: Full capture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;title + desc + errors + network + env&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.990&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.75&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;65.8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fteu2or9zizaw4z7b45dp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fteu2or9zizaw4z7b45dp.png" alt="What to Embed? Input Strategy Comparison" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The progression tells a clear story:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Description is the biggest single contributor&lt;/strong&gt; (+2.7% F1 over title only). This makes sense — the title is a summary, the description contains the actual technical detail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Console errors show diminishing returns at this level&lt;/strong&gt; (-0.2% F1 vs title+desc). On this dataset, adding the first console error alone doesn't help beyond what the description already provides. However, the full capture strategy (D) shows that the combination of all machine-captured fields together pushes F1 to 0.990.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The full capture is what matters&lt;/strong&gt; (+1.2% F1 over title+desc). Adding all machine-captured fields — console errors, network logs, and environment info — together provides the signal needed for the best dedup quality. Individual fields may not move the needle, but the combination does.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If you're building an SDK that captures bug context, this is why structured fields matter more than you'd think:&lt;/strong&gt; machine-captured data (console errors, network logs, stack traces) is deterministic — two users hitting the same bug generate the same console error, even if one writes "checkout broken" and the other writes "can't buy stuff lol." Free-text descriptions can't give an embedding model that kind of signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caveat: this finding is partially optimistic by construction.&lt;/strong&gt; The 250 SDK captures in my dataset are 25 bugs × 10 variations, where each of the 10 variations shares the &lt;em&gt;exact same&lt;/em&gt; console logs, stack traces, URLs, browser metadata, and timestamps — they came from one Playwright capture per bug, with only the title and description varied afterward. That's realistic for "two users on identical setups hit the same bug," but real multi-author reports have different browsers, different user IDs, different stack trace line numbers, different network timing. Some of the +1.2% F1 from Strategy B → D is the embedding matching on identical structured strings that wouldn't be identical in production. The &lt;em&gt;direction&lt;/em&gt; of the finding (structured fields help) holds across the per-category Bugzilla analysis too, but the &lt;em&gt;magnitude&lt;/em&gt; here should be read as an upper bound on what you'd see with real-world-varied captures.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Hard negatives: where models actually differ
&lt;/h3&gt;

&lt;p&gt;Overall F1 scores are close (0.979–0.990). But the models diverge on the hardest task: distinguishing different bugs that share vocabulary. I analyzed the D3 pairs (different bugs in the same component) specifically:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;D3 False Positives&lt;/th&gt;
&lt;th&gt;D2 False Negatives&lt;/th&gt;
&lt;th&gt;Total errors&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;bge-m3&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3-embedding&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;49&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai-embed-large&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;snowflake-arctic-embed&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;97&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The larger, more realistic dataset changed the picture.&lt;/strong&gt; In v1 (540 reports, 40 synthetic SDK captures), qwen3 had zero false positives on hard negatives. In v2 (650 reports, 250 real SDK captures), every model has significant false positives. The real SDK captures create much harder negative pairs — real console errors and stack traces share more vocabulary across different bugs than synthetic ones did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;bge-m3 and qwen3 make the fewest total errors&lt;/strong&gt; (48 and 49 respectively). bge-m3 has 29 FP + 19 FN; qwen3 has 32 FP + 17 FN. Both catch most real duplicates while keeping false positives manageable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;all-minilm makes 2x more errors&lt;/strong&gt; (97 total) — 34 false positives and 63 false negatives. The headline F1 gap of 1% hides this 2x difference on hard cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snowflake struggles the most.&lt;/strong&gt; 43 false positives and 48 false negatives — 91 total errors.&lt;/p&gt;

&lt;p&gt;The practical implication: on realistic data, &lt;strong&gt;no model achieves zero false positives&lt;/strong&gt;. If your dedup system auto-merges without human review, you need a confidence threshold above the optimal F1 threshold to reduce false positives — at the cost of missing more duplicates. Human review remains important.&lt;/p&gt;

&lt;p&gt;Some real examples of what models got wrong:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Confused pair&lt;/th&gt;
&lt;th&gt;Why it's tricky&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Memory leak on infinite scroll" vs "Images fail to load on fast scroll"&lt;/td&gt;
&lt;td&gt;Both mention scrolling + performance issues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Body scroll not locked in modal" vs "Focus escapes dialog to background"&lt;/td&gt;
&lt;td&gt;Both about modal behavior on iOS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"CORS error after subdomain change" vs "Avatar images blocked by CDN CORS"&lt;/td&gt;
&lt;td&gt;Both contain "CORS" + "blocked"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;MRL dimension truncation&lt;/strong&gt;&lt;/p&gt;



&lt;p&gt;Qwen3 and nomic-embed-text support Matryoshka Representation Learning (MRL), meaning you can truncate their embeddings to fewer dimensions without retraining. This trades storage for quality.&lt;/p&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;th&gt;Storage per 100K&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;0.972&lt;/td&gt;
&lt;td&gt;49 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;0.976&lt;/td&gt;
&lt;td&gt;98 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;0.980&lt;/td&gt;
&lt;td&gt;195 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic&lt;/td&gt;
&lt;td&gt;768 (full)&lt;/td&gt;
&lt;td&gt;0.981&lt;/td&gt;
&lt;td&gt;293 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;0.990&lt;/td&gt;
&lt;td&gt;49 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;0.990&lt;/td&gt;
&lt;td&gt;98 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;0.991&lt;/td&gt;
&lt;td&gt;195 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3&lt;/td&gt;
&lt;td&gt;4096 (full)&lt;/td&gt;
&lt;td&gt;0.990&lt;/td&gt;
&lt;td&gt;1,563 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsdo5f81pvssz3y6t7r90.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsdo5f81pvssz3y6t7r90.png" alt="MRL Dimension Truncation: F1 vs Storage" width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;nomic loses very little from truncation.&lt;/strong&gt; From 768 to 128 dims, F1 drops only from 0.981 to 0.972. At 512 dims (0.980), quality is nearly identical to full — 34% less storage for effectively the same performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;qwen3 is remarkably stable across all dimensions.&lt;/strong&gt; F1 stays at 0.990–0.991 from 128 dims all the way to 4096. This means you can truncate qwen3 to 128 dims (49 MB per 100K records) with zero quality loss — a major practical advantage. It also sidesteps pgvector's 2000-dimension limit entirely: just truncate to 128 or 256 dims and use pgvector without issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  F1 by bug category
&lt;/h3&gt;

&lt;p&gt;Not all bugs are equally easy to deduplicate. I broke down F1 by error type:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyiawi9nxhe9w94rpl7p0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyiawi9nxhe9w94rpl7p0.png" alt="F1 by Model x Bug Category" width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The heatmap shows that &lt;strong&gt;UI interaction bugs are hardest&lt;/strong&gt; for all models — they tend to have vague descriptions ("button doesn't work") and share vocabulary across different issues. &lt;strong&gt;Network errors are easiest&lt;/strong&gt; — they come with specific HTTP status codes and endpoint URLs that are highly discriminative. (The "GitHub Issues" column shows 0.00 because these 100 reports have no duplicate pairs — they provide vocabulary diversity, not dedup targets.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Where embeddings beat keyword matching most
&lt;/h3&gt;

&lt;p&gt;Per-category F1 gaps tell a more nuanced story than the overall ~2-point headline. Sorted by how much the best embedding beats the best lexical baseline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Best embedding&lt;/th&gt;
&lt;th&gt;TF-IDF&lt;/th&gt;
&lt;th&gt;BM25&lt;/th&gt;
&lt;th&gt;Gap (emb − best lex)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UI interaction&lt;/td&gt;
&lt;td&gt;0.994&lt;/td&gt;
&lt;td&gt;0.947&lt;/td&gt;
&lt;td&gt;0.916&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.7 pts&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State management&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;0.942&lt;/td&gt;
&lt;td&gt;0.932&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.5 pts&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network errors&lt;/td&gt;
&lt;td&gt;0.994&lt;/td&gt;
&lt;td&gt;0.973&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.1 pts&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JS errors&lt;/td&gt;
&lt;td&gt;0.996&lt;/td&gt;
&lt;td&gt;0.980&lt;/td&gt;
&lt;td&gt;0.962&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.6 pts&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.2 pts&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CSS/UI&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.993&lt;/td&gt;
&lt;td&gt;0.993&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.6 pts&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;React-specific&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.996&lt;/td&gt;
&lt;td&gt;0.993&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.4 pts&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The pattern:&lt;/strong&gt; on technical bugs with distinctive error messages (React, CSS, performance, JS, network), TF-IDF sits within 0.4–2 F1 points of the best embedding — essentially a tie. The embedding advantage shows up where users describe symptoms in free-form prose — UI interaction (4.7 pts) and state management (2.5 pts) — where "button doesn't work" and "can't click submit" share little vocabulary despite describing the same bug. If most of your bugs come with structured error identifiers, embeddings don't help much. If most of your bugs come from users writing in their own words, embeddings earn their cost but not dramatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Validation on Mozilla Bugzilla
&lt;/h2&gt;

&lt;p&gt;Everything so far has been on my own dataset. The synthetic half is controlled-quality by construction and the SDK captures are 10 paraphrases per bug — not what real multi-author duplicates look like. The honest question: do these rankings survive on bug reports I had no hand in creating?&lt;/p&gt;

&lt;p&gt;I ran all 6 models and both lexical baselines against 407 Mozilla Bugzilla bugs (250 duplicate pairs, 100 hard negatives). These are plain-text title + description with no SDK captures, no stack traces, no structured fields — fundamentally different data from the synthetic benchmark.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feqc4dmh3fpdntmfxb4sj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feqc4dmh3fpdntmfxb4sj.png" alt="Cross-Validation on Mozilla Bugzilla: All Methods Ranked" width="800" height="483"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Bugzilla F1 for every method tested. BM25 (diamond marker, 0.954) sits at #3 — ahead of &lt;code&gt;all-minilm&lt;/code&gt;, &lt;code&gt;bge-m3&lt;/code&gt;, &lt;code&gt;nomic&lt;/code&gt;, and &lt;code&gt;snowflake&lt;/code&gt;. The dashed line is BM25's score; four of the six embedding models fall below it.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The two-dataset picture
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Synthetic CV F1&lt;/th&gt;
&lt;th&gt;Bugzilla F1&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;th&gt;Synthetic rank&lt;/th&gt;
&lt;th&gt;Bugzilla rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;qwen3-embedding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.990&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.966&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;−0.024&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mxbai-embed-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.984&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.962&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;−0.022&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2 ↑&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;BM25 baseline&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.951&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;0.954&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;+0.003&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;7 (below all 6)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;3&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;all-minilm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.978&lt;/td&gt;
&lt;td&gt;0.952&lt;/td&gt;
&lt;td&gt;−0.026&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;em&gt;TF-IDF baseline&lt;/em&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.973&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.950&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;−0.023&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;8&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;5&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;bge-m3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;td&gt;0.948&lt;/td&gt;
&lt;td&gt;−0.038&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;6 ↓↓↓↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;nomic-embed-text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.973&lt;/td&gt;
&lt;td&gt;0.894&lt;/td&gt;
&lt;td&gt;−0.079&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;snowflake-arctic-embed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.977&lt;/td&gt;
&lt;td&gt;0.872&lt;/td&gt;
&lt;td&gt;−0.105&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What the two datasets agree on
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. qwen3 is the strongest model on both.&lt;/strong&gt; It's #1 on synthetic and #1 on Bugzilla. If you can afford 7.6B parameters and 2.7-second latency, it's the safe pick.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Threshold 0.9 is terrible on both.&lt;/strong&gt; On synthetic, &lt;a href="mailto:Recall@0.9"&gt;Recall@0.9&lt;/a&gt; ranged 22–58%. On Bugzilla, the optimal thresholds are 0.27–0.62 — cosine 0.9 misses nearly every duplicate. The "embeddings are near-duplicates above 0.9" folklore doesn't survive either dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The bottom tier stays the bottom tier.&lt;/strong&gt; nomic and snowflake are the worst two models on both datasets, and by the biggest absolute margin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. BM25 beats most of the embedding models on Bugzilla.&lt;/strong&gt; This is the uncomfortable finding. On the synthetic benchmark, BM25 scores 0.951 — below every embedding model. On Bugzilla, BM25 scores 0.954 (actually &lt;em&gt;slightly higher&lt;/em&gt; than its synthetic score) — and it beats &lt;code&gt;bge-m3&lt;/code&gt; (0.948), &lt;code&gt;nomic&lt;/code&gt; (0.894), &lt;code&gt;snowflake&lt;/code&gt; (0.872), narrowly beats &lt;code&gt;all-minilm&lt;/code&gt; (0.952), and loses only to the top two embedding models (qwen3 by 0.012, mxbai by 0.008). On single-language plain-text Mozilla bug reports, whitespace-tokenized BM25 is a serious competitor — use it as a sanity-check baseline before concluding that an embedding model is "good enough" for your domain. The same BM25 that costs &amp;lt;1ms per pair beats four of the six embedding models I tested. If your data looks like Bugzilla (English, plain text, experienced-author writing style), the honest recommendation might be "start with BM25 and only reach for embeddings if you need multilingual matching or are handling vague UI descriptions."&lt;/p&gt;

&lt;h3&gt;
  
  
  What they disagree on
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. The middle of the leaderboard shuffles substantially.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;bge-m3&lt;/code&gt; drops from #2 on synthetic to #4 on Bugzilla (−0.040 F1 — the largest drop among the top 3)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mxbai&lt;/code&gt; moves from #3 to #2, overtaking bge-m3&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;all-minilm&lt;/code&gt; jumps from #6 to #3 — climbing over 3 models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Read literally: if you had picked a model by ranking on the synthetic benchmark alone, you would have picked either qwen3 (still correct) or bge-m3 (now disputed). The "top 3 are tied" claim from Results is robust, but &lt;em&gt;which&lt;/em&gt; of the three is actually best depends on the dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Thresholds do not transfer at all.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Synthetic optimal&lt;/th&gt;
&lt;th&gt;Bugzilla optimal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen3-embedding&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;td&gt;0.49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bge-m3&lt;/td&gt;
&lt;td&gt;0.69&lt;/td&gt;
&lt;td&gt;0.49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai-embed-large&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;0.53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;0.64&lt;/td&gt;
&lt;td&gt;0.27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;0.62&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;snowflake-arctic-embed&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;0.51&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Applying the synthetic-tuned threshold to Bugzilla costs every model 20–30 F1 points. The synthetic dataset's paraphrase-heavy structure produces tighter duplicate clusters (and therefore higher thresholds) than real multi-author bug reports do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Degradation is uneven.&lt;/strong&gt; qwen3, mxbai and all-minilm lose 0.022–0.026 F1 going from synthetic to Bugzilla — stable. bge-m3 loses 0.038 (~70% more than the others in the top tier). nomic loses 0.079. snowflake loses 0.105 — four times more than qwen3. Headline F1 is one thing; &lt;em&gt;how well a model degrades on data you didn't tune against&lt;/em&gt; is a different thing, and it's the one that matters in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I'd take from this
&lt;/h3&gt;

&lt;p&gt;If I'd published the synthetic results alone, I'd have recommended bge-m3 or qwen3 with some confidence. The Bugzilla data says something more nuanced: &lt;strong&gt;qwen3 and mxbai are robust across both datasets; bge-m3 falls to 6th place on Bugzilla and even loses to a BM25 baseline; and BM25 itself is the quiet winner on the robustness-per-dollar axis&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One number worth sitting with: the F1 gap between qwen3 and bge-m3 on synthetic is 0.004. The F1 drop bge-m3 suffers going from synthetic to Bugzilla is 0.038 — ten times larger.&lt;/strong&gt; The out-of-distribution shift moves you further than any model-choice decision within the top tier. Optimizing which embedding model you pick, beyond the robustness heuristic below, is optimizing noise.&lt;/p&gt;

&lt;p&gt;Two practical heuristics:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pick a model that degrades gracefully on data you didn't tune against.&lt;/strong&gt; By that criterion, qwen3 and mxbai stay in the top tier on both datasets (lose 0.023–0.024 F1). bge-m3 loses 0.040 and drops 4 ranks. nomic and snowflake lose 3–4x more than the top. Headline F1 is one thing; robustness is what actually matters in production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run BM25 as a baseline before committing to embeddings.&lt;/strong&gt; On Bugzilla-like data (plain text, single language, experienced authors), BM25 scores within 1.2 F1 points of the best embedding model and beats most of the field. If your data looks like that, embeddings may not be worth the infrastructure. The case for embeddings is strongest where BM25 is structurally weak: multilingual matching and semantically-vague bug descriptions (see the per-category table in Deep Dives — a ~5-point gap on UI-interaction bugs, which is the biggest per-category gap but still small in absolute terms).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Vector Store Shootout
&lt;/h2&gt;

&lt;p&gt;You've chosen your embedding model. Now: where do you store the vectors? I loaded the same embeddings (qwen3-embedding, 4096 dims) into three stores and measured everything — first at 550 real records, then at synthetic scale from 1K to 100K with all four stores.&lt;/p&gt;

&lt;p&gt;Note: pgvector is excluded from the real-data test because qwen3's 4096-dim vectors exceed pgvector's 2000-dimension index limit. pgvector remains viable with models that output &amp;lt;=1024 dims, or with qwen3 using MRL truncation (see above).&lt;/p&gt;

&lt;h3&gt;
  
  
  At bug-tracker scale (550 records, qwen3 4096 dims)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Qdrant&lt;/th&gt;
&lt;th&gt;ChromaDB&lt;/th&gt;
&lt;th&gt;sqlite-vec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Insert time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.69s&lt;/td&gt;
&lt;td&gt;0.63s&lt;/td&gt;
&lt;td&gt;0.41s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query p50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.21ms&lt;/td&gt;
&lt;td&gt;3.29ms&lt;/td&gt;
&lt;td&gt;5.52ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Recall@10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At this scale, all three stores return perfect recall. ChromaDB is the fastest on queries (3.29ms), followed by sqlite-vec (5.52ms) and Qdrant (7.21ms).&lt;/p&gt;

&lt;h3&gt;
  
  
  But what happens at scale?
&lt;/h3&gt;

&lt;p&gt;This is the question the 550-record benchmark can't answer. So I generated synthetic embeddings (4096 dims, random vectors for &lt;strong&gt;latency benchmarking only&lt;/strong&gt; — recall was validated on real data at 550 records; real-world HNSW recall at 100K depends on embedding distribution) and tested at 1K, 10K, 50K, and 100K records. pgvector is included here since synthetic data allows testing with any dimension count.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;pgvector&lt;/th&gt;
&lt;th&gt;Qdrant&lt;/th&gt;
&lt;th&gt;ChromaDB&lt;/th&gt;
&lt;th&gt;sqlite-vec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1K p50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.71ms&lt;/td&gt;
&lt;td&gt;4.73ms&lt;/td&gt;
&lt;td&gt;2.21ms&lt;/td&gt;
&lt;td&gt;1.62ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;10K p50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.69ms&lt;/td&gt;
&lt;td&gt;4.85ms&lt;/td&gt;
&lt;td&gt;2.98ms&lt;/td&gt;
&lt;td&gt;18.54ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;50K p50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.51ms&lt;/td&gt;
&lt;td&gt;7.92ms&lt;/td&gt;
&lt;td&gt;3.47ms&lt;/td&gt;
&lt;td&gt;87.68ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;100K p50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10.84ms&lt;/td&gt;
&lt;td&gt;7.70ms&lt;/td&gt;
&lt;td&gt;3.61ms&lt;/td&gt;
&lt;td&gt;167.41ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F421ip971ytjlcmo00bcj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F421ip971ytjlcmo00bcj.png" alt="Vector Store Performance: 1K to 100K Records" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The picture changes completely at scale:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;sqlite-vec falls off a cliff.&lt;/strong&gt; Brute-force search is fine at 1K (1.6ms) but at 100K it's 167ms — 100x slower. No index means linear scan. For a bug tracker that grows beyond a few thousand reports, sqlite-vec stops being viable for real-time queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChromaDB is the surprise winner on query latency.&lt;/strong&gt; 3.6ms at 100K, barely different from 2.2ms at 1K. Its HNSW implementation scales almost perfectly. If you only care about query speed, ChromaDB wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qdrant overtakes pgvector at 50K+.&lt;/strong&gt; At 100K, Qdrant (7.7ms) is 29% faster than pgvector (10.8ms). The Rust HNSW starts earning its keep. But the real story is insert time: pgvector takes &lt;strong&gt;27 minutes&lt;/strong&gt; to insert and index 100K records vs Qdrant's &lt;strong&gt;77 seconds&lt;/strong&gt;. That's a 20x difference — pgvector's HNSW index build is the bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pgvector is still the pragmatic choice for most teams.&lt;/strong&gt; 10.8ms at 100K records is fast enough for any bug tracker. The insert time penalty matters only for bulk imports — not for the one-at-a-time inserts that happen when users file bugs. And you get SQL, transactions, backups, and zero additional infrastructure.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Store&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Watch out for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;pgvector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Teams already on PostgreSQL, &amp;lt;100K records&lt;/td&gt;
&lt;td&gt;Slow bulk index build at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qdrant&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;gt;50K records, need filtered search&lt;/td&gt;
&lt;td&gt;Extra Docker service, REST API complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ChromaDB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fastest queries at any scale, prototyping&lt;/td&gt;
&lt;td&gt;30MB RAM overhead, no SQL, weak production tooling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;sqlite-vec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt;5K records, zero dependencies&lt;/td&gt;
&lt;td&gt;Linear scan kills performance at 10K+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  pgvector's 2000-dimension limit
&lt;/h3&gt;

&lt;p&gt;One practical constraint I hit: pgvector cannot create HNSW or IVFFlat indexes on vectors with more than 2000 dimensions (or 4000 with &lt;code&gt;halfvec&lt;/code&gt; in pgvector 0.7.0+). Qwen3-embedding outputs 4096 dims, which means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;halfvec&lt;/code&gt; type (pgvector 0.7.0+) to index up to 4000 dims at float16 precision&lt;/li&gt;
&lt;li&gt;Or truncate to ≤2000 dims via MRL before indexing (qwen3 loses zero quality — see above)&lt;/li&gt;
&lt;li&gt;Or use brute-force search (no index) — fine for &amp;lt;10K records but won't scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For models with ≤1024 dims (mxbai, bge-m3, nomic, all-minilm, snowflake), this is a non-issue.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; I tested at 550 real records and up to 100K synthetic records. For a typical bug tracker (&amp;lt;50K reports), pgvector handles everything under 11ms. Beyond 50K, Qdrant's insert speed and query latency start to justify the extra infrastructure. sqlite-vec is great for tiny projects but doesn't scale. ChromaDB is the fastest at every scale but lacks production maturity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwllbvstfu0pb98t9m8aa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwllbvstfu0pb98t9m8aa.png" alt="Which Vector Store Should You Use?" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I chose: bge-m3 (for BugSpotter specifically)
&lt;/h3&gt;

&lt;p&gt;Based on this benchmark, I switched BugSpotter's default embedding model from all-minilm to &lt;strong&gt;bge-m3&lt;/strong&gt; (landed before this article published). The choice is driven by deployment constraints, not by headline F1 — and the Bugzilla data complicates the picture in a way worth being honest about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the data actually shows:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On the synthetic benchmark, the top 3 (qwen3, bge-m3, mxbai) are statistically tied — their bootstrap 95% CIs overlap, so F1 alone can't pick a winner.&lt;/li&gt;
&lt;li&gt;On Mozilla Bugzilla (real multi-author duplicates), qwen3 leads (0.966), mxbai is #2 (0.962), &lt;strong&gt;BM25 is #3 (0.954), and bge-m3 drops to #6 (0.948) — below a whitespace-BM25 baseline&lt;/strong&gt;. On English plain-text bugs, bge-m3 offers no F1 advantage over BM25.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why bge-m3 anyway, for BugSpotter — multilingual is now the load-bearing reason:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Bugzilla result forces an honest reframe. If I'd seen this table before picking a model, my case for bge-m3 wouldn't have rested on four roughly-equal factors — it would have rested on one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual support (the real dealbreaker).&lt;/strong&gt; bge-m3 was trained on 100+ languages. BM25 is English-centric by construction — its tokenization, IDF estimation, and stopword assumptions all break on non-English text, and it can't match a Russian bug report to its English duplicate at all. mxbai is English-only. BugSpotter serves users globally; users &lt;em&gt;will&lt;/em&gt; file bugs in their native language. This is why I'm not switching to BM25 despite its 0.954 vs bge-m3's 0.948 on Bugzilla.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic matching on vague UI bugs.&lt;/strong&gt; The per-category table in Deep Dives shows BM25 trails the best embedding by ~4.7 F1 points on UI-interaction bugs and ~2.5 on state-management — the categories where users describe symptoms in free-form prose. Gap is under 2 points on technical bugs with distinctive error identifiers. Bugzilla's contributor base writes more distinctive technical language than typical end-users do, which is part of why BM25 holds up so well there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pgvector compatibility without truncation.&lt;/strong&gt; bge-m3's 1024 dims fit HNSW directly. qwen3's 4096 dims exceed pgvector's 2000-dim limit and need MRL truncation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async architecture.&lt;/strong&gt; Embedding in BugSpotter runs in a BullMQ background worker — the user never waits. So the 224ms vs 268ms latency difference is invisible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single €25/mo server.&lt;/strong&gt; bge-m3 runs comfortably alongside pgvector and the API on one Hetzner CPX42.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I were picking for an &lt;strong&gt;English-only product with plain-text bug reports&lt;/strong&gt;, the honest answer from this data is: &lt;strong&gt;start with BM25&lt;/strong&gt;. It's &amp;lt;1ms per pair, no infrastructure, and scores 0.954 on Bugzilla — only 0.012 below qwen3. Reach for embeddings only if you need multilingual matching (bge-m3) or are handling vague UI descriptions where BM25 structurally loses (mxbai or qwen3).&lt;/p&gt;

&lt;p&gt;If I had &lt;strong&gt;GPU budget&lt;/strong&gt;, &lt;strong&gt;qwen3&lt;/strong&gt; wins both datasets and is the only model with a comfortable margin over BM25 on Bugzilla.&lt;/p&gt;

&lt;p&gt;For BugSpotter specifically (multilingual, self-hosted CPU-only, pgvector-compatible), bge-m3's deployment profile still wins — but it's because of the languages argument, not the F1 argument.&lt;/p&gt;

&lt;p&gt;The meta-lesson: pick a model against your &lt;em&gt;deployment&lt;/em&gt; constraints and your own data, and &lt;strong&gt;always run BM25 as a baseline&lt;/strong&gt; before concluding an embedding is worth the infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  If bge-m3 isn't right for you
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your priority&lt;/th&gt;
&lt;th&gt;Use this&lt;/th&gt;
&lt;th&gt;F1 (synthetic / Bugzilla)&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;English-only plain-text reports, minimal infra&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.951 / 0.954&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;No model, no Ollama, no vector DB. Beats 4 of 6 embedding models on Bugzilla. Start here.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Absolute minimum latency (embedding-based)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;all-minilm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.978 / 0.952&lt;/td&gt;
&lt;td&gt;28ms&lt;/td&gt;
&lt;td&gt;10x faster than bge-m3, but 2x more errors on hard cases; can't embed long reports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max quality, latency irrelevant&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;qwen3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.990 / 0.966&lt;/td&gt;
&lt;td&gt;2.7s&lt;/td&gt;
&lt;td&gt;Best F1 on both datasets, but needs truncation for pgvector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Balance of quality + speed (English)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;mxbai-embed-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.984 / 0.962&lt;/td&gt;
&lt;td&gt;224ms&lt;/td&gt;
&lt;td&gt;Tied with bge-m3 on synthetic, ahead on Bugzilla, English-only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What threshold should you set?
&lt;/h3&gt;

&lt;p&gt;Don't use 0.9. Don't use any number from a blog post. Do this instead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Label 50–100 pairs from your own bug database (duplicate or not)&lt;/li&gt;
&lt;li&gt;Run a threshold sweep from 0.5 to 0.9 in steps of 0.01&lt;/li&gt;
&lt;li&gt;Pick the threshold that maximizes F1 (or bias toward precision if false positives are costly)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you can't label data yet, use these starting points from this benchmark:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Start with&lt;/th&gt;
&lt;th&gt;Tune range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;0.55–0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai-embed-large&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;0.65–0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bge-m3&lt;/td&gt;
&lt;td&gt;0.68&lt;/td&gt;
&lt;td&gt;0.60–0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3-embedding&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;td&gt;0.68–0.82&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What text should you embed?
&lt;/h3&gt;

&lt;p&gt;Include everything the SDK captures. The experiment showed each field contributes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;F1 contribution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Title&lt;/td&gt;
&lt;td&gt;Baseline (0.951)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ Description&lt;/td&gt;
&lt;td&gt;+2.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ All fields (full capture)&lt;/td&gt;
&lt;td&gt;+3.9% total&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The machine-captured fields (console errors, network requests, stack traces) are the most reliable signal. They're deterministic — same bug produces the same error — while human-written descriptions vary wildly.&lt;/p&gt;

&lt;h3&gt;
  
  
  What vector store?
&lt;/h3&gt;

&lt;p&gt;If you already have PostgreSQL and your model has ≤1024 dims: &lt;strong&gt;pgvector&lt;/strong&gt;. Zero additional infrastructure. 10.8ms at 100K records — fast enough. Just watch out for slow bulk index builds and the 2000-dim limit (rules out full qwen3).&lt;/p&gt;

&lt;p&gt;If you have &amp;lt;5K records and want zero dependencies: &lt;strong&gt;sqlite-vec&lt;/strong&gt;. One &lt;code&gt;.db&lt;/code&gt; file, sub-2ms queries. But it uses brute-force search — at 10K records it's 19ms, at 100K it's 167ms. Don't grow into it.&lt;/p&gt;

&lt;p&gt;If you're prototyping or need the fastest queries: &lt;strong&gt;ChromaDB&lt;/strong&gt;. &lt;code&gt;pip install chromadb&lt;/code&gt; and go. Fastest queries at every scale I tested (3.6ms at 100K). But weak production tooling limits its use beyond prototypes.&lt;/p&gt;

&lt;p&gt;If you have &amp;gt;50K records, need fast bulk imports, or use high-dim models like qwen3: &lt;strong&gt;Qdrant&lt;/strong&gt;. It inserts 100K records in 77 seconds vs pgvector's 27 minutes. No dimension limit. Worth the extra Docker service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations &amp;amp; Future Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Methodology strengths
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;250 real SDK captures via Playwright (not synthetic), 100 GitHub issues, 300 synthetic paraphrases&lt;/li&gt;
&lt;li&gt;600 hard negative pairs (different bugs, same component)&lt;/li&gt;
&lt;li&gt;Two levels of cross-validation: pair-level CV (F1 drops 0.000–0.002) and archetype-level CV (holds out entire bug types — F1 drops 0.003–0.005). Thresholds generalize to unseen bug categories&lt;/li&gt;
&lt;li&gt;3 runs on separate VMs — seeds vary synthetic noise and negative-pair sampling; Ollama embeddings are deterministic at v0.20.7&lt;/li&gt;
&lt;li&gt;TF-IDF, BM25, and BM25F lexical baselines (naive + tuned under 5-fold CV) — embeddings outperform the best (TF-IDF at 0.973) by ~1.7 points on synthetic data; BM25 overtakes most embeddings on Bugzilla&lt;/li&gt;
&lt;li&gt;Independent cross-validation on 407 Mozilla Bugzilla bugs (see the Cross-Validation section) — rankings partially shuffle but the top tier holds&lt;/li&gt;
&lt;li&gt;Embedding text format matches production code&lt;/li&gt;
&lt;li&gt;Fully reproducible: one script, MIT license, €0.60 total cost&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I got wrong (or didn't test)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;650 reports is still small.&lt;/strong&gt; Real bug trackers have 10K–100K+ reports. I validated vector store scaling separately on synthetic data up to 100K records (see Vector Store Shootout), but embedding quality at that scale remains untested.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-author synthetic paraphrases.&lt;/strong&gt; The 300 synthetic reports and their paraphrases were generated with AI assistance and reviewed by one person. Real bug reports come from dozens of people with different writing styles, languages, and technical vocabulary. This likely makes the D2 (paraphrase) pairs easier than real-world duplicates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frontend/web only.&lt;/strong&gt; All reports come from the JavaScript/TypeScript ecosystem. Backend bugs (Java exceptions, database deadlocks), mobile-native crashes, and infrastructure issues have different vocabulary — these rankings may not transfer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;English only.&lt;/strong&gt; All reports are in English (with minor RU text in a few variations). For multilingual teams, bge-m3's 100+ language support would likely give it a bigger advantage over the English-focused models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU only.&lt;/strong&gt; I tested on one hardware config (Hetzner CPX42, 8 vCPU, 16GB RAM). GPU inference would change the latency rankings dramatically — qwen3's 2.7 seconds would drop to ~100ms on an RTX 4000, making it competitive with the smaller models on speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing models.&lt;/strong&gt; &lt;code&gt;qwen3-embedding:4b&lt;/code&gt; wasn't available in Ollama at test time. This is the model most likely to hit the sweet spot between quality and efficiency. I also didn't test API models (OpenAI, Cohere, Voyage) — intentionally, since the focus is self-hosted, but a comparison would add context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No cross-encoder baseline.&lt;/strong&gt; Cross-encoder rerankers (e.g., &lt;code&gt;bge-reranker-v2-m3&lt;/code&gt;, &lt;code&gt;ms-marco-MiniLM&lt;/code&gt;) score sentence pairs directly rather than projecting to a shared vector space first, and they almost always beat bi-encoder embeddings on pairwise classification tasks like this one. 4,475 pairs is trivially small for a cross-encoder — a single rerank pass would take seconds on CPU. The natural production setup is a two-stage pipeline: bi-encoder (fast recall, embedding) → cross-encoder (accurate rerank on top-k candidates). This benchmark tests only the first stage. A cross-encoder reranked against qwen3's top-10 candidates would very likely push F1 above 0.99 on both synthetic and Bugzilla. Future work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No separator ablation.&lt;/strong&gt; I use &lt;code&gt;|&lt;/code&gt; as the field separator in embedding text (inherited from production code). I didn't test alternatives (&lt;code&gt;\n&lt;/code&gt;, &lt;code&gt;[SEP]&lt;/code&gt;, markdown headers) or structured formats (JSON, XML). This could affect results — structured payloads may help models parse field boundaries more reliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lexical baselines: tuning actively hurts; stemming hurts much more.&lt;/strong&gt; I tested naive BM25F (default field weights, whitespace tokenization) and a "tuned" version (grid-searched field weights, camelCase/snake_case splitting, &lt;em&gt;no&lt;/em&gt; Porter stemming). The oracle F1 from grid-searching weights on all 4,475 pairs reaches 0.923 — essentially matching default BM25F's 0.923. But under proper 5-fold CV (weights picked on train folds, F1 measured on held-out fold), tuned BM25F drops to 0.872 ± 0.012 — a 5-point overfitting gap. Even 6 weight configs on ~3,580 train pairs is enough for the selection to not generalize. Plain BM25 (no field weights at all, whitespace tokenization) at 0.951 beats every BM25F variant. Adding Porter stemming on top of BM25F collapses F1 to 0.038 because "undefined" → "undefin", "CORS" → "cor", "processPayment" → "processpay" — the exact tokens that disambiguate one stack trace from another get mangled. &lt;a href="https://dl.acm.org/doi/abs/10.1145/3576042" rel="noopener noreferrer"&gt;Zhang et al. (2023)&lt;/a&gt; used Lucene-style tokenization on Mozilla/Eclipse data (plain-text bug reports) where stemming helps. On structured reports with error IDs and stack traces, standard NLP preprocessing is counterproductive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No fine-tuning.&lt;/strong&gt; All models were used out of the box. Fine-tuning on bug report data (even with a small dataset) could significantly change the rankings — a fine-tuned all-minilm might outperform a generic mxbai-embed-large.&lt;/p&gt;

&lt;h3&gt;
  
  
  Future work
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale test at 100K records&lt;/strong&gt; — regenerate synthetic data to 100K, rerun vector store benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add qwen3-embedding:4b&lt;/strong&gt; when available in Ollama.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU benchmark&lt;/strong&gt; — RTX 4000 SFF Ada on Hetzner GEX44 or Vast.ai spot instance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning experiment&lt;/strong&gt; — fine-tune all-minilm on bug report data, compare with out-of-box bge-m3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language test&lt;/strong&gt; — add bug reports in Russian, Chinese, Spanish. Test bge-m3's multilingual advantage.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>embeddings</category>
      <category>vectorsearch</category>
      <category>devtools</category>
      <category>selfhosted</category>
    </item>
  </channel>
</rss>
