DEV Community: Alexander Budanov

An MCP server post-mortem: context vs. protocol

Alexander Budanov — Mon, 01 Jun 2026 19:24:27 +0000

If you are exposing an MCP server in front of a REST API, two things from our own experience that are worth passing on:

Code change and tool-description change have to land together. Reshape a list response without updating the description, and you trade a loud overflow failure for a silent one — the agent stops at the thin record, never realises get_bug exists, answers from incomplete data.
Log result_size_bytes per tool call, and smoke-test against production-shaped data, not synthetic fixtures. A response can be type-correct and still wrong about its size, and dev fixtures hide the 99th-percentile costs that broke us in production.

Both are aspects of the now-familiar idea that an MCP server is a context translator, not a protocol translator — the BFF / overfetching pattern with the consumer swapped from a UI client to an LLM. The principle itself is not new; what may be new is the specific pairing of the code change with the tool-description change, and the habit of logging output byte size on every call.

The story below — list_bugs(limit=3) returning 61,621 bytes for three records, the agent harness saving the overflow to disk and then grepping field values from the file, one question producing four tool calls — is one incident on our own backend. Illustrative, not a benchmark.

The incident

The JSONL log line our server wrote for that first call:

{
  "tool": "list_bugs",
  "args": { "limit": 3 },
  "duration_ms": 505,
  "result_status": "ok",
  "result_count": 3,
  "result_size_bytes": 61621,
  "upstream_url": "GET /api/v1/reports"
}

Three records. ~20.5 KB per record, returned to the agent verbatim. From the server's point of view, a clean success.

From the agent's point of view, less so:

[list_bugs]
  OUT: Error: result (61,621 characters across 236 lines)
       exceeds maximum allowed tokens. Output saved to disk.
       For targeted searches: use grep on the file directly.

Thought: Output is huge (61 KB for 3 records). Delegating to a sub-agent.
Agent: Extract 3 latest records summary
[Read]   …/tool-results/list_bugs-….txt
[Grep]   "^ \"(id|title|status|priority|created_at)\":" …

To produce a five-row table, the agent had to: hit the overflow ceiling, spawn a sub-agent, read the saved file, grep for fields, and reassemble. Four tool calls plus orchestration overhead, on top of the original. On three records — limit: 20 would simply have failed past any reasonable budget.

One observation worth making before the diagnosis: the recovery sequence above is harness-specific, and the difference can be sharp even within a single vendor. After the incident, we ran a controlled overflow (a tool returning ~80 KB to match the original case) through four MCP clients — Claude Code, Claude Desktop, Goose (Block), and Cursor — and got two distinct behaviours. Only Claude Code gated the tool result by character count and went into the save-to-disk recovery dance described above. The other three, including Claude Desktop from the same vendor, had no such gate: each one fed the whole payload straight to its underlying model, which read it and answered correctly. The "silent degradation" we were worried about — raw-injection leading to a confidently wrong answer instead of a clear failure — did not appear at 80 KB in any of the three non-gating clients. We expect it would appear at sizes large enough to actually stress the model's own context budget; that we did not test here.

A reader could fairly observe that this finding partly undercuts the drama of the opening incident: three of the four clients we tested would have absorbed our 80 KB without overflow, so the failure was somewhat client-specific. That is true, and it doesn't weaken the case for the projection fix. Even when a harness absorbs the full payload without complaint, the agent still pays for every token of it on every turn the result remains in context. The fix is about keeping the context budget lean; preventing overflow is a downstream consequence, not the primary motivation.

Why it happened

GET /api/v1/reports?limit=3 doesn't return summaries. It returns full records — every field on every row. For real production records that includes the full description, captured browser/network telemetry, stack trace, and replay metadata. ~20 KB each; three of them came to 60+ KB.

The MCP server's handler did the obvious thing:

const data = await client.request('GET', path, { params });
return { data, resultCount: data?.data?.length ?? 0 };

A perfectly correct REST proxy. That's why it broke.

The mismatch is structural. REST list endpoints are designed for UIs that want to minimise round-trips and join client-side — a thin-record response would be considered underpowered by such a consumer. Agent-facing tools operate under the opposite pressure: the agent reads every byte you return into its context window, and pays the cost twice — in dollars and in attention. Agents also cannot skim the way a human can: a developer in Postman scans the JSON visually and ignores irrelevant fields, while the agent must consume the entire payload before it can decide what is relevant.

So list_bugs the REST endpoint and list_bugs the MCP tool look like the same operation. They are not.

The fix

Three changes, ordered by impact.

1. Project list-mode results to a thin record

The largest contributor — remove heavy fields from list-mode tools before returning. An agent performing triage does not need network logs to decide which record warrants closer attention. It needs an ID, a title, status, priority, the timestamps, and a project handle. If an entry merits further investigation, the agent follows up with get_bug and pays the full payload cost on that one record only.

const LIST_FIELDS = [
  'id', 'title', 'status', 'priority',
  'created_at', 'updated_at', 'project_id',
] as const;

function thinRecord(r: unknown): Record<string, unknown> {
  if (!r || typeof r !== 'object') return {};
  const src = r as Record<string, unknown>;
  const out: Record<string, unknown> = {};
  for (const k of LIST_FIELDS) if (k in src) out[k] = src[k];
  return out;
}

For calibration: on our production-shaped records, this projection takes each record from roughly 20 KB (full payload including description, console array, network logs, replay metadata) to roughly 280 bytes (just the allowlisted fields). The per-record reduction is governed entirely by how heavy the dropped fields are on your particular records; the principle transfers, the size delta does not.

The tool description was updated alongside the code change to make the contract explicit: "Returns thin records — for full content, follow up with get_bug." That string is not documentation. It is the primary natural-language signal the agent uses when choosing which tool to dispatch — the tool name and the input schema also contribute, but the description is what disambiguates list_bugs from get_bug when both are plausible candidates. A description that says "returns full details" trains the agent to treat list-mode and detail-mode as interchangeable. The response-shape change alone, without the description update, would have traded the overflow failure for a worse failure mode: the agent stops at the thin record, does not recognise that get_bug is the appropriate next call, answers from incomplete information, and does not request the remaining detail. The code change and the description change must land together.

The trade-off this hides: thin projection imposes extra round-trips. If the agent triages 20 candidates and decides to inspect 5 of them in detail, the cost is 5 additional get_bug calls — precisely the N+1 problem the REST list endpoint was designed to avoid. The trade-off reverses between the two consumers: a UI is latency-bound (round-trips dominate, byte size is nearly free), an agent is budget-bound (token cost dominates, while sequential 2–3 KB calls are absorbable).

2. Bounded excerpts for search results

Search hits had the same problem in a different form. The intelligence service returned full descriptions inside each hit. We projected the hits to the same thin set, and then synthesised a bounded excerpt:

const EXCERPT_MAX = 240;

function makeExcerpt(s: string): string {
  const flat = s.replace(/\s+/g, ' ').trim();
  if (flat.length <= EXCERPT_MAX) return flat;
  const cut = flat.slice(0, EXCERPT_MAX);
  const lastSpace = cut.lastIndexOf(' ');
  return (lastSpace > EXCERPT_MAX - 40 ? cut.slice(0, lastSpace) : cut) + '…';
}

The whitespace collapse matters more than it appears — production descriptions are multi-paragraph with indentation, and verbatim slicing produces an excerpt full of \n\n and wasted bytes. The right order is: flatten the text first, then cut at a word boundary.

A limitation specifically for search: this is head-of-string truncation. If the query matches at character 1500 of a 3000-character description, the excerpt is the first 240 characters and does not contain the match — the agent receives a hit whose excerpt does not show why it is a hit. A query-aware snippet (locate the match position, centre the excerpt around it) is the correct answer for search hits. We have not yet implemented it.

3. Compact JSON

Cosmetic but inexpensive. The dispatch layer was calling JSON.stringify(data, null, 2) — pretty-printed with two-space indentation. Indentation is not free for an LLM consumer the way it is for a human reader: whitespace consists of tokens as well, and these are tokens that consume budget without contributing any meaning. Drop the second argument, and add a regression test that asserts the serialised output contains no newlines (the compact-versus-pretty signature is unambiguous on that one byte).

Why this surfaced only in production

The reason this manifested on the first production call, and not at any point during local development, is that our dev fixtures contained thin records. Synthetic descriptions of two sentences. No captured console. No network logs. A "record" in the fixture weighed perhaps 1 KB.

If we had been our own first user — running the MCP server against the dev stack — list_bugs(limit=3) would have returned approximately 3 KB. Manageable. Nothing to flag. We would have shipped the same code unchanged, considered the task complete, and waited for a customer to discover the failure mode in production.

The general form: dev data hides 99th-percentile costs. Anything that is bounded in production but unbounded in principle (free-text descriptions, uploaded files, captured ambient telemetry) tends to weigh almost nothing in fixtures and almost everything in real use. Type checking, unit tests, even integration tests against mocked upstreams — none of these surface the problem. The shape is correct. The size is wrong.

Two process changes that follow from this:

Run smoke tests against production-shaped data, not against synthetic fixtures. Either replay anonymised real payloads through the test suite, or point the local build at a small staging instance with realistic data before declaring the work complete.
Log byte sizes, not only success and failure. The only reason we detected this so quickly is that the behavioural logger records result_size_bytes on every tool call. The 61,621 was not an error as far as the agent was concerned — the call had succeeded. It was the size that broke things, and the size is visible only if you instrument for it.

The instrumentation pattern itself is small — { tool, args_size_bytes, result_size_bytes, duration_ms, error_class } per call, JSONL append-only, daily-rotated — and it is worth copying into any MCP server you build. Type signatures will catch the wrong-shape bug. Behavioural logs are the only place where the wrong-size bug surfaces early enough to be fixed cheaply.

Code at apex-bridge/bugspotter-mcp, MIT.

How I Chose an Embedding Model for Bug Report Deduplication

Alexander Budanov — Tue, 21 Apr 2026 08:58:09 +0000

TL;DR

Disclosure: I'm the founder of Apex Bridge Technology and the creator of BugSpotter. This benchmark was built to make a real product decision — which embedding model ships with BugSpotter. I'm publishing the methodology, data, and code so you can verify the numbers and make your own call.

I benchmarked 6 self-hosted embedding models against TF-IDF/BM25 baselines on 650 bug reports — including 250 real SDK captures collected via Playwright from the BugSpotter demo app — and cross-validated on 407 real Mozilla Bugzilla bugs.

Plain BM25 beats most embedding models on real-world data. On Mozilla Bugzilla, whitespace-tokenized BM25 scores F1=0.954 — beating bge-m3 (0.948), nomic (0.894), snowflake (0.872), narrowly beating all-minilm (0.952), and losing only to qwen3 (0.966) and mxbai (0.962). <1ms per pair, no Ollama, no vector DB. If your bug reports are English plain text, BM25 is probably the right answer — reach for embeddings only for multilingual matching or vague UI-interaction bugs. See Cross-Validation on Mozilla Bugzilla for the full ranking.
Among embedding models: qwen3 leads; bge-m3 and mxbai tied for second. qwen3 (CV F1=0.990) beats bge-m3 (0.986) and mxbai (0.984) by ~0.004-0.006 F1 — small but consistent across 3 seeds. bge-m3 and mxbai overlap on bootstrap CIs; you can't rank them. Pick qwen3 if you can afford 2.7s latency; otherwise choose between bge-m3 and mxbai on deployment constraints.
Field-weighting BM25 overfits, even at small grid size. "Tuned" BM25F with grid-searched field weights scores 0.923 oracle F1 but only 0.872 ± 0.012 under proper 5-fold CV — a 5-point overfitting gap from just 6 weight configs on 4,475 pairs. Plain BM25 (no fields) at 0.951 beats every BM25F variant. The simpler the lexical method, the better.
What the SDK captures beats what users type. Machine-captured fields (console errors, network logs, stack traces) take F1 from 0.951 (title only) to 0.990 (full capture).
Thresholds don't transfer between models or datasets. Optimal on my synthetic set: 0.62–0.73. Optimal on Mozilla Bugzilla: 0.27–0.62. The commonly-cited 0.9 misses 42–78% of duplicates on my data. Tune on your own labeled pairs.

Skip to Recommendations →

All code, data, and results: github.com/apex-bridge/bugspotter-embedding-benchmark

Why This Matters

If you run a bug tracker — whether it's Jira, Linear, a self-hosted tool, or something you built yourself — you've seen this: the same bug reported three times by three different people, in three different ways.

BUG-1041: Checkout button doesn't work after coupon.

BUG-1042: Can't complete purchase with promo code active.

BUG-1043: klick on 'place order' does nothing with cupon.

Same bug. Three tickets. Three engineers triage it. One fixes it, two discover it's already fixed.

The embedding approach

The modern solution is vector similarity: embed each bug report into a high-dimensional vector, store it in a vector database, and when a new report comes in — find the nearest neighbors. If the cosine similarity is above some threshold, flag it as a potential duplicate. (Cosine similarity measures how close two vectors are — 1.0 = identical, 0.0 = unrelated. F1 score balances precision and recall — 1.0 = perfect, and you want it as high as possible.)

Simple, elegant, and well-understood. Except for three questions nobody answers well:

Which embedding model? MTEB leaderboards rank models on academic benchmarks (STS, NLI, retrieval). Bug reports are none of these — they're short, technical, full of stack traces and error codes. A model that tops MTEB might fail on TypeError: Cannot read properties of undefined.
What threshold? The commonly cited threshold of 0.9 for near-duplicate detection comes from general-purpose NLP — not from bug reports with console errors and stack traces. Is 0.9 actually right for this domain?
What text to embed? A bug report has a title, description, console logs, network errors, stack traces, browser info. Which parts should go into the embedding? Just the title? Everything? Does including the stack trace help or hurt?

Why self-hosted matters

You could use OpenAI's text-embedding-3-small and call it a day. But if you're building a self-hosted tool — or your users care about data privacy — think about what's in a bug report. Stack traces contain file paths, variable names, internal URLs. Console errors reveal your tech stack. For regulated industries, sending this to an external API is a non-starter.

I needed a model that runs locally via Ollama, fits on a budget server, and gives production-quality dedup — without any data leaving the network. Sentry has validated embedding-based grouping at scale — but on a narrower task (error signature grouping, not free-text bug dedup) with a custom fine-tuned model, not off-the-shelf embeddings.

What this benchmark answers

Existing benchmarks (Patil et al. 2023, Zhang et al. 2023) evaluate on Mozilla/Eclipse where bug reports are plain-text title + description — no structured fields. Shibaev et al. 2025 covers stack traces specifically but not full bug context. My benchmark complements these by testing on the kind of structured captures modern SDKs emit:

6 models from 22M to 7.6B parameters, all running through Ollama
650 bug reports — 100 real GitHub issues + 300 synthetic paraphrases (30 archetypes x 10) + 250 real SDK captures (25 bugs x 10, collected via Playwright from the BugSpotter demo app)
4,475 labeled pairs including hard negatives (different bugs that look similar)
4 vector stores compared head-to-head
Everything reproducible — one script, one €25/mo server, MIT-licensed code

The goal: a practical answer to "which model, what threshold, what text, what store" — backed by numbers, not opinions.

Setup & Methodology

Hardware

No GPU. Hetzner CPX42 — 8 vCPU (AMD EPYC), 16 GB RAM, €25/mo. The kind of box most self-hosted setups actually run on. I ran the full pipeline 3 times on identical instances (seeds 42/123/456) to verify stability.

Everything runs in Docker: Ollama for embedding inference, PostgreSQL 16 + pgvector for vector storage, Qdrant for comparison. ChromaDB and sqlite-vec run embedded in Python — no additional containers.

The entire stack starts with docker compose up -d.

The 6 models

All models run through Ollama's /api/embed endpoint. No fine-tuning, no custom configurations — just pull and run. Each report is embedded individually (batch size 1), using default num_ctx. Note: Ollama has had documented embedding consistency issues across versions (#3777, #4207). I used Ollama v0.20.7 and Python 3.12. Results may differ on other versions — I recommend pinning the Ollama version in production.

Model	Params	Dims	Max Tokens	Quantization
all-minilm	22M	384	256	F16
nomic-embed-text	137M	768	8K	F16
snowflake-arctic-embed	334M	768	512	F16
mxbai-embed-large	335M	1024	512	F16
bge-m3	568M	1024	8K	F16
qwen3-embedding	7.6B	4096	32K	Q4_K_M

Why these six: they span 22M to 7.6B parameters, they're all available in the Ollama registry, and they represent different architectures and training approaches. all-minilm is the baseline (smallest, fastest — it's what BugSpotter originally shipped with, and this benchmark is the reason we moved off it). qwen3-embedding is the ceiling (SOTA on MTEB, but quantized to fit in 16GB RAM). Note that qwen3 is architecturally different — it's a decoder-based model using last-token pooling from a full LLM, not a BERT-style encoder. This explains the 95x latency gap vs all-minilm.

Notable absence: qwen3-embedding:4b — the model most likely to hit the sweet spot between all-minilm and qwen3-embedding:7.6B. I'll add it when Ollama publishes the weights.

Note: nomic-embed-text MRL truncation requires v1.5+. If you pull the default nomic-embed-text without specifying the version, you may get v1 with fixed 768 dims and no truncation support.

The dataset

There's no public benchmark for duplicate bug report detection with structured fields (console logs, network errors, stack traces). Existing academic datasets (Mozilla Bugzilla, Eclipse) contain only title + description as plain text. So I built my own.

650 bug reports from 3 sources:

Source 1: Real GitHub Issues (100 reports). Scraped from major open-source repositories using the GitHub API. Only issues labeled bug that contain error messages or stack traces in code blocks. These provide realistic vocabulary and formatting diversity.

Source 2: Synthetic bug reports (300 reports). 30 bug archetypes — each representing a common frontend error pattern (checkout failures, CORS errors, memory leaks, hydration mismatches, useEffect loops, etc.) — with 10 variations each. The 9 paraphrases per archetype are semantically equivalent but lexically different:

Original: "Checkout button unresponsive after coupon applied"
Paraphrase: "Cannot complete purchase with promo code active"
Noisy: "klick on 'place order' does nothing with cupon"
Truncated: "Coupon breaks checkout CTA"

Paraphrases were generated with AI assistance and manually reviewed to ensure semantic equivalence while maximizing lexical diversity. I acknowledge this limits stylistic diversity compared to real multi-author bug reports (see Limitations). This forces models to understand meaning, not just match words.

Source 3: Real SDK captures (250 reports). 25 bugs x 10 variations each, captured via Playwright from the BugSpotter demo app. These are real bug reports in the SDK's structured format — with console errors, network logs, stack traces, and browser metadata. Unlike v1's 40 synthetic SDK captures, these are genuine captures from a running application, making them far more realistic. These bugs are deliberately placed in the same components as the archetypes (checkout, auth, feed, modal) but describe different problems. This creates natural hard negatives.

Dataset caveat: 300 of 650 reports are synthetic paraphrases. This likely makes F1 scores higher than you'd see on real multi-author bug reports. Treat the numbers as relative model rankings, not production expectations. This benchmark tests semantic similarity between bug reports — a proxy for, but not identical to, real-world duplicate detection where two reports about the same bug may be textually very different.

Ground truth: 4,475 labeled pairs

Every pair is labeled as duplicate or not_duplicate across four difficulty levels. D3 is the critical category — "Checkout button broken with coupon" vs "Checkout total shows NaN after removing last item" — both mention checkout, both are bugs, but they're different problems. A model that only matches keywords will fail here.

Embedding text construction

Each bug report is converted to a single text string for embedding, matching the production build_embedding_text() function. I tested four simpler strategies (title only, title + description, etc.) — results in the Deep Dives section.

Full embedding text format

title
| description
| console_errors (up to 5)
| failed_network_requests (up to 3)
| Browser: X
| OS: Y
| Page: /path

All parts joined with | (pipe separator was inherited from production code; not ablated in this benchmark).

Evaluation pipeline

For each of the 6 models:

Embed all 650 reports (3 passes — 1 cold, 2 warm, take median latency)
Compute cosine similarity for all 4,475 pairs
Sweep threshold from 0.50 to 0.99 (step 0.01), compute precision, recall, F1 at each
Find optimal threshold (max F1) and compute ROC-AUC

Between models, I explicitly unload the previous model from Ollama to prevent memory cross-contamination. Latency variance across warm runs was low (<5% CV for all models), confirming median values are representative.

To verify reproducibility, I ran the full pipeline 3 times on separate Hetzner CPX42 instances (seeds 42, 123, 456 for synthetic data generation). Total cost: 3 × ~€0.20 = €0.60, ~5 hours per VM. The seeds affect only minor noise injection in paraphrases — the models, SDK captures, and core dataset are identical across runs. Ollama embeddings are deterministic at batch size 1 in this configuration (I verified by embedding the same report multiple times and getting identical vectors). All results in this article report mean ± std across these 3 runs.

A TF-IDF baseline (sklearn TfidfVectorizer with sublinear_tf and bigrams) runs on the same pairs for comparison — same input text, different vectorization method.

The full pipeline runs end-to-end with one command: ./deploy/run_clean.sh --seed 42

Results: The Numbers

Heads-up: this section ranks the 6 embedding models plus lexical baselines on the synthetic benchmark. The headline finding — whether you need embeddings at all — comes from the independent Mozilla Bugzilla data in Cross-Validation on Mozilla Bugzilla. If you only want to know "BM25 or embeddings?", jump there.

The main table

F1 is 5-fold cross-validated: threshold is picked on 4 train folds, F1 is measured on the held-out fold, results are averaged across the 5 folds. Threshold column shows the mean train-fold optimum. All values mean across 3 seeds.

Rank	Model	Params	CV F1	Threshold	Recall@0.9	Latency
1	qwen3-embedding	7.6B	0.990	0.73	58%	2,662ms
2	bge-m3	568M	0.986	0.69	29%	268ms
3	mxbai-embed-large	335M	0.984	0.73	47%	224ms
4	all-minilm †	22M	0.978	0.62	37%	28ms
5	snowflake-arctic-embed	334M	0.977	0.65	22%	220ms
6	nomic-embed-text	137M	0.973	0.73	36%	82ms
—	TF-IDF baseline	—	0.973	0.17	0.0%	<1ms
—	BM25 baseline	—	0.951	0.13	0.1%	<1ms
—	BM25F (field-weighted, default)	—	0.923	0.09	0.7%	<1ms
—	BM25F tuned (5-fold CV)	—	0.872	0.05	0.0%	<1ms

† all-minilm is evaluated on 4,415 pairs (the other models on 4,475). Sixty pairs are dropped because 76 reports exceed all-minilm's 256-token context window — including all 10 reports each from sdk_json_parse_crash, sdk_rate_limit_429, and sdk_zindex_conflict, plus 12 GitHub issues. all-minilm cannot embed these at all. In production that means all-minilm will silently fail on long bug reports; the F1 here is an optimistic upper bound over the pairs it can handle.

Four things jump out:

1. qwen3 leads; bge-m3 and mxbai tied for second. Bootstrap 95% CIs (1000 resamples over pairs, evaluated at the cv-picked threshold) give qwen3 [0.989–0.992], bge-m3 [0.985–0.988], and mxbai [0.983–0.987]. qwen3's lower bound sits at or above bge-m3's upper bound — qwen3's lead over the #2/#3 pair is small (~0.004) but consistent across seeds. bge-m3 and mxbai overlap — you can't rank them from this data. The bottom tier (all-minilm, snowflake, nomic) sits cleanly below. Archetype-level CV (holding out entire bug types, not random pairs) drops F1 by only 0.003–0.005. Pick from the top tier based on latency, dimensions, and hard-negative error rates — not on the headline F1.

2. Lexical baselines are surprisingly strong — but only the simplest ones. TF-IDF scores F1=0.973, plain BM25 scores 0.951. The best embedding (qwen3) beats TF-IDF by ~1.7 points and the #2 embedding (bge-m3) by only ~1.3. Embeddings still win, but not by much. The lexical picture gets more interesting in the other direction: adding field weighting actively hurts. Default BM25F (weights: title=3, desc=1, console=2, network=1.5) scores 0.923 — 5 points below plain BM25. And "tuning" the weights by grid search under proper 5-fold CV scores only 0.872 ± 0.012 — 5 more points below default. The grid-searched weights look good on the fold they were picked on (oracle F1 = 0.923, essentially matching default) but don't generalize to held-out folds. So on this data, the best lexical method is the simplest one: plain BM25 with whitespace tokenization and no fields at all. Porter stemming makes it collapse to F1=0.038 because it turns "undefined" into "undefin" and "CORS" into "cor", destroying the exact-token matching that makes stack traces and error IDs distinctive. And on Mozilla Bugzilla (next section), plain BM25 beats most of the embedding models — so "use BM25, skip the stemming and field weights" is a surprisingly defensible default.

3. Latency varies ~95x. all-minilm embeds a bug report in 28ms. qwen3 takes 2,662ms — nearly 3 seconds. For real-time "is this a duplicate?" on bug submit, that's the difference between invisible and noticeable.

4. The Recall@0.9 column is the most important one. It shows what happens if you use the commonly cited 0.9 threshold. Even the best model (qwen3) would only catch 58% of duplicates. The worst (snowflake) catches 22%. These numbers mean that at threshold 0.9, your dedup system is missing a significant portion of duplicates.

Each bubble is a model. X = median latency, Y = F1 score, size = parameter count. The top-left corner (high F1, low latency) is the sweet spot — bge-m3 and mxbai come closest.

A note on absolute numbers: these F1 scores reflect this benchmark dataset with controlled synthetic paraphrases. Real-world performance will be lower — treat these as relative rankings between models, not expected production metrics.

Robustness: hard pairs only (D2 + D3)

A fair critique: the full-set F1 includes 1,398 D4 "easy negative" pairs (different bugs, different components) that any sensible model should nail. Are the rankings just being propped up by the easy cases?

I re-swept thresholds on the D2 + D3 subset only (paraphrases + hard negatives, 3,062 pairs per seed). Rankings are unchanged and F1s are stable:

Model	F1 (full, re-swept)	F1 (D2+D3 only, re-swept)
qwen3-embedding	0.991	0.992
bge-m3	0.987	0.989
mxbai-embed-large	0.985	0.987
all-minilm	0.979	0.980
snowflake-arctic-embed	0.979	0.981
nomic-embed-text	0.974	0.976
TF-IDF baseline	0.973	0.978
BM25 baseline	0.951	0.960

Both columns are F1 at the optimal threshold for their respective pair set — not the 5-fold CV number from the main Results table. The comparison is internally consistent (same protocol on both subsets); the ~0.001 gap to the main table's CV F1 is the usual oracle-vs-CV difference.

Everything goes slightly up on the harder subset, because the threshold sweep finds a tighter optimum when easy pairs no longer anchor it. The ~2-point embedding-vs-lexical gap holds. If you had expected TF-IDF to collapse when easy negatives are removed, this benchmark doesn't show that effect on this data.

📝 A bug I caught during development

When I first ran this benchmark I got TF-IDF=0.774, BM25=0.388, BM25F=0.499 — baseline numbers suspiciously worse than any baseline has a right to be. Both were wrong, for two separate reasons.

The threshold-sweep function was copied from the embedding-similarity code, which scans [0.50, 1.00] — the right range for cosine similarities. But BM25's normalized scores cluster in [0.05, 0.25], so the sweep was searching empty space and returning whatever F1 happened to land at 0.50. Fixing the range to [0.00, 1.00] gave the numbers you see above.

The F1=0.038 I initially got for "tuned BM25F" was a second bug: I was stemming with Porter, which turned undefined into undefin, CORS into cor, processPayment into processpay — destroying the exact-token matching that makes stack traces and error IDs distinctive. Removing stemming and grid-searching field weights on all 4,475 pairs lifted F1 to 0.923 — but under proper 5-fold CV (weights picked on train folds, F1 measured on held-out) it drops to 0.872 ± 0.012. That's a 5-point overfitting gap: field weights selected on any fold don't generalize to the next. Even at 6 configs × 4,475 pairs, grid search is enough to overfit. Plain BM25 at 0.951 beats both versions of BM25F — the lesson is that field weighting doesn't earn its complexity on this data.

The lesson: if your baseline is way worse than a baseline has any right to be, suspect yourself first. I'm sharing this before the main Recommendations because catching this kind of bug is the most practically useful thing I can teach anyone reading a retrieval benchmark — and it almost shipped a "21-point gap between embeddings and lexical methods" claim that was a methodology bug, not a finding.

Why threshold matters more than model choice

The difference between the best and worst model is 1.7% F1 (0.990 vs 0.973). The difference between using the right threshold vs 0.9 is 42–78% of your duplicates missed. Threshold selection is the single highest-leverage decision in a dedup pipeline.

Precision-Recall curves for all 6 models. The dots mark each model's optimal threshold. Notice how the curves separate mainly in the high-recall region — that's where threshold choice matters most.

Each model has its own optimal operating point. There's no universal "good threshold" — it depends on the model's embedding space:

Model	Optimal Threshold	What 0.9 would cost you
all-minilm	0.62	Miss 63% of duplicates
snowflake	0.65	Miss 78% of duplicates
bge-m3	0.69	Miss 71% of duplicates
mxbai-large	0.73	Miss 53% of duplicates
nomic v1.5	0.73	Miss 64% of duplicates
qwen3 7.6B	0.73	Miss 42% of duplicates

Practical takeaway: don't hardcode a threshold — especially not mine. These thresholds were tuned on the same dataset used for evaluation. Run a sweep on your own data, even a small one (50–100 labeled pairs) — the optimal threshold for your model + your data will likely differ.

How models see duplicates vs non-duplicates

The violin plot below shows the distribution of cosine similarity scores for each pair type, per model. The key question: is there a clean gap between the duplicate distribution (D1, D2) and the non-duplicate distribution (D3, D4)?

What this reveals:

D1 (exact duplicates) cluster at 0.90–1.0 for all models — no surprises.
D2 (paraphrases) spread between 0.65–0.95 — this is where models differ. mxbai and qwen3 push D2 higher (tighter clusters), making them easier to separate from negatives.
D3 (hard negatives) are the problem. They overlap with D2 in the 0.55–0.75 range for weaker models (nomic, snowflake). This overlap is exactly where false positives come from.
D4 (easy negatives) sit below 0.5 for most models — these are trivial to filter.

The wider the gap between D2 and D3, the easier it is to pick a threshold that works. mxbai has the cleanest separation. nomic has the most overlap — which explains its lower F1.

Deep Dives

What to embed: the input strategy experiment

A bug report contains many fields. Which ones should go into the embedding? I tested four strategies using mxbai-embed-large (other models may rank strategies differently, but the direction should hold):

Strategy	What's included	F1	Threshold	Avg words
A: Title only	title	0.951	0.62	7.9
B: Title + Desc	title + description	0.978	0.65	28.0
C: + Console	title + desc + first_error	0.976	0.71	34.7
D: Full capture	title + desc + errors + network + env	0.990	0.75	65.8

The progression tells a clear story:

Description is the biggest single contributor (+2.7% F1 over title only). This makes sense — the title is a summary, the description contains the actual technical detail.

Console errors show diminishing returns at this level (-0.2% F1 vs title+desc). On this dataset, adding the first console error alone doesn't help beyond what the description already provides. However, the full capture strategy (D) shows that the combination of all machine-captured fields together pushes F1 to 0.990.

The full capture is what matters (+1.2% F1 over title+desc). Adding all machine-captured fields — console errors, network logs, and environment info — together provides the signal needed for the best dedup quality. Individual fields may not move the needle, but the combination does.

If you're building an SDK that captures bug context, this is why structured fields matter more than you'd think: machine-captured data (console errors, network logs, stack traces) is deterministic — two users hitting the same bug generate the same console error, even if one writes "checkout broken" and the other writes "can't buy stuff lol." Free-text descriptions can't give an embedding model that kind of signal.

Caveat: this finding is partially optimistic by construction. The 250 SDK captures in my dataset are 25 bugs × 10 variations, where each of the 10 variations shares the exact same console logs, stack traces, URLs, browser metadata, and timestamps — they came from one Playwright capture per bug, with only the title and description varied afterward. That's realistic for "two users on identical setups hit the same bug," but real multi-author reports have different browsers, different user IDs, different stack trace line numbers, different network timing. Some of the +1.2% F1 from Strategy B → D is the embedding matching on identical structured strings that wouldn't be identical in production. The direction of the finding (structured fields help) holds across the per-category Bugzilla analysis too, but the magnitude here should be read as an upper bound on what you'd see with real-world-varied captures.

Hard negatives: where models actually differ

Overall F1 scores are close (0.979–0.990). But the models diverge on the hardest task: distinguishing different bugs that share vocabulary. I analyzed the D3 pairs (different bugs in the same component) specifically:

Model	D3 False Positives	D2 False Negatives	Total errors
bge-m3	29	19	48
qwen3-embedding	32	17	49
mxbai-embed-large	37	23	60
nomic-embed-text	33	45	78
snowflake-arctic-embed	43	48	91
all-minilm	34	63	97

The larger, more realistic dataset changed the picture. In v1 (540 reports, 40 synthetic SDK captures), qwen3 had zero false positives on hard negatives. In v2 (650 reports, 250 real SDK captures), every model has significant false positives. The real SDK captures create much harder negative pairs — real console errors and stack traces share more vocabulary across different bugs than synthetic ones did.

bge-m3 and qwen3 make the fewest total errors (48 and 49 respectively). bge-m3 has 29 FP + 19 FN; qwen3 has 32 FP + 17 FN. Both catch most real duplicates while keeping false positives manageable.

all-minilm makes 2x more errors (97 total) — 34 false positives and 63 false negatives. The headline F1 gap of 1% hides this 2x difference on hard cases.

Snowflake struggles the most. 43 false positives and 48 false negatives — 91 total errors.

The practical implication: on realistic data, no model achieves zero false positives. If your dedup system auto-merges without human review, you need a confidence threshold above the optimal F1 threshold to reduce false positives — at the cost of missing more duplicates. Human review remains important.

Some real examples of what models got wrong:

Confused pair	Why it's tricky
"Memory leak on infinite scroll" vs "Images fail to load on fast scroll"	Both mention scrolling + performance issues
"Body scroll not locked in modal" vs "Focus escapes dialog to background"	Both about modal behavior on iOS
"CORS error after subdomain change" vs "Avatar images blocked by CDN CORS"	Both contain "CORS" + "blocked"

MRL dimension truncation

Qwen3 and nomic-embed-text support Matryoshka Representation Learning (MRL), meaning you can truncate their embeddings to fewer dimensions without retraining. This trades storage for quality.

Model	Dims	F1	Storage per 100K
nomic	128	0.972	49 MB
nomic	256	0.976	98 MB
nomic	512	0.980	195 MB
nomic	768 (full)	0.981	293 MB
qwen3	128	0.990	49 MB
qwen3	256	0.990	98 MB
qwen3	512	0.991	195 MB
qwen3	4096 (full)	0.990	1,563 MB

nomic loses very little from truncation. From 768 to 128 dims, F1 drops only from 0.981 to 0.972. At 512 dims (0.980), quality is nearly identical to full — 34% less storage for effectively the same performance.

qwen3 is remarkably stable across all dimensions. F1 stays at 0.990–0.991 from 128 dims all the way to 4096. This means you can truncate qwen3 to 128 dims (49 MB per 100K records) with zero quality loss — a major practical advantage. It also sidesteps pgvector's 2000-dimension limit entirely: just truncate to 128 or 256 dims and use pgvector without issue.

F1 by bug category

Not all bugs are equally easy to deduplicate. I broke down F1 by error type:

The heatmap shows that UI interaction bugs are hardest for all models — they tend to have vague descriptions ("button doesn't work") and share vocabulary across different issues. Network errors are easiest — they come with specific HTTP status codes and endpoint URLs that are highly discriminative. (The "GitHub Issues" column shows 0.00 because these 100 reports have no duplicate pairs — they provide vocabulary diversity, not dedup targets.)

Where embeddings beat keyword matching most

Per-category F1 gaps tell a more nuanced story than the overall ~2-point headline. Sorted by how much the best embedding beats the best lexical baseline:

Category	Best embedding	TF-IDF	BM25	Gap (emb − best lex)
UI interaction	0.994	0.947	0.916	4.7 pts
State management	0.967	0.942	0.932	2.5 pts
Network errors	0.994	0.973	0.967	2.1 pts
JS errors	0.996	0.980	0.962	1.6 pts
Performance	0.999	0.986	0.986	1.2 pts
CSS/UI	0.999	0.993	0.993	0.6 pts
React-specific	1.000	0.996	0.993	0.4 pts

The pattern: on technical bugs with distinctive error messages (React, CSS, performance, JS, network), TF-IDF sits within 0.4–2 F1 points of the best embedding. On React-specific bugs (1.000 vs 0.996) and CSS (0.999 vs 0.993), the gap is basically a rounding error. The embedding advantage shows up where users describe symptoms in free-form prose — UI interaction (4.7 pts) and state management (2.5 pts) — where "button doesn't work" and "can't click submit" share little vocabulary despite describing the same bug. If most of your bugs come with structured error identifiers, embeddings don't help much. If most of your bugs come from users writing in their own words, embeddings earn their cost but not dramatically.

Cross-Validation on Mozilla Bugzilla

Everything so far has been on my own dataset. The synthetic half is controlled-quality by construction and the SDK captures are 10 paraphrases per bug — not what real multi-author duplicates look like. The honest question: do these rankings survive on bug reports I had no hand in creating?

I ran all 6 models and both lexical baselines against 407 Mozilla Bugzilla bugs (250 duplicate pairs, 100 hard negatives). These are plain-text title + description with no SDK captures, no stack traces, no structured fields — fundamentally different data from the synthetic benchmark.

Bugzilla F1 for every method tested. BM25 (diamond marker, 0.954) sits at #3 — ahead of all-minilm, bge-m3, nomic, and snowflake. The dashed line is BM25's score; four of the six embedding models fall below it.

The two-dataset picture

Model	Synthetic CV F1	Bugzilla F1	Δ	Synthetic rank	Bugzilla rank
qwen3-embedding	0.990	0.966	−0.024	1	1
mxbai-embed-large	0.984	0.962	−0.022	3	2 ↑
*BM25 baseline*	0.951	*0.954*	+0.003	7 (below all 6)	3
all-minilm	0.978	0.952	−0.026	4	4
*TF-IDF baseline*	0.973	0.950	−0.023	8	5
bge-m3	0.986	0.948	−0.038	2	6 ↓↓↓↓
nomic-embed-text	0.973	0.894	−0.079	6	7
snowflake-arctic-embed	0.977	0.872	−0.105	5	8

What the two datasets agree on

1. qwen3 is the strongest model on both. It's #1 on synthetic and #1 on Bugzilla. If you can afford 7.6B parameters and 2.7-second latency, it's the safe pick.

2. Threshold 0.9 is terrible on both. On synthetic, Recall@0.9 ranged 22–58%. On Bugzilla, the optimal thresholds are 0.27–0.62 — cosine 0.9 misses nearly every duplicate. The "embeddings are near-duplicates above 0.9" folklore doesn't survive either dataset.

3. The bottom tier stays the bottom tier. nomic and snowflake are the worst two models on both datasets, and by the biggest absolute margin.

4. BM25 beats most of the embedding models on Bugzilla. This is the uncomfortable finding. On the synthetic benchmark, BM25 scores 0.951 — below every embedding model. On Bugzilla, BM25 scores 0.954 (actually slightly higher than its synthetic score) — and it beats bge-m3 (0.948), nomic (0.894), snowflake (0.872), narrowly beats all-minilm (0.952), and loses only to the top two embedding models (qwen3 by 0.012, mxbai by 0.008). On single-language plain-text Mozilla bug reports, whitespace-tokenized BM25 is a serious competitor — use it as a sanity-check baseline before concluding that an embedding model is "good enough" for your domain. The same BM25 that costs <1ms per pair beats four of the six embedding models I tested. If your data looks like Bugzilla (English, plain text, experienced-author writing style), the honest recommendation might be "start with BM25 and only reach for embeddings if you need multilingual matching or are handling vague UI descriptions."

What they disagree on

1. The middle of the leaderboard shuffles substantially.

bge-m3 drops from #2 on synthetic to #4 on Bugzilla (−0.040 F1 — the largest drop among the top 3)
mxbai moves from #3 to #2, overtaking bge-m3
all-minilm jumps from #6 to #3 — climbing over 3 models

Read literally: if you had picked a model by ranking on the synthetic benchmark alone, you would have picked either qwen3 (still correct) or bge-m3 (now disputed). The "top 3 are tied" claim from Results is robust, but which of the three is actually best depends on the dataset.

2. Thresholds do not transfer at all.

Model	Synthetic optimal	Bugzilla optimal
qwen3-embedding	0.74	0.49
bge-m3	0.69	0.49
mxbai-embed-large	0.73	0.53
all-minilm	0.64	0.27
nomic-embed-text	0.73	0.62
snowflake-arctic-embed	0.65	0.51

Applying the synthetic-tuned threshold to Bugzilla costs every model 20–30 F1 points. The synthetic dataset's paraphrase-heavy structure produces tighter duplicate clusters (and therefore higher thresholds) than real multi-author bug reports do.

3. Degradation is uneven. qwen3, mxbai and all-minilm lose 0.022–0.026 F1 going from synthetic to Bugzilla — stable. bge-m3 loses 0.038 (~70% more than the others in the top tier). nomic loses 0.079. snowflake loses 0.105 — four times more than qwen3. Headline F1 is one thing; how well a model degrades on data you didn't tune against is a different thing, and it's the one that matters in production.

What I'd take from this

If I'd published the synthetic results alone, I'd have recommended bge-m3 or qwen3 with some confidence. The Bugzilla data says something more nuanced: qwen3 and mxbai are robust across both datasets; bge-m3 falls to 6th place on Bugzilla and even loses to a BM25 baseline; and BM25 itself is the quiet winner on the robustness-per-dollar axis.

One number worth sitting with: the F1 gap between qwen3 and bge-m3 on synthetic is 0.004. The F1 drop bge-m3 suffers going from synthetic to Bugzilla is 0.038 — ten times larger. The out-of-distribution shift moves you further than any model-choice decision within the top tier. Optimizing which embedding model you pick, beyond the robustness heuristic below, is optimizing noise.

Two practical heuristics:

Pick a model that degrades gracefully on data you didn't tune against. By that criterion, qwen3 and mxbai stay in the top tier on both datasets (lose 0.023–0.024 F1). bge-m3 loses 0.040 and drops 4 ranks. nomic and snowflake lose 3–4x more than the top. Headline F1 is one thing; robustness is what actually matters in production.
Run BM25 as a baseline before committing to embeddings. On Bugzilla-like data (plain text, single language, experienced authors), BM25 scores within 1.2 F1 points of the best embedding model and beats most of the field. If your data looks like that, embeddings may not be worth the infrastructure. The case for embeddings is strongest where BM25 is structurally weak: multilingual matching and semantically-vague bug descriptions (see the per-category table in Deep Dives — a ~5-point gap on UI-interaction bugs, which is the biggest per-category gap but still small in absolute terms).
**Frame the decision around input, not model. A reader on LinkedIn crystallized this: the real question isn't embeddings vs BM25, it's whether your tool auto-captures structured data (console errors, stack traces, HTTP codes) or relies on users typing descriptions. Auto-capture turns dedup into keyword matching and BM25 suffices. User-typed prose needs semantic matching to bridge vocabulary gaps. The axis isn't model quality — it's whether the input has useful tokens before any model sees it.

Vector Store Shootout

You've chosen your embedding model. Now: where do you store the vectors? I loaded the same embeddings (qwen3-embedding, 4096 dims) into three stores and measured everything — first at 550 real records, then at synthetic scale from 1K to 100K with all four stores.

Note: pgvector is excluded from the real-data test because qwen3's 4096-dim vectors exceed pgvector's 2000-dimension index limit. pgvector remains viable with models that output <=1024 dims, or with qwen3 using MRL truncation (see above).

At bug-tracker scale (550 records, qwen3 4096 dims)

	Qdrant	ChromaDB	sqlite-vec
Insert time	1.69s	0.63s	0.41s
Query p50	7.21ms	3.29ms	5.52ms
Recall@10	1.00	1.00	1.00

At this scale, all three stores return perfect recall. ChromaDB is the fastest on queries (3.29ms), followed by sqlite-vec (5.52ms) and Qdrant (7.21ms).

But what happens at scale?

This is the question the 550-record benchmark can't answer. So I generated synthetic embeddings (4096 dims, random vectors for latency benchmarking only — recall was validated on real data at 550 records; real-world HNSW recall at 100K depends on embedding distribution) and tested at 1K, 10K, 50K, and 100K records. pgvector is included here since synthetic data allows testing with any dimension count.

Scale	pgvector	Qdrant	ChromaDB	sqlite-vec
1K p50	2.71ms	4.73ms	2.21ms	1.62ms
10K p50	2.69ms	4.85ms	2.98ms	18.54ms
50K p50	7.51ms	7.92ms	3.47ms	87.68ms
100K p50	10.84ms	7.70ms	3.61ms	167.41ms

The picture changes completely at scale:

sqlite-vec falls off a cliff. Brute-force search is fine at 1K (1.6ms) but at 100K it's 167ms — 100x slower. No index means linear scan. For a bug tracker that grows beyond a few thousand reports, sqlite-vec stops being viable for real-time queries.

ChromaDB is the surprise winner on query latency. 3.6ms at 100K, barely different from 2.2ms at 1K. Its HNSW implementation scales almost perfectly. If you only care about query speed, ChromaDB wins.

Qdrant overtakes pgvector at 50K+. At 100K, Qdrant (7.7ms) is 29% faster than pgvector (10.8ms). The Rust HNSW starts earning its keep. But the real story is insert time: pgvector takes 27 minutes to insert and index 100K records vs Qdrant's 77 seconds. That's a 20x difference — pgvector's HNSW index build is the bottleneck.

pgvector is still the pragmatic choice for most teams. 10.8ms at 100K records is fast enough for any bug tracker. The insert time penalty matters only for bulk imports — not for the one-at-a-time inserts that happen when users file bugs. And you get SQL, transactions, backups, and zero additional infrastructure.

Store	Best for	Watch out for
pgvector	Teams already on PostgreSQL, <100K records	Slow bulk index build at scale
Qdrant	>50K records, need filtered search	Extra Docker service, REST API complexity
ChromaDB	Fastest queries at any scale, prototyping	30MB RAM overhead, no SQL, weak production tooling
sqlite-vec	<5K records, zero dependencies	Linear scan kills performance at 10K+

pgvector's 2000-dimension limit

One practical constraint I hit: pgvector cannot create HNSW or IVFFlat indexes on vectors with more than 2000 dimensions (or 4000 with halfvec in pgvector 0.7.0+). Qwen3-embedding outputs 4096 dims, which means:

Use halfvec type (pgvector 0.7.0+) to index up to 4000 dims at float16 precision
Or truncate to ≤2000 dims via MRL before indexing (qwen3 loses zero quality — see above)
Or use brute-force search (no index) — fine for <10K records but won't scale

For models with ≤1024 dims (mxbai, bge-m3, nomic, all-minilm, snowflake), this is a non-issue.

Bottom line: I tested at 550 real records and up to 100K synthetic records. For a typical bug tracker (<50K reports), pgvector handles everything under 11ms. Beyond 50K, Qdrant's insert speed and query latency start to justify the extra infrastructure. sqlite-vec is great for tiny projects but doesn't scale. ChromaDB is the fastest at every scale but lacks production maturity.

Recommendations

Before the specific model choice: the benchmark ultimately points at a simpler decision rule than 'which model?' Ask instead: does my tool auto-capture structured data, or do users type descriptions? Auto-capture (console errors, stack traces, HTTP codes) → BM25 is sufficient, and cheaper to run. User-typed prose → embeddings earn their cost because they bridge vocabulary gaps. Multilingual is the exception where embeddings matter even with good auto-capture. Everything below is downstream of that axis.

What I chose: bge-m3 (for BugSpotter specifically)
Based on this benchmark, I switched BugSpotter's default embedding model from all-minilm to bge-m3 (landed before this article published). The choice is driven by deployment constraints, not by headline F1 — and the Bugzilla data complicates the picture in a way worth being honest about.

What the data actually shows:

On the synthetic benchmark, the top 3 (qwen3, bge-m3, mxbai) are statistically tied — their bootstrap 95% CIs overlap, so F1 alone can't pick a winner.
On Mozilla Bugzilla (real multi-author duplicates), qwen3 leads (0.966), mxbai is #2 (0.962), BM25 is #3 (0.954), and bge-m3 drops to #6 (0.948) — below a whitespace-BM25 baseline. On English plain-text bugs, bge-m3 offers no F1 advantage over BM25.

Why bge-m3 anyway, for BugSpotter — multilingual is now the load-bearing reason:

The Bugzilla result forces an honest reframe. If I'd seen this table before picking a model, my case for bge-m3 wouldn't have rested on four roughly-equal factors — it would have rested on one:

Multilingual support (the real dealbreaker). bge-m3 was trained on 100+ languages. BM25 is English-centric by construction — its tokenization, IDF estimation, and stopword assumptions all break on non-English text, and it can't match a Russian bug report to its English duplicate at all. mxbai is English-only. BugSpotter serves users globally; users will file bugs in their native language. This is why I'm not switching to BM25 despite its 0.954 vs bge-m3's 0.948 on Bugzilla.
Semantic matching on vague UI bugs. The per-category table in Deep Dives shows BM25 trails the best embedding by ~4.7 F1 points on UI-interaction bugs and ~2.5 on state-management — the categories where users describe symptoms in free-form prose. Gap is under 2 points on technical bugs with distinctive error identifiers. Bugzilla's contributor base writes more distinctive technical language than typical end-users do, which is part of why BM25 holds up so well there.
pgvector compatibility without truncation. bge-m3's 1024 dims fit HNSW directly. qwen3's 4096 dims exceed pgvector's 2000-dim limit and need MRL truncation.
Async architecture. Embedding in BugSpotter runs in a BullMQ background worker — the user never waits. So the 224ms vs 268ms latency difference is invisible.
Single €25/mo server. bge-m3 runs comfortably alongside pgvector and the API on one Hetzner CPX42.

If I were picking for an English-only product with plain-text bug reports, the honest answer from this data is: start with BM25. It's <1ms per pair, no infrastructure, and scores 0.954 on Bugzilla — only 0.012 below qwen3. Reach for embeddings only if you need multilingual matching (bge-m3) or are handling vague UI descriptions where BM25 structurally loses (mxbai or qwen3).

If I had GPU budget, qwen3 wins both datasets and is the only model with a comfortable margin over BM25 on Bugzilla.

For BugSpotter specifically (multilingual, self-hosted CPU-only, pgvector-compatible), bge-m3's deployment profile still wins — but it's because of the languages argument, not the F1 argument.

The meta-lesson: pick a model against your deployment constraints and your own data, and always run BM25 as a baseline before concluding an embedding is worth the infrastructure.

If bge-m3 isn't right for you

Your priority	Use this	F1 (synthetic / Bugzilla)	Latency	Why
English-only plain-text reports, minimal infra	BM25	0.951 / 0.954	<1ms	No model, no Ollama, no vector DB. Beats 4 of 6 embedding models on Bugzilla. Start here.
Absolute minimum latency (embedding-based)	all-minilm	0.978 / 0.952	28ms	10x faster than bge-m3, but 2x more errors on hard cases; can't embed long reports
Max quality, latency irrelevant	qwen3	0.990 / 0.966	2.7s	Best F1 on both datasets, but needs truncation for pgvector
Balance of quality + speed (English)	mxbai-embed-large	0.984 / 0.962	224ms	Tied with bge-m3 on synthetic, ahead on Bugzilla, English-only

What threshold should you set?

Don't use 0.9. Don't use any number from a blog post. Do this instead:

Label 50–100 pairs from your own bug database (duplicate or not)
Run a threshold sweep from 0.5 to 0.9 in steps of 0.01
Pick the threshold that maximizes F1 (or bias toward precision if false positives are costly)

If you can't label data yet, use these starting points from this benchmark:

Model	Start with	Tune range
all-minilm	0.65	0.55–0.75
mxbai-embed-large	0.73	0.65–0.80
bge-m3	0.68	0.60–0.75
qwen3-embedding	0.74	0.68–0.82

What text should you embed?

Include everything the SDK captures. The experiment showed each field contributes:

Field	F1 contribution
Title	Baseline (0.951)
+ Description	+2.7%
+ All fields (full capture)	+3.9% total

The machine-captured fields (console errors, network requests, stack traces) are the most reliable signal. They're deterministic — same bug produces the same error — while human-written descriptions vary wildly.

What vector store?

If you already have PostgreSQL and your model has ≤1024 dims: pgvector. Zero additional infrastructure. 10.8ms at 100K records — fast enough. Just watch out for slow bulk index builds and the 2000-dim limit (rules out full qwen3).

If you have <5K records and want zero dependencies: sqlite-vec. One .db file, sub-2ms queries. But it uses brute-force search — at 10K records it's 19ms, at 100K it's 167ms. Don't grow into it.

If you're prototyping or need the fastest queries: ChromaDB. pip install chromadb and go. Fastest queries at every scale I tested (3.6ms at 100K). But weak production tooling limits its use beyond prototypes.

If you have >50K records, need fast bulk imports, or use high-dim models like qwen3: Qdrant. It inserts 100K records in 77 seconds vs pgvector's 27 minutes. No dimension limit. Worth the extra Docker service.

Limitations & Future Work

Methodology strengths

250 real SDK captures via Playwright (not synthetic), 100 GitHub issues, 300 synthetic paraphrases
600 hard negative pairs (different bugs, same component)
Two levels of cross-validation: pair-level CV (F1 drops 0.000–0.002) and archetype-level CV (holds out entire bug types — F1 drops 0.003–0.005). Thresholds generalize to unseen bug categories
3 runs on separate VMs — seeds vary synthetic noise and negative-pair sampling; Ollama embeddings are deterministic at v0.20.7
TF-IDF, BM25, and BM25F lexical baselines (naive + tuned under 5-fold CV) — embeddings outperform the best (TF-IDF at 0.973) by ~1.7 points on synthetic data; BM25 overtakes most embeddings on Bugzilla
Independent cross-validation on 407 Mozilla Bugzilla bugs (see the Cross-Validation section) — rankings partially shuffle but the top tier holds
Embedding text format matches production code
Fully reproducible: one script, MIT license, €0.60 total cost

What I got wrong (or didn't test)

650 reports is still small. Real bug trackers have 10K–100K+ reports. I validated vector store scaling separately on synthetic data up to 100K records (see Vector Store Shootout), but embedding quality at that scale remains untested.

Single-author synthetic paraphrases. The 300 synthetic reports and their paraphrases were generated with AI assistance and reviewed by one person. Real bug reports come from dozens of people with different writing styles, languages, and technical vocabulary. This likely makes the D2 (paraphrase) pairs easier than real-world duplicates.

Frontend/web only. All reports come from the JavaScript/TypeScript ecosystem. Backend bugs (Java exceptions, database deadlocks), mobile-native crashes, and infrastructure issues have different vocabulary — these rankings may not transfer.

English only. All reports are in English (with minor RU text in a few variations). For multilingual teams, bge-m3's 100+ language support would likely give it a bigger advantage over the English-focused models.

CPU only. I tested on one hardware config (Hetzner CPX42, 8 vCPU, 16GB RAM). GPU inference would change the latency rankings dramatically — qwen3's 2.7 seconds would drop to ~100ms on an RTX 4000, making it competitive with the smaller models on speed.

Missing models. qwen3-embedding:4b wasn't available in Ollama at test time. This is the model most likely to hit the sweet spot between quality and efficiency. I also didn't test API models (OpenAI, Cohere, Voyage) — intentionally, since the focus is self-hosted, but a comparison would add context.

No cross-encoder baseline. Cross-encoder rerankers (e.g., bge-reranker-v2-m3, ms-marco-MiniLM) score sentence pairs directly rather than projecting to a shared vector space first, and they almost always beat bi-encoder embeddings on pairwise classification tasks like this one. 4,475 pairs is trivially small for a cross-encoder — a single rerank pass would take seconds on CPU. The natural production setup is a two-stage pipeline: bi-encoder (fast recall, embedding) → cross-encoder (accurate rerank on top-k candidates). This benchmark tests only the first stage. A cross-encoder reranked against qwen3's top-10 candidates would very likely push F1 above 0.99 on both synthetic and Bugzilla. Future work.

No separator ablation. I use | as the field separator in embedding text (inherited from production code). I didn't test alternatives (\n, [SEP], markdown headers) or structured formats (JSON, XML). This could affect results — structured payloads may help models parse field boundaries more reliably.

Lexical baselines: tuning actively hurts; stemming hurts much more. I tested naive BM25F (default field weights, whitespace tokenization) and a "tuned" version (grid-searched field weights, camelCase/snake_case splitting, no Porter stemming). The oracle F1 from grid-searching weights on all 4,475 pairs reaches 0.923 — essentially matching default BM25F's 0.923. But under proper 5-fold CV (weights picked on train folds, F1 measured on held-out fold), tuned BM25F drops to 0.872 ± 0.012 — a 5-point overfitting gap. Even 6 weight configs on ~3,580 train pairs is enough for the selection to not generalize. Plain BM25 (no field weights at all, whitespace tokenization) at 0.951 beats every BM25F variant. Adding Porter stemming on top of BM25F collapses F1 to 0.038 because "undefined" → "undefin", "CORS" → "cor", "processPayment" → "processpay" — the exact tokens that disambiguate one stack trace from another get mangled. Zhang et al. (2023) used Lucene-style tokenization on Mozilla/Eclipse data (plain-text bug reports) where stemming helps. On structured reports with error IDs and stack traces, standard NLP preprocessing is counterproductive.

No fine-tuning. All models were used out of the box. Fine-tuning on bug report data (even with a small dataset) could significantly change the rankings — a fine-tuned all-minilm might outperform a generic mxbai-embed-large.

Future work

Scale test at 100K records — regenerate synthetic data to 100K, rerun vector store benchmark.
Add qwen3-embedding:4b when available in Ollama.
GPU benchmark — RTX 4000 SFF Ada on Hetzner GEX44 or Vast.ai spot instance.
Fine-tuning experiment — fine-tune all-minilm on bug report data, compare with out-of-box bge-m3.
Multi-language test — add bug reports in Russian, Chinese, Spanish. Test bge-m3's multilingual advantage.