DEV Community: ComparEdge

Why 70% of RAG projects never reach production in 2026

Oleh Kem — Wed, 08 Jul 2026 11:14:45 +0000

The demo lies because the dataset is polite

A 50-document RAG demo is almost designed to succeed. The documents are clean, the questions are friendly, and the person demoing the system knows what to ask.

Production is not polite. Documents are duplicated, stale, scanned, badly formatted, cross-referenced, and full of tables. Users search by clause number, invoice ID, acronym, and half-remembered phrase. The model answers confidently even when retrieval missed the one paragraph that mattered.

The vector database is rarely the first thing that breaks. The pipeline around it usually breaks first.

ANN search in one pass

Vector search is approximate search. That word matters.

Flat search compares every vector. It gives perfect recall and bad speed once the dataset grows.

HNSW builds a graph and walks toward near neighbors. It is fast and accurate, but memory heavy. At 10 million vectors with 1536 dimensions, memory stops being a footnote.

IVF clusters the vector space and searches selected buckets. It uses less memory, but recall depends on clustering quality and how many buckets you probe.

Every production system chooses a tradeoff between recall, latency, memory, cost, and operational complexity. A missed chunk can become a wrong answer no matter how smart the LLM is.

At ComparEdge, I keep vector databases separate from general databases because the buyer is usually asking about retrieval quality, latency, metadata filtering, and RAG cost. That is a different question from "which database do we already know?"

Chunking beats database selection

The default tutorial strategy, 512-token chunks with overlap, is fine for a demo. It is dangerous for contracts, policies, invoices, API docs, and anything with tables.

A legal clause may depend on a definition in section 1.2, a condition in section 4.7, and an exception in appendix B. Fixed chunking splits the relationship. The embedding sees fragments. The retriever returns a partial answer. The LLM writes it nicely. The business trusts it.

Document-aware chunking is harder. It respects headings, tables, lists, and cross-references. It takes parsing work that nobody wants to schedule. That is why teams skip it and then spend months trying to fix accuracy by swapping databases.

Wrong layer.

Hybrid search is not optional for real users

Embeddings are good at meaning. They are weak at exact identifiers.

"How do I terminate my subscription?" and "what are the cancellation terms?" are semantic matches. Dense retrieval works.

"Clause 7.3.2", "INV-2024-0847", "SOC2 Type II", and "customer_id 18492" need keyword matching. BM25 still earns its keep.

Hybrid retrieval combines dense semantic search with sparse exact matching. In many real workloads, pgvector plus PostgreSQL full-text search can beat a more expensive dense-only setup because it retrieves the exact thing the user asked for.

Database	What it does	Complexity	Main weakness
Pinecone	Managed vector search with metadata filtering	Low	Cost at scale and dense-first defaults
Weaviate	Vector plus keyword hybrid search	Medium	Self-hosting can be resource-heavy
Qdrant	High-performance vector search and filtering	Medium	Smaller ecosystem than older tools
Chroma	Embedded vector store for local/prototype use	Very Low	Not built for serious horizontal scale
pgvector	PostgreSQL extension for vector similarity	Low if Postgres is already used	Performance ceiling at larger scale
MongoDB Atlas	Vector search inside a document database	Low if MongoDB is already used	Less mature ANN tuning

When pgvector is enough

Most RAG systems do not have 100 million vectors or 10,000 queries per second. They have a few hundred thousand chunks, internal users, and a Postgres database already running.

For those teams, pgvector is often the adult choice. It avoids another vendor, keeps metadata and relational filters close, and lets the team move slowly until scale proves otherwise.

Pinecone becomes easier to justify when vectors pass the 10 million range, latency requirements get strict, QPS is high, or multi-region availability matters. Weaviate is a better conversation when hybrid search and schema flexibility are central. Qdrant fits teams that care about filtering, performance, and keeping operational control closer to the engineering team.

The expensive wrong answer

A contract assistant says sublicensing is not allowed. Sales closes the deal. Six months later, legal finds the exception in appendix C, cross-referenced from clause 12.4. The system missed it because chunking split the context and retrieval returned only the restrictive clause.

The vector database did not "fail" in isolation. The product failed: no citations strong enough to inspect, no confidence threshold, no human review for high-stakes decisions, no eval set that tested cross-reference retrieval.

RAG output needs sources, confidence behavior, and escalation paths. Otherwise it is a confident intern with a nice API.

Procurement should not compare these products as if they were the same database with different logos. Pinecone pricing needs query volume, read units, namespaces, and storage growth next to it. Weaviate pricing depends on whether the team wants Cloud, self-hosting, or hybrid deployment. Qdrant pricing should be read against managed cluster size, filtering load, and whether self-hosting is realistic.

I care about how those checks are done, because vector database cost is rarely one line item. It is ingestion, re-indexing, embedding refreshes, metadata filters, backups, replicas, and the engineering time spent proving recall did not get worse.

What actually determines RAG quality

The hierarchy I see in practice:

Chunking and parsing
Embedding model choice
Retrieval strategy, especially hybrid search
Prompting and answer policy
Database selection

If a team spends three months comparing vector databases while using naive chunking, it is optimizing the least useful part first.

Start with documents. Build evals. Measure retrieval recall. Add hybrid search. Then pick the simplest database that meets the scale you actually have.

Tools mentioned:

Pinecone - managed vector search
Weaviate - hybrid vector and keyword search
Qdrant - vector search and filtering
Chroma - embedded local vector store
pgvector - PostgreSQL vector extension
MongoDB Atlas - document database with vector search

Choosing an LLM API for production in 2026: not benchmarks

Oleh Kem — Tue, 07 Jul 2026 10:57:32 +0000

Leaderboards are a bad procurement tool

A model can win a benchmark and still be wrong for your production system.

Production asks less glamorous questions. How fast is the first token? What happens to P99 latency when the queue is full? Where does customer data go? How much does a workflow cost after retries, tool calls, and long prompts? Can you switch providers without rewriting three months of prompt work?

When I compare this category on ComparEdge, I treat LLM pricing as infrastructure math, not a leaderboard footnote. If nobody on the team can explain cost per workflow, I would run the case through an LLM cost calculator before the architecture settles around one provider.

TTFT and total generation are different problems

Time to first token controls whether a chat product feels alive. Total generation time controls how long a batch job or API workflow takes to finish.

Those are not the same metric. One model can start quickly and then crawl through a long answer. Another can start slowly but finish a complex response cleanly. A production system has to know which delay users actually feel.

P50 latency is the demo number. P99 is the support-ticket number. If one in a hundred requests takes eight seconds, thousands of users will notice.

Batch API is cheap until it shapes the architecture

Batch APIs are useful for document processing, nightly enrichment, and offline analysis. A 50% discount is real money once the bill is large enough.

The mistake is pretending that batch is only a cheaper endpoint. Your system now has a live path and a delayed path. When the business later asks for "the batch thing" to work in 30 seconds, you are changing prompts, retries, timeouts, monitoring, and the promise the product made to users.

For teams under roughly $20K/month in spend, batch savings can be smaller than the engineering overhead. Above that, the math may flip.

Switching providers is not changing a URL

Lock-in usually hides in the boring places.

Prompts are tuned to a model's quirks. JSON reliability differs. System prompts behave differently. Tool calling differs. Edge cases differ. Moving 35 production prompts can easily mean weeks of engineering work.

Fine-tunes are worse. A fine-tuned model usually lives inside one provider's infrastructure. You do not carry it across the street like a database dump.

Embeddings are where a casual provider switch can turn into a project. Ten million documents embedded with one model live in that model's vector space. Moving to another embedding model means re-embedding, recalibrating thresholds, retesting retrieval, and rebuilding confidence.

Evals can lock you in too. A regression suite may encode one model's behavior as "correct." A new model can be better and still fail old tests because the old tests measured compatibility, not quality.

Long context is not a RAG replacement

Dumping 500K tokens into context feels liberating until the bill arrives. At $3 per million input tokens, that is $1.50 before the model writes a word. A thousand queries per day becomes $45K/month in input tokens alone.

A decent RAG pipeline might send 2,000 relevant tokens. That costs a rounding error by comparison.

Long context is great for one-off analysis of a large document or codebase. It is usually a bad default for repeated production queries over stable data.

Data residency can eliminate options

If EU customer data must stay in the EU, many default API paths become awkward. Azure OpenAI can solve part of the GPT story through EU regions. Bedrock can change the Anthropic deployment story. Mistral has an obvious advantage for some European buyers.

The legal exposure is not theoretical. A 20% API premium can look cheap next to a GDPR complaint tied to avoidable data transfer.

The flagship model is often the wrong model

For classification, extraction, routing, and many support workflows, the expensive model is often a tax on architecture nobody designed.

Model routing is usually the practical answer. Small fast models handle simple tasks. Stronger models handle complex work. Batch handles delayed jobs. Regional routing handles data residency. A thin classifier can cut 60-80% of spend in many systems.

OpenAI API is often the first comparison point when a team wants ecosystem depth, fine-tuning, embeddings, and broad SDK support. Claude API usually enters the discussion when long-form reasoning, safer writing, or Bedrock deployment matters. Google AI Studio makes more sense when the team is already close to Gemini, Vertex AI, or Google Cloud deployment paths.

Provider	P50 latency signal	Data privacy signal	Price per 1M tokens	Lock-in risk
OpenAI API	Strong general latency	US default, Azure EU option	Varies by model	High due to fine-tuning and embeddings ecosystem
Claude API	Strong reasoning latency profile	US and UK direct, Bedrock EU option	Varies by model	Medium
Google AI Studio	Fast Gemini options	Vertex regional deployment	Varies by model	Medium, tied to GCP paths
Groq	Very low latency for open models	Region options vary	Often low	Low if using portable open models
Mistral	Strong EU positioning	EU-hosted options	Varies by model	Lower with open-weight options

Pricing changes expose lazy architecture

If your provider raises prices 40% with 30 days' notice, you will learn whether you have a provider strategy or a provider dependency.

A team with prompts, embeddings, fine-tunes, and evals tied to one provider cannot migrate in a month. It absorbs the increase and starts a three-month migration under pressure.

Before procurement signs off, I would read OpenAI API pricing with batch jobs, embeddings, cached input, fine-tuning, and eval traffic in the same spreadsheet. Claude API pricing needs the context window, Bedrock route, and reasoning latency next to it. Google AI Studio pricing belongs beside Vertex region assumptions and Gemini routing plans, not in a separate tab nobody opens.

I also care about how those pricing checks are done, because vendor pages often make clean model comparisons while production bills come from retries, failed JSON, embeddings, storage, moderation, and fallback logic.

The insurance is provider compatibility from day one: an abstraction layer, evals against at least two providers, and prompts written to survive more than one model family.

For a CTO, the decision is a provider strategy, not a trophy pick. Run one provider in production, keep another close enough to fail over or migrate, and test a third quarterly.

For an ML engineer, the habit to break is writing prompts that only work on one model. Clever provider-specific prompt hacks are technical debt unless they sit behind tests.

For finance, total model spend is too blunt. "We spend $47K/month" is trivia. "$0.12 per document, $0.003 per classification, $0.45 per complex analysis" is the number people can manage.

Tools mentioned:

OpenAI API - GPT models, fine-tuning, embeddings
Claude API - Anthropic models and Bedrock paths
Google AI Studio - Gemini API and Vertex options
Groq - low-latency inference for open models
Mistral - EU-oriented and open-weight options

Cursor vs Windsurf vs Copilot: real ROI for engineering teams

Oleh Kem — Mon, 06 Jul 2026 21:27:11 +0000

Faster typing is not the same as faster engineering

AI coding tools are good enough now that pretending otherwise is silly. They autocomplete, explain code, generate tests, refactor files, and sometimes carry a task across a repo with less hand-holding than expected.

The problem is measurement. Vendor studies usually measure task completion in clean conditions. Production engineering has old code, unclear requirements, missing tests, security constraints, and reviewers who are already overloaded.

GitClear's 2024 analysis found a 39% increase in code churn after AI coding adoption. That does not prove AI tools are bad. It does suggest teams may be writing more code that later gets rewritten or deleted. More output is not automatically more progress.

When I compare this category on ComparEdge, I separate AI coding tools by context model, workflow, pricing model, deployment constraints, and review risk. The tool that feels fastest in a demo is not always the one that saves the team time after review.

RAG inside an IDE

Every coding assistant has the same constraint: your repository is bigger than the model context.

Copilot often starts from the open file, nearby files, recent context, and repository structure. That is fast and often useful for local work. It struggles more when the answer lives three directories away.

Cursor leans harder into full-codebase indexing. It retrieves chunks from across the project and lets the engineer pin files, docs, or symbols explicitly. That helps with cross-file changes, but it also means retrieval quality becomes part of the product.

Windsurf's Cascade tries to keep a more persistent understanding of the project and current work. That can reduce repeated context setup. It can also make the workflow feel more opaque if you want strict control over what the model sees.

None of these tools understands a codebase the way a senior engineer does after two years of production incidents. They approximate understanding through retrieval, context, and pattern matching. The approximation is useful. It fails in predictable ways.

Benchmarks miss review cost

Benchmarks ask whether a tool can finish a task. Teams need to ask what happens after the tool finishes.

Does the PR get larger? Does review quality drop? Are tests meaningful? Does the tool create duplicated patterns instead of finding existing abstractions? Does it follow the old codebase's conventions, or does it import modern patterns into a system that cannot support them?

AI tools help most with greenfield work, tests, docs, boilerplate, and boring refactors. They help least when the work depends on history: why this service has a weird retry policy, why the billing system uses a strange enum, why a migration cannot run during European business hours.

Cursor, Windsurf, Copilot, and the workflow split

The split is really about workflow. Cursor makes the most sense when codebase indexing and multi-file edits are the daily job. Windsurf is more about a persistent agentic flow around the current task. GitHub Copilot is still the low-friction default for completions and IDE chat.

Once the tool starts planning, editing, testing, and retrying across files, the buyer is drifting from autocomplete into AI agents. That is where review policy matters more than the vendor's demo video.

Tool	What it is good for	Where it can disappoint
Cursor	VS Code fork with codebase indexing and multi-model support	Indexing and subscription cost matter on larger teams
Windsurf	Editor with persistent Cascade agent workflow	Less explicit context control
GitHub Copilot	Inline completions and IDE chat	Shallower cross-file context
Codeium	Free tier, completions, chat, broad IDE support	Retrieval and agent depth vary
Cline AI	Open-source agentic coding with local/cloud models	Configuration and model choice affect quality
Aider	Terminal-based git-native coding agent	CLI workflow is not for every team

AI-generated code still belongs to you

The uncomfortable ownership question is not philosophical. It is operational.

If an AI tool introduces a SQL injection vulnerability, the customer will not sue the autocomplete box. The organization shipped the code. The reviewer approved it. The process allowed it.

That means AI-generated code should be treated like untrusted input. Run SAST and DAST. Require smaller PRs, not larger ones. Apply security review to authentication, authorization, payments, data access, and API boundaries. Do not let the AI's speed outrun review capacity.

The 47-file SQL injection failure

An agent modifies 200 files for a new feature. Tests pass. Reviewers skim because the diff is large and the feature appears to work. In 47 files, the tool used string concatenation around user input instead of parameterized queries.

The root cause is not that AI is uniquely bad at security. Humans write bad code too. The root cause is mismatch: code production got faster, but review stayed the same size.

The fix is not banning AI tools. The fix is adapting the workflow around them.

What leaders should measure

The CTO should ignore lines of code generated. Measure cycle time, defect escape rate, code review load, PR size, rework, and incident count after adoption.

Engineering managers should set rules around PR size and sensitive code paths. AI can generate a large change. That does not mean the team should review it as one large change.

Procurement should also read pricing against workflow, not seats alone. Cursor pricing changes meaning if engineers rely heavily on agent loops and premium model requests. Windsurf pricing should be checked against Cascade usage and credit limits. GitHub Copilot pricing looks simple until premium requests, enterprise controls, and usage policy enter the conversation.

I care about how those pricing checks are done, because AI coding ROI can disappear quietly when the subscription is cheap but the review load, rework, and hidden usage limits grow.

Individual engineers should use the tool where it saves attention and stay skeptical where it demands judgment. Boilerplate, test scaffolds, migration drafts, and docs are good uses. Security-sensitive code deserves a slower hand.

Claude Opus 4.8: What Developers Need to Know About Anthropic's New Flagship

Oleh Kem — Thu, 28 May 2026 17:20:58 +0000

Anthropic shipped Claude Opus 4.8 today. Same price as Opus 4.7, fast mode at 2.5x speed, fast mode 3x cheaper than before. Alongside the model release: dynamic workflows in Claude Code and effort control in claude.ai.

This post covers the benchmark numbers, the practical changes for coding and agents, and what teams building on Claude should pay attention to.

Benchmark Numbers

The numbers that matter most for developers:

SWE-Bench Pro (agentic coding): Opus 4.8 = 69.2%, Opus 4.7 = 64.3%, GPT-5.5 = 58.6%, Gemini 3.1 Pro = 54.2%. A 4.9 point gain over the previous version and a 10.6 point lead over GPT-5.5.

Terminal-Bench 2.1 (agentic terminal coding): Opus 4.8 = 74.6%, GPT-5.5 = 78.2%, Gemini 3.1 Pro = 70.3%. GPT-5.5 leads this benchmark. Opus 4.8 still jumps 8.5 points over Opus 4.7's 66.1%.

OSWorld-Verified (agentic computer use): Opus 4.8 = 83.4%, GPT-5.5 = 78.7%. Browser agent hits 84% on Online-Mind2Web, beating both Opus 4.7 and GPT-5.5.

Humanity's Last Exam (reasoning, with tools): Opus 4.8 = 57.9%, GPT-5.5 = 52.2%, Gemini 3.1 Pro = 51.4%.

Finance Agent v2: Opus 4.8 = 53.9%, GPT-5.5 = 51.8%. First model to break 10% on the all-pass Legal Agent Benchmark.

For cost comparisons across models and workloads, the LLM calculator on ComparEdge is useful for running specific scenarios.

What Changed for Code Quality and Tool Calling

The most relevant change for daily work: Opus 4.8 is roughly 4x less likely than Opus 4.7 to let code flaws pass unremarked. It catches its own mistakes more often, and it pushes back when a plan has problems.

Devin's team confirmed the improvements directly: "Claude Opus 4.8 uses tools cleanly and follows instructions with the consistency our autonomous engineering workloads need to keep running unattended. It improves on Opus 4.6 and fixes the comment-verbosity and tool-calling issues we saw with Opus 4.7."

CursorBench reported that Opus 4.8 exceeds prior Opus models across every effort level, with more efficient tool calling overall.

Tom Pritchard, Staff Engineer at Shopify: "Claude Opus 4.8 has noticeably better judgment. In Claude Code, it asks the right questions, catches its own mistakes, pushes back when a plan isn't sound, and builds up confidence around complex, multi-service explorations before making big changes. It's a great model to build with."

Kay Zhu, Co-Founder and CTO: "On our Super-Agent benchmark, Claude Opus 4.8 is the only model to complete every case end-to-end, beating prior Opus models and GPT-5.5 at parity on cost."

Dynamic Workflows in Claude Code

The biggest feature launch alongside the model: dynamic workflows, available as a research preview in Claude Code. The model plans work and runs hundreds of parallel subagents in a single session. Anthropic says this enables codebase-scale migrations across hundreds of thousands of lines of code, from kickoff to merge.

Available for Enterprise, Team, and Max plans.

This is particularly relevant for large refactors, framework migrations, and cross-service changes where manual orchestration of multiple Claude sessions was previously the only option.

Alignment Improvements

Misaligned behavior (deception, cooperation with misuse) is substantially lower than Opus 4.7. Opus 4.8 scores near 1.83 on Anthropic's misalignment metric, comparable to Mythos Preview (their best-aligned model). Opus 4.7 scored 2.47. This matters for teams running autonomous agents where the model operates without constant human review.

Pricing

Same price as Opus 4.7. Fast mode at 2.5x speed, 3x cheaper than fast mode on previous models. Databricks reported 61% cheaper token cost for their Genie agent compared to Opus 4.7.

I Built a Tool to Stop Guessing LLM API Costs. Here Is What I Learned.

Oleh Kem — Thu, 28 May 2026 16:56:44 +0000

You know that moment when you check your API dashboard and the number has an extra digit you were not expecting? That is where this project started.

We were comparing models for a production pipeline, nothing exotic, just document processing, and realized we had no reliable way to answer a basic question: which model actually costs less for our workload?

So we built one: LLM Calculator. Here is what the build taught us.

The Math Problem Nobody Talks About

LLM pricing looks simple until you try to calculate it for real.

First, input and output tokens have different prices. Most models charge 2 to 5x more for output. A summarization task (lots of input, little output) and a code generation task (little input, lots of output) can have wildly different costs on the same model. The "cheapest model" depends entirely on what you are doing with it.

Then there is batch pricing. OpenAI gives 50% off for batch API calls. If your workload can handle async, that reshuffles the entire ranking. Same story with cached pricing: Anthropic's prompt caching can cut input costs by 90% on repeated prefixes. Are you factoring that in? Most people are not.

Now multiply this across 16 providers and 110+ models: OpenAI, Anthropic, Google, DeepSeek, Groq, Mistral, Meta, Cohere, Together, Perplexity, xAI, Fireworks, Replicate, AI21, Cloudflare, Amazon Bedrock. Prices change constantly. Your mental model of "GPT-4o costs about X" is probably already outdated.

What We Built

A free LLM token cost calculator at comparedge.com/llm-calculator, part of ComparEdge (independent, no vendor affiliations).

Feature tour, dev-to-dev:

Input/output ratio slider. Drag to match your workload profile. Rankings reshuffle in real time. This single feature changed more model decisions than anything else in our testing.

Batch and cache toggles. One click each. Toggle batch pricing for async-tolerant workloads, cached pricing for repeated-prefix scenarios. The cost landscape changes dramatically.

Stack and Compare mode. Pick up to 5 models, see them side-by-side with pricing, context windows, and cost per million tokens for your specific ratio. The "final boss" view for making a decision.

Budget filter. Set a ceiling. Everything over it disappears. Useful when you need to narrow 110 options fast.

10 export formats. PDF and CSV, sure. But also: LiteLLM JSON (for proxy configs), OpenRouter JSON, Python Dict, .env Snippet, Cursor Rules, Markdown, HTML, Plain Text. The output should drop into your actual workflow.

What We Learned Building This

Pricing data is a moving target. We thought the hard part would be the UI. It was not. It was keeping pricing accurate across 16 providers who update at different times, in different formats, with different definitions of what a token even means. Maintenance is the real product.

"Cheapest" is the wrong question. The right question is: cheapest for my specific input/output ratio, with or without batch/cache, within my context window requirements. That is a much harder question, but it is the one that actually saves money.

People do not want more data; they want fewer options. Early versions showed everything. Users were overwhelmed. The budget filter and compare mode exist because people need to go from 110 models to 3 candidates fast.

The Forecasting Problem

Here is what we have not solved yet: predicting future costs.

We are working on a forecasting mode combining growth multiplier, agent overhead, and Pareto concentration factor. The agent overhead part is the tricky bit. Agentic workflows multiply token consumption in ways that are hard to model because the agent decides how many calls to make.

We do not want to ship a forecasting tool that is just "multiply current cost by a number you pick." That is a spreadsheet. We want something that accounts for how LLM usage actually scales. Still in progress.

Try It

Free at LLM Api Calculator Cost. PDF export works without an account. If you use it and have feedback, especially on what export formats are missing or what the compare mode gets wrong, I would genuinely like to hear it in the comments.