DEV Community: embeddings

Embeddings Magic

Boussaden Taha — Sat, 27 Jun 2026 10:37:21 +0000

Transforming language into geometry.

Introduction

Embeddings are one of the most important building blocks of modern AI applications, yet they're often treated as a black box.

In this article, I'll demystify embeddings by exploring what they are, how they are created, and why they make semantic search possible.

The Problem With Traditional Search

Imagine searching for the phrase:

"How do I reset my password?"

A traditional keyword search looks for exact or similar words. If a document instead says:

"Steps to recover your account credentials"

the search may fail because the wording is completely different.

Humans immediately recognize that both sentences describe the same intent, but computers on the other hand need a different way to represent meaning, and this is where embeddings come in.

What Is an Embedding?

An embedding is a dense vector, a list of numbers that represents the semantic meaning of a piece of text. In a more simple way, an array of numerical values usually floating point numbers where almost every position holds meaningful information.

Instead of treating text as a sequence of characters or words, an embedding model maps it into a high dimensional vector space.

For example:

"cat"
↓

[0.18, -0.42, 0.91, ...]

The numbers themselves have no intuitive meaning.

What matters is where the vector is located relative to other vectors.

Meaning Comes From Position

Imagine a map where cities that are geographically close tend to share borders, climates, and transportation links.

Well embeddings work similarly, texts with similar meanings are placed near one another in vector space.

For example:

Dog
      ●

Cat
      ●

Puppy
       ●


Car                         ●

Engine                       ●

Truck                          ●

The actual space may have hundreds or thousands of dimensions instead of two, but the intuition remains the same, so we conclude that the distance represents semantic similarity.

Similar Meaning, Different Words

This is where we can see embeddings stenght.

In these sentences below:

"Reset my password"
"Recover my account"
"Can't log in"
"Forgot my credentials"

They share very few keywords, yet an embedding model places them close together because they express similar ideas.

This enables semantic search, where results are retrieved based on meaning rather than exact wording.

How Similarity Is Measured

Once text has been converted into vectors, we need a way to compare them and the most common metric is cosine similarity.

Rather than comparing the individual numbers, cosine similarity measures the angle between two vectors.

Small angle → highly similar
Large angle → less similar

This works surprisingly well because embedding models are trained to organize semantically related content in nearby regions of the vector space.

Why Embeddings Matter for RAG

Retrieval Augmented Generation (RAG) depends heavily on embeddings, where a typical pipeline looks like this:

Documents
    │
    ▼
Embedding Model
    │
    ▼
Vectors Stored in a Vector Database
    │
    ▼
User Query
    │
    ▼
Query Embedding
    │
    ▼
Similarity Search
    │
    ▼
Relevant Documents
    │
    ▼
LLM

Notice something important:

The LLM never searches your documents directly. Instead, it searches the embedding space for documents whose vectors are closest to the query.

Conclusion

Now that I scratched the surface on how these "numerical representations of text" work. Understanding embeddings is essential for anyone building LLM applications because they power everything from document retrieval to recommendation systems.

Embeddings real power is not in storing vectors but in organizing them, and that what makes them so effective.

Cornell Notes on Context Layer

Kat Padilla — Tue, 23 Jun 2026 14:50:34 +0000

An LLM can only reason about information inside its context window. It doesn't automatically know my docs, APIs, databases, Jira tickets or company data. Something has to decide what information gets shown to the model.

💡

That is the Context Layer.

Core Concepts

Cue	Notes
Context Layer?	My current mental model is that the Context Layer is responsible for deciding what information enters the model's working memory.
Context Package?	The final collection of facts, documents, relationships, and tool results that gets sent to the model. Ideally only the information needed to answer the question.
Why not send everything?	More context isn't automatically better. It increases cost, latency and noise. Relevance matters more than volume.
Context Engineering?	The practice of deciding what information should enter the context window, what should be excluded, how sources should be ranked and how information should be structured.
Where can context come from?	Documents, databases, knowledge graphs, APIs, tools, memory systems, and other knowledge sources.
What's RAG?	Retrieval-Augmented Generation. Instead of stuffing everything into context, retrieve only the relevant pieces when needed.
What problem does RAG solve?	Large collections of information that won't fit inside the context window.
Typical RAG sources	Confluence pages, PDFs, Jira tickets, documentation, wikis, and databases.
What are embeddings?	Numerical representations of meaning. Similar concepts end up near each other in vector space.
Why do embeddings matter?	Most semantic search and RAG systems depend on embeddings to find relevant information.
vector database?	A database optimized for storing and searching embeddings.
Knowledge Graph?	A system that stores relationships between entities, not just the entities themselves.
Mental model for Knowledge Graphs	A database tells me a customer exists. A Knowledge Graph tells me how that customer relates to campaigns, products, segments, and other customers.
What problem does a Knowledge Graph solve?	Understanding how things connect and influence one another.
Is a Knowledge Graph the same as RAG?	No. RAG retrieves information. Knowledge Graphs model relationships.
Can RAG and Knowledge Graphs work together?	Yes. A Knowledge Graph can simply be another source used by the Context Layer.
What are tools?	Systems the model can call when it needs information it doesn't already have.
Examples of tools	Snowflake, Jira, GitHub, CRM systems, internal APIs and databases.
What is MCP?	Model Context Protocol. A standard way for AI systems to discover and interact with tools.
Why use tools instead of storing everything in context?	Some information changes constantly. It's usually better to fetch it on demand than try to keep it in memory.
Examples of tool usage	Current revenue, latest churn rate, open incidents, deployment status, customer profile lookups.
Separation of concerns	Facts should remain factual. Relationships belong in graphs. Live values come from tools. Interpretation happens in the model.

Diagram

Knowledge Sources
    │
    ├── Documents
    ├── Databases
    ├── Knowledge Graphs
    ├── APIs
    └── Tools
            ↓
      Context Layer
            ↓
      Context Package
            ↓
      Context Window
            ↓
            LLM
            ↓
         Answer

Summary

Context Layer is the orchestrator.
RAG retrieves information.
Knowledge Graphs provide relationships.
Tools provide real-time data.
The Context Layer decides what enters the model's working memory.
The LLM reasons over whatever makes it through.

Claude API Semantic Search: Embeddings Alternatives & RAG

Sangmin Lee — Mon, 22 Jun 2026 01:30:09 +0000

Originally published at claudeguide.io/claude-api-semantic-search

Claude API for Semantic Search: Embeddings Alternatives and RAG Patterns

Claude doesn't offer a native embeddings API, but Anthropic's recommended partner Voyage AI provides embeddings optimized for Claude — and combining Voyage embeddings with Claude's generation creates a powerful semantic search and RAG (Retrieval-Augmented Generation) pipeline that outperforms single-vendor solutions by 15-20% on retrieval accuracy benchmarks. This guide covers the full architecture: embedding, indexing, retrieval, and generation.

For model selection and cost trade-offs, see Haiku vs Sonnet vs Opus.

Architecture Overview

Query → Voyage AI (embed) → Vector DB (search) → Top-K docs → Claude (generate answer)

Component	Recommended	Alternative
Embeddings	Voyage AI `voyage-3`	OpenAI `text-embedding-3-small`
Vector DB	Pinecone / pgvector	Qdrant / Weaviate / ChromaDB
Generation	Claude Sonnet	Claude Haiku (for cost)
Reranker	Voyage `rerank-2`	Cohere Rerank

Step 1: Generate Embeddings with Voyage AI

import voyageai

vo = voyageai.Client()  # Uses VOYAGE_API_KEY env var

# Embed documents (batch)
documents = [
    "Claude API supports streaming responses via SSE",
    "Prompt caching reduces costs by up to 90%",
    "Tool use enables function calling with type safety",
]

doc_embeddings = vo.embed(
    documents,
    model="voyage-3",
    input_type="document"
).embeddings

# Embed a query
query_embedding = vo.embed(
    ["How do I reduce Claude API costs?"],
    model="voyage-3",
    input_type="query"
).embeddings[0]

Cost: Voyage AI voyage-3 costs $0.06 per 1M tokens — embedding 10,000 documents of ~500 tokens each costs approximately $0.30.

Step 2: Store in a Vector Database

pgvector (PostgreSQL)

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(1024),  -- voyage-3 dimensions
    metadata JSONB DEFAULT '{}'
);

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

import psycopg2

conn = psycopg2.connect("postgresql://...")
cur = conn.cursor()

for doc, emb in zip(documents, doc_embeddings):
    cur.execute(
        "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
        (doc, emb)
    )
conn.commit()

Pinecone

from pinecone import Pinecone

pc = Pinecone()
index = pc.Index("claude-docs")

vectors = [
    {"id": f"doc-{i}", "values": emb, "metadata": {"text": doc}}
    for i, (doc, emb) in enumerate(zip(documents, doc_embeddings))
]
index.upsert(vectors=vectors)

Step 3: Retrieve and Generate with Claude

import anthropic

def semantic_search_and_answer(query: str, top_k: int = 5) -

---

## Advanced: Hybrid Search

Combine vector similarity with keyword search for better results:

python
def hybrid_search(query: str, top_k: int = 10) -

What embeddings are, explained by building one

I Want To Learn Programming — Sun, 14 Jun 2026 14:00:07 +0000

Embeddings are behind search, recommendations, and most of modern AI, and they are usually explained with intimidating diagrams. The core idea is simple and worth building yourself: an embedding turns a thing (a word, a document, a product) into a list of numbers (a vector) so that similar things end up close together in that number space.

Why turn things into vectors

Computers cannot compare meaning directly, but they can compare vectors. If "king" and "queen" are nearby points, and "king" and "banana" are far apart, then "closeness of vectors" becomes a usable stand-in for "similarity of meaning." Once your items are vectors, search and recommendation become geometry: find the nearest points.

Measuring closeness

The standard measure is cosine similarity, the angle between two vectors. Identical direction scores 1, unrelated scores near 0, opposite scores -1.

import math

def cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb)

With just this, you can already build a tiny semantic search: embed your documents, embed the query, and return the documents with the highest cosine similarity.

A first embedding you can build by hand

You do not need a neural network to feel the idea. A simple bag-of-words vector already places similar documents near each other:

def embed(text, vocab):
    counts = {w: 0 for w in vocab}
    for w in text.lower().split():
        if w in counts:
            counts[w] += 1
    return [counts[w] for w in vocab]

Two documents about the same topic share words, so their vectors point in a similar direction, so cosine similarity is high. That is the whole mechanism, in miniature. Real embeddings (word2vec, or the ones inside large language models) learn far richer vectors where direction captures meaning, not just word overlap, but the principle is identical: similar things, nearby vectors.

Why building it matters

Once you have built a vector space and searched it by cosine similarity, the buzzwords resolve: a "vector database" is a store of these vectors with fast nearest-neighbor search; "semantic search" is exactly what you just did; retrieval for AI is embedding your documents and finding the closest ones to a question. You will understand the systems instead of trusting them.

Build it for real

The AI and Deep Learning track builds embeddings from scratch, from counting vectors to learned representations and the attention that powers transformers, all graded in your browser. The first project is free.

Turn things into vectors, and a huge amount of modern AI becomes geometry you can reason about.

Memory and State in Claude Agents: Patterns That Scale

Sangmin Lee — Sat, 13 Jun 2026 01:31:36 +0000

Originally published at claudeguide.io/claude-agent-memory-patterns

Memory and State in Claude Agents: Patterns That Scale

Claude agents don't have persistent memory between API calls — each call starts fresh. Adding memory means deciding what to store, where to store it, and how much to bring back into context on the next call. The four patterns that cover 90% of production needs are: conversation history (in-context), summary compression (compressed context), external memory (vector search), and explicit state (structured data). This guide covers when to use each and how to implement them.

The Memory Problem

# Call 1
client.messages.create(messages=[{"role": "user", "content": "My name is Alex"}])
# Claude: "Hi Alex!"

# Call 2 — Claude has no memory of call 1
client.messages.create(messages=[{"role": "user", "content": "What's my name?"}])
# Claude: "I don't know your name."

Every conversation must carry its own context. The question is how much and in what form.

Pattern 1: Full Conversation History (In-Context)

The simplest approach — append every turn to a running messages list.


python
import anthropic

client = anthropic.Anthropic()


class ConversationAgent:
    """Agent that maintains full conversation history in context."""

    def __init__(self, system: str, max_turns: int = 50):
        self.system = system
        self.messages = []
        self.max_turns = max_turns

    def chat(self, user_message: str) -

[→ Get the Agent SDK Cookbook — $49](https://shoutfirst.gumroad.com/l/ogxhmy?utm_source=claudeguide&utm_medium=article&utm_campaign=claude-agent-memory-patterns)

*30-day money-back guarantee. Instant download.*

Build a RAG Pipeline From Scratch (Production Patterns That Actually Matter)

Umesh Malik — Fri, 12 Jun 2026 20:19:37 +0000

Most RAG tutorials stop at "embed your docs, do a similarity search, stuff the results in a prompt." That gets you a demo. It does not get you something that gives correct, grounded answers on real data — and the gap between those two is where all the actual engineering lives.

A RAG pipeline is a series of stages, and a weak link in any one of them caps the quality of the whole thing. You can have a frontier model and a beautiful prompt, and still ship garbage if your chunking is wrong. So this is the pipeline end to end, with the production patterns that decide whether it works — not just the happy-path demo.

If you're still deciding whether RAG is even the right tool versus fine-tuning, read RAG vs Fine-Tuning for LLMs first. This post assumes you've decided to retrieve.

TL;DR

RAG is a pipeline: ingest → chunk → embed → store → retrieve → generate. The output is only as good as the weakest stage.
Retrieval quality is everything. Most "the LLM hallucinated" bugs are actually "the right chunk never got retrieved" bugs.
Chunk on meaning, not character counts. Semantic boundaries plus light overlap beat fixed-size splits.
Don't rely on vector search alone. Hybrid (keyword + vector) retrieval with a reranker is the production default.
Ground the generation. Pass only retrieved context, require citations, and refuse when context is thin.
You can't improve what you don't measure. Build a retrieval eval before you tune anything.

The pipeline, stage by stage

1. Ingestion

Load your sources and clean them before anything else. Strip boilerplate, nav chrome, and duplicated headers/footers. Garbage in here propagates through every downstream stage and you'll never trace the bad answer back to it. Preserve structure — headings, lists, tables — because that structure is what makes good chunking possible.

2. Chunking — where most pipelines quietly fail

Chunking is the highest-leverage, most-underrated stage. The naive move is to split every document into fixed 500-character windows. Don't. Fixed-size splitting severs sentences and merges unrelated ideas, and then retrieval surfaces fragments that don't mean anything on their own.

Instead:

Split on semantic boundaries — headings, paragraphs, list items. Respect the document's own structure.
One idea per chunk. A chunk should be retrievable and self-contained.
Add light overlap so context isn't cut mid-thought between adjacent chunks.
Attach metadata to every chunk: source, title, section, date, URL. You'll use it for filtering and citations.

chunk = {
  id, text,
  metadata: { source, title, section, url, date }
}

💡 Key insight: If retrieval is bad, fix chunking before you touch the model or the prompt. The retriever can only find what chunking made findable.

3. Embedding

Turn each chunk into a vector with an embedding model. Two rules that save pain later:

Embed the same way at index time and query time. Same model, same preprocessing. A mismatch silently wrecks relevance.
Version your embeddings. When you change the embedding model, you must re-embed the whole corpus. Track which model produced which vectors so you know when a reindex is due.

4. Storage

Store vectors in an index that does fast similarity search with metadata filtering. You don't necessarily need a dedicated vector database — pgvector on the Postgres you already run handles a surprising amount before a specialized store (Qdrant, Weaviate, Pinecone) earns its keep.

What actually matters: filtering. "Search only this customer's docs" or "only documents from the last year" is a metadata WHERE clause combined with vector similarity. Without it, retrieval leaks across boundaries it shouldn't.

5. Retrieval — go hybrid, then rerank

This is the stage that most separates a demo from a product.

Vector search alone is not enough. Embeddings are great at semantic similarity and bad at exact matches — error codes, product SKUs, proper nouns, acronyms. Keyword search (BM25) is the opposite. Hybrid retrieval runs both and merges the results, so you catch both "what they meant" and "the exact term they typed."

Then rerank. Initial retrieval optimizes for recall — pull a generous candidate set (say, top 20). A cross-encoder reranker then scores those candidates against the query far more precisely and keeps the top handful you'll actually pass to the model. Retrieve broad, rerank narrow.

candidates = vectorSearch(q, k=20) ∪ keywordSearch(q, k=20)
top = rerank(q, candidates)[:5]

6. Grounded generation

Now — and only now — the LLM. The job here is to keep it honest:

Pass only the retrieved context. Don't let the model fall back on parametric memory for facts it should be reading.
Require citations. Ask it to cite the chunk/source for each claim. Citations are both a UX feature and a hallucination check.
Give it permission to say "I don't know." If the retrieved context doesn't answer the question, the correct output is a refusal, not a confident guess. Tell it that explicitly.

System: Answer ONLY from the context below. Cite sources by id.
If the context doesn't contain the answer, say you don't know.

Context:
[1] {chunk_1}
[2] {chunk_2}
...

Question: {user_query}

The patterns that separate prod from demo

Hybrid + rerank, not bare vector search. The single biggest quality jump.
Metadata filtering for security and scoping — never retrieve across tenant or permission boundaries.
Citations and refusal wired into the prompt, so wrong answers become "I don't know" instead of confident fiction.
Caching. Cache embeddings (don't re-embed unchanged chunks) and cache answers to repeated queries.
A retrieval eval set. A fixed set of question → expected-source pairs you can score on every change.

Common mistakes

Fixed-size chunking. The default that quietly caps your ceiling. Chunk on meaning.
Vector-only retrieval. You'll miss exact-match queries every time. Add keyword search.
No reranking. Stuffing the raw top-k into the prompt wastes context on near-misses.
Tuning the prompt to fix a retrieval problem. If the right chunk isn't retrieved, the prompt is irrelevant. Diagnose retrieval first.
No evaluation. "It looks better" isn't a metric. Without an eval set you're guessing, and you'll regress silently.

Best practices

Measure retrieval separately from generation. Most failures are retrieval failures; isolate them. Track recall on your eval set.
Chunk on structure, then iterate. Start with semantic boundaries and light overlap; adjust based on retrieval scores.
Default to hybrid + rerank. Treat it as the baseline, not an optimization.
Filter by metadata for scope and security. Especially in multi-tenant systems.
Force grounding and citations. Answer only from context; cite; allow "I don't know."
Re-embed on model change. Version vectors so you know when a reindex is required.

Conclusion

RAG isn't one trick — it's a pipeline, and quality is set by its weakest stage. Get chunking right, retrieve hybrid and rerank, ground the generation, and measure retrieval so you're improving the right thing. Do that and you cross the line from impressive demo to a system people can trust with real questions.

Skip the engineering — relying on naive chunking and bare vector search — and you'll ship something that demos well and fails the moment real users ask real questions.

Go deeper across LLM Engineering — RAG, Fine-Tuning & Production LLMs, revisit the RAG vs Fine-Tuning decision framework, or explore AI Coding Agents for the agentic side of LLM systems.

Explore more: LLM Engineering · AI Coding Agents · Claude Code

Originally published at umesh-malik.com

Keep reading on umesh-malik.com:

Кэширование LLM-ответов: Redis, semantic cache и экономия 40-70% на API

Promptra Team — Wed, 10 Jun 2026 09:05:02 +0000

LLM-API — самая дорогая зависимость в стеке. На FAQ-боте с 100K запросов в день Claude Opus 4.7 (350/1790 ₽ за 1М токенов) выливается в 250–400 тыс ₽ в месяц. Половину этой суммы можно вернуть кэшированием: точные повторы запросов часто составляют 20–40%, перефразированные близкие — ещё 20–30%, итого 40–70% запросов вообще не должны доходить до модели. Через единый шлюз Promptra prompt caching от Anthropic и OpenAI пробрасывается без изменений, что плюсом срезает 60–80% input-стоимости для агентов с длинным system prompt.

Этот гайд — три уровня кэширования с готовым кодом: in-memory LRU для микро-кэша последних N запросов в одном процессе, Redis exact-match с TTL для распределённого кэша между worker'ами, semantic cache через embeddings и Qdrant для семантически близких запросов. С реальными бенчмарками cost savings, инвалидацией, защитой от ложных срабатываний. оплата в рублях по договору, полный пакет закрывающих документов.

TL;DR — три уровня кэша

Уровень	Где	Hit rate	Когда
L1 in-memory LRU	RAM процесса	5–15%	Микро-кэш горячих запросов, один worker
L2 Redis exact-match	Распределённый	15–35%	Точные повторы между worker'ами
L3 semantic cache	Qdrant + embeddings	25–45% сверху L2	Перефразированные вопросы, FAQ-боты
Bonus prompt cache	На стороне провайдера	60–80% input savings	Агенты с длинным system prompt

Суммарно при удачной комбинации — 50–70% запросов не доходят до flagship-модели.

L1: in-memory LRU для горячих запросов

Самый простой уровень — кэш в RAM одного worker'а через functools.lru_cache или его asyncio-вариант. Подходит для маленьких микросервисов с одним процессом. Эта статья — production-расширение нашего pillar-гида полный технический гид по LLM API на Python: токены, function calling, streaming, RAG, async/batch.

from functools import lru_cache
import hashlib
import json

def messages_hash(messages: list[dict], model: str) -> str:
    """Стабильный hash от messages и model."""
    payload = json.dumps({"messages": messages, "model": model}, sort_keys=True)
    return hashlib.sha256(payload.encode).hexdigest

@lru_cache(maxsize=1024)
def _cached_completion(key: str, model: str, messages_json: str) -> str:
    """Внутренний кэш, ключ — hash. Содержит JSON-сериализованный ответ."""
    messages = json.loads(messages_json)
    response = client.chat.completions.create(model=model, messages=messages)
    return json.dumps({
        "content": response.choices[0].message.content,
        "usage": response.usage.model_dump,
    })

def llm_with_l1(messages: list[dict], model: str) -> dict:
    key = messages_hash(messages, model)
    return json.loads(_cached_completion(key, model, json.dumps(messages, sort_keys=True)))

Плюсы: одна строка, нет внешних зависимостей. Минусы: кэш живёт только в одном процессе, теряется при рестарте, не масштабируется на 10+ worker'ов. Подходит для CLI-тулов, dev-режима, маленьких pet-проектов.

L2: Redis exact-match с TTL

Базовый production-кэш: hash запроса как ключ, сериализованный ответ как значение, TTL по типу контента. Работает между всеми worker'ами.

import hashlib
import json
import redis.asyncio as redis
from openai import OpenAI

r = redis.from_url("redis://localhost:6379/3")
client = OpenAI(api_key="sk-promptra-...", base_url="https://api.promptra.ru/v1")

CACHE_TTL_BY_TYPE = {
    "faq": 86400,        # 24 часа для статичных FAQ
    "summary": 14400,    # 4 часа для summary
    "agent": 1800,       # 30 минут для агентских ответов
    "news": 300,         # 5 минут для новостей
}

def cache_key(messages: list[dict], model: str, kind: str) -> str:
    payload = json.dumps({"messages": messages, "model": model}, sort_keys=True, ensure_ascii=False)
    h = hashlib.sha256(payload.encode("utf-8")).hexdigest
    return f"llm:{kind}:{model}:{h}"

async def llm_with_l2(messages: list[dict], model: str, kind: str = "agent") -> dict:
    key = cache_key(messages, model, kind)
    cached = await r.get(key)
    if cached:
        # Метрика hit
        await r.incr(f"cache:hits:{kind}")
        return json.loads(cached)

    # Метрика miss
    await r.incr(f"cache:misses:{kind}")

    response = client.chat.completions.create(model=model, messages=messages)
    result = {
        "content": response.choices[0].message.content,
        "usage": {
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
        },
        "model": response.model,
    }
    await r.setex(key, CACHE_TTL_BY_TYPE.get(kind, 1800), json.dumps(result, ensure_ascii=False))
    return result

Ключевые моменты:

sort_keys=True + ensure_ascii=False — стабильный hash независимо от порядка ключей и unicode.
TTL по типу — статичные FAQ живут долго, новости — недолго.
Префикс llm:<kind>:<model>: — упрощает инвалидацию по типу или модели (SCAN llm:faq:* + DEL).
Метрики hits/misses — обязательны для оценки эффективности.

Hit rate exact-match cache: на FAQ-боте обычно 20–40%, на агентских pipeline — 5–15%, на креативной генерации — почти 0. Для перефразированных вопросов нужен следующий уровень.

L3: semantic cache через embeddings

Exact-match не ловит перефразированное: «как открыть API ключ» и «где взять токен для API» — разные строки, но один смысл. Semantic cache решает это через embeddings.

Архитектура:

На запрос считаем embedding (используем дешёвую модель вроде text-embedding-3-small или DeepSeek embeddings).
Ищем в Qdrant top-1 точку с косинусным сходством > 0.92.
Если найдена — возвращаем кэшированный ответ.
Если нет — вызываем LLM, сохраняем embedding+ответ в Qdrant.

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid
import json

client = OpenAI(api_key="sk-promptra-...", base_url="https://api.promptra.ru/v1")
qdrant = QdrantClient(host="localhost", port=6333)

COLLECTION = "llm_semantic_cache"
EMBEDDING_MODEL = "text-embedding-3-small"  # 1536 dim
SIMILARITY_THRESHOLD = 0.92

# Один раз при старте
qdrant.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

def embed(text: str) -> list[float]:
    resp = client.embeddings.create(model=EMBEDDING_MODEL, input=text)
    return resp.data[0].embedding

def query_text(messages: list[dict]) -> str:
    """Берём последнее user-сообщение как query для embedding."""
    for m in reversed(messages):
        if m["role"] == "user":
            return m["content"]
    return ""

def llm_with_semantic_cache(messages: list[dict], model: str) -> dict:
    query = query_text(messages)
    if not query or len(query) < 10:
        # Слишком короткий запрос — semantic cache бесполезен
        return raw_llm(messages, model)

    # Embedding запроса
    query_vector = embed(query)

    # Поиск в Qdrant
    hits = qdrant.search(
        collection_name=COLLECTION,
        query_vector=query_vector,
        limit=1,
        with_payload=True,
    )

    if hits and hits[0].score >= SIMILARITY_THRESHOLD:
        # Hit
        return {
            "content": hits[0].payload["content"],
            "from_cache": True,
            "similarity": hits[0].score,
        }

    # Miss — идём в LLM
    response = client.chat.completions.create(model=model, messages=messages)
    content = response.choices[0].message.content

    # Сохраняем
    qdrant.upsert(
        collection_name=COLLECTION,
        points=[PointStruct(
            id=str(uuid.uuid4),
            vector=query_vector,
            payload={"query": query, "content": content, "model": model},
        )],
    )

    return {"content": content, "from_cache": False}

Параметры на тюнинг:

SIMILARITY_THRESHOLD = 0.92 — стартовая точка. Ниже 0.88 — много ложных срабатываний. Выше 0.96 — почти ничего не хитится. Для FAQ — 0.93. Для технических вопросов — 0.96.
EMBEDDING_MODEL — дешёвая модель достаточна. text-embedding-3-small или DeepSeek embeddings (через Promptra) дают качество, сопоставимое с LLM-pump для FAQ.
Длина query > 10 символов — embeddings от слишком коротких запросов нестабильны.

Подробнее про embeddings и RAG-стек — Embeddings и векторный поиск: RAG-стек 2026. Про подбор embedding-модели для русского — Embeddings API в России.

Защита от ложных срабатываний

Semantic cache опасен когда близкие по смыслу запросы требуют разных ответов. Классические сценарии и решения:

1. Запросы с параметрами / ID. «Заказы клиента 42» и «заказы клиента 43» — embedding почти идентичен (0.97+), но ответы разные. Решение — исключить такие запросы из semantic cache:

import re

ID_PATTERNS = [
    r"\b\d{3,}\b",          # Длинные числа
    r"\b[A-Z]{2,}\d{2,}\b", # SKU/Article codes
    r"@\w+",                # Mentions
    r"https?://\S+",        # URLs
]

def has_parametric(text: str) -> bool:
    return any(re.search(p, text) for p in ID_PATTERNS)

def llm_with_smart_cache(messages, model):
    query = query_text(messages)
    if has_parametric(query):
        # Параметрический запрос — только exact-match L2
        return llm_with_l2(messages, model, kind="parametric")
    # Чистый текст — можно semantic
    return llm_with_semantic_cache(messages, model)

2. Контекстно-зависимые запросы. «Расскажи подробнее» в чате означает разное в зависимости от истории. Решение — кэшировать только single-turn запросы (длина messages = 1–2), или включать hash от system prompt в ключ.

3. Запросы с датой/временем. «Покажи новости сегодня» — сегодня меняется. Кэшировать на короткий TTL (5–15 минут) и инвалидировать в полночь.

Дополнительная защита — whitelist по интентам: классифицируете запрос в один из N интентов (FAQ, news, agent, code), и semantic cache работает только для FAQ-интента.

Prompt caching: бонус от провайдера

OpenAI и Anthropic кэшируют префикс промта на своей стороне. Это другой механизм, не путать с собственным кэшем:

Anthropic prompt caching: пометка cache_control: {type: "ephemeral"} на блоке system или последнем сообщении. Первый запрос — обычная цена + 25% за запись. Последующие в течение 5 минут — ×0.1 от input. Документация — Anthropic prompt caching.
OpenAI prompt caching: автоматическое для промтов >1024 токенов. Префикс кэшируется на 5–10 минут, повтор стоит ×0.5 от input. Заголовок ответа prompt_cache_hit_tokens показывает количество cached токенов.

Через Promptra prompt caching работает без изменений — просто передаёте те же параметры:

# Anthropic-стиль через Promptra
response = client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[
        {
            "role": "system",
            "content": "Ты помощник по продуктам Promptra. <Длинный system prompt 5000+ токенов>",
            "cache_control": {"type": "ephemeral"},  # пробрасывается напрямую
        },
        {"role": "user", "content": user_message},
    ],
)

Экономия для агентов с long system prompt:

Сценарий	Без cache	С prompt cache	Экономия
Agent с 8K system, 200 RPS, GPT-5.5	8K × 200 × 350 ₽/M = 560 ₽/мин на input	8K × 200 × 175 ₽/M = 280 ₽/мин	50% input
Agent с 15K system, 50 RPS, Opus 4.7	15K × 50 × 350 ₽/M = 263 ₽/мин	15K × 50 × 35 ₽/M = 26 ₽/мин	90% input

Для агентов с богатым system prompt экономия только на prompt caching достигает 200K-600K ₽/месяц. См. также Сравнение цен LLM 2026 для расчёта по конкретной модели.

Реальные бенчмарки cost savings

Production-замеры с FAQ-бота (русские пользователи, Mostly support questions):

Сценарий: 100K запросов/день, GPT-5.5 (350/2150 ₽/M)
Средний запрос: 800 input + 400 output токенов

Baseline (без кэша):
- 100K × (800 × 350 + 400 × 2150) / 1M = 28 000 + 86 000 = 114 000 ₽/день
- 3.4 млн ₽/месяц

+L2 Redis exact-match (hit 28%):
- 72K вызовов LLM × средняя цена = 114K × 0.72 = 82 080 ₽/день
- Экономия 28% = 31 920 ₽/день, 957 600 ₽/месяц

+L3 semantic cache (cumulative hit 55%):
- 45K вызовов LLM × средняя цена = 114K × 0.45 = 51 300 ₽/день
- Экономия 55% = 62 700 ₽/день, 1 881 000 ₽/месяц
- Минус embedding-стоимость (для 45K hits + 100K queries × $0.00002) ≈ 700 ₽/день
- Net 62K ₽/день = 1.86M ₽/месяц

+Prompt caching на system prompt (агентский use case):
- На каждый вызов экономия 50% input → ещё минус 6% от итога
- Net 1.91M ₽/месяц экономии = 22.9M ₽/год

Реальные числа зависят от природы трафика. Для уникального креатива (генерация маркетингового текста) hit rate < 5%, экономия минимальна. Для FAQ-ботов, support чатов и интент-классификации — 50–70% типично.

Инфраструктура semantic cache:

Redis 8 GB RAM — 600–1200 ₽/мес (Yandex Cloud).
Qdrant 1 vCPU + 2 GB RAM на 100K точек — 500–800 ₽/мес.
Embedding-вызовы (text-embedding-3-small) — около $0.02 на 1M токенов = 1.43 ₽/M. На 100K запросов с 200 токенов = 28.6 ₽/день.

Итого инфра semantic cache — 1500–2000 ₽/мес против экономии 60–200K ₽/мес. ROI положительный с первого дня. Подробнее про async batch как ещё один способ экономии — Async и Batch API LLM: 50% скидка.

Инвалидация: что и когда выкидывать

Кэш без инвалидации — это утечка времени. Стандартные триггеры:

Время (TTL) — встроено в Redis SETEX. Для Qdrant — отдельный crontab cleanup по полю created_at.
Смена модели — все ответы под старой моделью становятся неактуальны. Префикс ключа включает имя модели → DROP COLLECTION или DELETE FROM ... WHERE model = old_model.
Смена system prompt — если изменили промт агента, старые ответы инвалидируются. Включайте hash от system prompt в cache key.
Обновление документации (для RAG) — при reingest корпуса инвалидируете весь кэш ответов на основе старых документов.
Ручная инвалидация — админский endpoint POST /admin/cache/invalidate с фильтром (kind/model/pattern).

# Bulk инвалидация по паттерну
async def invalidate_by_kind(kind: str):
    cursor = 0
    deleted = 0
    while True:
        cursor, keys = await r.scan(cursor, match=f"llm:{kind}:*", count=1000)
        if keys:
            deleted += await r.delete(*keys)
        if cursor == 0:
            break
    return deleted

# Qdrant инвалидация
def invalidate_semantic_by_model(old_model: str):
    qdrant.delete(
        collection_name=COLLECTION,
        points_selector={"filter": {"must": [{"key": "model", "match": {"value": old_model}}]}},
    )

Production-чеклист

[ ] L2 Redis обязателен в production — никаких прямых вызовов LLM на повторяющиеся запросы.
[ ] TTL по типу контента — FAQ 24h, summary 4h, agent 30min, news 5min.
[ ] Cache key включает model — миграция между моделями не отдаёт устаревшие ответы.
[ ] Метрики hits/misses по типам — обязательны для оценки ROI.
[ ] L3 semantic cache для FAQ-ботов и support — 25–45% сверху L2.
[ ] SIMILARITY_THRESHOLD 0.92 стартово, тюнить под качество.
[ ] Параметрические запросы (с ID, URL) — исключать из semantic cache.
[ ] Prompt caching для агентов с long system prompt — экономия 50–90% input.
[ ] Инвалидация по смене модели и system prompt — обязательна.
[ ] Cleanup Qdrant раз в день — точки старше TTL удаляются.
[ ] Admin endpoint для ручной инвалидации.
[ ] Алерт на hit rate < 15% — что-то сломалось, либо трафик уникален и кэш не нужен.

Через Promptra все провайдеры доступны через base_url="https://api.promptra.ru/v1" — кэшированный код работает одинаково для Opus, GPT и Gemini, что упрощает A/B-тесты моделей с сохранением hit rate. Подробнее про миграцию между провайдерами — Миграция с OpenAI на Promptra за 10 минут. Про подсчёт токенов до отправки и оптимизацию payload — Как считать токены LLM.

Анти-паттерны

Кэш без TTL — данные устаревают, ответы становятся неверными.
Cache key без модели — после миграции на новую модель отдаёте старые ответы.
Semantic cache на параметрические запросы — отвечаете про клиента 42 на запрос про клиента 43.
Слишком низкий SIMILARITY_THRESHOLD (< 0.88) — много ложных hits, плохой UX.
Слишком высокий threshold (> 0.97) — почти не хитится, инфра впустую.
Игнорировать prompt caching — теряете 50–90% input savings на агентах.
Кэш на write-операциях — нельзя кэшировать вызовы с tool calls которые меняют state.
Не мониторить hit rate — не знаете эффективность.

Запасные варианты

LangChain RedisCache / SemanticCache — готовые интеграции, но менее гибкие. Подойдут для прототипа.
GPTCache — отдельная библиотека под Python с подключаемыми хранилищами. Удобно для исследования.
OpenAI Batch API — для офлайн-обработки 50% скидка вместо кэша. Подходит когда задержка 24ч приемлема.
Anthropic prompt caching — обязательно при длинных system prompt, экономия 90% input.

Для FAQ-бота на 50K-200K запросов в день оптимальный стек — Redis L2 + Qdrant semantic L3 + prompt caching. Окупается за 1–3 дня и стабильно работает годами.

Promptra — Russian LLM API aggregator. One OpenAI-compatible endpoint to all flagship models: OpenAI (GPT-5.5, GPT-5.4), Anthropic (Claude Opus 4.7, Sonnet 4.6), Google (Gemini 3.1 Pro, 3.5 Flash), DeepSeek V4 Pro, Qwen 3.6 Plus.

Provider prices 1-to-1 at CBR rate — no markup on tokens. Ruble billing per contract, full closing documents through EDI. No VPN — legal B2B service in Russia.

Try: promptra.ru · model catalog · docs

Embeddings и векторный поиск: полный RAG-стек 2026 для русскоязычных проектов

Promptra Team — Mon, 08 Jun 2026 15:37:55 +0000

RAG (Retrieval-Augmented Generation) — это паттерн, при котором LLM получает релевантный контекст из вашей базы знаний прежде, чем формулировать ответ. Это то, как ChatGPT с подключёнными документами, Notion AI и большинство B2B-чатботов на нейросетях знают вашу специфику. Под капотом — embeddings (текст превращается в вектор), векторная БД (хранит и ищет векторы по похожести), и LLM (получает топ-k найденных кусков и отвечает).

В этом гайде — полный стек 2026 для русскоязычных проектов через единый шлюз Promptra: модели embeddings (OpenAI text-embedding-3, Cohere v3, Voyage 3), векторные БД (pgvector, Qdrant, Chroma), стратегии chunking, hybrid search, и метрики оценки качества retrieval. Цены в рублях по курсу ЦБ, оплата в рублях по договору, полный пакет закрывающих документов.

TL;DR — RAG-стек за 5 компонентов

Chunking — режем документы на куски 500–1500 токенов с overlap или по семантическим границам.
Embedding — превращаем каждый chunk в вектор 1024–3072 dim через embeddings модель.
Vector DB — храним векторы с metadata в Qdrant или pgvector, индекс HNSW для быстрого поиска.
Retrieval — на запрос пользователя считаем его embedding, ищем топ-k ближайших векторов (cosine similarity).
Generation — отдаём найденные chunks как контекст в LLM (Claude Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro), модель отвечает. Эта статья — часть pillar-гида: полный технический гид по LLM API на Python — токены, function calling, streaming, RAG, batch.

С hybrid search и rerank качество retrieve в 2 раза выше чем у чистого vector search.

Шаг 1. Embeddings — какую модель выбрать

Embedding-модель превращает текст в вектор фиксированной длины. Близкие по смыслу тексты получают близкие векторы (по cosine similarity). От качества embeddings зависит всё — никакой LLM не спасёт, если retrieve вернул нерелевантные куски.

Топ-модели 2026 для русского языка:

Модель	Dim	Контекст	Цена $/1M	Сильные стороны
OpenAI text-embedding-3-large	3072	8K	$0.13	Флагман, Matryoshka (можно урезать)
OpenAI text-embedding-3-small	1536	8K	$0.02	Дёшево, baseline, RU adequate
Cohere embed-multilingual-v3.0	1024	512	$0.10	Лучше всех на multilingual задачах
Voyage 3	1024	32K	$0.06	Premium качество для retrieval
Qwen-embed-v2	1024	8K	$0.05	Сильная альтернатива на русском

Через Promptra все доступны по OpenAI-совместимому endpoint'у — меняется только model:

from openai import OpenAI

client = OpenAI(
    api_key="sk-promptra-...",
    base_url="https://api.promptra.ru/v1",
)

response = client.embeddings.create(
    model="text-embedding-3-large",
    input=["Текст первого документа.", "Текст второго документа."],
    encoding_format="float",
)

vectors = [d.embedding for d in response.data]
print(f"Получили {len(vectors)} векторов размерности {len(vectors[0])}")

Подробное сравнение моделей и цен — в материале «Embeddings API из России». Спецификация OpenAI embeddings — в официальном гайде; подробности Voyage 3 — на сайте Voyage AI.

Шаг 2. Chunking — стратегии разбиения

Документы редко помещаются в один вектор. Нужно разбить на куски (chunks) такого размера, чтобы каждый был семантически целостным и помещался в контекст embedding-модели.

Fixed-size с overlap — baseline

Самый простой подход: режем по N символов с overlap M:

def fixed_chunks(text: str, size: int = 800, overlap: int = 150) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + size, len(text))
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

Работает, но режет посреди предложений. Качество retrieve среднее.

Semantic chunking — по предложениям/параграфам

Используем nltk или spacy для разбиения на предложения, потом собираем чанки до целевого размера:

from nltk.tokenize import sent_tokenize

def semantic_chunks(text: str, target_size: int = 800, overlap_sentences: int = 2) -> list[str]:
    sentences = sent_tokenize(text, language="russian")
    chunks = []
    current = []
    current_size = 0

    for sent in sentences:
        if current_size + len(sent) > target_size and current:
            chunks.append(" ".join(current))
            # overlap: берём последние N предложений в следующий chunk
            current = current[-overlap_sentences:]
            current_size = sum(len(s) for s in current)
        current.append(sent)
        current_size += len(sent)

    if current:
        chunks.append(" ".join(current))
    return chunks

Качество retrieve существенно выше — куски всегда целые.

Hierarchical chunking — лучший на 2026

Идея: индексируем мелкие chunks (для точного поиска), а возвращаем родительские (для богатого контекста):

from dataclasses import dataclass

@dataclass
class HierarchicalChunk:
    child_text: str   # 200-400 токенов — для embedding и поиска
    parent_text: str  # 1500-2000 токенов — отдаём в LLM
    parent_id: str
    child_id: str

def hierarchical_chunks(text: str) -> list[HierarchicalChunk]:
    parents = semantic_chunks(text, target_size=2000, overlap_sentences=0)
    result = []
    for p_idx, parent in enumerate(parents):
        children = semantic_chunks(parent, target_size=350, overlap_sentences=1)
        for c_idx, child in enumerate(children):
            result.append(HierarchicalChunk(
                child_text=child,
                parent_text=parent,
                parent_id=f"p{p_idx}",
                child_id=f"p{p_idx}-c{c_idx}",
            ))
    return result

Это побеждает на 90% задач — точность поиска от маленьких chunks, богатство контекста от больших. Для технических доков — chunking по заголовкам markdown. Для кода — по функциям/классам. Для диалогов — по поворотам.

Шаг 3. Векторная БД — pgvector / Qdrant / Chroma

Сохраняем векторы вместе с metadata (id, source, chunk_text) и ищем по cosine similarity.

pgvector — если у вас уже Postgres

CREATE EXTENSION vector;

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    source TEXT NOT NULL,
    chunk_text TEXT NOT NULL,
    parent_text TEXT,
    embedding VECTOR(1024)   -- размерность под Voyage 3 / Cohere
);

CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Запрос top-k:

SELECT id, chunk_text, parent_text, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

Из Python через psycopg или asyncpg. Плюс: ACID, обычные SQL-бэкапы, JOIN с реляционными данными. Минус: на сотнях миллионов векторов производительность падает — для такого масштаба нужен Qdrant.

Qdrant — production-grade open-source

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

# upsert
points = [
    PointStruct(
        id=i,
        vector=vector,
        payload={"source": doc.source, "chunk_text": doc.text, "parent_text": doc.parent},
    )
    for i, (vector, doc) in enumerate(zip(vectors, chunks))
]
client.upsert(collection_name="docs", points=points)

# search
hits = client.search(
    collection_name="docs",
    query_vector=query_vector,
    limit=10,
    query_filter=Filter(must=[FieldCondition(key="source", match=MatchValue(value="manual"))]),
)

Qdrant — российская команда, доступен on-prem, до сотен миллионов векторов, фильтрация по metadata из коробки, hybrid search встроенный.

Chroma — для прототипирования

import chromadb

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection("docs")

collection.add(
    embeddings=vectors,
    documents=[c.text for c in chunks],
    metadatas=[{"source": c.source} for c in chunks],
    ids=[c.id for c in chunks],
)

results = collection.query(
    query_embeddings=[query_vector],
    n_results=10,
)

Единственная зависимость pip install chromadb, in-memory или file-based. Для прода с серьёзным трафиком не годится.

Шаг 4. Hybrid search — vector + BM25

Чистый vector search проигрывает на точных совпадениях имён, кодов, дат. Hybrid комбинирует semantic + keyword:

# В Qdrant — встроенный hybrid через named vectors
from qdrant_client.models import Prefetch, FusionQuery, Fusion

results = client.query_points(
    collection_name="docs",
    prefetch=[
        Prefetch(query=dense_query_vector, using="dense", limit=20),
        Prefetch(query=sparse_query_vector, using="bm25", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),   # Reciprocal Rank Fusion
    limit=10,
)

RRF (Reciprocal Rank Fusion) — стандартный способ объединять списки. Формула: score = sum(1 / (k + rank_i)), где k обычно 60. Документ, который попал в топ обоих списков — получает больший score, чем тот, что в топе только одного.

Качество retrieve на технических документах с hybrid обычно растёт на 10–20% Recall@10.

Шаг 5. Rerank — последний штрих качества

Поверх hybrid search ставится reranker: модель, которая для каждой пары (query, chunk) даёт точный relevance score. Это дорого (один inference на каждый retrieved chunk), но даёт лучший top-10 из retrieved 50–100.

Топ-модели rerank:

Cohere rerank-multilingual-v3.0 — production-grade, низкая latency, ~$2 за 1M запросов
Voyage rerank-2 — premium качество, ~$3 за 1M
BGE-reranker-v2-m3 — open-source, можно self-host

# через Cohere SDK поверх обычного retrieve
retrieved = vector_search(query, k=50)
rerank_response = cohere_client.rerank(
    query=query,
    documents=[r.text for r in retrieved],
    top_n=10,
    model="rerank-multilingual-v3.0",
)
final = [retrieved[r.index] for r in rerank_response.results]

Recall@10 после rerank обычно +15–25% над hybrid. Совокупный uplift embedding → +chunking → +hybrid → +rerank — это разница между «работает терпимо» и «работает отлично».

Шаг 6. Generation — отдаём контекст в LLM

Финальный шаг — собираем prompt и вызываем LLM:

def rag_answer(query: str, retrieved_chunks: list[str], model: str = "claude-opus-4-7") -> str:
    context = "\n\n---\n\n".join(retrieved_chunks)
    system = (
        "Ты — ассистент по технической документации. "
        "Отвечай ТОЛЬКО на основе предоставленного контекста. "
        "Если в контексте нет ответа — скажи 'В документации нет ответа на этот вопрос'. "
        "Не выдумывай факты."
    )
    user = f"Контекст:\n{context}\n\nВопрос: {query}\n\nОтветь подробно."

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    return response.choices[0].message.content

Что важно:

Жёсткий system message — заставляет модель не выдумывать вне контекста. Это снижает галлюцинации.
Указание источников — попросите модель отдавать [source_1]-references на куски, это даёт пользователю верификацию.
Длина контекста vs стоимость — каждый retrieved chunk это input-токены. На Opus 4.7 (350 ₽ за 1M input) 10 chunks по 1500 токенов = 15K токенов = 5.25 ₽ на запрос. Если запросов 100K в месяц — 525K ₽. Цены — в «Сравнение цен LLM 2026».

Для дешёвого RAG используют Gemini 3.1 Pro (140/860 ₽) или DeepSeek V4 Pro (30/60 ₽) на маленьких задачах. Для критичных продуктовых ответов — Opus 4.7 или GPT-5.5.

Evaluation — без неё это чёрный ящик

RAG нельзя выкатить в прод без замера качества. Минимальная eval:

Соберите 50–200 пар (query, ids_of_relevant_chunks) вручную или через LLM-as-judge.
Прогоните query → embed → search → top-k, замерьте:
- Recall@k — доля релевантных, попавших в top-k. Цель Recall@10 > 0.85.
- MRR — Mean Reciprocal Rank, насколько высоко стоит первый релевантный. Цель > 0.5.
- NDCG@k — учитывает порядок результатов, премиум-метрика.
Если ниже — меняйте модель embeddings, chunking, добавляйте rerank, тюньте hybrid weight.

Дополнительно — end-to-end eval: даёте LLM-судье (через тот же Opus 4.7 / GPT-5.5) пары (query, generated_answer, ground_truth), спрашиваете «насколько ответ точен и обоснован контекстом». Это позволяет ловить регрессии не только в retrieve, но и в формулировке.

Стоимость RAG на 1M документов

Пример экономики для базы 1M документов, средний размер 500 токенов, 10K запросов в день:

Статья	Расчёт	Сумма
Embedding индексации	500M токенов × $0.06 (Voyage 3)	~430K ₽ разово
Vector DB	Qdrant on-prem (4 vCPU, 32GB RAM)	~25K ₽/мес
Embedding запросов	10K × 50 токенов × 30 дней	~650 ₽/мес
Rerank	10K × 50 × 30 (Cohere v3)	~6500 ₽/мес
LLM generation (Opus 4.7)	10K × 15K input + 800 output	~2.27 М ₽/мес
LLM на дешёвом (Gemini 3.1 Pro)	10K × 15K input + 800 output	~700K ₽/мес

Главная статья — это LLM-generation. Если ответы не требуют флагмана — переход с Opus 4.7 на Gemini 3.1 Pro экономит 1.5 М ₽/мес. Через Promptra это смена model в одной строке.

Оплата и закрывающие документы

Все компоненты RAG-стека — embeddings, rerank, LLM generation — оплачиваются на юр.лицо российское юр.лицо, резидент РФ. Сервисная комиссия 5% берётся только при пополнении баланса, на токены наценки нет, цены строго по курсу ЦБ. Полный пакет закрывающих документов (договор-оферта, счёт на оплату, акт оказанных услуг, счёт-фактура, УПД) приходит через ЭДО — Диадок, СБИС, Контур. Подробнее — на странице «Тарифы».

Что дальше

RAG в 2026 — это не один запрос к LLM с приклеенным контекстом. Это связка: chunking → embedding → vector DB → hybrid search → rerank → generation, и evaluation поверх каждого слоя. С правильной архитектурой Recall@10 переваливает 0.9, end-to-end judge score выходит за 4.5/5, и продукт перестаёт галлюцинировать. Полезные следующие шаги: «Function calling и tool use» для агентов-аналитиков поверх RAG, «Async-вызовы и Batch API» для индексации больших баз и «Как считать токены в LLM» для расчёта стоимости context window. Если нужно подобрать модель под ваш RAG или подключить ключ через юрлицо — напишите команде Promptra в Telegram.

📚 Главный гайд по теме: Лучшая нейросеть 2026: какую LLM выбрать под задачу — связанные материалы и обзор всей категории.

Provider prices 1-to-1 at CBR rate — no markup on tokens. Ruble billing per contract, full closing documents through EDI. No VPN — legal B2B service in Russia.

Try: promptra.ru · model catalog · docs

Vector Database Tutorial: From Zero to RAG Agent in 2026

Iniyarajan — Mon, 08 Jun 2026 08:25:21 +0000

Common misconception: Vector databases are just fancy storage systems. The truth? They're the foundation that makes AI agents truly intelligent.

We're in 2026, and vector databases have become the backbone of every production RAG system. Whether you're building a customer support agent or a code assistant, understanding how vectors work isn't optional anymore. Let's walk through building a complete RAG agent together, starting from the basics.

Photo by Brett Sayles on Pexels

What Makes Vector Databases Different
Setting Up Your First Vector Database
Building a RAG Pipeline
Creating an AI Agent with Memory
Production Considerations
Frequently Asked Questions

What Makes Vector Databases Different

Traditional databases store data in rows and columns. Vector databases store mathematical representations of data — embeddings — that capture semantic meaning. When we ask "How do I deploy my app?", a vector database doesn't just match keywords. It understands that this relates to deployment, DevOps, and infrastructure.

Related: Vector Database Tutorial: Building Smart AI Agents with RAG

The magic happens in the similarity search. Vector databases use algorithms like HNSW (Hierarchical Navigable Small World) to find the most relevant documents in milliseconds, even with millions of entries.

Also read: Building Robust AI Agent Memory Systems in 2026

Here's where it gets interesting for AI agents. We can store not just documents, but conversation history, user preferences, and contextual information as vectors. This gives our agents semantic memory — they remember not just what happened, but what it means.

Setting Up Your First Vector Database

We'll use Pinecone for this vector database tutorial because it's production-ready and developer-friendly. But the concepts apply to any vector database.

First, let's create our vector space:

import pinecone
from openai import OpenAI
import numpy as np

# Initialize Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")

# Create index with 1536 dimensions (OpenAI embeddings)
index_name = "rag-agent-memory"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine"
    )

index = pinecone.Index(index_name)
client = OpenAI()

def get_embedding(text):
    """Convert text to vector embedding"""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return response.data[0].embedding

def store_document(doc_id, content, metadata=None):
    """Store document as vector in database"""
    embedding = get_embedding(content)
    index.upsert([
        {
            "id": doc_id,
            "values": embedding,
            "metadata": {"content": content, **(metadata or {})}
        }
    ])

def search_similar(query, top_k=5):
    """Find similar documents to query"""
    query_embedding = get_embedding(query)
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )
    return results.matches

This setup gives us the foundation for semantic search. But for a production RAG agent, we need more structure.

Building a RAG Pipeline

A robust RAG pipeline handles document preprocessing, chunking, and retrieval orchestration. Here's our complete system:

class RAGAgent:
    def __init__(self, index_name="rag-agent"):
        self.index = pinecone.Index(index_name)
        self.client = OpenAI()
        self.conversation_memory = []

    def add_documents(self, documents):
        """Add documents to vector database with chunking"""
        for i, doc in enumerate(documents):
            # Split into chunks (simple approach)
            chunks = self._chunk_text(doc["content"])

            for j, chunk in enumerate(chunks):
                doc_id = f"{doc['id']}_chunk_{j}"
                embedding = get_embedding(chunk)

                self.index.upsert([{
                    "id": doc_id,
                    "values": embedding,
                    "metadata": {
                        "content": chunk,
                        "source": doc["id"],
                        "chunk_index": j
                    }
                }])

    def _chunk_text(self, text, chunk_size=500, overlap=50):
        """Split text into overlapping chunks"""
        words = text.split()
        chunks = []

        for i in range(0, len(words), chunk_size - overlap):
            chunk = " ".join(words[i:i + chunk_size])
            chunks.append(chunk)

        return chunks

    def query(self, question):
        """Query with RAG pipeline"""
        # Retrieve relevant context
        context_docs = search_similar(question, top_k=3)
        context = "\n\n".join([match.metadata["content"] for match in context_docs])

        # Include conversation memory
        memory_context = "\n".join([
            f"User: {msg['user']}\nAssistant: {msg['assistant']}"
            for msg in self.conversation_memory[-3:]  # Last 3 exchanges
        ])

        # Generate response
        prompt = f"""
        Context from knowledge base:
        {context}

        Previous conversation:
        {memory_context}

        Current question: {question}

        Please provide a helpful response based on the context and conversation history.
        """

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
                {"role": "user", "content": prompt}
            ]
        )

        answer = response.choices[0].message.content

        # Store in conversation memory
        self.conversation_memory.append({
            "user": question,
            "assistant": answer
        })

        return answer

What makes this different from a simple chatbot? The vector database gives our agent semantic understanding of your knowledge base, and the memory system maintains context across conversations.

Creating an AI Agent with Memory

Real AI agents need more than just document retrieval. They need episodic memory — remembering past interactions, user preferences, and learned behaviors. We can store all of this as vectors.

class MemoryEnhancedAgent(RAGAgent):
    def __init__(self, index_name="memory-agent"):
        super().__init__(index_name)
        self.user_profile = {}

    def store_interaction(self, user_id, interaction_type, content):
        """Store user interaction as vector for future reference"""
        memory_id = f"{user_id}_{interaction_type}_{len(self.conversation_memory)}"
        embedding = get_embedding(content)

        self.index.upsert([{
            "id": memory_id,
            "values": embedding,
            "metadata": {
                "user_id": user_id,
                "type": interaction_type,
                "content": content,
                "timestamp": int(time.time())
            }
        }])

    def get_user_context(self, user_id, query):
        """Retrieve relevant user history for personalized responses"""
        # Search for relevant past interactions
        results = self.index.query(
            vector=get_embedding(query),
            top_k=5,
            filter={"user_id": {"$eq": user_id}},
            include_metadata=True
        )

        return [match.metadata for match in results.matches]

    def personalized_query(self, user_id, question):
        """Answer with personalized context from user history"""
        # Get user's relevant history
        user_context = self.get_user_context(user_id, question)

        # Combine with knowledge base context
        kb_context = search_similar(question, top_k=3)

        # Generate personalized response
        context_text = "\n".join([
            f"User's past interaction: {ctx['content']}"
            for ctx in user_context[:2]
        ])

        kb_text = "\n".join([
            match.metadata["content"] for match in kb_context
        ])

        prompt = f"""
        User's relevant history:
        {context_text}

        Knowledge base context:
        {kb_text}

        Current question: {question}

        Provide a personalized response considering the user's history and preferences.
        """

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a personalized assistant that adapts to user preferences and history."},
                {"role": "user", "content": prompt}
            ]
        )

        answer = response.choices[0].message.content

        # Store this interaction for future reference
        self.store_interaction(user_id, "query_response", f"Q: {question}\nA: {answer}")

        return answer

This approach transforms our RAG system into a true AI agent. It learns from every interaction and becomes more helpful over time.

Production Considerations

Building production RAG agents requires thinking beyond the happy path. Here are the challenges we need to address:

Embedding Model Selection: Different models excel at different tasks. text-embedding-ada-002 is general-purpose, but specialized models like text-embedding-3-large offer better performance for specific domains.

Vector Database Scaling: Pinecone handles scaling automatically, but self-hosted options like Weaviate or Qdrant require capacity planning. Consider your query volume and storage requirements.

Chunk Strategy: Simple text splitting isn't enough for complex documents. Consider semantic chunking that preserves context boundaries, or hierarchical chunking for structured data.

Evaluation and Monitoring: RAG systems can hallucinate or retrieve irrelevant context. Implement evaluation metrics like context relevance and answer faithfulness. Tools like LangSmith or Weights & Biases help track performance over time.

Privacy and Security: Vector embeddings can leak information about source documents. For sensitive data, consider techniques like differential privacy or encrypted vector search.

Cost Optimization: Embedding generation and vector storage costs add up. Batch embedding requests, use caching for frequent queries, and implement tiered storage for older data.

Frequently Asked Questions

Q: Which vector database should I choose for production?

For beginners, start with Pinecone for its managed service and excellent documentation. If you need self-hosted solutions, Weaviate offers great performance with GraphQL queries, while Qdrant provides Rust-based speed with Python APIs.

Q: How do I handle documents that are too large for embedding models?

Use hierarchical chunking: create summary embeddings for entire documents and detailed embeddings for chunks. Store both in your vector database with different metadata tags, then query summaries first and drill down to relevant chunks.

Q: Can vector databases replace traditional databases entirely?

No, they're complementary. Use vector databases for semantic search and similarity matching, but keep structured data in traditional databases. Many production systems use both, with vector databases handling AI features and SQL databases managing business logic.

Q: How do I evaluate if my RAG system is working well?

Track three key metrics: retrieval accuracy (are relevant documents found?), context relevance (is retrieved content useful?), and answer faithfulness (does the generated response stay true to the context?). Tools like RAGAS provide automated evaluation frameworks.

Vector databases have evolved from experimental technology to production necessity in 2026. They're the foundation that makes AI agents truly intelligent — capable of understanding context, remembering interactions, and providing personalized experiences.

The key insight? Don't think of vector databases as just storage. Think of them as the memory system that gives your AI agents the ability to learn, adapt, and become more helpful over time. That's what separates a simple chatbot from a truly intelligent agent.

Need a server? Get $200 free credits on DigitalOcean to deploy your AI apps.

Resources I Recommend

If you're diving deeper into RAG and vector databases, these RAG and vector database books provide comprehensive coverage of production patterns and advanced techniques that complement this tutorial.

📘 Go Deeper: Building AI Agents: A Practical Developer's Guide

185 pages covering autonomous systems, RAG, multi-agent workflows, and production deployment — with complete code examples.

Get the ebook →

Also check out: *AI-Powered iOS Apps: CoreML to Claude***

Enjoyed this article?

I write daily about iOS development, AI, and modern tech — practical tips you can use right away.

Follow me on Dev.to for daily articles
Follow me on Hashnode for in-depth tutorials
Follow me on Medium for more stories
Connect on Twitter/X for quick tips

If this helped you, drop a like and share it with a fellow developer!

Cluster-Aware Retrieval for RAG Systems

Alex Towell — Sun, 07 Jun 2026 03:03:46 +0000

Most RAG systems treat embedding spaces as flat, uniform distributions. They're not. Real knowledge bases contain distinct semantic clusters: database docs, frontend frameworks, DevOps practices, each with different internal structure. Ignoring this wastes retrieval precision.

The Problem with Flat Retrieval

A query about "React hooks optimization" should pull from the frontend cluster, not equally consider database or infrastructure docs that happen to share semantic overlap. Standard cosine similarity doesn't care about topical boundaries. You get results that are individually relevant but collectively unfocused.

Modeling Clusters with GMM

Gaussian Mixture Models assume your embeddings arise from (K) underlying Gaussian distributions:

$$p(v) = \sum_{k=1}^K \pi_k \mathcal{N}(v \mid \mu_k, \Sigma_k)$$

For a query (q), compute the posterior probability of each cluster:

$$p(k \mid q) = \frac{\pi_k \mathcal{N}(q \mid \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(q \mid \mu_j, \Sigma_j)}$$

This gives you soft assignments: the probability that a query belongs to each semantic cluster.

Two-Stage Retrieval

Cluster selection: Pick cluster(s) with highest (p(k \mid q)). Take top-2 for ambiguous queries.
Intra-cluster retrieval: Run k-NN within selected clusters.

The cluster boundaries act as a soft filter, avoiding the "dilution effect" where off-topic documents dominate results.

Mahalanobis Distance Per Cluster

Here's the underexplored part: different clusters can use different distance metrics. For a cluster modeled as (\mathcal{N}(\mu_k, \Sigma_k)), the Mahalanobis distance accounts for the cluster's shape:

$$d_{\text{Mah}}(q, v) = \sqrt{(q - v)^T \Sigma_k^{-1} (q - v)}$$

Elongated clusters in certain semantic directions get stretched appropriately. Cosine similarity treats all directions equally. Mahalanobis adapts.

Clusters as Agent Tools

In agentic RAG, each cluster becomes a tool the agent can invoke:

tools = [
    ClusterRetrievalTool(cluster_id=k, name=f"Search {topic_k}")
    for k in range(K)
]

The agent decides which clusters to search and in what order:

Query: "How does React's context API compare to Redux?"
Agent plan:
1. Search frontend cluster for React context
2. Search state management cluster for Redux patterns
3. Synthesize comparison

This beats flat retrieval for cross-topic synthesis.

Implementation

Fit GMM offline on document embeddings:

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=K, covariance_type='full')
gmm.fit(document_embeddings)

# For query q:
cluster_probs = gmm.predict_proba(q.reshape(1, -1))[0]
selected_clusters = cluster_probs.argsort()[-2:][::-1]  # top-2

Store cluster assignments as metadata in your vector DB:

results = vector_db.query(
    query_embedding=q,
    filter={"cluster_id": {"$in": selected_clusters}},
    top_k=20
)

Key decisions:

Number of clusters: Use BIC/AIC or domain knowledge
Regularization: Add (\lambda I) to covariance matrices to prevent singularities
Initialization: k-means++ for better convergence

When It Helps

Topically diverse corpora: Multi-product docs, cross-domain papers
Single-topic queries: Clear primary topic to route to
Noise reduction: Distant-but-similar content diluting results

When it doesn't:

Homogeneous corpora
Very small datasets
Queries requiring extensive cross-topic synthesis (agentic patterns help here)

Limitations

Cluster boundaries: Queries near boundaries may be misrouted. Soft routing (weighted retrieval across clusters) helps.

Scalability: GMM fitting doesn't scale well beyond roughly 100 clusters and millions of docs. Use hierarchical clustering or vector DB partitioning for large systems.

Benchmark first: Flat retrieval with strong reranking is a tough baseline. Always compare.

The core insight: embedding spaces have structure. Exploit it.

Embeddings API в России: векторный поиск и RAG

Promptra Team — Sat, 06 Jun 2026 19:35:56 +0000

Эмбеддинг — это представление текста в виде вектора чисел, где близкие по смыслу тексты оказываются рядом в пространстве. На эмбеддингах строят семантический поиск, RAG (ответы по своей базе знаний), классификацию и поиск дубликатов. Для большинства задач разумный дефолт — text-embedding-3-small (дёшево, 1536 измерений), а где важна точность — text-embedding-3-large (3072 измерения). Обе модели вызываются через один OpenAI-совместимый эндпоинт из России без VPN: в коде на openai SDK меняется только base_url на https://api.promptra.ru/v1, оплата идёт в рублях на юр.лицо с закрывающими документами.

Ниже — разбор простыми словами, что такое эмбеддинги и где они применяются, какие модели и размерности бывают, как получить вектор по API (Python и curl), как собрать RAG-пайплайн по шагам (чанк → эмбеддинг → векторная БД → поиск → ответ), сколько это стоит в рублях и какие ошибки встречаются чаще всего. Тон — для разработчика, который хочет понять механику и цену, а не читать маркетинг.

Что такое эмбеддинги простыми словами

Компьютер не понимает текст так, как человек. Чтобы машина могла сравнивать фразы по смыслу, текст нужно перевести в числа. Эмбеддинг — это и есть такой перевод: модель-эмбеддер берёт строку («как сбросить пароль») и возвращает список из сотен или тысяч чисел — вектор. Вектор фиксированной длины, не зависящей от длины текста: и одно слово, и целый абзац превращаются в вектор одной и той же размерности. Подробнее — миграция с OpenAI SDK на Promptra за 10 минут.

Главное свойство этих векторов: близость в пространстве отражает близость по смыслу. Фразы «как сбросить пароль» и «забыл пароль, что делать» дадут векторы, расположенные рядом, хотя в них почти нет общих слов. А «как сбросить пароль» и «рецепт борща» окажутся далеко друг от друга. Это принципиально отличается от обычного текстового поиска по ключевым словам: там совпадение ищется по буквам, а в семантическом поиске — по смыслу.

Расстояние между векторами измеряют чаще всего косинусной близостью (cosine similarity) — насколько совпадает «направление» двух векторов. Значение около 1 — тексты почти про одно и то же, около 0 — никак не связаны. Именно эта метрика лежит в основе семантического поиска: чтобы найти релевантные документы, мы считаем близость вектора запроса к векторам всех документов и берём самые близкие.

Где применяются эмбеддинги

Эмбеддинги — это инфраструктурный кирпич, на котором держится сразу несколько практических задач:

Семантический поиск. Поиск по смыслу, а не по словам. Пользователь спрашивает «не приходит письмо подтверждения» — система находит статью «проблемы с доставкой email», даже если в ней нет слова «подтверждение».
RAG (Retrieval-Augmented Generation). Самое популярное применение. Перед тем как задать вопрос языковой модели, мы находим в своей базе знаний релевантные фрагменты (через эмбеддинги) и подкладываем их в промпт. Модель отвечает не из «головы», а по вашим документам — точнее и без выдумок.
Классификация и маршрутизация. Тикеты, обращения, лиды можно раскладывать по категориям, сравнивая их эмбеддинги с эталонными. Без обучения отдельной модели.
Кластеризация и дедупликация. Сгруппировать похожие отзывы, найти дубли товаров в каталоге, выделить темы в массиве сообщений.
Рекомендации. «Похожие товары», «похожие статьи» — это поиск ближайших векторов к текущему объекту.

Объединяет все эти задачи одно: вместо того чтобы сравнивать тексты по словам, мы сравниваем их по смыслу — а для этого сначала превращаем каждый текст в вектор.

Модели эмбеддингов и размерности

Самые распространённые модели эмбеддингов — линейка OpenAI text-embedding-3. Они доступны через OpenAI-совместимый API, и именно их чаще всего подключают в РФ-проектах. Ключевые параметры:

Модель	Размерность по умолчанию	Сильная сторона	Цена OpenAI (USD / 1M токенов)
`text-embedding-3-small`	1536	Дёшево, быстро, дефолт	$0.02
`text-embedding-3-large`	3072	Максимальное качество	$0.13
`text-embedding-ada-002` (legacy)	1536	Старая, для совместимости	$0.10

USD-прайс — с официальной страницы OpenAI. Рублёвые оценки разберём в разделе про цену ниже.

Что значит размерность. Это длина вектора — сколько чисел в нём. У text-embedding-3-small по умолчанию 1536 чисел, у large — 3072. Чем больше размерность, тем больше «нюансов» смысла модель может закодировать, но тем больше места занимает вектор в базе и тем дороже его хранить и сравнивать. Для большинства задач 1536 измерений small — более чем достаточно.

Параметр dimensions — важная фишка. Модели text-embedding-3 (в отличие от старой ada-002) умеют отдавать укороченный вектор без потери большей части качества. Передаёте dimensions=512 — и получаете вектор из 512 чисел вместо 1536. Это экономит память векторной базы и ускоряет поиск, а просадка в качестве на типичных задачах небольшая. Приём называется Matryoshka-представление: вектор «вложен» так, что первые N чисел уже несут основной смысл. На практике для экономии берут 512 или 256 измерений у large и часто получают качество не хуже полного small.

Какую модель выбрать по умолчанию:

Начинайте с text-embedding-3-small. Дёшево, быстро, на 90% задач (поиск по базе знаний, классификация тикетов, рекомендации) её хватает с запасом.
Берите text-embedding-3-large, когда важна точность ранжирования: юридический или медицинский поиск, многоязычный корпус, тонкие смысловые различия. При желании укоротите вектор через dimensions, чтобы не раздувать базу.
ada-002 — только для совместимости со старым кодом. Для новых проектов смысла нет: 3-small дешевле в пять раз и качественнее.

Как получить эмбеддинг по API

Хорошая новость: эмбеддинги вызываются тем же openai SDK, что и чат — отличается только метод (embeddings.create вместо chat.completions.create). А чтобы работать из России без VPN, меняется ровно один параметр — base_url. Остальной код остаётся как в любом примере из документации OpenAI.

Python

Минимальный рабочий пример — получить вектор для одной строки:

from openai import OpenAI

client = OpenAI(
 api_key="prm-xxxxxxxxxxxxxxxx",
 base_url="https://api.promptra.ru/v1", # единственное изменение
)

resp = client.embeddings.create(
 model="text-embedding-3-small",
 input="Как сбросить пароль от личного кабинета",
)

vector = resp.data[0].embedding
print(len(vector)) # 1536 — размерность вектора
print(resp.usage.total_tokens) # сколько токенов потрачено

В ответе resp.data — это список (по одному элементу на каждый входной текст), а resp.data[0].embedding — сам вектор: список из 1536 чисел типа float. Поле resp.usage.total_tokens показывает фактический расход, по которому считается оплата.

Батч — несколько текстов за один запрос. Это важный приём для экономии: вместо тысячи отдельных запросов отправляйте список строк, и модель вернёт список векторов в том же порядке. Так индексируют базу знаний:

texts = [
 "Как сбросить пароль",
 "Не приходит письмо с подтверждением",
 "Как изменить тариф",
]

resp = client.embeddings.create(
 model="text-embedding-3-small",
 input=texts, # список строк — батч
)

for i, item in enumerate(resp.data):
 print(i, len(item.embedding)) # вектор для каждого текста по порядку

Укороченный вектор через dimensions — когда нужно сэкономить на хранении и поиске:

resp = client.embeddings.create(
 model="text-embedding-3-large",
 input="Текст запроса",
 dimensions=512, # вместо 3072 по умолчанию
)
print(len(resp.data[0].embedding)) # 512

Параметр dimensions поддерживается только у моделей text-embedding-3 и новее — у старой ada-002 его нет.

curl

Проверить эндпоинт без всякого SDK можно одним запросом:

curl https://api.promptra.ru/v1/embeddings \
 -H "Authorization: Bearer prm-xxxxxxxxxxxxxxxx" \
 -H "Content-Type: application/json" \
 -d '{
 "model": "text-embedding-3-small",
 "input": "Как сбросить пароль от личного кабинета"
 }'

Если в ответе пришёл JSON с полем data, внутри которого массив embedding из чисел, — эндпоинт и ключ в порядке, можно индексировать базу. Подробный разбор того, как поменять base_url в разных SDK и на разных языках, — в гайде миграция с OpenAI SDK: меняем base_url.

Как построить RAG и семантический поиск

RAG расшифровывается как Retrieval-Augmented Generation — «генерация с подмешиванием найденного». Идея простая: языковая модель не знает содержимого ваших внутренних документов, и если спросить её напрямую, она либо ответит общими словами, либо выдумает. RAG решает это так: перед ответом мы находим в своей базе релевантные фрагменты (через эмбеддинги) и кладём их прямо в промпт. Модель отвечает уже по конкретным данным.

Пайплайн состоит из двух фаз — индексация (один раз, заранее) и запрос (на каждый вопрос пользователя).

Фаза 1. Индексация базы знаний

Чанкинг. Документы (статьи, инструкции, договоры) режутся на куски — чанки — по 200–800 токенов. Слишком большие чанки размывают смысл, слишком мелкие теряют контекст. Типичный старт — абзацы или фрагменты по 300–500 токенов с небольшим перекрытием.
Эмбеддинг. Для каждого чанка получаем вектор через embeddings.create (батчами, как показано выше).
Сохранение в векторную БД. Вектор плюс исходный текст чанка и метаданные (источник, заголовок) складываются в векторную базу.

Фаза 2. Запрос пользователя

Эмбеддинг запроса. Вопрос пользователя превращаем в вектор той же моделью.
Поиск ближайших (топ-K). Векторная БД находит K чанков, ближайших к запросу по косинусной близости (обычно K = 3–8).
Сборка промпта. Найденные чанки вставляем в системный промпт: «Ответь на вопрос, используя только эти фрагменты: …».
Генерация ответа. Отправляем промпт в чат-модель (например, GPT-5.5 или Claude Sonnet) — она отвечает по подложенным данным.

Ключевой момент: эмбеддер и чат-модель — это разные модели, и они работают вместе. Эмбеддер (text-embedding-3-small) отвечает за поиск, чат-модель (gpt-5.5, claude-sonnet-4-6 и т.д.) — за формулировку ответа. Обе доступны через один и тот же эндпоинт — меняется только значение model.

Минимальный RAG-цикл на Python, без внешней векторной БД (для базы из нескольких сотен чанков хватает поиска в памяти через numpy):

import numpy as np
from openai import OpenAI

client = OpenAI(api_key="prm-xxxx", base_url="https://api.promptra.ru/v1")

# --- Индексация (один раз) ---
chunks = [
 "Чтобы сбросить пароль, откройте раздел Настройки и нажмите Сбросить.",
 "Письмо с подтверждением приходит в течение 5 минут, проверьте папку Спам.",
 "Сменить тариф можно в Биллинге, изменения вступают в силу сразу.",
]
emb = client.embeddings.create(model="text-embedding-3-small", input=chunks)
index = np.array([d.embedding for d in emb.data]) # матрица векторов

# --- Запрос ---
question = "не могу войти, забыл пароль"
q = client.embeddings.create(
 model="text-embedding-3-small", input=question
).data[0].embedding
q = np.array(q)

# косинусная близость и топ-K
sims = index @ q / (np.linalg.norm(index, axis=1) * np.linalg.norm(q))
top_k = sims.argsort[::-1][:2]
context = "\n".join(chunks[i] for i in top_k)

# --- Генерация ответа по найденному контексту ---
answer = client.chat.completions.create(
 model="gpt-5.5",
 messages=[
 {"role": "system", "content": f"Ответь, используя только эти данные:\n{context}"},
 {"role": "user", "content": question},
 ],
)
print(answer.choices[0].message.content)

На больших объёмах (десятки тысяч чанков и больше) поиск в памяти заменяют на специализированную векторную БД — pgvector (расширение PostgreSQL), Qdrant, Chroma, Milvus или подобные. Логика та же: складываете векторы, ищете ближайшие. Меняется только хранилище — эмбеддинги по-прежнему берёте через тот же API.

Цена эмбеддингов в рублях

Здесь важная оговорка: цены на модели эмбеддингов в нашем каталоге отдельной строкой пока не зафиксированы (каталог сейчас отражает чат-модели). Поэтому рублёвые значения ниже — производная оценка: официальная цена OpenAI в долларах, умноженная на курс ЦБ РФ 71.668 ₽/$ (на 2026-05-27, тот же курс, что и для всех моделей в каталоге). Фактический счёт считается по курсу ЦБ на день пополнения и без наценки на токены; точные ставки по эмбеддингам уточняйте у команды при подключении.

Модель	Цена OpenAI (USD / 1M)	Оценка в ₽ / 1M (× 71.668)
`text-embedding-3-small`	$0.02	≈ 1.4 ₽
`text-embedding-3-large`	$0.13	≈ 9.3 ₽
`text-embedding-ada-002`	$0.10	≈ 7.2 ₽

Главное, что бросается в глаза: эмбеддинги несопоставимо дешевле генерации. Миллион токенов через text-embedding-3-small обходится примерно в 1.4 ₽ — это в сотни раз дешевле, чем тот же миллион токенов на выходе у чат-флагмана (для сравнения, выход GPT-5.5 — 2150 ₽ за 1M). Причина в том, что эмбеддер только «читает» текст и отдаёт один вектор, он ничего не генерирует. Платите вы только за входные токены — понятия «выходных токенов» у эмбеддингов нет.

Прикинем реальные сценарии (по оценочной ставке text-embedding-3-small ≈ 1.4 ₽ за 1M):

Сценарий	Объём	Примерно токенов	Стоимость (оценка)
Проиндексировать базу знаний на 1000 статей	≈ 1000 × 800 токенов	0.8M	≈ 1.1 ₽
Проиндексировать 100 000 товаров каталога	≈ 100K × 100 токенов	10M	≈ 14 ₽
1 млн поисковых запросов в месяц	≈ 1M × 20 токенов	20M	≈ 28 ₽

Даже индексация крупной базы и миллион запросов в месяц складываются в десятки рублей. Эмбеддинги — та статья расходов, по которой в RAG-системе экономить обычно не нужно: основной счёт формирует чат-модель, генерирующая ответы, а не эмбеддер. Поэтому выбор large вместо small ради качества почти не бьёт по бюджету.

Важно: значения в таблицах — производная оценка от долларового прайса OpenAI по курсу ЦБ, а не строка из каталога. Перед расчётом бюджета сверьте актуальную ставку с командой.

Типичные ошибки при работе с эмбеддингами

Несколько граблей, на которые наступают чаще всего:

Разные модели для индексации и запроса. Векторы от text-embedding-3-small и text-embedding-3-large живут в разных пространствах — их нельзя сравнивать между собой. Если вы проиндексировали базу одной моделью, а запросы считаете другой, поиск выдаст мусор. Правило: одна и та же модель (и одна и та же размерность) на индексацию и на запрос. Сменили модель — переиндексируйте всю базу.

Слишком крупные чанки. Если затолкать в один чанк целую страницу, его вектор «усреднит» все темы сразу, и поиск по конкретному вопросу станет размытым. Дробите на смысловые куски по 200–800 токенов. Обратная крайность — чанки по одному предложению — теряют контекст. Истина посередине, обычно абзац.

Игнорирование лимита длины входа. У моделей эмбеддингов есть максимум токенов на один вход (у text-embedding-3 это 8191 токен). Текст длиннее обрежется или вызовет ошибку — длинные документы нужно резать на чанки до эмбеддинга, а не после.

Отказ от батчинга. Индексировать базу по одному тексту на запрос — медленно и упирается в rate limit. Отправляйте списком (батчами по сотне-другой строк) — это и быстрее, и устойчивее к лимитам.

Хранение векторов как есть без нормализации. Многие векторные БД и метрики ожидают нормализованные векторы (длины 1). Если считаете близость вручную через скалярное произведение — либо нормализуйте, либо используйте честную косинусную формулу с делением на нормы (как в примере выше).

Эмбеддинг очень разноязычного корпуса дешёвой моделью. На многоязычных и узкоспециальных данных text-embedding-3-small иногда заметно уступает large. Если поиск «промахивается» на вашем корпусе — первое, что стоит попробовать, это поднять модель до large, благо по цене это почти незаметно.

Если по тексту нужно не только искать, но и генерировать (ответы, резюме, код) — выбор чат-модели под задачу и бюджет мы разобрали в гайде нейросеть для кода: какие LLM выбрать, а обзор топовых моделей — в материале топ-5 LLM 2026. Подключить чат-модели можно на странице ChatGPT API — тем же ключом и эндпоинтом, что и эмбеддинги.

Оплата и документы для юр.лица

Для компаний важна не только техническая сторона, но и то, как расходы на API проходят по бухгалтерии. Оплата идёт на юр.лицо — российское юр.лицо — с полным пакетом закрывающих документов через ЭДО: договор-оферта, счёт, акт, счёт-фактура, УПД. Документы автоматически проводятся в учётной системе через операторов ЭДО (Диадок, СБИС).

Цена за токены — 1-в-1 с прайсом провайдера, пересчитанным по курсу ЦБ, без наценки на сами токены; сервисная комиссия 5% берётся только при пополнении баланса. Доступ работает из России без VPN: запрос уходит на эндпоинт агрегатора, а он связывается с провайдером со своей стороны — туннелировать трафик или маскировать IP не нужно.

Provider prices 1-to-1 at CBR rate — no markup on tokens. Ruble billing per contract, full closing documents through EDI. No VPN — legal B2B service in Russia.

Try: promptra.ru · model catalog · docs

text-embedding-3-small Dimensions Explained: 1536 vs 1024 vs 512

Jenny Met — Sat, 06 Jun 2026 14:13:22 +0000

text-embedding-3-small Dimensions Explained: 1536 vs 1024 vs 512

If you use text-embedding-3-small, one small setting can quietly affect your whole retrieval system: embedding dimensions.

The default vector length is 1536 dimensions. That is a good default. But it is not always the cheapest or fastest choice once you store millions of chunks in a vector database.

This guide explains what text-embedding-3-small dimensions means, when to keep 1536, when to test smaller vectors, and how to call an OpenAI-compatible embeddings endpoint with real code.

What are text-embedding-3-small dimensions?

An embedding turns text into a list of numbers. That list is a vector.

For text-embedding-3-small, the default vector has 1536 numbers. If you embed the sentence:

“API gateways help developers route model calls.”

The model returns one vector that represents the meaning of that whole input. The vector is not one number per word. It is one semantic representation for the input text you send.

You then store that vector in a vector database such as pgvector, Pinecone, Milvus, Weaviate, Chroma, or Qdrant. When a user searches, you embed the query and compare it against stored vectors.

Official OpenAI documentation states that text-embedding-3-small defaults to 1536 dimensions, while text-embedding-3-large defaults to 3072 dimensions. It also supports a dimensions parameter that can reduce the output vector length.

External references:

Default text-embedding-3-small dimensions: why 1536 is common

1536 dimensions is popular because it is the default. It is also a practical balance between quality and cost for many semantic search and RAG workloads.

Use the default 1536 dimensions when:

You are building your first retrieval system.
You do not have evaluation data yet.
Your dataset is small enough that vector storage is not painful.
Search quality matters more than a few gigabytes of storage.
You want fewer moving parts during the first launch.

That last point matters. If your app is still early, the biggest risk is usually not vector size. It is bad chunking, weak retrieval evaluation, missing metadata filters, or poor prompts.

Start simple. Then optimize.

The dimensions parameter: what changes and what does not

The dimensions parameter lets you request a shorter embedding vector.

For example, instead of asking for the default 1536-dimensional vector, you can request 1024, 768, or 512 dimensions if your provider supports it for that model.

What changes:

Area	1536 dimensions	1024 / 768 / 512 dimensions
Vector storage	Larger	Smaller
Index memory	Larger	Smaller
Search latency	Often higher	Often lower
Retrieval quality	Strong baseline	Must be tested
API input token cost	Usually unchanged	Usually unchanged

What does not usually change: the number of input tokens you send. Embedding API pricing is normally based on input tokens, not the final vector size.

That means smaller dimensions mainly help with storage, index memory, and retrieval speed. They are not a magic way to reduce the embedding generation bill.

Storage math: 1536 vs 1024 vs 512 dimensions

A float32 number uses 4 bytes. So the raw vector size is:

vector_size_bytes = dimensions × 4

For one vector:

Dimensions	Bytes per vector	Storage vs 1536
1536	6,144 bytes	Baseline
1024	4,096 bytes	~33% smaller
768	3,072 bytes	~50% smaller
512	2,048 bytes	~67% smaller

For 1 million chunks, raw float32 vector storage looks like this:

Dimensions	Raw vector storage	With rough 35% index overhead
1536	~5.72 GiB	~7.72 GiB
1024	~3.81 GiB	~5.15 GiB
768	~2.86 GiB	~3.86 GiB
512	~1.91 GiB	~2.57 GiB

This is why dimensions start to matter at scale. A small difference per vector becomes real infrastructure cost when you store millions of chunks.

Quick calculator for embedding dimensions

Here is a small Python tool you can use to estimate storage and rough generation cost.

#!/usr/bin/env python3
import argparse


def gib(n):
    return n / (1024 ** 3)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--documents", type=int, required=True)
    parser.add_argument("--avg-tokens", type=int, required=True)
    parser.add_argument("--dimensions", type=int, nargs="+", default=[1536, 1024, 768, 512])
    parser.add_argument("--price-per-million", type=float, default=0.02)
    args = parser.parse_args()

    total_tokens = args.documents * args.avg_tokens
    estimated_cost = total_tokens / 1_000_000 * args.price_per_million

    print(f"Documents: {args.documents:,}")
    print(f"Estimated input tokens: {total_tokens:,}")
    print(f"Embedding generation cost: ${estimated_cost:,.2f}")
    print()
    print("Dims  Raw GiB  With 35% index overhead")

    for dim in args.dimensions:
        raw_bytes = args.documents * dim * 4
        print(f"{dim:>4}  {gib(raw_bytes):>7.2f}  {gib(raw_bytes * 1.35):>24.2f}")


if __name__ == "__main__":
    main()

Example:

python3 embedding_dimension_calculator.py --documents 1000000 --avg-tokens 350

Example output:

Embedding Dimension Calculator
================================
Documents/chunks: 1,000,000
Average tokens/chunk: 350
Estimated input tokens: 350,000,000
Embedding generation cost @ $0.02/1M tokens: $7.00

Dimension storage comparison
--------------------------------
  Dims  Bytes/vector     Raw GiB  With index GiB  Saved vs max
  1536         6,144        5.72            7.72            0%
  1024         4,096        3.81            5.15           33%
   768         3,072        2.86            3.86           50%
   512         2,048        1.91            2.57           67%

The important lesson: generation cost can stay small, while vector database cost and memory can grow quickly.

API example: default 1536 dimensions

Here is a standard OpenAI-compatible embeddings call.

curl https://crazyrouter.com/v1/embeddings \
  -H "Authorization: Bearer $CRAZYROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-3-small",
    "input": "Explain API gateway routing in one paragraph."
  }'

The response includes an embedding array. With the default setting, its length should be 1536.

You can use the same pattern with any OpenAI-compatible client. With Crazyrouter, you only change the base URL and API key:

Base URL: https://crazyrouter.com/v1
Endpoint: /embeddings
Auth: Authorization: Bearer YOUR_KEY

Related internal guides:

Python example: check the vector length

from openai import OpenAI

client = OpenAI(
    api_key="your-crazyrouter-api-key",
    base_url="https://crazyrouter.com/v1",
)

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="A vector database stores embeddings for semantic search.",
)

vector = response.data[0].embedding
print(len(vector))  # usually 1536 by default
print(vector[:5])   # preview the first few values

Do not paste real keys into code. Use environment variables in production.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CRAZYROUTER_API_KEY"],
    base_url="https://crazyrouter.com/v1",
)

Python example: request custom dimensions

If your embeddings provider supports the dimensions parameter for text-embedding-3-small, you can request a shorter vector.

from openai import OpenAI

client = OpenAI(
    api_key="your-crazyrouter-api-key",
    base_url="https://crazyrouter.com/v1",
)

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Shorter embeddings can reduce vector database storage.",
    dimensions=1024,
)

vector = response.data[0].embedding
print(len(vector))  # expected: 1024 when supported

Important: do not mix dimensions in the same vector index. If your collection was created for 1536-dimensional vectors, a 1024-dimensional vector will usually fail at insert time.

Use one collection per dimension setting.

Node.js example: embeddings with OpenAI-compatible base URL

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.CRAZYROUTER_API_KEY,
  baseURL: "https://crazyrouter.com/v1",
});

const response = await client.embeddings.create({
  model: "text-embedding-3-small",
  input: "Embeddings help search by meaning, not just keywords.",
});

const vector = response.data[0].embedding;
console.log(vector.length);

Which dimensions should you choose?

There is no universal best value. Choose based on evaluation, not vibes.

A practical starting point:

Use case	Suggested starting dimensions	Why
Prototype / small app	1536	Maximize quality while you learn
Support docs RAG	1536 or 1024	Quality matters, but storage can grow
Large FAQ search	1024	Often a good balance to test
High-volume semantic cache	768 or 512	Speed and memory may matter more
Legal / medical / financial retrieval	1536	Test carefully before reducing
Mobile / edge search	512 or 768	Smaller vectors are easier to move

For production, run an evaluation set. Take 50 to 200 real user queries. Label the best matching documents. Compare recall@5 or recall@10 for 1536, 1024, 768, and 512.

If 1024 gives almost the same recall as 1536, you can reduce storage and memory without hurting users.

Common mistakes with text-embedding-3-small dimensions

Mistake 1: mixing 1536 and 1024 vectors in one index

Vector databases expect a fixed dimension per collection or index. If you change dimensions, create a new index and re-embed the corpus.

Mistake 2: optimizing dimensions before chunking

Bad chunking hurts retrieval more than a larger vector helps it.

Fix chunking first:

Keep chunks focused.
Add useful metadata.
Avoid huge mixed-topic chunks.
Test overlap instead of guessing.

Mistake 3: assuming smaller dimensions reduce API cost

Embedding generation cost is usually based on input tokens. Smaller vectors reduce storage and search costs, not necessarily API call cost.

Mistake 4: choosing 512 without evaluation

512-dimensional vectors can work for some workloads. But they may lose recall on nuanced queries. Test them before moving production search.

Mistake 5: forgetting downstream schema changes

If you use pgvector, your schema may include a fixed dimension:

CREATE TABLE documents (
  id bigserial PRIMARY KEY,
  content text,
  embedding vector(1536)
);

If you switch to 1024, you need a different column or table:

CREATE TABLE documents_1024 (
  id bigserial PRIMARY KEY,
  content text,
  embedding vector(1024)
);

A simple evaluation workflow

Use this workflow before changing dimensions in production:

Pick 100 real user queries.
Label the correct documents for each query.
Create separate indexes for 1536, 1024, 768, and 512.
Run the same queries against each index.
Compare recall@5, recall@10, latency, and memory.
Choose the smallest dimension that does not hurt retrieval quality.

This is more reliable than reading a benchmark and hoping it matches your data.

Final recommendation

For most teams, the best first move is simple:

Start with text-embedding-3-small at 1536 dimensions.
Build a clean retrieval evaluation set.
Test 1024 once your corpus grows.
Try 768 or 512 only when storage, memory, or latency becomes important.

If you already use OpenAI-compatible tools, you can test this through Crazyrouter by setting your base URL to https://crazyrouter.com/v1 and calling /embeddings with your normal SDK.

The goal is not to use the smallest vector. The goal is to use the smallest vector that still retrieves the right answer.

FAQ: text-embedding-3-small dimensions

What are the default text-embedding-3-small dimensions?

The default text-embedding-3-small output is 1536 dimensions. That means each input text returns a vector with 1536 numeric values.

Can I change text-embedding-3-small dimensions?

Yes, when your provider supports the dimensions parameter, you can request a shorter vector. Common test values are 1024, 768, and 512.

Do smaller embedding dimensions reduce API cost?

Usually not directly. Embedding API cost is normally based on input tokens. Smaller dimensions mainly reduce vector storage, index memory, and search latency.

Is 512 dimensions enough for text-embedding-3-small?

Sometimes. It depends on your dataset and retrieval quality requirements. Use an evaluation set before using 512 dimensions in production.

Can I store 1536 and 1024 dimension vectors in the same database table?

Usually no. Most vector indexes require a fixed dimension. Create separate collections or tables when testing different dimensions.

Should I use text-embedding-3-small or text-embedding-3-large?

Use text-embedding-3-small for cost-effective general retrieval. Test text-embedding-3-large when retrieval quality is the main bottleneck and you can afford larger vectors.

What is the best dimension for RAG?

Start with 1536 for text-embedding-3-small. Then test 1024 and 768 against real queries. The best dimension is the smallest one that preserves your recall target.

DEV Community: embeddings

Embeddings Magic

Introduction

The Problem With Traditional Search

What Is an Embedding?

Meaning Comes From Position

Similar Meaning, Different Words

How Similarity Is Measured

Why Embeddings Matter for RAG

Conclusion

Cornell Notes on Context Layer

Core Concepts

Diagram

Summary

Claude API Semantic Search: Embeddings Alternatives & RAG

Claude API for Semantic Search: Embeddings Alternatives and RAG Patterns

Architecture Overview

Step 1: Generate Embeddings with Voyage AI

Step 2: Store in a Vector Database

pgvector (PostgreSQL)

Pinecone

Step 3: Retrieve and Generate with Claude

What embeddings are, explained by building one

Why turn things into vectors

Measuring closeness

A first embedding you can build by hand

Why building it matters

Build it for real

Memory and State in Claude Agents: Patterns That Scale

Memory and State in Claude Agents: Patterns That Scale

The Memory Problem

Pattern 1: Full Conversation History (In-Context)

Build a RAG Pipeline From Scratch (Production Patterns That Actually Matter)

TL;DR

The pipeline, stage by stage

1. Ingestion

2. Chunking — where most pipelines quietly fail

3. Embedding

4. Storage

5. Retrieval — go hybrid, then rerank

6. Grounded generation

The patterns that separate prod from demo

Common mistakes

Best practices

Conclusion

Кэширование LLM-ответов: Redis, semantic cache и экономия 40-70% на API

TL;DR — три уровня кэша

L1: in-memory LRU для горячих запросов

L2: Redis exact-match с TTL

L3: semantic cache через embeddings

Защита от ложных срабатываний

Prompt caching: бонус от провайдера

Реальные бенчмарки cost savings

Инвалидация: что и когда выкидывать

Production-чеклист

Анти-паттерны

Запасные варианты

Embeddings и векторный поиск: полный RAG-стек 2026 для русскоязычных проектов

TL;DR — RAG-стек за 5 компонентов

Шаг 1. Embeddings — какую модель выбрать

Шаг 2. Chunking — стратегии разбиения

Fixed-size с overlap — baseline

Semantic chunking — по предложениям/параграфам

Hierarchical chunking — лучший на 2026

Шаг 3. Векторная БД — pgvector / Qdrant / Chroma

pgvector — если у вас уже Postgres

Qdrant — production-grade open-source

Chroma — для прототипирования

Шаг 4. Hybrid search — vector + BM25

Шаг 5. Rerank — последний штрих качества

Шаг 6. Generation — отдаём контекст в LLM

Evaluation — без неё это чёрный ящик

Стоимость RAG на 1M документов

Оплата и закрывающие документы

Что дальше

Vector Database Tutorial: From Zero to RAG Agent in 2026

Table of Contents

What Makes Vector Databases Different

Setting Up Your First Vector Database

Building a RAG Pipeline