<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: embeddings</title>
    <description>The latest articles tagged 'embeddings' on DEV Community.</description>
    <link>https://dev.to/t/embeddings</link>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tag/embeddings"/>
    <language>en</language>
    <item>
      <title>Embeddings Magic</title>
      <dc:creator>Boussaden Taha</dc:creator>
      <pubDate>Sat, 27 Jun 2026 10:37:21 +0000</pubDate>
      <link>https://dev.to/tahaboussaden/embeddings-magic-2hlb</link>
      <guid>https://dev.to/tahaboussaden/embeddings-magic-2hlb</guid>
      <description>&lt;p&gt;Transforming language into geometry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Embeddings are one of the most important building blocks of modern AI applications, yet they're often treated as a black box.&lt;/p&gt;

&lt;p&gt;In this article, I'll demystify embeddings by exploring what they are, how they are created, and why they make semantic search possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Traditional Search
&lt;/h2&gt;

&lt;p&gt;Imagine searching for the phrase:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How do I reset my password?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A traditional keyword search looks for exact or similar words. If a document instead says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Steps to recover your account credentials"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;the search may fail because the wording is completely different.&lt;/p&gt;

&lt;p&gt;Humans immediately recognize that both sentences describe the same intent, but computers on the other hand need a different way to represent meaning, and this is where embeddings come in.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is an Embedding?
&lt;/h2&gt;

&lt;p&gt;An embedding is a dense vector, a list of numbers that represents the semantic meaning of a piece of text. In a more simple way, an array of numerical values usually floating point numbers where almost every position holds meaningful information.&lt;/p&gt;

&lt;p&gt;Instead of treating text as a sequence of characters or words, an embedding model maps it into a high dimensional vector space.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="s2"&gt;"cat"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;↓&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.91&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The numbers themselves have no intuitive meaning.&lt;/p&gt;

&lt;p&gt;What matters is &lt;strong&gt;where&lt;/strong&gt; the vector is &lt;strong&gt;located relative&lt;/strong&gt; to &lt;strong&gt;other vectors&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Meaning Comes From Position
&lt;/h2&gt;

&lt;p&gt;Imagine a map where cities that are geographically close tend to share borders, climates, and transportation links.&lt;/p&gt;

&lt;p&gt;Well embeddings work similarly, texts with similar meanings are placed near one another in vector space.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dog
      ●

Cat
      ●

Puppy
       ●


Car                         ●

Engine                       ●

Truck                          ●
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual space may have hundreds or thousands of dimensions instead of two, but the intuition remains the same, so we conclude that the distance represents semantic similarity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Similar Meaning, Different Words
&lt;/h2&gt;

&lt;p&gt;This is where we can see embeddings stenght.&lt;/p&gt;

&lt;p&gt;In these sentences below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Reset my password"&lt;/li&gt;
&lt;li&gt;"Recover my account"&lt;/li&gt;
&lt;li&gt;"Can't log in"&lt;/li&gt;
&lt;li&gt;"Forgot my credentials"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They share very few keywords, yet an embedding model places them close together because they express similar ideas.&lt;/p&gt;

&lt;p&gt;This enables semantic search, where results are retrieved based on meaning rather than exact wording.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Similarity Is Measured
&lt;/h2&gt;

&lt;p&gt;Once text has been converted into vectors, we need a way to compare them and the most common metric is &lt;a href="https://en.wikipedia.org/wiki/Cosine_similarity" rel="noopener noreferrer"&gt;cosine similarity&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Rather than comparing the individual numbers, cosine similarity measures the angle between two vectors.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small angle → highly similar&lt;/li&gt;
&lt;li&gt;Large angle → less similar&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This works surprisingly well because embedding models are trained to organize semantically related content in nearby regions of the vector space.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Embeddings Matter for RAG
&lt;/h2&gt;

&lt;p&gt;Retrieval Augmented Generation (RAG) depends heavily on embeddings, where a typical pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Documents
    │
    ▼
Embedding Model
    │
    ▼
Vectors Stored in a Vector Database
    │
    ▼
User Query
    │
    ▼
Query Embedding
    │
    ▼
Similarity Search
    │
    ▼
Relevant Documents
    │
    ▼
LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice something important:&lt;/p&gt;

&lt;p&gt;The LLM never searches your documents directly. Instead, it searches the embedding space for documents whose vectors are closest to the query.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Now that I scratched the surface on how these "numerical representations of text" work. Understanding embeddings is essential for anyone building LLM applications because they power everything from document retrieval to recommendation systems.&lt;/p&gt;

&lt;p&gt;Embeddings real power is not in storing vectors but in organizing them, and that what makes them so effective.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vectordatabase</category>
      <category>embeddings</category>
    </item>
    <item>
      <title>Cornell Notes on Context Layer</title>
      <dc:creator>Kat Padilla</dc:creator>
      <pubDate>Tue, 23 Jun 2026 14:50:34 +0000</pubDate>
      <link>https://dev.to/katpadi/cornell-notes-on-context-layer-2ahm</link>
      <guid>https://dev.to/katpadi/cornell-notes-on-context-layer-2ahm</guid>
      <description>&lt;p&gt;An LLM can only reason about information inside its context window. It doesn't automatically know my docs, APIs, databases, Jira tickets or company data. Something has to decide what information gets shown to the model.&lt;/p&gt;

&lt;p&gt;💡&lt;/p&gt;

&lt;p&gt;That is the &lt;strong&gt;Context Layer&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Concepts
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cue&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context Layer?&lt;/td&gt;
&lt;td&gt;My current mental model is that the Context Layer is responsible for deciding what information enters the model's working memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Package?&lt;/td&gt;
&lt;td&gt;The final collection of facts, documents, relationships, and tool results that gets sent to the model. Ideally only the information needed to answer the question.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Why not send everything?&lt;/td&gt;
&lt;td&gt;More context isn't automatically better. It increases cost, latency and noise. Relevance matters more than volume.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Engineering?&lt;/td&gt;
&lt;td&gt;The practice of deciding what information should enter the context window, what should be excluded, how sources should be ranked and how information should be structured.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where can context come from?&lt;/td&gt;
&lt;td&gt;Documents, databases, knowledge graphs, APIs, tools, memory systems, and other knowledge sources.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What's RAG?&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation. Instead of stuffing everything into context, retrieve only the relevant pieces when needed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What problem does RAG solve?&lt;/td&gt;
&lt;td&gt;Large collections of information that won't fit inside the context window.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical RAG sources&lt;/td&gt;
&lt;td&gt;Confluence pages, PDFs, Jira tickets, documentation, wikis, and databases.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What are embeddings?&lt;/td&gt;
&lt;td&gt;Numerical representations of meaning. Similar concepts end up near each other in vector space.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Why do embeddings matter?&lt;/td&gt;
&lt;td&gt;Most semantic search and RAG systems depend on embeddings to find relevant information.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vector database?&lt;/td&gt;
&lt;td&gt;A database optimized for storing and searching embeddings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge Graph?&lt;/td&gt;
&lt;td&gt;A system that stores relationships between entities, not just the entities themselves.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mental model for Knowledge Graphs&lt;/td&gt;
&lt;td&gt;A database tells me a customer exists. A Knowledge Graph tells me how that customer relates to campaigns, products, segments, and other customers.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What problem does a Knowledge Graph solve?&lt;/td&gt;
&lt;td&gt;Understanding how things connect and influence one another.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is a Knowledge Graph the same as RAG?&lt;/td&gt;
&lt;td&gt;No. RAG retrieves information. Knowledge Graphs model relationships.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can RAG and Knowledge Graphs work together?&lt;/td&gt;
&lt;td&gt;Yes. A Knowledge Graph can simply be another source used by the Context Layer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What are tools?&lt;/td&gt;
&lt;td&gt;Systems the model can call when it needs information it doesn't already have.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Examples of tools&lt;/td&gt;
&lt;td&gt;Snowflake, Jira, GitHub, CRM systems, internal APIs and databases.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What is MCP?&lt;/td&gt;
&lt;td&gt;Model Context Protocol. A standard way for AI systems to discover and interact with tools.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Why use tools instead of storing everything in context?&lt;/td&gt;
&lt;td&gt;Some information changes constantly. It's usually better to fetch it on demand than try to keep it in memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Examples of tool usage&lt;/td&gt;
&lt;td&gt;Current revenue, latest churn rate, open incidents, deployment status, customer profile lookups.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Separation of concerns&lt;/td&gt;
&lt;td&gt;Facts should remain factual. Relationships belong in graphs. Live values come from tools. Interpretation happens in the model.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Diagram
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Knowledge Sources
    │
    ├── Documents
    ├── Databases
    ├── Knowledge Graphs
    ├── APIs
    └── Tools
            ↓
      Context Layer
            ↓
      Context Package
            ↓
      Context Window
            ↓
            LLM
            ↓
         Answer

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Context Layer is the orchestrator.&lt;/li&gt;
&lt;li&gt;RAG retrieves information.&lt;/li&gt;
&lt;li&gt;Knowledge Graphs provide relationships.&lt;/li&gt;
&lt;li&gt;Tools provide real-time data.&lt;/li&gt;
&lt;li&gt;The Context Layer decides what enters the model's working memory.&lt;/li&gt;
&lt;li&gt;The LLM reasons over whatever makes it through.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>knowledgegraph</category>
      <category>embeddings</category>
    </item>
    <item>
      <title>Claude API Semantic Search: Embeddings Alternatives &amp; RAG</title>
      <dc:creator>Sangmin Lee</dc:creator>
      <pubDate>Mon, 22 Jun 2026 01:30:09 +0000</pubDate>
      <link>https://dev.to/claudeguide/claude-api-semantic-search-embeddings-alternatives-rag-49k5</link>
      <guid>https://dev.to/claudeguide/claude-api-semantic-search-embeddings-alternatives-rag-49k5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://claudeguide.io/claude-api-semantic-search?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=claude-api-semantic-search" rel="noopener noreferrer"&gt;claudeguide.io/claude-api-semantic-search&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Claude API for Semantic Search: Embeddings Alternatives and RAG Patterns
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Claude doesn't offer a native embeddings API, but Anthropic's recommended partner Voyage AI provides embeddings optimized for Claude — and combining Voyage embeddings with Claude's generation creates a powerful semantic search and RAG (Retrieval-Augmented Generation) pipeline that outperforms single-vendor solutions by 15-20% on retrieval accuracy benchmarks.&lt;/strong&gt; This guide covers the full architecture: embedding, indexing, retrieval, and generation.&lt;/p&gt;

&lt;p&gt;For model selection and cost trade-offs, see &lt;a href="https://dev.to/claude-haiku-sonnet-opus-which-model"&gt;Haiku vs Sonnet vs Opus&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query → Voyage AI (embed) → Vector DB (search) → Top-K docs → Claude (generate answer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;th&gt;Alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;Voyage AI &lt;code&gt;voyage-3&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;OpenAI &lt;code&gt;text-embedding-3-small&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector DB&lt;/td&gt;
&lt;td&gt;Pinecone / pgvector&lt;/td&gt;
&lt;td&gt;Qdrant / Weaviate / ChromaDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generation&lt;/td&gt;
&lt;td&gt;Claude Sonnet&lt;/td&gt;
&lt;td&gt;Claude Haiku (for cost)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reranker&lt;/td&gt;
&lt;td&gt;Voyage &lt;code&gt;rerank-2&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Cohere Rerank&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Step 1: Generate Embeddings with Voyage AI
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;voyageai&lt;/span&gt;

&lt;span class="n"&gt;vo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;voyageai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Uses VOYAGE_API_KEY env var
&lt;/span&gt;
&lt;span class="c1"&gt;# Embed documents (batch)
&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Claude API supports streaming responses via SSE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prompt caching reduces costs by up to 90%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool use enables function calling with type safety&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;doc_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;input_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;

&lt;span class="c1"&gt;# Embed a query
&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I reduce Claude API costs?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;input_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; Voyage AI &lt;code&gt;voyage-3&lt;/code&gt; costs $0.06 per 1M tokens — embedding 10,000 documents of ~500 tokens each costs approximately $0.30.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Store in a Vector Database
&lt;/h2&gt;

&lt;h3&gt;
  
  
  pgvector (PostgreSQL)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;-- voyage-3 dimensions&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="n"&gt;JSONB&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'{}'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;ivfflat&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_cosine_ops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_embeddings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO documents (content, embedding) VALUES (%s, %s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pinecone
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pinecone&lt;/span&gt;

&lt;span class="n"&gt;pc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pinecone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_embeddings&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Retrieve and Generate with Claude
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;semantic_search_and_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;## Advanced: Hybrid Search
&lt;/span&gt;
&lt;span class="n"&gt;Combine&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;better&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
def hybrid_search(query: str, top_k: int = 10) -&lt;/p&gt;

</description>
      <category>embeddings</category>
      <category>rag</category>
      <category>retrieval</category>
    </item>
    <item>
      <title>What embeddings are, explained by building one</title>
      <dc:creator>I Want To Learn Programming</dc:creator>
      <pubDate>Sun, 14 Jun 2026 14:00:07 +0000</pubDate>
      <link>https://dev.to/iwtlp/what-embeddings-are-explained-by-building-one-4j5m</link>
      <guid>https://dev.to/iwtlp/what-embeddings-are-explained-by-building-one-4j5m</guid>
      <description>&lt;p&gt;Embeddings are behind search, recommendations, and most of modern AI, and they are usually explained with intimidating diagrams. The core idea is simple and worth building yourself: an embedding turns a thing (a word, a document, a product) into a list of numbers (a vector) so that similar things end up close together in that number space.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why turn things into vectors
&lt;/h2&gt;

&lt;p&gt;Computers cannot compare meaning directly, but they can compare vectors. If "king" and "queen" are nearby points, and "king" and "banana" are far apart, then "closeness of vectors" becomes a usable stand-in for "similarity of meaning." Once your items are vectors, search and recommendation become geometry: find the nearest points.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring closeness
&lt;/h2&gt;

&lt;p&gt;The standard measure is cosine similarity, the angle between two vectors. Identical direction scores 1, unrelated scores near 0, opposite scores -1.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cosine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dot&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;na&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With just this, you can already build a tiny semantic search: embed your documents, embed the query, and return the documents with the highest cosine similarity.&lt;/p&gt;

&lt;h2&gt;
  
  
  A first embedding you can build by hand
&lt;/h2&gt;

&lt;p&gt;You do not need a neural network to feel the idea. A simple bag-of-words vector already places similar documents near each other:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vocab&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;vocab&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;vocab&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two documents about the same topic share words, so their vectors point in a similar direction, so cosine similarity is high. That is the whole mechanism, in miniature. Real embeddings (word2vec, or the ones inside large language models) learn far richer vectors where direction captures meaning, not just word overlap, but the principle is identical: similar things, nearby vectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why building it matters
&lt;/h2&gt;

&lt;p&gt;Once you have built a vector space and searched it by cosine similarity, the buzzwords resolve: a "vector database" is a store of these vectors with fast nearest-neighbor search; "semantic search" is exactly what you just did; retrieval for AI is embedding your documents and finding the closest ones to a question. You will understand the systems instead of trusting them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build it for real
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://iwtlp.com/track/ai-python" rel="noopener noreferrer"&gt;AI and Deep Learning track&lt;/a&gt; builds embeddings from scratch, from counting vectors to learned representations and the attention that powers transformers, all graded in your browser. The first project is free.&lt;/p&gt;

&lt;p&gt;Turn things into vectors, and a huge amount of modern AI becomes geometry you can reason about.&lt;/p&gt;

</description>
      <category>embeddings</category>
      <category>ai</category>
      <category>nlp</category>
    </item>
    <item>
      <title>Memory and State in Claude Agents: Patterns That Scale</title>
      <dc:creator>Sangmin Lee</dc:creator>
      <pubDate>Sat, 13 Jun 2026 01:31:36 +0000</pubDate>
      <link>https://dev.to/claudeguide/memory-and-state-in-claude-agents-patterns-that-scale-3f1f</link>
      <guid>https://dev.to/claudeguide/memory-and-state-in-claude-agents-patterns-that-scale-3f1f</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://claudeguide.io/claude-agent-memory-patterns?utm_source=devto&amp;amp;utm_medium=syndication&amp;amp;utm_campaign=claude-agent-memory-patterns" rel="noopener noreferrer"&gt;claudeguide.io/claude-agent-memory-patterns&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Memory and State in Claude Agents: Patterns That Scale
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Claude agents don't have persistent memory between API calls — each call starts fresh. Adding memory means deciding what to store, where to store it, and how much to bring back into context on the next call. The four patterns that cover 90% of production needs are: conversation history (in-context), summary compression (compressed context), external memory (vector search), and explicit state (structured data).&lt;/strong&gt; This guide covers when to use each and how to implement them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Memory Problem
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Call 1
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My name is Alex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}])&lt;/span&gt;
&lt;span class="c1"&gt;# Claude: "Hi Alex!"
&lt;/span&gt;
&lt;span class="c1"&gt;# Call 2 — Claude has no memory of call 1
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my name?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}])&lt;/span&gt;
&lt;span class="c1"&gt;# Claude: "I don't know your name."
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every conversation must carry its own context. The question is how much and in what form.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pattern 1: Full Conversation History (In-Context)
&lt;/h2&gt;

&lt;p&gt;The simplest approach — append every turn to a running messages list.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
import anthropic

client = anthropic.Anthropic()


class ConversationAgent:
    """Agent that maintains full conversation history in context."""

    def __init__(self, system: str, max_turns: int = 50):
        self.system = system
        self.messages = []
        self.max_turns = max_turns

    def chat(self, user_message: str) -

[→ Get the Agent SDK Cookbook — $49](https://shoutfirst.gumroad.com/l/ogxhmy?utm_source=claudeguide&amp;amp;utm_medium=article&amp;amp;utm_campaign=claude-agent-memory-patterns)

*30-day money-back guarantee. Instant download.*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>memory</category>
      <category>embeddings</category>
      <category>production</category>
    </item>
    <item>
      <title>Build a RAG Pipeline From Scratch (Production Patterns That Actually Matter)</title>
      <dc:creator>Umesh Malik</dc:creator>
      <pubDate>Fri, 12 Jun 2026 20:19:37 +0000</pubDate>
      <link>https://dev.to/umesh_malik/build-a-rag-pipeline-from-scratch-production-patterns-that-actually-matter-2gh6</link>
      <guid>https://dev.to/umesh_malik/build-a-rag-pipeline-from-scratch-production-patterns-that-actually-matter-2gh6</guid>
      <description>&lt;p&gt;Most RAG tutorials stop at "embed your docs, do a similarity search, stuff the results in a prompt." That gets you a demo. It does not get you something that gives correct, grounded answers on real data — and the gap between those two is where all the actual engineering lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A RAG pipeline is a series of stages, and a weak link in any one of them caps the quality of the whole thing.&lt;/strong&gt; You can have a frontier model and a beautiful prompt, and still ship garbage if your chunking is wrong. So this is the pipeline end to end, with the production patterns that decide whether it works — not just the happy-path demo.&lt;/p&gt;

&lt;p&gt;If you're still deciding whether RAG is even the right tool versus fine-tuning, read &lt;a href="https://umesh-malik.com/blog/rag-vs-fine-tuning-llms-2026" rel="noopener noreferrer"&gt;RAG vs Fine-Tuning for LLMs&lt;/a&gt; first. This post assumes you've decided to retrieve.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;RAG is a &lt;strong&gt;pipeline&lt;/strong&gt;: ingest → chunk → embed → store → retrieve → generate. The output is only as good as the weakest stage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval quality is everything.&lt;/strong&gt; Most "the LLM hallucinated" bugs are actually "the right chunk never got retrieved" bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk on meaning, not character counts.&lt;/strong&gt; Semantic boundaries plus light overlap beat fixed-size splits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't rely on vector search alone.&lt;/strong&gt; Hybrid (keyword + vector) retrieval with a reranker is the production default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ground the generation.&lt;/strong&gt; Pass only retrieved context, require citations, and refuse when context is thin.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can't improve what you don't measure.&lt;/strong&gt; Build a retrieval eval before you tune anything.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The pipeline, stage by stage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Ingestion
&lt;/h3&gt;

&lt;p&gt;Load your sources and &lt;strong&gt;clean them before anything else&lt;/strong&gt;. Strip boilerplate, nav chrome, and duplicated headers/footers. Garbage in here propagates through every downstream stage and you'll never trace the bad answer back to it. Preserve structure — headings, lists, tables — because that structure is what makes good chunking possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Chunking — where most pipelines quietly fail
&lt;/h3&gt;

&lt;p&gt;Chunking is the highest-leverage, most-underrated stage. The naive move is to split every document into fixed 500-character windows. Don't. Fixed-size splitting severs sentences and merges unrelated ideas, and then retrieval surfaces fragments that don't mean anything on their own.&lt;/p&gt;

&lt;p&gt;Instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Split on semantic boundaries&lt;/strong&gt; — headings, paragraphs, list items. Respect the document's own structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One idea per chunk.&lt;/strong&gt; A chunk should be retrievable and self-contained.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add light overlap&lt;/strong&gt; so context isn't cut mid-thought between adjacent chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attach metadata&lt;/strong&gt; to every chunk: source, title, section, date, URL. You'll use it for filtering and citations.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chunk = {
  id, text,
  metadata: { source, title, section, url, date }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Key insight&lt;/strong&gt;: If retrieval is bad, fix chunking before you touch the model or the prompt. The retriever can only find what chunking made findable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Embedding
&lt;/h3&gt;

&lt;p&gt;Turn each chunk into a vector with an embedding model. Two rules that save pain later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embed the same way at index time and query time.&lt;/strong&gt; Same model, same preprocessing. A mismatch silently wrecks relevance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version your embeddings.&lt;/strong&gt; When you change the embedding model, you must re-embed the whole corpus. Track which model produced which vectors so you know when a reindex is due.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Storage
&lt;/h3&gt;

&lt;p&gt;Store vectors in an index that does fast similarity search &lt;strong&gt;with metadata filtering&lt;/strong&gt;. You don't necessarily need a dedicated vector database — &lt;code&gt;pgvector&lt;/code&gt; on the Postgres you already run handles a surprising amount before a specialized store (Qdrant, Weaviate, Pinecone) earns its keep.&lt;/p&gt;

&lt;p&gt;What actually matters: filtering. "Search only this customer's docs" or "only documents from the last year" is a metadata &lt;code&gt;WHERE&lt;/code&gt; clause combined with vector similarity. Without it, retrieval leaks across boundaries it shouldn't.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Retrieval — go hybrid, then rerank
&lt;/h3&gt;

&lt;p&gt;This is the stage that most separates a demo from a product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector search alone is not enough.&lt;/strong&gt; Embeddings are great at semantic similarity and bad at exact matches — error codes, product SKUs, proper nouns, acronyms. Keyword search (BM25) is the opposite. &lt;strong&gt;Hybrid retrieval runs both and merges the results&lt;/strong&gt;, so you catch both "what they meant" and "the exact term they typed."&lt;/p&gt;

&lt;p&gt;Then &lt;strong&gt;rerank&lt;/strong&gt;. Initial retrieval optimizes for recall — pull a generous candidate set (say, top 20). A cross-encoder reranker then scores those candidates against the query far more precisely and keeps the top handful you'll actually pass to the model. Retrieve broad, rerank narrow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;candidates = vectorSearch(q, k=20) ∪ keywordSearch(q, k=20)
top = rerank(q, candidates)[:5]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. Grounded generation
&lt;/h3&gt;

&lt;p&gt;Now — and only now — the LLM. The job here is to keep it honest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pass only the retrieved context.&lt;/strong&gt; Don't let the model fall back on parametric memory for facts it should be reading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Require citations.&lt;/strong&gt; Ask it to cite the chunk/source for each claim. Citations are both a UX feature and a hallucination check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Give it permission to say "I don't know."&lt;/strong&gt; If the retrieved context doesn't answer the question, the correct output is a refusal, not a confident guess. Tell it that explicitly.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System: Answer ONLY from the context below. Cite sources by id.
If the context doesn't contain the answer, say you don't know.

Context:
[1] {chunk_1}
[2] {chunk_2}
...

Question: {user_query}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The patterns that separate prod from demo
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid + rerank&lt;/strong&gt;, not bare vector search. The single biggest quality jump.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata filtering&lt;/strong&gt; for security and scoping — never retrieve across tenant or permission boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Citations and refusal&lt;/strong&gt; wired into the prompt, so wrong answers become "I don't know" instead of confident fiction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching.&lt;/strong&gt; Cache embeddings (don't re-embed unchanged chunks) and cache answers to repeated queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A retrieval eval set.&lt;/strong&gt; A fixed set of question → expected-source pairs you can score on every change.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fixed-size chunking.&lt;/strong&gt; The default that quietly caps your ceiling. Chunk on meaning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector-only retrieval.&lt;/strong&gt; You'll miss exact-match queries every time. Add keyword search.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No reranking.&lt;/strong&gt; Stuffing the raw top-k into the prompt wastes context on near-misses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tuning the prompt to fix a retrieval problem.&lt;/strong&gt; If the right chunk isn't retrieved, the prompt is irrelevant. Diagnose retrieval first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No evaluation.&lt;/strong&gt; "It looks better" isn't a metric. Without an eval set you're guessing, and you'll regress silently.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Best practices
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Measure retrieval separately from generation.&lt;/strong&gt; Most failures are retrieval failures; isolate them. Track recall on your eval set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk on structure, then iterate.&lt;/strong&gt; Start with semantic boundaries and light overlap; adjust based on retrieval scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default to hybrid + rerank.&lt;/strong&gt; Treat it as the baseline, not an optimization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter by metadata for scope and security.&lt;/strong&gt; Especially in multi-tenant systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Force grounding and citations.&lt;/strong&gt; Answer only from context; cite; allow "I don't know."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-embed on model change.&lt;/strong&gt; Version vectors so you know when a reindex is required.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;RAG isn't one trick — it's a pipeline, and quality is set by its weakest stage. Get chunking right, retrieve hybrid and rerank, ground the generation, and &lt;em&gt;measure retrieval&lt;/em&gt; so you're improving the right thing. Do that and you cross the line from impressive demo to a system people can trust with real questions.&lt;/p&gt;

&lt;p&gt;Skip the engineering — relying on naive chunking and bare vector search — and you'll ship something that demos well and fails the moment real users ask real questions.&lt;/p&gt;

&lt;p&gt;Go deeper across &lt;a href="https://umesh-malik.com/topics/llm-engineering" rel="noopener noreferrer"&gt;LLM Engineering — RAG, Fine-Tuning &amp;amp; Production LLMs&lt;/a&gt;, revisit the &lt;a href="https://umesh-malik.com/blog/rag-vs-fine-tuning-llms-2026" rel="noopener noreferrer"&gt;RAG vs Fine-Tuning decision framework&lt;/a&gt;, or explore &lt;a href="https://umesh-malik.com/topics/ai-coding-agents" rel="noopener noreferrer"&gt;AI Coding Agents&lt;/a&gt; for the agentic side of LLM systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explore more:&lt;/strong&gt; &lt;a href="https://umesh-malik.com/topics/llm-engineering" rel="noopener noreferrer"&gt;LLM Engineering&lt;/a&gt; · &lt;a href="https://umesh-malik.com/topics/ai-coding-agents" rel="noopener noreferrer"&gt;AI Coding Agents&lt;/a&gt; · &lt;a href="https://umesh-malik.com/topics/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://umesh-malik.com/blog/build-rag-pipeline-from-scratch" rel="noopener noreferrer"&gt;umesh-malik.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep reading on umesh-malik.com:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://umesh-malik.com/blog/rag-vs-fine-tuning-llms-2026" rel="noopener noreferrer"&gt;RAG vs Fine-Tuning for LLMs in 2026: A Production Decision Framework With Real Tradeoffs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://umesh-malik.com/blog/autonomous-ai-agents-production-gap-2026" rel="noopener noreferrer"&gt;AI Agents That Run the Business in 2026: Why 77% Never Reach Production (and What the 23% Do Differently)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://umesh-malik.com/blog/how-to-build-mcp-server" rel="noopener noreferrer"&gt;How to Build a Production MCP Server (I Added One to My Site)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>llmengineering</category>
      <category>vectordatabases</category>
      <category>embeddings</category>
    </item>
    <item>
      <title>Кэширование LLM-ответов: Redis, semantic cache и экономия 40-70% на API</title>
      <dc:creator>Promptra Team</dc:creator>
      <pubDate>Wed, 10 Jun 2026 09:05:02 +0000</pubDate>
      <link>https://dev.to/promptra-team/keshirovaniie-llm-otvietov-redis-semantic-cache-i-ekonomiia-40-70-na-api-57l2</link>
      <guid>https://dev.to/promptra-team/keshirovaniie-llm-otvietov-redis-semantic-cache-i-ekonomiia-40-70-na-api-57l2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faf2z68o7byp0ekwxfvdb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faf2z68o7byp0ekwxfvdb.png" alt="Инфографика LLM-кэша: три уровня — in-memory LRU, Redis exact-match с TTL, semantic cache через embeddings и Qdrant, поверх график экономии 40-70% и счётчик сэкономленных рублей; плоский векторный стиль" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LLM-API — самая дорогая зависимость в стеке. На FAQ-боте с 100K запросов в день Claude Opus 4.7 (350/1790 ₽ за 1М токенов) выливается в 250–400 тыс ₽ в месяц. &lt;strong&gt;Половину этой суммы можно вернуть кэшированием&lt;/strong&gt;: точные повторы запросов часто составляют 20–40%, перефразированные близкие — ещё 20–30%, итого 40–70% запросов вообще не должны доходить до модели. Через единый шлюз &lt;a href="https://promptra.ru" rel="noopener noreferrer"&gt;Promptra&lt;/a&gt; prompt caching от Anthropic и OpenAI пробрасывается без изменений, что плюсом срезает 60–80% input-стоимости для агентов с длинным system prompt.&lt;/p&gt;

&lt;p&gt;Этот гайд — три уровня кэширования с готовым кодом: &lt;strong&gt;in-memory LRU&lt;/strong&gt; для микро-кэша последних N запросов в одном процессе, &lt;strong&gt;Redis exact-match&lt;/strong&gt; с TTL для распределённого кэша между worker'ами, &lt;strong&gt;semantic cache&lt;/strong&gt; через embeddings и &lt;a href="https://qdrant.tech" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt; для семантически близких запросов. С реальными бенчмарками cost savings, инвалидацией, защитой от ложных срабатываний. оплата в рублях по договору, полный пакет закрывающих документов.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR — три уровня кэша
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Уровень&lt;/th&gt;
&lt;th&gt;Где&lt;/th&gt;
&lt;th&gt;Hit rate&lt;/th&gt;
&lt;th&gt;Когда&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1 in-memory LRU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RAM процесса&lt;/td&gt;
&lt;td&gt;5–15%&lt;/td&gt;
&lt;td&gt;Микро-кэш горячих запросов, один worker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2 Redis exact-match&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Распределённый&lt;/td&gt;
&lt;td&gt;15–35%&lt;/td&gt;
&lt;td&gt;Точные повторы между worker'ами&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3 semantic cache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qdrant + embeddings&lt;/td&gt;
&lt;td&gt;25–45% сверху L2&lt;/td&gt;
&lt;td&gt;Перефразированные вопросы, FAQ-боты&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bonus prompt cache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;На стороне провайдера&lt;/td&gt;
&lt;td&gt;60–80% input savings&lt;/td&gt;
&lt;td&gt;Агенты с длинным system prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Суммарно при удачной комбинации — 50–70% запросов не доходят до flagship-модели.&lt;/p&gt;

&lt;h2&gt;
  
  
  L1: in-memory LRU для горячих запросов
&lt;/h2&gt;

&lt;p&gt;Самый простой уровень — кэш в RAM одного worker'а через &lt;code&gt;functools.lru_cache&lt;/code&gt; или его asyncio-вариант. Подходит для маленьких микросервисов с одним процессом. Эта статья — production-расширение нашего pillar-гида &lt;a href="https://promptra.ru/blog/llm-api-na-python-polnyy-tehnicheskiy-gid-2026/" rel="noopener noreferrer"&gt;полный технический гид по LLM API на Python: токены, function calling, streaming, RAG, async/batch&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;lru_cache&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;messages_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Стабильный hash от messages и model.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;hexdigest&lt;/span&gt;

&lt;span class="nd"&gt;@lru_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_cached_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages_json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Внутренний кэш, ключ — hash. Содержит JSON-сериализованный ответ.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;llm_with_l1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;messages_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_cached_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Плюсы: одна строка, нет внешних зависимостей. Минусы: кэш живёт только в одном процессе, теряется при рестарте, не масштабируется на 10+ worker'ов. Подходит для CLI-тулов, dev-режима, маленьких pet-проектов.&lt;/p&gt;

&lt;h2&gt;
  
  
  L2: Redis exact-match с TTL
&lt;/h2&gt;

&lt;p&gt;Базовый production-кэш: hash запроса как ключ, сериализованный ответ как значение, TTL по типу контента. Работает между всеми worker'ами.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis.asyncio&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis://localhost:6379/3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-promptra-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.promptra.ru/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;CACHE_TTL_BY_TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;faq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# 24 часа для статичных FAQ
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;14400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# 4 часа для summary
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# 30 минут для агентских ответов
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# 5 минут для новостей
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="n"&gt;hexdigest&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;llm_with_l2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Метрика hit
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache:hits:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Метрика miss
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache:misses:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CACHE_TTL_BY_TYPE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ключевые моменты:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;sort_keys=True + ensure_ascii=False&lt;/strong&gt; — стабильный hash независимо от порядка ключей и unicode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL по типу&lt;/strong&gt; — статичные FAQ живут долго, новости — недолго.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Префикс &lt;code&gt;llm:&amp;lt;kind&amp;gt;:&amp;lt;model&amp;gt;:&lt;/code&gt;&lt;/strong&gt; — упрощает инвалидацию по типу или модели (&lt;code&gt;SCAN llm:faq:* + DEL&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Метрики hits/misses&lt;/strong&gt; — обязательны для оценки эффективности.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hit rate exact-match cache: на FAQ-боте обычно 20–40%, на агентских pipeline — 5–15%, на креативной генерации — почти 0. Для перефразированных вопросов нужен следующий уровень.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdannw4m5k9mdebzewtlf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdannw4m5k9mdebzewtlf.png" alt="Схема L2 Redis exact-match cache: запрос приходит, хешируется SHA-256, ищется в Redis, при hit (примерно 25-35%) сразу возврат за 2 мс, при miss идёт в LLM, ответ сохраняется с TTL; терракотовая выноска «экономия 25% запросов»; заголовок «L2 exact-match cache»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  L3: semantic cache через embeddings
&lt;/h2&gt;

&lt;p&gt;Exact-match не ловит перефразированное: «как открыть API ключ» и «где взять токен для API» — разные строки, но один смысл. Semantic cache решает это через embeddings.&lt;/p&gt;

&lt;p&gt;Архитектура:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;На запрос считаем embedding (используем дешёвую модель вроде &lt;code&gt;text-embedding-3-small&lt;/code&gt; или DeepSeek embeddings).&lt;/li&gt;
&lt;li&gt;Ищем в Qdrant top-1 точку с косинусным сходством &amp;gt; 0.92.&lt;/li&gt;
&lt;li&gt;Если найдена — возвращаем кэшированный ответ.&lt;/li&gt;
&lt;li&gt;Если нет — вызываем LLM, сохраняем embedding+ответ в Qdrant.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PointStruct&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-promptra-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.promptra.ru/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;qdrant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6333&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;COLLECTION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_semantic_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;EMBEDDING_MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# 1536 dim
&lt;/span&gt;&lt;span class="n"&gt;SIMILARITY_THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;

&lt;span class="c1"&gt;# Один раз при старте
&lt;/span&gt;&lt;span class="n"&gt;qdrant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recreate_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vectors_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EMBEDDING_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Берём последнее user-сообщение как query для embedding.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;llm_with_semantic_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Слишком короткий запрос — semantic cache бесполезен
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;raw_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Embedding запроса
&lt;/span&gt;    &lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Поиск в Qdrant
&lt;/span&gt;    &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdrant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;with_payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;SIMILARITY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Hit
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Miss — идём в LLM
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

    &lt;span class="c1"&gt;# Сохраняем
&lt;/span&gt;    &lt;span class="n"&gt;qdrant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PointStruct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Параметры на тюнинг:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SIMILARITY_THRESHOLD = 0.92&lt;/strong&gt; — стартовая точка. Ниже 0.88 — много ложных срабатываний. Выше 0.96 — почти ничего не хитится. Для FAQ — 0.93. Для технических вопросов — 0.96.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EMBEDDING_MODEL&lt;/strong&gt; — дешёвая модель достаточна. text-embedding-3-small или DeepSeek embeddings (через Promptra) дают качество, сопоставимое с LLM-pump для FAQ.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Длина query &amp;gt; 10 символов&lt;/strong&gt; — embeddings от слишком коротких запросов нестабильны.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Подробнее про embeddings и RAG-стек — &lt;a href="https://promptra.ru/blog/embeddings-i-vektornyy-poisk-rag-stek-2026" rel="noopener noreferrer"&gt;Embeddings и векторный поиск: RAG-стек 2026&lt;/a&gt;. Про подбор embedding-модели для русского — &lt;a href="https://promptra.ru/blog/embeddings-api-rossiya" rel="noopener noreferrer"&gt;Embeddings API в России&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikgwr38tw2mkic2fcx3m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikgwr38tw2mkic2fcx3m.png" alt="Схема semantic cache: входной запрос «как взять API ключ?», embedding 1536-dim, поиск в Qdrant top-1 с порогом 0.92, найден похожий «где получить токен для API» с score 0.94, возврат закэшированного ответа без вызова LLM; заголовок «Semantic cache через embeddings»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Защита от ложных срабатываний
&lt;/h2&gt;

&lt;p&gt;Semantic cache опасен когда близкие по смыслу запросы требуют разных ответов. Классические сценарии и решения:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Запросы с параметрами / ID.&lt;/strong&gt; «Заказы клиента 42» и «заказы клиента 43» — embedding почти идентичен (0.97+), но ответы разные. Решение — исключить такие запросы из semantic cache:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="n"&gt;ID_PATTERNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b\d{3,}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Длинные числа
&lt;/span&gt;    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b[A-Z]{2,}\d{2,}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# SKU/Article codes
&lt;/span&gt;    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@\w+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# Mentions
&lt;/span&gt;    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https?://\S+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# URLs
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;has_parametric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ID_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;llm_with_smart_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;has_parametric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Параметрический запрос — только exact-match L2
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;llm_with_l2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parametric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Чистый текст — можно semantic
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;llm_with_semantic_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Контекстно-зависимые запросы.&lt;/strong&gt; «Расскажи подробнее» в чате означает разное в зависимости от истории. Решение — кэшировать только single-turn запросы (длина messages = 1–2), или включать hash от system prompt в ключ.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Запросы с датой/временем.&lt;/strong&gt; «Покажи новости сегодня» — сегодня меняется. Кэшировать на короткий TTL (5–15 минут) и инвалидировать в полночь.&lt;/p&gt;

&lt;p&gt;Дополнительная защита — &lt;strong&gt;whitelist по интентам&lt;/strong&gt;: классифицируете запрос в один из N интентов (FAQ, news, agent, code), и semantic cache работает только для FAQ-интента.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt caching: бонус от провайдера
&lt;/h2&gt;

&lt;p&gt;OpenAI и Anthropic кэшируют префикс промта на своей стороне. Это другой механизм, не путать с собственным кэшем:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic prompt caching&lt;/strong&gt;: пометка &lt;code&gt;cache_control: {type: "ephemeral"}&lt;/code&gt; на блоке system или последнем сообщении. Первый запрос — обычная цена + 25% за запись. Последующие в течение 5 минут — ×0.1 от input. Документация — &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic prompt caching&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI prompt caching&lt;/strong&gt;: автоматическое для промтов &amp;gt;1024 токенов. Префикс кэшируется на 5–10 минут, повтор стоит ×0.5 от input. Заголовок ответа &lt;code&gt;prompt_cache_hit_tokens&lt;/code&gt; показывает количество cached токенов.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Через Promptra prompt caching работает без изменений — просто передаёте те же параметры:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Anthropic-стиль через Promptra
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ты помощник по продуктам Promptra. &amp;lt;Длинный system prompt 5000+ токенов&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# пробрасывается напрямую
&lt;/span&gt;        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Экономия для агентов с long system prompt:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Сценарий&lt;/th&gt;
&lt;th&gt;Без cache&lt;/th&gt;
&lt;th&gt;С prompt cache&lt;/th&gt;
&lt;th&gt;Экономия&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Agent с 8K system, 200 RPS, GPT-5.5&lt;/td&gt;
&lt;td&gt;8K × 200 × 350 ₽/M = 560 ₽/мин на input&lt;/td&gt;
&lt;td&gt;8K × 200 × 175 ₽/M = 280 ₽/мин&lt;/td&gt;
&lt;td&gt;50% input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent с 15K system, 50 RPS, Opus 4.7&lt;/td&gt;
&lt;td&gt;15K × 50 × 350 ₽/M = 263 ₽/мин&lt;/td&gt;
&lt;td&gt;15K × 50 × 35 ₽/M = 26 ₽/мин&lt;/td&gt;
&lt;td&gt;90% input&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Для агентов с богатым system prompt экономия только на prompt caching достигает 200K-600K ₽/месяц. См. также &lt;a href="https://promptra.ru/blog/sravnenie-cen-llm-2026-tochnye-tarify-v-rublyah" rel="noopener noreferrer"&gt;Сравнение цен LLM 2026&lt;/a&gt; для расчёта по конкретной модели.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6mxr58i3xrldwe8jf17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6mxr58i3xrldwe8jf17.png" alt="Сравнительная диаграмма стоимости с prompt caching: вертикальные столбцы для агента с 8K system prompt — «без кэша» 100% input cost, «OpenAI cache hit» 50%, «Anthropic ephemeral cache» 10%; терракотовый акцент на самом низком; заголовок «Prompt caching экономит 50-90% на input»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Реальные бенчмарки cost savings
&lt;/h2&gt;

&lt;p&gt;Production-замеры с FAQ-бота (русские пользователи, Mostly support questions):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Сценарий: 100K запросов/день, GPT-5.5 (350/2150 ₽/M)
Средний запрос: 800 input + 400 output токенов

Baseline (без кэша):
- 100K × (800 × 350 + 400 × 2150) / 1M = 28 000 + 86 000 = 114 000 ₽/день
- 3.4 млн ₽/месяц

+L2 Redis exact-match (hit 28%):
- 72K вызовов LLM × средняя цена = 114K × 0.72 = 82 080 ₽/день
- Экономия 28% = 31 920 ₽/день, 957 600 ₽/месяц

+L3 semantic cache (cumulative hit 55%):
- 45K вызовов LLM × средняя цена = 114K × 0.45 = 51 300 ₽/день
- Экономия 55% = 62 700 ₽/день, 1 881 000 ₽/месяц
- Минус embedding-стоимость (для 45K hits + 100K queries × $0.00002) ≈ 700 ₽/день
- Net 62K ₽/день = 1.86M ₽/месяц

+Prompt caching на system prompt (агентский use case):
- На каждый вызов экономия 50% input → ещё минус 6% от итога
- Net 1.91M ₽/месяц экономии = 22.9M ₽/год
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Реальные числа зависят от природы трафика. Для уникального креатива (генерация маркетингового текста) hit rate &amp;lt; 5%, экономия минимальна. Для FAQ-ботов, support чатов и интент-классификации — 50–70% типично.&lt;/p&gt;

&lt;p&gt;Инфраструктура semantic cache:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis 8 GB RAM — 600–1200 ₽/мес (Yandex Cloud).&lt;/li&gt;
&lt;li&gt;Qdrant 1 vCPU + 2 GB RAM на 100K точек — 500–800 ₽/мес.&lt;/li&gt;
&lt;li&gt;Embedding-вызовы (text-embedding-3-small) — около $0.02 на 1M токенов = 1.43 ₽/M. На 100K запросов с 200 токенов = 28.6 ₽/день.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Итого инфра semantic cache — 1500–2000 ₽/мес против экономии 60–200K ₽/мес. ROI положительный с первого дня. Подробнее про async batch как ещё один способ экономии — &lt;a href="https://promptra.ru/blog/async-i-batch-api-llm-50-procentov-skidka-perfomance" rel="noopener noreferrer"&gt;Async и Batch API LLM: 50% скидка&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Инвалидация: что и когда выкидывать
&lt;/h2&gt;

&lt;p&gt;Кэш без инвалидации — это утечка времени. Стандартные триггеры:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Время (TTL)&lt;/strong&gt; — встроено в Redis SETEX. Для Qdrant — отдельный crontab cleanup по полю &lt;code&gt;created_at&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Смена модели&lt;/strong&gt; — все ответы под старой моделью становятся неактуальны. Префикс ключа включает имя модели → &lt;code&gt;DROP COLLECTION&lt;/code&gt; или &lt;code&gt;DELETE FROM ... WHERE model = old_model&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Смена system prompt&lt;/strong&gt; — если изменили промт агента, старые ответы инвалидируются. Включайте hash от system prompt в cache key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Обновление документации (для RAG)&lt;/strong&gt; — при reingest корпуса инвалидируете весь кэш ответов на основе старых документов.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ручная инвалидация&lt;/strong&gt; — админский endpoint &lt;code&gt;POST /admin/cache/invalidate&lt;/code&gt; с фильтром (kind/model/pattern).
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bulk инвалидация по паттерну
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invalidate_by_kind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;deleted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;deleted&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;deleted&lt;/span&gt;

&lt;span class="c1"&gt;# Qdrant инвалидация
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invalidate_semantic_by_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old_model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;qdrant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;points_selector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;old_model&lt;/span&gt;&lt;span class="p"&gt;}}]}},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Production-чеклист
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;L2 Redis&lt;/strong&gt; обязателен в production — никаких прямых вызовов LLM на повторяющиеся запросы.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;TTL по типу контента&lt;/strong&gt; — FAQ 24h, summary 4h, agent 30min, news 5min.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Cache key включает model&lt;/strong&gt; — миграция между моделями не отдаёт устаревшие ответы.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Метрики hits/misses&lt;/strong&gt; по типам — обязательны для оценки ROI.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;L3 semantic cache&lt;/strong&gt; для FAQ-ботов и support — 25–45% сверху L2.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;SIMILARITY_THRESHOLD 0.92&lt;/strong&gt; стартово, тюнить под качество.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Параметрические запросы (с ID, URL)&lt;/strong&gt; — исключать из semantic cache.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Prompt caching&lt;/strong&gt; для агентов с long system prompt — экономия 50–90% input.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Инвалидация по смене модели и system prompt&lt;/strong&gt; — обязательна.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Cleanup Qdrant раз в день&lt;/strong&gt; — точки старше TTL удаляются.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Admin endpoint&lt;/strong&gt; для ручной инвалидации.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Алерт на hit rate &amp;lt; 15%&lt;/strong&gt; — что-то сломалось, либо трафик уникален и кэш не нужен.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Через Promptra все провайдеры доступны через &lt;code&gt;base_url="https://api.promptra.ru/v1"&lt;/code&gt; — кэшированный код работает одинаково для Opus, GPT и Gemini, что упрощает A/B-тесты моделей с сохранением hit rate. Подробнее про миграцию между провайдерами — &lt;a href="https://promptra.ru/blog/migraciya-iz-openai-na-promptra-python-sdk-za-10-minut" rel="noopener noreferrer"&gt;Миграция с OpenAI на Promptra за 10 минут&lt;/a&gt;. Про подсчёт токенов до отправки и оптимизацию payload — &lt;a href="https://promptra.ru/blog/kak-schitat-tokeny-llm-tokenizer-stoimost-zaprosa" rel="noopener noreferrer"&gt;Как считать токены LLM&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21toc5mi1u8rbvc15na6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21toc5mi1u8rbvc15na6.png" alt="Финальная инфографика 4 уровней кэша: вертикальные слои сверху вниз — «L1 in-memory LRU (5-15%)», «L2 Redis exact-match (15-35%)», «L3 semantic cache (25-45% сверху)», «Bonus prompt caching (50-90% на input)»; справа суммарный счётчик экономии «до 1.9M ₽/месяц»; заголовок «4 слоя кэша = 50-70% экономии»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Анти-паттерны
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Кэш без TTL&lt;/strong&gt; — данные устаревают, ответы становятся неверными.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache key без модели&lt;/strong&gt; — после миграции на новую модель отдаёте старые ответы.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic cache на параметрические запросы&lt;/strong&gt; — отвечаете про клиента 42 на запрос про клиента 43.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Слишком низкий SIMILARITY_THRESHOLD&lt;/strong&gt; (&amp;lt; 0.88) — много ложных hits, плохой UX.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Слишком высокий threshold&lt;/strong&gt; (&amp;gt; 0.97) — почти не хитится, инфра впустую.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Игнорировать prompt caching&lt;/strong&gt; — теряете 50–90% input savings на агентах.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Кэш на write-операциях&lt;/strong&gt; — нельзя кэшировать вызовы с tool calls которые меняют state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Не мониторить hit rate&lt;/strong&gt; — не знаете эффективность.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Запасные варианты
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain &lt;code&gt;RedisCache&lt;/code&gt; / &lt;code&gt;SemanticCache&lt;/code&gt;&lt;/strong&gt; — готовые интеграции, но менее гибкие. Подойдут для прототипа.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPTCache&lt;/strong&gt; — отдельная библиотека под Python с подключаемыми хранилищами. Удобно для исследования.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI Batch API&lt;/strong&gt; — для офлайн-обработки 50% скидка вместо кэша. Подходит когда задержка 24ч приемлема.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic prompt caching&lt;/strong&gt; — обязательно при длинных system prompt, экономия 90% input.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Для FAQ-бота на 50K-200K запросов в день оптимальный стек — Redis L2 + Qdrant semantic L3 + prompt caching. Окупается за 1–3 дня и стабильно работает годами.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4u5z4wied30v04c7tqz0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4u5z4wied30v04c7tqz0.png" alt="Финальная сводная инфографика: горизонтальная стрелка времени, сверху три слоя кэша с маркерами hit rate, снизу столбчатая диаграмма «экономия по сценариям»: FAQ-bot 65%, support chat 50%, agent с system 75%, креатив 5%; терракотовый акцент на максимальной экономии; заголовок «Где кэш работает, а где нет»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Promptra&lt;/strong&gt; — Russian LLM API aggregator. One OpenAI-compatible endpoint to all flagship models: OpenAI (GPT-5.5, GPT-5.4), Anthropic (Claude Opus 4.7, Sonnet 4.6), Google (Gemini 3.1 Pro, 3.5 Flash), DeepSeek V4 Pro, Qwen 3.6 Plus.&lt;/p&gt;

&lt;p&gt;Provider prices 1-to-1 at CBR rate — no markup on tokens. Ruble billing per contract, full closing documents through EDI. No VPN — legal B2B service in Russia.&lt;/p&gt;

&lt;p&gt;Try: &lt;a href="https://promptra.ru" rel="noopener noreferrer"&gt;promptra.ru&lt;/a&gt; · &lt;a href="https://promptra.ru/models" rel="noopener noreferrer"&gt;model catalog&lt;/a&gt; · &lt;a href="https://promptra.ru/docs" rel="noopener noreferrer"&gt;docs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cache</category>
      <category>redis</category>
      <category>semanticcache</category>
      <category>embeddings</category>
    </item>
    <item>
      <title>Embeddings и векторный поиск: полный RAG-стек 2026 для русскоязычных проектов</title>
      <dc:creator>Promptra Team</dc:creator>
      <pubDate>Mon, 08 Jun 2026 15:37:55 +0000</pubDate>
      <link>https://dev.to/promptra-team/embeddings-i-viektornyi-poisk-polnyi-rag-stiek-2026-dlia-russkoiazychnykh-proiektov-1198</link>
      <guid>https://dev.to/promptra-team/embeddings-i-viektornyi-poisk-polnyi-rag-stiek-2026-dlia-russkoiazychnykh-proiektov-1198</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzspwpnynx0uz6ny1zykz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzspwpnynx0uz6ny1zykz.png" alt="Архитектурная инфографика RAG-стека 2026: блок документов проходит chunking, embeddings модель, попадает в векторную БД, рядом блок query идёт через embedding в поиск top-k и передаётся в LLM с контекстом, плоский векторный стиль" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt; — это паттерн, при котором LLM получает релевантный контекст из вашей базы знаний прежде, чем формулировать ответ. Это то, как ChatGPT с подключёнными документами, Notion AI и большинство B2B-чатботов на нейросетях знают вашу специфику. Под капотом — embeddings (текст превращается в вектор), векторная БД (хранит и ищет векторы по похожести), и LLM (получает топ-k найденных кусков и отвечает).&lt;/p&gt;

&lt;p&gt;В этом гайде — полный стек 2026 для русскоязычных проектов через единый шлюз &lt;a href="https://promptra.ru" rel="noopener noreferrer"&gt;Promptra&lt;/a&gt;: модели embeddings (OpenAI text-embedding-3, Cohere v3, Voyage 3), векторные БД (pgvector, Qdrant, Chroma), стратегии chunking, hybrid search, и метрики оценки качества retrieval. Цены в рублях по курсу ЦБ, оплата в рублях по договору, полный пакет закрывающих документов.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR — RAG-стек за 5 компонентов
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt; — режем документы на куски 500–1500 токенов с overlap или по семантическим границам.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding&lt;/strong&gt; — превращаем каждый chunk в вектор 1024–3072 dim через embeddings модель.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector DB&lt;/strong&gt; — храним векторы с metadata в Qdrant или pgvector, индекс HNSW для быстрого поиска.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval&lt;/strong&gt; — на запрос пользователя считаем его embedding, ищем топ-k ближайших векторов (cosine similarity).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generation&lt;/strong&gt; — отдаём найденные chunks как контекст в LLM (Claude Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro), модель отвечает. Эта статья — часть pillar-гида: &lt;a href="https://promptra.ru/blog/llm-api-na-python-polnyy-tehnicheskiy-gid-2026/" rel="noopener noreferrer"&gt;полный технический гид по LLM API на Python — токены, function calling, streaming, RAG, batch&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;С hybrid search и rerank качество retrieve в 2 раза выше чем у чистого vector search.&lt;/p&gt;

&lt;h2&gt;
  
  
  Шаг 1. Embeddings — какую модель выбрать
&lt;/h2&gt;

&lt;p&gt;Embedding-модель превращает текст в вектор фиксированной длины. Близкие по смыслу тексты получают близкие векторы (по cosine similarity). От качества embeddings зависит &lt;strong&gt;всё&lt;/strong&gt; — никакой LLM не спасёт, если retrieve вернул нерелевантные куски.&lt;/p&gt;

&lt;p&gt;Топ-модели 2026 для русского языка:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Модель&lt;/th&gt;
&lt;th&gt;Dim&lt;/th&gt;
&lt;th&gt;Контекст&lt;/th&gt;
&lt;th&gt;Цена $/1M&lt;/th&gt;
&lt;th&gt;Сильные стороны&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI text-embedding-3-large&lt;/td&gt;
&lt;td&gt;3072&lt;/td&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;$0.13&lt;/td&gt;
&lt;td&gt;Флагман, Matryoshka (можно урезать)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI text-embedding-3-small&lt;/td&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;$0.02&lt;/td&gt;
&lt;td&gt;Дёшево, baseline, RU adequate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohere embed-multilingual-v3.0&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;td&gt;Лучше всех на multilingual задачах&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voyage 3&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;$0.06&lt;/td&gt;
&lt;td&gt;Premium качество для retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen-embed-v2&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;$0.05&lt;/td&gt;
&lt;td&gt;Сильная альтернатива на русском&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Через Promptra все доступны по OpenAI-совместимому endpoint'у — меняется только &lt;code&gt;model&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-promptra-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.promptra.ru/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Текст первого документа.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Текст второго документа.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;encoding_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Получили &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; векторов размерности &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Подробное сравнение моделей и цен — в материале &lt;a href="https://promptra.ru/blog/embeddings-api-rossiya" rel="noopener noreferrer"&gt;«Embeddings API из России»&lt;/a&gt;. Спецификация OpenAI embeddings — в &lt;a href="https://platform.openai.com/docs/guides/embeddings" rel="noopener noreferrer"&gt;официальном гайде&lt;/a&gt;; подробности Voyage 3 — на &lt;a href="https://docs.voyageai.com/docs/embeddings" rel="noopener noreferrer"&gt;сайте Voyage AI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd5rn01e5vd72vqzm4oqr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd5rn01e5vd72vqzm4oqr.png" alt="Сравнительная инфографика моделей embeddings: 5 блоков-карточек в ряд с подписями «text-embed-3-large 3072 dim», «text-embed-3-small 1536», «Cohere v3 1024», «Voyage 3 1024», «Qwen-embed-v2 1024», у каждой звёздочки качества и цены за 1М в рублях; заголовок «Модели embeddings для русского, 2026»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Шаг 2. Chunking — стратегии разбиения
&lt;/h2&gt;

&lt;p&gt;Документы редко помещаются в один вектор. Нужно разбить на куски (chunks) такого размера, чтобы каждый был &lt;strong&gt;семантически целостным&lt;/strong&gt; и помещался в контекст embedding-модели.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fixed-size с overlap — baseline
&lt;/h3&gt;

&lt;p&gt;Самый простой подход: режем по N символов с overlap M:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fixed_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Работает, но режет посреди предложений. Качество retrieve среднее.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic chunking — по предложениям/параграфам
&lt;/h3&gt;

&lt;p&gt;Используем &lt;code&gt;nltk&lt;/code&gt; или &lt;code&gt;spacy&lt;/code&gt; для разбиения на предложения, потом собираем чанки до целевого размера:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sent_tokenize&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;semantic_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap_sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sent_tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;russian&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;current_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_size&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;target_size&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="c1"&gt;# overlap: берём последние N предложений в следующий chunk
&lt;/span&gt;            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;overlap_sentences&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
            &lt;span class="n"&gt;current_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;current_size&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Качество retrieve существенно выше — куски всегда целые.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hierarchical chunking — лучший на 2026
&lt;/h3&gt;

&lt;p&gt;Идея: индексируем мелкие chunks (для точного поиска), а возвращаем родительские (для богатого контекста):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HierarchicalChunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;child_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;   &lt;span class="c1"&gt;# 200-400 токенов — для embedding и поиска
&lt;/span&gt;    &lt;span class="n"&gt;parent_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  &lt;span class="c1"&gt;# 1500-2000 токенов — отдаём в LLM
&lt;/span&gt;    &lt;span class="n"&gt;parent_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;child_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hierarchical_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;HierarchicalChunk&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;parents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;semantic_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap_sentences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;children&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;semantic_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;350&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap_sentences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;children&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HierarchicalChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;child_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;parent_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;parent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p_idx&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;child_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p_idx&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c_idx&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Это побеждает на 90% задач — точность поиска от маленьких chunks, богатство контекста от больших. Для технических доков — chunking по заголовкам markdown. Для кода — по функциям/классам. Для диалогов — по поворотам.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibyvbrkr6msnr56bwiqs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibyvbrkr6msnr56bwiqs.png" alt="Сравнение трёх стратегий chunking: «Fixed-size» — куски одного размера резко обрезанные, «Semantic» — куски по предложениям ровные, «Hierarchical» — большие parents и внутри них мелкие children с подписью «индексируем child, отдаём parent» терракотовая; заголовок «Стратегии разбиения документов»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Шаг 3. Векторная БД — pgvector / Qdrant / Chroma
&lt;/h2&gt;

&lt;p&gt;Сохраняем векторы вместе с metadata (id, source, chunk_text) и ищем по cosine similarity.&lt;/p&gt;

&lt;h3&gt;
  
  
  pgvector — если у вас уже Postgres
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_text&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parent_text&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;VECTOR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;-- размерность под Voyage 3 / Cohere&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_cosine_ops&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ef_construction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Запрос top-k:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parent_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Из Python через psycopg или asyncpg. Плюс: ACID, обычные SQL-бэкапы, JOIN с реляционными данными. Минус: на сотнях миллионов векторов производительность падает — для такого масштаба нужен Qdrant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qdrant — production-grade open-source
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PointStruct&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:6333&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vectors_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# upsert
&lt;/span&gt;&lt;span class="n"&gt;points&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;PointStruct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parent_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# search
&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;query_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;must&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;FieldCondition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;MatchValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))]),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Qdrant — российская команда, доступен on-prem, до сотен миллионов векторов, фильтрация по metadata из коробки, hybrid search встроенный.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chroma — для прототипирования
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PersistentClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./chroma_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Единственная зависимость &lt;code&gt;pip install chromadb&lt;/code&gt;, in-memory или file-based. Для прода с серьёзным трафиком не годится.&lt;/p&gt;

&lt;h2&gt;
  
  
  Шаг 4. Hybrid search — vector + BM25
&lt;/h2&gt;

&lt;p&gt;Чистый vector search проигрывает на точных совпадениях имён, кодов, дат. Hybrid комбинирует semantic + keyword:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# В Qdrant — встроенный hybrid через named vectors
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Prefetch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FusionQuery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Fusion&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_points&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prefetch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;Prefetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dense_query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;using&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dense&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;Prefetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sparse_query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;using&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bm25&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;FusionQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fusion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Fusion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RRF&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# Reciprocal Rank Fusion
&lt;/span&gt;    &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RRF (Reciprocal Rank Fusion)&lt;/strong&gt; — стандартный способ объединять списки. Формула: &lt;code&gt;score = sum(1 / (k + rank_i))&lt;/code&gt;, где k обычно 60. Документ, который попал в топ обоих списков — получает больший score, чем тот, что в топе только одного.&lt;/p&gt;

&lt;p&gt;Качество retrieve на технических документах с hybrid обычно растёт на 10–20% Recall@10.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flt1827tldaf6sd8hpjpg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flt1827tldaf6sd8hpjpg.png" alt="Схема hybrid search: запрос «как настроить токен» идёт двумя путями — вверх в «Semantic search → top 20 by cosine», вниз в «BM25 keyword search → top 20 by tf-idf», обе линии сходятся в блок «RRF fusion → top 10», финальная стрелка к «LLM с контекстом»; заголовок «Hybrid: семантика + ключевые слова»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Шаг 5. Rerank — последний штрих качества
&lt;/h2&gt;

&lt;p&gt;Поверх hybrid search ставится &lt;strong&gt;reranker&lt;/strong&gt;: модель, которая для каждой пары &lt;code&gt;(query, chunk)&lt;/code&gt; даёт точный relevance score. Это дорого (один inference на каждый retrieved chunk), но даёт лучший top-10 из retrieved 50–100.&lt;/p&gt;

&lt;p&gt;Топ-модели rerank:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cohere rerank-multilingual-v3.0&lt;/strong&gt; — production-grade, низкая latency, ~$2 за 1M запросов&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voyage rerank-2&lt;/strong&gt; — premium качество, ~$3 за 1M&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BGE-reranker-v2-m3&lt;/strong&gt; — open-source, можно self-host
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# через Cohere SDK поверх обычного retrieve
&lt;/span&gt;&lt;span class="n"&gt;retrieved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rerank_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cohere_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;retrieved&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;top_n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rerank-multilingual-v3.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;retrieved&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rerank_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Recall@10 после rerank обычно +15–25% над hybrid. Совокупный uplift &lt;code&gt;embedding → +chunking → +hybrid → +rerank&lt;/code&gt; — это разница между «работает терпимо» и «работает отлично».&lt;/p&gt;

&lt;h2&gt;
  
  
  Шаг 6. Generation — отдаём контекст в LLM
&lt;/h2&gt;

&lt;p&gt;Финальный шаг — собираем prompt и вызываем LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ты — ассистент по технической документации. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Отвечай ТОЛЬКО на основе предоставленного контекста. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Если в контексте нет ответа — скажи &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;В документации нет ответа на этот вопрос&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Не выдумывай факты.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Контекст:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Вопрос: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Ответь подробно.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Что важно:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Жёсткий system message&lt;/strong&gt; — заставляет модель не выдумывать вне контекста. Это снижает галлюцинации.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Указание источников&lt;/strong&gt; — попросите модель отдавать &lt;code&gt;[source_1]&lt;/code&gt;-references на куски, это даёт пользователю верификацию.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Длина контекста vs стоимость&lt;/strong&gt; — каждый retrieved chunk это input-токены. На Opus 4.7 (350 ₽ за 1M input) 10 chunks по 1500 токенов = 15K токенов = 5.25 ₽ на запрос. Если запросов 100K в месяц — 525K ₽. Цены — в &lt;a href="https://promptra.ru/blog/sravnenie-cen-llm-2026-tochnye-tarify-v-rublyah" rel="noopener noreferrer"&gt;«Сравнение цен LLM 2026»&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Для дешёвого RAG используют Gemini 3.1 Pro (140/860 ₽) или DeepSeek V4 Pro (30/60 ₽) на маленьких задачах. Для критичных продуктовых ответов — Opus 4.7 или GPT-5.5.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation — без неё это чёрный ящик
&lt;/h2&gt;

&lt;p&gt;RAG нельзя выкатить в прод без замера качества. Минимальная eval:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Соберите 50–200 пар &lt;code&gt;(query, ids_of_relevant_chunks)&lt;/code&gt; вручную или через LLM-as-judge.&lt;/li&gt;
&lt;li&gt;Прогоните &lt;code&gt;query → embed → search → top-k&lt;/code&gt;, замерьте:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recall@k&lt;/strong&gt; — доля релевантных, попавших в top-k. Цель Recall@10 &amp;gt; 0.85.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MRR&lt;/strong&gt; — Mean Reciprocal Rank, насколько высоко стоит первый релевантный. Цель &amp;gt; 0.5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NDCG@k&lt;/strong&gt; — учитывает порядок результатов, премиум-метрика.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Если ниже — меняйте модель embeddings, chunking, добавляйте rerank, тюньте hybrid weight.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Дополнительно — &lt;strong&gt;end-to-end eval&lt;/strong&gt;: даёте LLM-судье (через тот же Opus 4.7 / GPT-5.5) пары &lt;code&gt;(query, generated_answer, ground_truth)&lt;/code&gt;, спрашиваете «насколько ответ точен и обоснован контекстом». Это позволяет ловить регрессии не только в retrieve, но и в формулировке.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq8e8ad8bcbw1z1yzs4c6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq8e8ad8bcbw1z1yzs4c6.png" alt="Дашборд retrieval-метрик: 4 карточки в ряд — «Recall@10: 0.87 цель», «MRR: 0.62», «NDCG@10: 0.79», «E2E judge score: 4.3/5», под ними график изменения метрик при апгрейдах: «baseline → +rerank → +hybrid» с восходящей терракотовой линией; заголовок «Evaluation: без неё это чёрный ящик»" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Стоимость RAG на 1M документов
&lt;/h2&gt;

&lt;p&gt;Пример экономики для базы 1M документов, средний размер 500 токенов, 10K запросов в день:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Статья&lt;/th&gt;
&lt;th&gt;Расчёт&lt;/th&gt;
&lt;th&gt;Сумма&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embedding индексации&lt;/td&gt;
&lt;td&gt;500M токенов × $0.06 (Voyage 3)&lt;/td&gt;
&lt;td&gt;~430K ₽ разово&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector DB&lt;/td&gt;
&lt;td&gt;Qdrant on-prem (4 vCPU, 32GB RAM)&lt;/td&gt;
&lt;td&gt;~25K ₽/мес&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding запросов&lt;/td&gt;
&lt;td&gt;10K × 50 токенов × 30 дней&lt;/td&gt;
&lt;td&gt;~650 ₽/мес&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rerank&lt;/td&gt;
&lt;td&gt;10K × 50 × 30 (Cohere v3)&lt;/td&gt;
&lt;td&gt;~6500 ₽/мес&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM generation (Opus 4.7)&lt;/td&gt;
&lt;td&gt;10K × 15K input + 800 output&lt;/td&gt;
&lt;td&gt;~2.27 М ₽/мес&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM на дешёвом (Gemini 3.1 Pro)&lt;/td&gt;
&lt;td&gt;10K × 15K input + 800 output&lt;/td&gt;
&lt;td&gt;~700K ₽/мес&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Главная статья — это LLM-generation. Если ответы не требуют флагмана — переход с Opus 4.7 на Gemini 3.1 Pro экономит 1.5 М ₽/мес. Через Promptra это смена &lt;code&gt;model&lt;/code&gt; в одной строке.&lt;/p&gt;

&lt;h2&gt;
  
  
  Оплата и закрывающие документы
&lt;/h2&gt;

&lt;p&gt;Все компоненты RAG-стека — embeddings, rerank, LLM generation — оплачиваются на юр.лицо российское юр.лицо, резидент РФ. Сервисная комиссия 5% берётся только при пополнении баланса, на токены наценки нет, цены строго по курсу ЦБ. Полный пакет закрывающих документов (договор-оферта, счёт на оплату, акт оказанных услуг, счёт-фактура, УПД) приходит через ЭДО — Диадок, СБИС, Контур. Подробнее — на странице &lt;a href="https://promptra.ru/pricing" rel="noopener noreferrer"&gt;«Тарифы»&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Что дальше
&lt;/h2&gt;

&lt;p&gt;RAG в 2026 — это &lt;strong&gt;не один запрос к LLM с приклеенным контекстом&lt;/strong&gt;. Это связка: chunking → embedding → vector DB → hybrid search → rerank → generation, и evaluation поверх каждого слоя. С правильной архитектурой Recall@10 переваливает 0.9, end-to-end judge score выходит за 4.5/5, и продукт перестаёт галлюцинировать. Полезные следующие шаги: &lt;a href="https://promptra.ru/blog/function-calling-tool-use-llm-2026-python-praktika" rel="noopener noreferrer"&gt;«Function calling и tool use»&lt;/a&gt; для агентов-аналитиков поверх RAG, &lt;a href="https://promptra.ru/blog/async-i-batch-api-llm-50-procentov-skidka-perfomance" rel="noopener noreferrer"&gt;«Async-вызовы и Batch API»&lt;/a&gt; для индексации больших баз и &lt;a href="https://promptra.ru/blog/kak-schitat-tokeny-llm-tokenizer-stoimost-zaprosa" rel="noopener noreferrer"&gt;«Как считать токены в LLM»&lt;/a&gt; для расчёта стоимости context window. Если нужно подобрать модель под ваш RAG или подключить ключ через юрлицо — &lt;a href="https://promptra.ru" rel="noopener noreferrer"&gt;напишите команде Promptra в Telegram&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📚 &lt;strong&gt;Главный гайд по теме:&lt;/strong&gt; &lt;a href="https://promptra.ru/blog/luchshaya-neyroset-2026/" rel="noopener noreferrer"&gt;Лучшая нейросеть 2026: какую LLM выбрать под задачу&lt;/a&gt; — связанные материалы и обзор всей категории.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Promptra&lt;/strong&gt; — Russian LLM API aggregator. One OpenAI-compatible endpoint to all flagship models: OpenAI (GPT-5.5, GPT-5.4), Anthropic (Claude Opus 4.7, Sonnet 4.6), Google (Gemini 3.1 Pro, 3.5 Flash), DeepSeek V4 Pro, Qwen 3.6 Plus.&lt;/p&gt;

&lt;p&gt;Provider prices 1-to-1 at CBR rate — no markup on tokens. Ruble billing per contract, full closing documents through EDI. No VPN — legal B2B service in Russia.&lt;/p&gt;

&lt;p&gt;Try: &lt;a href="https://promptra.ru" rel="noopener noreferrer"&gt;promptra.ru&lt;/a&gt; · &lt;a href="https://promptra.ru/models" rel="noopener noreferrer"&gt;model catalog&lt;/a&gt; · &lt;a href="https://promptra.ru/docs" rel="noopener noreferrer"&gt;docs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>embeddings</category>
      <category>rag</category>
      <category>qdrant</category>
      <category>pgvector</category>
    </item>
    <item>
      <title>Vector Database Tutorial: From Zero to RAG Agent in 2026</title>
      <dc:creator>Iniyarajan</dc:creator>
      <pubDate>Mon, 08 Jun 2026 08:25:21 +0000</pubDate>
      <link>https://dev.to/iniyarajan86/vector-database-tutorial-from-zero-to-rag-agent-in-2026-4b0c</link>
      <guid>https://dev.to/iniyarajan86/vector-database-tutorial-from-zero-to-rag-agent-in-2026-4b0c</guid>
      <description>&lt;p&gt;&lt;strong&gt;Common misconception&lt;/strong&gt;: Vector databases are just fancy storage systems. The truth? They're the foundation that makes AI agents truly intelligent.&lt;/p&gt;

&lt;p&gt;We're in 2026, and vector databases have become the backbone of every production RAG system. Whether you're building a customer support agent or a code assistant, understanding how vectors work isn't optional anymore. Let's walk through building a complete RAG agent together, starting from the basics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtega0rg0vi54fdous4f.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtega0rg0vi54fdous4f.jpeg" alt="vector database" width="800" height="418"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://www.pexels.com/@brett-sayles" rel="noopener noreferrer"&gt;Brett Sayles&lt;/a&gt; on &lt;a href="https://pexels.com" rel="noopener noreferrer"&gt;Pexels&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;What Makes Vector Databases Different&lt;/li&gt;
&lt;li&gt;Setting Up Your First Vector Database&lt;/li&gt;
&lt;li&gt;Building a RAG Pipeline&lt;/li&gt;
&lt;li&gt;Creating an AI Agent with Memory&lt;/li&gt;
&lt;li&gt;Production Considerations&lt;/li&gt;
&lt;li&gt;Frequently Asked Questions&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  What Makes Vector Databases Different
&lt;/h2&gt;

&lt;p&gt;Traditional databases store data in rows and columns. Vector databases store mathematical representations of data — embeddings — that capture semantic meaning. When we ask "How do I deploy my app?", a vector database doesn't just match keywords. It understands that this relates to deployment, DevOps, and infrastructure.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Related&lt;/strong&gt;: &lt;a href="https://dev.to/iniyarajan86/vector-database-tutorial-building-smart-ai-agents-with-rag-2b50"&gt;Vector Database Tutorial: Building Smart AI Agents with RAG&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The magic happens in the similarity search. Vector databases use algorithms like HNSW (Hierarchical Navigable Small World) to find the most relevant documents in milliseconds, even with millions of entries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Also read&lt;/strong&gt;: &lt;a href="https://dev.to/iniyarajan86/building-robust-ai-agent-memory-systems-in-2026-173l"&gt;Building Robust AI Agent Memory Systems in 2026&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggVEQKICBBW_Cfk4QgUmF3IERvY3VtZW50c10gLS0-IEJb8J-noCBFbWJlZGRpbmcgTW9kZWxdCiAgQiAtLT4gQ1vwn5OKIFZlY3RvciBFbWJlZGRpbmdzXQogIEMgLS0-IERb8J-XhO-4jyBWZWN0b3IgRGF0YWJhc2VdCiAgRVvinZMgVXNlciBRdWVyeV0gLS0-IEIKICBCIC0tPiBGW_Cfk4ogUXVlcnkgVmVjdG9yXQogIEYgLS0-IEQKICBEIC0tPiBHW_CflI0gU2ltaWxhcml0eSBTZWFyY2hdCiAgRyAtLT4gSFvwn5OLIFJlbGV2YW50IERvY3VtZW50c10%3Ftheme%3Ddark%26bgColor%3D1a1a2e" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggVEQKICBBW_Cfk4QgUmF3IERvY3VtZW50c10gLS0-IEJb8J-noCBFbWJlZGRpbmcgTW9kZWxdCiAgQiAtLT4gQ1vwn5OKIFZlY3RvciBFbWJlZGRpbmdzXQogIEMgLS0-IERb8J-XhO-4jyBWZWN0b3IgRGF0YWJhc2VdCiAgRVvinZMgVXNlciBRdWVyeV0gLS0-IEIKICBCIC0tPiBGW_Cfk4ogUXVlcnkgVmVjdG9yXQogIEYgLS0-IEQKICBEIC0tPiBHW_CflI0gU2ltaWxhcml0eSBTZWFyY2hdCiAgRyAtLT4gSFvwn5OLIFJlbGV2YW50IERvY3VtZW50c10%3Ftheme%3Ddark%26bgColor%3D1a1a2e" alt="System Architecture" width="467" height="590"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's where it gets interesting for AI agents. We can store not just documents, but conversation history, user preferences, and contextual information as vectors. This gives our agents semantic memory — they remember not just what happened, but what it means.&lt;/p&gt;
&lt;h2&gt;
  
  
  Setting Up Your First Vector Database
&lt;/h2&gt;

&lt;p&gt;We'll use Pinecone for this vector database tutorial because it's production-ready and developer-friendly. But the concepts apply to any vector database.&lt;/p&gt;

&lt;p&gt;First, let's create our vector space:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Pinecone
&lt;/span&gt;&lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west1-gcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create index with 1536 dimensions (OpenAI embeddings)
&lt;/span&gt;&lt;span class="n"&gt;index_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag-agent-memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_indexes&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Convert text to vector embedding&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-ada-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Store document as vector in database&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{})}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Find similar documents to query&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;include_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matches&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This setup gives us the foundation for semantic search. But for a production RAG agent, we need more structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;A robust RAG pipeline handles document preprocessing, chunking, and retrieval orchestration. Here's our complete system:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggTFIKICBBW_Cfk4EgRG9jdW1lbnRzXSAtLT4gQnvwn5OPIENodW5rIFNpemV9CiAgQiAtLT4gQ1vinILvuI8gVGV4dCBTcGxpdHRlcl0KICBDIC0tPiBEW_Cfp6AgR2VuZXJhdGUgRW1iZWRkaW5nc10KICBEIC0tPiBFW_Cfl4TvuI8gU3RvcmUgaW4gVmVjdG9yIERCXQogIEZb4p2TIFVzZXIgUXVlcnldIC0tPiBHW_CflI0gU2VtYW50aWMgU2VhcmNoXQogIEcgLS0-IEUKICBFIC0tPiBIW_Cfk4sgUmV0cmlldmVkIENvbnRleHRdCiAgSCAtLT4gSVvwn6SWIExMTSBHZW5lcmF0aW9uXQogIEkgLS0-IEpb4pyFIFJlc3BvbnNlXQ%3Ftheme%3Ddark%26bgColor%3D1a1a2e" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggTFIKICBBW_Cfk4EgRG9jdW1lbnRzXSAtLT4gQnvwn5OPIENodW5rIFNpemV9CiAgQiAtLT4gQ1vinILvuI8gVGV4dCBTcGxpdHRlcl0KICBDIC0tPiBEW_Cfp6AgR2VuZXJhdGUgRW1iZWRkaW5nc10KICBEIC0tPiBFW_Cfl4TvuI8gU3RvcmUgaW4gVmVjdG9yIERCXQogIEZb4p2TIFVzZXIgUXVlcnldIC0tPiBHW_CflI0gU2VtYW50aWMgU2VhcmNoXQogIEcgLS0-IEUKICBFIC0tPiBIW_Cfk4sgUmV0cmlldmVkIENvbnRleHRdCiAgSCAtLT4gSVvwn6SWIExMTSBHZW5lcmF0aW9uXQogIEkgLS0-IEpb4pyFIFJlc3BvbnNlXQ%3Ftheme%3Ddark%26bgColor%3D1a1a2e" alt="Process Flowchart" width="1888" height="227"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RAGAgent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Add documents to vector database with chunking&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Split into chunks (simple approach)
&lt;/span&gt;            &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_chunk_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;([{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;}])&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Split text into overlapping chunks&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Query with RAG pipeline&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="c1"&gt;# Retrieve relevant context
&lt;/span&gt;        &lt;span class="n"&gt;context_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;context_docs&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="c1"&gt;# Include conversation memory
&lt;/span&gt;        &lt;span class="n"&gt;memory_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Assistant: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_memory&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;  &lt;span class="c1"&gt;# Last 3 exchanges
&lt;/span&gt;        &lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="c1"&gt;# Generate response
&lt;/span&gt;        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Context from knowledge base:
        &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

        Previous conversation:
        &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

        Current question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

        Please provide a helpful response based on the context and conversation history.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that answers questions based on provided context.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

        &lt;span class="c1"&gt;# Store in conversation memory
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What makes this different from a simple chatbot? The vector database gives our agent semantic understanding of your knowledge base, and the memory system maintains context across conversations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating an AI Agent with Memory
&lt;/h2&gt;

&lt;p&gt;Real AI agents need more than just document retrieval. They need episodic memory — remembering past interactions, user preferences, and learned behaviors. We can store all of this as vectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MemoryEnhancedAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RAGAgent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_interaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interaction_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Store user interaction as vector for future reference&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;memory_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;interaction_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;([{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;memory_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;interaction_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}])&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieve relevant user history for personalized responses&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="c1"&gt;# Search for relevant past interactions
&lt;/span&gt;        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$eq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
            &lt;span class="n"&gt;include_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;personalized_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer with personalized context from user history&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="c1"&gt;# Get user's relevant history
&lt;/span&gt;        &lt;span class="n"&gt;user_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_user_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Combine with knowledge base context
&lt;/span&gt;        &lt;span class="n"&gt;kb_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search_similar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Generate personalized response
&lt;/span&gt;        &lt;span class="n"&gt;context_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s past interaction: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;user_context&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="n"&gt;kb_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;kb_context&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        User&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s relevant history:
        &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

        Knowledge base context:
        &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kb_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

        Current question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

        Provide a personalized response considering the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s history and preferences.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a personalized assistant that adapts to user preferences and history.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

        &lt;span class="c1"&gt;# Store this interaction for future reference
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store_interaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;A: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach transforms our RAG system into a true AI agent. It learns from every interaction and becomes more helpful over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Considerations
&lt;/h2&gt;

&lt;p&gt;Building production RAG agents requires thinking beyond the happy path. Here are the challenges we need to address:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding Model Selection&lt;/strong&gt;: Different models excel at different tasks. &lt;code&gt;text-embedding-ada-002&lt;/code&gt; is general-purpose, but specialized models like &lt;code&gt;text-embedding-3-large&lt;/code&gt; offer better performance for specific domains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector Database Scaling&lt;/strong&gt;: Pinecone handles scaling automatically, but self-hosted options like Weaviate or Qdrant require capacity planning. Consider your query volume and storage requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunk Strategy&lt;/strong&gt;: Simple text splitting isn't enough for complex documents. Consider semantic chunking that preserves context boundaries, or hierarchical chunking for structured data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation and Monitoring&lt;/strong&gt;: RAG systems can hallucinate or retrieve irrelevant context. Implement evaluation metrics like context relevance and answer faithfulness. Tools like LangSmith or Weights &amp;amp; Biases help track performance over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy and Security&lt;/strong&gt;: Vector embeddings can leak information about source documents. For sensitive data, consider techniques like differential privacy or encrypted vector search.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost Optimization&lt;/strong&gt;: Embedding generation and vector storage costs add up. Batch embedding requests, use caching for frequent queries, and implement tiered storage for older data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Which vector database should I choose for production?
&lt;/h3&gt;

&lt;p&gt;For beginners, start with Pinecone for its managed service and excellent documentation. If you need self-hosted solutions, Weaviate offers great performance with GraphQL queries, while Qdrant provides Rust-based speed with Python APIs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How do I handle documents that are too large for embedding models?
&lt;/h3&gt;

&lt;p&gt;Use hierarchical chunking: create summary embeddings for entire documents and detailed embeddings for chunks. Store both in your vector database with different metadata tags, then query summaries first and drill down to relevant chunks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can vector databases replace traditional databases entirely?
&lt;/h3&gt;

&lt;p&gt;No, they're complementary. Use vector databases for semantic search and similarity matching, but keep structured data in traditional databases. Many production systems use both, with vector databases handling AI features and SQL databases managing business logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How do I evaluate if my RAG system is working well?
&lt;/h3&gt;

&lt;p&gt;Track three key metrics: retrieval accuracy (are relevant documents found?), context relevance (is retrieved content useful?), and answer faithfulness (does the generated response stay true to the context?). Tools like RAGAS provide automated evaluation frameworks.&lt;/p&gt;

&lt;p&gt;Vector databases have evolved from experimental technology to production necessity in 2026. They're the foundation that makes AI agents truly intelligent — capable of understanding context, remembering interactions, and providing personalized experiences.&lt;/p&gt;

&lt;p&gt;The key insight? Don't think of vector databases as just storage. Think of them as the memory system that gives your AI agents the ability to learn, adapt, and become more helpful over time. That's what separates a simple chatbot from a truly intelligent agent.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Need a server? &lt;a href="https://m.do.co/c/f0a5b173fd4c" rel="noopener noreferrer"&gt;Get $200 free credits on DigitalOcean&lt;/a&gt; to deploy your AI apps.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Resources I Recommend
&lt;/h2&gt;

&lt;p&gt;If you're diving deeper into RAG and vector databases, &lt;a href="https://www.amazon.in/s?k=rag+vector+database+llm&amp;amp;tag=iniyarajan86-21" rel="noopener noreferrer"&gt;these RAG and vector database books&lt;/a&gt; provide comprehensive coverage of production patterns and advanced techniques that complement this tutorial.&lt;/p&gt;

&lt;h2&gt;
  
  
  You Might Also Like
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/iniyarajan86/vector-database-tutorial-building-smart-ai-agents-with-rag-2b50"&gt;Vector Database Tutorial: Building Smart AI Agents with RAG&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/iniyarajan86/building-robust-ai-agent-memory-systems-in-2026-173l"&gt;Building Robust AI Agent Memory Systems in 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/iniyarajan86/langchain-tutorial-for-beginners-build-your-first-ai-agent-3klo"&gt;LangChain Tutorial for Beginners: Build Your First AI Agent&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📘 Go Deeper: Building AI Agents: A Practical Developer's Guide
&lt;/h2&gt;

&lt;p&gt;185 pages covering autonomous systems, RAG, multi-agent workflows, and production deployment — with complete code examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://iniyarajan.gumroad.com/l/building-ai-agents" rel="noopener noreferrer"&gt;Get the ebook →&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Also check out: *&lt;/em&gt;&lt;a href="https://iniyarajan.gumroad.com/l/ai-ios-apps" rel="noopener noreferrer"&gt;AI-Powered iOS Apps: CoreML to Claude&lt;/a&gt;***&lt;/p&gt;

&lt;h2&gt;
  
  
  Enjoyed this article?
&lt;/h2&gt;

&lt;p&gt;I write daily about &lt;strong&gt;iOS development, AI, and modern tech&lt;/strong&gt; — practical tips you can use right away.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Follow me on &lt;a href="https://dev.to/iniyarajan86"&gt;Dev.to&lt;/a&gt; for daily articles&lt;/li&gt;
&lt;li&gt;Follow me on &lt;a href="https://iniyarajanhashnodedev.hashnode.dev" rel="noopener noreferrer"&gt;Hashnode&lt;/a&gt; for in-depth tutorials&lt;/li&gt;
&lt;li&gt;Follow me on &lt;a href="https://medium.com/@iniyarajan" rel="noopener noreferrer"&gt;Medium&lt;/a&gt; for more stories&lt;/li&gt;
&lt;li&gt;Connect on &lt;a href="https://twitter.com/iniyaniOS" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt; for quick tips&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If this helped you, drop a like and share it with a fellow developer!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>vectordatabase</category>
      <category>rag</category>
      <category>aiagents</category>
      <category>embeddings</category>
    </item>
    <item>
      <title>Cluster-Aware Retrieval for RAG Systems</title>
      <dc:creator>Alex Towell</dc:creator>
      <pubDate>Sun, 07 Jun 2026 03:03:46 +0000</pubDate>
      <link>https://dev.to/queelius/cluster-aware-retrieval-for-rag-systems-2o2n</link>
      <guid>https://dev.to/queelius/cluster-aware-retrieval-for-rag-systems-2o2n</guid>
      <description>&lt;p&gt;Most RAG systems treat embedding spaces as flat, uniform distributions. They're not. Real knowledge bases contain distinct semantic clusters: database docs, frontend frameworks, DevOps practices, each with different internal structure. Ignoring this wastes retrieval precision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Flat Retrieval
&lt;/h2&gt;

&lt;p&gt;A query about "React hooks optimization" should pull from the frontend cluster, not equally consider database or infrastructure docs that happen to share semantic overlap. Standard cosine similarity doesn't care about topical boundaries. You get results that are individually relevant but collectively unfocused.&lt;/p&gt;

&lt;h2&gt;
  
  
  Modeling Clusters with GMM
&lt;/h2&gt;

&lt;p&gt;Gaussian Mixture Models assume your embeddings arise from (K) underlying Gaussian distributions:&lt;/p&gt;

&lt;p&gt;$$p(v) = \sum_{k=1}^K \pi_k \mathcal{N}(v \mid \mu_k, \Sigma_k)$$&lt;/p&gt;

&lt;p&gt;For a query (q), compute the posterior probability of each cluster:&lt;/p&gt;

&lt;p&gt;$$p(k \mid q) = \frac{\pi_k \mathcal{N}(q \mid \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(q \mid \mu_j, \Sigma_j)}$$&lt;/p&gt;

&lt;p&gt;This gives you soft assignments: the probability that a query belongs to each semantic cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two-Stage Retrieval
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster selection&lt;/strong&gt;: Pick cluster(s) with highest (p(k \mid q)). Take top-2 for ambiguous queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intra-cluster retrieval&lt;/strong&gt;: Run k-NN within selected clusters.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The cluster boundaries act as a soft filter, avoiding the "dilution effect" where off-topic documents dominate results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mahalanobis Distance Per Cluster
&lt;/h2&gt;

&lt;p&gt;Here's the underexplored part: different clusters can use different distance metrics. For a cluster modeled as (\mathcal{N}(\mu_k, \Sigma_k)), the Mahalanobis distance accounts for the cluster's shape:&lt;/p&gt;

&lt;p&gt;$$d_{\text{Mah}}(q, v) = \sqrt{(q - v)^T \Sigma_k^{-1} (q - v)}$$&lt;/p&gt;

&lt;p&gt;Elongated clusters in certain semantic directions get stretched appropriately. Cosine similarity treats all directions equally. Mahalanobis adapts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Clusters as Agent Tools
&lt;/h2&gt;

&lt;p&gt;In agentic RAG, each cluster becomes a tool the agent can invoke:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;ClusterRetrievalTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic_k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent decides which clusters to search and in what order:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query: "How does React's context API compare to Redux?"&lt;/li&gt;
&lt;li&gt;Agent plan:

&lt;ol&gt;
&lt;li&gt;Search frontend cluster for React context&lt;/li&gt;
&lt;li&gt;Search state management cluster for Redux patterns&lt;/li&gt;
&lt;li&gt;Synthesize comparison&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This beats flat retrieval for cross-topic synthesis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;Fit GMM offline on document embeddings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.mixture&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GaussianMixture&lt;/span&gt;

&lt;span class="n"&gt;gmm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GaussianMixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;covariance_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;full&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;gmm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document_embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# For query q:
&lt;/span&gt;&lt;span class="n"&gt;cluster_probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gmm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;selected_clusters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cluster_probs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:][::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# top-2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store cluster assignments as metadata in your vector DB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cluster_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$in&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;selected_clusters&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Number of clusters&lt;/strong&gt;: Use BIC/AIC or domain knowledge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regularization&lt;/strong&gt;: Add (\lambda I) to covariance matrices to prevent singularities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Initialization&lt;/strong&gt;: k-means++ for better convergence&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When It Helps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Topically diverse corpora&lt;/strong&gt;: Multi-product docs, cross-domain papers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-topic queries&lt;/strong&gt;: Clear primary topic to route to&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Noise reduction&lt;/strong&gt;: Distant-but-similar content diluting results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When it doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Homogeneous corpora&lt;/li&gt;
&lt;li&gt;Very small datasets&lt;/li&gt;
&lt;li&gt;Queries requiring extensive cross-topic synthesis (agentic patterns help here)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cluster boundaries&lt;/strong&gt;: Queries near boundaries may be misrouted. Soft routing (weighted retrieval across clusters) helps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: GMM fitting doesn't scale well beyond roughly 100 clusters and millions of docs. Use hierarchical clustering or vector DB partitioning for large systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark first&lt;/strong&gt;: Flat retrieval with strong reranking is a tough baseline. Always compare.&lt;/p&gt;

&lt;p&gt;The core insight: embedding spaces have structure. Exploit it.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>machinelearning</category>
      <category>embeddings</category>
      <category>informationretrieval</category>
    </item>
    <item>
      <title>Embeddings API в России: векторный поиск и RAG</title>
      <dc:creator>Promptra Team</dc:creator>
      <pubDate>Sat, 06 Jun 2026 19:35:56 +0000</pubDate>
      <link>https://dev.to/promptra-team/embeddings-api-v-rossii-viektornyi-poisk-i-rag-2gpo</link>
      <guid>https://dev.to/promptra-team/embeddings-api-v-rossii-viektornyi-poisk-i-rag-2gpo</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7naftitol93smr9owlk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7naftitol93smr9owlk.png" alt="Схема RAG-пайплайна на тёплом фоне: текстовый документ разбивается на чанки, каждый превращается в вектор-эмбеддинг, векторы складываются в векторную базу, запрос пользователя ищет ближайшие по смыслу фрагменты и топ-K уходит в языковую модель, все блоки подписаны на русском, акцент терракотовым цветом" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Эмбеддинг — это представление текста в виде вектора чисел, где близкие по смыслу тексты оказываются рядом в пространстве.&lt;/strong&gt; На эмбеддингах строят семантический поиск, RAG (ответы по своей базе знаний), классификацию и поиск дубликатов. Для большинства задач разумный дефолт — &lt;strong&gt;text-embedding-3-small&lt;/strong&gt; (дёшево, 1536 измерений), а где важна точность — &lt;strong&gt;text-embedding-3-large&lt;/strong&gt; (3072 измерения). Обе модели вызываются через один &lt;strong&gt;OpenAI-совместимый эндпоинт&lt;/strong&gt; из России без VPN: в коде на &lt;code&gt;openai&lt;/code&gt; SDK меняется только &lt;code&gt;base_url&lt;/code&gt; на &lt;code&gt;https://api.promptra.ru/v1&lt;/code&gt;, оплата идёт в рублях на юр.лицо с закрывающими документами.&lt;/p&gt;

&lt;p&gt;Ниже — разбор простыми словами, что такое эмбеддинги и где они применяются, какие модели и размерности бывают, как получить вектор по API (Python и curl), как собрать RAG-пайплайн по шагам (чанк → эмбеддинг → векторная БД → поиск → ответ), сколько это стоит в рублях и какие ошибки встречаются чаще всего. Тон — для разработчика, который хочет понять механику и цену, а не читать маркетинг.&lt;/p&gt;

&lt;h2&gt;
  
  
  Что такое эмбеддинги простыми словами
&lt;/h2&gt;

&lt;p&gt;Компьютер не понимает текст так, как человек. Чтобы машина могла сравнивать фразы по смыслу, текст нужно перевести в числа. Эмбеддинг — это и есть такой перевод: модель-эмбеддер берёт строку («как сбросить пароль») и возвращает список из сотен или тысяч чисел — вектор. Вектор фиксированной длины, не зависящей от длины текста: и одно слово, и целый абзац превращаются в вектор одной и той же размерности. Подробнее — &lt;a href="https://promptra.ru/blog/migraciya-iz-openai-na-promptra-python-sdk-za-10-minut/" rel="noopener noreferrer"&gt;миграция с OpenAI SDK на Promptra за 10 минут&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Главное свойство этих векторов: &lt;strong&gt;близость в пространстве отражает близость по смыслу&lt;/strong&gt;. Фразы «как сбросить пароль» и «забыл пароль, что делать» дадут векторы, расположенные рядом, хотя в них почти нет общих слов. А «как сбросить пароль» и «рецепт борща» окажутся далеко друг от друга. Это принципиально отличается от обычного текстового поиска по ключевым словам: там совпадение ищется по буквам, а в семантическом поиске — по смыслу.&lt;/p&gt;

&lt;p&gt;Расстояние между векторами измеряют чаще всего косинусной близостью (cosine similarity) — насколько совпадает «направление» двух векторов. Значение около 1 — тексты почти про одно и то же, около 0 — никак не связаны. Именно эта метрика лежит в основе семантического поиска: чтобы найти релевантные документы, мы считаем близость вектора запроса к векторам всех документов и берём самые близкие.&lt;/p&gt;

&lt;h3&gt;
  
  
  Где применяются эмбеддинги
&lt;/h3&gt;

&lt;p&gt;Эмбеддинги — это инфраструктурный кирпич, на котором держится сразу несколько практических задач:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Семантический поиск.&lt;/strong&gt; Поиск по смыслу, а не по словам. Пользователь спрашивает «не приходит письмо подтверждения» — система находит статью «проблемы с доставкой email», даже если в ней нет слова «подтверждение».&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG (Retrieval-Augmented Generation).&lt;/strong&gt; Самое популярное применение. Перед тем как задать вопрос языковой модели, мы находим в своей базе знаний релевантные фрагменты (через эмбеддинги) и подкладываем их в промпт. Модель отвечает не из «головы», а по вашим документам — точнее и без выдумок.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Классификация и маршрутизация.&lt;/strong&gt; Тикеты, обращения, лиды можно раскладывать по категориям, сравнивая их эмбеддинги с эталонными. Без обучения отдельной модели.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Кластеризация и дедупликация.&lt;/strong&gt; Сгруппировать похожие отзывы, найти дубли товаров в каталоге, выделить темы в массиве сообщений.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Рекомендации.&lt;/strong&gt; «Похожие товары», «похожие статьи» — это поиск ближайших векторов к текущему объекту.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Объединяет все эти задачи одно: вместо того чтобы сравнивать тексты по словам, мы сравниваем их по смыслу — а для этого сначала превращаем каждый текст в вектор.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsql5gy6afxzuv8kgq5kh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsql5gy6afxzuv8kgq5kh.png" alt="Двумерная иллюстрация смыслового пространства эмбеддингов на тёплом фоне: точки-фразы про пароль и вход сгруппированы в одном углу, точки про оплату и счёт — в другом, точка про рецепт борща стоит далеко в стороне, между парой близких точек показан угол косинусной близости, подписи на русском" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Модели эмбеддингов и размерности
&lt;/h2&gt;

&lt;p&gt;Самые распространённые модели эмбеддингов — линейка OpenAI &lt;code&gt;text-embedding-3&lt;/code&gt;. Они доступны через OpenAI-совместимый API, и именно их чаще всего подключают в РФ-проектах. Ключевые параметры:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Модель&lt;/th&gt;
&lt;th&gt;Размерность по умолчанию&lt;/th&gt;
&lt;th&gt;Сильная сторона&lt;/th&gt;
&lt;th&gt;Цена OpenAI (USD / 1M токенов)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;text-embedding-3-small&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;Дёшево, быстро, дефолт&lt;/td&gt;
&lt;td&gt;$0.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;text-embedding-3-large&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3072&lt;/td&gt;
&lt;td&gt;Максимальное качество&lt;/td&gt;
&lt;td&gt;$0.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;text-embedding-ada-002&lt;/code&gt; (legacy)&lt;/td&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;Старая, для совместимости&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;USD-прайс — с &lt;a href="https://platform.openai.com/docs/pricing" rel="noopener noreferrer"&gt;официальной страницы OpenAI&lt;/a&gt;. Рублёвые оценки разберём в разделе про цену ниже.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Что значит размерность.&lt;/strong&gt; Это длина вектора — сколько чисел в нём. У &lt;code&gt;text-embedding-3-small&lt;/code&gt; по умолчанию 1536 чисел, у &lt;code&gt;large&lt;/code&gt; — 3072. Чем больше размерность, тем больше «нюансов» смысла модель может закодировать, но тем больше места занимает вектор в базе и тем дороже его хранить и сравнивать. Для большинства задач 1536 измерений &lt;code&gt;small&lt;/code&gt; — более чем достаточно.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Параметр &lt;code&gt;dimensions&lt;/code&gt; — важная фишка.&lt;/strong&gt; Модели &lt;code&gt;text-embedding-3&lt;/code&gt; (в отличие от старой &lt;code&gt;ada-002&lt;/code&gt;) умеют отдавать укороченный вектор без потери большей части качества. Передаёте &lt;code&gt;dimensions=512&lt;/code&gt; — и получаете вектор из 512 чисел вместо 1536. Это экономит память векторной базы и ускоряет поиск, а просадка в качестве на типичных задачах небольшая. Приём называется Matryoshka-представление: вектор «вложен» так, что первые N чисел уже несут основной смысл. На практике для экономии берут 512 или 256 измерений у &lt;code&gt;large&lt;/code&gt; и часто получают качество не хуже полного &lt;code&gt;small&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Какую модель выбрать по умолчанию:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Начинайте с &lt;code&gt;text-embedding-3-small&lt;/code&gt;.&lt;/strong&gt; Дёшево, быстро, на 90% задач (поиск по базе знаний, классификация тикетов, рекомендации) её хватает с запасом.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Берите &lt;code&gt;text-embedding-3-large&lt;/code&gt;, когда&lt;/strong&gt; важна точность ранжирования: юридический или медицинский поиск, многоязычный корпус, тонкие смысловые различия. При желании укоротите вектор через &lt;code&gt;dimensions&lt;/code&gt;, чтобы не раздувать базу.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ada-002&lt;/code&gt; — только для совместимости&lt;/strong&gt; со старым кодом. Для новых проектов смысла нет: &lt;code&gt;3-small&lt;/code&gt; дешевле в пять раз и качественнее.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzh4oeswzhq1dwrq7ug7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzh4oeswzhq1dwrq7ug7.png" alt="Сравнительная инфографика трёх моделей эмбеддингов на тёплом фоне: три карточки с подписями text-embedding-3-small, text-embedding-3-large и ada-002, у каждой размерность 1536, 3072, 1536 и цена в рублях за миллион токенов, карточка small выделена терракотовым как рекомендованная по умолчанию, подписи на русском" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Как получить эмбеддинг по API
&lt;/h2&gt;

&lt;p&gt;Хорошая новость: эмбеддинги вызываются тем же &lt;code&gt;openai&lt;/code&gt; SDK, что и чат — отличается только метод (&lt;code&gt;embeddings.create&lt;/code&gt; вместо &lt;code&gt;chat.completions.create&lt;/code&gt;). А чтобы работать из России без VPN, меняется ровно один параметр — &lt;code&gt;base_url&lt;/code&gt;. Остальной код остаётся как в любом примере из документации OpenAI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python
&lt;/h3&gt;

&lt;p&gt;Минимальный рабочий пример — получить вектор для одной строки:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prm-xxxxxxxxxxxxxxxx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.promptra.ru/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# единственное изменение
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Как сбросить пароль от личного кабинета&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# 1536 — размерность вектора
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# сколько токенов потрачено
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;В ответе &lt;code&gt;resp.data&lt;/code&gt; — это список (по одному элементу на каждый входной текст), а &lt;code&gt;resp.data[0].embedding&lt;/code&gt; — сам вектор: список из 1536 чисел типа float. Поле &lt;code&gt;resp.usage.total_tokens&lt;/code&gt; показывает фактический расход, по которому считается оплата.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Батч — несколько текстов за один запрос.&lt;/strong&gt; Это важный приём для экономии: вместо тысячи отдельных запросов отправляйте список строк, и модель вернёт список векторов в том же порядке. Так индексируют базу знаний:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Как сбросить пароль&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Не приходит письмо с подтверждением&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Как изменить тариф&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# список строк — батч
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# вектор для каждого текста по порядку
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Укороченный вектор через &lt;code&gt;dimensions&lt;/code&gt;&lt;/strong&gt; — когда нужно сэкономить на хранении и поиске:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Текст запроса&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# вместо 3072 по умолчанию
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# 512
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Параметр &lt;code&gt;dimensions&lt;/code&gt; поддерживается только у моделей &lt;code&gt;text-embedding-3&lt;/code&gt; и новее — у старой &lt;code&gt;ada-002&lt;/code&gt; его нет.&lt;/p&gt;

&lt;h3&gt;
  
  
  curl
&lt;/h3&gt;

&lt;p&gt;Проверить эндпоинт без всякого SDK можно одним запросом:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://api.promptra.ru/v1/embeddings &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer prm-xxxxxxxxxxxxxxxx"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
 "model": "text-embedding-3-small",
 "input": "Как сбросить пароль от личного кабинета"
 }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Если в ответе пришёл JSON с полем &lt;code&gt;data&lt;/code&gt;, внутри которого массив &lt;code&gt;embedding&lt;/code&gt; из чисел, — эндпоинт и ключ в порядке, можно индексировать базу. Подробный разбор того, как поменять &lt;code&gt;base_url&lt;/code&gt; в разных SDK и на разных языках, — в гайде &lt;a href="https://promptra.ru/blog/migraciya-openai-sdk-base-url" rel="noopener noreferrer"&gt;миграция с OpenAI SDK: меняем base_url&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Как построить RAG и семантический поиск
&lt;/h2&gt;

&lt;p&gt;RAG расшифровывается как Retrieval-Augmented Generation — «генерация с подмешиванием найденного». Идея простая: языковая модель не знает содержимого ваших внутренних документов, и если спросить её напрямую, она либо ответит общими словами, либо выдумает. RAG решает это так: перед ответом мы находим в своей базе релевантные фрагменты (через эмбеддинги) и кладём их прямо в промпт. Модель отвечает уже по конкретным данным.&lt;/p&gt;

&lt;p&gt;Пайплайн состоит из двух фаз — индексация (один раз, заранее) и запрос (на каждый вопрос пользователя).&lt;/p&gt;

&lt;h3&gt;
  
  
  Фаза 1. Индексация базы знаний
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Чанкинг.&lt;/strong&gt; Документы (статьи, инструкции, договоры) режутся на куски — чанки — по 200–800 токенов. Слишком большие чанки размывают смысл, слишком мелкие теряют контекст. Типичный старт — абзацы или фрагменты по 300–500 токенов с небольшим перекрытием.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Эмбеддинг.&lt;/strong&gt; Для каждого чанка получаем вектор через &lt;code&gt;embeddings.create&lt;/code&gt; (батчами, как показано выше).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Сохранение в векторную БД.&lt;/strong&gt; Вектор плюс исходный текст чанка и метаданные (источник, заголовок) складываются в векторную базу.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Фаза 2. Запрос пользователя
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Эмбеддинг запроса.&lt;/strong&gt; Вопрос пользователя превращаем в вектор той же моделью.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Поиск ближайших (топ-K).&lt;/strong&gt; Векторная БД находит K чанков, ближайших к запросу по косинусной близости (обычно K = 3–8).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Сборка промпта.&lt;/strong&gt; Найденные чанки вставляем в системный промпт: «Ответь на вопрос, используя только эти фрагменты: …».&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Генерация ответа.&lt;/strong&gt; Отправляем промпт в чат-модель (например, GPT-5.5 или Claude Sonnet) — она отвечает по подложенным данным.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ключевой момент: &lt;strong&gt;эмбеддер и чат-модель — это разные модели&lt;/strong&gt;, и они работают вместе. Эмбеддер (&lt;code&gt;text-embedding-3-small&lt;/code&gt;) отвечает за поиск, чат-модель (&lt;code&gt;gpt-5.5&lt;/code&gt;, &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; и т.д.) — за формулировку ответа. Обе доступны через один и тот же эндпоинт — меняется только значение &lt;code&gt;model&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Минимальный RAG-цикл на Python, без внешней векторной БД (для базы из нескольких сотен чанков хватает поиска в памяти через numpy):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prm-xxxx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.promptra.ru/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- Индексация (один раз) ---
&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Чтобы сбросить пароль, откройте раздел Настройки и нажмите Сбросить.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Письмо с подтверждением приходит в течение 5 минут, проверьте папку Спам.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Сменить тариф можно в Биллинге, изменения вступают в силу сразу.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# матрица векторов
&lt;/span&gt;
&lt;span class="c1"&gt;# --- Запрос ---
&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;не могу войти, забыл пароль&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# косинусная близость и топ-K
&lt;/span&gt;&lt;span class="n"&gt;sims&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sims&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;[::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- Генерация ответа по найденному контексту ---
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ответь, используя только эти данные:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
 &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;На больших объёмах (десятки тысяч чанков и больше) поиск в памяти заменяют на специализированную векторную БД — pgvector (расширение PostgreSQL), Qdrant, Chroma, Milvus или подобные. Логика та же: складываете векторы, ищете ближайшие. Меняется только хранилище — эмбеддинги по-прежнему берёте через тот же API.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nesb7130ox5dwtjf5yy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nesb7130ox5dwtjf5yy.png" alt="Двухфазная блок-схема RAG на тёплом фоне: верхняя дорожка индексация — документы, чанкинг, эмбеддинг, векторная база; нижняя дорожка запрос — вопрос пользователя, эмбеддинг запроса, поиск топ-K в базе, сборка промпта, ответ языковой модели; блоки эмбеддера и чат-модели подсвечены терракотовым, стрелки и подписи на русском" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Цена эмбеддингов в рублях
&lt;/h2&gt;

&lt;p&gt;Здесь важная оговорка: цены на модели эмбеддингов в нашем каталоге отдельной строкой пока не зафиксированы (каталог сейчас отражает чат-модели). Поэтому рублёвые значения ниже — &lt;strong&gt;производная оценка&lt;/strong&gt;: официальная цена OpenAI в долларах, умноженная на курс ЦБ РФ 71.668 ₽/$ (на 2026-05-27, тот же курс, что и для всех моделей в каталоге). Фактический счёт считается по курсу ЦБ на день пополнения и без наценки на токены; точные ставки по эмбеддингам уточняйте у команды при подключении.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Модель&lt;/th&gt;
&lt;th&gt;Цена OpenAI (USD / 1M)&lt;/th&gt;
&lt;th&gt;Оценка в ₽ / 1M (× 71.668)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;text-embedding-3-small&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.02&lt;/td&gt;
&lt;td&gt;≈ 1.4 ₽&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;text-embedding-3-large&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.13&lt;/td&gt;
&lt;td&gt;≈ 9.3 ₽&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;text-embedding-ada-002&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;td&gt;≈ 7.2 ₽&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Главное, что бросается в глаза: &lt;strong&gt;эмбеддинги несопоставимо дешевле генерации&lt;/strong&gt;. Миллион токенов через &lt;code&gt;text-embedding-3-small&lt;/code&gt; обходится примерно в 1.4 ₽ — это в сотни раз дешевле, чем тот же миллион токенов на выходе у чат-флагмана (для сравнения, выход GPT-5.5 — 2150 ₽ за 1M). Причина в том, что эмбеддер только «читает» текст и отдаёт один вектор, он ничего не генерирует. Платите вы только за входные токены — понятия «выходных токенов» у эмбеддингов нет.&lt;/p&gt;

&lt;p&gt;Прикинем реальные сценарии (по оценочной ставке &lt;code&gt;text-embedding-3-small&lt;/code&gt; ≈ 1.4 ₽ за 1M):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Сценарий&lt;/th&gt;
&lt;th&gt;Объём&lt;/th&gt;
&lt;th&gt;Примерно токенов&lt;/th&gt;
&lt;th&gt;Стоимость (оценка)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Проиндексировать базу знаний на 1000 статей&lt;/td&gt;
&lt;td&gt;≈ 1000 × 800 токенов&lt;/td&gt;
&lt;td&gt;0.8M&lt;/td&gt;
&lt;td&gt;≈ 1.1 ₽&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Проиндексировать 100 000 товаров каталога&lt;/td&gt;
&lt;td&gt;≈ 100K × 100 токенов&lt;/td&gt;
&lt;td&gt;10M&lt;/td&gt;
&lt;td&gt;≈ 14 ₽&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 млн поисковых запросов в месяц&lt;/td&gt;
&lt;td&gt;≈ 1M × 20 токенов&lt;/td&gt;
&lt;td&gt;20M&lt;/td&gt;
&lt;td&gt;≈ 28 ₽&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Даже индексация крупной базы и миллион запросов в месяц складываются в десятки рублей. Эмбеддинги — та статья расходов, по которой в RAG-системе экономить обычно не нужно: основной счёт формирует чат-модель, генерирующая ответы, а не эмбеддер. Поэтому выбор &lt;code&gt;large&lt;/code&gt; вместо &lt;code&gt;small&lt;/code&gt; ради качества почти не бьёт по бюджету.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Важно: значения в таблицах — производная оценка от долларового прайса OpenAI по курсу ЦБ, а не строка из каталога. Перед расчётом бюджета сверьте актуальную ставку с командой.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Типичные ошибки при работе с эмбеддингами
&lt;/h2&gt;

&lt;p&gt;Несколько граблей, на которые наступают чаще всего:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Разные модели для индексации и запроса.&lt;/strong&gt; Векторы от &lt;code&gt;text-embedding-3-small&lt;/code&gt; и &lt;code&gt;text-embedding-3-large&lt;/code&gt; живут в разных пространствах — их нельзя сравнивать между собой. Если вы проиндексировали базу одной моделью, а запросы считаете другой, поиск выдаст мусор. Правило: одна и та же модель (и одна и та же размерность) на индексацию и на запрос. Сменили модель — переиндексируйте всю базу.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Слишком крупные чанки.&lt;/strong&gt; Если затолкать в один чанк целую страницу, его вектор «усреднит» все темы сразу, и поиск по конкретному вопросу станет размытым. Дробите на смысловые куски по 200–800 токенов. Обратная крайность — чанки по одному предложению — теряют контекст. Истина посередине, обычно абзац.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Игнорирование лимита длины входа.&lt;/strong&gt; У моделей эмбеддингов есть максимум токенов на один вход (у &lt;code&gt;text-embedding-3&lt;/code&gt; это 8191 токен). Текст длиннее обрежется или вызовет ошибку — длинные документы нужно резать на чанки до эмбеддинга, а не после.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Отказ от батчинга.&lt;/strong&gt; Индексировать базу по одному тексту на запрос — медленно и упирается в rate limit. Отправляйте списком (батчами по сотне-другой строк) — это и быстрее, и устойчивее к лимитам.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Хранение векторов как есть без нормализации.&lt;/strong&gt; Многие векторные БД и метрики ожидают нормализованные векторы (длины 1). Если считаете близость вручную через скалярное произведение — либо нормализуйте, либо используйте честную косинусную формулу с делением на нормы (как в примере выше).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Эмбеддинг очень разноязычного корпуса дешёвой моделью.&lt;/strong&gt; На многоязычных и узкоспециальных данных &lt;code&gt;text-embedding-3-small&lt;/code&gt; иногда заметно уступает &lt;code&gt;large&lt;/code&gt;. Если поиск «промахивается» на вашем корпусе — первое, что стоит попробовать, это поднять модель до &lt;code&gt;large&lt;/code&gt;, благо по цене это почти незаметно.&lt;/p&gt;

&lt;p&gt;Если по тексту нужно не только искать, но и генерировать (ответы, резюме, код) — выбор чат-модели под задачу и бюджет мы разобрали в гайде &lt;a href="https://promptra.ru/blog/neyroset-dlya-koda" rel="noopener noreferrer"&gt;нейросеть для кода: какие LLM выбрать&lt;/a&gt;, а обзор топовых моделей — в материале &lt;a href="https://promptra.ru/blog/top-5-llm-2026" rel="noopener noreferrer"&gt;топ-5 LLM 2026&lt;/a&gt;. Подключить чат-модели можно на &lt;a href="https://promptra.ru/api/chatgpt" rel="noopener noreferrer"&gt;странице ChatGPT API&lt;/a&gt; — тем же ключом и эндпоинтом, что и эмбеддинги.&lt;/p&gt;

&lt;h2&gt;
  
  
  Оплата и документы для юр.лица
&lt;/h2&gt;

&lt;p&gt;Для компаний важна не только техническая сторона, но и то, как расходы на API проходят по бухгалтерии. Оплата идёт на юр.лицо — российское юр.лицо — с полным пакетом закрывающих документов через ЭДО: договор-оферта, счёт, акт, счёт-фактура, УПД. Документы автоматически проводятся в учётной системе через операторов ЭДО (Диадок, СБИС).&lt;/p&gt;

&lt;p&gt;Цена за токены — 1-в-1 с прайсом провайдера, пересчитанным по курсу ЦБ, без наценки на сами токены; сервисная комиссия 5% берётся только при пополнении баланса. Доступ работает из России без VPN: запрос уходит на эндпоинт агрегатора, а он связывается с провайдером со своей стороны — туннелировать трафик или маскировать IP не нужно.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Promptra&lt;/strong&gt; — Russian LLM API aggregator. One OpenAI-compatible endpoint to all flagship models: OpenAI (GPT-5.5, GPT-5.4), Anthropic (Claude Opus 4.7, Sonnet 4.6), Google (Gemini 3.1 Pro, 3.5 Flash), DeepSeek V4 Pro, Qwen 3.6 Plus.&lt;/p&gt;

&lt;p&gt;Provider prices 1-to-1 at CBR rate — no markup on tokens. Ruble billing per contract, full closing documents through EDI. No VPN — legal B2B service in Russia.&lt;/p&gt;

&lt;p&gt;Try: &lt;a href="https://promptra.ru" rel="noopener noreferrer"&gt;promptra.ru&lt;/a&gt; · &lt;a href="https://promptra.ru/models" rel="noopener noreferrer"&gt;model catalog&lt;/a&gt; · &lt;a href="https://promptra.ru/docs" rel="noopener noreferrer"&gt;docs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>embeddings</category>
      <category>rag</category>
      <category>api</category>
      <category>openaiapi</category>
    </item>
    <item>
      <title>text-embedding-3-small Dimensions Explained: 1536 vs 1024 vs 512</title>
      <dc:creator>Jenny Met</dc:creator>
      <pubDate>Sat, 06 Jun 2026 14:13:22 +0000</pubDate>
      <link>https://dev.to/xujfcn/text-embedding-3-small-dimensions-explained-1536-vs-1024-vs-512-128a</link>
      <guid>https://dev.to/xujfcn/text-embedding-3-small-dimensions-explained-1536-vs-1024-vs-512-128a</guid>
      <description>&lt;h1&gt;
  
  
  text-embedding-3-small Dimensions Explained: 1536 vs 1024 vs 512
&lt;/h1&gt;

&lt;p&gt;If you use &lt;code&gt;text-embedding-3-small&lt;/code&gt;, one small setting can quietly affect your whole retrieval system: embedding dimensions.&lt;/p&gt;

&lt;p&gt;The default vector length is &lt;strong&gt;1536 dimensions&lt;/strong&gt;. That is a good default. But it is not always the cheapest or fastest choice once you store millions of chunks in a vector database.&lt;/p&gt;

&lt;p&gt;This guide explains what &lt;code&gt;text-embedding-3-small dimensions&lt;/code&gt; means, when to keep 1536, when to test smaller vectors, and how to call an OpenAI-compatible embeddings endpoint with real code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvs4raqj9yd8qk8tch7fh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvs4raqj9yd8qk8tch7fh.png" alt="text-embedding-3-small dimensions visual guide" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What are text-embedding-3-small dimensions?
&lt;/h2&gt;

&lt;p&gt;An embedding turns text into a list of numbers. That list is a vector.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;text-embedding-3-small&lt;/code&gt;, the default vector has &lt;strong&gt;1536 numbers&lt;/strong&gt;. If you embed the sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“API gateways help developers route model calls.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model returns one vector that represents the meaning of that whole input. The vector is not one number per word. It is one semantic representation for the input text you send.&lt;/p&gt;

&lt;p&gt;You then store that vector in a vector database such as pgvector, Pinecone, Milvus, Weaviate, Chroma, or Qdrant. When a user searches, you embed the query and compare it against stored vectors.&lt;/p&gt;

&lt;p&gt;Official OpenAI documentation states that &lt;code&gt;text-embedding-3-small&lt;/code&gt; defaults to 1536 dimensions, while &lt;code&gt;text-embedding-3-large&lt;/code&gt; defaults to 3072 dimensions. It also supports a &lt;code&gt;dimensions&lt;/code&gt; parameter that can reduce the output vector length.&lt;/p&gt;

&lt;p&gt;External references:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs/guides/embeddings?utm_source=crazyrouter_blog&amp;amp;utm_medium=article&amp;amp;utm_campaign=embedding_dimensions" rel="noopener noreferrer"&gt;OpenAI embeddings guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs/models/text-embedding-3-small?utm_source=crazyrouter_blog&amp;amp;utm_medium=article&amp;amp;utm_campaign=embedding_dimensions" rel="noopener noreferrer"&gt;OpenAI text-embedding-3-small model page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/75733749/openai-embeddings-api-how-to-change-the-embedding-output-dimension?utm_source=crazyrouter_blog&amp;amp;utm_medium=article&amp;amp;utm_campaign=embedding_dimensions" rel="noopener noreferrer"&gt;Stack Overflow discussion on embedding dimensions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Default text-embedding-3-small dimensions: why 1536 is common
&lt;/h2&gt;

&lt;p&gt;1536 dimensions is popular because it is the default. It is also a practical balance between quality and cost for many semantic search and RAG workloads.&lt;/p&gt;

&lt;p&gt;Use the default 1536 dimensions when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You are building your first retrieval system.&lt;/li&gt;
&lt;li&gt;You do not have evaluation data yet.&lt;/li&gt;
&lt;li&gt;Your dataset is small enough that vector storage is not painful.&lt;/li&gt;
&lt;li&gt;Search quality matters more than a few gigabytes of storage.&lt;/li&gt;
&lt;li&gt;You want fewer moving parts during the first launch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point matters. If your app is still early, the biggest risk is usually not vector size. It is bad chunking, weak retrieval evaluation, missing metadata filters, or poor prompts.&lt;/p&gt;

&lt;p&gt;Start simple. Then optimize.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dimensions parameter: what changes and what does not
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;dimensions&lt;/code&gt; parameter lets you request a shorter embedding vector.&lt;/p&gt;

&lt;p&gt;For example, instead of asking for the default 1536-dimensional vector, you can request 1024, 768, or 512 dimensions if your provider supports it for that model.&lt;/p&gt;

&lt;p&gt;What changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;1536 dimensions&lt;/th&gt;
&lt;th&gt;1024 / 768 / 512 dimensions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vector storage&lt;/td&gt;
&lt;td&gt;Larger&lt;/td&gt;
&lt;td&gt;Smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index memory&lt;/td&gt;
&lt;td&gt;Larger&lt;/td&gt;
&lt;td&gt;Smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Search latency&lt;/td&gt;
&lt;td&gt;Often higher&lt;/td&gt;
&lt;td&gt;Often lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval quality&lt;/td&gt;
&lt;td&gt;Strong baseline&lt;/td&gt;
&lt;td&gt;Must be tested&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API input token cost&lt;/td&gt;
&lt;td&gt;Usually unchanged&lt;/td&gt;
&lt;td&gt;Usually unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What does &lt;strong&gt;not&lt;/strong&gt; usually change: the number of input tokens you send. Embedding API pricing is normally based on input tokens, not the final vector size.&lt;/p&gt;

&lt;p&gt;That means smaller dimensions mainly help with storage, index memory, and retrieval speed. They are not a magic way to reduce the embedding generation bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage math: 1536 vs 1024 vs 512 dimensions
&lt;/h2&gt;

&lt;p&gt;A float32 number uses 4 bytes. So the raw vector size is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vector_size_bytes = dimensions × 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For one vector:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimensions&lt;/th&gt;
&lt;th&gt;Bytes per vector&lt;/th&gt;
&lt;th&gt;Storage vs 1536&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;6,144 bytes&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;4,096 bytes&lt;/td&gt;
&lt;td&gt;~33% smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;3,072 bytes&lt;/td&gt;
&lt;td&gt;~50% smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;2,048 bytes&lt;/td&gt;
&lt;td&gt;~67% smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For 1 million chunks, raw float32 vector storage looks like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimensions&lt;/th&gt;
&lt;th&gt;Raw vector storage&lt;/th&gt;
&lt;th&gt;With rough 35% index overhead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;~5.72 GiB&lt;/td&gt;
&lt;td&gt;~7.72 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;~3.81 GiB&lt;/td&gt;
&lt;td&gt;~5.15 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;~2.86 GiB&lt;/td&gt;
&lt;td&gt;~3.86 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;~1.91 GiB&lt;/td&gt;
&lt;td&gt;~2.57 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is why dimensions start to matter at scale. A small difference per vector becomes real infrastructure cost when you store millions of chunks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick calculator for embedding dimensions
&lt;/h2&gt;

&lt;p&gt;Here is a small Python tool you can use to estimate storage and rough generation cost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;gib&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--avg-tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--price-per-million&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_tokens&lt;/span&gt;
    &lt;span class="n"&gt;estimated_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price_per_million&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Documents: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated input tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Embedding generation cost: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;estimated_cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dims  Raw GiB  With 35% index overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;raw_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;gib&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mf"&gt;7.2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;gib&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_bytes&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.35&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mf"&gt;24.2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 embedding_dimension_calculator.py &lt;span class="nt"&gt;--documents&lt;/span&gt; 1000000 &lt;span class="nt"&gt;--avg-tokens&lt;/span&gt; 350
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Embedding Dimension Calculator
================================
Documents/chunks: 1,000,000
Average tokens/chunk: 350
Estimated input tokens: 350,000,000
Embedding generation cost @ $0.02/1M tokens: $7.00

Dimension storage comparison
--------------------------------
  Dims  Bytes/vector     Raw GiB  With index GiB  Saved vs max
  1536         6,144        5.72            7.72            0%
  1024         4,096        3.81            5.15           33%
   768         3,072        2.86            3.86           50%
   512         2,048        1.91            2.57           67%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important lesson: generation cost can stay small, while vector database cost and memory can grow quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  API example: default 1536 dimensions
&lt;/h2&gt;

&lt;p&gt;Here is a standard OpenAI-compatible embeddings call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://crazyrouter.com/v1/embeddings &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$CRAZYROUTER_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "text-embedding-3-small",
    "input": "Explain API gateway routing in one paragraph."
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response includes an embedding array. With the default setting, its length should be 1536.&lt;/p&gt;

&lt;p&gt;You can use the same pattern with any OpenAI-compatible client. With Crazyrouter, you only change the base URL and API key:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Base URL: &lt;code&gt;https://crazyrouter.com/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Endpoint: &lt;code&gt;/embeddings&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Auth: &lt;code&gt;Authorization: Bearer YOUR_KEY&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Related internal guides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://crazyrouter.com/blog/openai-compatible-api-base-url-explained?utm_source=crazyrouter_blog&amp;amp;utm_medium=article&amp;amp;utm_campaign=embedding_dimensions" rel="noopener noreferrer"&gt;OpenAI-compatible base URL explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crazyrouter.com/blog/ai-api-pricing-guide-2026?utm_source=crazyrouter_blog&amp;amp;utm_medium=article&amp;amp;utm_campaign=embedding_dimensions" rel="noopener noreferrer"&gt;AI API pricing guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crazyrouter.com/blog/ai-api-cost-optimization-complete-guide-2026?utm_source=crazyrouter_blog&amp;amp;utm_medium=article&amp;amp;utm_campaign=embedding_dimensions" rel="noopener noreferrer"&gt;AI API cost optimization guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crazyrouter.com/blog/structured-output-json-mode-ai-api-guide-2026?utm_source=crazyrouter_blog&amp;amp;utm_medium=article&amp;amp;utm_campaign=embedding_dimensions" rel="noopener noreferrer"&gt;Structured output JSON mode guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://crazyrouter.com/blog/rag-implementation-guide-2026?utm_source=crazyrouter_blog&amp;amp;utm_medium=article&amp;amp;utm_campaign=embedding_dimensions" rel="noopener noreferrer"&gt;RAG implementation guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Python example: check the vector length
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-crazyrouter-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://crazyrouter.com/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A vector database stores embeddings for semantic search.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# usually 1536 by default
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;   &lt;span class="c1"&gt;# preview the first few values
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do not paste real keys into code. Use environment variables in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CRAZYROUTER_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://crazyrouter.com/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Python example: request custom dimensions
&lt;/h2&gt;

&lt;p&gt;If your embeddings provider supports the &lt;code&gt;dimensions&lt;/code&gt; parameter for &lt;code&gt;text-embedding-3-small&lt;/code&gt;, you can request a shorter vector.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-crazyrouter-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://crazyrouter.com/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Shorter embeddings can reduce vector database storage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# expected: 1024 when supported
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important: do not mix dimensions in the same vector index. If your collection was created for 1536-dimensional vectors, a 1024-dimensional vector will usually fail at insert time.&lt;/p&gt;

&lt;p&gt;Use one collection per dimension setting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Node.js example: embeddings with OpenAI-compatible base URL
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;CRAZYROUTER_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://crazyrouter.com/v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Embeddings help search by meaning, not just keywords.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Which dimensions should you choose?
&lt;/h2&gt;

&lt;p&gt;There is no universal best value. Choose based on evaluation, not vibes.&lt;/p&gt;

&lt;p&gt;A practical starting point:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Suggested starting dimensions&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prototype / small app&lt;/td&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;Maximize quality while you learn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support docs RAG&lt;/td&gt;
&lt;td&gt;1536 or 1024&lt;/td&gt;
&lt;td&gt;Quality matters, but storage can grow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large FAQ search&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;Often a good balance to test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-volume semantic cache&lt;/td&gt;
&lt;td&gt;768 or 512&lt;/td&gt;
&lt;td&gt;Speed and memory may matter more&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal / medical / financial retrieval&lt;/td&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;Test carefully before reducing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mobile / edge search&lt;/td&gt;
&lt;td&gt;512 or 768&lt;/td&gt;
&lt;td&gt;Smaller vectors are easier to move&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For production, run an evaluation set. Take 50 to 200 real user queries. Label the best matching documents. Compare recall@5 or recall@10 for 1536, 1024, 768, and 512.&lt;/p&gt;

&lt;p&gt;If 1024 gives almost the same recall as 1536, you can reduce storage and memory without hurting users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common mistakes with text-embedding-3-small dimensions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake 1: mixing 1536 and 1024 vectors in one index
&lt;/h3&gt;

&lt;p&gt;Vector databases expect a fixed dimension per collection or index. If you change dimensions, create a new index and re-embed the corpus.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 2: optimizing dimensions before chunking
&lt;/h3&gt;

&lt;p&gt;Bad chunking hurts retrieval more than a larger vector helps it.&lt;/p&gt;

&lt;p&gt;Fix chunking first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep chunks focused.&lt;/li&gt;
&lt;li&gt;Add useful metadata.&lt;/li&gt;
&lt;li&gt;Avoid huge mixed-topic chunks.&lt;/li&gt;
&lt;li&gt;Test overlap instead of guessing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Mistake 3: assuming smaller dimensions reduce API cost
&lt;/h3&gt;

&lt;p&gt;Embedding generation cost is usually based on input tokens. Smaller vectors reduce storage and search costs, not necessarily API call cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 4: choosing 512 without evaluation
&lt;/h3&gt;

&lt;p&gt;512-dimensional vectors can work for some workloads. But they may lose recall on nuanced queries. Test them before moving production search.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 5: forgetting downstream schema changes
&lt;/h3&gt;

&lt;p&gt;If you use pgvector, your schema may include a fixed dimension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;bigserial&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you switch to 1024, you need a different column or table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents_1024&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;bigserial&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A simple evaluation workflow
&lt;/h2&gt;

&lt;p&gt;Use this workflow before changing dimensions in production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick 100 real user queries.&lt;/li&gt;
&lt;li&gt;Label the correct documents for each query.&lt;/li&gt;
&lt;li&gt;Create separate indexes for 1536, 1024, 768, and 512.&lt;/li&gt;
&lt;li&gt;Run the same queries against each index.&lt;/li&gt;
&lt;li&gt;Compare recall@5, recall@10, latency, and memory.&lt;/li&gt;
&lt;li&gt;Choose the smallest dimension that does not hurt retrieval quality.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is more reliable than reading a benchmark and hoping it matches your data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final recommendation
&lt;/h2&gt;

&lt;p&gt;For most teams, the best first move is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with &lt;code&gt;text-embedding-3-small&lt;/code&gt; at 1536 dimensions.&lt;/li&gt;
&lt;li&gt;Build a clean retrieval evaluation set.&lt;/li&gt;
&lt;li&gt;Test 1024 once your corpus grows.&lt;/li&gt;
&lt;li&gt;Try 768 or 512 only when storage, memory, or latency becomes important.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you already use OpenAI-compatible tools, you can test this through Crazyrouter by setting your base URL to &lt;code&gt;https://crazyrouter.com/v1&lt;/code&gt; and calling &lt;code&gt;/embeddings&lt;/code&gt; with your normal SDK.&lt;/p&gt;

&lt;p&gt;The goal is not to use the smallest vector. The goal is to use the smallest vector that still retrieves the right answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ: text-embedding-3-small dimensions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What are the default text-embedding-3-small dimensions?
&lt;/h3&gt;

&lt;p&gt;The default &lt;code&gt;text-embedding-3-small&lt;/code&gt; output is 1536 dimensions. That means each input text returns a vector with 1536 numeric values.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I change text-embedding-3-small dimensions?
&lt;/h3&gt;

&lt;p&gt;Yes, when your provider supports the &lt;code&gt;dimensions&lt;/code&gt; parameter, you can request a shorter vector. Common test values are 1024, 768, and 512.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do smaller embedding dimensions reduce API cost?
&lt;/h3&gt;

&lt;p&gt;Usually not directly. Embedding API cost is normally based on input tokens. Smaller dimensions mainly reduce vector storage, index memory, and search latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is 512 dimensions enough for text-embedding-3-small?
&lt;/h3&gt;

&lt;p&gt;Sometimes. It depends on your dataset and retrieval quality requirements. Use an evaluation set before using 512 dimensions in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I store 1536 and 1024 dimension vectors in the same database table?
&lt;/h3&gt;

&lt;p&gt;Usually no. Most vector indexes require a fixed dimension. Create separate collections or tables when testing different dimensions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I use text-embedding-3-small or text-embedding-3-large?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;text-embedding-3-small&lt;/code&gt; for cost-effective general retrieval. Test &lt;code&gt;text-embedding-3-large&lt;/code&gt; when retrieval quality is the main bottleneck and you can afford larger vectors.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the best dimension for RAG?
&lt;/h3&gt;

&lt;p&gt;Start with 1536 for &lt;code&gt;text-embedding-3-small&lt;/code&gt;. Then test 1024 and 768 against real queries. The best dimension is the smallest one that preserves your recall target.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>embeddings</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
