<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Actian for Developers</title>
    <description>The latest articles on DEV Community by Actian for Developers (@actiandev).</description>
    <link>https://dev.to/actiandev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12290%2Fc208e611-715d-4932-a035-3285a56758fe.png</url>
      <title>DEV Community: Actian for Developers</title>
      <link>https://dev.to/actiandev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/actiandev"/>
    <language>en</language>
    <item>
      <title>Should You Use RAG or Fine-Tune Your LLM?</title>
      <dc:creator>Offisong Emmanuel</dc:creator>
      <pubDate>Tue, 19 May 2026 19:43:02 +0000</pubDate>
      <link>https://dev.to/actiandev/should-you-use-rag-or-fine-tune-your-llm-3g8j</link>
      <guid>https://dev.to/actiandev/should-you-use-rag-or-fine-tune-your-llm-3g8j</guid>
      <description>&lt;p&gt;The debate over retrieval augmented generation (RAG) vs. fine-tuning appears simple at first glance. RAG pulls in external data at inference time. Fine-tuning modifies model weights during training. In production systems, that distinction is insufficient.&lt;/p&gt;

&lt;p&gt;According to the &lt;a href="https://menlovc.com/2024-the-state-of-generative-ai-in-the-enterprise/" rel="noopener noreferrer"&gt;Menlo Ventures 2024&lt;/a&gt; State of Generative AI in the Enterprise report, 51 percent of enterprise AI deployments use RAG in production. Only nine percent rely primarily on fine-tuning. Yet research such as the &lt;a href="https://shishirpatil.github.io/publications/raft-2024.pdf" rel="noopener noreferrer"&gt;RAFT study&lt;/a&gt; from UC Berkeley shows that hybrid systems combining retrieval and fine-tuning outperform either approach alone across benchmarks.&lt;/p&gt;

&lt;p&gt;If hybrid systems can produce better results, why does industry adoption favor only RAG? In this article, we’ll compare RAG, fine-tuning, and a hybrid architecture to understand the trade-offs and where each approach excels.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG&lt;/strong&gt;: Best for frequently changing knowledge and moderate traffic; easy to update without retraining.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-&lt;/strong&gt;&lt;strong&gt;t&lt;/strong&gt;&lt;strong&gt;uning&lt;/strong&gt;: Best for stable domains and high-volume or low-latency tasks; improves task-specific accuracy and formatting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid/RAFT&lt;/strong&gt;: Combines up-to-date retrieval with optimized model behavior for the highest accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key&lt;/strong&gt; &lt;strong&gt;t&lt;/strong&gt;&lt;strong&gt;rade-off&lt;/strong&gt;: Choice depends on query volume, how often knowledge changes, and team expertise. &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why the Standard RAG vs. Fine-Tuning Comparison Fails
&lt;/h2&gt;

&lt;p&gt;RAG is a method where the model dynamically pulls in external data at inference time. Each query retrieves relevant documents or knowledge chunks, which the system appends to the prompt, allowing the model to produce answers grounded in current information.&lt;/p&gt;

&lt;p&gt;Fine-tuning is the process of modifying a model’s weights during training using labeled data. Instead of relying on external retrieval, the model internalizes patterns directly, producing consistent outputs without querying external sources.&lt;/p&gt;

&lt;p&gt;While these definitions are technically correct, most standard comparisons miss the factors that actually drive decisions in production. In real-world systems, the choice between RAG and fine-tuning depends on variables like scale, query volume, and how often your data changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Missing variable 1: Context expansion at scale
&lt;/h3&gt;

&lt;p&gt;In many production RAG systems, every request appends hundreds of tokens. That added context changes how the model allocates attention and prioritizes weights.&lt;/p&gt;

&lt;p&gt;Large retrieved contexts compete for attention with the prompt and instructions, which can dilute signal quality. Small retrieval errors or loosely relevant chunks can introduce formatting drift, or shift reasoning in subtle ways. The system’s output becomes tightly coupled to retrieval quality.&lt;/p&gt;

&lt;p&gt;Fine-tuning works differently. Instead of injecting large volumes of text at inference time, it embeds patterns and constraints directly into the model during training. The distinction affects how the system behaves under real workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Missing variable 2: Retraining frequency
&lt;/h3&gt;

&lt;p&gt;The common advice says “use RAG if knowledge changes frequently" and “use fine-tuning if behavior is stable.” But how frequently is “frequently”?&lt;/p&gt;

&lt;p&gt;If your knowledge base changes daily, retraining pipelines may introduce operational friction. Evaluation cycles, dataset versioning, and deployment validation all add delay.&lt;/p&gt;

&lt;p&gt;Data preparation also matters. If your organization lacks structured, versioned, and clean datasets, the hidden cost of &lt;a href="https://www.actian.com/blog/generative-ai/data-preparation-guide-generative-ai-adoption/" rel="noopener noreferrer"&gt;preparing training data&lt;/a&gt; can exceed compute costs. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Math of RAG vs. Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;Surface-level comparisons of RAG and fine-tuning often ignore the cost curves that determine long-term viability. In production systems, financial estimations are crucial in architectural decisions. To evaluate RAG vs. fine-tuning realistically, we need to examine three cost layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Token cost and context expansion&lt;/li&gt;
&lt;li&gt;Retrieval infrastructure cost&lt;/li&gt;
&lt;li&gt;Training infrastructure cost&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The cost structure of RAG
&lt;/h3&gt;

&lt;p&gt;RAG systems introduce a recurring operational cost because each query retrieves external information and injects it into the model’s prompt. That additional context is billed on every request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context expansion&lt;/strong&gt;&lt;br&gt;
Production RAG systems append around 500 tokens of retrieved context to each query. The provider bills those tokens on every request.&lt;/p&gt;

&lt;p&gt;Using pricing similar to &lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.2 at 1.750 dollars&lt;/a&gt; per million input tokens, the incremental monthly cost becomes:&lt;/p&gt;

&lt;p&gt;Cost per query &lt;br&gt;
500 tokens × $1.75/1,000,000 = $0.000875 per query&lt;/p&gt;

&lt;p&gt;At a small scale, this cost appears negligible. However, because it applies to every query, the total overhead grows linearly with traffic.&lt;/p&gt;

&lt;p&gt;At different traffic levels:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Monthly queries&lt;/th&gt;
&lt;th&gt;Context cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 million&lt;/td&gt;
&lt;td&gt;$8,750&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50 million&lt;/td&gt;
&lt;td&gt;$43,750&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 million&lt;/td&gt;
&lt;td&gt;$87,500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is context overhead alone. It does not include output tokens or base prompt tokens. At a sustained scale, what appears flexible and inexpensive becomes a significant recurring expense. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector database and retrieval cost&lt;/strong&gt;&lt;br&gt;
Token cost is only one component of RAG costs. RAG also relies on a vector database for semantic search. The system must store, index, and query embeddings efficiently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.pinecone.io/pricing/estimate/" rel="noopener noreferrer"&gt;Public pricing&lt;/a&gt; of Pinecone lists:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage at approximately 0.33 dollars per gigabyte per month&lt;/li&gt;
&lt;li&gt;Read units at approximately 16 dollars per million&lt;/li&gt;
&lt;li&gt;Write units at approximately four dollars per million&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, consider a system handling 50 million queries per month, where each query performs a single vector search (assuming a 1,024-dimension vector). That would result in 50 million read operations monthly. If the system also writes approximately six million records per month, the combined read and write activity would bring the total estimated monthly cost to around $1,532.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjb79lejvljq0xezfw1y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjb79lejvljq0xezfw1y.png" alt="Figure 1: Pinecone pricing for 50M vectors" width="800" height="593"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At 200 million queries per month, the total expenses rises to $9,000 per month.&lt;/p&gt;

&lt;p&gt;Two RAG systems serving identical traffic can therefore have materially different cost structures depending on &lt;a href="https://dev.to/actiandev/whats-changing-in-vector-databases-in-2026-3pbo"&gt;how the vector database is designed&lt;/a&gt; and optimized.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure cost&lt;/strong&gt;&lt;br&gt;
RAG systems require storage and compute infrastructure to generate embeddings, store and index vectors, execute retrieval queries, and run inference. Each of these stages consumes compute resources, typically provisioned through cloud servers that must scale with traffic.&lt;/p&gt;

&lt;p&gt;For real-time or high-throughput applications, additional capacity is required to maintain low latency and system reliability. Replication, autoscaling, monitoring, and failover mechanisms all add operational complexity. These infrastructure layers are essential for production-grade RAG, but they expand the total cost footprint beyond token usage alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  The cost structure of fine-tuning
&lt;/h3&gt;

&lt;p&gt;Fine-tuning introduces a different economic model from RAG systems. Instead of paying incremental costs on every request for external context, you invest upfront to modify the model’s internal behavior.&lt;/p&gt;

&lt;p&gt;That upfront investment can be broken into four primary cost categories: data, training compute, experimentation, and operational maintenance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data preparation costs&lt;/strong&gt;&lt;br&gt;
High-quality labeled data is the foundation of effective fine-tuning. This includes collecting domain-specific examples, cleaning inconsistencies, formatting inputs and outputs correctly, and validating annotation quality.&lt;/p&gt;

&lt;p&gt;In many organizations, &lt;a href="https://www.centage.com/blog/how-to-calculate-the-roi-of-ai-a-guide-for-finance-leaders-2025-edition" rel="noopener noreferrer"&gt;data preparation consumes 20 to 40 percent&lt;/a&gt; of the total fine-tuning budget. Poorly curated data directly degrades model performance, leading to additional retraining cycles and wasted compute. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training compute costs&lt;/strong&gt;&lt;br&gt;
OpenAI lists fine-tuning at roughly $25 per million training tokens for GPT-4.1. A run using 20 million tokens would cost about $500 in direct training fees, with larger datasets or multiple runs increasing this total.&lt;/p&gt;

&lt;p&gt;For self-hosted training, costs depend on model size and hardware. High-performance GPUs such as A100 clusters can cost thousands of dollars per training epoch. Because fine-tuning is rarely a single-pass process, multiple epochs, evaluations, and retraining cycles are common, which further increases the overall cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Experimentation and validation costs&lt;/strong&gt;&lt;br&gt;
Fine-tuning is an iterative process that requires experimentation with hyperparameters, evaluation against baseline models, and testing across edge cases. These workflows require engineering time, infrastructure, and structured evaluation frameworks. Unlike prompt engineering, fine-tuning introduces a full ML lifecycle, adding ongoing operational overhead.&lt;/p&gt;

&lt;p&gt;This creates a non-linear cost curve. Fine-tuning concentrates cost at the beginning, while marginal cost per request remains relatively stable as traffic grows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmbyk0ospu9xjm8p9td0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmbyk0ospu9xjm8p9td0.png" alt="Figure 2: Non linear cost curve" width="800" height="515"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Whether that trade-off is advantageous depends on three variables: query volume, knowledge stability, and retraining frequency. Without modeling those explicitly, cost comparisons between RAG and fine-tuning remain incomplete.&lt;/p&gt;

&lt;h2&gt;
  
  
  When RAG Wins
&lt;/h2&gt;

&lt;p&gt;Despite its scaling trade-offs, RAG remains the dominant production choice for a reason. In certain operating conditions, it is structurally more flexible, faster to iterate, and operationally safer than fine-tuning. RAG is suitable in the following scenarios:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. When knowledge changes frequently
&lt;/h3&gt;

&lt;p&gt;If your domain knowledge changes weekly or daily, fine-tuning becomes operationally expensive. Dataset updates, retraining, evaluation, and deployment introduce delays that can stretch from hours to weeks depending on governance requirements.&lt;/p&gt;

&lt;p&gt;Teams frequently underestimate the operational overhead of keeping a fine-tuned model synchronized with a rapidly evolving knowledge base. In these environments, RAG shifts the problem from model retraining to data indexing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. When you have extensive unstructured data but limited labeled data
&lt;/h3&gt;

&lt;p&gt;Many organizations possess terabytes of internal documents but lack high-quality supervised datasets. Building labeled training corpora requires annotation workflows, domain experts, and quality validation pipelines. In practice, this often becomes the most expensive part of fine-tuning projects.&lt;/p&gt;

&lt;p&gt;RAG bypasses this constraint by allowing models to operate directly on existing document corpora without constructing large labeled datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. When governance and data residency requirements are strict
&lt;/h3&gt;

&lt;p&gt;Once sensitive information is embedded in model weights, deletion and auditing become difficult. Removing a specific record from a fine-tuned model often requires retraining or maintaining complex dataset lineage.&lt;/p&gt;

&lt;p&gt;RAG architectures avoid this issue by keeping sensitive information in external storage systems where standard governance controls already exist.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. When query volume is moderate
&lt;/h3&gt;

&lt;p&gt;As shown in the earlier cost analysis, context expansion overhead grows with query volume, reaching approximately $43,750 per month at 50 million queries. At moderate traffic, RAG’s per-request costs are typically lower than the amortized expenses of fine-tuning, including training and ongoing maintenance. This makes RAG an attractive choice for organizations that want high-quality outputs without front-loading infrastructure and compute investments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use cases
&lt;/h3&gt;

&lt;p&gt;Large-scale examples illustrate RAG’s effectiveness at this volume. Notion’s Q&amp;amp;A assistant is &lt;a href="https://www.supervised.news/p/how-notion-is-tackling-one-of-ais" rel="noopener noreferrer"&gt;effectively a large-scale RAG system&lt;/a&gt; over workspace data. The difficult engineering problem was not retrieval itself, but enforcing identity and access controls during retrieval. When a user queries the assistant, the system must ensure the model only retrieves documents that the user is permitted to see. &lt;/p&gt;

&lt;p&gt;LinkedIn leveraged RAG and knowledge graphs to preserve the structure of their support cases. This system retrieved relevant subgraphs rather than isolated text chunks, improving retrieval accuracy by 77.6% and &lt;a href="https://arxiv.org/html/2404.17723v1" rel="noopener noreferrer"&gt;reducing median issue resolution time by 28.6%&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For systems at this scale, RAG combines cost efficiency with flexibility, allowing teams to update knowledge sources rapidly without retraining models, while still delivering high-quality results.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Fine-Tuning Wins
&lt;/h2&gt;

&lt;p&gt;Fine-tuning becomes structurally advantageous under different conditions. These conditions typically involve scale, stability, and behavioral precision.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. When query volume exceeds 100 million per month
&lt;/h3&gt;

&lt;p&gt;At very high traffic levels (100M+ queries per month), RAG’s per-request context overhead becomes significant. Each query adds hundreds of retrieved tokens that the model processes, causing costs to scale linearly with traffic. Large context windows can also increase latency, reduce throughput, and complicate infrastructure reliability.&lt;/p&gt;

&lt;p&gt;If domain knowledge is relatively stable, fine-tuning can become more efficient. By embedding knowledge directly into the model, organizations avoid repeated retrieval and token costs, leading to more predictable per-query expenses, better consistency, and simpler operations at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. When output structure is critical
&lt;/h3&gt;

&lt;p&gt;Fine-tuned models often excel in tasks that require strict adherence to structure or formal constraints. For example, &lt;a href="https://cosine.sh/" rel="noopener noreferrer"&gt;Cosine&lt;/a&gt;, which is an AI software engineering assistant that’s able to autonomously resolve bugs and build features, was able to achieve a &lt;a href="https://openai.com/index/gpt-4o-fine-tuning/" rel="noopener noreferrer"&gt;SOTA score of 43.8%&lt;/a&gt; on the SWE-bench⁠ verified benchmark. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1j3yqxbcm6pq2cla833.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1j3yqxbcm6pq2cla833.png" alt="Figure 3: SWE-bench leaderboard" width="800" height="624"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Similarly, &lt;a href="https://distyl.ai/" rel="noopener noreferrer"&gt;Distyl&lt;/a&gt; secured the top position on the BIRD-SQL benchmark, widely regarded as the premier evaluation for text-to-SQL performance. Its fine-tuned GPT-4o model reached an execution accuracy of 71.83% on the leaderboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvksid1ptlanijujvvqu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvksid1ptlanijujvvqu.png" alt="Figure 4: Execution accuracy leaderboard" width="800" height="577"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In applications where errors propagate downstream, into financial calculations, automated APIs, or compliance documents, behavioral consistency is mandatory. In these contexts, fine-tuning provides the reliability needed to minimize risk and maintain trust in automated outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. When latency requirements are strict
&lt;/h3&gt;

&lt;p&gt;RAG adds multiple steps to the inference pipeline that increase response time. Each query must go through embedding generation, vector search, and context injection before reaching the model.&lt;/p&gt;

&lt;p&gt;Fine-tuned models skip retrieval entirely. All necessary knowledge and reasoning patterns are internalized, allowing the model to generate outputs immediately. In applications where sub-100ms responses are required, such as live recommendation engines or high-frequency trading systems, removing the retrieval pipeline eliminates a major bottleneck.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. When deep domain reasoning matters more than freshness
&lt;/h3&gt;

&lt;p&gt;A domain-specific agriculture benchmark study found that &lt;a href="https://cension.ai/blog/ai-rag-fine-tuning-cheaper-hallucinations/" rel="noopener noreferrer"&gt;fine-tuning improved model accuracy from 75% to 81%&lt;/a&gt;, while hybrid systems (fine-tuning + retrieval) reached 86%. Because the dataset focused on specialized agricultural knowledge and reasoning tasks, the improvement primarily reflects stronger domain reasoning, not simply better access to external information.&lt;/p&gt;

&lt;p&gt;In domains such as legal analysis or medical decision support, reasoning patterns can be complex. Fine-tuning enables models to internalize domain expertise rather than rely solely on retrieved context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hybrid Approach
&lt;/h2&gt;

&lt;p&gt;While RAG and fine-tuning each have clear advantages, research shows that combining them effectively can produce superior results, but only when done correctly. The RAFT (Retrieval Augmented Fine-Tuning) approach, developed by UC Berkeley, Microsoft, and Meta Research, &lt;a href="https://arxiv.org/pdf/2403.10131" rel="noopener noreferrer"&gt;demonstrates how to do this&lt;/a&gt; in practice.&lt;/p&gt;

&lt;p&gt;RAFT trains a model to operate in an “open-book” setting. It learns to process retrieved context, identify relevant passages, ignore distractors, and cite evidence accurately. Without this explicit training, simply layering RAG on top of a fine-tuned model often fails. For instance, a model fine-tuned on medical reasoning may retrieve irrelevant journal articles if it hasn’t learned to filter and prioritize context, resulting in hallucinations or incorrect recommendations.&lt;/p&gt;

&lt;p&gt;RAFT addresses this with a &lt;a href="https://arxiv.org/html/2506.22644v1" rel="noopener noreferrer"&gt;structured 80/20 training split&lt;/a&gt;. 80% of training examples include oracle documents that the model should use, and 20% do not, forcing the model to learn when to trust retrieved data and when to rely on internalized knowledge. This operational detail is crucial for engineers evaluating whether their team can implement a hybrid approach successfully. It is not enough to just combine RAG and fine-tuning. The model must be trained to reason over the retrieved context.&lt;/p&gt;

&lt;p&gt;A common and practical pattern is “fine-tune for format, RAG for knowledge.” Fine-tuning shapes the model’s internal behavior, enforcing domain-specific reasoning, output structure, and style. RAG provides dynamic access to external information that changes frequently or is too large to store in the model weights. In healthcare, for example, fine-tuning ensures the model understands medical terminology, follows proper diagnostic reasoning, and formats outputs according to clinical documentation standards. RAG supplements this by retrieving the latest research, newly published treatment guidelines, or patient-specific records, keeping recommendations current without retraining the entire model.&lt;/p&gt;

&lt;p&gt;Similarly, Harvey AI &lt;a href="https://newsletter.himanshuramchandani.co/p/harvey-ai-5b-legal-fine-tuning-case-study" rel="noopener noreferrer"&gt;fine-tuned on 10 billion case law tokens&lt;/a&gt;, but still leverages RAG to handle current cases and updates. This pattern is widely used in other domains too. Legal systems fine-tune for statutory reasoning and citation style, then layer RAG to retrieve the most current case law; finance models fine-tune for portfolio analysis rules, then layer RAG for market updates and regulatory changes. It’s a way to balance the stability of learned behavior with the adaptability of retrieval.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Quantified Decision Framework for RAG vs. Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;The question is no longer “Which approach is better?” It is “Under what conditions does each approach make economic and operational sense?”&lt;/p&gt;

&lt;p&gt;Instead of defaulting to architectural preference, evaluate three measurable variables:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Knowledge change frequency&lt;/li&gt;
&lt;li&gt;Monthly query volume&lt;/li&gt;
&lt;li&gt;Infrastructure capability and governance constraints&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When those variables are quantified, the decision becomes far clearer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Measure knowledge volatility
&lt;/h3&gt;

&lt;p&gt;Knowledge change frequency is often the fastest way to eliminate one option. If your domain knowledge changes weekly or daily, RAG is structurally favored. Updating an index is far simpler than retraining a fine-tuned model. The separation between model weights and external data enables real time data retrieval without redeployment cycles.&lt;/p&gt;

&lt;p&gt;If knowledge remains stable for months at a time, fine-tuning becomes economically viable. Retraining frequency drops, and training cost can be amortized over longer intervals. In these environments, embedding domain specific knowledge directly into model parameters may reduce long-term inference overhead.&lt;/p&gt;

&lt;p&gt;As a practical threshold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Knowledge changes more than monthly → prioritize RAG&lt;/li&gt;
&lt;li&gt;Knowledge stable for multiple months → evaluate fine-tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Calculate context expansion cost**
&lt;/h3&gt;

&lt;p&gt;The next variable is query volume. Large-scale RAG systems append hundreds of tokens to every query, and this context overhead scales linearly with traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantitative triggers&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Monthly queries&lt;/th&gt;
&lt;th&gt;Guidance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt;10M&lt;/td&gt;
&lt;td&gt;RAG is cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10–50M&lt;/td&gt;
&lt;td&gt;Evaluate fine-tuning vs. RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50–100M&lt;/td&gt;
&lt;td&gt;Fine-tuning or hybrid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt;100M&lt;/td&gt;
&lt;td&gt;Fine-tuning or hybrid&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 3: Assess infrastructure maturity
&lt;/h3&gt;

&lt;p&gt;Even if economics favor one approach, infrastructure capability may dictate feasibility.&lt;/p&gt;

&lt;p&gt;RAG requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong data engineering&lt;/li&gt;
&lt;li&gt;Reliable data pipelines&lt;/li&gt;
&lt;li&gt;Efficient vector database architecture&lt;/li&gt;
&lt;li&gt;Observability and monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fine-tuning requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High quality labeled data&lt;/li&gt;
&lt;li&gt;Machine learning expertise&lt;/li&gt;
&lt;li&gt;Compute resource allocation&lt;/li&gt;
&lt;li&gt;Evaluation discipline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When teams ignore their actual capabilities, architecture decisions collapse under scale. Many production failures blamed on “model quality” are just traits of immature infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision matrix
&lt;/h2&gt;

&lt;p&gt;The following matrix translates the analysis into practical guidance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Scenario&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Monthly&lt;/strong&gt; &lt;strong&gt;q&lt;/strong&gt;&lt;strong&gt;ueries&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Knowledge&lt;/strong&gt; &lt;strong&gt;u&lt;/strong&gt;&lt;strong&gt;pdate&lt;/strong&gt; &lt;strong&gt;f&lt;/strong&gt;&lt;strong&gt;requency&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Recommendation&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Rationale&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Domain knowledge updates weekly, moderate traffic&lt;/td&gt;
&lt;td&gt;10–50M&lt;/td&gt;
&lt;td&gt;Weekly/Daily&lt;/td&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Immediate indexing and low recurring cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-scale traffic, knowledge stable&lt;/td&gt;
&lt;td&gt;50–100M+&lt;/td&gt;
&lt;td&gt;&amp;lt;1 update/month&lt;/td&gt;
&lt;td&gt;Fine-tuning&lt;/td&gt;
&lt;td&gt;Avoids recurring context injection, reduces latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured output or code generation required&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Fine-tuning&lt;/td&gt;
&lt;td&gt;Embeds domain-specific rules and formatting internally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Specialized reasoning + frequent updates&lt;/td&gt;
&lt;td&gt;10–50M&lt;/td&gt;
&lt;td&gt;Weekly/Daily&lt;/td&gt;
&lt;td&gt;Hybrid&lt;/td&gt;
&lt;td&gt;Combines internalized reasoning with dynamic knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-domain systems with diverse knowledge update cycles&lt;/td&gt;
&lt;td&gt;10–100M&lt;/td&gt;
&lt;td&gt;Mixed&lt;/td&gt;
&lt;td&gt;Hybrid&lt;/td&gt;
&lt;td&gt;Fine-tuning stabilizes core domains, RAG handles rapidly changing sources&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Using this matrix, it becomes easier to make the decision whether to utilize RAG, fine-tune your LLMs, or use the hybrid approach. &lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The debate between RAG and fine-tuning is often framed as a binary choice, but the more useful question is “If hybrid systems demonstrably outperform either approach alone, why does industry adoption still overwhelmingly favor RAG?” &lt;/p&gt;

&lt;p&gt;Hybrid requires both ML and data engineering capabilities simultaneously, a combination few organizations have. RAG remains the practical default, offering agility and transparency with less upfront complexity.&lt;/p&gt;

&lt;p&gt;The key takeaway is to choose the architecture that matches your knowledge volatility, query scale, and team capability. For teams exploring enterprise-scale retrieval systems, platforms like Actian &lt;a href="https://www.actian.com/databases/vectorai-db/#waitlist" rel="noopener noreferrer"&gt;VectorAI DB&lt;/a&gt; provide purpose-built vector database capabilities designed for performance and scalability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://discord.com/invite/432A2M63Py" rel="noopener noreferrer"&gt;&lt;em&gt;Join the Discord community&lt;/em&gt;&lt;/a&gt; &lt;em&gt;and learn how Actian fits to your AI strategy.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>rag</category>
    </item>
    <item>
      <title>What 37signals’ Cloud Repatriation Taught Us About AI Infrastructure</title>
      <dc:creator>Offisong Emmanuel</dc:creator>
      <pubDate>Tue, 19 May 2026 19:42:51 +0000</pubDate>
      <link>https://dev.to/actiandev/what-37signals-cloud-repatriation-taught-us-about-ai-infrastructure-2hp</link>
      <guid>https://dev.to/actiandev/what-37signals-cloud-repatriation-taught-us-about-ai-infrastructure-2hp</guid>
      <description>&lt;p&gt;In 2023, &lt;a href="https://37signals.com/" rel="noopener noreferrer"&gt;37signals&lt;/a&gt; announced that it had completely left the public cloud and followed up by publicly documenting its cloud repatriation process, providing one of the clearest real-world examples of on-premises economics at scale. By reversing its cloud migration and shifting workloads to private cloud infrastructure, the company drastically &lt;a href="https://www.datacenterdynamics.com/en/news/37signals-claims-it-saved-almost-2m-last-year-from-cloud-repatriation/" rel="noopener noreferrer"&gt;reduced its annual cloud infrastructure spend by almost $2 million&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The transparency of the numbers made the case compelling. In 2022, &lt;a href="https://dev.37signals.com/our-cloud-spend-in-2022/" rel="noopener noreferrer"&gt;37signals spent $3,201,564 on cloud services&lt;/a&gt;, which is about $266,797 per month. These detailed cost breakdowns, along with published hardware investment and payback timelines, provided a rare look into the financial mechanics of large-scale cloud repatriation.&lt;/p&gt;

&lt;p&gt;For commodity SaaS workloads, the math was clear. But the same logic raises an important question for the next generation of compute-heavy systems: “Does the economic argument extend to AI infrastructure as well?” In this article, we examine whether the same economic logic holds for AI infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;37signals spent ~$3.2M/year on AWS in 2022.&lt;/li&gt;
&lt;li&gt;After repatriating workloads to their own infrastructure, cloud spend dropped to ~$1.3M by 2024.&lt;/li&gt;
&lt;li&gt;The company invested roughly $700K–$800K in servers and paid them off in under 18 months.&lt;/li&gt;
&lt;li&gt;The entire infrastructure is still run by the same 10-person team. No additional operational overhead.&lt;/li&gt;
&lt;li&gt;The key takeaway is that at a sustained scale, owning infrastructure can be dramatically cheaper than renting it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 37signals Playbook: What Hanson Actually Documented
&lt;/h2&gt;

&lt;p&gt;In 2022, 37signals &lt;a href="https://dev.37signals.com/our-cloud-spend-in-2022/" rel="noopener noreferrer"&gt;spent $3.2 million annually on AWS&lt;/a&gt;. After &lt;a href="https://world.hey.com/dhh/we-have-left-the-cloud-251760fb" rel="noopener noreferrer"&gt;leaving the cloud in 2023&lt;/a&gt;, their annual costs had dropped to approximately $1.3 million by 2024, a &lt;a href="https://www.datacenterdynamics.com/en/news/37signals-claims-it-saved-almost-2m-last-year-from-cloud-repatriation/" rel="noopener noreferrer"&gt;reduction of almost $2 million&lt;/a&gt; per year.&lt;/p&gt;

&lt;p&gt;The transition required a hardware investment of roughly &lt;a href="https://world.hey.com/dhh/the-big-cloud-exit-faq-20274010" rel="noopener noreferrer"&gt;$600,000 in Dell servers&lt;/a&gt;. The company fully recouped the investment in under 18 months, achieving complete payback in the second half of 2023 as their AWS reserved instance contracts expired. From that point forward, the savings flowed directly to operating margin rather than offsetting capital expense.&lt;/p&gt;

&lt;p&gt;37signals projected $1.5 million in hardware costs and roughly $200,000 per year in operating expenses. This shift replaces a recurring $1.3 million annual cloud storage bill with a one-time capital outlay plus a fraction of the ongoing operating cost. Over five years, 37signals revised the total savings projections upward from $7 million to more than $10 million.&lt;/p&gt;

&lt;h3&gt;
  
  
  37signals cloud exit financials by year
&lt;/h3&gt;

&lt;p&gt;To illustrate the financial impact of 37signals’ cloud exit over time, the table below breaks down annual &lt;a href="https://www.actian.com/blog/databases/when-to-choose-on-premises-vs-cloud-for-vector-databases/" rel="noopener noreferrer"&gt;cloud spending, on-premises hardware investments&lt;/a&gt;, and operating costs, highlighting the resulting net savings and key operational notes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Year&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Cloud&lt;/strong&gt; &lt;strong&gt;s&lt;/strong&gt;&lt;strong&gt;pend&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Hardware&lt;/strong&gt; &lt;strong&gt;i&lt;/strong&gt;&lt;strong&gt;nvestment&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Operating&lt;/strong&gt; &lt;strong&gt;c&lt;/strong&gt;&lt;strong&gt;osts&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022 Baseline&lt;/td&gt;
&lt;td&gt;~$3.2M&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;Included in cloud spend&lt;/td&gt;
&lt;td&gt;Full cloud dependency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2023 Migration&lt;/td&gt;
&lt;td&gt;~$2M&lt;/td&gt;
&lt;td&gt;~$700–800K&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Hardware fully recouped in under 18 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2024+ Post-repatriation&lt;/td&gt;
&lt;td&gt;~$1.3M&lt;/td&gt;
&lt;td&gt;~$1.5M (storage)&lt;/td&gt;
&lt;td&gt;~$200K/year&lt;/td&gt;
&lt;td&gt;~$1.9M annual savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025+&lt;/td&gt;
&lt;td&gt;Minimal AWS dependency&lt;/td&gt;
&lt;td&gt;~$1.5M (Pure Storage, 18PB)&lt;/td&gt;
&lt;td&gt;~$200K/year&lt;/td&gt;
&lt;td&gt;$10M+ projected 5-year savings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notably, the migration did not require the team to expand operations. A 10-person infrastructure team handled the entire repatriation without adding new staff. Addressing a common concern about operational overhead, &lt;a href="https://world.hey.com/dhh/our-cloud-exit-savings-will-now-top-ten-million-over-five-years-c7d9b5bd" rel="noopener noreferrer"&gt;37signals co-founder David Heinemeier Hansso&lt;/a&gt;&lt;a href="https://world.hey.com/dhh/our-cloud-exit-savings-will-now-top-ten-million-over-five-years-c7d9b5bd" rel="noopener noreferrer"&gt;n noted:&lt;/a&gt;&lt;br&gt;
&lt;a href="https://world.hey.com/dhh/our-cloud-exit-savings-will-now-top-ten-million-over-five-years-c7d9b5bd" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“We've been out for just over a year now, and the team managing everything is still the same. There were no hidden dragons of additional workload associated with the exit that required us to balloon the team, as some spectators speculated when we announced it. All the answers in our &lt;a href="https://world.hey.com/dhh/the-big-cloud-exit-faq-20274010" rel="noopener noreferrer"&gt;Big Cloud Exit FAQ&lt;/a&gt; continue to hold.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This directly challenges the common assumption that moving away from public cloud environments inevitably requires a significantly larger infrastructure team.&lt;/p&gt;

&lt;p&gt;Execution followed a “criticality ladder” strategy where the team migrated lower-risk services first and more critical ones later. The team moved the HEY email system in stages, starting with caching, then database, and finally, job services. To minimize risk, they colocated infrastructure approximately one millisecond from the AWS region to preserve rollback capability during the cloud repatriation process. After stabilizing the system, they replaced managed services with substantial recurring cost, including RDS and managed Elasticsearch which exceeded $500,000 together annually.&lt;/p&gt;

&lt;p&gt;What makes 37signals' case study consequential is the publicly documented cost efficiency. For organizations questioning long-term &lt;a href="https://www.actian.com/blog/cloud-data-warehouse/will-cloud-data-warehouses-really-help-you-cut-costs/" rel="noopener noreferrer"&gt;cloud adoption assumptions&lt;/a&gt; particularly with regard to storage costs and managed services, the 37signals documentation provides a rare baseline for comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI Infrastructure Economics Are Even More Extreme
&lt;/h2&gt;

&lt;p&gt;The lessons from 37signals’ cloud repatriation take on a sharper edge when applied to AI infrastructure. Higher GPU costs, predictable inference workloads, massive embedding storage, and stricter data regulations create financial and operational pressures that amplify the advantages of &lt;a href="https://www.actian.com/on-premises-data/" rel="noopener noreferrer"&gt;on-premises&lt;/a&gt; or hybrid cloud solutions that allow you to move workloads where they make the most sense. Below, we break down the key drivers.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI infrastructure cost comparison
&lt;/h3&gt;

&lt;p&gt;To evaluate the cost implications of different AI infrastructure approaches, the table below compares upfront setup costs, monthly operating expenses at varying workloads, and expected break-even timelines for cloud, on-premises, and hybrid configurations.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Setup&lt;/strong&gt; &lt;strong&gt;c&lt;/strong&gt;&lt;strong&gt;ost&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Monthly&lt;/strong&gt; &lt;strong&gt;c&lt;/strong&gt;&lt;strong&gt;ost&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Break-even&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cloud GPU rental (AWS/ Azure)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$2,900–3,500 (8h/day × $4–8/hour × 15 days)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud inference APIs (Lambda Labs)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$1,800–2,500 (8h/day × $3.67/hour × 15 days)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted GPU (8×H100 server)&lt;/td&gt;
&lt;td&gt;$200K–400K&lt;/td&gt;
&lt;td&gt;$1,500–2,000 (power + maintenance)&lt;/td&gt;
&lt;td&gt;&amp;lt;12 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid (Cloud training + On-Prem)&lt;/td&gt;
&lt;td&gt;$200K–400K&lt;/td&gt;
&lt;td&gt;Training only, inference minimal&lt;/td&gt;
&lt;td&gt;&amp;lt;12 months&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: For cloud GPU rental, we estimate monthly cost assuming eight hours/day per GPU. The cost scales linearly with utilization; it is not directly per-query.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1. GPU cloud markups are high
&lt;/h3&gt;

&lt;p&gt;AI workloads depend heavily on GPUs, and cloud providers charge far steeper premiums for GPU capacity than for typical CPU compute. On-demand AWS P5 instances with H100 GPUs cost roughly $4–8 per GPU-hour, while comparable Azure H100 instances are about $3.67 per hour. By contrast, spot markets and alternative providers such as &lt;a href="https://lambda.ai/" rel="noopener noreferrer"&gt;Lambda Labs&lt;/a&gt; offer similar GPU capacity for $1–2 per hour, or $1.85–2.49 per hour with reserved commitments. &lt;/p&gt;

&lt;p&gt;The result is a 4–8× markup for on-demand hyperscaler GPU capacity relative to the spot or specialized GPU cloud market. In other words, the premium cloud providers charge for high-end AI compute is significantly larger than typical CPU cloud markups. For organizations running sustained inference workloads, this pricing gap quickly becomes the dominant cost driver in AI infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Predictable inference makes GPU ownership economical
&lt;/h3&gt;

&lt;p&gt;High GPU pricing becomes especially significant because AI inference workloads are unusually predictable. Purchasing H100 GPUs outright can be cost-efficient. A single GPU costs roughly $25K–40K, while a complete 8×H100 server ranges from $200K–400K. &lt;a href="https://lenovopress.lenovo.com/lp2225-on-premise-vs-cloud-generative-ai-total-cost-of-ownership-2025-edition" rel="noopener noreferrer"&gt;Lenovo’s analysis&lt;/a&gt; shows that six or more hours of sustained daily usage reaches payback against AWS within the first year.&lt;/p&gt;

&lt;p&gt;The reason this break-even arrives so quickly is that AI inference workloads are unusually predictable. Unlike SaaS traffic which fluctuates throughout the day, production AI systems such as recommendation engines tend to process steady volumes of requests.&lt;/p&gt;

&lt;p&gt;Predictability changes the economics. When infrastructure runs at consistent utilization, owned hardware can be amortized efficiently across the workload. Paying cloud premiums for burst capacity that teams rarely use becomes unnecessary.&lt;/p&gt;

&lt;p&gt;For organizations running inference continuously, the hardware investment is often recouped in under 12 months. From that point forward, the savings resemble the same pattern documented by 37signals. Fixed infrastructure replacing an ongoing rental bill.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Embedding storage requirements are massive
&lt;/h3&gt;

&lt;p&gt;Even if GPU compute were optimized, AI systems introduce another rapidly growing cost layer: embedding storage. &lt;a href="https://dev.to/actiandev/whats-changing-in-vector-databases-in-2026-3pbo"&gt;Vector databases&lt;/a&gt; store high-dimensional embeddings used for search, retrieval, and recommendation. As datasets scale into millions or billions of records, storage requirements expand quickly.&lt;/p&gt;

&lt;p&gt;For instance, 10 million vectors at 1,536 dimensions require at least 58GB of raw storage, often 200–300GB with indexes and metadata. Cloud storage services like Pinecone charge $0.33/GB/month, meaning 500GB could cost $165/month before any queries. Self-hosted solutions like PostgreSQL with pgvector dramatically reduce cloud spending while keeping sensitive data under direct control. Over time, these storage requirements compound infrastructure costs alongside GPU compute, further reinforcing the economic advantages of self-hosted or hybrid architectures.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Data sovereignty and compliance favor on-premises deployment
&lt;/h3&gt;

&lt;p&gt;Data residency regulations and general compliance are priorities in the AI space with the industry becoming increasingly regulated. Notably, the EU AI Act introduced strict regulations for AI systems, with prohibitions on certain AI use cases which took effect in February 2025. On-premises deployment simplifies compliance. &lt;/p&gt;

&lt;p&gt;For financial organizations navigating complex regulatory environments, solutions like &lt;a href="https://www.actian.com/data-intelligence/platform/" rel="noopener noreferrer"&gt;Actian’s Data Intelligence platform&lt;/a&gt; helps enforce data governance and streamline compliance workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cloud Infrastructure Case Studies 37signals Validated
&lt;/h2&gt;

&lt;p&gt;As much as the financial transparency of 37signals’ cloud exit was radical, their repatriation was not an isolated occurrence. It was part of a growing trend by many organizations trying to regain cost control and optimize their cloud infrastructure. Many high-profile case studies illustrate the scale and economics of moving workloads back from public clouds to owned or hybrid infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dropbox
&lt;/h3&gt;

&lt;p&gt;Dropbox pioneered enterprise cloud repatriation as early as 2015, completing the migration between 2016 and 2018. &lt;a href="https://www.wired.com/2016/03/epic-story-dropboxs-exodus-amazon-cloud-empire/" rel="noopener noreferrer"&gt;The company moved roughly 90% of customer data&lt;/a&gt;, reportedly over 500 petabytes, off AWS to three owned colocation facilities. The infrastructure investment totaled $53 million, yet Dropbox reported $74.6 million in operational savings over two years per its &lt;a href="https://www.sec.gov/Archives/edgar/data/1467623/000119312518055809/d451946ds1.htm" rel="noopener noreferrer"&gt;2018 S‑1 filing&lt;/a&gt;. A small portion of workloads, primarily European customers and specialized services, remain in AWS. Internally, the initiative was known as “Magic Pocket,” and it exemplifies how a well-executed &lt;a href="https://www.actian.com/hybrid-cloud-architecture/" rel="noopener noreferrer"&gt;hybrid cloud approach&lt;/a&gt; can deliver substantial savings while aligning with long-term business objectives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ahrefs
&lt;/h3&gt;

&lt;p&gt;Ahrefs, the SEO tools company, relied on a Singapore colocation setup with 850 servers. Their reported savings from avoiding public cloud were approximately $400 million over 2.5 years. Actual infrastructure cost: $39.5 million for 850 servers (~$1,500/server/month), versus an estimated $447.7 million if hosted entirely on AWS (~$17,557/server/month equivalent). As &lt;a href="https://tech.ahrefs.com/how-ahrefs-saved-us-400m-in-3-years-by-not-going-to-the-cloud-8939dd930af8#:~:text=Ahrefs%20wouldn%E2%80%99t%20be%20profitable%2C%20or%20even%20exist%2C%20if%20our%20products%20were%20100%25%20on%20AWS." rel="noopener noreferrer"&gt;Ahrefs put it&lt;/a&gt;: “We wouldn’t be profitable, or even exist, if our products were 100% on AWS.” While critics argue that Ahrefs inflated AWS estimates, the directional savings were undeniable, illustrating that cloud repatriation challenges can be surmounted at scale with careful planning.&lt;/p&gt;

&lt;h3&gt;
  
  
  GEICO
&lt;/h3&gt;

&lt;p&gt;GEICO spent a decade migrating to multiple cloud providers only for its costs to climb and exceed projections by 2.5×, reaching $300 million by 2022 across eight providers. In response, GEICO began moving workloads to a private cloud using OpenStack and Kubernetes, targeting over 50% repatriation by 2029. &lt;a href="https://youtu.be/fKd5j6A7Nt0?si=ZISpHuQkRXpvdM9M" rel="noopener noreferrer"&gt;Early results&lt;/a&gt; show 50% reductions in compute and &lt;a href="https://www.thestack.technology/insurer-slashes-compute-costs-with-cloud-repatriation-shift-to-ocp-but/" rel="noopener noreferrer"&gt;60% reduction per gigabyte of storage costs&lt;/a&gt; compared with public cloud services, demonstrating how a hybrid cloud architecture can deliver efficiency, compliance, and alignment with long-term business objectives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Akamai
&lt;/h3&gt;

&lt;p&gt;Akamai was on the &lt;a href="https://www.linkedin.com/posts/tleighton_how-akamai-regained-control-of-its-runaway-activity-7282843623774142465-LR4c/" rel="noopener noreferrer"&gt;path to spending over $100 million&lt;/a&gt; on third party cloud services before migrating compute workloads to its own global edge network of 350,000+ servers. The migration delivered savings of roughly $100 million per year, a testament to the economics of repatriation when existing infrastructure and scale align.&lt;/p&gt;

&lt;p&gt;What these cases share is the same economic pattern documented by 37signals. Predictable, high-volume workloads eventually become cheaper to run on owned infrastructure than on hyperscaler clouds. &lt;/p&gt;

&lt;p&gt;These examples reflect a broader shift occurring across enterprise infrastructure strategies. &lt;a href="https://web.archive.org/web/20241123193805/https://substack-post-media.s3.us-east-1.amazonaws.com/post-files/152070619/7ea0372b-e710-4fcb-a41f-7f928549fb3d.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;amp;X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&amp;amp;X-Amz-Credential=ASIAUM3FPD6B2WS5FXT4%2F20241123%2Fus-east-1%2Fs3%2Faws4_request&amp;amp;X-Amz-Date=20241123T193805Z&amp;amp;X-Amz-Expires=3600&amp;amp;X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEIaCXVzLWVhc3QtMSJHMEUCIQC2kR555pFI8HmlydaNacbd9RPRu2BYjsS6BUtYIwpVmQIgIC5k5JRn0xlVZB2aZhNi0dpsodMbVuy0UiAYVgzoqYoqggQI2%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAEGgwzMDI0NzExMjY5MTUiDOeTbKFwEqOxR%2BttZCrWAwxAcmObVtjWpEZD5WO6PpOoZ3AHJwHU2Fi42m%2F6dZOWpMKZKr0gIvgXWjdexEN5NUFa5x2r4PiXbC%2FBEizoa4NEuUUVNHGSTQ9FB0rISgGfcqmZCi6L8aIT5IV%2FBRmwQQIwtZIJxVsDXdofktzdaeiMAa1DuA8ZWksfr2u936yuChersIvuBMcMw8izbrAbSL93AKYQhg5oxeBjagNc8%2BWKw1QLpkWeX4vonr%2F4Ili%2BuEJrchDRfQqJUkXK0vAi2zmeyRsNzA0OmhY3MKxZoanypr3D9v02mCrYmzvKTFd%2B0IfHAjDSTe%2FiUTfPxWnr%2BItyA9LNWyX7Fs4eYDPK6%2FvpVTEOlbkQWqUy5ukBPL6UDmmAT65x%2B7Ofhpr1CEMlBI8CiGc4gfbg4XBScKijmyp%2FTk%2FZD%2F%2BbzTohH8sxaRXot99gMt4DP2OQ7S5LJAoG5s6FsoD%2B5%2FeD0%2F2Gn6m4eIDmyh1LMwuKl99whqJTjybFAxGuK7SBfAtpz6%2Bg4e9VsA1eRFX8ckijgD80jFJOR7yLTFGxKNxmyoPgy2gAZWCgWO3Sp1Q0ixxphaXpg4v2F5CIqljcQwEMeUB9Ps%2F6iyHhtAUBANNmfpLqxlBMcM8Soo9GmXFOMNefiLoGOqUB58p6MbetrkYFc2LtsI9NvYH80vCh4fjHwaVTkvDIN5jmgiphqA6PwLPjv3It1nogoqYtk6snmGd58j%2FUNPkXYQpzQLp2olOZY1Ukb9HRgbYkcowP71cpDDTEL%2FryKutpNgx6WFBUhZmiJZL6t%2FGCXMNi8RbkC4dnr8mgTw3yburAGSK6mNKeLPLxjGlN%2FgOqlhjMX4GhWn%2BDuUpMzhNyvluMHFY6&amp;amp;X-Amz-Signature=c2a04d69b332738b503a91db32b88fa97ce966e59f4b536cc5dbbec0c7f15c78&amp;amp;X-Amz-SignedHeaders=host&amp;amp;response-content-disposition=attachment%3B%20filename%3D%22Barclays_Technology_2h24_Cio_Survey_2024.pdf%22&amp;amp;x-id=GetObject" rel="noopener noreferrer"&gt;Barclays’ Chief Information Officers (CIO) surveys&lt;/a&gt; show cloud repatriation trending upward in recent years, with the sentiment peaking in the second half of 2024 with 86% of CIOs planning repatriation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq188cccw2o7uu7s1gq2q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq188cccw2o7uu7s1gq2q.png" alt="Figure 1: Barclay’s CIO survey showing 86% of CIOs planning cloud repatriation" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, this statistic does not mean that companies are abandoning public cloud environments completely. &lt;a href="https://www.idc.com/resource-center/blog/storm-clouds-ahead-missed-expectations-in-cloud-computing/#:~:text=only%208%2D9%25%20of%20companies%20plan%20full%20workload%20repatriation" rel="noopener noreferrer"&gt;According to IDC&lt;/a&gt;, only 8–9% of companies favor full repatriation with most preferring a hybrid approach that combines public and private clouds. Hybrid cloud infrastructure allows organizations to optimize workload placement by strategically allocating sensitive data and mission-critical applications on-premises while leveraging public cloud services for less critical workloads. As such, it has become increasingly important for teams exploring similar transitions to understand the nuances of &lt;a href="https://www.actian.com/blog/cloud-data-warehouse/hybrid-cloud-benefits-and-risks/" rel="noopener noreferrer"&gt;hybrid deployments and their associated risks.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud Repatriation Statistics
&lt;/h2&gt;

&lt;p&gt;Cloud repatriation is accelerating at the same time as &lt;a href="https://hostingjournalist.com/news/idc-public-cloud-spending-to-pass-1-trillion-in-2026" rel="noopener noreferrer"&gt;public cloud spending keeps climbing&lt;/a&gt;. IDC projects global public cloud spend will reach $1.6 trillion in 2028, doubling from their 2024 prediction. Yet as mentioned earlier, 86% of CIOs are planning some form of repatriation &lt;a href="https://www.databank.com/resources/blogs/why-86-of-cios-are-rethinking-their-cloud-strategy/" rel="noopener noreferrer"&gt;according to Barclays&lt;/a&gt;. Both trends can be true because this is not a cloud exodus so much as a rebalancing. Enterprises are leaning towards a hybrid cloud model.&lt;/p&gt;

&lt;p&gt;AI is likely to accelerate that shift. AI workloads account for less than 10% of total cloud compute today but &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-05-13-gartner-identifies-top-trends-shaping-the-future-of-cloud#:~:text=Gartner%20predicts%2050%25%20of%20cloud%20compute%20resources%20will%20be%20devoted%20to%20AI%20workloads%20by%202029" rel="noopener noreferrer"&gt;Gartner projects&lt;/a&gt; that this figure will approach 50% by 2029. Hyperscalers are responding with enormous capital investment. There is an &lt;a href="https://introl.com/blog/hyperscaler-capex-600b-2026-ai-infrastructure-debt-january-2026" rel="noopener noreferrer"&gt;estimated $600 billion in infrastructure spend&lt;/a&gt; in 2026, roughly three-quarters of it tied to AI. The assumption is clear: Enterprises will rent that GPU capacity. But the 37signals math suggests that once AI workloads move from experimentation to steady production, ownership economics begin to dominate.&lt;/p&gt;

&lt;p&gt;Cost pressure is already driving behavior. &lt;a href="https://info.flexera.com/CM-REPORT-State-of-the-Cloud" rel="noopener noreferrer"&gt;Flexera reports that 27%&lt;/a&gt; &lt;a href="https://info.flexera.com/CM-REPORT-State-of-the-Cloud" rel="noopener noreferrer"&gt;of cloud resources are wasted&lt;/a&gt; or underutilized, and &lt;a href="https://www.flexera.com/about-us/press-center/new-flexera-report-finds-84-percent-of-organizations-struggle-to-manage-cloud-spend" rel="noopener noreferrer"&gt;21% of workloads have already been repatriated&lt;/a&gt;. The primary reason cited is cost exceeding projections, followed by performance concerns. With GPUs, the margin for inefficiency is thinner. There are fewer optimization levers, higher hourly rates, and faster budget burn.&lt;/p&gt;

&lt;p&gt;Regulation adds another layer. The EU AI Act, DORA for financial services, China’s PIPL, and India’s DPDP are tightening data governance requirements. &lt;a href="https://www.mimecast.com/blog/why-data-sovereignty-is-now-a-dealbreaker-in-cybersecurity/" rel="noopener noreferrer"&gt;Mimecast reports&lt;/a&gt; that 87% of organizations now factor data sovereignty into vendor decisions. For AI systems, sovereignty extends beyond data location to model provenance, audit trails, and compliance documentation. On-premises deployment does not eliminate regulatory complexity, but it centralizes control and for many enterprises, that simplicity is becoming strategically attractive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fey79t7jet319mgg6zzjo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fey79t7jet319mgg6zzjo.png" alt="Figure 2: A bar chart showing why enterprises repatriate" width="800" height="669"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Counter-Arguments and When Cloud Providers Win
&lt;/h2&gt;

&lt;p&gt;Not all observers agree that cloud repatriation is the best path for every organization. Public cloud environments still deliver value in certain circumstances. But arguments often do not hold strong in the case of AI workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  When cloud wins vs. when on-premises wins
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Component&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Cloud&lt;/strong&gt; &lt;strong&gt;a&lt;/strong&gt;&lt;strong&gt;dvantage&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;On-&lt;/strong&gt;&lt;strong&gt;p&lt;/strong&gt;&lt;strong&gt;rem&lt;/strong&gt; &lt;strong&gt;a&lt;/strong&gt;&lt;strong&gt;dvantage&lt;/strong&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Workload predictability&lt;/td&gt;
&lt;td&gt;Handles spiky or unpredictable workloads&lt;/td&gt;
&lt;td&gt;Predictable workloads cheaper to self-host&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team expertise&lt;/td&gt;
&lt;td&gt;Requires minimal in-house infrastructure skill&lt;/td&gt;
&lt;td&gt;Strong IT teams can optimize and reduce vendor reliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale and growth&lt;/td&gt;
&lt;td&gt;Rapid scaling and global expansion&lt;/td&gt;
&lt;td&gt;Predictable growth enables cost-efficient hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regulatory requirements&lt;/td&gt;
&lt;td&gt;Managed compliance, geo-redundancy&lt;/td&gt;
&lt;td&gt;Direct control simplifies regulatory alignment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost and margins&lt;/td&gt;
&lt;td&gt;Pay-as-you-go reduces upfront spend&lt;/td&gt;
&lt;td&gt;Long-term savings from owned infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service quality&lt;/td&gt;
&lt;td&gt;Cloud SLAs ensure availability and performance&lt;/td&gt;
&lt;td&gt;Dedicated resources guarantee predictable uptime&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Cloud “wrong usage” argument
&lt;/h3&gt;

&lt;p&gt;Jeremy Daly, a serverless advocate, &lt;a href="https://www.jeremydaly.com/the-cloud-isnt-the-issue-youre-just-using-it-wrong/" rel="noopener noreferrer"&gt;argues that&lt;/a&gt; “37signals was using the cloud wrong.” By treating cloud environments as virtual colocation, running VMs and Kubernetes, they were paying cloud premiums without capturing the value of serverless, managed services, and instant scaling. As Daly notes, “In the cloud, we should be renting services, not servers.” &lt;/p&gt;

&lt;p&gt;For SaaS workloads with highly variable or spiky traffic, this argument is compelling. Serverless infrastructure allows organizations to scale instantly and pay only for the compute they actually use. &lt;/p&gt;

&lt;p&gt;However, AI inference workloads often behave very differently. Production inference systems, such as recommendation models, copilots, and document processing pipelines, tend to run at steady, sustained utilization rather than unpredictable bursts. In these cases, the economic advantage of elastic cloud scaling diminishes. The premium paid for burst capacity still exists, but the workload itself rarely needs that burst capacity.&lt;/p&gt;

&lt;p&gt;Daly’s argument therefore holds for variable SaaS workloads, where elasticity is critical. For sustained AI inference workloads running at high utilization, paying a premium for burst capacity that is rarely used can make dedicated infrastructure or hybrid deployments more cost-efficient.&lt;/p&gt;

&lt;h3&gt;
  
  
  Full cost critique
&lt;/h3&gt;

&lt;p&gt;Some critics also question the financial assumptions behind 37signals’ approach. &lt;a href="https://wrld.tech/cloud-vs-on-premise-storage-a-cost-comparison-guide/#:~:text=Cloud%20storage%20is%20cost%2Deffective,%2C%20budget%2C%20and%20data%20requirements." rel="noopener noreferrer"&gt;They point out&lt;/a&gt; that hardware and software normally account for only about 20% of IT costs, with the remainder covering electricity, cooling, physical security, racking, Uninterruptible Power Supply (UPS), and opportunity costs. David Heinemeier Hanson’s analysis did not include all of these overheads because 37signals used colocation facilities rather than fully owned data centers. Even so, considering 37signals’ figures, it is reasonable to conclude that renting colocation space can still be far cheaper than relying on cloud services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Competence vs. growth framework
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://newsletter.goodtechthings.com/p/wait-is-cloud-bad" rel="noopener noreferrer"&gt;Forrest Brazeal’s IT competence versus growth aspirations framework&lt;/a&gt; provides additional nuance. He places 37signals in the &lt;em&gt;High Competence/Low Growth&lt;/em&gt; quadrant, ideal for self-hosting. &lt;em&gt;“Not every company has the competence (high) or growth aspirations (low) of 37signals,”&lt;/em&gt; he observes. Startups with uncertain or spiky workloads benefit from cloud flexibility, but AI companies running production inference at scale often combine high operational competence with steady growth. Such profiles (steady growth &amp;amp; high competence) are well suited to repatriation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Applying the Playbook to AI Infrastructure
&lt;/h2&gt;

&lt;p&gt;If 37signals provided the economic blueprint, AI infrastructure makes the economics more concrete. The decision is no longer abstract. It becomes a structured assessment grounded in workload behavior, utilization, and regulatory exposure.&lt;/p&gt;

&lt;p&gt;A practical four-question framework helps translate the 37signals logic into AI terms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Is your inference workload predictable and sustained?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Unlike SaaS traffic spikes, most production AI systems such as recommendation engines, RAG pipelines, or fraud detection models process steady volumes with gradual growth.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Are projected GPU utilization rates above 60–70%?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
At this threshold, owned hardware amortization typically undercuts public cloud GPU pricing within the first year.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Are you processing more than 10–50 million queries per month?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
At this scale, per-token and per-query pricing from cloud APIs compound rapidly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do you face data sovereignty or strict compliance requirements?&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For financial services, healthcare, or government workloads, regulatory mandates can tilt the decision toward controlled environments.&lt;/p&gt;

&lt;p&gt;If the answer is “yes” to three or four of these, the repatriation economics tend to favor on-premises deployment for production inference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;
&lt;strong&gt;Workload&lt;/strong&gt; &lt;strong&gt;s&lt;/strong&gt;&lt;strong&gt;tage&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Recommended&lt;/strong&gt; &lt;strong&gt;e&lt;/strong&gt;&lt;strong&gt;nvironment&lt;/strong&gt;
&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Rationale&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model training&lt;/td&gt;
&lt;td&gt;Public cloud&lt;/td&gt;
&lt;td&gt;Compute-intensive; cloud GPUs handle burst workloads cost-effectively&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Experimentation and prototyping&lt;/td&gt;
&lt;td&gt;Public cloud&lt;/td&gt;
&lt;td&gt;Flexible, fast provisioning for early-stage iteration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production inference&lt;/td&gt;
&lt;td&gt;On-premises / Hybrid&lt;/td&gt;
&lt;td&gt;Steady workloads; owned hardware cheaper at 60–70%+ GPU utilization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector storage (embeddings)&lt;/td&gt;
&lt;td&gt;On-premises&lt;/td&gt;
&lt;td&gt;Reduces recurring managed-service costs and ensures data control&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The hybrid AI pattern
&lt;/h3&gt;

&lt;p&gt;In practice, most AI organizations adopt a hybrid model rather than an all-or-nothing shift. Training remains in the cloud. Inference moves closer to owned infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://lenovopress.lenovo.com/lp2225-on-premise-vs-cloud-generative-ai-total-cost-of-ownership-2025-edition#:~:text=will%20run%20you%20over%20%24483%20M%20in%20cloud%20costs" rel="noopener noreferrer"&gt;Lenovo documented&lt;/a&gt; that training Llama 3.1 at hyperscale (39.3 million GPU hours) in the cloud would exceed $483 million. That type of elastic, short-term scale is exactly where public cloud excels. Inference is different. Once a model is trained, serving it for three to five years becomes steady, predictable work. That is where amortized hardware economics has the upper hand.&lt;/p&gt;

&lt;p&gt;This split architecture also simplifies data migration risk. Instead of relocating entire AI pipelines at once, organizations can migrate production inference workloads gradually while leaving experimentation and early-stage training in cloud environments. A controlled, phased migration process reduces operational disruption while ensuring seamless integration between cloud-based training and on-premises serving layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Self-hosted inference economics
&lt;/h3&gt;

&lt;p&gt;The economics of self-hosted inference depend heavily on utilization and token volume. According to enterprise deployment benchmarks, a 7B-parameter model running on an H100 GPU at roughly 70% utilization &lt;a href="https://blog.premai.io/private-llm-deployment-a-practical-guide-for-enterprise-teams-2026/" rel="noopener noreferrer"&gt;costs about $10,000 per year in spot nodes&lt;/a&gt; or hardware amortization. Power costs about $300 annually, bringing the total costs to about $10,300.&lt;/p&gt;

&lt;p&gt;Public LLM APIs, by contrast, typically charge per million tokens, with enterprise pricing in 2025 ranging from $0.25–$15 per million input tokens and $1.25–$75 per million output tokens depending on model tier and provider. &lt;/p&gt;

&lt;p&gt;At low usage levels, APIs remain the more economical option because infrastructure sits idle. However, the economics change as workloads scale. Industry analyses suggest that self-hosted deployment begins to break even at roughly two million tokens per day, after which the fixed cost of owned infrastructure is amortized across a large inference volume.&lt;/p&gt;

&lt;p&gt;At high volumes, self-hosted inference can reduce costs by up to 78%. &lt;a href="https://medium.com/artefact-engineering-and-data-science/llms-deployment-a-practical-cost-analysis-e0c1b8eb08ca#:~:text=8k%3C%20conversations%20per%20day" rel="noopener noreferrer"&gt;Artefact’s analysis&lt;/a&gt;) found break-even around 8,000 conversations per day. Below that threshold, managed cloud APIs remain more economical. Above it, ownership compounds savings. The pattern mirrors 37signals: predictable workload plus high utilization equals rapid payback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector databases
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.instacart.com/company/tech-innovation/how-instacart-built-a-modern-search-infrastructure-on-postgres" rel="noopener noreferrer"&gt;Instacart documented&lt;/a&gt; migrating from Elasticsearch plus FAISS to PostgreSQL with pgvector, achieving 80% cost savings and a 10× reduction in write amplification. Timescale’s pgvectorscale benchmarks show &lt;a href="https://www.tigerdata.com/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost" rel="noopener noreferrer"&gt;approximately 75% lower costs&lt;/a&gt; than managed vector services like Pinecone at comparable performance.&lt;/p&gt;

&lt;p&gt;For RAG systems handling millions of queries monthly, self-hosted vector infrastructure produces savings that resemble the 37signals S3 case: large recurring storage bills replaced by amortized hardware and open-source tooling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data sovereignty as a structural driver
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.grandviewresearch.com/industry-analysis/sovereign-cloud-market-report#:~:text=size%20was%20estimated%20at%20USD%2096.77%20billion%20in%202024%20and%20is%20projected%20to%20reach%20USD%20648.87%20billion%20by%202033%2C%20growing%20at%20a%20CAGR%20of%2023.8%25%20from%202025%20to%202033" rel="noopener noreferrer"&gt;Grandview research&lt;/a&gt; reports that the sovereign cloud market was worth 648.87 billion USD in 2025 and is projected to reach USD 648.87 billion by 2033. Also, &lt;a href="https://www.n-ix.com/data-sovereignty/#:~:text=By%202028%2C%20industry%20forecasts%20suggest%20that%2060%25%20of%20financial%20services%20firms%20outside%20the%20US%20will%20adopt%20sovereign%20cloud%20environments%20to%20comply%20with%20DORA%20and%20related%20data%20sovereignty%20regulations%20%5B2%5D." rel="noopener noreferrer"&gt;according to&lt;/a&gt; &lt;a href="https://www.n-ix.com/data-sovereignty/#:~:text=By%202028%2C%20industry%20forecasts%20suggest%20that%2060%25%20of%20financial%20services%20firms%20outside%20the%20US%20will%20adopt%20sovereign%20cloud%20environments%20to%20comply%20with%20DORA%20and%20related%20data%20sovereignty%20regulations%20%5B2%5D." rel="noopener noreferrer"&gt;Gartner&lt;/a&gt;, around 60% of financial firms outside the United States are expected to adopt sovereign or on-premises deployments by 2028.&lt;/p&gt;

&lt;p&gt;Frameworks such as the EU AI Act, China’s PIPL, and India’s DPDP mandate data localization and traceability. For organizations processing sensitive training datasets or proprietary inference logs, on-premises deployment inherently satisfies residency requirements because data never leaves jurisdictional boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;37signals showed that cloud repatriation teams can measure, model, and defend decisions with hard numbers. With AI infrastructure, the economics can be even more pronounced. If cloud repatriation saved roughly $10 million for Basecamp, an equivalent AI company running production inference at comparable scale could save multiples of that amount, given the much higher cost of GPU compute and embedding infrastructure.&lt;/p&gt;

&lt;p&gt;For organizations choosing to run AI workloads in controlled environments, platforms like &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Actian VectorAI DB&lt;/a&gt; provide a purpose-built vector database designed for high-volume vector search and AI inference workloads. It can be deployed on-premises or in the cloud, allowing organizations to place vector infrastructure where it best fits their operational and economic requirements.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://discord.gg/432A2M63Py" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt; and learn more about Actian.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>cloud</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>How to Build a HIPAA Compliant AI Ecosystem Without the Cloud</title>
      <dc:creator>Offisong Emmanuel</dc:creator>
      <pubDate>Tue, 19 May 2026 19:42:41 +0000</pubDate>
      <link>https://dev.to/actiandev/how-to-build-a-hipaa-compliant-ai-ecosystem-without-the-cloud-2dih</link>
      <guid>https://dev.to/actiandev/how-to-build-a-hipaa-compliant-ai-ecosystem-without-the-cloud-2dih</guid>
      <description>&lt;p&gt;Healthcare cannot rely on cloud RAG because patient data leaves your network and your system logs, stores, and exposes it outside your control. You sign a Business Associate Agreement (BAA), connect your pipeline to a managed vector database, and assume compliance is complete. That assumption is wrong. The BAA covers the provider’s infrastructure. It does not cover what your application sends, logs, or exposes during retrieval and generation.&lt;/p&gt;

&lt;p&gt;You remain responsible for every path your system sends Protected Health Information (PHI) through. A clinician query can leak sensitive data through logs. A system prompt can include patient context that your system stores outside your boundary. Weak access control allows retrieval results to expose records across departments. These risks exist in your application layer, not in the cloud provider’s scope.&lt;/p&gt;

&lt;p&gt;American regulators now target this gap. In 2026, they &lt;a href="https://www.hakunamatatatech.com/our-resources/blog/hipaa-compliant-llm" rel="noopener noreferrer"&gt;flagged attack patterns like membership inference&lt;/a&gt;, where an adversary probes an AI system to confirm whether a patient’s data exists in the index. Cloud-hosted pipelines increase this risk because queries and embeddings move across external infrastructure. Audit requirements tighten further when logs live on third-party systems.&lt;/p&gt;

&lt;p&gt;In this tutorial, you will build a clinical knowledge assistant that runs entirely on hospital infrastructure. It performs semantic search over clinical data, enforces role-based access at query time, and generates answers with clear citations. Every query stays inside your network, every access is logged locally, and no external API calls are required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why BAA Is Not Enough
&lt;/h2&gt;

&lt;p&gt;A BAA protects the cloud provider’s infrastructure, not how your system handles PHI during queries, retrieval, and generation. You remain responsible for every place PHI appears, moves, or gets stored inside your pipeline. There are multiple failure modes that make your system non-compliant, even when you sign a BAA.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Shared responsibility gap&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The BAA stops at the infrastructure boundary. Your system controls what enters a prompt, what gets logged, and what leaves your network. If a clinician query includes PHI and your application logs it to an external service, you are responsible. If your retrieval step returns records across departments without strict filters, you have created an internal data breach. These failures happen in your code, not in the cloud provider’s scope.&lt;/p&gt;

&lt;p&gt;For example, a physician searches “Show me similar cases to John Doe with early stage lung cancer.” Your application logs the full query to a cloud logging service for debugging. That log now contains PHI outside your network. The cloud provider did not leak it. Your application sent it.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Audit log ownership&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://conciergehealthcareattorneysllc.com/blog/hipaa-audit-trail-requirements-what-healthcare-practitioners-need-to-know/" rel="noopener noreferrer"&gt;HIPAA requires a complete audit trail&lt;/a&gt; for every access to PHI. When your vector database runs on third-party infrastructure, your system stores query logs and retrieval traces outside your control. You cannot guarantee completeness, retention, or isolation. Your security team cannot verify access patterns without relying on another provider’s system. That breaks your ability to enforce and prove compliance.&lt;/p&gt;

&lt;p&gt;For example, your compliance team asks for a report of all oncology patient records in the past 30 days. Your vector database provider stores query logs on their platform with limited retention. Some logs are missing and others lack user-level metadata. You cannot produce a complete audit trail.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Membership inference exposure&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Attackers can probe your system with targeted queries to determine whether a specific patient’s data exists in your index. This attack class is now a regulatory concern. Cloud-hosted indexes increase this risk because they expose a remote interface for repeated probing. A locally hosted index removes that external interface and limits access to your internal network.&lt;/p&gt;

&lt;p&gt;For example, an attacker sends repeated queries like “Patients diagnosed with HIV in 2024 treated with drug X” and slightly modifies filters each time. They observe changes in response confidence and content. Over time, they infer whether a specific individual’s record exists in your dataset.&lt;/p&gt;

&lt;p&gt;These failures show that a BAA does not ensure compliance. An on-premises deployment &lt;a href="https://www.actian.com/blog/databases/when-to-choose-on-premises-vs-cloud-for-vector-databases/" rel="noopener noreferrer"&gt;removes the third-party surface entirely&lt;/a&gt; and gives you full control over data flow, access, and auditability. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk01kfpgjhgmv0qss5f2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk01kfpgjhgmv0qss5f2u.png" alt="Pasted image: image.png" width="800" height="593"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Split view showing Cloud vs On-premises RAG architecture&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Are Building
&lt;/h2&gt;

&lt;p&gt;In this section, you will build a RAG system with three layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Ingestion layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You ingest clinical notes and treatment protocols into a controlled vector index with enforced data hygiene. You de-identify data before any processing. HIPAA Safe Harbor requires removal of identifiers, while Expert Determination allows a statistical approach. You apply one of these before ingestion, not after. You then chunk documents into 512 token segments with 50 token overlap, generate embeddings using a local model, and store them in VectorAI DB with metadata.&lt;/p&gt;

&lt;p&gt;You define a strict schema for every record. Each chunk includes document type, department, date, and author role. This metadata is not optional. It enables access control at query time and prevents cross-department leakage. Do not store raw documents without structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Query layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You process clinician queries through a controlled retrieval pipeline. Every query passes through role-based access control before it reaches the index. A cardiology user can only retrieve cardiology data. A scheduling bot cannot access diagnosis notes. You enforce this with a MUST filter on department or patient cohort at the database level.&lt;/p&gt;

&lt;p&gt;Run hybrid search. Vector similarity retrieves semantically relevant chunks. Metadata filters restrict the result set. Pass the filtered context into a local LLM. The model generates an answer from retrieved data only and includes citations. Do not allow the model to invent or pull from external knowledge.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Audit layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Log every interaction locally with full traceability. Each query writes a record that includes timestamp, user ID, department, query text, and retrieved document references. This log lives on your infrastructure with defined retention and access policies. You do not rely on external logging systems.&lt;/p&gt;

&lt;p&gt;You can reconstruct any access event from this log. You can answer who accessed what, when, and under which role. This satisfies audit requirements and gives your security team direct visibility into system behavior.&lt;/p&gt;

&lt;p&gt;The entire system runs on commodity hardware inside the hospital network. The end-to-end architecture of the system is shown in the image:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv1r8v9b2trg43nmkgr4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv1r8v9b2trg43nmkgr4x.png" alt="Pasted image: image.png" width="800" height="593"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hospital RAG system architecture&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a HIPAA Compliant RAG Workflow
&lt;/h2&gt;

&lt;p&gt;In this section, you will build a fully local RAG system that ingests clinical data, enforces access control, answers queries, and logs every interaction. &lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Prerequisites&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To follow along, install the following tools on your local network:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Docker and Docker Compose installed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Python 3.10 or higher.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://pypi.org/project/pip/" rel="noopener noreferrer"&gt;PIP&lt;/a&gt; or &lt;a href="https://docs.astral.sh/uv/getting-started/installation/" rel="noopener noreferrer"&gt;UV&lt;/a&gt;: This guide uses UV.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 1: Deploy a vector database&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You deploy a local instance of &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Actian VectorAI DB&lt;/a&gt; with persistent storage for both vector data and audit logs.&lt;/p&gt;

&lt;p&gt;Create a &lt;code&gt;docker-compose.yaml&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;services:
  vectorai:
    image: williamimoh/actian-vectorai-db:latest
    platform: linux/amd64
    container_name: vectorai_db
    ports:
      - &lt;span class="s2"&gt;"50052:50051"&lt;/span&gt;
    volumes:
      &lt;span class="c"&gt;# vector data persists across restarts&lt;/span&gt;
      - ./data:/app/data
      &lt;span class="c"&gt;# audit log lives on host — not inside the container&lt;/span&gt;
      - ./audit_logs:/app/audit_logs
    environment:
      - &lt;span class="nv"&gt;VECTORAI_LOG_LEVEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;info
    restart: unless-stopped

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The database starts and exposes port 50051 for local access. Vector data persists in ./data. Audit logs write to &lt;code&gt;./audit_logs&lt;/code&gt; on the host, which keeps all access records inside your network boundary.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;M3/_M4 Apple Silicons might encounter a GRPC disconnection error without any container _logs. In this case, disable Rosetta in Docker Desktop.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;VectorAI DB is under active development.&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 2: Build the ingestion pipeline&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Install the client library and run the ingestion pipeline to convert clinical documents into embeddings and store them in your local vector database.&lt;/p&gt;

&lt;p&gt;Use &lt;strong&gt;uv&lt;/strong&gt; for dependency management and execution. It is fast, reproducible, and avoids global Python state.&lt;/p&gt;

&lt;p&gt;Download the &lt;a href="https://github.com/hackmamba-io/actian-vectorAI-db-beta/blob/main/actian_vectorai-0.1.0b2-py3-none-any.whl" rel="noopener noreferrer"&gt;Actian VectorAI client package&lt;/a&gt;. This creates a file &lt;code&gt;actian_vectorai-0.1.0b2-py3-none-any&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Initialize your project by:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uv init .
uv venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After initialization, install the Actian VectorAI package by:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;uv&lt;/span&gt; &lt;span class="n"&gt;pip3&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;actian_vectorai&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;py3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;none&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;any&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the embedding model dependency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;uv&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;transformers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a file &lt;code&gt;ingest.py&lt;/code&gt; with the following contents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;actian_vectorai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PointStruct&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="c1"&gt;# ── Config ────────────────────────────────────────────────────────────────────
&lt;/span&gt;&lt;span class="n"&gt;VECTORAI_HOST&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:50052&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;COLLECTION&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clinical_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;EMBED_MODEL&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# 384-dim
&lt;/span&gt;&lt;span class="n"&gt;VECTOR_DIM&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;384&lt;/span&gt;
&lt;span class="n"&gt;CHUNK_TOKENS&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
&lt;span class="n"&gt;OVERLAP_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;

&lt;span class="c1"&gt;# ── Synthetic clinical notes (replace with real de-identified corpus) ─────────
&lt;/span&gt;&lt;span class="n"&gt;RAW_NOTES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;card_note_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clinical_note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardiology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-03-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attending_physician&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            Patient: [NAME REDACTED], DOB: [DATE REDACTED], MRN: [MRN REDACTED]
            Chief Complaint: Chest pain radiating to left arm, onset 2 hours ago.
            Assessment: Acute ST-elevation myocardial infarction confirmed on ECG.
            History: Hypertension and type 2 diabetes. Started on aspirin 325 mg,
            clopidogrel 600 mg loading dose, and heparin infusion per ACS protocol.
            Plan: Emergency PCI. Beta-blocker therapy with metoprolol succinate
            25 mg daily post-procedure. ACE inhibitor ramipril 5 mg daily initiated
            24 hours post-PCI. Follow-up echocardiography in 6 weeks.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;card_protocol_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;treatment_protocol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardiology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-01-10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department_head&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            Cardiology Protocol — Heart Failure with Reduced EF (HFrEF)
            First-line therapy:
            - ACE inhibitor: ramipril 2.5–10 mg daily (or ARB if ACE-intolerant).
            - Beta-blocker: bisoprolol 1.25–10 mg daily, carvedilol 3.125–25 mg BID,
              or metoprolol succinate 12.5–200 mg daily. Titrate every 2 weeks.
            - MRA: spironolactone 25–50 mg daily for NYHA class II–IV
              if eGFR &amp;gt; 30 and K+ &amp;lt; 5.0.
            Target: Symptomatic improvement. Reassess LVEF at 3–6 months.
            Device therapy (ICD/CRT) if LVEF ≤ 35% after 3 months optimal therapy.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;psych_note_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clinical_note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;psychiatry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-03-18&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;psychiatrist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            Psychiatry intake note — [NAME REDACTED], [AGE REDACTED]-year-old.
            Presenting with major depressive episode, PHQ-9 score 18 (severe).
            No current suicidal ideation. Started sertraline 50 mg daily.
            Psychotherapy referral placed. Follow-up in 2 weeks.
            Safety plan documented. Family support confirmed present.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;onco_note_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clinical_note&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oncology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-03-20&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oncologist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            Oncology note — [NAME REDACTED].
            Diagnosis: Stage IIIA non-small cell lung cancer, adenocarcinoma.
            EGFR mutation positive (exon 19 deletion).
            Plan: Osimertinib 80 mg daily (first-line EGFR-targeted therapy).
            Baseline CT chest/abdomen/pelvis completed. Brain MRI negative.
            Next imaging review in 8 weeks. Antiemetics PRN, skin care for rash.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# ── Step 1: De-identification ─────────────────────────────────────────────────
# For production use Presidio:
#   from presidio_analyzer import AnalyzerEngine
#   from presidio_anonymizer import AnonymizerEngine
#   analyzer, anonymizer = AnalyzerEngine(), AnonymizerEngine()
#   result = analyzer.analyze(text=raw, entities=[...], language="en")
#   clean  = anonymizer.anonymize(text=raw, analyzer_results=result).text
#
# This demo applies lightweight regex to already-synthetic notes.
&lt;/span&gt;
&lt;span class="n"&gt;_HIPAA_PATTERNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b\d{3}-\d{2}-\d{4}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[SSN]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;          &lt;span class="c1"&gt;# SSN
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\bMRN[-:\s]*\d{4,10}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[MRN]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;          &lt;span class="c1"&gt;# medical record #
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b\d{1,2}/\d{1,2}/\d{2,4}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[DATE]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;         &lt;span class="c1"&gt;# dates
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b[A-Z][a-z]+ [A-Z][a-z]+\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[NAME]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;         &lt;span class="c1"&gt;# names (simple)
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[PHONE]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;        &lt;span class="c1"&gt;# phone
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.\w+\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[EMAIL]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;# email
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b\d{5}(?:-\d{4})?\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[ZIP]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;          &lt;span class="c1"&gt;# zip
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(?:https?://)\S+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[URL]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;          &lt;span class="c1"&gt;# URLs
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[IP]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;           &lt;span class="c1"&gt;# IP addresses
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deidentify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Remove HIPAA Safe Harbor identifiers from text.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replacement&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_HIPAA_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replacement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# ── Step 2: Chunking ──────────────────────────────────────────────────────────
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CHUNK_TOKENS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OVERLAP_TOKENS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Split text into overlapping token windows (whitespace tokenisation).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;

&lt;span class="c1"&gt;# ── Step 3: Embedding ─────────────────────────────────────────────────────────
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading embedding model: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;EMBED_MODEL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EMBED_MODEL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normalize_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# ── Step 4: Ingest into VectorAI DB ──────────────────────────────────────────
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_chunk_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Stable integer ID from (document_id, chunk_index).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;VectorAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VECTORAI_HOST&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Health check
&lt;/span&gt;        &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;health_check&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VectorAI DB connected  version=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Create collection (skip if already exists)
&lt;/span&gt;        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;vectors_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;VECTOR_DIM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Cosine&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Collection &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; created  dim=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;VECTOR_DIM&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exists&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Collection &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; already exists — skipping create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;

        &lt;span class="n"&gt;total_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;note&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# De-identify FIRST — before chunking or embedding
&lt;/span&gt;            &lt;span class="n"&gt;clean_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deidentify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

            &lt;span class="c1"&gt;# Chunk second
&lt;/span&gt;            &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clean_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Embed third
&lt;/span&gt;            &lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Build PointStruct records with strict metadata schema
&lt;/span&gt;            &lt;span class="c1"&gt;# All four metadata fields are REQUIRED — no optional fields.
&lt;/span&gt;            &lt;span class="n"&gt;points&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="nc"&gt;PointStruct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;_chunk_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="c1"&gt;# ── strict schema ──────────────────────────────────────
&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# required
&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;       &lt;span class="c1"&gt;# required — RBAC filter key
&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;             &lt;span class="c1"&gt;# required
&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;      &lt;span class="c1"&gt;# required
&lt;/span&gt;                        &lt;span class="c1"&gt;# ── retrieval helpers ──────────────────────────────────
&lt;/span&gt;                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;                &lt;span class="c1"&gt;# de-identified chunk text
&lt;/span&gt;                    &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;

            &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;total_chunks&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  ✓ &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  dept=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  chunks=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Ingestion complete — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; documents, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_chunks&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chunks total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RAW_NOTES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file performs the following actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;De-identifies data:&lt;/strong&gt; The system removes all HIPAA identifiers from raw text before processing. Names, dates, and other sensitive fields are replaced with placeholders to prevent PHI from entering the system pipeline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chunks the texts:&lt;/strong&gt; The system splits the cleaned text into 512-token segments with a 50-token overlap. This overlap preserves context across boundaries, enhancing retrieval accuracy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Embeds the chunks:&lt;/strong&gt; The model converts each chunk into a numerical vector using a local sentence-transformers model. This process captures semantic meaning while keeping all processing within the network.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stores with metadata:&lt;/strong&gt; The system writes each chunk and its vector to VectorAI DB, along with necessary fields like document_type, department, date, and author_role. These fields support strict access control during queries.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run the script by:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;uv run ingest.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You should see the following results after running the command:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0ngacc2ml9vjldcfu1l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj0ngacc2ml9vjldcfu1l.png" width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ingest.py execution&lt;/p&gt;

&lt;p&gt;From the logs, you see that the ingestion pipeline writes chunks to VectorAI DB.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 3: Run your queries&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Execute queries against your local RAG system and validate retrieval, access control, and audit logging.&lt;/p&gt;

&lt;p&gt;Create a file &lt;code&gt;query.py&lt;/code&gt; with the following contents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.error&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;actian_vectorai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorAIClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;actian_vectorai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FilterBuilder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="c1"&gt;# ── Config ─────────────────────────────────────────────────────────────────────
&lt;/span&gt;&lt;span class="n"&gt;VECTORAI_HOST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:50052&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;COLLECTION&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clinical_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;EMBED_MODEL&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;AUDIT_LOG&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./audit_logs/queries.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# volume-mounted path
&lt;/span&gt;
&lt;span class="c1"&gt;# Ollama settings — set OLLAMA_ENABLED=True once `ollama serve` is running
&lt;/span&gt;&lt;span class="n"&gt;OLLAMA_ENABLED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;          &lt;span class="c1"&gt;# flip to True when Ollama is ready
&lt;/span&gt;&lt;span class="n"&gt;OLLAMA_URL&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;OLLAMA_MODEL&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mistral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="c1"&gt;# or "llama3.2:3b" for lower hardware
&lt;/span&gt;
&lt;span class="c1"&gt;# ── RBAC: role → allowed departments ──────────────────────────────────────────
# Access is enforced as a MUST filter at the database level.
# A scheduling_bot cannot reach clinical notes; cardiology cannot see psychiatry.
&lt;/span&gt;&lt;span class="n"&gt;ROLE_PERMISSIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardiology_clinician&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardiology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oncology_clinician&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oncology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general_practitioner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardiology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oncology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;admin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                 &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardiology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oncology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;psychiatry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scheduling_bot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scheduling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# no clinical note access
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AccessDeniedError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allowed_departments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ROLE_PERMISSIONS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;AccessDeniedError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown role &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; — access denied by default.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ROLE_PERMISSIONS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# ── Embedding (reuse the same model as ingest.py) ─────────────────────────────
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading embedding model: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;EMBED_MODEL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EMBED_MODEL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;normalize_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# ── Step 5: Search with department MUST filter ─────────────────────────────────
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Hybrid retrieval: vector similarity + metadata MUST filter.
    Results from departments outside the allowed list are impossible —
    the filter is applied at the database level, not in application code.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;VectorAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VECTORAI_HOST&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dept&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="c1"&gt;# MUST filter — department equality enforced at DB level
&lt;/span&gt;                &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;FilterBuilder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;must&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dept&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
                &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# ── Step 6: LLM answer via Ollama ─────────────────────────────────────────────
&lt;/span&gt;&lt;span class="n"&gt;_RAG_SYSTEM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a clinical decision support assistant. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer ONLY using the context passages below. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do NOT use external knowledge or make assumptions. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cite each fact as [Doc N]. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If the context is insufficient, say: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;I cannot answer from the available documents.&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Doc &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, dept=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, role=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;author_role&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;OLLAMA_ENABLED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Return raw retrieved context when LLM is disabled
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[LLM disabled — set OLLAMA_ENABLED=True]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;_RAG_SYSTEM&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Answer:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;OLLAMA_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;OLLAMA_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URLError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Ollama unreachable: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Retrieved context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_audit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;AUDIT_LOG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AUDIT_LOG&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ── Public query entry point ───────────────────────────────────────────────────
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Execute a role-gated RAG query.

    Returns:
        {answer, retrieved_docs, access_denied, error}
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# RBAC check — before anything else
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;departments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;allowed_departments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;AccessDeniedError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;write_audit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DENIED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer_provided&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;denial_reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Access denied: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Embed → retrieve (with MUST filter) → generate
&lt;/span&gt;    &lt;span class="n"&gt;q_vec&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q_vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;doc_refs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Audit log — every query, regardless of outcome
&lt;/span&gt;    &lt;span class="nf"&gt;write_audit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;            &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;doc_refs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer_provided&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc_refs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# ── Demo runs ─────────────────────────────────────────────────────────────────
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;separator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;─&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;

    &lt;span class="c1"&gt;# ── Query 1: authorised cardiology query ────────────────────────────────
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QUERY 1 — cardiology_clinician (authorised)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dr_chen_007&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardiology_clinician&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What beta-blocker is recommended for heart failure with reduced ejection fraction?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Answer:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieved sources:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  score=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;document_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]  dept=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
              &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;document_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  chunk=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# ── Query 2: scheduling bot tries to access clinical notes ───────────────
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QUERY 2 — scheduling_bot (attempting clinical note access)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bot_sched_01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scheduling_bot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the diagnosis notes for cardiology patients?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;✗ Access denied (as expected): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Answer:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sources:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# ── Query 3: cardiology query that must NOT return psychiatry notes ───────
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QUERY 3 — cardiology_clinician (RBAC must exclude psychiatry)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dr_chen_007&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cardiology_clinician&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;antidepressant dosing and patient management&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;departments_returned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r3&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="n"&gt;cross_leak&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;psychiatry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;departments_returned&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Departments in results: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;departments_returned&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cross-department leak: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;✗ LEAK DETECTED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cross_leak&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;✓ none — RBAC working correctly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# ── Show audit log tail ───────────────────────────────────────────────────
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AUDIT LOG  →  {AUDIT_LOG}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;AUDIT_LOG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AUDIT_LOG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]:&lt;/span&gt;          &lt;span class="c1"&gt;# show last 3 entries
&lt;/span&gt;            &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;…&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs_accessed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No audit log found — run ingest.py first.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script performs three core operations in a single flow.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enforces access control&lt;/strong&gt;: The system checks the user’s role before retrieving any data. Each role is mapped to specific departments, and enforced as a mandatory filter at the database level. The authorization layer immediately blocks and logs unauthorized roles.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Retrieve and generate answers:&lt;/strong&gt; The system embeds the query and retrieves relevant document chunks using vector search, with strict department filters applied. The results are then passed to a local LLM. If the LLM is disabled, the retrieved context is returned directly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write audit logs:&lt;/strong&gt; The system logs every query locally, including the user ID, role, query text, accessed documents, and access status. This creates a complete audit trail for compliance and review.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run the script by:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;uv run query.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You should see the following results after running the command:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qd9ra0blulrcurqrxn3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qd9ra0blulrcurqrxn3.png" width="800" height="728"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;query.py execution&lt;/p&gt;

&lt;p&gt;The output shows three test cases: one showing a valid clinician query, a denied access attempt, and a check for cross-department leakage. These confirm that RBAC and audit logging work correctly before you move to production.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 4: Configure the audit log&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Store every query locally by using the volume mapping defined during deployment.&lt;/p&gt;

&lt;p&gt;The Docker configuration mounts ./audit_logs from your host into the container. When you run queries, this creates a local folder named audit_logs with a file queries.jsonl.&lt;/p&gt;

&lt;p&gt;The file contains the following entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31T15:56:02.552088+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dr_chen_007"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cardiology_clinician"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"department"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cardiology"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What beta-blocker is recommended for heart failure with reduced ejection fraction?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"retrieved_docs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"answer_provided"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"access_denied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31T15:56:03.098346+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bot_sched_01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"scheduling_bot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"department"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"scheduling"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What are the diagnosis notes for cardiology patients?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"retrieved_docs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"answer_provided"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"access_denied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31T15:56:03.663767+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dr_chen_007"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cardiology_clinician"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"department"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cardiology"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"antidepressant dosing and patient management"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"retrieved_docs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"answer_provided"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"access_denied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31T15:56:18.331657+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dr_chen_007"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cardiology_clinician"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"department"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cardiology"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What beta-blocker is recommended for heart failure with reduced ejection fraction?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"retrieved_docs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"answer_provided"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"access_denied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31T15:56:18.492188+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bot_sched_01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"scheduling_bot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"department"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"scheduling"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What are the diagnosis notes for cardiology patients?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"retrieved_docs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"answer_provided"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"access_denied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31T15:56:18.569824+00:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dr_chen_007"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cardiology_clinician"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"department"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cardiology"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"query_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"antidepressant dosing and patient management"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"retrieved_docs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"answer_provided"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"access_denied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each line represents a single query event. The log captures who made the request, their role, the department scope, the query text, and whether access was allowed. This file lives entirely on your infrastructure and gives you a complete, verifiable audit trail for every interaction with PHI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;A BAA was never enough because it does not control how your application handles PHI. You solved that by keeping all data, queries, and logs inside your network.&lt;/p&gt;

&lt;p&gt;You now have a RAG system that enforces role-based access, retrieves only authorized data, and logs every interaction locally, without external APIs or third-party exposure.&lt;/p&gt;

&lt;p&gt;Apply this pattern to other regulated systems. Refer to &lt;a href="https://docs.vectoraidb.actian.com/" rel="noopener noreferrer"&gt;the&lt;/a&gt; &lt;a href="https://actianvectorai.mintlify.app/home/getting-started/overview" rel="noopener noreferrer"&gt;VectorAI DB documentation&lt;/a&gt; and &lt;a href="https://github.com/actiancorp" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; for updates and implementation details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://discord.gg/432A2M63Py" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt; and learn more about Actian.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>privacy</category>
      <category>rag</category>
      <category>security</category>
    </item>
    <item>
      <title>When to choose on-premises vs. cloud for vector databases</title>
      <dc:creator>Offisong Emmanuel</dc:creator>
      <pubDate>Tue, 19 May 2026 19:42:27 +0000</pubDate>
      <link>https://dev.to/actiandev/when-to-choose-on-premises-vs-cloud-for-vector-databases-2bi1</link>
      <guid>https://dev.to/actiandev/when-to-choose-on-premises-vs-cloud-for-vector-databases-2bi1</guid>
      <description>&lt;p&gt;For most of the last decade, the on-premises vs. cloud debate felt settled. Cloud computing was cheaper, faster, and easier to adopt. Enterprises moved workloads from on-premises infrastructure to public cloud services, relying on major cloud providers to handle scalability, maintenance, and security.&lt;/p&gt;

&lt;p&gt;In 2026, that assumption is breaking, and cracks are showing up in legal reviews, financial projects, and SLA negotiations. Enterprises are facing an increasing pressure for &lt;a href="https://securityboulevard.com/2025/12/the-global-data-residency-crisis-how-enterprises-can-navigate-geolocation-storage-and-privacy-compliance-without-sacrificing-performance/" rel="noopener noreferrer"&gt;data residency regulations&lt;/a&gt;, stricter enforcement, and scrutiny around cloud security models. Compliance constraints, data security requirements, cost predictability, and latency are forcing teams to reconsider on-premises solutions, private cloud computing, and hybrid cloud infrastructure.&lt;/p&gt;

&lt;p&gt;At the same time, AI is moving closer to where data is generated. Manufacturing sites, retail stores, and healthcare environments increasingly require offline capability and sub-100ms latency. That shift helps explain why &lt;a href="https://blogs.oracle.com/database/ga-of-oracle-ai-database-26ai-for-linux-x86-64-on-premises-platforms" rel="noopener noreferrer"&gt;Oracle released AI Database 26ai&lt;/a&gt; for on-premises deployment and why &lt;a href="https://docs.cloud.google.com/distributed-cloud/gemini-on-gdcc/latest/docs/overview" rel="noopener noreferrer"&gt;Google&lt;/a&gt; is pushing Gemini onto Distributed Cloud for air-gapped environments. This shift signals that large-scale enterprise AI no longer fits neatly in cloud environments.&lt;/p&gt;

&lt;p&gt;In this article, we’ll examine why on-premises infrastructure is resurging, what trade-offs you need to know, and how to make defensible deployment decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Driving The On-premises Resurgence
&lt;/h2&gt;

&lt;p&gt;The renewed interest in on-premises infrastructure is not about going back to old systems. It is a response to clear changes in how AI systems are being built and used in 2025 and 2026. For many enterprises, cloud-only vector databases no longer fit their compliance, cost, and reliability needs.&lt;/p&gt;

&lt;p&gt;A lot of factors drive this current on-premises resurgence, but in this article we will consider four key causes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Large vendors now support on-premises AI
&lt;/h3&gt;

&lt;p&gt;On-premises AI is no longer treated as an edge case by major vendors. Oracle’s release of AI Database 26ai and Google’s decision to run Gemini on Distributed Cloud show a clear shift in how enterprise AI is being packaged and delivered.&lt;/p&gt;

&lt;p&gt;These products are built for large enterprises, not early-stage experiments or research projects. That distinction matters. Large vendors do not invest in complex on-premises AI platforms unless there is strong and growing customer demand. These announcements confirm that many enterprises want to run AI systems inside their own environments, close to their data, and under their full operational control. Why is this?&lt;/p&gt;

&lt;h3&gt;
  
  
  Regulatory pressure is now a real blocker
&lt;/h3&gt;

&lt;p&gt;Teams used to plan for regulatory risk as a future possibility. Now it’s a day-to-day reality. GDPR enforcement reached record levels in 2025, with insufficient legal basis for data processing driving the largest penalties. That year alone, regulators issued &lt;a href="https://www.enforcementtracker.com/?insights" rel="noopener noreferrer"&gt;nearly 2,700 fines&lt;/a&gt; totaling billions of euros.&lt;/p&gt;

&lt;p&gt;From a data security perspective, GDPR enforcement has fundamentally changed how enterprises evaluate cloud services. While cloud service providers offer compliance tooling, legal teams are increasingly wary of relying on third-party providers for sensitive data storage and processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87kkko5v2rov6kqlb9ed.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87kkko5v2rov6kqlb9ed.png" width="800" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fef6odffzvedoo7lbxz7h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fef6odffzvedoo7lbxz7h.png" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;HIPAA adds another layer of complexity. For example, in Florida, physicians must maintain medical records for five years after the last patient contact, whereas hospitals must maintain them for seven years under &lt;a href="https://www.hipaajournal.com/hipaa-retention-requirements/" rel="noopener noreferrer"&gt;state record-retention requirements&lt;/a&gt;. This makes repeated data movement risky and expensive. Financial services and government contractors face similar data sovereignty requirements that limit where data can be stored and processed. In these situations, cloud deployments add legal review, audit work, and ongoing risk. Keeping data on-premises is often the most straightforward way to meet these obligations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Edge AI requires local and offline operation
&lt;/h3&gt;

&lt;p&gt;AI workloads are increasingly deployed close to where data is created. Manufacturing facilities may operate in air-gapped environments or remote locations with limited connectivity. Retail systems must continue working during network outages. Healthcare applications often require very low latency for real-time decision support.&lt;/p&gt;

&lt;p&gt;In these environments, relying on a remote cloud service introduces risk. Network delays and outages directly affect system reliability. On-premises and edge deployments allow vector search and inference to run locally, without depending on constant network access. For many use cases, this local execution is not an optimization but a requirement.&lt;/p&gt;

&lt;p&gt;Together, these shifts explain why on-premises vector databases are gaining traction again. The change is driven by the practical realities of deploying production AI systems under real regulatory, cost, and reliability constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Compliance Calculus
&lt;/h2&gt;

&lt;p&gt;For many enterprises, compliance is the deciding factor in the on-premises versus cloud debate. While cloud providers offer compliance certifications, the real challenge is not whether a platform can be compliant in theory, but whether it can withstand legal review, audits, and long-term operational scrutiny in practice. Once vector databases move into production and begin storing sensitive or regulated data, these questions become unavoidable.&lt;/p&gt;

&lt;h3&gt;
  
  
  GDPR and the limits of cross-border transfers
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.pinsentmasons.com/out-law/guides/international-transfers-schrems-ii-gdpr" rel="noopener noreferrer"&gt;Schrems II ruling&lt;/a&gt; changed how European data can be processed outside the EU. Privacy Shield was invalidated, leaving Standard Contractual Clauses as the primary legal mechanism for cross-border data transfers. In highly regulated industries such as financial services and healthcare, many legal teams consider SCCs insufficient due to enforcement uncertainty and ongoing legal challenges.&lt;/p&gt;

&lt;p&gt;For vector databases, this matters because embeddings often contain derived personal data. Even if raw records are masked or tokenized, embeddings can still be considered personal data under GDPR. If data must remain within the EEA, or within a specific country, cloud deployments that rely on global infrastructure introduce legal risk. In these cases, on-premises or in-region deployment becomes a requirement rather than a preference.&lt;/p&gt;

&lt;h3&gt;
  
  
  HIPAA retention and the real cost of data movement
&lt;/h3&gt;

&lt;p&gt;HIPAA does not explicitly require data to stay on-premises, but it does require long retention periods and strict access controls. When vector embeddings are built on top of this data, they inherit the same retention requirements. &lt;a href="https://www.actian.com/blog/data-governance/hipaa-data-governance/" rel="noopener noreferrer"&gt;HIPAA data governance&lt;/a&gt; must be enforced when considering on-premises or cloud vector databases.&lt;/p&gt;

&lt;p&gt;The cost impact becomes clear when egress fees are included. Consider a system storing 100 TB of embeddings in a cloud environment. At a common egress rate of $0.09 per GB, moving that data out of the cloud over a seven-year retention period results in:&lt;/p&gt;

&lt;p&gt;100 TB × $0.09 per GB × 84 months = over $750,000 in egress costs alone&lt;/p&gt;

&lt;p&gt;This does not include compute, storage, or indexing costs. With this in mind, will &lt;a href="https://www.actian.com/blog/cloud-data-warehouse/will-cloud-data-warehouses-really-help-you-cut-costs/" rel="noopener noreferrer"&gt;cloud data warehouses&lt;/a&gt; really help you cut costs?&lt;/p&gt;

&lt;h3&gt;
  
  
  Financial services and data sovereignty rules
&lt;/h3&gt;

&lt;p&gt;Financial institutions face additional constraints beyond GDPR. Regulations such as &lt;a href="https://www.ftc.gov/business-guidance/privacy-security/gramm-leach-bliley-act" rel="noopener noreferrer"&gt;GLBA&lt;/a&gt;, &lt;a href="https://www.apra.gov.au/" rel="noopener noreferrer"&gt;APRA&lt;/a&gt;, and regional data sovereignty mandates often require strict control over where customer data is stored and processed. Regulators may demand clear evidence of geographic boundaries, access controls, and auditability.&lt;/p&gt;

&lt;p&gt;Cloud services can meet some of these requirements, but they often introduce complex configurations, contractual dependencies, and ongoing compliance reviews. For many banks and insurers, on-premises deployment simplifies audits by keeping data within controlled infrastructure that regulators already understand.&lt;/p&gt;

&lt;h3&gt;
  
  
  Government and public sector constraints
&lt;/h3&gt;

&lt;p&gt;Government contracts introduce some of the strictest infrastructure requirements. Standards such as &lt;a href="https://www.fedramp.gov/" rel="noopener noreferrer"&gt;FedRAMP&lt;/a&gt; often mandate US-only infrastructure, restricted access, and tightly controlled environments.&lt;/p&gt;

&lt;p&gt;In these cases, public cloud services are frequently disallowed or require extensive approvals. On-premises deployment is often the only viable option for running vector databases in support of government workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  When compliance makes cloud untenable
&lt;/h3&gt;

&lt;p&gt;If legal teams flag cross-border data transfers as unacceptable, cloud deployments quickly become impractical. Once data residency is mandatory, on-premises deployment is no longer a trade-off decision. It is a compliance requirement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hegevika082rpgbuyid.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hegevika082rpgbuyid.png" width="800" height="915"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Breakdown Analysis
&lt;/h2&gt;

&lt;p&gt;Cost is often the reason teams revisit the on-premises versus cloud decision. To make a defensible decision, teams need to understand where costs diverge and when self-hosting becomes economically rational.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where self-hosting breaks even
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://openmetal.io/resources/blog/when-self-hosting-vector-databases-becomes-cheaper-than-saas/" rel="noopener noreferrer"&gt;Research from OpenMetal&lt;/a&gt; shows a consistent breakeven point for Pinecone vector databases at scale. Once workloads reach roughly 80 to 100 million queries per month, self-hosted deployments tend to be cheaper than managed cloud services. Below this range, cloud pricing is usually competitive. Above it, usage-based billing begins to dominate total cost.&lt;/p&gt;

&lt;p&gt;This threshold matters because many enterprise RAG systems cross it quickly. Customer support, document search, fraud detection, and recommendation systems often serve tens or hundreds of millions of queries each month once deployed across business units or regions.&lt;/p&gt;

&lt;h3&gt;
  
  
  The hidden cost in cloud pricing
&lt;/h3&gt;

&lt;p&gt;Cloud pricing is rarely just a per-query fee. Vector databases introduce several cost drivers that are easy to overlook during planning.&lt;/p&gt;

&lt;p&gt;Egress fees are a major factor. Most cloud providers charge around $0.09 per GB for data leaving their network. Moving embeddings between regions, exporting data for analytics, or migrating to another system all incur these fees. Over time, they become a meaningful portion of total spend.&lt;/p&gt;

&lt;p&gt;Finally, vector search does not scale linearly. As vector counts grow and dimensionality increases, query costs rise faster than expected. What looks affordable at 10 million vectors can become expensive at 500 million, even if query volume grows steadily.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-premises costs are fixed and predictable
&lt;/h3&gt;

&lt;p&gt;On-premises deployments have real costs, but they behave differently. Hardware is typically amortized over three to five years. Staffing requirements are stable once the system is running. Facilities and power costs are known in advance.&lt;/p&gt;

&lt;p&gt;The key difference is predictability. Costs do not spike because of usage patterns or data movement. Once the system is sized correctly, monthly spend remains largely flat, even as query volume increases.&lt;/p&gt;

&lt;h3&gt;
  
  
  A real world example
&lt;/h3&gt;

&lt;p&gt;Consider a production e-commerce application with the following scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;500M vectors&lt;/li&gt;
&lt;li&gt;200M queries every month&lt;/li&gt;
&lt;li&gt;1024 vector dimensions&lt;/li&gt;
&lt;li&gt;6M writes monthly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this scale, a typical managed Pinecone vector database costs around $8,500 per month once compute, storage, and rebuild overhead are included.&lt;/p&gt;

&lt;h3&gt;
  
  
  Estimated monthly cost
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Total Estimated Cost:&lt;/strong&gt;  &lt;strong&gt;$8,454 / month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Storage&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Usage:&lt;/strong&gt; 845 GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt;  &lt;strong&gt;$279&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Query Costs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;24 &lt;strong&gt;b1 nodes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;4 shards × 6 replicas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assumption:&lt;/strong&gt; 1% filter selectivity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Estimated Cost:&lt;/strong&gt;   &lt;strong&gt;$8,074&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Note: Actual query cost may vary. Benchmark your workload on DRN for more accurate estimates.&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Write Costs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write Volume:&lt;/strong&gt; 30 million Write Units (WU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assumption:&lt;/strong&gt; Each write request consumes &lt;strong&gt;≥ 5 WU&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt;  &lt;strong&gt;$101&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo22upje65uddcdfkz4oj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo22upje65uddcdfkz4oj.png" width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An equivalent on-premises deployment might cost approximately half of that after hardware amortization, assuming an 18-month payback period and one to two engineers supporting the system. After that payback period, costs drop further while capacity remains available.&lt;/p&gt;

&lt;p&gt;A study by &lt;a href="https://www.enterprisestorageforum.com/cloud/on-premise-vs-cloud-storage/" rel="noopener noreferrer"&gt;Enterprise Storage Forum&lt;/a&gt; shows the cost projection of on-premises and cloud workloads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuywbo3k237j2l5a4bmc3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuywbo3k237j2l5a4bmc3.png" width="600" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cost alone does not decide every deployment, but once vector workloads reach scale, the economics become difficult to ignore. Understanding where your system sits on this curve is essential before locking in a long-term vector database strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Latency And Connectivity Matter
&lt;/h2&gt;

&lt;p&gt;Latency and connectivity are often treated as secondary concerns in architecture decisions. For many AI workloads, they are decisive. Once vector databases support real-time systems, network round-trips and internet dependency can make cloud deployments impractical or unsafe.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-time response requirements
&lt;/h3&gt;

&lt;p&gt;Some applications have strict response time limits. In healthcare, clinical decision support and diagnostic systems often require responses in under 50 milliseconds. This budget includes data retrieval, vector search, and model inference. Similarly, banks and financial institutions often require very low latency for maximum user experience.&lt;/p&gt;

&lt;p&gt;Public cloud deployments add unavoidable network latency. Even within the same region, round-trip latency typically adds 20 to 80 milliseconds before any compute work begins. For applications with tight latency targets, this overhead alone can exceed the total allowed response time. On-premises deployments remove that network hop, allowing systems to meet real-time requirements consistently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Systems that must work offline
&lt;/h3&gt;

&lt;p&gt;Many environments cannot rely on constant connectivity. Retail point-of-sale systems must continue operating during network outages. Manufacturing facilities are often located in remote areas with unstable connections. Military and maritime deployments may operate in fully disconnected or classified environments.&lt;/p&gt;

&lt;p&gt;In these scenarios, a cloud dependency is a single point of failure. If the network goes down, the AI system stops working. On-premises and edge deployments allow vector search and inference to run locally, ensuring the system continues to function even when external connectivity is unavailable.&lt;/p&gt;

&lt;h3&gt;
  
  
  The cost of downtime
&lt;/h3&gt;

&lt;p&gt;It is no news that there has been an increase in downtime from cloud providers. On November 18, 2025, &lt;a href="https://blog.cloudflare.com/18-november-2025-outage/" rel="noopener noreferrer"&gt;Cloudflare outage&lt;/a&gt; disrupted large portions of the internet, causing &lt;a href="https://www.thenational.scot/news/national/uk-today/25630748.full-list-sites-affected-cloudflare-outage-like-x/" rel="noopener noreferrer"&gt;downtime across major platforms&lt;/a&gt;including X, Amazon Web Services, Spotify, and so on. The impact of connectivity failures is not theoretical. In manufacturing, average downtime costs are &lt;a href="https://www.twi-institute.com/manufacturing-downtime/" rel="noopener noreferrer"&gt;estimated at $260,000 per hour&lt;/a&gt;. When AI systems support quality control, predictive maintenance, or process automation, any outage directly affects production.&lt;/p&gt;

&lt;p&gt;A cloud-only architecture introduces risk that is hard to justify in these environments. Even short network disruptions can lead to significant financial loss. On-premises deployments reduce this risk by removing external dependencies from critical execution paths.&lt;/p&gt;

&lt;p&gt;For workloads with strict latency targets or limited connectivity, the choice is often clear. Cloud-based vector databases may work during development, but they fail to meet operational requirements in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Operational Complexity Question
&lt;/h2&gt;

&lt;p&gt;The strongest argument for cloud vector databases is operational simplicity. Managed services remove the need to provision hardware, manage clusters, apply patches, or handle failures. For small teams or early-stage projects, this advantage is real and often decisive. Cloud deployments allow engineers to focus on application logic rather than infrastructure.&lt;/p&gt;

&lt;p&gt;It is also important to recognize that modern on-premises deployments look very different from those of a decade ago. This is not the world of manual server provisioning and fragile scripts. Kubernetes, infrastructure-as-code, and automated deployment pipelines have reduced operational overhead significantly. Rolling upgrades, automated scaling, and monitoring are now standard practices in on-premises environments as well as in the cloud.&lt;/p&gt;

&lt;p&gt;Many enterprises adopt hybrid approaches to balance speed and control. Development and experimentation happen in the cloud, where teams can move quickly and iterate. Production systems run on-premises, where costs are predictable and compliance is easier to enforce. This pattern allows teams to get the best of both models without committing fully to either.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Framework: Eight Questions
&lt;/h2&gt;

&lt;p&gt;The fastest way to make a defensible deployment decision is to walk through a small set of yes or no questions with engineering, legal, finance, and operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Does your data require geographic restrictions?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Regulations such as GDPR, HIPAA, and financial services rules may limit where data can be stored or processed.&lt;/p&gt;

&lt;p&gt;If yes, on-premises should be strongly considered because it provides full control over data location. If no, cloud deployment remains viable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Do you have predictable, high-volume query patterns?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloud vector database costs scale with usage. A simple check is monthly queries multiplied by the unit cost.&lt;/p&gt;

&lt;p&gt;If usage exceeds roughly 80 to 100 million queries per month, on-premises is often cheaper. Below that range, cloud pricing is usually more economical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Do you need offline capability?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some systems must continue working without network access, such as in manufacturing, retail, or edge environments.&lt;/p&gt;

&lt;p&gt;If yes, on-premises is required. If no, cloud remains an option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Can you tolerate additional latency?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloud deployments add network latency, often 50 to 100 milliseconds.&lt;/p&gt;

&lt;p&gt;If your application cannot tolerate this, on-premises deployment is necessary. If it can, cloud performance may be acceptable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Do you have existing infrastructure teams?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Operational capacity matters.&lt;/p&gt;

&lt;p&gt;If you already run on-premises systems, the added burden is limited. If not, cloud-managed services provide a clear operational advantage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Is cost predictability important?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Usage-based billing introduces cost variability.&lt;/p&gt;

&lt;p&gt;If predictable costs matter, on-premises provides stability. If flexibility matters more, cloud pricing may be a better fit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Are you extending existing IT infrastructure?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Deployment context affects the decision.&lt;/p&gt;

&lt;p&gt;If you are extending existing systems, on-premises leverages current investments. If you are building something new, cloud may be faster to deploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. How large is your data footprint?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data volume and access frequency influence long-term cost.&lt;/p&gt;

&lt;p&gt;If you manage more than 10 TB with frequent access, on-premises becomes attractive. If your data is smaller, cloud is often sufficient.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wz1bqfz6210hv7bkcyv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wz1bqfz6210hv7bkcyv.png" width="800" height="1105"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When several answers point in the same direction, the decision becomes easy to explain and defend across engineering, legal, finance, and operations teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Cloud Makes Sense
&lt;/h2&gt;

&lt;p&gt;On-premises deployment is not always the right answer. In many situations, cloud-based vector databases remain the better choice. Being clear about these cases helps avoid over-engineering.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unpredictable scaling:&lt;/strong&gt; Startups and new products often face uncertain growth. Cloud platforms allow rapid scaling without long-term infrastructure commitments, which reduces risk when demand is unclear.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small data volumes:&lt;/strong&gt; When total data is under 10 TB and query volume stays below about 50 million queries per month, cloud pricing usually works well and is simpler than self-hosting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rapid experimentation:&lt;/strong&gt; Proof-of-concepts, research projects, and early prototypes benefit from fast setup and easy teardown. Cloud services support quick iteration with minimal operational effort.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No compliance constraints:&lt;/strong&gt; If data residency, sovereignty, and regulatory requirements are not an issue, cloud deployment avoids legal complexity and speeds up delivery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited infrastructure expertise:&lt;/strong&gt; Teams focused on application logic rather than operations can rely on managed services instead of maintaining databases, clusters, and hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these cases, cloud is the most effective and practical option.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hybrid Deployment Strategies
&lt;/h2&gt;

&lt;p&gt;Hybrid deployments act as the middle ground for enterprises that need both speed and control. Rather than treating cloud and on-premises as mutually exclusive, teams place each part of the system where it performs best.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud for iteration, on-prem for scale
&lt;/h3&gt;

&lt;p&gt;A common pattern is to develop and test in the cloud, where managed services and elastic infrastructure enable rapid iteration. Once models, indexes, or pipelines are stable, they are promoted into on-premises production environments to meet compliance, latency, and operational requirements. This preserves developer velocity without compromising production guarantees.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data segregation by risk and regulation
&lt;/h3&gt;

&lt;p&gt;Hybrid architectures also allow organizations to separate workloads by risk profile. Sensitive or regulated data stays on-premises, while analytics, training, or search over derived data runs in the cloud. The same logic applies regionally: EU data may remain on-premises or in sovereign environments, while US workloads run in public cloud regions, avoiding global systems being constrained by the strictest jurisdiction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost and migration flexibility
&lt;/h3&gt;

&lt;p&gt;Cost optimization is another driver. Frequently accessed vectors or low-latency services can be cheaper and more predictable on-premises, while cold storage and bursty workloads benefit from cloud pricing. Many teams start cloud-first, then selectively move components on-premises as scale or compliance pressures grow. Hybrid makes this a controlled evolution rather than a disruptive rewrite.&lt;/p&gt;

&lt;p&gt;Industry research shows this is a stable operating model. &lt;a href="https://cloud.google.com/distributed-cloud" rel="noopener noreferrer"&gt;Google Distributed Cloud&lt;/a&gt; and similar platforms explicitly frame hybrid as a long-term strategy, recognizing that modern systems are designed to span environments, not collapse them into one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actian’s Approach To On-premises Vector Databases
&lt;/h2&gt;

&lt;p&gt;For teams that conclude that on-premises is the right deployment model, the next question is: which platform can actually meet these requirements? &lt;a href="https://www.actian.com/" rel="noopener noreferrer"&gt;Actian’s approach&lt;/a&gt; is built specifically for this audience, without assuming the cloud is the default or the end state.&lt;/p&gt;

&lt;p&gt;Actian delivers an &lt;a href="https://www.actian.com/blog/databases/introducing-actian-vector-7-0/" rel="noopener noreferrer"&gt;enterprise-grade vector database&lt;/a&gt; that runs fully in your own data center or controlled environments. You retain full control over data placement, networking, and operations. There is no forced dependency on external cloud services, which simplifies audits and long-term system design.&lt;/p&gt;

&lt;p&gt;Compliance requirements are treated as baseline constraints. By keeping data local and eliminating egress paths, Actian aligns with GDPR, HIPAA, FedRAMP, and similar regulatory frameworks. This reduces the need for compensating controls or complex legal workarounds.&lt;/p&gt;

&lt;p&gt;Cost behavior is also predictable. Actian avoids usage-based pricing models that scale with queries or vector counts. This makes budgeting simpler and removes surprises as workloads grow.&lt;/p&gt;

&lt;p&gt;Edge support is also taken into consideration. Actian’s architecture supports offline operation and local inference, making it suitable for manufacturing sites, retail locations, and other environments where connectivity is limited or unreliable. The system is designed to keep working even when the network does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Choosing between cloud and on-premises for vector databases is about understanding your priorities. Cloud works well for small workloads, rapid experimentation, and teams without deep infrastructure expertise. On-premises makes sense when compliance, latency, cost predictability, or scale are critical.&lt;/p&gt;

&lt;p&gt;Many enterprises find a hybrid approach is the best balance, combining cloud flexibility with on-premises control. The key is making intentional decisions based on your data, workloads, and regulatory needs rather than following trends.&lt;/p&gt;

&lt;p&gt;Actian empowers enterprises to confidently manage and govern data at scale. Organizations trust Actian data management and data intelligence solutions to streamline complex data environments and accelerate the delivery of AI-ready data. As the data and AI division of &lt;a href="https://www.hcl-software.com/" rel="noopener noreferrer"&gt;HCLSoftware&lt;/a&gt;, Actian helps enterprises manage and govern data at scale across on-premises, cloud, and hybrid environments. Learn more about &lt;a href="https://www.actian.com/databases/vector-ai-db/#waitlist" rel="noopener noreferrer"&gt;Actian&lt;/a&gt; and how it fits into your on-premises AI strategy.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cloud</category>
      <category>database</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>How to Evaluate Vector Databases in 2026</title>
      <dc:creator>Odewole Babatunde Samson</dc:creator>
      <pubDate>Tue, 19 May 2026 09:12:21 +0000</pubDate>
      <link>https://dev.to/actiandev/how-to-evaluate-vector-databases-in-2026-213m</link>
      <guid>https://dev.to/actiandev/how-to-evaluate-vector-databases-in-2026-213m</guid>
      <description>&lt;p&gt;In 2026, a synthetic performance crisis challenges the vector database market. A GitHub search for “vector database benchmark” reveals polished repositories with dashboards and performance charts. However, vendors often build these tools to evaluate their own products and portray architecture-specific strengths as objective comparisons.&lt;/p&gt;

&lt;p&gt;Zilliz maintains VectorDBBench. Redis and Qdrant publish benchmark suites that highlight their own systems. Even widely cited Approximate Nearest Neighbor (ANN) evaluations, such as ANN-Benchmarks, rely on low-dimensional datasets such as Scale-Invariant Feature Transform (SIFT) and Generalized Search Trees (GIST). Modern Large Language Model (LLM) embeddings often reach 3,072 dimensions. These benchmarks do not reflect that reality.&lt;/p&gt;

&lt;p&gt;Leaderboards reward performance under static conditions, yet production systems must survive continuous writes, metadata filters, and concurrency spikes. As software engineer Simon Frey famously noted in a &lt;a href="https://simon-frey.com/blog/why-vector-database-are-a-scam/" rel="noopener noreferrer"&gt;viral post&lt;/a&gt;: “The best vector database is the one you already have.” This captures the 2026 market shift, prompting teams to move from specialized silos toward the databases they already trust and operate.&lt;/p&gt;

&lt;p&gt;This guide takes a production-first approach. We define the five critical tests for 2026 and explore why your optimal vector database may already exist within your current architecture, whether that is PostgreSQL with pgvector or an enterprise hybrid engine like Actian VectorAI DB.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;TL;DR&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The bias:&lt;/strong&gt; Most benchmark suites originate from vendors and optimize for narrow architectural advantages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The reality:&lt;/strong&gt; Production workloads include continuous ingestion, metadata filtering, and concurrency spikes that synthetic tests ignore.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The risk:&lt;/strong&gt; Tail latency (P99), index fragmentation, and write amplification degrade systems long before average QPS drops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The cost curve:&lt;/strong&gt; Managed vector services often introduce nonlinear pricing as the dataset size increases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The direction:&lt;/strong&gt; 2026 favors integrated platforms, from established relational extensions (PostgreSQL + pgvector) to enterprise hybrid systems (Actian VectorAI DB), over “vector-only” silos.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why Every Benchmark You’ve Seen is Vendor-Optimized&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Benchmarks create a perception of objectivity but often encode architectural assumptions. Tools like VectorDBBench (Zilliz) reward distributed scaling, while Redis and Qdrant suites emphasize in-memory operations. To find objective data, architects must look to peer-reviewed academic conferences such as &lt;a href="https://neurips.cc/" rel="noopener noreferrer"&gt;NeurIPS&lt;/a&gt; and VLDB (Very Large Databases), which prioritize algorithmic rigor over marketing.&lt;/p&gt;

&lt;p&gt;Before examining what matters in production, it helps to understand how common benchmark tools shape outcomes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark tool&lt;/th&gt;
&lt;th&gt;Primary creator&lt;/th&gt;
&lt;th&gt;Optimization focus&lt;/th&gt;
&lt;th&gt;Typical bias&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VectorDBBench&lt;/td&gt;
&lt;td&gt;Zilliz (Milvus)&lt;/td&gt;
&lt;td&gt;High-throughput scaling&lt;/td&gt;
&lt;td&gt;Favors massive clusters; penalizes single-node systems.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vector-db-benchmark&lt;/td&gt;
&lt;td&gt;Redis/Qdrant&lt;/td&gt;
&lt;td&gt;In-memory operations&lt;/td&gt;
&lt;td&gt;Favors RAM-heavy architectures; ignores TCO of memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ANN-Benchmarks&lt;/td&gt;
&lt;td&gt;Academic&lt;/td&gt;
&lt;td&gt;Raw algorithm efficiency&lt;/td&gt;
&lt;td&gt;Uses outdated, low-dimensional datasets (SIFT/GIST).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NeurIPS / VLDB&lt;/td&gt;
&lt;td&gt;Academic Peers&lt;/td&gt;
&lt;td&gt;Algorithmic robustness&lt;/td&gt;
&lt;td&gt;Focuses on math/theory; ignores operational/SLA reality.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Hidden Rules of Benchmarking&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A significant hurdle is the “DeWitt Clause,” a legal provision in many End User License Agreements (&lt;a href="https://en.wikipedia.org/wiki/End-user_license_agreement" rel="noopener noreferrer"&gt;EULAs&lt;/a&gt;) that prohibits users from publishing independent benchmarks without the vendor’s permission. In 2024, &lt;a href="https://benchant.com/blog/vectordb-de-witt" rel="noopener noreferrer"&gt;BenchANT&lt;/a&gt; found that 30% of the major vector databases legally prohibit disclosure that their products are slow.&lt;/p&gt;

&lt;p&gt;Furthermore, these benchmarks often operate at “Time Zero,” the artificial window immediately following ingestion but preceding live updates. In production, systems must constantly insert and delete data, forcing the index to re-optimize in real time. Vendor benchmarks often omit the Out-of-Memory (&lt;a href="https://en.wikipedia.org/wiki/Out_of_memory" rel="noopener noreferrer"&gt;OOM&lt;/a&gt;) failures that result.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4q2kar0qcprn5w1ihnl6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4q2kar0qcprn5w1ihnl6.png" alt="circular validation loop" width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Five Production Tests That Actually Matter&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Most benchmarks measure performance after loading data, before any real updates occur. But production is a nonstop, unpredictable process. To find a database that can handle real users, you should run these five stress tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Filtering under concurrent load&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Pure vector similarity searches are rare in real life. In production, you’re more likely to search for something like “Product recommendations WHERE category is ‘shoes’ AND stock &amp;gt; 0.”&lt;/p&gt;

&lt;p&gt;Reddit’s engineering team, managing 340M+ vectors, identified &lt;a href="https://www.reddit.com/r/RedditEng/comments/1ozxnjc/choosing_a_vector_database_for_ann_search_at/" rel="noopener noreferrer"&gt;metadata filtering&lt;/a&gt; as the primary performance bottleneck in their 2025 deployment. They found that as concurrent users grew, the database spent more time resolving metadata filters than calculating similarity distances.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The reality: Production means 100+ concurrent clients hitting different metadata subsets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The gap: VectorDBBench only tests with a single client. In real-world situations, moving data between the vector graph and the relational metadata store can cause P99 latency to jump by 10x, as the CPU waits for disk I/O.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Performance degradation over time&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;While archival retrieval-augmented generation (RAG) systems can technically use static knowledge bases, production-grade applications in 2026 must reflect real-time data, such as customer tickets or product inventory. As the engineering team at &lt;a href="https://milvus.io/blog/benchmarks-lie-vector-dbs-deserve-a-real-test.md" rel="noopener noreferrer"&gt;Milvus&lt;/a&gt; admitted, “Benchmarks test after data ingestion completes, but production data never stops flowing.” If the database cannot re-index as quickly as it ingests data, your AI may provide stale or incorrect answers for hours.&lt;/p&gt;

&lt;p&gt;Benchmarks that omit a “72-hour continuous write-and-query” test provide zero value. You must determine whether query performance degrades after six months of continuous index maintenance.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Tail latency under load (P95/P99)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Average latency can be misleading and doesn’t show what users really experience. For example, a 10ms average response time doesn’t help if your slowest 1% of queries (P99) take 800ms. This makes your AI agent seem slow and unreliable. Only high-concurrency tests reveal these spikes, which often happen during garbage collection or index locking.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Total cost of ownership (TCO)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In 2025, managed vendors introduced complex “read unit” pricing. This created a “Growth penalty”: if your index grows from 10GB to 100GB, you may pay 10x as much for the same query result.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale metric&lt;/th&gt;
&lt;th&gt;Managed Vector DB (usage-based)&lt;/th&gt;
&lt;th&gt;Integrated/Hybrid platform&lt;/th&gt;
&lt;th&gt;TCO impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Initial (10GB)&lt;/td&gt;
&lt;td&gt;High (Platform fee + usage)&lt;/td&gt;
&lt;td&gt;Moderate (Fixed resource)&lt;/td&gt;
&lt;td&gt;Integrated is ~40% lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Growth (100GB)&lt;/td&gt;
&lt;td&gt;High (Scales with volume)&lt;/td&gt;
&lt;td&gt;Low (Vertical scaling)&lt;/td&gt;
&lt;td&gt;8x cost gap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise (1TB+)&lt;/td&gt;
&lt;td&gt;Prohibitive (Linear growth)&lt;/td&gt;
&lt;td&gt;Optimized (Reserved capacity)&lt;/td&gt;
&lt;td&gt;90%+ long-term savings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This economic reality primarily drives the market’s shift toward “Vector as a Feature,” in which teams prioritize &lt;a href="https://www.actian.com/on-premises-data/" rel="noopener noreferrer"&gt;on-premises&lt;/a&gt; capabilities and predictable scaling over usage-based silos.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Operational maturity&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Benchmarks ignore the “Operational Support Tax,” which quantifies the cost and risk of maintaining specialized infrastructure. You can easily find a PostgreSQL expert because the community has thrived for 30 years, but hiring someone proficient in a niche, three-year-old vector database often creates a bottleneck.&lt;/p&gt;

&lt;p&gt;Evaluate the ecosystem: Does the database work with standard backup tools? Can it integrate with Prometheus? How long does it take to rebuild an index after a crash?&lt;/p&gt;

&lt;p&gt;Here’s how benchmark claims compare to production reality.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Benchmark focus&lt;/th&gt;
&lt;th&gt;Production reality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ingestion&lt;/td&gt;
&lt;td&gt;Static QPS after completion&lt;/td&gt;
&lt;td&gt;Sustained QPS during continuous writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;Average latency&lt;/td&gt;
&lt;td&gt;P95/P99 Latency under concurrent load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filtering&lt;/td&gt;
&lt;td&gt;Single-client filtered search&lt;/td&gt;
&lt;td&gt;100+ Concurrent metadata-filtered queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Infrastructure cost per query&lt;/td&gt;
&lt;td&gt;TCO at 100M+ queries/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewamr9w9ad8b6w70akzk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewamr9w9ad8b6w70akzk.png" alt="the ingestion cliff" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Spotting these hidden bottlenecks is the first step to building a strong system. In 2026, the answer is rarely to use a faster, specialized database. Instead, engineers are adding these features to the tools they already know and trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Consolidation Shift: Vector as a Feature&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Corey Quinn, Chief Cloud Economist, once said: “Vector is a feature, not a product.” This prediction shapes the 2026 market. Teams are moving away from specialized “Vector-Only” databases and choosing integrated “Vector-Also” platforms. Shifting data between a main database and a separate vector database often causes more problems than it fixes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The PostgreSQL renaissance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Engineers frequently argue on platforms like Hacker News that ~80% of RAG use cases (specifically those with embeddings under 2M) do not require a specialized vector database. For these workloads, standalone silos often introduce more operational friction than they offer in performance gains. Instacart validated this at scale by migrating from &lt;a href="https://www.infoq.com/news/2025/08/instacart-elasticsearch-postgres/" rel="noopener noreferrer"&gt;Elasticsearch to PostgreSQL&lt;/a&gt;, achieving 80% cost savings and reducing write workload by 10x after eliminating the need to coordinate and reconcile data across fragmented architectures.&lt;/p&gt;

&lt;p&gt;Recently, pgvectorscale achieved 471 queries per second at &lt;a href="https://www.tigerdata.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;99% recall on 50 million vectors&lt;/a&gt;, outperforming Qdrant’s 41 QPS on identical AWS hardware. Vendor benchmarks often omit this result because it shows that most RAG applications don’t require a specialized vendor.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Performance metric&lt;/th&gt;
&lt;th&gt;PostgreSQL (pgvector + pgvectorscale)&lt;/th&gt;
&lt;th&gt;Qdrant (Specialized)&lt;/th&gt;
&lt;th&gt;The Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput (QPS)&lt;/td&gt;
&lt;td&gt;471.57&lt;/td&gt;
&lt;td&gt;41.47&lt;/td&gt;
&lt;td&gt;11.4x higher in Postgres&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 Latency&lt;/td&gt;
&lt;td&gt;60.42 ms&lt;/td&gt;
&lt;td&gt;36.73 ms&lt;/td&gt;
&lt;td&gt;Qdrant is 39% faster at tail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P99 Latency&lt;/td&gt;
&lt;td&gt;74.60 ms&lt;/td&gt;
&lt;td&gt;38.71 ms&lt;/td&gt;
&lt;td&gt;Qdrant is 48% faster at tail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware&lt;/td&gt;
&lt;td&gt;AWS r6id.4xlarge (16 vCPU)&lt;/td&gt;
&lt;td&gt;AWS r6id.4xlarge (16 vCPU)&lt;/td&gt;
&lt;td&gt;Parity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The integrated enterprise gap&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For workloads that exceed basic extensions, &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Actian VectorAI DB&lt;/a&gt; bridges the gap by embedding a high-performance engine with native vector support. Teams can execute metadata filtering and similarity search within a single system, reducing data movement and simplifying query execution.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Architectural strategy&lt;/th&gt;
&lt;th&gt;Intended AI capability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Actian VectorAI DB&lt;/td&gt;
&lt;td&gt;High-performance hybrid&lt;/td&gt;
&lt;td&gt;Engineered for integrated analytics + native vector support.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Integrated feature&lt;/td&gt;
&lt;td&gt;Leverages pgvector within standard SQL.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS S3 Vectors&lt;/td&gt;
&lt;td&gt;Storage-centric&lt;/td&gt;
&lt;td&gt;Designed to query multi-billion vectors in object storage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MongoDB Atlas&lt;/td&gt;
&lt;td&gt;Unified document/vector API&lt;/td&gt;
&lt;td&gt;Integrates native vector search directly into the existing document store workflow.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As the market comes together, the way we evaluate databases shifts. Teams no longer ask, “Who has the fastest graph?” They ask, “Which architecture provides the most reliable query engine?” No universal winner exists. Teams instead face a spectrum of trade-offs between specialized speed and integrated reliability.&lt;/p&gt;

&lt;p&gt;The evaluation process now puts more weight on operational strength, real-world flexibility, and support for hybrid search. Reliable query execution is becoming the top priority, especially given the growing demand for hybrid search.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Hybrid Search Reality That Pure Vector Benchmarks Hide&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Pure vector search often fails the “groundedness” test, which measures how strictly an AI’s response relies on provided source material. A high groundedness score ensures that the LLM avoids fabrication and adheres closely to your internal data.&lt;/p&gt;

&lt;p&gt;According to an analysis by the &lt;a href="https://techcommunity.microsoft.com/blog/azuredevcommunityblog/doing-rag-vector-search-is-not-enough/4161073" rel="noopener noreferrer"&gt;Microsoft Azure DevBlog&lt;/a&gt;, pure vector search alone struggles with factual accuracy, scoring a mediocre 2.79 out of 5 for groundedness. The solution is Hybrid Search, which blends semantic vector similarity with traditional keyword matching (BM25).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The 20–40% performance penalty&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Hybrid search demands significant computation. The database must rank results from two different engines, such as lexical and semantic, then merge them using a fusion algorithm. Production implementations typically see a &lt;a href="https://www.pinecone.io/blog/hybrid-search" rel="noopener noreferrer"&gt;20–40% performance penalty&lt;/a&gt; when moving from pure vector search to hybrid search. Reciprocal Rank Fusion (RRF) creates most of this “merge tax”, which, according to &lt;a href="https://www.elastic.co/search-labs/es/blog/improving-information-retrieval-elastic-stack-hybrid" rel="noopener noreferrer"&gt;Elastic’s research&lt;/a&gt;, can significantly increase query latency compared to single-index lookups.&lt;/p&gt;

&lt;p&gt;Databases that integrate vector search with filtering, full-text search, and query execution in a single engine execute hybrid queries within a single atomic statement. The query optimizer can evaluate metadata filters, full-text conditions, and vector similarity at once. This lets the optimizer produce better execution plans and move less data.&lt;/p&gt;

&lt;p&gt;In contrast, specialized vector silos fragment the query path. Applications route requests across multiple systems and merge results outside the database. This increases system complexity and introduces unpredictable latency under load.&lt;/p&gt;

&lt;p&gt;Hybrid platforms such as Actian VectorAI DB address this problem by embedding vector search within the database engine. This design removes cross-system joins, simplifies operations, and reduces long-term architectural overhead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fob6yxz2vdxje7662vpwd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fob6yxz2vdxje7662vpwd.png" alt="integrated query execution diagram" width="800" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Build Your Own Evaluation Framework&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Stop asking which database won a GitHub leaderboard. Start asking which architecture survives your constraints. In 2026, these constraints center on data residency, scale, and team expertise.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The case for hybrid and on-premises&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Data residency is no longer optional for global companies. With EU AI Act penalties reaching 35M Euros or 7% of global revenue, &lt;a href="https://datastores.ai/cloud" rel="noopener noreferrer"&gt;cloud-only&lt;/a&gt; vector databases represent a legal non-starter for regulated industries.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Sovereignty: 60% of financial firms outside the US plan to adopt sovereign/on-premises vector solutions by 2028.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost: As query volumes hit 100M/month, the “cloud tax” becomes visible. Self-hosting or using hybrid platforms like Actian can cut your infrastructure bill in half.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Maturity: If you already manage a relational database, your team possesses 90% of the required skills.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The 2026 architecture decision tree&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Does the data require on-premises storage for compliance? → Prioritize Actian VectorAI DB or self-hosted PostgreSQL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Does your query volume exceed 100M/month? → Avoid managed usage-based pricing; use self-hosted or reserved capacity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Do you require complex metadata filtering? → An integrated relational/vector engine is non-negotiable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcchgwu5uw1k2xentt1nb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcchgwu5uw1k2xentt1nb.png" alt="architecture decision tree" width="800" height="762"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to Evaluate the Evaluators&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To avoid letting vendor benchmarks mislead you, give the evaluation tool the same careful review you give the database. To spot a biased test, look past the headline QPS numbers and check the exact conditions that produced them.&lt;/p&gt;

&lt;p&gt;Use the following evaluation rubric to review any benchmark report before it shapes your architectural decisions.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Evaluation metric&lt;/th&gt;
&lt;th&gt;Red flag (Discard result)&lt;/th&gt;
&lt;th&gt;Green flag (Trustworthy result)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ingestion state&lt;/td&gt;
&lt;td&gt;Queries run against a static, immutable index with zero background writes.&lt;/td&gt;
&lt;td&gt;“Read-while-Write” testing, where queries run during continuous data ingestion.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware parity&lt;/td&gt;
&lt;td&gt;Vendor cloud “Optimized” vs. Competitor “Default” local/mismatched instances.&lt;/td&gt;
&lt;td&gt;Verified identical CPU, RAM, and Disk I/O configurations across all tested systems.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data selectivity&lt;/td&gt;
&lt;td&gt;“High Selectivity” filters (99% of data removed) that hide join/scan inefficiencies.&lt;/td&gt;
&lt;td&gt;“Low Selectivity” (10–20% filtered) tests that force the engine to handle large-scale index traversal.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dimensionality&lt;/td&gt;
&lt;td&gt;Testing on 128-dimension legacy datasets (SIFT/GIST).&lt;/td&gt;
&lt;td&gt;Testing on 1,536 or 3,072-dimension vectors that match modern LLM outputs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency metric&lt;/td&gt;
&lt;td&gt;Focuses strictly on “Average Latency” or “Mean Response Time.”&lt;/td&gt;
&lt;td&gt;Clearly publishes P95 and P99 tail latency under high concurrent load.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pre-Commitment Checklist&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Test with production-representative high-dimensional embeddings (3,072d+).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Measure P99 latency with 100+ concurrent users hitting diverse metadata filters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Calculate 3-year TCO, including storage growth, egress, and re-indexing fees.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Confirm that your team can manage observability and backups for the new stack.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Real evaluation requires testing with your data, your patterns, and your scale. Load your production-representative data, run a week-long stability test under concurrent load, and measure P99 latency and the TCO.&lt;/p&gt;

&lt;p&gt;If your workload requires compliance, hybrid deployment, or production-grade operational maturity that managed vector databases don’t offer, then Actian VectorAI DB &lt;a href="https://www.actian.com/databases/vectorai-db/#waitlist" rel="noopener noreferrer"&gt;early access&lt;/a&gt; is the right next step.&lt;/p&gt;

&lt;p&gt;Join the &lt;a href="https://discord.gg/432A2M63Py" rel="noopener noreferrer"&gt;Actian community on Discord&lt;/a&gt; to discuss vector architecture with engineers solving real production problems.&lt;/p&gt;

</description>
      <category>database</category>
      <category>rag</category>
      <category>vectordatabase</category>
      <category>ai</category>
    </item>
    <item>
      <title>The hidden cost of vector database pricing models</title>
      <dc:creator> Oluseye Jeremiah</dc:creator>
      <pubDate>Mon, 18 May 2026 14:47:31 +0000</pubDate>
      <link>https://dev.to/actiandev/the-hidden-cost-of-vector-database-pricing-models-n45</link>
      <guid>https://dev.to/actiandev/the-hidden-cost-of-vector-database-pricing-models-n45</guid>
      <description>&lt;p&gt;For a long time, usage-based pricing seemed like the safest way to run new infrastructure. The appeal was to start small, pay very little, and let costs rise only if the product proved itself. For teams experimenting with semantic search or early retrieval systems, that trade-off made sense, particularly when fixed infrastructure commitments felt riskier than uncertain usage patterns.&lt;/p&gt;

&lt;p&gt;That sense of safety began to fade in 2025 as several vector database providers introduced pricing floors and minimums. Pinecone announced a $50/month minimum, Weaviate implemented a $25/month floor, and similar changes rippled across the managed vector database market. &lt;/p&gt;

&lt;p&gt;Small, steady workloads suddenly experienced step changes in cost without any corresponding increase in activity, a pattern that reflected a broader shift across the SaaS landscape. Always-on vector database infrastructure no longer fits the economics of single-digit monthly pricing. &lt;a href="https://www.cio.com/article/4104365/saas-price-hikes-put-cios-budgets-in-a-bind.html" rel="noopener noreferrer"&gt;SaaS subscription&lt;/a&gt; costs from several large vendors rose between 10% and 20% in 2025, outpacing &lt;a href="https://www.cio.com/article/4104365/saas-price-hikes-put-cios-budgets-in-a-bind.html" rel="noopener noreferrer"&gt;IT budget&lt;/a&gt; growth projections of 2.8%, according to &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-07-15-gartner-forecasts-worldwide-it-spending-to-grow-7-point-9-percent-in-2025" rel="noopener noreferrer"&gt;Gartner&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Today, vector databases power production systems at scale. They run semantic search, recommendations, copilots, and internal knowledge tools. Data volumes stay relatively stable, and traffic patterns follow predictable curves. Yet for many organizations, vector search infrastructure has become one of the most volatile cost centers in the stack. Not because usage swings wildly, but because vector database pricing models behave differently once systems mature.&lt;/p&gt;

&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cloud native vector database pricing advertises low minimums and usage-based flexibility, but production costs tell a different story. &lt;/li&gt;
&lt;li&gt;Hidden fees (embeddings, reindexing, backups) can double your bill. &lt;/li&gt;
&lt;li&gt;Query costs scale with dataset size, meaning the same query becomes 10x more expensive as you grow from 10GB to 100GB. &lt;/li&gt;
&lt;li&gt;The October 2025 pricing shift introduced $50 minimums, forcing 400-500% cost increases for stable workloads. &lt;/li&gt;
&lt;li&gt;At 60-100M queries/month, self-hosting becomes 50-75% cheaper than cloud. &lt;/li&gt;
&lt;li&gt;Pricing model must be an architectural decision, not an afterthought.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What pricing pages leave out
&lt;/h3&gt;

&lt;p&gt;Vector database pricing pages prioritize offering summarization over long-term cost modeling. Their job is to make adoption frictionless, not to walk you through how the bill is calculated after a system is live. Most pages spotlight a familiar set of numbers: storage per gigabyte, read and write units, and a low monthly minimum. &lt;a href="https://www.pinecone.io/pricing/estimate/" rel="noopener noreferrer"&gt;Free tiers&lt;/a&gt; are marketed as enough to get started, which makes experimentation feel low-risk.&lt;/p&gt;

&lt;p&gt;What these pages rarely explain is how those line items interact once usage stabilizes. They typically don't model how query costs change as datasets grow, how write activity accumulates over time, or how meaningful parts of the workflow sit entirely outside the database. &lt;a href="https://docs.pinecone.io/guides/manage-cost/understanding-cost" rel="noopener noreferrer"&gt;Pinecone's pricing&lt;/a&gt; examples exclude initial data import, inference for embeddings and reranking, and assistant usage. Weaviate's pricing calculator similarly omits backup costs and data egress fees. &lt;/p&gt;

&lt;p&gt;Qdrant's estimates don't account for reindexing overhead. The same vendors that dominate every comparison list now face questions about the sustainability of their pricing. These disclaimers are present but easy to skim past when you're focused on shipping a proof of concept.&lt;/p&gt;

&lt;p&gt;A predictable pattern repeats itself. Someone runs the calculator and sets a monthly budget. The system goes live. A few weeks later, the bill is two to four times higher than expected. Nothing broke, no traffic spike happened. The database is doing exactly what it was built to do. The pricing page simply didn't describe the total cost of operating it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How usage-based pricing works (and why it gets expensive)
&lt;/h3&gt;

&lt;p&gt;Usage-based pricing reduces risk during experimentation when traffic is unknown. The issue is that vector databases in production are rarely unpredictable. &lt;/p&gt;

&lt;p&gt;Once a system is live, most engineering groups have a reasonable understanding of data size and baseline query volume. What they lack is a reliable way to predict next month's bill, because managed vector databases charge across several dimensions simultaneously: storage, writes, and queries. &lt;/p&gt;

&lt;p&gt;Each cost grows on its own curve, and none maps cleanly to user value. The part that catches development teams off guard is query pricing. In many models, query cost rises as the dataset grows, even when the query itself stays the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  The three cost drivers you're actually paying for
&lt;/h3&gt;

&lt;p&gt;Managed vector databases bill across three primary dimensions, though the exact rates vary by provider:&lt;/p&gt;

&lt;h4&gt;
  
  
  Storage:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Pinecone: $0.30/GB/month&lt;/li&gt;
&lt;li&gt;Weaviate: $0.095/GB/month&lt;/li&gt;
&lt;li&gt;Qdrant: $0.28/GB/month&lt;/li&gt;
&lt;li&gt;Scales linearly as your dataset grows&lt;/li&gt;
&lt;li&gt;More vector dimensions = larger bill&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Operations:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Pinecone: Write units ($4/million), Read units ($16/million)&lt;/li&gt;
&lt;li&gt;Weaviate: Per compute unit hour (variable)&lt;/li&gt;
&lt;li&gt;Qdrant: Credit-based system&lt;/li&gt;
&lt;li&gt;Every upsert, update, and query consumes units&lt;/li&gt;
&lt;li&gt;Vector search operations accumulate quickly at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Additional services:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Embedding generation: Pinecone Inference ($0.08/million tokens)&lt;/li&gt;
&lt;li&gt;Weaviate/Qdrant: Require external services (OpenAI, Cohere)&lt;/li&gt;
&lt;li&gt;Reranking, backups, data transfer billed separately&lt;/li&gt;
&lt;li&gt;Adds another vendor relationship and cost stream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each cost dimension scales independently, and their interaction creates compounding effects that pricing calculators rarely capture. Understanding why these costs compound requires looking at how vector search actually works, specifically HNSW indexing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn5kzw3b8ci8pj7k87m1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn5kzw3b8ci8pj7k87m1.png" width="800" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why costs compound as you scale
&lt;/h3&gt;

&lt;p&gt;The cost increases stem directly from how vector search works under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  How HNSW works:
&lt;/h3&gt;

&lt;p&gt;Most production vector databases use approximate nearest neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World) to make searches tractable at scale. &lt;/p&gt;

&lt;p&gt;HNSW constructs a multi-layer graph in which each layer represents vectors at different levels of granularity, thereby organizing millions of vector dimensions into an efficient structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  The cost impact:
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.pinecone.io/guides/manage-cost/understanding-cost" rel="noopener noreferrer"&gt;Pinecone's documentation &lt;/a&gt;indicates that a query consumes 1 RU per 1 GB of namespace size, with a minimum of 0.25 RUs per query. As your dataset grows, so does the graph:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset size&lt;/th&gt;
&lt;th&gt;RU per query&lt;/th&gt;
&lt;th&gt;Cost at $16/M RU&lt;/th&gt;
&lt;th&gt;Same query, different cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 GB&lt;/td&gt;
&lt;td&gt;10 RU&lt;/td&gt;
&lt;td&gt;$0.00016&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 GB&lt;/td&gt;
&lt;td&gt;100 RU&lt;/td&gt;
&lt;td&gt;$0.0016&lt;/td&gt;
&lt;td&gt;10x more expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 TB&lt;/td&gt;
&lt;td&gt;1,000 RU&lt;/td&gt;
&lt;td&gt;$0.016&lt;/td&gt;
&lt;td&gt;100x more expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Result:
&lt;/h3&gt;

&lt;p&gt;Ten times the cost, for the same query, delivering the same result quality.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://www.metacto.com/blogs/the-true-cost-of-pinecone-a-deep-dive-into-pricing-integration-and-maintenance" rel="noopener noreferrer"&gt;$16 per million&lt;/a&gt; read units, costs scale linearly with data growth but the functionality delivered to users stays the same. A search query returns the same number of results with the same accuracy whether your index is 10 GB or 100 GB. Your users see no difference, but you pay 10x more. This is the moment growth starts to feel like a penalty. The graph structure needs to traverse more vector dimensions as your index expands, and you pay for every additional operation.&lt;/p&gt;

&lt;h3&gt;
  
  
  The free tier that isn't really free
&lt;/h3&gt;

&lt;p&gt;The free tier enables early experimentation but doesn't predict production economics. By the time you hit the limits, switching costs are no longer theoretical. Migration is perceived as expensive, and people accept pricing they would have questioned earlier.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Free tier limits&lt;/th&gt;
&lt;th&gt;Production reality&lt;/th&gt;
&lt;th&gt;Time to exceed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pinecone&lt;/td&gt;
&lt;td&gt;2 GB, 1M reads, 2M writes (single region)&lt;/td&gt;
&lt;td&gt;60+ GB, 5M+ reads typical&lt;/td&gt;
&lt;td&gt;2-4 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weaviate&lt;/td&gt;
&lt;td&gt;1M vectors, limited compute&lt;/td&gt;
&lt;td&gt;10M+ vectors standard&lt;/td&gt;
&lt;td&gt;1-3 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;1 GB storage&lt;/td&gt;
&lt;td&gt;60+ GB storage common&lt;/td&gt;
&lt;td&gt;1-2 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The October 2025 pricing shift that changed everything
&lt;/h3&gt;

&lt;p&gt;These structural issues became impossible to ignore when Pinecone made a significant pricing change. By late 2025, pricing changes across major vector database providers made it clear that the pay as you go (PAYG) model did not always hold once systems reached steady production. The most visible signal came in October, when &lt;a href="https://www.metacto.com/blogs/the-true-cost-of-pinecone-a-deep-dive-into-pricing-integration-and-maintenance" rel="noopener noreferrer"&gt;Pinecone implemented&lt;/a&gt; a $50 monthly minimum across paid Standard plans.&lt;/p&gt;

&lt;p&gt;For organizations already spending well above that level, the change barely registered. For smaller but stable workloads, the situation was different. Some groups had intentionally designed their usage to stay under $10 per month. &lt;/p&gt;

&lt;p&gt;These weren't abandoned projects, but internal tools, early production features, and low-volume customer-facing systems that had already stabilized. Usage remained flat, but in some cases the introduction of pricing minimums led to five- to tenfold increases in monthly costs.&lt;/p&gt;

&lt;p&gt;What made the moment important was not the dollar amount. It was the introduction of a fixed floor into a model marketed as consumption-based. Low usage no longer guaranteed low cost. Once that assumption broke, minimums stopped feeling like an edge case and started looking like structural risk.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Previous monthly cost&lt;/th&gt;
&lt;th&gt;New minimum&lt;/th&gt;
&lt;th&gt;Increase&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;$8&lt;/td&gt;
&lt;td&gt;$50&lt;/td&gt;
&lt;td&gt;525%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$12&lt;/td&gt;
&lt;td&gt;$50&lt;/td&gt;
&lt;td&gt;317%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$25&lt;/td&gt;
&lt;td&gt;$50&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The migration it forced
&lt;/h3&gt;

&lt;p&gt;For anyone below the new $50 minimum, migration was rarely planned. It was reactive. Platform owners had to evaluate alternatives, export data, rebuild indexes, and validate query behavior under time pressure. In some cases, the engineering effort required to migrate exceeded the annual savings from switching providers. Many still moved anyway, because the alternative was committing to pricing that no longer matched the workload.&lt;/p&gt;

&lt;p&gt;The impact of the pricing change became visible across developer communities. One &lt;a href="https://maxrohde.com/2025/08/09/pinecone-price-increase-is-chroma-cloud-the-best-alternative/" rel="noopener noreferrer"&gt;developer documented &lt;/a&gt;their migration experience publicly, noting they had managed to keep bills under $10 per month by storing only essential data in the vector database. The September 2025 announcement requiring a $50 monthly minimum regardless of actual usage prompted an immediate search for alternatives.&lt;/p&gt;

&lt;p&gt;The migration calculus proved challenging. Moving to Chroma Cloud became the chosen path, but the process revealed deeper concerns about serverless pricing models. As the developer noted, they were seeking a truly serverless solution in which costs scale linearly with usage, starting at $0. The $50 minimum eliminated that possibility.&lt;/p&gt;

&lt;p&gt;This pattern repeated across Reddit threads and developer forums. A &lt;a href="https://dev.to/mxro/pinecone-price-increase-is-chroma-cloud-the-best-alternative-111h"&gt;discussion thread &lt;/a&gt;titled “Pinecone's new $50/mo minimum just nuked my hobby project” captured the broader sentiment. Teams running stable, low-volume production workloads faced a choice: accept a 400-500% cost increase or invest engineering time in migration.&lt;/p&gt;

&lt;p&gt;The issue wasn't the absolute dollar amount. For many teams, $50 per month remained affordable. The problem was precedent. If a vendor could introduce a minimum that quintupled costs without warning, what prevented future increases? The pricing change transformed vendor selection from a technical decision into a risk management calculation.&lt;/p&gt;

&lt;p&gt;A few patterns showed up repeatedly across these migrations. Pricing predictability started to matter more than managed convenience. Open source and self-hosted options re-entered discussions that had previously defaulted to cloud. Vendor pricing risk became a first-class architectural concern. These migrations were not driven by dissatisfaction with features or performance. They were driven by economics.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it reveals about vendor pricing power
&lt;/h3&gt;

&lt;p&gt;Once a vector database is deployed in production, vendors can adjust pricing in ways that materially affect customers, even if usage remains unchanged. &lt;/p&gt;

&lt;p&gt;Usage-based pricing lowers the barrier to adoption, but it increases switching costs over time as APIs become embedded, data formats solidify, and migrations grow expensive.&lt;/p&gt;

&lt;p&gt;For engineering leadership, the evaluation question shifts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;: "What does this cost today?"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt;: "How exposed are we to pricing changes once this is in production?"&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-world cost scenarios (what you'll actually pay)
&lt;/h3&gt;

&lt;p&gt;Understanding these dynamics in the abstract is one thing. Seeing how they play out in actual production systems is another. &lt;/p&gt;

&lt;p&gt;To see the full picture, let's examine three common production scenarios and compare costs across major providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 1: Customer support RAG system
&lt;/h3&gt;

&lt;p&gt;Imagine a customer support assistant built on historical tickets, internal documentation, and help articles. At this stage, you might be dealing with about 10 million vectors (typically 768 or 1536 vector dimensions) and around five million queries per month.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Queries&lt;/th&gt;
&lt;th&gt;Writes&lt;/th&gt;
&lt;th&gt;Embeddings&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pinecone&lt;/td&gt;
&lt;td&gt;$18&lt;/td&gt;
&lt;td&gt;$5 (but $50 min applies)&lt;/td&gt;
&lt;td&gt;$0.40&lt;/td&gt;
&lt;td&gt;$40-60&lt;/td&gt;
&lt;td&gt;$20-30&lt;/td&gt;
&lt;td&gt;$350-500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weaviate&lt;/td&gt;
&lt;td&gt;$6&lt;/td&gt;
&lt;td&gt;Compute: $40-60&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;td&gt;$40-60&lt;/td&gt;
&lt;td&gt;$15-25&lt;/td&gt;
&lt;td&gt;$300-400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;$17&lt;/td&gt;
&lt;td&gt;Credits: $30-50&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;td&gt;$40-60&lt;/td&gt;
&lt;td&gt;$15-25&lt;/td&gt;
&lt;td&gt;$280-380&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key finding: Even at a small scale, actual costs are 3-5x higher than base calculator estimates due to minimums and complex pricing structures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 2: E-commerce recommendation engine
&lt;/h3&gt;

&lt;p&gt;As systems grow, the cost dynamics become more pronounced. With around 100 million vectors and tens of millions of queries per month, costs climb quickly. Product catalogs, user vector embeddings, and real-time personalization introduce sustained traffic and frequent updates.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Queries&lt;/th&gt;
&lt;th&gt;Writes&lt;/th&gt;
&lt;th&gt;Embeddings&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pinecone&lt;/td&gt;
&lt;td&gt;$180&lt;/td&gt;
&lt;td&gt;$192&lt;/td&gt;
&lt;td&gt;$8&lt;/td&gt;
&lt;td&gt;$200-300&lt;/td&gt;
&lt;td&gt;$50-80&lt;/td&gt;
&lt;td&gt;$1,500-2,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weaviate&lt;/td&gt;
&lt;td&gt;$57&lt;/td&gt;
&lt;td&gt;Compute: $800-1,000&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;td&gt;$200-300&lt;/td&gt;
&lt;td&gt;$40-60&lt;/td&gt;
&lt;td&gt;$1,400-2,200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;$168&lt;/td&gt;
&lt;td&gt;Credits: $600-900&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;td&gt;$200-300&lt;/td&gt;
&lt;td&gt;$40-60&lt;/td&gt;
&lt;td&gt;$1,300-2,100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key finding:&lt;/strong&gt; At mid-scale, costs converge across providers. Embedding fees often exceed base database costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 3: Multi-tenant SaaS platform
&lt;/h3&gt;

&lt;p&gt;The economics shift dramatically at the enterprise scale. At 500 million vectors and 100 million queries per month, usage-based pricing becomes structural. These large datasets contain high-dimensional vector embeddings across many customers.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Queries&lt;/th&gt;
&lt;th&gt;Writes&lt;/th&gt;
&lt;th&gt;Embeddings&lt;/th&gt;
&lt;th&gt;Support&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pinecone&lt;/td&gt;
&lt;td&gt;$921&lt;/td&gt;
&lt;td&gt;$1,200&lt;/td&gt;
&lt;td&gt;$100-150&lt;/td&gt;
&lt;td&gt;$500-700&lt;/td&gt;
&lt;td&gt;$300-500&lt;/td&gt;
&lt;td&gt;$2,500-4,000+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weaviate&lt;/td&gt;
&lt;td&gt;$292&lt;/td&gt;
&lt;td&gt;Compute: $2,000-3,000&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;td&gt;$500-800&lt;/td&gt;
&lt;td&gt;$200-400&lt;/td&gt;
&lt;td&gt;$3,000-4,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;$860&lt;/td&gt;
&lt;td&gt;Credits: $1,500-2,200&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;td&gt;$500-800&lt;/td&gt;
&lt;td&gt;$200-400&lt;/td&gt;
&lt;td&gt;$2,900-4,200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key finding:&lt;/strong&gt; At enterprise scale, annual costs reach $30,000-$54,000. This is where self-hosting economics become compelling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Side-by-side provider comparison
&lt;/h3&gt;

&lt;p&gt;To make the economics clearer, here's how the major vector database providers stack up across the dimensions that matter most for production deployments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Pinecone&lt;/th&gt;
&lt;th&gt;Weaviate&lt;/th&gt;
&lt;th&gt;Qdrant&lt;/th&gt;
&lt;th&gt;PostgreSQL + pgvector&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pricing model&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;td&gt;Self-hosted (fixed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly minimum&lt;/td&gt;
&lt;td&gt;$50&lt;/td&gt;
&lt;td&gt;$25&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage cost&lt;/td&gt;
&lt;td&gt;$0.30/GB&lt;/td&gt;
&lt;td&gt;$0.095/GB&lt;/td&gt;
&lt;td&gt;$0.28/GB&lt;/td&gt;
&lt;td&gt;Hardware cost only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query pricing&lt;/td&gt;
&lt;td&gt;Scales with data&lt;/td&gt;
&lt;td&gt;Compute-based&lt;/td&gt;
&lt;td&gt;Credit-based&lt;/td&gt;
&lt;td&gt;Free within capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Additional Cost&lt;/td&gt;
&lt;td&gt;Many&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Some&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost predictability&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Low-Medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scenario 1 cost&lt;/td&gt;
&lt;td&gt;$350-500&lt;/td&gt;
&lt;td&gt;$300-400&lt;/td&gt;
&lt;td&gt;$280-380&lt;/td&gt;
&lt;td&gt;~$200-300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scenario 2 cost&lt;/td&gt;
&lt;td&gt;$1,500-2,500&lt;/td&gt;
&lt;td&gt;$1,400-2,200&lt;/td&gt;
&lt;td&gt;$1,300-2,100&lt;/td&gt;
&lt;td&gt;~$800-1,200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scenario 3 cost&lt;/td&gt;
&lt;td&gt;$2,500-4,000+&lt;/td&gt;
&lt;td&gt;$3,000-4,500&lt;/td&gt;
&lt;td&gt;$2,900-4,200&lt;/td&gt;
&lt;td&gt;~$1,500-2,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Fast prototyping&lt;/td&gt;
&lt;td&gt;Hybrid search&lt;/td&gt;
&lt;td&gt;K8s-native teams&lt;/td&gt;
&lt;td&gt;Stable, high-volume&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The hidden fees that aren't in the calculator
&lt;/h3&gt;

&lt;p&gt;These scenarios reveal a consistent pattern: the advertised pricing rarely captures the full cost. Production vector search systems incur costs that are rarely modeled comprehensively by calculators. Understanding these hidden costs is crucial for accurate budgeting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding and inference fees
&lt;/h3&gt;

&lt;p&gt;Pinecone Inference &lt;a href="https://docs.pinecone.io/guides/manage-cost/understanding-cost" rel="noopener noreferrer"&gt;charges&lt;/a&gt; $0.08 per million tokens for generating vector embeddings. Weaviate and Qdrant don't provide native embedding services, requiring you to use external providers like OpenAI (starting at $0.10 per million tokens) or Cohere.&lt;/p&gt;

&lt;p&gt;Converting documents to vectors costs extra beyond database operations across all platforms. Reranking adds additional per-request fees. Cohere-rerank-v3.5 has &lt;a href="https://www.metacto.com/blogs/the-true-cost-of-pinecone-a-deep-dive-into-pricing-integration-and-maintenance" rel="noopener noreferrer"&gt;no free requests&lt;/a&gt; on any tier, meaning every reranking operation is billed.&lt;/p&gt;

&lt;p&gt;These embedding and inference costs can match or exceed the database bill itself, depending on data churn and query patterns. Every time you generate new vector embeddings or update existing ones, you're paying separately from your core vector storage costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reindexing costs (the silent killer)
&lt;/h3&gt;

&lt;p&gt;The cost impact becomes especially severe when you need to change your approach. When you change embedding models, you must re-vectorize all data. For a 100-million-vector dataset, this could mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding costs: $8,000-$15,000 one-time&lt;/li&gt;
&lt;li&gt;Increased write units during migration&lt;/li&gt;
&lt;li&gt;Processing time and compute overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Experimentation with models becomes prohibitively expensive, creating lock-in to initial embedding choices. The cost of generating vector embeddings at scale makes it risky to improve your system.&lt;/p&gt;

&lt;h3&gt;
  
  
  The support tax
&lt;/h3&gt;

&lt;p&gt;Support tiers add meaningful costs across all managed providers. &lt;a href="https://www.metacto.com/blogs/the-true-cost-of-pinecone-a-deep-dive-into-pricing-integration-and-maintenance" rel="noopener noreferrer"&gt;Pinecone's support tiers &lt;/a&gt;run from free community forums to $499/month for 24/7 coverage. Weaviate charges $500/month for their Professional support tier. Qdrant's enterprise support starts at similar levels. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Pinecone&lt;/th&gt;
&lt;th&gt;Weaviate&lt;/th&gt;
&lt;th&gt;Qdrant&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Community only&lt;/td&gt;
&lt;td&gt;Community only&lt;/td&gt;
&lt;td&gt;Community only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer&lt;/td&gt;
&lt;td&gt;$29/month&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro/Enterprise&lt;/td&gt;
&lt;td&gt;$499/month&lt;/td&gt;
&lt;td&gt;$500/month&lt;/td&gt;
&lt;td&gt;Custom&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Geographic distribution costs
&lt;/h3&gt;

&lt;p&gt;Multi-region deployment for latency optimization adds data transfer costs, regional infrastructure overhead, and can increase base costs by 30-50% depending on configuration. Running vector search across multiple cloud provider regions compounds these expenses.&lt;/p&gt;

&lt;h3&gt;
  
  
  When self-hosting becomes 75% cheaper
&lt;/h3&gt;

&lt;p&gt;Given these hidden costs and pricing volatility, many teams eventually reach a crossroads. There is a point where vector database pricing stops being a convenience question and becomes an economic one. That point usually arrives earlier than many people expect.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.tigerdata.com/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost" rel="noopener noreferrer"&gt;Timescale benchmarks&lt;/a&gt; show that PostgreSQL + pgvector is 75% cheaper than Pinecone, while also delivering 28x &lt;a href="https://medium.com/timescale/postgresql-and-pgvector-now-faster-than-pinecone-75-cheaper-and-100-open-source-0b0c2cca00c0" rel="noopener noreferrer"&gt;faster&lt;/a&gt; P95 latency compared to Pinecone's storage-optimized tier. The tipping point at which self-hosting becomes materially cheaper typically occurs between 60 and 100 million queries per month.&lt;/p&gt;

&lt;h3&gt;
  
  
  The cost crossover point
&lt;/h3&gt;

&lt;p&gt;Breaking down the economics by scale reveals clear patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Below 10M queries/month:&lt;/strong&gt; Cloud is usually simpler. The operational overhead of self-hosting (DevOps time, monitoring, maintenance) outweighs potential savings. Managed services make sense here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10M-60M queries/month:&lt;/strong&gt; Economics converge. Self-hosting costs stabilize, whereas cloud costs continue to rise with usage. This is where many teams begin to seriously evaluate alternatives. The gap narrows to the point at which the decision depends more on team capabilities than on pure economics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;60M-100M+ queries/month:&lt;/strong&gt; Self-hosting becomes 50-75% cheaper. PostgreSQL &lt;a href="https://www.tigerdata.com/blog/pgvector-vs-pinecone" rel="noopener noreferrer"&gt;self-hosted &lt;/a&gt;costs approximately $835 per month on AWS EC2, compared to Pinecone's &lt;a href="https://www.tigerdata.com/blog/pgvector-vs-pinecone" rel="noopener noreferrer"&gt;$3,241&lt;/a&gt; per month for the storage-optimized index at a comparable scale. At this volume, the math becomes hard to ignore.&lt;/p&gt;

&lt;h3&gt;
  
  
  What self-hosting actually costs
&lt;/h3&gt;

&lt;p&gt;Breaking down the real economics reveals why this shift happens. Running your own vector search infrastructure on dedicated hardware involves several cost components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Server: $400-$800/month (OpenMetal dedicated hardware or equivalent AWS EC2 instance optimized for vector workloads)&lt;/li&gt;
&lt;li&gt;Setup: About 40 hours initial effort ($4,000-$8,000 one-time at typical engineering rates)&lt;/li&gt;
&lt;li&gt;Ongoing maintenance: 10-15 hours/month (roughly $1,500-$2,250/month in engineering time)&lt;/li&gt;
&lt;li&gt;Monitoring stack: $50-$200/month (Prometheus, Grafana, alerting)&lt;/li&gt;
&lt;li&gt;Backup storage: $100-$300/month (S3 or equivalent)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total:&lt;/strong&gt; About $2,050-$3,550/month versus Pinecone $5,000-$10,000+ at enterprise scale&lt;br&gt;
&lt;strong&gt;Net savings:&lt;/strong&gt; $2,950-$6,450/month = $35,000-$77,000/year&lt;/p&gt;

&lt;p&gt;The math gets more compelling as you scale. With large datasets containing hundreds of millions of vector dimensions, the gap widens substantially.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance advantages beyond cost
&lt;/h3&gt;

&lt;p&gt;The economic case is strong, but performance matters too. &lt;a href="https://medium.com/timescale/postgresql-and-pgvector-now-faster-than-pinecone-75-cheaper-and-100-open-source-0b0c2cca00c0" rel="noopener noreferrer"&gt;Timescale benchmarks &lt;/a&gt;demonstrate that PostgreSQL with pgvector achieves a P95 latency 28x lower than Pinecone's storage tier: 63ms &lt;a href="https://medium.com/timescale/postgresql-and-pgvector-now-faster-than-pinecone-75-cheaper-and-100-open-source-0b0c2cca00c0" rel="noopener noreferrer"&gt;versus&lt;/a&gt; 1,763ms. Additionally, PostgreSQL &lt;a href="https://www.tigerdata.com/blog/pgvector-vs-pinecone" rel="noopener noreferrer"&gt;achieves&lt;/a&gt; 16x higher query throughput at 99% recall.&lt;/p&gt;

&lt;p&gt;Beyond performance, self-hosting provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Control: Tune for your specific workload and vector dimensions&lt;/li&gt;
&lt;li&gt;No throttling or rate limits&lt;/li&gt;
&lt;li&gt;Data sovereignty and compliance benefits&lt;/li&gt;
&lt;li&gt;Predictable scaling where costs are tied to capacity, not usage&lt;/li&gt;
&lt;li&gt;Hybrid search flexibility to combine vector search with traditional queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The hidden cost of free and serverless
&lt;/h3&gt;

&lt;p&gt;Free tiers and serverless pricing are designed to feel safe. They lower friction, reduce upfront commitment, and make it easy to start building. In practice, they often delay cost visibility rather than eliminate it.&lt;/p&gt;

&lt;p&gt;Serverless does not mean infrastructure is free. It means infrastructure is abstracted and billed indirectly through usage. For steady workloads, that abstraction usually comes at a premium. Every query, every stored vector, every embedding refresh, and every background operation is metered. Over time, convenience replaces predictability.&lt;/p&gt;

&lt;p&gt;Free tiers follow a similar pattern. They are useful for experimentation, but they are not representative of production economics. By the time limits are reached, integration work is already done, APIs are embedded, and migration feels expensive. At that point, teams tend to accept pricing they would have challenged earlier.&lt;/p&gt;

&lt;h3&gt;
  
  
  A practical way to choose
&lt;/h3&gt;

&lt;p&gt;Once pricing volatility appears, the question is no longer which database is cheapest today. It becomes which pricing model still works once the system stabilizes.&lt;/p&gt;

&lt;p&gt;Three factors matter most:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Scale:&lt;/strong&gt; How many vectors you store, how many queries you run per month, and how quickly those numbers grow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predictability:&lt;/strong&gt; Whether usage is bursty and uncertain, or steady and forecastable over the next six to twelve months&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control:&lt;/strong&gt; How much operational responsibility your team can realistically take on, and how sensitive the business is to budget variance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Early on, managed cloud services usually make sense. They optimize for speed, experimentation, and unknown demand. As workloads stabilize and query volumes climb into the tens of millions per month, usage-based pricing begins to lose its advantage. Costs rise faster than value, and forecasting becomes harder, not easier.&lt;/p&gt;

&lt;p&gt;Beyond roughly 60–100 million queries per month, many teams reach a crossover point. At that scale, self-hosted or on-premises deployments are often materially cheaper and far more predictable, even after accounting for infrastructure and operational overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  When each option fits
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Cloud-managed services work best when:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Traffic is unpredictable or highly bursty.&lt;/li&gt;
&lt;li&gt;Speed of iteration matters more than long-term cost.&lt;/li&gt;
&lt;li&gt;DevOps capacity is limited.&lt;/li&gt;
&lt;li&gt;Workloads are still exploratory.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Self-hosted or on-premises deployments make sense when:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Query volume is high and stable.&lt;/li&gt;
&lt;li&gt;Cost predictability is a business requirement.&lt;/li&gt;
&lt;li&gt;Budgets must be defended in advance.&lt;/li&gt;
&lt;li&gt;Compliance or data residency matters.&lt;/li&gt;
&lt;li&gt;Performance targets are tight.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right choice depends on matching your pricing model to your actual production behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision triggers that help
&lt;/h3&gt;

&lt;p&gt;Instead of debating architecture continuously, many teams define clear triggers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If monthly vector database spend exceeds $1,500, re-evaluate deployment options.&lt;/li&gt;
&lt;li&gt;If query volume exceeds 50 million per month, model total cost of ownership for owned infrastructure.&lt;/li&gt;
&lt;li&gt;If pricing changes exceed 20%, reassess vendor risk.&lt;/li&gt;
&lt;li&gt;If latency targets are consistently missed, evaluate alternatives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These triggers turn pricing from a surprise into a planned decision point.&lt;/p&gt;

&lt;h3&gt;
  
  
  The bottom line
&lt;/h3&gt;

&lt;p&gt;Vector database pricing looks simple at the start. Free tiers, low minimums, and usage-based billing suggest you only pay for what you use. In production, the economics change. Costs compound across storage, queries, embeddings, and background operations. &lt;/p&gt;

&lt;p&gt;The same query gets more expensive as datasets grow, even when it delivers the same value. Predictability disappears at the stage where predictability matters most. For sustained workloads, there is a clear tipping point where ownership becomes cheaper and easier to justify. Teams that avoid bill shock are not the ones who negotiated better discounts; they are the ones who treated pricing as an architectural decision early.&lt;/p&gt;

&lt;p&gt;For organizations that value fixed budgets, predictable spend, and long-term control, this is why on-premises vector databases are re-entering serious architectural discussions. &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Actian’s on-premises&lt;/a&gt; vector database, designed around transparent licensing rather than usage-based volatility, reflects that shift.&lt;/p&gt;

&lt;p&gt;Do the cost math before you need to migrate. &lt;a href="https://www.actian.com/databases/vectorai-db/#waitlist" rel="noopener noreferrer"&gt;It is always cheaper that way.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>database</category>
      <category>infrastructure</category>
      <category>saas</category>
    </item>
    <item>
      <title>5 Best Python Vector Database Libraries</title>
      <dc:creator>Tiioluwani</dc:creator>
      <pubDate>Mon, 18 May 2026 12:42:35 +0000</pubDate>
      <link>https://dev.to/actiandev/5-best-python-vector-database-libraries-2235</link>
      <guid>https://dev.to/actiandev/5-best-python-vector-database-libraries-2235</guid>
      <description>&lt;p&gt;Most comparisons of Python vector database libraries focus on retrieval speed, indexing algorithms, or benchmark results. These metrics matter, but production failures stem from various factors: installation inconsistencies, client packaging differences, version churn, and unexpected API changes. In reality, a different class of problems appears once the application leaves the notebook environment and runs inside a production service.&lt;/p&gt;

&lt;p&gt;A typical example occurs with embedded ChromaDB setups. A project may work perfectly during development, only to fail in production with an error such as:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RuntimeError: Chroma running in http-only client mode&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A structural conflict between the &lt;code&gt;chromadb&lt;/code&gt; and &lt;code&gt;chromadb-client&lt;/code&gt; packages triggers this error because the client-only package lacks the default embedding functions the application depends on. Diagnosing this can take hours.&lt;/p&gt;

&lt;p&gt;Client packaging choices and library design decisions, not retrieval quality or indexing performance, produce this type of failure.&lt;/p&gt;

&lt;p&gt;This article compares the leading Python vector database libraries from that perspective, examining &lt;a href="https://www.actian.com/blog/databases/5-edge-ai-architecture-patterns-for-disconnected-environments/" rel="noopener noreferrer"&gt;client architecture&lt;/a&gt;, installation stability, API design, and long-term maintainability, rather than benchmark numbers alone.&lt;/p&gt;

&lt;p&gt;TL;DR&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ChromaDB:&lt;/strong&gt; Fastest setup for prototyping and notebook environments with minimal configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pinecone:&lt;/strong&gt; Fully managed cloud solution with zero infrastructure management overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qdrant:&lt;/strong&gt; Zero code changes from local development to production; the strongest open-source option for API stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaviate:&lt;/strong&gt; Hybrid search combining vector similarity and keyword filtering at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actian VectorAI DB:&lt;/strong&gt; On-premises deployment with same architecture from laptop to production; Actian designed it for edge and air-gapped environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Python Landscape: Understanding the Options
&lt;/h2&gt;

&lt;p&gt;The relationship between a Python vector database library and its storage backend determines how you will develop, test, and eventually scale your application. The wrong library choice often triggers the environment-specific failures described above, since each architecture handles local and production environments differently.&lt;/p&gt;

&lt;p&gt;These differences generally fall into four distinct categories, each with its own approach to the interaction between infrastructure and code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four client architectures
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud-only (e.g., Pinecone):&lt;/strong&gt; These clients act as a full API abstraction for serverless environments. The primary advantage is zero infrastructure management, but this requires an active internet connection and an API key for all local development and testing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OSS with managed option (e.g., Qdrant, Weaviate, Milvus):&lt;/strong&gt; This set of tools uses the same API for both self-hosted Docker instances and managed cloud services. This provides excellent development-production parity, though it often requires managing a local server or a Docker container during development.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Embedded libraries (e.g., ChromaDB, FAISS):&lt;/strong&gt; These tools run in-process and embed the database logic in your Python application. While they are ideal for notebooks and rapid prototyping, their developers never designed them for distributed production environments, and they do not offer a well-defined migration path as the application scales.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Extension approach (e.g., pgvector via Timescale vector):&lt;/strong&gt; This model adds vector search capabilities to traditional relational databases. It allows existing PostgreSQL infrastructure to support vector similarity search. However, query performance varies with index configuration, dataset size, and workload characteristics; some scenarios benefit from the relational foundation, while others favor purpose-built vector architectures.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp2hef7yk38l2n2ahayfs.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp2hef7yk38l2n2ahayfs.jpeg" alt="Figure 1: Client architecture" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These four models describe how a client connects to storage, but they also show a practical separation between standalone search libraries and managed database systems. Choosing the wrong model generates some of the most persistent production friction in vector search applications.&lt;/p&gt;

&lt;p&gt;A vector database provides the infrastructure required for production readiness, going beyond what developers' standalone libraries offer. Libraries such as &lt;a href="https://docs.langchain.com/oss/python/integrations/vectorstores/faiss" rel="noopener noreferrer"&gt;FAISS&lt;/a&gt; or &lt;a href="https://docs.langchain.com/oss/python/integrations/providers/annoy" rel="noopener noreferrer"&gt;Annoy&lt;/a&gt; are static, in-memory tools focused on approximate nearest neighbor search across large datasets. They are highly efficient for similarity search within a fixed vector space, but they cannot manage data over time.&lt;/p&gt;

&lt;p&gt;Specialized databases like Pinecone, Qdrant, or Milvus go further, providing full CRUD support, metadata-based filtering, and distributed persistence for large datasets.&lt;/p&gt;

&lt;p&gt;The table below summarizes where each architecture fits across common use cases.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Primary trade-off&lt;/th&gt;
&lt;th&gt;Production migration path&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cloud-only&lt;/td&gt;
&lt;td&gt;No infrastructure management; requires network connectivity and API authentication for all environments&lt;/td&gt;
&lt;td&gt;Same client code across development and production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSS + managed&lt;/td&gt;
&lt;td&gt;Identical API for local and cloud deployments; requires Docker or server setup for local development&lt;/td&gt;
&lt;td&gt;Zero code changes between local Docker instance and managed cloud service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedded&lt;/td&gt;
&lt;td&gt;In-process execution with minimal setup; limited to single-machine architecture&lt;/td&gt;
&lt;td&gt;Client class replacement required; distributed deployment needs architecture redesign&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extension&lt;/td&gt;
&lt;td&gt;Integrates with existing PostgreSQL infrastructure; performance depends on index configuration and dataset characteristics&lt;/td&gt;
&lt;td&gt;Current PostgreSQL setup and scale requirements determine the migration path&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Client Comparison: Developer Experience Deep Dive
&lt;/h2&gt;

&lt;p&gt;Architecture does narrow your options, but the day-to-day experience of working with a Python vector database library comes down to how each client handles connection setup, version stability, and the friction points encountered during active development.&lt;/p&gt;

&lt;p&gt;We're comparing the four clients below based on what developers deal with in real-world use.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Pinecone Python client
&lt;/h3&gt;

&lt;p&gt;Pinecone offers one of the more polished connection experiences among cloud-only vector database clients, with extensive type hints and a straightforward initialization pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pinecone&lt;/span&gt;

    &lt;span class="n"&gt;pc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pinecone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-index-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extensive type hints and IDE autocompletion support.&lt;/li&gt;
&lt;li&gt;Pinecone introduced AsyncIO support in v6 via Pinecone Asyncio.&lt;/li&gt;
&lt;li&gt;gRPC mode offers higher throughput for demanding workloads.&lt;/li&gt;
&lt;li&gt;Well-maintained official documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pain points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pinecone shipped three major versions in 18 months (v5, v6, and v7), introducing breaking changes to the connection logic and renaming the package from &lt;code&gt;pinecone-client&lt;/code&gt; to &lt;code&gt;pinecone&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Historical confusion between the pinecone and pinecone-client packages.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;query_namespaces&lt;/code&gt; async operations under load require thread pool tuning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to choose Pinecone:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fully managed infrastructure is a hard requirement.&lt;/li&gt;
&lt;li&gt;The team has no appetite for self-hosted database management.&lt;/li&gt;
&lt;li&gt;The Budget allows for cloud-only pricing at the target scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to avoid Pinecone:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local development requires a live API key.&lt;/li&gt;
&lt;li&gt;API stability across versions is a priority.&lt;/li&gt;
&lt;li&gt;Cost at scale above 100M vectors is a constraint.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Weaviate Python client
&lt;/h3&gt;

&lt;p&gt;Weaviate's v4 client is a meaningful step forward from v3, adding typed classes and gRPC support that noticeably improve query performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;weaviate&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weaviate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect_to_local&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-collection-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gRPC mode delivers &lt;a href="https://weaviate.io/blog/grpc-performance-improvements" rel="noopener noreferrer"&gt;40-70%&lt;/a&gt; faster query performance than v3.&lt;/li&gt;
&lt;li&gt;Typed property and DataType classes replace untyped v3 dictionaries.&lt;/li&gt;
&lt;li&gt;Built-in hybrid search combining vector and keyword search.&lt;/li&gt;
&lt;li&gt;Strong support for multi-tenancy workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pain points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weaviate has fully deprecated the &lt;a href="https://weaviate.io/blog/python-v3-client-deprecation" rel="noopener noreferrer"&gt;v3 API&lt;/a&gt;, and teams report that the migration takes weeks of work.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.weaviate.io/weaviate/api/grpc" rel="noopener noreferrer"&gt;gRPC requires port&lt;/a&gt; &lt;code&gt;50051&lt;/code&gt; to be open, which creates friction in restricted network environments.&lt;/li&gt;
&lt;li&gt;Batch API redesign caused significant confusion (&lt;a href="https://github.com/weaviate/weaviate-python-client/issues/433" rel="noopener noreferrer"&gt;Issue #433&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;LangChain did not ship v4 support until several months after Weaviate's release (&lt;a href="https://github.com/langchain-ai/langchain/issues/14531" rel="noopener noreferrer"&gt;Issue #14531&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to choose Weaviate:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid search combining vector similarity and keyword filtering is a core requirement.&lt;/li&gt;
&lt;li&gt;gRPC performance gains justify the network configuration overhead.&lt;/li&gt;
&lt;li&gt;The team has the capacity to manage the v3-to-v4 migration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to avoid Weaviate:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Migration resources are limited, and API stability is a priority.&lt;/li&gt;
&lt;li&gt;Network environments restrict non-standard port access.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. ChromaDB Python client
&lt;/h3&gt;

&lt;p&gt;ChromaDB offers one of the easiest onboarding experiences among Python vector database libraries, making it a natural starting point for notebooks and early-stage prototyping.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;

    &lt;span class="c1"&gt;# In-memory mode
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Persistent mode
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PersistentClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/your/path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# HTTP client mode
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HttpClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simplest API surface of any client in this comparison.&lt;/li&gt;
&lt;li&gt;Mature LangChain integration with well-documented examples.&lt;/li&gt;
&lt;li&gt;In-memory mode requires zero configuration for notebook environments.&lt;/li&gt;
&lt;li&gt;Large and active open-source community.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pain points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.13 incompatibility (&lt;a href="https://github.com/chroma-core/chroma/issues/3651" rel="noopener noreferrer"&gt;Issue #3651&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Windows instability above 99 records (&lt;a href="https://github.com/chroma-core/chroma/issues/3058" rel="noopener noreferrer"&gt;Issue #3058&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Confusing &lt;code&gt;chromadb&lt;/code&gt; with &lt;code&gt;chromadb-client&lt;/code&gt; breaks production deployments.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nmslib/hnswlib" rel="noopener noreferrer"&gt;hnswlib&lt;/a&gt; compilation errors on &lt;a href="https://github.com/chroma-core/chroma/issues/314" rel="noopener noreferrer"&gt;ARM Mac processors&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Requires &lt;a href="https://docs.trychroma.com/docs/overview/troubleshooting#sqlite" rel="noopener noreferrer"&gt;SQLite 3.35 or higher&lt;/a&gt;, creating an environment-specific setup overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to choose ChromaDB:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rapid prototyping in a notebook environment is the primary use case.&lt;/li&gt;
&lt;li&gt;The dataset fits comfortably within a single process.&lt;/li&gt;
&lt;li&gt;LangChain integration with minimal configuration is a priority.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to avoid ChromaDB:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production deployment on Windows or Python 3.13 is required.&lt;/li&gt;
&lt;li&gt;The application needs to scale beyond a single machine.&lt;/li&gt;
&lt;li&gt;A clear and well-defined migration path to a distributed infrastructure is important.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Qdrant Python client
&lt;/h3&gt;

&lt;p&gt;Developers value Qdrant for its local-to-production parity. The same client code runs against an in-memory instance during development and a fully managed cloud deployment in production, without requiring any modifications.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;

    &lt;span class="c1"&gt;# In-memory mode for local development
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Production mode - zero code changes required
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-cluster-url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;:memory:&lt;/code&gt; mode enables a zero-code-change local-to-production workflow.&lt;/li&gt;
&lt;li&gt;Qdrant introduced a native &lt;a href="https://python-client.qdrant.tech/qdrant_client.async_qdrant_client" rel="noopener noreferrer"&gt;AsyncQdrantClient&lt;/a&gt; for high-concurrency workloads.&lt;/li&gt;
&lt;li&gt;Pydantic model type safety throughout the client interface.&lt;/li&gt;
&lt;li&gt;Rust-backed implementation with lower memory overhead compared to JVM-based alternatives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pain points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers must explicitly set &lt;code&gt;prefer_grpc=True&lt;/code&gt; to enable gRPC, a step they often overlook.&lt;/li&gt;
&lt;li&gt;Port split between REST (6333) and gRPC (6334) requires careful network configuration.&lt;/li&gt;
&lt;li&gt;Pydantic version constraints: v1.10.x or v2.21 and above only.&lt;/li&gt;
&lt;li&gt;Cloud connection issues (&lt;a href="https://github.com/qdrant/qdrant-client/issues/112" rel="noopener noreferrer"&gt;Issue #112&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to choose Qdrant:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local-to-production parity is a priority, and zero code changes between environments matter.&lt;/li&gt;
&lt;li&gt;High-concurrency async workloads require native &lt;code&gt;AsyncQdrantClient&lt;/code&gt; support.&lt;/li&gt;
&lt;li&gt;You prefer a self-hosted, open-source vector database deployment over a managed cloud service.&lt;/li&gt;
&lt;li&gt;Hybrid search that combines dense and sparse vectors is a core requirement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to avoid Qdrant:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The team has no Docker experience and needs a simpler local setup.&lt;/li&gt;
&lt;li&gt;The target environment cannot support gRPC network configuration.&lt;/li&gt;
&lt;li&gt;Pydantic version constraints conflict with existing project dependencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each client brings distinct trade-offs to the connection layer. These differences also extend further into how each client handles installation and platform compatibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation and Environment Management
&lt;/h2&gt;

&lt;p&gt;In ideal environments, installing a Python vector database library is straightforward. In practice, the target platform, Python version, and existing package dependencies each introduce variables that can turn a simple pip install into a multi-hour debugging session. A quick compatibility check before committing to a client is worth the effort, since most of these issues only surface after the setup is complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  The compatibility matrix
&lt;/h3&gt;

&lt;p&gt;The following matrix shows client behavior across Python 3.8–3.13 on macOS ARM, Windows, and Linux.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Client&lt;/th&gt;
&lt;th&gt;macOS ARM (M1/M2)&lt;/th&gt;
&lt;th&gt;Windows&lt;/th&gt;
&lt;th&gt;Linux (Debian)&lt;/th&gt;
&lt;th&gt;Python 3.13&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pinecone&lt;/td&gt;
&lt;td&gt;✓ Full support&lt;/td&gt;
&lt;td&gt;✓ Full support&lt;/td&gt;
&lt;td&gt;✓ Full support&lt;/td&gt;
&lt;td&gt;✓ Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weaviate&lt;/td&gt;
&lt;td&gt;✓ Full support&lt;/td&gt;
&lt;td&gt;✓ Full support&lt;/td&gt;
&lt;td&gt;Requires Docker for gRPC&lt;/td&gt;
&lt;td&gt;✓ Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChromaDB&lt;/td&gt;
&lt;td&gt;hnswlib compilation errors&lt;/td&gt;
&lt;td&gt;Instability above 99 records (&lt;a href="https://github.com/chroma-core/chroma/issues/3058" rel="noopener noreferrer"&gt;#3058&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;Requires Debian Bookworm+&lt;/td&gt;
&lt;td&gt;✗ Broken (&lt;a href="https://github.com/chroma-core/chroma/issues/3651" rel="noopener noreferrer"&gt;#3651&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;✓ Full support&lt;/td&gt;
&lt;td&gt;✓ Full support&lt;/td&gt;
&lt;td&gt;✓ Full support&lt;/td&gt;
&lt;td&gt;✓ Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ChromaDB carries the heaviest compatibility burden of any client in this comparison. On macOS with ARM processors, hnswlib produces compilation errors during installation, forcing developers to manually pin Python to 3.11 or 3.12.&lt;/p&gt;

&lt;p&gt;On Windows, ChromaDB destabilizes once a collection exceeds 99 records, making the embedded client unsuitable for anything beyond early prototyping. On Linux, Debian-based distributions require Bookworm or later to install and run ChromaDB cleanly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Virtual environment best practices
&lt;/h3&gt;

&lt;p&gt;Setting up a virtual environment before installing any vector database client saves a lot of debugging time, especially with ChromaDB, where developers know package conflicts occur frequently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
    &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate  &lt;span class="c"&gt;# On Windows: venv\Scripts\activate&lt;/span&gt;
    pip &lt;span class="nb"&gt;install &lt;/span&gt;chromadb 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pinning the client version in a &lt;code&gt;requirements.txt&lt;/code&gt; file matters equally, since several of these clients have a history of introducing breaking changes between minor releases.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chromadb==0.4.x
    qdrant-client==1.7.x
    pinecone==3.x
    weaviate-client==4.x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ChromaDB's two-package architecture confuses many developers. When someone installs &lt;code&gt;chromadb-client&lt;/code&gt; instead of &lt;code&gt;chromadb&lt;/code&gt;, the application throws this error on its first attempt to call the default embedding function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ValueError: You must provide an embedding &lt;span class="k"&gt;function&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;chromadb-client&lt;/code&gt; is a lightweight HTTP-only package that does not include &lt;code&gt;DefaultEmbeddingFunction&lt;/code&gt;. Applications that rely on the default embedding behavior need the full chromadb package instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extras and optional dependencies
&lt;/h3&gt;

&lt;p&gt;Beyond the base installation, each client supports additional dependencies that can improve performance for production workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; &lt;span class="c"&gt;# Pinecone with gRPC support&lt;/span&gt;
    pip &lt;span class="nb"&gt;install &lt;/span&gt;pinecone[grpc]

    &lt;span class="c"&gt;# Qdrant with FastEmbed for local embedding generation&lt;/span&gt;
    pip &lt;span class="nb"&gt;install &lt;/span&gt;qdrant-client[fastembed]

    &lt;span class="c"&gt;# ChromaDB with sentence-transformers for local embedding support&lt;/span&gt;
    pip &lt;span class="nb"&gt;install &lt;/span&gt;chromadb sentence-transformers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;gRPC has the highest impact on optional installation for query performance. Weaviate sees &lt;a href="https://weaviate.io/blog/grpc-performance-improvements" rel="noopener noreferrer"&gt;40-70%&lt;/a&gt; faster queries over gRPC than over REST, while Qdrant gains roughly 15% in query speed. The trade-off is that gRPC requires additional network configuration, which may not be feasible in restricted environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://qdrant.tech/documentation/fastembed/" rel="noopener noreferrer"&gt;FastEmbed&lt;/a&gt; and &lt;a href="https://sbert.net/" rel="noopener noreferrer"&gt;sentence-transformers&lt;/a&gt; both enable local embedding generation without an external API dependency, keeping latency and embedding costs down for semantic search and similarity search workloads.&lt;/p&gt;

&lt;p&gt;Qdrant's native &lt;a href="https://python-client.qdrant.tech/" rel="noopener noreferrer"&gt;AsyncQdrantClient&lt;/a&gt; and Pinecone's &lt;code&gt;PineconeAsyncio&lt;/code&gt; deliver 3–5x throughput improvements under high-concurrency workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local Development Workflows
&lt;/h2&gt;

&lt;p&gt;Developers make most vector database decisions in the local development environment. The critical question is: Which client requires the least code change when moving to production?&lt;/p&gt;

&lt;p&gt;The migration path&lt;br&gt;
Here is how each client handles the move from local deployment to production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="c1"&gt;# Qdrant - zero code changes required
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# Development
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;                         &lt;span class="c1"&gt;# Production
&lt;/span&gt;        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-cluster-url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# ChromaDB - client class change required
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;                     &lt;span class="c1"&gt;# Development
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HttpClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;                  &lt;span class="c1"&gt;# Production
&lt;/span&gt;        &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Pinecone - same code in both environments
&lt;/span&gt;    &lt;span class="n"&gt;pc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pinecone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# Development and production
&lt;/span&gt;    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-index-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Qdrant &lt;code&gt;:memory:&lt;/code&gt; mode carries identical client code from local development all the way to production. The vector store configuration, cosine similarity settings, and hnsw index parameters remain the same across environments.&lt;/p&gt;

&lt;p&gt;ChromaDB requires a change to the client class when moving to production. The more widely the codebase uses the client, the more of the application this change touches.&lt;/p&gt;

&lt;p&gt;Pinecone uses the same code in both development and production since everything runs in the cloud regardless of the stage.&lt;/p&gt;

&lt;p&gt;These migration differences stem from three distinct local development approaches: embedded mode, Docker, and cloud-only.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedded mode
&lt;/h3&gt;

&lt;p&gt;ChromaDB’s default embedded client stores data only in memory. When the application stops running, the data is lost. For developments involving persistent collections, &lt;code&gt;PersistentClient&lt;/code&gt; writes data to disk instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="c1"&gt;# In-memory only: data lost when process ends
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_collection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Persistent local storage
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PersistentClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/local/path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Qdrant’s &lt;code&gt;:memory:&lt;/code&gt; mode uses the same client interface as a production deployment. Whatever code works locally also works in production without any changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:memory:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_collection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vectors_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both clients work well for early prototyping and notebook environments, and the differences surface only at the production boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker for local development
&lt;/h3&gt;

&lt;p&gt;Docker runs the vector database in an isolated local container using the same configuration as in a production deployment. Qdrant and Weaviate are both open-source vector databases that support this approach.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Qdrant&lt;/span&gt;
    docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 6333:6333 qdrant/qdrant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="c"&gt;# Weaviate&lt;/span&gt;
    docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 &lt;span class="nt"&gt;-p&lt;/span&gt; 50051:50051 cr.weaviate.io/semitechnologies/weaviate:latest

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the container is running, the client connects to localhost the same way it would to a self-hosted vector database in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="c1"&gt;# Qdrant
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:6333&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Weaviate
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weaviate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect_to_local&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main advantage is that vector index configuration behaves the same way locally and in production, and issues that surface locally are genuine issues rather than environment-specific artifacts. &lt;/p&gt;

&lt;p&gt;The trade-off is the overhead of Docker installation and port configuration, particularly Weaviate's requirement for both ports &lt;code&gt;8080 and 50051&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud-only development
&lt;/h3&gt;

&lt;p&gt;Pinecone operates entirely in the cloud. Every operation, from index creation to vector upsert to real-time search, requires an active API key and a live network connection.  &lt;/p&gt;

&lt;p&gt;The setup overhead is minimal since there is no local infrastructure to configure or maintain. The same code runs across all environments, with API key management and network connectivity as the only constant requirements.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F30p0scs97bk0w2z3y318.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F30p0scs97bk0w2z3y318.jpeg" alt="Figure 2: Local development workflow" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Moving beyond local development, how each client integrates with the broader Python ecosystem, particularly LangChain and LlamaIndex, adds another layer to the comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integration Ecosystem: LangChain and LlamaIndex
&lt;/h2&gt;

&lt;p&gt;LangChain and LlamaIndex sit at the center of most Python-based retrieval-augmented generation workflows. All four clients integrate with both frameworks, though the quality of these integrations varies.&lt;/p&gt;

&lt;h3&gt;
  
  
  LangChain maturity
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_pinecone&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PineconeVectorStore&lt;/span&gt;      &lt;span class="c1"&gt;# Dedicated package
&lt;/span&gt;    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_chroma&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;                      &lt;span class="c1"&gt;# Mature, widely used
&lt;/span&gt;    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_qdrant&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantVectorStore&lt;/span&gt;           &lt;span class="c1"&gt;# Actively maintained
&lt;/span&gt;    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_weaviate&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WeaviateVectorStore&lt;/span&gt;       &lt;span class="c1"&gt;# Lagged during v3 to v4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pinecone's LangChain integration benefits from thorough documentation and a stable release history across major version changes. Teams widely rely on ChromaDB's integration to prototype retrieval-augmented generation pipelines.&lt;/p&gt;

&lt;p&gt;Qdrant’s dedicated langchain-qdrant package keeps pace closely with client releases. Weaviate is the exception. LangChain took several months to catch up with the v3-to-v4 migration, and the subsequent breaking changes forced many teams to pin their dependency versions until the LangChain integration caught up.&lt;/p&gt;

&lt;h3&gt;
  
  
  LlamaIndex support
&lt;/h3&gt;

&lt;p&gt;All four clients have LlamaIndex connectors through the &lt;code&gt;llama-index-vector-stores&lt;/code&gt; namespace.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_index.vector_stores.qdrant&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantVectorStore&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, the support pattern is consistent across all four clients. When a client releases a new version, LlamaIndex integration typically follows within four to eight weeks. Any breaking changes in the client affect pipelines built on top of it in the meantime.&lt;/p&gt;

&lt;h3&gt;
  
  
  The integration tax
&lt;/h3&gt;

&lt;p&gt;Every time a vector database client releases a major update, a predictable sequence follows.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The client releases a new version with breaking changes.&lt;/li&gt;
&lt;li&gt;LangChain and LlamaIndex updated their integrations weeks later.&lt;/li&gt;
&lt;li&gt;Production pipelines built on top of those integrations break in the meantime.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This pattern has played out with Weaviate’s v3-to-v4 migration, Pinecone’s three major releases in 18 months, and ChromaDB’s ongoing compatibility issues. &lt;/p&gt;

&lt;p&gt;API stability matters more than feature richness when selecting a Python vector database library for production retrieval-augmented pipelines. A client with fewer features but a stable API causes considerably less disruption than one with a rich feature set and frequent breaking changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Considerations: Beyond Raw Speed
&lt;/h2&gt;

&lt;p&gt;Most client comparisons overlook three factors that significantly affect vector database performance: protocol choice, the quality of async support, and connection pooling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Protocol choice: REST vs. gRPC
&lt;/h3&gt;

&lt;p&gt;gRPC and REST are the two transport protocols available across these clients. As mentioned earlier, Weaviate sees 40 to 80% faster queries over gRPC, and Qdrant gains roughly 15% in query speed with gRPC enabled. In restricted network environments where port &lt;code&gt;50051&lt;/code&gt; is not accessible, REST is the more practical option.&lt;/p&gt;

&lt;h3&gt;
  
  
  Async support quality
&lt;/h3&gt;

&lt;p&gt;Most teams build production LLM applications on FastAPI or similar async frameworks, which makes async client support a meaningful performance consideration. Using a synchronous client within an async application results in blocking calls, which sharply reduces throughput.&lt;/p&gt;

&lt;p&gt;Qdrant’s native &lt;code&gt;AsyncQdrantClient&lt;/code&gt;, available since v1.61, provides a well-established async implementation. Pinecone introduced &lt;code&gt;PineconeAsyncio&lt;/code&gt; in v6, bringing proper async support to cloud-only vector search workloads. Weaviate added async support in v4.7, making it the most recent of the four to reach production-ready async capabilities. ChromaDB’s async support remains limited across all four.&lt;/p&gt;

&lt;p&gt;The throughput difference is substantial. For I/O-bound workloads where network latency is the bottleneck, &lt;a href="https://fastapi.tiangolo.com/async/" rel="noopener noreferrer"&gt;async clients&lt;/a&gt; typically deliver 3–5x higher throughput than their synchronous equivalents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connection pooling and resource management
&lt;/h3&gt;

&lt;p&gt;This is one of the configuration areas where default settings tend to fall short in production. Qdrant and Pinecone both expose parameters that give more control over connection management under sustained production traffic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="c1"&gt;# Qdrant connection pool configuration
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-cluster-url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;pool_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Pinecone connection pool configuration
&lt;/span&gt;    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-index-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;pool_threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;connection_pool_maxsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Pinecone, &lt;code&gt;query_namespaces&lt;/code&gt; requires tuning &lt;code&gt;pool_threads&lt;/code&gt; and &lt;code&gt;connection_pool_maxsize&lt;/code&gt; for production workloads. For Qdrant, increasing &lt;code&gt;pool_size&lt;/code&gt; above the default reduces connection contention for applications that handle large volumes of document embeddings in parallel.&lt;/p&gt;

&lt;p&gt;Teams that tune these settings before deployment avoid considerable debugging time when the application runs under load.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error Handling and Debugging
&lt;/h2&gt;

&lt;p&gt;Vector database libraries handle a lot of complexity internally. When something fails, how clearly the client communicates that failure determines how quickly teams can fix it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error message quality
&lt;/h3&gt;

&lt;p&gt;The quality of error messages varies considerably across the four clients.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pinecone&lt;/strong&gt; produces clear, actionable error messages that typically suggest a solution alongside the failure description, reducing the time teams spend searching for the root cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qdrant&lt;/strong&gt; error messages are helpful and point directly to the source of the problem. The UnexpectedResponse exception includes a specific reason field that identifies exactly which parameter failed validation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    qdrant_client.http.exceptions.UnexpectedResponse: Status 400, reason: &lt;span class="s2"&gt;"Wrong input: Vector dimension error: expected dim: 384, got 768"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ChromaDB&lt;/strong&gt; error messages are frequently vague and require a GitHub search to diagnose. When the two-package mix-up occurs, ChromaDB raises a ValueError about missing embedding functions instead of reporting the actual root cause. The SQLite version requirement produces a similarly unhelpful error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; RuntimeError: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; 3.35.0.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This error is a common roadblock for Python developers deploying on older Amazon Linux 2 or Streamlit environments.&lt;/p&gt;

&lt;p&gt;Weaviate v3 silently failed, returning null objects or dictionaries with an errors key that developers had to check manually. The &lt;a href="https://weaviate.io/blog/py-client-v4-release" rel="noopener noreferrer"&gt;v4 rewrite&lt;/a&gt; addressed this with typed exceptions, such as &lt;code&gt;WeaviateQueryError&lt;/code&gt; and &lt;code&gt;WeaviateGRPCUnavailableError.&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Logging and observability
&lt;/h3&gt;

&lt;p&gt;Observability capabilities differ across the four clients.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Qdrant supports structured logging, distributed tracing, and metrics without additional configuration, which makes it a strong fit for production machine learning applications that require visibility into vector search engine performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pinecone provides basic logging through its managed infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ChromaDB has limited logging with no structured output, which makes diagnosing issues in production AI applications considerably harder.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Common pitfalls and solutions
&lt;/h3&gt;

&lt;p&gt;Three error patterns recur across all four clients in production environments.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Client version mismatches cause frequent, unexpected failures, particularly amid Pinecone's three releases over 18 months and Weaviate's v3-to-v4 migration. Teams can control this by pinning client versions in a &lt;code&gt;requirements.txt&lt;/code&gt; file. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Embedding dimension mismatch occurs when the query embedding dimensions do not match the collection's expectations. Verifying that the embedding model's output size matches the collection configuration before deployment prevents this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rate limiting affects cloud-only deployments on Pinecone and Weaviate Cloud. Implementing exponential backoff on API calls is the standard solution for production workloads that approach rate limits under sustained traffic.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How often version mismatches surface, how broadly platform incompatibilities spread, and how clearly error messages communicate failures together determine a client's real maintenance cost in production. &lt;/p&gt;

&lt;p&gt;The version churn across Pinecone, Weaviate, and ChromaDB has left many production teams looking for a client that prioritizes operational stability over feature velocity. Actian VectorAI DB addresses this directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actian VectorAI DB
&lt;/h2&gt;

&lt;p&gt;Actian designed &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Actian VectorAI DB&lt;/a&gt; for large and small-scale deployment of vector search in edge and on-premises environments, addressing operational stability through the following characteristics:&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture and design
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Same architecture and APIs across all environments.&lt;/li&gt;
&lt;li&gt;Docker-based deployment from local laptops to production infrastructure.&lt;/li&gt;
&lt;li&gt;HNSW-based indexing for approximate nearest neighbor search.&lt;/li&gt;
&lt;li&gt;Real-time indexing architecture with immediate update availability.&lt;/li&gt;
&lt;li&gt;Python and JavaScript SDKs with REST and SQL APIs.&lt;/li&gt;
&lt;li&gt;Native LangChain and LlamaIndex integration support.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deployment options
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Docker containers for local development and production.&lt;/li&gt;
&lt;li&gt;Data center, private cloud, and public cloud infrastructure support.&lt;/li&gt;
&lt;li&gt;Edge, remote, and air-gapped environment capability with offline operation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Compliance design
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;On-premises deployment architecture for data sovereignty.&lt;/li&gt;
&lt;li&gt;Supports GDPR, HIPAA, and data residency requirements.&lt;/li&gt;
&lt;li&gt;Aligns with SOC 2 and ISO 27001 compliance frameworks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Performance targets
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Sub-15ms query latency goal for local deployment.&lt;/li&gt;
&lt;li&gt;Uses HNSW to optimize for high recall accuracy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These characteristics represent important considerations beyond features and benchmarks alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Framework: Choosing Your Python Client
&lt;/h2&gt;

&lt;p&gt;Selecting the right Python vector database library comes down to six criteria.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deployment model:&lt;/strong&gt; Cloud-only deployments point towards Pinecone. On-premises or self-hosted vector database requirements point towards Qdrant, Weaviate, or Actian. Teams already running PostgreSQL should test pgvector against their workload before adding a new dependency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Team experience:&lt;/strong&gt; Junior teams or teams new to vector databases benefit from ChromaDB or Actian, where API stability and clear error messages reduce the learning curve. Senior teams comfortable with Docker and gRPC configuration get more out of Qdrant and Weaviate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale:&lt;/strong&gt; &lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Under 10 million vectors: ChromaDB or FAISS&lt;/li&gt;
&lt;li&gt;10 million to 100 million vectors: Qdrant, Pinecone, or Actian&lt;/li&gt;
&lt;li&gt;Over 100 million vectors: Milvus&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;API stability:&lt;/strong&gt; Production environments with zero tolerance for breaking changes point towards Actian or Pinecone v7+. Teams that can absorb migrations can work with Weaviate or Qdrant. POC and experimental workloads can use ChromaDB or FAISS.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Integration:&lt;/strong&gt; LangChain-dependent pipelines should verify version support before committing to a client. Weaviate's v3-to-v4 lag is the clearest example of what happens when this check is skipped.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; Pinecone's managed pricing runs from approximately $50 to $500 or more per month, depending on scale. Cost-conscious teams running large datasets on self-hosted infrastructure should evaluate Qdrant or pgvector. ChromaDB and FAISS are both free for local, open-source vector database workloads.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Decision matrix&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Pinecone&lt;/th&gt;
&lt;th&gt;Qdrant&lt;/th&gt;
&lt;th&gt;Weaviate&lt;/th&gt;
&lt;th&gt;ChromaDB&lt;/th&gt;
&lt;th&gt;Actian VectorAI DB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API stability&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Improving&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local development&lt;/td&gt;
&lt;td&gt;✗ No local mode&lt;/td&gt;
&lt;td&gt;✓ :memory: mode&lt;/td&gt;
&lt;td&gt;Docker required&lt;/td&gt;
&lt;td&gt;✓ Embedded&lt;/td&gt;
&lt;td&gt;✓ :memory: + SQLite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Platform compatibility&lt;/td&gt;
&lt;td&gt;✓ Cloud only&lt;/td&gt;
&lt;td&gt;✓ All platforms&lt;/td&gt;
&lt;td&gt;✓ All platforms&lt;/td&gt;
&lt;td&gt;✗ Issues on ARM, Win&lt;/td&gt;
&lt;td&gt;✓ All platforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async support&lt;/td&gt;
&lt;td&gt;✓ v6+&lt;/td&gt;
&lt;td&gt;✓ Native&lt;/td&gt;
&lt;td&gt;✓ v4.7+&lt;/td&gt;
&lt;td&gt;✗ Limited&lt;/td&gt;
&lt;td&gt;✓ Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$50-500+/mo&lt;/td&gt;
&lt;td&gt;Free / Self-hosted&lt;/td&gt;
&lt;td&gt;Free / Managed&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Enterprise pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The ChromaDB production failure from the opening example stems from client packaging issues that developers only encounter after deployment. This comparison helps avoid similar failures: platform incompatibilities, breaking changes from version migrations, and client class redesigns that propagate across codebases.&lt;/p&gt;

&lt;p&gt;ChromaDB gets projects started quickly but tends to show its limitations once the application moves to production. Pinecone is polished and well-managed, but version churn and permanent cloud dependencies are real costs. Qdrant is the strongest open-source option for teams that want local-to-production parity without code changes. Weaviate's v4 client significantly improves on v3 and is well-suited for teams that need hybrid search at scale.&lt;/p&gt;

&lt;p&gt;For teams where API stability and platform compatibility are critical, enterprise-grade clients like Actian VectorAI DB provide production-ready stability with verified cross-platform support.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Explore &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Actian VectorAI DB&lt;/a&gt; for guaranteed production stability.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>vectordatabase</category>
      <category>python</category>
    </item>
    <item>
      <title>Can AI in Manufacturing Work Without the Cloud? A Guide</title>
      <dc:creator>Tiioluwani</dc:creator>
      <pubDate>Mon, 18 May 2026 12:40:42 +0000</pubDate>
      <link>https://dev.to/actiandev/can-ai-in-manufacturing-work-without-the-cloud-a-guide-48a8</link>
      <guid>https://dev.to/actiandev/can-ai-in-manufacturing-work-without-the-cloud-a-guide-48a8</guid>
      <description>&lt;p&gt;Keeping external traffic out of operational networks is a best practice that most manufacturing facilities build into their architecture from the ground up.&lt;/p&gt;

&lt;p&gt;Manufacturing networks use the &lt;a href="https://www.sans.org/blog/introduction-to-ics-security-part-2" rel="noopener noreferrer"&gt;Purdue Model&lt;/a&gt;, a five-level system that has shaped industrial network design for decades. At the lowest level are the physical machines: sensors, motors, and actuators at Level 0; real-time controllers and SCADA systems at Level 1; and supervisory servers and HMI systems at Level 2. Level 3 manages operations. Levels 4 and 5 connect to the enterprise network and to the internet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.isa.org/standards-and-publications/isa-standards/isa-iec-62443-series-of-standards" rel="noopener noreferrer"&gt;IEC 62443&lt;/a&gt; enforces strict boundaries between these levels. Traffic from Level 2 does not reach the internet. For defense contractors, &lt;a href="https://www.pmddtc.state.gov/ddtc_public?id=ddtc_public_portal_itar_landing" rel="noopener noreferrer"&gt;ITAR&lt;/a&gt; compounds the problem. Technical data must stay on U.S. soil and remain accessible only to U.S. persons. Cloud-hosted vector databases like &lt;a href="https://www.pinecone.io" rel="noopener noreferrer"&gt;Pinecone&lt;/a&gt;, &lt;a href="https://console.weaviate.cloud" rel="noopener noreferrer"&gt;Weaviate Cloud&lt;/a&gt;, and &lt;a href="https://cloud.qdrant.io" rel="noopener noreferrer"&gt;Qdrant Cloud&lt;/a&gt; fail both requirements. Level 2 has no way to send that request, and &lt;a href="https://www.actian.com/blog/databases/what-37signals-cloud-repatriation-taught-us-about-ai-infrastructure/" rel="noopener noreferrer"&gt;other industries learned this lesson the hard way&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh14czsvaiz4gi3nyldes.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh14czsvaiz4gi3nyldes.png" alt="Why cloud AI cannot reach the factory floor" width="800" height="652"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Latency compounds the problem. &lt;a href="https://dev.to/actiandev/why-real-time-analytics-cant-depend-on-cloud-in-2026-1paj"&gt;Cloud round-trips&lt;/a&gt; average 50 to 500 milliseconds. PLC-level control loops require responses in under 10 milliseconds. Teams that need AI during outages use &lt;a href="https://www.actian.com/blog/databases/5-edge-ai-architecture-patterns-for-disconnected-environments/" rel="noopener noreferrer"&gt;edge deployment patterns designed for disconnected environments&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Cost adds another layer. &lt;a href="https://aws.amazon.com/ec2/pricing/on-demand/" rel="noopener noreferrer"&gt;AWS standard egress&lt;/a&gt; starts at $0.09 per GB. At any serious production scale, sensor and vision data add up quickly, and the bill arrives faster than most teams expect.&lt;/p&gt;

&lt;p&gt;Architecture, latency, and cost all point in the same direction. AI on the factory floor needs to run where the data lives.&lt;/p&gt;

&lt;p&gt;This tutorial shows you how to build a local RAG pipeline that runs entirely on factory-floor hardware, where a technician can ask a question about any piece of equipment and get a cited answer from decades of maintenance records, with no internet connection required.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Are Building
&lt;/h2&gt;

&lt;p&gt;You’ll build a three-layer RAG pipeline that runs fully inside your factory network. The ingestion layer processes PDF maintenance documents and stores them in Actian VectorAI DB. The query layer takes a technician's question and returns a cited answer fast enough for interactive use on factory-floor hardware.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingestion:&lt;/strong&gt; Reads the PDF maintenance documents, splits them into 256-token chunks with a 25-token overlap, generates embeddings using sentence-transformers on a CPU, and stores everything in VectorAI DB with metadata for equipment line, document date, and source file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query:&lt;/strong&gt; Takes a technician's question, embeds it with the same model, runs a hybrid search in VectorAI DB filtered by equipment line and date range, and sends the top results to a local LLM running with Ollama, which generates a cited answer in plain English.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit:&lt;/strong&gt; Logs every ingestion and query event as a structured JSON entry to &lt;code&gt;./data/audit.log&lt;/code&gt;, timestamped in UTC, and stored inside your security boundary to satisfy IEC 62443 traceability requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;VectorAI DB sits at the center of all three layers. It stores the embeddings that the ingestion layer produces, and serves the search results that the query layer runs. Running it o&lt;a href="https://www.actian.com/blog/databases/when-to-choose-on-premises-vs-cloud-for-vector-databases/" rel="noopener noreferrer"&gt;n-premises instead of in the cloud&lt;/a&gt; keeps the whole pipeline inside your security boundary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6eubnxzcycn6269k9jr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6eubnxzcycn6269k9jr.png" alt="Pipeline architecture" width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pipeline runs on standard factory edge server hardware, with Ubuntu 22.04 LTS, 16 GB of RAM, and a 4-core CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Local RAG Pipeline with VectorAI DB
&lt;/h2&gt;

&lt;p&gt;Set up VectorAI DB, build the ingestion pipeline, run your first query, add hybrid filters, and connect a local LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Sign up for the &lt;a href="https://www.actian.com/databases/vectorai-db/community-edition/" rel="noopener noreferrer"&gt;VectorAI DB community edition&lt;/a&gt; before you start, then make sure you have these installed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker and Docker Compose&lt;/li&gt;
&lt;li&gt;Python 3.10 or higher&lt;/li&gt;
&lt;li&gt;uv package manager. Install with &lt;code&gt;curl -LsSf https://astral.sh/uv/install.sh | sh&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ollama. Install from &lt;a href="http://ollama.com" rel="noopener noreferrer"&gt;Ollama.com&lt;/a&gt; and pull the model with &lt;code&gt;ollama pull llama3.2:3b&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Your machine needs at least 8 GB RAM (16 GB or more recommended) and 10 GB of disk space (100 GB or more recommended) to run VectorAI DB. If you're on Windows, the uv install command needs 'sh', which PowerShell doesn't have. Run all commands in WSL2 (Windows Subsystem for Linux). To set up WSL2, run 'wsl --install' in PowerShell, then use the Ubuntu terminal for this tutorial.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Project structure
&lt;/h3&gt;

&lt;p&gt;Set up your project folder like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; factory-rag/
├── docker-compose.yml
├── data/
│   └── audit.log
├── config/
└── src/
    ├── healthcheck.py
    ├── ingest.py
    ├── query.py
    ├── llm.py
    ├── audit.py
    └── test_e2e.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the directories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; factory-rag/&lt;span class="o"&gt;{&lt;/span&gt;data,config,src&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;factory-rag
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 1: Deploy VectorAI DB
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;docker-compose.yml&lt;/code&gt; in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;vectorai-db&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;williamimoh/actian-vectorai-db:latest&lt;/span&gt;
    &lt;span class="na"&gt;platform&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;linux/amd64&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vectorai-db&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50051:50051"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./data:/data&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./config:/config&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
    &lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD-SHELL"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nc&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-z&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;50051&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;||&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exit&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;start_period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start the container with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install the SDK with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add actian_vectorai-0.1.0b2-py3-none-any.whl

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install these required libraries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add sentence-transformers pypdf

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check that the server is running. Make a file called &lt;code&gt;src/healthcheck.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;actian_vectorai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorAIClient&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;VectorAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:50051&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;health_check&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✓ VectorAI DB is running&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Title:   &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Version: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python src/healthcheck.py

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Terminal output:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakqq6onpafd0d6f9tsfr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakqq6onpafd0d6f9tsfr.png" alt="Terminal output" width="633" height="73"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Build the ingestion pipeline
&lt;/h3&gt;

&lt;p&gt;Put your PDF maintenance documents in the data/ folder before running this step. Add any equipment maintenance records, inspection reports, or failure logs there.&lt;br&gt;
The pipeline uses &lt;code&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/code&gt;, which needs less than 200 MB of RAM on CPU. We split text into 256-token chunks with a 25-token overlap to keep enough context for good retrieval.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;src/ingest.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;actian_vectorai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PointStruct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VectorAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VectorParams&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pypdf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PdfReader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;audit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;log_ingestion&lt;/span&gt;

&lt;span class="n"&gt;COLLECTION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maintenance_records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;HOST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:50051&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;VECTOR_DIM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;384&lt;/span&gt;
&lt;span class="n"&gt;CHUNK_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;
&lt;span class="n"&gt;OVERLAP_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CHUNK_TOKENS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;OVERLAP_TOKENS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;token_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;add_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_ids&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_ids&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;token_ids&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;decoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;decoded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decoded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_ids&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ingest_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PdfReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;full_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;full_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [warn] No extractable text in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, skipping.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;points&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;show_progress_bar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nc"&gt;PointStruct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NAMESPACE_DNS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pdfs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;pdfs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No PDF files found in &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data_dir&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Add PDFs to ./data/ and retry.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading embedding model &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;VectorAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HOST&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;vectors_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;VECTOR_DIM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Cosine&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Created collection &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;VECTOR_DIM&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-dim, Cosine)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Collection &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; already exists, appending chunks.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pdfs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ingesting &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ingest_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chunks stored&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;log_ingestion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Done. &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; total chunks stored in &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ingest PDFs into VectorAI DB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--data-dir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--equipment-line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--doc-date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the ingestion step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python src/ingest.py &lt;span class="nt"&gt;--equipment-line&lt;/span&gt; turbine-A &lt;span class="nt"&gt;--doc-date&lt;/span&gt; 2024-03-15
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Expected output:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqf44zjrgsr6kn7w8p6d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqf44zjrgsr6kn7w8p6d.png" alt="Terminal output" width="800" height="132"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The metadata schema saves the equipment line, document date, and source file with each chunk. This lets you filter searches by equipment line or date range without searching the whole collection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Run your first query
&lt;/h3&gt;

&lt;p&gt;Your ingestion pipeline has stored the maintenance records in VectorAI DB. The pipeline can answer questions. When a technician asks something in plain English, the pipeline embeds the question, searches the &lt;code&gt;maintenance_records&lt;/code&gt; collection, and returns the top five most relevant chunks with similarity scores.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;src/query.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;actian_vectorai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FilterBuilder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VectorAIClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;audit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;log_query&lt;/span&gt;

&lt;span class="n"&gt;COLLECTION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maintenance_records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;HOST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:50051&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;TOP_K&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date_to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;fb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FilterBuilder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;fb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;must&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;doc_date_to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;fb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;must&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gte&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lte&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;doc_date_to&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;fb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;must&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;equipment_line&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date_to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MODEL_NAME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;show_progress_bar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;query_filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date_to&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;VectorAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HOST&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TOP_K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_filter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search maintenance records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Natural language question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--equipment-line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--doc-date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--doc-date-to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;doc_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc_date_to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;doc_date_to&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;latency_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
    &lt;span class="nf"&gt;log_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;equipment_line&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No results found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Top &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; results for: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\"\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] score=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
              &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(chunk &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Try your first query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python src/query.py &lt;span class="s2"&gt;"What caused the bearing failure?"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The search uses the same model as ingestion to embed the query, keeping both the query and stored vectors in the same semantic space. For maintenance records with this model, similarity scores between 0.4 and 0.6 indicate relevant matches. &lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Add hybrid filters
&lt;/h3&gt;

&lt;p&gt;Filtering by equipment line and date helps keep search results relevant to the technician's current work. Run the same query from Step 3, but add these filters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python src/query.py &lt;span class="s2"&gt;"What caused the bearing failure?"&lt;/span&gt; &lt;span class="nt"&gt;--equipment-line&lt;/span&gt; turbine-A

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add a date filter to narrow the results even more:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python src/query.py &lt;span class="s2"&gt;"What caused the bearing failure?"&lt;/span&gt; &lt;span class="nt"&gt;--equipment-line&lt;/span&gt; turbine-A &lt;span class="nt"&gt;--doc-date&lt;/span&gt; 2024-03-15
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Expected output:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftj590jqsg7zogf1dmj5m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftj590jqsg7zogf1dmj5m.png" alt="Terminal output" width="800" height="243"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;build_filter&lt;/code&gt; function constructs a FilterBuilder query that combines vector similarity with exact metadata matching. A technician working on turbine-A only sees results from that equipment line, not from the entire maintenance history.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Connect the local LLM
&lt;/h3&gt;

&lt;p&gt;The search results feed into a local LLM running via Ollama, which generates a cited answer in plain English. The entire round trip runs on factory-floor hardware.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;src/llm.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="n"&gt;OLLAMA_HOST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;OLLAMA_MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.2:3b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;MAX_NEW_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;
&lt;span class="n"&gt;TEMPERATURE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;
&lt;span class="n"&gt;TIMEOUT_SECONDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Answer: I have no relevant context to answer this question.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;context_blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;equip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;context_blocks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Source: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Equipment: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;equip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Date: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_blocks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a maintenance records assistant. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer the question using ONLY the provided context. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cite sources inline using [1], [2], etc. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If the context does not contain enough information, say so.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OLLAMA_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;options&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MAX_NEW_TOKENS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TEMPERATURE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;OLLAMA_HOST&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TIMEOUT_SECONDS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(chunk &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, score &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; / &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;reply&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What maintenance was performed?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;dummy_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;example.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-03-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turbine-A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Performed scheduled bearing inspection on turbine-A. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Replaced worn bearing race on shaft 2. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Torque settings verified per spec TRB-004.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dummy_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wire everything together by creating &lt;code&gt;src/test_e2e.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;search&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;

&lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What maintenance was performed on the gearbox?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turbine-A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the full pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python src/test_e2e.py

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt; fits in the memory of a standard factory edge server. The LLM  receives only the retrieved chunks as context, not the full document collection, which keeps responses fast and grounded in cited sources.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Expected output:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclu3vqouovn332vvqg8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclu3vqouovn332vvqg8f.png" alt="Terminal output" width="800" height="235"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pipeline is fully up and running. A technician can ask a question, get a cited answer from local maintenance records, and never need to use the internet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Add audit logging
&lt;/h3&gt;

&lt;p&gt;IEC 62443 requires full traceability for every operation within the OT network. Without a local audit trail, your pipeline has no record of what was queried, when, or what it returned.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;src/audit.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timezone&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="n"&gt;LOG_PATH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data/audit.log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;LOG_PATH&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;FileHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOG_PATH&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actian_vectorai.audit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results_returned&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_ingestion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks_stored&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingestion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tz&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;source_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equipment_line&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;equipment_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunks_stored&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunks_stored&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the audit script with this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;data/audit.log

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fky2zwqw7hszw9ffv4nxi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fky2zwqw7hszw9ffv4nxi.png" alt="Terminal output" width="800" height="44"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pipeline now keeps a structured record of every ingestion and query event in &lt;code&gt;./data/audit.log&lt;/code&gt;, timestamped in UTC and stored inside your security boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;You just built a local RAG pipeline that runs entirely on factory-floor hardware, serves queries during network outages, and returns cited answers from decades of maintenance records.&lt;/p&gt;

&lt;p&gt;AI in manufacturing can operate without a cloud connection. VectorAI DB enables this by running entirely within the IEC 62443 security boundary, without relying on the cloud. Cut the internet connection entirely, and the pipeline keeps working.&lt;/p&gt;

&lt;p&gt;Your pipeline ingests PDF maintenance documents, stores embeddings in the VectorAI DB at Level 2 of your OT network, and answers natural-language questions using a local LLM with no cloud dependency at any step. From here, you can extend the pipeline by adding more document types, tuning the embedding model for your specific equipment vocabulary, adding role-based query filtering by technician, or scaling ingestion across multiple equipment lines.&lt;/p&gt;

&lt;p&gt;Find the full &lt;a href="https://docs.vectoraidb.actian.com/" rel="noopener noreferrer"&gt;VectorAI DB documentation&lt;/a&gt; and the &lt;a href="https://github.com/hackmamba-io/actian-vectorAI-db-beta" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; to explore further.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Join the &lt;a href="https://discord.gg/432A2M63Py" rel="noopener noreferrer"&gt;community&lt;/a&gt; and &lt;a href="https://www.actian.com/databases/vectorai-db/community-edition/" rel="noopener noreferrer"&gt;learn more&lt;/a&gt; about Actian.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vectordatabase</category>
    </item>
    <item>
      <title>5 Edge AI Architecture Patterns for Disconnected Environments</title>
      <dc:creator>Praise James</dc:creator>
      <pubDate>Mon, 18 May 2026 11:05:16 +0000</pubDate>
      <link>https://dev.to/actiandev/5-edge-ai-architecture-patterns-for-disconnected-environments-27of</link>
      <guid>https://dev.to/actiandev/5-edge-ai-architecture-patterns-for-disconnected-environments-27of</guid>
      <description>&lt;p&gt;A haul truck operating 200 miles from the nearest cellular tower does not pause when connectivity drops. An offshore wind turbine does not suspend fault detection because a satellite link fails in a storm. In these environments, inference, control loops, and safety systems must continue operating regardless of network status. Yet the dominant edge AI architecture still revolves around connectivity and cloud AI.&lt;/p&gt;

&lt;p&gt;Disconnected environments demand edge-native, offline-first architectures designed for operational autonomy. Market signals reinforce this reality.&lt;/p&gt;

&lt;p&gt;ABI Research projects &lt;a href="https://www.abiresearch.com/press/edge-server-spending-to-reach-us19-billion-by-2027-enabling-integration-of-edge-based-solutions-as-part-of-edge-to-cloud-orchestration-strategy" rel="noopener noreferrer"&gt;edge server spending&lt;/a&gt; to reach $19B by 2027, with on-premises deployments accounting for nearly $10.5B. In 2025, organizations deployed &lt;a href="https://www.vpnranks.com/resources/edge-computing-statistics/" rel="noopener noreferrer"&gt;approximately 815 million&lt;/a&gt; edge-enabled IoT devices globally.&lt;/p&gt;

&lt;p&gt;Most operational environments are inherently distributed, generating data far from centralized cloud systems. Edge deployment strategies that depend on sending that data back and forth for processing cause IoT systems to miss critical insights, increase latency, and introduce data loss. Yet proposed edge architectures still treat offline readiness as an add-on rather than the default.&lt;/p&gt;

&lt;p&gt;We present five edge AI deployment patterns that operate without assumed connectivity, covering their implementation tactics, real-world scenarios, trade-offs, and a decision framework for selecting the right pattern for your operational priorities.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Suitable use cases for each documented deployment pattern at a glance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;The drone (self-contained single-node edge AI)&lt;/td&gt;
&lt;td&gt;Autonomous mobile systems with strict energy budgets and zero cloud connection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The factory (multi-node edge AI with optional cloud)&lt;/td&gt;
&lt;td&gt;Facilities with local infrastructure in intermittent environments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchical federated learning (client-edge-cloud)&lt;/td&gt;
&lt;td&gt;Privacy-sensitive distributed operations where data leakage risks are unacceptable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Store-and-forward disconnected inference&lt;/td&gt;
&lt;td&gt;Operations with scheduled connectivity windows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The network (distributed edge-to-edge fabric)&lt;/td&gt;
&lt;td&gt;Distributed coordination without cloud dependency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why Disconnected Environments are an Edge AI Problem
&lt;/h2&gt;

&lt;p&gt;There is a structural blind spot for disconnected environments, driven by the assumption that industries using edge AI models are cloud-centric and operate under persistent connectivity. Where edge AI applications matter most, constant network access does not exist.&lt;/p&gt;

&lt;h3&gt;
  
  
  What disconnected actually means
&lt;/h3&gt;

&lt;p&gt;Disconnected environments are settings with unreliable or nonexistent connectivity, ranging from airgapped scenarios with complete network isolation to intermittent setups with frequent connectivity degradation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4n9z1nrhyfrupmvoogjq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4n9z1nrhyfrupmvoogjq.png" alt="Figure 1: Connectivity spectrum" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In these operational settings, edge AI capabilities truly shine because they support the real-time data processing, low latency, bandwidth optimization, and data governance that disconnected environments require.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.precedenceresearch.com/edge-ai-market" rel="noopener noreferrer"&gt;Precedence Research&lt;/a&gt; estimates the global edge AI market will reach $143B by 2034, a potential 472% increase from $25B in 2025. For a significant portion of this market, constant cloud connectivity is not feasible. Yet inference, local data storage, and real-time decision-making must continue regardless of network status or location.&lt;/p&gt;

&lt;h3&gt;
  
  
  Disconnection is where edge AI earns its value
&lt;/h3&gt;

&lt;p&gt;Disconnected environments such as mining sites, manufacturing plants, military operations, offshore wind farms, and smart cities expose the limitations of current edge AI deployment solutions.&lt;/p&gt;

&lt;p&gt;Rio Tinto operates on mining sites up to &lt;a href="https://www.bbc.com/news/articles/cgej7gzg8l0o[](url)" rel="noopener noreferrer"&gt;930 miles&lt;/a&gt; from cellular coverage, where operators cannot rely on a centralized infrastructure. They need autonomous inspection robots that use edge AI to track personnel and vehicles, interpreting data from 3D LiDAR, thermal imaging, and gas sensors in real-time.&lt;/p&gt;

&lt;p&gt;At least &lt;a href="https://www.alcircle.com/news/rio-tinto-welcomes-300th-komatsu-autonomous-haulage-truck-at-pilbara-operations-wa-111740#:~:text=%E2%80%9CThe%20AHS%20fleet%20at%20Rio,Tinto%20takes%20with%20its%20suppliers.%22" rel="noopener noreferrer"&gt;300 autonomous haul trucks&lt;/a&gt; operate in Rio Tinto’s Pilbara region. Each truck processes roughly 5TB of data daily through subterranean tunnels with limited connectivity, requiring &lt;a href="https://www.rcrwireless.com/20180710/network-infrastructure/four-private-lte-use-cases#:~:text=According%20to%20a%20Qualcomm%20white%20paper%20on,and%20related%20facilities%20including%20transportation%20hubs%20and" rel="noopener noreferrer"&gt;private LTE networks&lt;/a&gt; for on-device IoT processing.&lt;/p&gt;

&lt;p&gt;Offshore wind farms face a similar constraint. Turbines and inspection vessels go offline when satellite connections fail due to harsh weather or line-of-sight blockage, and each turbine averages &lt;a href="https://www.groundcontrol.com/blog/wireless-connectivity-for-offshore-wind-farms/" rel="noopener noreferrer"&gt;approximately 8.3 failures per year&lt;/a&gt;. These farms need edge AI systems that detect issues early, monitor real-time maritime traffic, analyze local SCADA data, and trigger inspections based on immediate wind conditions.&lt;/p&gt;

&lt;p&gt;In remote manufacturing environments, plant managers also need edge AI to automate quality inspections, predict machine failures, and protect workforce health.&lt;/p&gt;

&lt;p&gt;A similar demand for local, secure processing drives military operations, where systems operate within airgapped networks in denied, disrupted, intermittent, and limited (DDIL) environments to maintain data confidentiality and integrity. Soldiers must communicate with command units and analyze real-time warfare data without relying on cloud data centers or large computing resources.&lt;/p&gt;

&lt;p&gt;These are the environments where edge AI deployment delivers the most impact. According to Dell, enterprise data processing will shift to &lt;a href="https://www.dell.com/en-us/blog/the-power-of-small-edge-ai-predictions-for-2026/" rel="noopener noreferrer"&gt;distributed data centers&lt;/a&gt; in 2026, but most documented architectures still emphasize transmitting data back to cloud data centers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Constrained hardware shapes model deployment
&lt;/h3&gt;

&lt;p&gt;The demands of AI compute and workload scaling at the edge also fuel the cloud-edge deployment recommendations.&lt;/p&gt;

&lt;p&gt;A deep learning model with &lt;a href="https://localaimaster.com/blog/ram-requirements-local-ai" rel="noopener noreferrer"&gt;3B parameters can require up to 4GB of RAM&lt;/a&gt;, but edge devices like microcontrollers and IoT sensors typically have &lt;a href="https://promwad.com/news/best-microcontrollers-low-power-iot-2025" rel="noopener noreferrer"&gt;less than 1GB&lt;/a&gt; for OS, workloads, and storage combined. Connected environment architectures assume large compute availability that doesn’t exist at the edge.&lt;/p&gt;

&lt;p&gt;Edge AI architectures must start with offline-first assumptions and hardware ceilings from day one. Retrofitting offline capability into cloud systems will not compensate for connectivity gaps and limited hardware resources. Below, we detail five architectural patterns tailored for disconnected environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1: The Drone (Self-Contained Single-Node Edge AI)
&lt;/h2&gt;

&lt;p&gt;In environments where connectivity is unavailable, and operational latency cannot tolerate network round-trips, the deployment boundary collapses to a single device. Inference cannot be delegated, synchronized, or deferred. Edge devices like drones, underwater vehicles, and remote inspection robots must make decisions using only locally available compute, memory, and sensor input.&lt;/p&gt;

&lt;p&gt;This constraint defines the drone architecture. All AI logic runs on a single device, without external orchestration or cloud offloading.&lt;/p&gt;

&lt;h3&gt;
  
  
  When the device is the entire stack
&lt;/h3&gt;

&lt;p&gt;Mobile systems that must function autonomously in disconnected environments benefit most from this pattern.&lt;/p&gt;

&lt;p&gt;With no external orchestration layer, data capturing, preprocessing, inference, storage, and control logic operate within a self-contained package. This package runs on a single node without networking with other nodes or distributing model training.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvk0484awih8oozg6fmxf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvk0484awih8oozg6fmxf.png" alt="Figure 2: Single-node drone architecture" width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Onboard decision logic means edge devices can execute predefined operations even when disconnected. Once a device captures data, it filters out redundant information, retaining only relevant data for eventual manual retrieval.&lt;/p&gt;

&lt;p&gt;Autonomous drones that perform object detection and terrain classification in mining zones cannot pause execution while awaiting external inference. The drone architecture removes network dependency by focusing on on-device inference.&lt;/p&gt;

&lt;p&gt;This makes it the most viable pattern for DDIL environments where connectivity is actively denied or degraded. Defense drones cannot assume that the network will recover or that a command signal will arrive at all. Every battlefield coordination must be executable from the device alone.&lt;/p&gt;

&lt;p&gt;GE Aerospace, which runs &lt;a href="https://www.geaerospace.com/news/press-releases/ge-aerospace-deploys-ai-driven-inspection-tool-maximize-narrowbody-engine-time-wing?utm_source=perplexity" rel="noopener noreferrer"&gt;45,000+ commercial aircraft engines&lt;/a&gt; and captures over &lt;a href="https://www.genpact.com/case-studies/soaring-toward-safer-skies-with-remote-engine-monitoring?utm_source=perplexity" rel="noopener noreferrer"&gt;480,000 data snapshots daily per aircraft&lt;/a&gt;, implements this architecture at scale. Onboard AI models handle predictive maintenance in strict accordance with DO-178C, which requires GE Aerospace to verify every airborne system against all possible failure conditions before it ever leaves the ground. This quality assurance aligns with the drone’s architectural requirement of no external support after model deployment.&lt;/p&gt;

&lt;p&gt;Single-node local processing requires machine learning models with small footprints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optimizing intelligence for the edge
&lt;/h3&gt;

&lt;p&gt;Edge devices operate within strict memory and power ceilings measured in megabytes and milliwatts. When full-precision networks exceed available RAM or energy budgets, model capacity must be optimized before inference becomes feasible.&lt;/p&gt;

&lt;p&gt;Not every edge workload needs a neural network. In constrained environments like offshore wind farms, classical statistical methods, such as &lt;a href="https://medium.com/@aausafq/draft-rethinking-ai-for-the-edge-63c073dee59a" rel="noopener noreferrer"&gt;Welford’s algorithm and linear regression often outperform neural networks&lt;/a&gt; on streaming data processing.&lt;/p&gt;

&lt;p&gt;A microcontroller computing sensor data with Welford’s algorithm updates statistics sequentially, without retaining past data points, which keeps memory and power consumption low. Before pushing a neural network to its hardware limit, consider whether the model class itself is suitable for the use case.&lt;/p&gt;

&lt;p&gt;When neural networks are the right fit for the workload, quantization addresses their hardware limitations by reducing the numerical precision of their weights, biases, and activations. Downsizing from 32-bit to 8-bit shrinks model size &lt;a href="https://www.edge-ai-vision.com/2024/02/quantization-of-convolutional-neural-networks-model-quantization/" rel="noopener noreferrer"&gt;by approximately 75%&lt;/a&gt; with less than 1% accuracy loss.&lt;/p&gt;

&lt;p&gt;Another model compression technique, pruning, eliminates redundant parameters that contribute minimally to output accuracy. Pruning an object detection model like YOLOv5 can reduce its parameter count and &lt;a href="https://dl.acm.org/doi/10.1145/3762329.3762371" rel="noopener noreferrer"&gt;computational cost by 40%&lt;/a&gt; before deployment.&lt;/p&gt;

&lt;p&gt;TinyML frameworks such as TensorFlow Lite for Microcontrollers, ONNX Runtime, and PyTorch Mobile support compact model deployment. The following code shows an example quantization scenario with TensorFlow Lite.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Post-training quantization using TFLite converter
# Converts 32-bit floats to 8-bit integers
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;representative_dataset&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="n"&gt;converter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TFLiteConverter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_saved_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;saved_model_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Optimize&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;representative_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;representative_dataset&lt;/span&gt;

&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;supported_ops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OpsSet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TFLITE_BUILTINS_INT8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference_input_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int8&lt;/span&gt;
&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference_output_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int8&lt;/span&gt;

&lt;span class="n"&gt;tflite_quant_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start with quantization for higher speedup rates without significant accuracy loss, followed by pruning to compress the model’s size further. For the drone architecture, the target size on a single microcontroller is &amp;lt;1MB. Plumerai’s person detection model demonstrates how compression techniques can achieve this goal. The model achieved &lt;a href="https://blog.plumerai.com/2021/12/datacenter-ai-on-mcu/" rel="noopener noreferrer"&gt;737KB on an ARM Cortex-M7&lt;/a&gt; microcontroller with less than 256KB of on-chip RAM using binarized neural networks.&lt;/p&gt;

&lt;p&gt;At the hardware level, energy-efficient processors such as the NVIDIA Jetson Nano, Google Edge TPU, and ARM Cortex-M execute AI models directly on edge devices, purpose-built for computer vision and sensor fusion workloads. ARM Cortex-M variants deliver up to &lt;a href="https://www.digikey.com/en/articles/how-and-why-microcontrollers-can-help-democratize-access-to-edge-ai#:~:text=Machine%20learning%20applications%20running%20on,and%20hardware%20components%20for%20inferencing" rel="noopener noreferrer"&gt;600 giga-operations per second (GOPS) with an energy efficiency averaging 3 tera-operations per second per watt (TOPS/W)&lt;/a&gt;, depending on configuration.&lt;/p&gt;

&lt;p&gt;Drone deployment introduces an architectural rigidity. With limited runtime intervention, the architecture must anticipate every failure state during design. The DO-178C reinforces this constraint by requiring full system validation before deployment. Teams must engineer every model update and behavioral correction with no orchestration window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 2: The Factory (Multi-Node Edge AI With Optional Cloud)
&lt;/h2&gt;

&lt;p&gt;During network outages in manufacturing and large retail facilities, inference must continue in-house across multiple machines. The factory architecture meets this requirement by distributing AI workloads across on-premises edge clusters, keeping operational control within the facility boundary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.actian.com/blog/data-management/sync-your-data-from-edge-to-cloud-with-actian-zen-easysync/" rel="noopener noreferrer"&gt;Cloud synchronization&lt;/a&gt; remains optional, used only for model retraining or batch analytics rather than as a runtime dependency. The priority is maintaining resilience and operational independence across all nodes, regardless of network availability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inference stays on the factory floor
&lt;/h3&gt;

&lt;p&gt;The factory architecture centers on three components: edge gateways, compute nodes, and local storage.&lt;/p&gt;

&lt;p&gt;An edge gateway routes sensor requests to edge nodes, which pull context from local edge databases like &lt;a href="https://www.actian.com/databases/zen/" rel="noopener noreferrer"&gt;Actian Zen&lt;/a&gt;, act on model inference, and write the results back to the database. Decision-making and local computing stays on-premises. Cloud systems only handle model updates periodically or on trigger.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgux0vgy4ep4eqgc4k8tq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgux0vgy4ep4eqgc4k8tq.png" alt="Figure 3: The factory architecture" width="800" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Industrial environments generate continuous, high-volume telemetry data from sensors, controllers, and inspection systems. Distributing inference across multiple edge nodes maintains high inference throughput. But without a local orchestration layer managing distribution and managing model lifecycle, edge nodes operate as isolated processors rather than a coordinated system.&lt;/p&gt;

&lt;p&gt;K3s, AWS IoT Greengrass, Azure IoT Edge, and Siemens Industrial Edge are popular orchestration tools for managing edge clusters. Each differs in how they handle model deployment and node management.&lt;/p&gt;

&lt;p&gt;K3s deploys containerized models as clusters of worker nodes with a control plane for health visibility. Configuring its datastore endpoint parameter enables teams to store local data in on-premises databases like PostgreSQL and Actian Zen, replacing the default SQLite. &lt;a href="https://dok.community/blog/persistence-at-the-edge/" rel="noopener noreferrer"&gt;Chick-fil-A&lt;/a&gt; uses K3s at the edge to process point-of-sale transactions across 3,000+ restaurants.&lt;/p&gt;

&lt;p&gt;AWS IoT Greengrass deploys cloud-compiled AI models as components with predefined inference functions to &lt;a href="https://aws.amazon.com/blogs/aws/new-machine-learning-inference-at-the-edge-using-aws-greengrass/#:~:text=Industrial%20Maintenance%20%E2%80%93%20Smart%2C%20local%20monitoring,predict%20failures%2C%20detect%20faulty%20equipment.&amp;amp;text=There%20are%20several%20different%20aspects,with%20a%20couple%20of%20clicks:" rel="noopener noreferrer"&gt;NVIDIA Jetson TX2, Intel Atom boards, and Raspberry Pi-powered devices&lt;/a&gt;. Inference remains on-premises, with data exported optionally to AWS IoT Core for model optimization. Pfizer manufacturing sites use &lt;a href="https://aws.amazon.com/blogs/industries/pfizer-boosts-bioreactor-efficiency-with-aws-industrial-edge-services/" rel="noopener noreferrer"&gt;AWS IoT Greengrass&lt;/a&gt; for near-real-time bioreactor monitoring to minimize contamination risk.&lt;/p&gt;

&lt;p&gt;Siemens Industrial Edge deploys Docker-containerized models directly on the shop floor, delivering &lt;a href="https://blog.siemens.com/2024/05/enhancing-productivity-with-siemens-industrial-edge/" rel="noopener noreferrer"&gt;real-time machine status&lt;/a&gt;. Siemens Electronics Factory Erlangen &lt;a href="https://aws.amazon.com/partners/success/siemens-electronics-factory-erlangen-siemens/" rel="noopener noreferrer"&gt;reduced model deployment time by 80% and false anomaly detection on printed circuit boards (PCBs) by 50%&lt;/a&gt; using this orchestrator. By running inference on PCB images locally and outsourcing only model retraining to the cloud, the factory has saved data storage costs by 90%.&lt;/p&gt;

&lt;p&gt;Azure IoT Edge uses a JSON deployment manifest to specify which containerized models to download to edge devices. Data processing happens at the edge with Azure IoT Hub providing centralized oversight while the devices maintain autonomy. &lt;a href="https://www.microsoft.com/en/customers/story/1601901070675086388-thomas-concrete-group-discrete-manufacturing-azure-en-united-states" rel="noopener noreferrer"&gt;Thomas Concrete Group&lt;/a&gt; uses Azure IoT Edge to collect data from sensors embedded in wet concrete, estimate the concrete’s hardening timeline, and send predictions to Azure IoT Hub.&lt;/p&gt;

&lt;p&gt;The table below highlights the differences between each orchestrator.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;K3s&lt;/th&gt;
&lt;th&gt;Azure IoT Edge&lt;/th&gt;
&lt;th&gt;AWS IoT Greengrass&lt;/th&gt;
&lt;th&gt;Siemens Industrial Edge&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Node management&lt;/td&gt;
&lt;td&gt;Manages nodes via a lightweight control plane&lt;/td&gt;
&lt;td&gt;Manages nodes remotely through Azure IoT Hub&lt;/td&gt;
&lt;td&gt;Manages nodes via AWS IoT Core&lt;/td&gt;
&lt;td&gt;Manages nodes via the Siemens Industrial Edge Management platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model deployment&lt;/td&gt;
&lt;td&gt;Deploys models as Kubernetes pods using standard container images&lt;/td&gt;
&lt;td&gt;Configures deployments via a JSON manifest that defines which modules, containing the trained models, run on which nodes&lt;/td&gt;
&lt;td&gt;Deploys models as components with predefined inference functions&lt;/td&gt;
&lt;td&gt;Deploys models directly on shop floors as Docker containers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud integration&lt;/td&gt;
&lt;td&gt;Can be integrated with a central infrastructure&lt;/td&gt;
&lt;td&gt;Supported via Azure IoT Hub&lt;/td&gt;
&lt;td&gt;Integrates with AWS IoT Core&lt;/td&gt;
&lt;td&gt;Supports integration with AWS services&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When the OT network is the security boundary
&lt;/h3&gt;

&lt;p&gt;Industrial companies converge their IT and operational technology (OT) networks to support on-premises AI and IoT integrations. But this convergence expands their attack surface area. &lt;a href="https://zeronetworks.com/blog/ot-security-trends-2025-escalating-threats-evolving-tactics" rel="noopener noreferrer"&gt;75% of OT attacks&lt;/a&gt; originate in IT environments, and &lt;a href="https://techinformed.com/manufacturers-face-losses-up-to-2m-cyberattack/" rel="noopener noreferrer"&gt;80% of manufacturers&lt;/a&gt; report increasing security threats across their IT/OT networks.&lt;/p&gt;

&lt;p&gt;For teams considering factory deployment for industrial systems, network segmentation must become a top priority. Edge AI solutions should operate solely within the OT network in compliance with the Purdue model. Sensitive data and inference stay close to the machines, sensors, and Programmable Logic Controllers (PLCs) that need them. This security boundary minimizes lateral movement of threats from the IT network.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3: Hierarchical Federated Learning (Client-Edge-Cloud)
&lt;/h2&gt;

&lt;p&gt;Hierarchical federated learning (HFL) builds on a three-layer infrastructure for teams navigating data mobility restrictions at the edge.&lt;/p&gt;

&lt;p&gt;At the lowest layer, client devices perform local training, optimizing model parameters through local gradient descent. Edge servers at the intermediate layer aggregate updated model weights from all client devices for statistical coherence. A final aggregation round by a cloud server marks the top layer, producing a global model that the edge servers distribute back to the client devices. Since only parameter updates traverse this hierarchy, intermittent connectivity does not halt training progress.&lt;/p&gt;

&lt;p&gt;The image below captures this iteration, which continues until the global model reaches the desired accuracy or converges.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz5qtq1yo1a1gu8yxgbb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz5qtq1yo1a1gu8yxgbb.png" alt="Figure 4: Hierarchical federated learning architecture" width="800" height="667"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Domains such as healthcare and financial services, where raw data is bound to its origin by privacy constraints, regulatory requirements, and bandwidth limitations, are ideal HFL use cases. Data sovereignty mandates and &lt;a href="https://www.foreignaffairs.com/united-states/ai-divide" rel="noopener noreferrer"&gt;geopolitical tensions&lt;/a&gt; add another layer to this constraint, restricting where and how data flows at the infrastructure level.&lt;/p&gt;

&lt;p&gt;A study by BARC found that &lt;a href="https://barc.com/the-great-cloud-reversal/" rel="noopener noreferrer"&gt;19% of companies&lt;/a&gt; plan to increase their on-premises investments, driven by this need for data sovereignty. HFL allows a shared model to improve across distributed nodes without the underlying data ever crossing a jurisdictional boundary.&lt;/p&gt;

&lt;p&gt;A recent experimental HFL training in healthcare achieved &lt;a href="https://www.accscience.com/journal/AIH/articles/online_first/5141" rel="noopener noreferrer"&gt;94.23% accuracy&lt;/a&gt; on a modified National Institute of Standards and Technology dataset, while keeping data on client devices. Only relevant aggregated information ever reaches the cloud to preserve privacy and curtail data leakage risks.&lt;/p&gt;

&lt;p&gt;In healthcare deployment, wearable devices (lowest layer) transmit raw data to a hospital’s local edge server (intermediate layer), which aggregates data from multiple wearables and sends it to a regional research institution (top layer) for final aggregation without exposing patient data.&lt;/p&gt;

&lt;p&gt;HFL is the most complex pattern to implement. Tooling support remains fragmented, and unlike other patterns discussed, it currently lacks native support within the Actian ecosystem. Teams should weigh this implementation overhead before committing to this architecture.&lt;/p&gt;

&lt;p&gt;The HFL architecture has three variants depending on which layer orchestrates data decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Cloud-orchestrated hierarchical federated learning
&lt;/h3&gt;

&lt;p&gt;The central cloud server coordinates the training process, client-edge communications, synchronization schedules, and the overall topology, with no additional aggregation rounds from the edge servers.&lt;/p&gt;

&lt;p&gt;Cloud-orchestrated HFL fits financial institutions, where occasional reliable connectivity can sustain the coordination loop. In a fraud detection deployment, multiple banking institutions might train models using transaction data, sending updates to the cloud, which aggregates, validates, and redistributes the improved model back to the banks.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Edge-orchestrated hierarchical federated learning
&lt;/h3&gt;

&lt;p&gt;Edge servers autonomously manage local client assignments, aggregating client updates to produce a locally improved model without cloud round-trips. Cloud systems only support at interval for bulk model retraining. Environments like offshore wind farms, where unstable connectivity is the baseline, benefit most from this variant. Turbines send model updates to a local edge server, which handles aggregation and independent model improvement.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Peer-to-peer aggregation
&lt;/h3&gt;

&lt;p&gt;This variant focuses on a gossip-like model with no central orchestrator. Clients exchange their model weights with other nodes, reducing gradient conflicts under heterogeneous data.&lt;/p&gt;

&lt;p&gt;Where the core HFL pattern reduces cloud ingress fees through aggregated updates, peer-to-peer aggregation keeps both training and aggregation within participating nodes. In distributed environments like smart cities, traffic sensors exchange anomaly-detection updates directly with neighboring devices until they converge on an improved model across the network organically.&lt;/p&gt;

&lt;p&gt;All three variants differ in their functional requirements, highlighted in the table below.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Cloud-orchestrated&lt;/th&gt;
&lt;th&gt;Edge-orchestrated&lt;/th&gt;
&lt;th&gt;Peer-to-peer aggregation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration model&lt;/td&gt;
&lt;td&gt;Cloud coordinates all aggregation and model distribution&lt;/td&gt;
&lt;td&gt;Edge server aggregates locally, syncs with cloud periodically&lt;/td&gt;
&lt;td&gt;No orchestrator; updates propagate between clients until convergence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy level&lt;/td&gt;
&lt;td&gt;Medium; the cloud controls model updates&lt;/td&gt;
&lt;td&gt;High; raw data remains on local edge servers&lt;/td&gt;
&lt;td&gt;High; no central point oversees aggregated updates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bandwidth requirements&lt;/td&gt;
&lt;td&gt;High; all updates are sent to the cloud&lt;/td&gt;
&lt;td&gt;Medium; only aggregated updates reach cloud&lt;/td&gt;
&lt;td&gt;Low; updates only travel between neighboring peers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disconnection tolerance&lt;/td&gt;
&lt;td&gt;Low; cloud disconnection breaks coordination&lt;/td&gt;
&lt;td&gt;High; edge server operates independently during outages&lt;/td&gt;
&lt;td&gt;Medium; network partitions slow convergence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;HFL’s layered infrastructure supports large-scale model training by distributing computation and communication across multiple nodes in the hierarchy. The challenge with this multi-tier design lies in navigating communication overhead, stale global models, and node reconfigurations.&lt;/p&gt;

&lt;p&gt;In HFL, communication cost is directly proportional to the model update size. Gradient compression techniques such as random sparsification and stochastic rounding shrink update payloads by &lt;a href="https://www.scirp.org/journal/paperinformation?paperid=133610" rel="noopener noreferrer"&gt;up to 98%&lt;/a&gt; before transmission.&lt;/p&gt;

&lt;p&gt;The asynchronous update cycle of HFL, where the global model incorporates client updates as they arrive, also amplifies the likelihood of stale model parameters. Weighted aggregation limits the influence of stale updates, preventing slower devices from degrading the global model.&lt;/p&gt;

&lt;p&gt;Topology shifts add another challenge. Clients get reassigned to different edge servers, roles shift between client and aggregator nodes, and new devices join mid-training. Each reconfiguration stalls convergence and degrades accuracy if new edge servers lack prior training history.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 4: Store-and-Forward Disconnected Inference
&lt;/h2&gt;

&lt;p&gt;In disconnected environments, intermittent connectivity can stretch for hours or days. Store-and-forward architecture accounts for this reality, sustaining large-scale data processing and storage during downtime, and forwarding summaries to the cloud once the system reconnects.&lt;/p&gt;

&lt;p&gt;For industrial automation environments, such as remote oil and gas operations and maritime vessels operating miles from cellular towers, this architecture solves the core problem of maintaining data continuity despite network disruption.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inference doesn’t wait for the cloud
&lt;/h3&gt;

&lt;p&gt;Store-and-forward deployment follows a hybrid approach. Training begins in the cloud, but execution shifts to the edge after model deployment. When connectivity drops, decision-making, control loops, and alarm triggers continue locally without interruption, and the system buffers timestamped results to a local edge database until synchronization resumes.&lt;/p&gt;

&lt;p&gt;Upon network restoration, the edge gateway offloads all buffered events to a central cloud infrastructure, providing the data required to push updated models and optimize AI pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c1nm6j4mhgz26rj5mt4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c1nm6j4mhgz26rj5mt4.png" alt="Figure 5: Store-and-forward architecture" width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Store-and-forward architecture creates a feedback loop that prevents data loss during disconnection. In manufacturing plants, SCADA systems continue collecting data from PLCs, Remote Terminal Units (RTUs), and edge gateways until connection resumes.&lt;/p&gt;

&lt;h3&gt;
  
  
  When the data finally moves
&lt;/h3&gt;

&lt;p&gt;The “forward” part of this architecture relies on lightweight communication protocols like Message Queuing Telemetry Transport (MQTT), designed for unstable networks and bandwidth-limited environments.&lt;/p&gt;

&lt;p&gt;MQTT’s publish-subscribe model routes queued updates from edge gateways to the cloud through brokers like Mosquitto. Publishers (sensors) send messages to a topic (temperature), and subscribers (cloud servers) receive messages from their registered topics. Messages replay in the exact chronological order they were received.&lt;/p&gt;

&lt;p&gt;The Python code snippet below illustrates a starting-point implementation using the Paho MQTT library. It uses Quality of Service (QoS) 1, a persistent session that enables Mosquitto to queue messages while the subscriber is offline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pip install paho-mqtt
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;paho.mqtt.publish&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;publish&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Usage: publisher.py &amp;lt;topic&amp;gt; &amp;lt;message&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Production code will add retry logic, local queue persistence, and message deduplication
&lt;/span&gt;
&lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To initiate data transfer after reconnection, the script below creates a persistent session using &lt;code&gt;clean_session=False&lt;/code&gt; and &lt;code&gt;loop_forever()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;paho.mqtt.client&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mqtt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Usage: subscriber.py &amp;lt;topic&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;client_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test-client&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connected with result code &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mqtt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clean_session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;on_connect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;on_connect&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;on_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;on_message&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1883&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loop_forever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store-and-forward architecture can introduce data replication inconsistencies during gateway synchronization. The system requires an arbitration policy, such as last-write-wins, which applies changes based on each update’s timestamp. When timestamps are identical, data structures like Conflict-free Replicated Data Types (CRDTs) merge copies to achieve a consistent final state across all edge gateways.&lt;/p&gt;

&lt;p&gt;Delta sync further improves CRDTs’ results. Where full dataset replication triggers on every record change, delta sync resolves conflicts at the property level, addressing only the modified fields.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 5: The Network (Distributed Edge-to-Edge Fabric)
&lt;/h2&gt;

&lt;p&gt;The network deployment pattern addresses the lack of fault tolerance and distributed processing prevalent in disconnected multi-site operations such as logistics networks and smart grids.&lt;/p&gt;

&lt;p&gt;The network deployment pattern addresses the lack of fault tolerance and distributed processing prevalent in disconnected multi-site operations such as logistics networks and smart grids.&lt;/p&gt;

&lt;p&gt;Coordinating edge devices across multiple locations through a cloud system quickly breaks outside network coverage. This is why the network architecture follows an east-west communication pattern, enabling edge nodes to exchange data directly with peers without central coordination.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mesh communication handles distributed intelligence
&lt;/h3&gt;

&lt;p&gt;The network deployment pattern adopts a non-hierarchical design, connecting multiple IoT devices through a mesh network to improve system uptime during outages. Each node dynamically communicates with its neighbors, forming a bidirectional network that relays data to remote environments via multi-hop paths.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe8nqyrfz1bs8ocxvi5kx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe8nqyrfz1bs8ocxvi5kx.png" alt="Figure 6: Network architecture" width="800" height="597"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cloud only joins as a peer for optional sync, but core computing remains on the network, working without centralized control.&lt;/p&gt;

&lt;p&gt;Smart grids are well-suited for this architecture, where &lt;a href="https://www.ericsson.com/en/blog/2021/10/wireless-for-power-grids" rel="noopener noreferrer"&gt;teleprotection demands 10–20ms latency&lt;/a&gt;. A network of transmission substations continuously tracks electricity flow and consumption patterns in real-time to detect imbalances before they escalate. That real-time visibility supports dynamic load redistribution and &lt;a href="https://www.sciencedirect.com/topics/engineering/autonomous-microgrid" rel="noopener noreferrer"&gt;autonomous microgrid management&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Military uncrewed aerial vehicles (UAVs) are another use case. When GPS fails in DDIL environments, &lt;a href="https://www.bluwireless.com/insight/gps-denied-drone-communications/" rel="noopener noreferrer"&gt;UAVs relay ISR data&lt;/a&gt; between each other through mesh networks. Adaptive interference routing ensures reliable data flow, while line-of-sight transmission reduces latency.&lt;/p&gt;

&lt;p&gt;This deployment pattern optimizes for network redundancy. Gossip protocol and distributed consensus algorithms like Raft eliminate single points of failure. When a node loses connection, the network remains operational, rerouting its data through other nodes.&lt;/p&gt;

&lt;p&gt;Gossip protocol enables live peer discovery through continuous, lightweight information exchanges. Each node always has a current view of its local network. Raft follows a leader-based approach where an elected leader node handles all writes, and log replication ensures follower nodes maintain a shared state. Edge databases replicate data across multiple nodes to improve consistency.&lt;/p&gt;

&lt;p&gt;Treating Gossip and Raft as competing options overlooks what actually matters. The focus should be on understanding where each sits in the CAP theorem and the trade-offs they introduce to a distributed network.&lt;/p&gt;

&lt;h3&gt;
  
  
  The consistency vs. availability trade-off
&lt;/h3&gt;

&lt;p&gt;When network partitions split the mesh, Raft ensures strong data consistency, while Gossip provides availability fallback and eventual consistency when paired with approaches like CRDTs.&lt;/p&gt;

&lt;p&gt;In edge computing, where connection is limited and nodes are numerous, partition tolerance is non-negotiable. Edge AI systems must choose whether to prioritize consistency or availability when implementing the network architecture.&lt;/p&gt;

&lt;p&gt;Availability is often optimal, as edge nodes continue to function independently after disconnection. Consistency-focused designs like Raft risk write suspensions and stale reads during network partitions.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Raft&lt;/th&gt;
&lt;th&gt;Gossip&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;Leader election and log replication&lt;/td&gt;
&lt;td&gt;Peer-to-peer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;Moderate; requires at least a quorum of nodes in a network to become available&lt;/td&gt;
&lt;td&gt;Low; messages travel quickly but propagation rounds can slow down speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consistency guarantees&lt;/td&gt;
&lt;td&gt;Strong consistency&lt;/td&gt;
&lt;td&gt;Eventual consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partition tolerance&lt;/td&gt;
&lt;td&gt;Moderate; might not survive a partition&lt;/td&gt;
&lt;td&gt;High; heals partitions faster&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Speed and data delivery trade-offs are another critical constraint of the network architecture. Mesh networking adds latency with each hop as the node count increases. If your system needs data back in &amp;lt;50ms or your latency requirements can tolerate &amp;gt;100ms, this trade-off should shape your design decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right Edge AI Deployment Pattern
&lt;/h2&gt;

&lt;p&gt;There’s no specific “right” edge AI deployment pattern for disconnected environments. A solid architecture implementation begins with a clear grasp of the specific constraints, goals, and characteristics of your target application. This means envisioning the full workload lifecycle, including connectivity profile, available compute resources, and latency requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Evaluate network stability
&lt;/h3&gt;

&lt;p&gt;Network stability is the primary driver of any edge AI deployment strategy. Determine how much resilience must be engineered into the edge nodes based on the expected duration of disconnection.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If the system is always disconnected&lt;/strong&gt;: Use drone or network architectures as they are designed to operate completely offline regardless of connectivity status.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the interruption persists for only minutes or hours&lt;/strong&gt;: Use factory or HFL architecture to continue data aggregation and inference without interruption. The system remains functional during the outage because all required dependencies already exist within the operational perimeter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If intermittent connectivity lasts for days or weeks&lt;/strong&gt;: Use the store-and-forward architecture to buffer inference results and operational data locally until the scheduled connectivity window becomes available again.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Assess latency requirements
&lt;/h3&gt;

&lt;p&gt;Define the maximum acceptable latency for your specific application by considering network hops, node availability, and geographical proximity of the edge nodes. The thresholds below reflect typical deployment patterns. Validate them against your specific hardware and network conditions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If the system requires &amp;lt;50ms latency&lt;/strong&gt;: Use the drone deployment pattern. Its single-node architecture keeps inference directly on sensors, cameras, or gateways, enabling near-real-time responses. Factory architecture also minimizes latency by running on edge servers within the same facility or on the factory floor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the system requires &amp;lt;100ms latency&lt;/strong&gt;: Use the network or HFL architecture to distribute model improvement workloads across multiple nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If &amp;lt;500ms latency is acceptable&lt;/strong&gt;: Use store-and-forward architecture for non-critical IoT data that requires batch processing or long-term analytics. It batch-offloads data-intensive tasks to the cloud.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Evaluate resource constraints
&lt;/h3&gt;

&lt;p&gt;Edge AI applications differ in processing power, storage, and bandwidth consumption, which impacts inference speed, data aggregation, and real-time analytics. Evaluate each resource limit independently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Power constraint&lt;/strong&gt;: For compute power &amp;lt;1 GFLOPS, common in microcontrollers used for sensor inference, the drone architecture is most suitable. It runs on constrained IoT devices using lightweight, inference-only models. At 10–100 GFLOPS, common in edge gateways, HFL and network architectures become more effective as they handle data aggregation needs well at this level. For edge GPU clusters that scale to &amp;gt;10 TFLOPS, factory and store-and-forward architecture support clustered inference pipelines, since they run on-premises.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bandwidth constraint&lt;/strong&gt;: Use store-and-forward architecture or HFL to store and process raw, high-volume data at the edge, forwarding only summarized updates to the cloud if required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data storage constraint&lt;/strong&gt;: Use factory or store-and-forward architectures paired with &lt;a href="https://www.actian.com/blog/data-warehouse/embedded-databases-iot-use-cases/" rel="noopener noreferrer"&gt;embedded databases&lt;/a&gt; to store time-series data locally and scale vertically within the facility. Databases like Actian Zen are optimized for edge AI use cases and can also sync with the cloud once connectivity is restored.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Consider a hybrid approach
&lt;/h3&gt;

&lt;p&gt;Industrial systems often combine the strengths of multiple architectures into a coordinated system that delivers resilience and flexibility. Rio Tinto’s mining operations illustrate what hybrid deployment looks like at scale.&lt;/p&gt;

&lt;p&gt;At the Greater Nammuldi iron ore mine, more than &lt;a href="https://www.bbc.com/news/articles/cgej7gzg8l0o" rel="noopener noreferrer"&gt;50 autonomous trucks&lt;/a&gt; operate on predefined routes, using onboard sensors to detect obstacles, an example of the &lt;strong&gt;drone architecture&lt;/strong&gt;. Across 17 sites in Western Australia, these trucks transmit operational data to Rio Tinto’s Operations Centre in Perth, reflecting the &lt;strong&gt;network architecture&lt;/strong&gt;. Finally, an autonomous rail system transports mined ore, synchronizing with the Operations Centre upon reaching port facilities. This fits the &lt;strong&gt;store-and-forward architecture&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Rio Tinto demonstrates that deployment patterns are not mutually exclusive. If your use case requires multiple architectures, consider running them on the layer of the system where they’re best suited, rather than forcing a single architecture across the entire operation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznn34a2gfwqdha3e5zmg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznn34a2gfwqdha3e5zmg.png" alt="Figure 7: Decision framework for choosing an edge AI architecture" width="800" height="1688"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following table maps specific deployment scenarios to their optimal disconnected edge AI deployment pattern to inform your decision.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Deployment scenarios&lt;/th&gt;
&lt;th&gt;Recommended pattern&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Autonomous inspection drones over oil fields or offshore wind farms&lt;/td&gt;
&lt;td&gt;Drone (single-node self-contained)&lt;/td&gt;
&lt;td&gt;A self-contained inference runtime with embedded local storage eliminates distributed computation to meet hardware limitations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automotive assembly lines running defect detection models&lt;/td&gt;
&lt;td&gt;Factory (multi-node edge AI)&lt;/td&gt;
&lt;td&gt;Cloud dependency is too risky for uptime requirements, so edge clusters run within the facility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hospital networks where patient data cannot leave individual facilities under HIPAA&lt;/td&gt;
&lt;td&gt;Hierarchical federated learning&lt;/td&gt;
&lt;td&gt;Models train locally, sharing only weight updates to the cloud, so raw data remains on the local site in compliance with data sovereignty and privacy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cargo vessels at sea syncing operational data at port&lt;/td&gt;
&lt;td&gt;Store-and-forward&lt;/td&gt;
&lt;td&gt;A local buffer ensures no inference result or operational event is lost across connectivity gaps that can last days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smart city traffic management across distributed intersections with no central server dependency&lt;/td&gt;
&lt;td&gt;Network (distributed edge-to-edge fabric)&lt;/td&gt;
&lt;td&gt;Nodes communicate peer-to-peer via consensus, so node loss reduces capacity without disrupting overall network operation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Industries operating across remote, underground, maritime, and geographically dispersed terrain need edge-native architectures that capture real-time insights and keep critical assets running without cloud dependency.&lt;/p&gt;

&lt;p&gt;The deployment patterns discussed prioritize what matters most for disconnected environments: local inference, no centralization latency, lower communication costs, and system autonomy.&lt;/p&gt;

&lt;p&gt;Before committing to a pattern, validate three things in your own environment: how long your system can tolerate network outage before data loss becomes operationally significant, whether your edge hardware can sustain the compute demands of your chosen architecture without degrading inference quality, and whether your team has the tooling maturity to manage model lifecycle at the edge without cloud dependency. Map your constraints against the decision framework above.&lt;/p&gt;

&lt;p&gt;The right answer might not be a single pattern. Layer in hybrid approaches only when the resilience gains justify the operational complexity.&lt;/p&gt;

&lt;p&gt;Each pattern depends on a data infrastructure that can operate, store, and sync entirely at the edge. For teams that need to go beyond structured storage and perform semantic search on their local data without exporting vector embeddings to a cloud server, &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Actian VectorAI DB&lt;/a&gt; is optimized for this use case. &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Start for free&lt;/a&gt; today.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Join the &lt;a href="https://discord.gg/432A2M63Py" rel="noopener noreferrer"&gt;Actian community on Discord&lt;/a&gt; to discuss edge AI architecture patterns with engineers deploying in disconnected environments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>iot</category>
    </item>
    <item>
      <title>How to Measure RAG System Performance</title>
      <dc:creator> Oluseye Jeremiah</dc:creator>
      <pubDate>Sat, 28 Mar 2026 10:17:27 +0000</pubDate>
      <link>https://dev.to/actiandev/how-to-measure-rag-system-performance-1i1h</link>
      <guid>https://dev.to/actiandev/how-to-measure-rag-system-performance-1i1h</guid>
      <description>&lt;p&gt;Your RAG demo passed every test. The dashboard showed green across the board, with answers that clearly cite source documents. A key metric called "Faithfulness" scored 0.89. Then you shipped to production. Within two weeks, 35% of users reported wrong answers. The metrics hadn't changed. The failures were real.&lt;/p&gt;

&lt;p&gt;What happened? Test queries looked formal, "What is the enterprise pricing structure?" while production queries were casual, "How much does this thing cost?" Faithfulness, which checks whether answers rely on retrieved documents, caught the hallucinations but missed tone problems, missing context, and the dozens of ways RAG systems fail when real users show up.&lt;/p&gt;

&lt;p&gt;Most teams add more metrics, build bigger dashboards, and measure everything, but in the end, they predict nothing. &lt;a href="https://aimultiple.com/rag-evaluation-tools" rel="noopener noreferrer"&gt;Weights &amp;amp; Biases&lt;/a&gt; found that a simple zero-shot evaluation prompt outperformed complex reasoning frameworks at 100% accuracy versus 82-90%, adding sophistication made results worse, not better. The problem isn't quantity, it's choosing the right measurements.&lt;/p&gt;

&lt;p&gt;Engineers know evaluation is hard, and most aren't doing it well. &lt;a href="https://openai.com/index/openai-to-acquire-neptune/" rel="noopener noreferrer"&gt;Neptune.ai&lt;/a&gt; research found that many RAG product initiatives stall after the proof-of-concept stage because teams underestimate the complexity of evaluation. This article walks through selecting three to five metrics that actually predict failures: which metrics catch which problems, what each costs, and how to build monitoring that scales.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Most teams measure retrieval and generation but miss end-to-end user success. Systems score 0.89 on Faithfulness while 35% of users report failures because metrics don't catch tone or context mismatches. Neptune.ai found that many RAG initiatives stall after the proof-of-concept stage because teams underestimate the evaluation complexity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simple beats complex: Weights &amp;amp; Biases found zero-shot prompts hit 100% accuracy versus 82-90% for complex frameworks. Adding sophistication made results worse, not better.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ground truth costs $50-200 per Q&amp;amp;A pair. Building 1,000 pairs requires $50,000-200,000. Reference-free metrics cost $0.01-0.04 per check and scale to production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Production queries break test sets. Derive 50% from production logs, refresh quarterly, weight edge cases (5% of traffic, 40% of complaints).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Start with three metrics: Context Relevance + Faithfulness + Answer Relevance at $0.02-0.04 per query. Expand only when you hit concrete limits.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Generic RAG Evaluation Metrics Fail
&lt;/h2&gt;

&lt;p&gt;Most RAG dashboards look convincing. Precision stays high, Faithfulness remains above 0.85, and Answer Relevance seems stable. But while the metrics show no problems, production tells a different story.&lt;/p&gt;

&lt;p&gt;Users report incomplete answers, responses miss intent, and queries fail even though no hallucination occurs. Engineers re-run the evaluation and see the same strong numbers. The issue isn't a missing metric, it's a missing layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  The three-layer problem
&lt;/h3&gt;

&lt;p&gt;Every RAG system operates across three layers, but most evaluation pipelines cover only two.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 (Retrieval)&lt;/strong&gt; measures whether the system retrieved the right documents using Precision, Recall, and Mean Reciprocal Rank. These metrics assess ranking quality and coverage — if Recall drops, the system fails to surface necessary context, and if Precision drops, irrelevant documents pollute results. Retrieval metrics matter, but they don't explain why users still complain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 (Generation)&lt;/strong&gt; measures whether the model used retrieved documents correctly. Faithfulness checks whether claims appear in the retrieved context, while Answer Relevance checks whether the response addresses the query. These metrics reduce hallucinations and detect context misuse, but they still miss many production failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 (End-to-end user success)&lt;/strong&gt; measures whether the answer actually helped the user. This layer covers tone, clarity, and whether the system actually completes the user's task. Automated metrics rarely capture this layer.&lt;/p&gt;

&lt;p&gt;A system might report a Faithfulness score of 0.89 and context relevance of 0.91, yet 30-35% of production queries still fail. The model grounds its answers, retrieval works as expected, and there are no clear hallucinations. The failure stems from a query mismatch.&lt;/p&gt;

&lt;p&gt;Most teams measure the retrieval and generation layers, but not the full end-to-end alignment. Understanding the three layers narrows the problem. The next question is which you can actually monitor in production without ground truth?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmz7u1053xx26swo8npe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmz7u1053xx26swo8npe.png" alt="Figure 1: The three layers of RAG evaluation: retrieval, generation, and end-to-end user success. Most teams measure only the first two layers." width="800" height="1333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference-Based vs. Reference-Free
&lt;/h2&gt;

&lt;p&gt;Once you recognize the three-layer structure, the question emerges, "Do you have ground truth Answers?" This limitation affects which metrics you can use, how much evaluation will cost, and whether you can monitor continuously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference-based metrics&lt;/strong&gt; compare system output against known correct answers. Context Recall, Context Precision, and Answer Correctness require labeled datasets. Their strength is stability for regression testing; they let you benchmark precisely and spot problems as models change.&lt;/p&gt;

&lt;p&gt;However, creating high-quality ground truth typically costs $50-200 per Q&amp;amp;A pair for expert annotation and quality assurance, particularly for specialized domains. At this rate, a 1,000-query test set costs $50,000–200,000, so reference-based evaluation doesn't scale to continuous production monitoring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference-free metrics&lt;/strong&gt; don't require labeled answers. Faithfulness, Answer Relevance, and Context Relevance estimate correctness by comparing outputs to retrieved context. Their main advantage is that they scale easily, making them practical for ongoing production monitoring.&lt;/p&gt;

&lt;p&gt;Most production systems need both types. Use reference-based metrics to set baselines, and reference-free metrics to monitor daily performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxeoupg1dowi5lzgev7ti.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxeoupg1dowi5lzgev7ti.png" alt="Figure 2: Decision tree for selecting metrics based on ground truth availability, budget constraints, and monitoring requirements." width="800" height="914"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this foundation in place, let's look at the specific metrics you'll use, what they measure, when they might fail, and which problems they help catch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Metrics Explained
&lt;/h2&gt;

&lt;p&gt;Most teams use whatever metrics their framework provides. The issue isn't that these metrics are wrong, but that they're often used without a clear understanding of what they measure or where they might fail. Retrieval determines which information the model receives. If retrieval fails, the generation step can't fix it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Precision
&lt;/h3&gt;

&lt;p&gt;Measures how many retrieved documents are relevant. If your retriever returns five documents and only two contain useful information, precision drops to 0.4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real failure example:&lt;/strong&gt; an "enterprise pricing" query returns a blog post first, while the actual pricing page is ranked fifth, so the user sees incorrect information upfront. This is why Precision should be used when evaluating ranking quality, as it directly impacts the accuracy of the answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Recall
&lt;/h3&gt;

&lt;p&gt;Requires you to know in advance which documents the system should retrieve for each query. This means maintaining a labeled test set where you've manually tagged, "For this question, these three documents are the correct answers."&lt;/p&gt;

&lt;p&gt;This makes Recall valuable for regression testing: "Did our update break Retrieval?" It doesn't work for production monitoring; you can't manually label thousands of daily queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Relevance
&lt;/h3&gt;

&lt;p&gt;Relies on embedding similarity to measure how close retrieved documents are to the query in the vector space. This works well for drift detection if average similarity drops over time, embeddings or indexing may be degrading. However, similarity doesn't guarantee usefulness. Treat context relevance as a monitoring signal, not a correctness guarantee.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mean Reciprocal Rank (MRR)
&lt;/h3&gt;

&lt;p&gt;Measures how high the first relevant document appears. If the first relevant result appears at position one, MRR equals 1.0. At position three, MRR equals 0.33.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; Formula: MRR = 1 / rank_of_first_relevant_result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://qdrant.tech/blog/rag-evaluation-guide/" rel="noopener noreferrer"&gt;Research &lt;/a&gt;suggests relevance in the top three positions predicts answer performance better than top-ten coverage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Faithfulness
&lt;/h3&gt;

&lt;p&gt;Evaluates whether the claims in a response are supported by the retrieved context. Most approaches break the answer into individual statements and verify them against the source documents. These checks typically cost between $0.01 and $0.04 apiece.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real failure example:&lt;/strong&gt; the system claims "coverage includes international shipping," even though the documentation only mentions domestic. Faithfulness is one of the most reliable ways to detect hallucinations, but it doesn't measure usefulness. A response can be fully grounded in the source material and still fail to help the user.&lt;/p&gt;

&lt;h3&gt;
  
  
  Answer Relevance
&lt;/h3&gt;

&lt;p&gt;Measures whether a response actually addresses the user's question. Many implementations approach this indirectly by asking an LLM to infer the likely question from the answer, then comparing it to the original query.&lt;/p&gt;

&lt;p&gt;The&lt;a href="https://arxiv.org/abs/2309.15217" rel="noopener noreferrer"&gt; RAGAS &lt;/a&gt;(Retrieval-Augmented Generation Assessment Suite) paper notes that Answer Relevance often diverges from human scoring in conversational cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real failure example:&lt;/strong&gt; a user asks how to reset a password, but the system responds with an explanation of the account creation process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Answer Correctness
&lt;/h3&gt;

&lt;p&gt;Compares the model's output to a gold reference answer. It provides strong regression guarantees, but requires curated ground truth, typically costing $50 to $200 per Q&amp;amp;A pair. Use it when precision matters more than scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  BLEU and ROUGE
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/spaces/evaluate-metric/bleu" rel="noopener noreferrer"&gt;BLEU &lt;/a&gt;(Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) were designed for machine translation and measure word overlap between generated text and reference answers. They work well for translation, but break down for RAG. Two answers can convey the same meaning with different wording and still score poorly, while a hallucinated answer that mirrors the reference phrasing may score highly. Treat these metrics as rough development signals only, not as a substitute for real evaluation in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metric comparison
&lt;/h3&gt;

&lt;p&gt;Cost estimates reflect approximate LLM API charges for automated evaluation calls. Metrics listed as "Free" use deterministic computation with no API dependency.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Requires ground truth?&lt;/th&gt;
&lt;th&gt;Cost per eval&lt;/th&gt;
&lt;th&gt;Production-ready?&lt;/th&gt;
&lt;th&gt;Best use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context Precision&lt;/td&gt;
&lt;td&gt;Document labels&lt;/td&gt;
&lt;td&gt;$0.001-0.01&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;High-volume monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Recall&lt;/td&gt;
&lt;td&gt;Document labels&lt;/td&gt;
&lt;td&gt;$0.01-0.02&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Regression testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Relevance&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;$0.001-0.01&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Continuous monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MRR&lt;/td&gt;
&lt;td&gt;Document labels&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;FAQ systems, search ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Faithfulness&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;$0.01-0.04&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Hallucination detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer Relevance&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;$0.01-0.02&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Query-answer matching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer Correctness&lt;/td&gt;
&lt;td&gt;Reference answers&lt;/td&gt;
&lt;td&gt;$50-200&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Benchmark testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BLEU/ROUGE&lt;/td&gt;
&lt;td&gt;Reference answers&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Development proxy only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Table 1: Comparison of RAG evaluation metrics by cost, ground truth requirements, and production readiness.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It's important to note that these metrics don't require gold-standard reference answers. However, they do rely on relevance labels for retrieved documents, which must be manually annotated. Only Context Relevance, Faithfulness, and Answer Relevance are truly reference-free.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM-as-a-Judge
&lt;/h2&gt;

&lt;p&gt;At some point, most teams reach the same conclusion: "If automated metrics miss tone and alignment, why not let another LLM evaluate the output?"&lt;/p&gt;

&lt;p&gt;This approach, known as LLM-as-a-judge, has become popular for evaluating RAG systems. It offers flexibility, requires no ground truth, and can capture nuanced reasoning. In practice, this method comes with trade-offs.&lt;/p&gt;

&lt;p&gt;LLM-as-a-judge uses a large model like GPT-4 or Claude to evaluate another model's output. You provide criteria directly in the prompt: "Does the context support the answer"? "Does it address the user's question"? "Is the tone appropriate"?&lt;/p&gt;

&lt;p&gt;The model returns a score or classification. This works well for nuanced checks and avoids the cost of creating labeled datasets. How reliable it is depends completely on how you design the prompts and how the model behaves.&lt;/p&gt;

&lt;h3&gt;
  
  
  The surprising finding
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://wandb.ai/site/articles/exploring-llm-as-a-judge/" rel="noopener noreferrer"&gt;Weights &amp;amp; Biases&lt;/a&gt; evaluated multiple LLM-based approaches. A simple zero-shot prompt achieved 100% accuracy. More complex frameworks using reasoning chains scored 82-90%.&lt;/p&gt;

&lt;p&gt;The simpler prompt outperformed the "smarter" ones. Complex reasoning chains introduced over-analysis. The judge inferred errors that didn't exist. It penalized acceptable variations and produced inconsistent results.&lt;/p&gt;

&lt;p&gt;Making evaluations more complex doesn't always improve them. Sometimes, it actually makes them worse.&lt;/p&gt;

&lt;p&gt;Known limitations include version dependency (GPT-4 and GPT-4o may produce different judgments), prompt sensitivity (small wording changes can shift scores by 10-15 points), and context length constraints (LLM-based evaluations struggles with long contexts).&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost reality
&lt;/h3&gt;

&lt;p&gt;Assume &lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-4o&lt;/a&gt; costs $0.015 per evaluation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000-case evaluation: $15 per metric&lt;/li&gt;
&lt;li&gt;Five metrics: $75&lt;/li&gt;
&lt;li&gt;Ten tuning rounds: $750&lt;/li&gt;
&lt;li&gt;Monthly regression testing: $250/month, or $3,000 annually&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For high-traffic systems, continuous evaluation can be expensive. LLM-as-a-judge doesn't remove the cost; it just moves it from labeling to inference.&lt;/p&gt;

&lt;p&gt;LLM-as-a-judge works best for development iteration, qualitative validation, sample-based production review (10-20% traffic), and early-stage systems without ground truth. Avoid relying on it for compliance documentation, high-volume per-query evaluation, or benchmark comparisons across model versions.&lt;/p&gt;

&lt;p&gt;Once you understand these basics, the real question becomes: Which metrics should you actually use? The answer depends on your specific use case and constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Strategy
&lt;/h2&gt;

&lt;p&gt;Which three to five metrics will predict failures in your system? There's no one-size-fits-all answer. Begin by identifying the type of failure you absolutely can't accept.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Q&amp;amp;A chatbots&lt;/strong&gt; facing hallucinations and intent mismatch risks, use Faithfulness (catches hallucinations), Answer Relevance (ensures query addressed), and Context Precision (reduces noise). Skip Context Recall since coverage is less important than accuracy. Add latency P95 and token cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For document search&lt;/strong&gt; where ranking quality matters most, use MRR (position of first relevant result), Context Precision (clean ranking), and Context Relevance (embedding quality). Skip generation metrics since this is about search, not generating answers. Add result diversity. &lt;a href="https://qdrant.tech/blog/rag-evaluation-guide/" rel="noopener noreferrer"&gt;Qdrant research&lt;/a&gt; shows that top-three ranking quality correlates more strongly with outcome than broader retrieval depth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For long-form generation&lt;/strong&gt; facing drift in framing or emphasis, use Faithfulness (grounding check), Answer Correctness (if ground truth exists), and Context Coverage (percentage of retrieved context used in answer). Add coherence checks and regular human reviews since automated metrics can't guarantee the narrative makes sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For compliance/legal systems&lt;/strong&gt; where omission is the dominant risk, use ALL retrieval metrics (complete coverage required), Faithfulness (no deviation), and Answer Correctness (requires ground truth). Add human validation and an audit trail. Reference-based evaluation and logging are essential for operations.&lt;/p&gt;

&lt;p&gt;After identifying the failure mode, constraints become the second filter. Whether you have ground truth data changes everything.&lt;/p&gt;

&lt;p&gt;The amount of traffic also matters. If your system handles hundreds of queries a day, you can evaluate each one with LLM-as-a-judge, but if you have tens of thousands, you'll need to use sampling. Budget is another factor. LLM-as-a-judge seems cheap per evaluation, but costs add up quickly when you use it for many metrics and rounds.&lt;/p&gt;

&lt;p&gt;Most production RAG systems operate effectively with three core signals. Start with Context Relevance (cheap, continuous retrieval monitoring), Faithfulness (catches hallucinations), and Answer Relevance (ensures query addressed). Add operational metrics like Latency P95/P99 and token cost per query. Evaluation metric overhead should add no more than 10-20% to your base retrieval-plus-generation latency. Cost: $0.02-0.04 per evaluation.&lt;/p&gt;

&lt;p&gt;Expand only after these stabilize: Have ground truth? Add Context Recall and Answer Correctness. Need compliance? Add human validation. Ranking matters? Add MRR. Avoid the temptation to measure everything — having too many metrics creates noise, which can obscure important changes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlggsp821goz4gwe0wwo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlggsp821goz4gwe0wwo.png" alt="Figure 3: Mapping use cases to recommended metrics based on failure modes, constraints, and operational requirements." width="800" height="719"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Monitoring
&lt;/h2&gt;

&lt;p&gt;Evaluation looks controlled in development. You curate test queries, control context, and metrics that behave predictably. Production removes those guarantees.&lt;/p&gt;

&lt;p&gt;Real users introduce typos, vague phrasing, and inconsistent terminology while query distribution shifts and edge cases surface. In development, most queries look like your test set, but in production, most may not.&lt;/p&gt;

&lt;p&gt;Three forces reshape performance: Query distribution shifts (users ask shorter, more casual questions and expect the system to infer intent), data evolves (knowledge bases update, new documents enter the index, embedding distributions change), and user expectations increase (people are less forgiving of slow responses or wrong tone than of small factual errors).&lt;/p&gt;

&lt;h3&gt;
  
  
  Continuous strategy
&lt;/h3&gt;

&lt;p&gt;Evaluating in production needs a layered approach to monitoring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Always On (Per-Query)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context Relevance (low-cost drift detection)&lt;/li&gt;
&lt;li&gt;Latency P95/P99 (infrastructure pressure)&lt;/li&gt;
&lt;li&gt;Token cost per query (prompt creep)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Batch/Sampling&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faithfulness (nightly batch on query subset)&lt;/li&gt;
&lt;li&gt;LLM-as-a-judge (10-20% traffic sample)&lt;/li&gt;
&lt;li&gt;Human review (50-100 queries weekly)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your evaluation process must adapt as traffic grows. If your system handles 500 queries a day, you can check them all. If it handles 50,000, that's not possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting alert thresholds
&lt;/h3&gt;

&lt;p&gt;Set your thresholds before any incidents happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context Relevance &amp;lt; 0.7: Retrieval drift likely&lt;/li&gt;
&lt;li&gt;Faithfulness &amp;lt; 0.8: Hallucination risk increased&lt;/li&gt;
&lt;li&gt;P95 latency &amp;gt; 2 seconds: Infrastructure constraints&lt;/li&gt;
&lt;li&gt;User feedback &amp;lt; 4.0/5.0: Tone or completeness issues
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;monitor_rag_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Production monitoring with threshold alerts&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# calculate_metrics expects: {'query': str, 'contexts': List[str], 'answer': str}
&lt;/span&gt;    &lt;span class="c1"&gt;# Returns: {'context_relevance': float, 'faithfulness': float, 'latency_p95': float, 'user_feedback': float}
&lt;/span&gt;    &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;alerts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;context_relevance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieval degrading&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;faithfulness&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hallucination risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latency_p95&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Infrastructure issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_feedback&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;4.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UX problem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;alerts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Evaluation costs should grow more slowly than your traffic does. Sample 5-10% of queries for expensive metrics, cache embeddings, batch LLM evaluations overnight, and use smaller models for screening.&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework Selection
&lt;/h2&gt;

&lt;p&gt;Most teams shouldn't build an evaluation from scratch. Frameworks exist because evaluation becomes brittle quickly. Choose based on lifecycle stage, not feature count.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAGAS
&lt;/h3&gt;

&lt;p&gt;RAGAS (Retrieval-Augmented Generation Assessment Suite) introduced a structured, reference-free RAG evaluation. It formalized Faithfulness, Answer Relevance, and Context Relevance in a reusable format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Research-backed methodology&lt;/li&gt;
&lt;li&gt;Native support for reference-free metrics&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Clean integration with &lt;a href="https://docs.langchain.com/oss/python/integrations/providers/overview" rel="noopener noreferrer"&gt;LangChain&lt;br&gt;
&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Limited explainability for metric failures&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Sensitive to LLM version differences&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt; 1-2 hours | &lt;strong&gt;Cost:&lt;/strong&gt; Free + LLM API | &lt;strong&gt;Best for:&lt;/strong&gt; Early-stage RAG validating retrieval and grounding quality&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer_relevance&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;

&lt;span class="c1"&gt;# Prepare evaluation data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of France?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Paris is the capital of France&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contexts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;France is a country in Western Europe with Paris as its capital&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run evaluation
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer_relevance&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Output: {'faithfulness': 0.95, 'answer_relevance': 0.88}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RAGAS is a good choice if your main goal is structural correctness, rather than production monitoring. You can find full documentation on &lt;a href="https://github.com/vibrantlabsai/ragas" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  DeepEval
&lt;/h3&gt;

&lt;p&gt;DeepEval approaches evaluation like test engineering. It supports CI/CD integration and automated regression testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Broad metric library (50+ metrics)&lt;/li&gt;
&lt;li&gt;Better failure inspection&lt;/li&gt;
&lt;li&gt;Designed for automated pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher configuration overhead&lt;/li&gt;
&lt;li&gt;More complex onboarding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Setup takes about 2-3 hours. It's open source, with optional paid tiers. It's best for teams that want to include evaluation in their release workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  TruLens
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.trulens.org/getting_started/#installation" rel="noopener noreferrer"&gt;TruLens&lt;/a&gt; focuses on simplicity. It tracks groundedness, Context Relevance, and Answer Relevance without heavy configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quick to deploy (under 1 hour setup)&lt;/li&gt;
&lt;li&gt;Minimal configuration&lt;/li&gt;
&lt;li&gt;Clear mental model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller ecosystem&lt;/li&gt;
&lt;li&gt;Less extensible for advanced workflows&lt;/li&gt;
&lt;li&gt;Slowed development pace following the Snowflake acquisition with ecosystem growth stalled&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Arize Phoenix
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://arize.com/docs/phoenix" rel="noopener noreferrer"&gt;Phoenix &lt;/a&gt;emphasizes production observability over development-only evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenTelemetry integration&lt;/li&gt;
&lt;li&gt;Trace-based debugging&lt;/li&gt;
&lt;li&gt;Real-time monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires infrastructure integration&lt;/li&gt;
&lt;li&gt;Heavier operational footprint&lt;/li&gt;
&lt;li&gt;Best for mature systems that need large-scale drift detection&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  LangSmith
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.langchain.com/langsmith/home" rel="noopener noreferrer"&gt;LangSmith&lt;/a&gt; integrates tightly with LangChain environments. It combines tracing with evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Native LangChain support&lt;/li&gt;
&lt;li&gt;Experiment tracking&lt;/li&gt;
&lt;li&gt;Production trace inspection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ecosystem dependency&lt;/li&gt;
&lt;li&gt;Less framework-agnostic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for teams using LangChain who are moving toward structured monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Framework comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Strengths&lt;/th&gt;
&lt;th&gt;Limitations&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RAGAS&lt;/td&gt;
&lt;td&gt;Pure RAG evaluation&lt;/td&gt;
&lt;td&gt;Reference-free, LangChain integration&lt;/td&gt;
&lt;td&gt;Limited explainability&lt;/td&gt;
&lt;td&gt;Free + LLM API&lt;/td&gt;
&lt;td&gt;1-2 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepEval&lt;/td&gt;
&lt;td&gt;Engineering teams&lt;/td&gt;
&lt;td&gt;50+ metrics, CI/CD integration&lt;/td&gt;
&lt;td&gt;Learning curve&lt;/td&gt;
&lt;td&gt;Free + optional $49-299/mo&lt;/td&gt;
&lt;td&gt;2-3 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TruLens&lt;/td&gt;
&lt;td&gt;Getting started&lt;/td&gt;
&lt;td&gt;3 core metrics, simple&lt;/td&gt;
&lt;td&gt;Limited traction&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;30 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Arize Phoenix&lt;/td&gt;
&lt;td&gt;Production debugging&lt;/td&gt;
&lt;td&gt;OpenTelemetry compatible&lt;/td&gt;
&lt;td&gt;Enterprise complexity&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;td&gt;3-4 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangSmith&lt;/td&gt;
&lt;td&gt;LangChain users&lt;/td&gt;
&lt;td&gt;Native integration&lt;/td&gt;
&lt;td&gt;Vendor lock-in&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;td&gt;1-2 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Table 2: Comparison of RAG evaluation frameworks by use case, features, and operational requirements.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose by phase
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;POC:&lt;/strong&gt; RAGAS or TruLens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD integration:&lt;/strong&gt; DeepEval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production monitoring:&lt;/strong&gt; Phoenix or similar observability tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise governance:&lt;/strong&gt; Commercial platforms with audit features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good framework integrates smoothly, gives stable results across LLM versions, keeps costs predictable, and makes failures easy to spot.&lt;/p&gt;

&lt;p&gt;Even with the right framework, teams often make the same mistakes. Spotting these patterns early can save you months of extra work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;Most RAG evaluation failures follow predictable patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Over-indexing on automated metrics
&lt;/h3&gt;

&lt;p&gt;This happens when automated scores look healthy but users complain. A system reports Faithfulness at 0.92, but user feedback indicates responses feel robotic or miss conversational nuance. Automated metrics measure grounding but don't measure tone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Allocate 10-20% of the evaluation budget to human review. Sample high-risk queries weekly. Use findings to adjust prompts or refine automated thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test-production mismatch
&lt;/h3&gt;

&lt;p&gt;This occurs when tests pass, but production fails at 40%. Test datasets contain formal queries: "What is the enterprise pricing structure?" Production users ask: "How much does this cost?" The distribution mismatch creates a silent evaluation failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Derive 50% of your test set from production logs. Refresh quarterly. Query patterns evolve faster than curated datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ignoring edge cases
&lt;/h3&gt;

&lt;p&gt;Common queries work but rare queries fail 80% of the time. Edge cases represent 5% of traffic but generate 40% of complaints. Test sets skew toward frequent queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Ensure equal representation of query types in evaluation. Weight infrequent but high-impact scenarios appropriately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actian VectorAI DB Advantages
&lt;/h2&gt;

&lt;p&gt;Most RAG evaluation pipelines expose queries and documents to external APIs. Embeddings travel to OpenAI, faithfulness checks route through Claude, and each evaluation step introduces data movement. For teams with compliance requirements, this setup doesn't work.&lt;/p&gt;

&lt;p&gt;Actian VectorAI DB addresses this gap by allowing you to run all evaluation workloads on-premises. Queries remain local, documents never leave controlled infrastructure, and LLM-based evaluation executes using locally hosted models. This eliminates external API dependencies entirely.&lt;/p&gt;

&lt;p&gt;Teams working with HIPAA-regulated data, financial records, or proprietary research can evaluate RAG systems on real production data without creating audit risk. Cloud evaluation costs scale with query volume and token count.&lt;a href="https://www.actian.com/databases/vectorai-db/#waitlist" rel="noopener noreferrer"&gt; Actian &lt;/a&gt;uses flat licensing with no per-query charges, making costs predictable as evaluation scales.&lt;/p&gt;

&lt;p&gt;Development environments often use mocked dependencies and synthetic data. Actian allows testing with the same database engine production uses, ensuring retrieval latency, index behavior, and evaluation results accurately predict production performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;More metrics don't guarantee better results. Automated scoring and human review form a more reliable system than either alone. Production queries provide better test coverage than curated datasets. Monitor continuously, not episodically.&lt;/p&gt;

&lt;p&gt;The Weights &amp;amp; Biases benchmark confirmed that simple evaluation, done consistently, outperforms complex evaluation done occasionally. Build your strategy on that principle. The goal isn't choosing the trendiest framework or the most complex dashboard, it's building infrastructure that remains accurate, scalable, and cost-effective as query volume grows.&lt;/p&gt;

&lt;p&gt;For teams building production RAG systems, start with three core metrics. Expand when you hit concrete limits, not hypothetical ones.&lt;/p&gt;

&lt;p&gt;If you need on-premises evaluation without exposing sensitive data to external APIs,&lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt; Actian VectorAI DB&lt;/a&gt; lets you run all evaluation workloads locally within your own infrastructure.&lt;/p&gt;




</description>
    </item>
    <item>
      <title>Why Real-Time Analytics Can’t Depend on Cloud in 2026</title>
      <dc:creator>Hitesh Jethva</dc:creator>
      <pubDate>Fri, 27 Feb 2026 11:44:40 +0000</pubDate>
      <link>https://dev.to/actiandev/why-real-time-analytics-cant-depend-on-cloud-in-2026-1paj</link>
      <guid>https://dev.to/actiandev/why-real-time-analytics-cant-depend-on-cloud-in-2026-1paj</guid>
      <description>&lt;p&gt;If your system needs to react in milliseconds, a half-second delay is no longer "almost real-time"; it is a failure. For example, in robotic welding systems, the controller has to adjust torque in under 10 milliseconds to avoid structural defects. For self-driving warehouse forklifts, &lt;a href="https://www.researchgate.net/publication/396335728_Automatic_Braking_System_A_Low-Cost_Prototype_for_Obstacle_Detection_and_Collision_Prevention" rel="noopener noreferrer"&gt;obstacle detection must trigger braking within 20 milliseconds to prevent crashes&lt;/a&gt;. In ICU monitoring, arrhythmia detection should send alerts immediately, not 400 milliseconds later.&lt;/p&gt;

&lt;p&gt;This is the reality many teams are discovering in 2026. Systems that look fine on paper stop behaving as expected when organizations try to run real-time analytics on cloud-based platforms. &lt;/p&gt;

&lt;p&gt;For years, industries have been told that cloud solutions are perfect for data management, transfer, and analysis. The thought process was simple: send data to the cloud and it will process and respond faster. But in practice, these assumptions are starting to fail. AI workloads are forcing companies and experts to rethink cloud-era assumptions.&lt;/p&gt;

&lt;p&gt;As real-time analytics shifts from basic reporting and dashboards to instant decision-making, speed becomes critical. Cloud analytics is good for reporting, but distance and network delays make it slow for time-critical actions. &lt;/p&gt;

&lt;p&gt;Edge computing changes this model. It processes data close to where it is generated, allowing systems to make immediate decisions at the source instead of waiting for a response from a remote cloud data center. &lt;a href="https://www.rtinsights.com/edge-computing-set-to-dominate-data-processing-by-2030" rel="noopener noreferrer"&gt;By 2030, latency-critical applications will increasingly shift to edge processing&lt;/a&gt;, while cloud remains dominant for batch analytics and reporting. &lt;/p&gt;

&lt;p&gt;In this post, we will cover why cloud-based real-time analytics fails and how engineers can make architecture decisions based on actual latency limits and physical constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real-Time Analytics Promise Versus the Physics Problem
&lt;/h2&gt;

&lt;p&gt;By early 2026, almost every analytics platform claims to support real-time analytics. &lt;a href="https://www.passguide.com/blog/comparative-evaluation-of-snowflakes-data-platform-capabilities-against-leading-cloud-analytics-and-enterprise-data-competitors/" rel="noopener noreferrer"&gt;Cloud data warehouses talk about real-time ingestion&lt;/a&gt;, and major streaming data platforms promise sub-second processing. If we analyze the current situation from a marketing viewpoint, it seems that the problem has already been solved.&lt;/p&gt;

&lt;p&gt;In reality, it hasn’t.&lt;/p&gt;

&lt;p&gt;Several analytics platforms and cloud vendors still define anything under a second “real time” because that is what cloud infrastructure can reliably deliver. That may be acceptable for business intelligence, but it falls short for systems that require true instant response, such as manufacturing control loops, safety-critical systems, and autonomous machines.&lt;/p&gt;

&lt;p&gt;These systems don't need insights "soon." Instead, they demand decisions "now." In these environments, latency is not a basic performance metric but a quality constraint. &lt;/p&gt;

&lt;p&gt;This is where physics enters. &lt;/p&gt;

&lt;p&gt;Physical distance creates a minimum latency that software cannot remove. &lt;a href="https://physics.nist.gov/cgi-bin/cuu/Value?c" rel="noopener noreferrer"&gt;Light moves at about 300,000 kilometers per second in a vacuum&lt;/a&gt;, but &lt;a href="https://www.sciencedirect.com/science/article/abs/pii/S0960077922012772" rel="noopener noreferrer"&gt;in fiber-optic cables, signals travel closer to 200,000 kilometers per second&lt;/a&gt; because of refraction and signal processing. Even a few thousand kilometers of round-trip travel can take tens of milliseconds. When you include routing, serialization, queuing, and processing delays, total latency in real-world situations often reaches 200 to 500 milliseconds.&lt;/p&gt;

&lt;p&gt;In a cloud workflow, that physical distance becomes part of the latency budget — meaning that before any computation begins, a significant portion of the response window has already been consumed.&lt;/p&gt;

&lt;p&gt;The promise of real-time analytics collides with physics the moment decisions happen faster than the cloud can respond. And in 2026, as AI-powered systems increasingly move from generating insights to automatically triggering operational decisions, more teams are running headfirst into that limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does “Real-Time” Mean?
&lt;/h2&gt;

&lt;p&gt;The term real-time analytics refers to analyzing data as it becomes available. But some practitioners carry a different definition or opinion for the term "real-time analytics." Generally, this term is used by marketing team experts, business users, application developers, and control-system engineers. &lt;/p&gt;

&lt;p&gt;However, each of these groups works within a different latency expectation, and when those differences are not clearly defined, systems often get designed around timing assumptions that fail once deployed in real-world environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frt4x2ird2ss5xfmxafx7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frt4x2ird2ss5xfmxafx7.png" alt="Image 1: What does real-time spectrum mean" width="800" height="718"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To understand why this matters, it is best to look at real-time analytics as a spectrum and not a single promise.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Spectrum of Real-Time Requirements
&lt;/h3&gt;

&lt;p&gt;Real-time is not a single fixed standard; different business and technical systems operate across distinct latency tiers, each with very different architectural needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business Real-Time (Sub-Second)
&lt;/h3&gt;

&lt;p&gt;In this latency tier, real-time cloud analytics platforms are strongest. Dashboards, operational reporting, alerting systems, and executive monitoring systems that refresh every hundred milliseconds fall under this category. For business intelligence and reporting, sub-second latency feels like real-time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Interactive Real-Time (Sub-100ms)
&lt;/h3&gt;

&lt;p&gt;In this latency tier, web applications, recommendation engines, and in-app feedback loops fit best. These often require responses under 100 milliseconds. Cloud architectures can meet this requirement, but congestion and jitter frequently make consistency difficult.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control Real-Time (Sub-10–20ms)
&lt;/h3&gt;

&lt;p&gt;Manufacturing automation, safety systems, robotics, and autonomous machines require responses in 10 milliseconds or less. At that speed, delays are not inconvenient, they are costly. They can cause defective products, equipment damage, safety risks, or failed control responses. Cloud-based real-time analytics often fails to meet this bar because network transit alone consumes the entire latency budget.&lt;/p&gt;

&lt;p&gt;In these scenarios, timing is everything. A robotic arm that is 300 milliseconds late can miss a weld. A safety system that reacts too slowly, may fail to prevent an accident. The question is not whether the cloud is fast. It is whether it is fast enough for the physical process it controls.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Traditional Batch Analytics Differs
&lt;/h3&gt;

&lt;p&gt;Batch analytics processes historical data on a schedule. Data is collected, stored, and analyzed hours or days later. It is ideal for forecasting, trend analysis, and long-term planning.&lt;/p&gt;

&lt;p&gt;Real-time analytics runs continuously. It ingests live data streams, evaluates events as they happen, and decides whether immediate action is required.&lt;/p&gt;

&lt;p&gt;The difference becomes decisive when action must occur inside a strict time window. If a defect must be rejected within 20 milliseconds, a response at 300 milliseconds is useless. Batch analytics still drives trends and planning, but when decisions must shape live systems, timing determines the outcome.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Real-Time Analytics Technology Stack
&lt;/h3&gt;

&lt;p&gt;Modern real-time stacks rely on streaming platforms for ingestion. Kafka, Flink, and Kinesis move events continuously, feeding databases built for high-throughput writes and fast reads, often with columnar storage and in-memory processing. &lt;/p&gt;

&lt;p&gt;On top sits event-driven architecture, where actions trigger the moment conditions are met. But this only works if latency targets are defined from the start. Without a clear definition of real time, teams build systems that look modern and sound fast, but fail at millisecond response. That means fraudulent transactions slip through, vehicles brake too late, defective products pass inspection, or safety systems react after the damage is done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Does Cloud Architecture Create Unavoidable Latency?
&lt;/h2&gt;

&lt;p&gt;Cloud latency is governed by distance, transmission paths, and routing physics more than system settings. For dashboards and report refreshes, the cloud feels fast. But once you examine how cloud processing actually works, it becomes clear why some real-time use cases hit a hard limit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5mpvdsz0d0qdzaquqojp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5mpvdsz0d0qdzaquqojp.png" alt="Image 2: Cloud vs. edge data processing paths" width="800" height="744"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cloud Processing Pathway
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6tsj0pn6i9h0oaa65z2s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6tsj0pn6i9h0oaa65z2s.png" alt="Image 3: The cloud processing pathway" width="800" height="972"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Even for experienced engineers, it’s important to recognize that cloud-based processing follows a multi-step loop, not a direct path. The diagram above illustrates the full round-trip workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data generation (on-premises layer)&lt;/strong&gt;: Raw data originates at the device level — sensors, cameras, PLCs, or industrial controllers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local aggregation&lt;/strong&gt;: A local gateway, industrial PC, or PLC filters, normalizes, and prepares the data before it leaves the facility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internet transmission&lt;/strong&gt;: The data is encrypted and transmitted over the public or private internet to a cloud region.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud ingestion &amp;amp; queuing&lt;/strong&gt;: Services such as Azure IoT Hub or AWS Kinesis receive the stream and buffer it for processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing &amp;amp; analytics&lt;/strong&gt;: The cloud platform runs analytics, inference, or rule engines to generate a decision or insight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local action&lt;/strong&gt;: The local system executes a physical response — stopping a machine, rejecting a product, triggering an alert, or adjusting an actuator.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this workflow, every step adds latency. Even with optimized pipelines, data moves through dozens of network hops, routers, and firewalls before it reaches its destination. Geographic distance adds another layer. A factory on the U.S. East Coast will reach a nearby cloud region such as AWS us-east-1 faster than one sending data across the country to AWS us-west-2.&lt;/p&gt;

&lt;p&gt;But even the closest cloud data center is still physically distant. That distance alone is enough to introduce noticeable delays.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency Components You Can't Eliminate
&lt;/h3&gt;

&lt;p&gt;Some sources of latency are unavoidable, for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Physical propagation delay&lt;/strong&gt;: Data travels through fiber at a fraction of the speed of light. &lt;a href="https://blog.cloudflare.com/african-traffic-growth-and-predictions-for-the-future/" rel="noopener noreferrer"&gt;Crossing the U.S. coast-to-coast takes roughly 100 milliseconds&lt;/a&gt; round-trip before any processing happens at all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Router processing at each hop&lt;/strong&gt;: Each router introduces a processing delay, often microseconds, that adds up across 10–20 hops. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serialization and deserialization&lt;/strong&gt;: Data must be packaged, encrypted, decrypted, and unpacked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: TLS handshakes and inspection add delay.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queueing delays&lt;/strong&gt;: Cloud ingestion services buffer incoming data, especially under load.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if you consider an ideal condition, cloud round-trip latency might drop below 150–200 milliseconds, but still can't match real-time for manufacturing control loops. &lt;/p&gt;

&lt;h3&gt;
  
  
  When Cloud Optimization Isn't Enough
&lt;/h3&gt;

&lt;p&gt;Cloud providers offer optimizations, but the gap remains large. Content delivery networks and edge caching help with static content, but they fail in live data processing and real-time decision-making. &lt;/p&gt;

&lt;p&gt;Deploying workloads in nearby regions shrinks geographic distance without removing the network round trip. Dedicated connections such as AWS Direct Connect reduce jitter and packet loss, but they cannot overcome the baseline latency physics imposes.&lt;/p&gt;

&lt;p&gt;Some teams try multi-region architectures to get closer to users or devices, but this adds complexity without fixing the core issue. For systems that need responses in 10–20 milliseconds, having a few milliseconds off a 300 milliseconds round trip doesn’t change the outcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Manufacturing Floor Reality Check
&lt;/h2&gt;

&lt;p&gt;Cloud latency is abstract until you attach real numbers. Manufacturing shows clearly why cloud-based real-time analytics breaks down. Let’s do the math.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fss8q26q2sty4i8870ts8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fss8q26q2sty4i8870ts8.png" alt="Image 4: What happens during 500ms cloud latency at 400 units/min" width="800" height="1228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is a production line that produces approximately 400 units per minute at 6.67 units per second, which means roughly every 150 milliseconds, a unit passes the inspection point. It is important for each system to respond within the set window, or it might be too late.&lt;/p&gt;

&lt;p&gt;Comparing this figure with cloud-based AI or analytics that usually takes 300–500 milliseconds for a full round trip, the score comes to 3.3 units (500ms/150ms). By the time the cloud responds, 3–4 units would already pass the inspection.&lt;/p&gt;

&lt;p&gt;Now, apply this timing gap to critical manufacturing tasks such as defect detection, weld monitoring, and PCB inspection, where decisions must happen before the next unit reaches the station.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality Control at Production Speed
&lt;/h3&gt;

&lt;p&gt;Automated quality checks are common in modern factories. Vision inspection systems perform the scanning for defective products, weld monitoring systems are responsible for quality assessment while the weld is still happening, and Printed Circuit Board inspection lines analyze boards as they move through production.&lt;/p&gt;

&lt;p&gt;Each of these systems operates fast, but the decision window is minimal. &lt;/p&gt;

&lt;p&gt;With cloud-based processing, even a small delay of 500ms can allow defective parts to pass the inspection point. By the time the system flags it, halting the parts may no longer be possible. &lt;/p&gt;

&lt;p&gt;However, edge processing changes the equation with responses in under 10 milliseconds, i.e., 50x faster. This responsiveness enables instant rejection of defective units, thereby improving quality control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safety Monitoring That Actually Protects Workers
&lt;/h3&gt;

&lt;p&gt;Safety systems demand even stricter requirements. Delay in time is dangerous for workers. &lt;/p&gt;

&lt;p&gt;If a worker steps into a hazardous zone without proper protective equipment and the sensors or cameras fail to trigger an immediate alert, the worker is exposed to serious injury or contamination.&lt;/p&gt;

&lt;p&gt;After a 500 milliseconds delay, the system sends the alert to the cloud, but by then the worker has already entered the contaminated area. Cloud-based analytics can reconstruct the incident timeline, but only edge-based analytics can prevent it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Predictive Maintenance Windows
&lt;/h3&gt;

&lt;p&gt;Most systems and machines give alerts or warning signs before complete failure. For example, vibration anomalies, temperature change, and acoustic patterns. Generally, there is a 1–2 second window period between early detection and actual damage. &lt;/p&gt;

&lt;p&gt;Cloud analytics can detect issues, but the system often responds too late. Edge analytics processes events immediately and stops the operation before real damage occurs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Manufacturing
&lt;/h2&gt;

&lt;p&gt;Manufacturing is not the only example of cloud latency. Many industries now demand immediate, analytics-driven decisions and face the same constraints. Let’s examine a few others.&lt;/p&gt;

&lt;h3&gt;
  
  
  Healthcare: When Seconds Determine Outcomes
&lt;/h3&gt;

&lt;p&gt;Healthcare systems rely on real-time analytics to monitor patients. ICU sensors track oxygen levels, heart rate, and other vital signs. This data only delivers value when the system detects anomalies early and flags a potential emergency.&lt;/p&gt;

&lt;p&gt;Delays in data transmission or cloud processing directly affect patient outcomes. Healthcare organizations must also meet strict regulatory requirements. HIPAA and data residency laws often require sensitive patient data to remain on-premises or within controlled environments, making edge processing a practical and compliant solution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Autonomous Systems: Physics Won't Wait for the Cloud
&lt;/h3&gt;

&lt;p&gt;Sales of self-driving vehicles continue to rise, driven by systems that detect obstacles, predict motion, and decide how to respond in milliseconds. Industrial robots, drones, and automated guided vehicles in warehouses use similar capabilities to generate real-time insights.&lt;/p&gt;

&lt;p&gt;A 100–200 milliseconds delay in an autonomous system can mean a missed brake command or a collision. If connectivity drops, decision-making can stall entirely.&lt;/p&gt;

&lt;p&gt;Autonomous systems must process data locally and keep response times under 50 milliseconds. The cloud supports model training and optimization, but real-time control must remain at the edge.&lt;/p&gt;

&lt;h3&gt;
  
  
  Financial Services: Fraud Detection at Transaction Speed
&lt;/h3&gt;

&lt;p&gt;For financial institutions, timing is critical. Delays influence customer behavior and lead to significant data and revenue loss. Whether a credit card transaction, account login, payments, or any other financial move, the data must be evaluated under 100 milliseconds.&lt;/p&gt;

&lt;p&gt;A delay in fraud detection can allow illegitimate transactions to complete or cause valid transactions to fail. In high-frequency trading, where decisions occur in microseconds, cloud latency is not viable.&lt;/p&gt;

&lt;p&gt;Many financial institutions use a hybrid model: fast risk scoring and decision logic run at the edge or on-premises, while deeper analysis and model training run in the cloud.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Connectivity Assumption Cloud Vendors Won't Discuss
&lt;/h2&gt;

&lt;p&gt;Most cloud-based real-time analytics platforms assume constant internet connectivity. Many data flow diagrams show seamless movement from device to cloud and back. In reality, when the network fails, business operations fail with it.&lt;/p&gt;

&lt;p&gt;When real-time analytics depend on a constant cloud connection, the gaps may eventually turn into system failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Connectivity Isn't Reliable
&lt;/h3&gt;

&lt;p&gt;The internet is widely available, but reliability is not guaranteed. Many locations cannot depend on stable, low-latency connections.&lt;/p&gt;

&lt;p&gt;Consider retail stores where an outage takes point-of-sale systems offline during peak hours or a major sale, and transactions stop immediately.&lt;/p&gt;

&lt;p&gt;Industrial IoT deployments often operate in remote locations such as mines, oil fields, and factories, where latency spikes and packet loss are common. Even well-connected urban areas experience peak-time congestion that introduces unpredictable delays.&lt;/p&gt;

&lt;p&gt;In all such cases, cloud-based real-time analytics will slowly fail, and there will be a delay in operations and decisions. &lt;/p&gt;

&lt;p&gt;The optimal solution is to move to edge computing in these cases to operate normally, even if the network is slow or unstable.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Connectivity Is Prohibited
&lt;/h3&gt;

&lt;p&gt;Some environments do not allow cloud connectivity due to security and compliance requirements.&lt;/p&gt;

&lt;p&gt;Manufacturing plants often use air-gapped networks to protect their intellectual property and prevent outside access. Financial institutions must comply with data sovereignty laws that govern where sensitive data is stored and processed. Healthcare organizations must meet HIPAA rules that set limits on how and where patient data is stored and shared.&lt;/p&gt;

&lt;p&gt;In all such cases, organizations need to process real-time analytics on-site or at the edge. Hybrid and private cloud models can handle less critical tasks, but operations that require low latency or follow strict rules must stay on-premises.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Sync-When-Connected Pattern
&lt;/h3&gt;

&lt;p&gt;Many teams use a straightforward approach where they process data at the edge and sync it to the cloud when a connection becomes available.&lt;/p&gt;

&lt;p&gt;Edge systems make real-time decisions on-site, so operations keep running even during outages or network slowdowns. Once the connection is stable again, the system sends logs, summaries, and model updates to the cloud for deeper analysis and retraining.&lt;/p&gt;

&lt;p&gt;This method gives you quick local responses while also letting you use the cloud’s scale and advanced analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture That Actually Works (Cloud-Right, Not Cloud-First)
&lt;/h2&gt;

&lt;p&gt;Many organizations learned a hard lesson and now shift from cloud-first thinking to cloud-right architecture. They once pushed real-time workloads to the cloud because it sounded simple, but that approach ignores latency, connectivity, compliance, and cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vjipcwiwiddkbhb7lef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vjipcwiwiddkbhb7lef.png" alt="Image 5: Three-tier real-time analytics architecture" width="800" height="1373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A more practical approach is what &lt;a href="https://www.deloitte.com/au/en/Industries/government-public/blogs/getting-cloud-right-how-prepare-successful-transformation.html" rel="noopener noreferrer"&gt;Deloitte&lt;/a&gt; describes as cloud-right, where each workload is placed in the location that best fits its needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Tier: For Elasticity and Deep Analytics
&lt;/h3&gt;

&lt;p&gt;The cloud is still the best place for workloads that benefit from scale and flexibility. They excel at training machine learning models, for example, often requiring large GPU clusters that are expensive to run on-premises. Historical data analysis, reporting, and long-term storage are also a perfect fit, where data lakes and warehouses can scale on demand.&lt;/p&gt;

&lt;p&gt;The cloud works well for variable workloads, experimentation, and analytics that don't require immediate responses. If sub-second latency is acceptable, cloud-based processing is usually the most cost-effective option.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-Premises Tier: For Consistency and Compliance
&lt;/h3&gt;

&lt;p&gt;On-premises systems are best for predictable, high-volume inference workloads that run continuously and would be costly to execute in the cloud. They play a key role when data is regulated and cannot leave the premises due to compliance, security, or data sovereignty requirements.&lt;/p&gt;

&lt;p&gt;On-premises deployments offer consistent performance and tighter integration with existing enterprise systems. For use cases that need reliable sub-100 milliseconds responses but don't require ultra-low latency, on-premises strikes a perfect balance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Edge Tier: For Immediacy and Offline Capability
&lt;/h3&gt;

&lt;p&gt;Edge real-time processing is essential for control systems, safety applications, and autonomous operations that require sub-10-20 milliseconds latency and offline capability. Data driven decisions are also possible in such cases when connectivity is slow, unreliable, or completely unavailable. Also, it allows data to be analyzed where it's generated, reducing bandwidth costs, improving operational efficiency, providing actionable insights, and avoiding cloud round-trip entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do About It?
&lt;/h2&gt;

&lt;p&gt;Real-time analytics have different meanings for different contexts. You don't need to chase cloud or edge computing for better results. Instead, you must run an assessment of your requirements and constraints before choosing an architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Define Your Real-Time Requirements
&lt;/h3&gt;

&lt;p&gt;Start by writing down what real-time actually means for your use case. Be specific and use numbers. For business processes that need only dashboards and reporting raw data or data visualizations, sub-second responses are usually fine — cloud data analytics would fit best.&lt;/p&gt;

&lt;p&gt;Interactive applications, where users expect instant feedback, often need responses under 100 milliseconds. This is where cloud performance becomes marginal and needs careful testing.&lt;/p&gt;

&lt;p&gt;Manufacturing automation, robotics, and machine-driven decisions typically need responses under 20 milliseconds. Safety systems are even stricter, often under 10 milliseconds.&lt;/p&gt;

&lt;p&gt;Before evaluating real-time analytics tools or platforms, define your latency SLAs clearly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Assess Connectivity Constraints
&lt;/h3&gt;

&lt;p&gt;Ask if you need a system that must operate during outages, in remote locations, or under regulatory restrictions. Be honest with yourself. &lt;/p&gt;

&lt;p&gt;Retail locations, mobile systems, remote industrial sites, and field operations regularly deal with outages and unstable networks. Ask yourself if you need a system that continues operating when offline. These constraints often matter more than raw performance benchmarks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Match Requirements to Architecture
&lt;/h3&gt;

&lt;p&gt;If your real-time analytics requirements demand sub-100 milliseconds latency, offline capability, or on-premises deployment, cloud-only solutions are insufficient. You might require hybrid or edge-based architectures.&lt;/p&gt;

&lt;p&gt;You can even switch to platforms like &lt;a href="https://www.actian.com/databases/vectorai-db" rel="noopener noreferrer"&gt;Actian’s VectorAI DB&lt;/a&gt; (beta in January 2026), designed to support edge and on-premises deployments specifically for latency-critical workloads. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Cloud vendors didn't break real-time analytics. Physics did. &lt;/p&gt;

&lt;p&gt;For cloud vendors, real-time often means dashboards that refresh quickly or data that appears within a second. For industrial engineers, real-time means systems that react in milliseconds. That gap in real-time analytics definition matters.&lt;/p&gt;

&lt;p&gt;No amount of optimization can change the physics involved. Data still has to travel across networks, through routers, and into distant data centers. A 500 milliseconds cloud round-trip might feel fast in software terms, but it is 50x too slow for manufacturing control systems that need responses in under 10 milliseconds.&lt;/p&gt;

&lt;p&gt;That's why many real-world applications simply cannot depend on the cloud for real-time processing. In 2026, the most successful real-time analytics systems will not be cloud-first. They will be physics-based architecture: edge and on-premises deployments. They exist because some decisions must be made immediately.&lt;/p&gt;

&lt;p&gt;If your real-time analytics requirements demand sub-100 milliseconds latency, offline operation, or strict data residency, cloud-only architectures start to break down. Solutions designed for edge and on-premises deployment, such as &lt;a href="https://www.actian.com/databases/vectorai-db" rel="noopener noreferrer"&gt;Actian’s VectorAI DB&lt;/a&gt; entering beta in January 2026, are built specifically for these constraints.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>analytics</category>
      <category>vectordatabase</category>
    </item>
    <item>
      <title>What's Changing in Vector Databases in 2026</title>
      <dc:creator>Praise James</dc:creator>
      <pubDate>Tue, 17 Feb 2026 14:25:14 +0000</pubDate>
      <link>https://dev.to/actiandev/whats-changing-in-vector-databases-in-2026-3pbo</link>
      <guid>https://dev.to/actiandev/whats-changing-in-vector-databases-in-2026-3pbo</guid>
      <description>&lt;p&gt;The vector database market has shifted. Engineering conversations have matured from “use Pinecone” to “we can build this on PostgreSQL." What the market is witnessing is a growing movement from cloud-native vector databases back to traditional infrastructure, where embedding vector search directly into a relational database has become standard practice.&lt;/p&gt;

&lt;p&gt;Every major cloud provider and traditional database, from AWS and Azure to MongoDB and PostgreSQL, now handles vector data. This consolidation raises two key questions: “Are standalone vector solutions still necessary?” or “Should teams continue with familiar multi-model systems like PostgreSQL?”&lt;/p&gt;

&lt;p&gt;Deployment limitations add another critical dimension. For many data-heavy industries like IoT, manufacturing, and retail, there are rarely practical ways to run these databases where data actually lives. This constraint exposes a gap in edge and on-premises deployment support. &lt;/p&gt;

&lt;p&gt;Additionally, AI agents are generating 10x &lt;a href="https://tomtunguz.com/2026-predictions/" rel="noopener noreferrer"&gt;more queries&lt;/a&gt; than human-driven applications, forcing a fundamental rethink of database throughput architecture. Despite the significance of these shifts, there is no thorough analysis of their implications for architectural decisions.&lt;/p&gt;

&lt;p&gt;We examine the core forces that have transformed the vector database market, argue why specialized solution usage is declining, assess where edge deployment support stands in 2026, and present an actionable database decision framework that accounts for data you can't migrate to the cloud. &lt;/p&gt;

&lt;h2&gt;
  
  
  What Shifted in 2025
&lt;/h2&gt;

&lt;p&gt;Pre-2025, purpose-built vector databases were presented as the standard infrastructure, but by 2026, a different reality emerges. Vectors have moved from being a database category to a data type. &lt;/p&gt;

&lt;p&gt;Major traditional database providers, from PostgreSQL to Oracle and MongoDB, now add native vector support. MongoDB integrated &lt;a href="https://www.infoworld.com/article/2338676/mongodb-adds-vector-search-to-atlas-database-to-help-build-ai-apps.html" rel="noopener noreferrer"&gt;Atlas Vector Search&lt;/a&gt;, PostgreSQL added &lt;a href="https://venturebeat.com/data-infrastructure/timescale-expands-open-source-vector-database-capabilities-for-postgresql" rel="noopener noreferrer"&gt;pgvector and pgvectorscale&lt;/a&gt; extensions, and Oracle introduced &lt;a href="https://blogs.oracle.com/database/oracle-announces-general-availability-of-ai-vector-search-in-oracle-database-23ai" rel="noopener noreferrer"&gt;Oracle Database 23ai&lt;/a&gt;. Top cloud providers, like AWS, Google, and Azure, also joined this trend. &lt;/p&gt;

&lt;p&gt;Integrated vector support eliminates the need to introduce a separate database alongside your primary relational system to implement vector search for AI applications. While purpose-built vector databases still dominate vendor lists, the market has already moved on, and the PostgreSQL acquisitions make that clear. &lt;/p&gt;

&lt;p&gt;In 2025 alone, Snowflake and Databricks &lt;a href="https://www.theregister.com/2025/06/10/snowflake_and_databricks_bank_postgresql/" rel="noopener noreferrer"&gt;spent approximately $1.25B&lt;/a&gt; acquiring PostgreSQL-first companies. At the same time, &lt;a href="https://survey.stackoverflow.co/2025/technology#1-dev-id-es" rel="noopener noreferrer"&gt;Stack Overflow &lt;/a&gt;reported PostgreSQL as the most used (46.5%) database among developers in 2025. These numbers signal that relational databases are now fit for AI workloads. But &lt;a href="https://venturebeat.com/data/six-data-shifts-that-will-shape-enterprise-ai-in-2026" rel="noopener noreferrer"&gt;VentureBeat&lt;/a&gt; predicts that this shift will narrow down purpose-built platforms to specialized use cases.&lt;/p&gt;

&lt;p&gt;By integrating vector search directly into production systems, traditional databases are compressing the role of dedicated vector infrastructure to billion-scale workloads with sub-50ms latency requirements, consistent with VentureBeat’s analysis and confirmed by PostgreSQL acquisitions. &lt;/p&gt;

&lt;p&gt;To understand what this 2025 shift means for your architectural decisions in 2026, let’s first look at how we got here. &lt;/p&gt;

&lt;h2&gt;
  
  
  A Refresher on Vector Databases
&lt;/h2&gt;

&lt;p&gt;Vector databases store, index, and query high-dimensional vector embeddings that represent multimodal data as numerical arrays to capture their semantic and contextual relationships. As unstructured data accounts for 90% of the &lt;a href="https://www.box.com/resources/unstructured-data-paper" rel="noopener noreferrer"&gt;global information&lt;/a&gt; footprint, encoding meaning for machine learning models requires embedding storage, vector search, and context retrieval, which vector databases handle. This infrastructure underpins many AI applications, including retrieval-augmented generation (RAG), recommendation systems, and natural language processing (NLP).&lt;/p&gt;

&lt;h2&gt;
  
  
  How Similarity Search Actually Works
&lt;/h2&gt;

&lt;p&gt;The core retrieval technology for similarity search is approximate nearest neighbor search. Most databases use hierarchical navigable small world graphs (HNSW), inverted file (IVF), locality-sensitive hashing (LSH), or product quantization (PQ) ANN indexing algorithms.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm0bix972srilxaxedtao.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm0bix972srilxaxedtao.png" alt="Figure 1: How vector similarity search works" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When a query vector arrives, the database follows a graph, hash, or quantization-based approach to find approximate nearest neighbor candidates within the vector space. The database then computes the distance between these vectors, typically using cosine similarity or Euclidean distance functions to rank the top-K results, as illustrated in the image above. These ranked results either improve the context that becomes the final output or serve as a candidate set for re-ranking to identify more true nearest neighbors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Retrieval-Augmented Generation (RAG) Made Vector Databases Essential
&lt;/h2&gt;

&lt;p&gt;The persistent interest in vector databases is a direct response to large language models' hallucinations, lack of domain knowledge, and inability to incorporate up-to-date information into their responses, making them insufficient for accuracy-sensitive tasks. RAG methods augment LLM outputs, leveraging vector databases as external knowledge bases and vector search as the computational backbone for retrieving relevant context. &lt;/p&gt;

&lt;p&gt;Conventional RAG systems build on a four-tier architecture: converting incoming queries into vector representations using an embedding model, executing a similarity search on stored vectors, integrating the retrieved relevant chunks and the query into an extended context that a language model processes, and finally transmitting the generated response back to the user. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feeodgu34g8wbv2zliq4a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feeodgu34g8wbv2zliq4a.png" alt="Figure 2: Typical cloud retrieval-augmented generation workflow" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Purpose-built vector databases simplified RAG implementation and efficient similarity search for early AI adopters. But three things changed between 2022 and 2025.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Market Forces Reshaping Vector Databases in 2026
&lt;/h2&gt;

&lt;p&gt;If 2022–2025 was about adding vector-native databases to AI applications, 2026 is leaning towards moving back to extended relational databases, rethinking architectural designs, and addressing an overlooked edge deployment gap. These three distinct trends stand out the most. &lt;/p&gt;

&lt;h3&gt;
  
  
  Force 1: Database Consolidation (Multimodal Platforms Win)
&lt;/h3&gt;

&lt;p&gt;In 2026, major traditional relational databases have integrated vector capabilities into their data layer, and their extensions are already showing success with AI workloads. PostgreSQL’s pgvectorscale, for instance, &lt;a href="https://www.tigerdata.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data" rel="noopener noreferrer"&gt;benchmarked&lt;/a&gt; 471 QPS, against Qdrant's 41 QPS at 99% recall on 50M vectors. This consolidation means developers can now build moderate-scale production AI applications on general-purpose databases. &lt;/p&gt;

&lt;p&gt;While purpose-built vector databases excel at vector search, infrastructure consolidation outweighs specialization when the workload doesn't demand it. Consider a product documentation knowledge base with 10M embedded documents, processing 500QPS, and requiring hybrid search. Traditional databases handle this workload effectively while also managing log collection, full-text search, and query analytics.&lt;/p&gt;

&lt;p&gt;One relational database that stands out in 2026 is PostgreSQL. An optimized PostgreSQL database currently supports &lt;a href="https://openai.com/index/scaling-postgresql/" rel="noopener noreferrer"&gt;OpenAI's&lt;/a&gt; ChatGPT and API, and the reason is simple: PostgreSQL gives engineers the flexibility, stability, and cost control needed for GenAI development. There are fewer moving parts, the system combines transactional safety with analytical capability, and a familiar ecosystem anchors your stack. &lt;/p&gt;

&lt;p&gt;Meanwhile, there's also the hybrid search advantage of PostgreSQL + pgvector that enables production systems to model nuanced relationships between data to match real user queries. Engineers prioritize databases that support personalization and enforce business rules such as price thresholds, categories, permissions, and date ranges. PostgreSQL achieves this richer data retrieval by merging dense and sparse vector embeddings. The database and its vector data extensions obtain query results from vector search, keyword matching, and metadata filters. &lt;/p&gt;

&lt;p&gt;Below is a Python example that demonstrates vector similarity search with metadata filtering using PostgreSQL + pgvector. The code takes a pre-filtering approach, filtering rows first by price and category before measuring vector distance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pgvector.psycopg2&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_vector&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbname=mydb user=postgres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;register_vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;min_price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;electronics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT product_name, price, category, embedding &amp;lt;-&amp;gt; %s AS distance
    FROM products
    WHERE price &amp;gt;= %s AND category = %s
    ORDER BY embedding &amp;lt;-&amp;gt; %s
    LIMIT 5
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (similarity: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pure vector search focuses on only similarity search operations. In contrast, hybrid search provides a better basis for reasoning about interconnected information on diverse data types by capturing both semantic matches and contextually appropriate responses.&lt;/p&gt;

&lt;p&gt;Vector-native solutions still matter, but for billion-scale use cases where performance, tuned indexes, and vector quantization are a priority. If you're building RAG applications or knowledge management systems, with a stable load of 50-100M vectors, traditional databases provide a unified platform where vectors and application data can reside in the same place. &lt;/p&gt;

&lt;h3&gt;
  
  
  Force 2: AI Agents Breaking the Query Model
&lt;/h3&gt;

&lt;p&gt;AI agents are issuing &lt;a href="https://tomtunguz.com/2026-predictions/" rel="noopener noreferrer"&gt;10x more queries&lt;/a&gt; than humans in 2026. This means the vector database infrastructure designed for human query patterns won't work for agents.  Autonomous systems spin up an &lt;a href="https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-neon-help-developers-deliver-ai-systems" rel="noopener noreferrer"&gt;isolated PostgreSQL instance&lt;/a&gt; in &amp;lt;500ms, rely on heavy parallelism, and ingest large datasets continuously. Low-latency databases alone won’t serve this behavior. Throughput must also scale to match the surge in concurrency that agents will introduce in 2026.&lt;/p&gt;

&lt;p&gt;However, not all vector databases are agent-ready, and optimizing for throughput often compromises latency. In production systems, these trade-offs become more pronounced. &lt;/p&gt;

&lt;p&gt;Database providers must rethink their architectural designs to align with agentic workloads. Traditional caching strategies that focused solely on storing frequently accessed embeddings must evolve to leverage semantic cache, which reuses previously retrieved query-answer pairs under similar computing conditions. This setup can reduce latency and inference costs, while maintaining high throughput during high traffic.&lt;/p&gt;

&lt;p&gt;At the indexing layer, databases must be configurable, exposing vector index parameters so engineers can tune trade-offs between speed, recall, and memory usage. To prevent server overload, databases must also move from static, reusable maximum connections to dynamic pool sizing that adjusts connection pools based on real-time demand. This minimizes running out of available connections under load or accumulating many idle ones. &lt;/p&gt;

&lt;p&gt;In 2026, vector databases must rewire infrastructure design for an agentic era rather than waiting to be shaped by it.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Force 3: The Deployment Gap Nobody's Filling
&lt;/h3&gt;

&lt;p&gt;While cloud databases have scaled to handle billions of vectors, developers building privacy-first, latency-sensitive applications at the edge are still being ignored in 2026. &lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.marketsandmarkets.com/Market-Reports/edge-computing-market-133384090.html" rel="noopener noreferrer"&gt;edge computing market&lt;/a&gt; was worth $168B in 2025, and &lt;a href="https://iot-analytics.com/number-connected-iot-devices/" rel="noopener noreferrer"&gt;IoT Analytics&lt;/a&gt; estimates the number of connected IoT devices will hit 39 billion by 2030. There's an active market, yet no one has filled the deployment gap. &lt;/p&gt;

&lt;p&gt;What the market is ignoring is that cloud-only databases are not equipped for offline scenarios, with limited bandwidth and intermittent connectivity. Critical applications, such as in healthcare, demand real-time responses (&amp;lt;10ms) and continuous system availability. Inability to operate during outages can cost between $700 and $450,000 per hour, depending on the industry. Edge setup can provide that always-on infrastructure while cutting transit costs. &lt;/p&gt;

&lt;p&gt;There are also the data security, compliance, and sovereignty requirements that regulated applications must meet by keeping data on-premises. Fulfilling these constraints means adapting infrastructure to support a secure, decentralized computing model that cloud systems cannot deliver. Edge deployment minimizes data movement and isolates sensitive workloads to reduce compliance scope. &lt;/p&gt;

&lt;p&gt;For air-gapped environments, localized decision-making is non-negotiable. Public cloud deployments rely on persistent connections, but applications operating within a controlled perimeter must avoid outbound connections. Adopting a private cloud approach is costly and resource-intensive, whereas edge infrastructure succeeds by processing data locally at the source.&lt;/p&gt;

&lt;p&gt;Yet in 2026, moving the edge beyond do-it-yourself setups is still in its early stages, despite a thriving market. Most hyperscalers currently treat edge computing as an extension of their existing cloud business. What the market needs is an edge-native solution that scales vertically to improve the network capacity, storage power, and processing ability of existing machines. But everyone still builds for the cloud. &lt;/p&gt;

&lt;p&gt;These three forces reveal a market that needs careful architectural reevaluation. One might be taking a hybrid approach, combining cloud and on-premises deployment for edge use cases. Another option is returning to the Postgres environment we are already familiar with. &lt;/p&gt;

&lt;h2&gt;
  
  
  The PostgreSQL Renaissance (and What It Means)
&lt;/h2&gt;

&lt;p&gt;Hyperscalers have been doubling down on PostgreSQL, and more engineers are choosing the database for enterprise-grade AI applications. This resurgence in interest and usage signals a change in infrastructure requirements for GenAI development. &lt;/p&gt;

&lt;h3&gt;
  
  
  Why the Hyperscalers Bet Big on PostgreSQL
&lt;/h3&gt;

&lt;p&gt;Every hyperscaler has integrated PostgreSQL technology into its database services. Google offers Cloud SQL for PostgreSQL and AlloyDB, AWS has Amazon Aurora and Amazon RDS for PostgreSQL, and Microsoft provides Azure Database for PostgreSQL. Top data warehouse providers are not left out of this PostgreSQL adoption either. &lt;/p&gt;

&lt;p&gt;In May 2025, Databricks acquired Neon for $1B. Snowflake followed the same trend in June 2025, acquiring Crunchy Data for an estimated $250M. In October 2025, Supabase also raised $100M in Series E funding. &lt;/p&gt;

&lt;p&gt;Hyperscalers recognize PostgreSQL's familiar, versatile, and extensible infrastructure, which already powers many enterprise databases, and leverage it to support engineers building agentic AI applications with PostgreSQL compatibility. With a 40-year market run, the open-source vector database has developed a mature tooling, flexible enough for both online transaction processing (OLTP) and AI application development. Plus, its dual JSON and vector support enables teams to build on the foundation they already know and scale from it.  &lt;/p&gt;

&lt;p&gt;At the same time, PostgreSQL’s pgvector and pgvectorscale extensions, with HNSW and StreamingDiskANN indexes, mean vector storage and similarity search happen directly within the database. &lt;/p&gt;

&lt;p&gt;Another factor fueling the PostgreSQL comeback is its ACID-compliant engine. Hyperscalers work with enterprise teams seeking data integrity and application stability for critical systems such as financial applications. PostgreSQL's transactional guarantees offer predictable and consistent behavior for production workloads. &lt;/p&gt;

&lt;p&gt;Despite hyperscalers’ convergence on PostgreSQL, AWS has presented a counter-trend to its PostgreSQL-based offerings with S3 Vectors. Instead of indexing vectors inside a database, embeddings live in object storage, querying 2 billion vectors per index. &lt;a href="https://aws.amazon.com/blogs/aws/amazon-s3-vectors-now-generally-available-with-increased-scale-and-performance/" rel="noopener noreferrer"&gt;AWS&lt;/a&gt; positions this storage-first model as a 90% TCO reduction for AI workloads, trading low latency (&amp;gt;100ms) for cost efficiency. This S3 Vectors’ deviation highlights PostgreSQL's scale limits. &lt;/p&gt;

&lt;p&gt;PostgreSQL is fast enough for many vector data workloads, but specialized architectures still win at scale. For instance, PostgreSQL’s multiversion concurrency control (MVCC) implementation is inefficient for write-heavy workloads, like real-time chat systems. During high write traffic, tables bloat and indexes require more maintenance, which in turn degrades application performance. &lt;/p&gt;

&lt;h3&gt;
  
  
  When PostgreSQL with pgvector Is Enough
&lt;/h3&gt;

&lt;p&gt;If your application already relies on PostgreSQL, introducing pgvector is a natural extension rather than adopting a new infrastructure or performing costly data migrations. Your vectors live next to your relational data, and you can query them in the same transaction using both similarity search and SQL JOINs. This hybrid search capability improves your application's retrieval layer and data management beyond pure vector search, with metadata constraints. &lt;/p&gt;

&lt;p&gt;PostgreSQL + pgvector also performs well for moderate-scale vector operations such as enterprise knowledge bases or internal RAG applications, where you're handling &amp;lt;100M vectors, with sub-100ms latency requirements. &lt;/p&gt;

&lt;h3&gt;
  
  
  When You Still Need Purpose-built
&lt;/h3&gt;

&lt;p&gt;If vector search is your primary workload, purpose-built platforms offer indexing structures, high-precision similarity search, and low-latency execution paths tuned for billion-scale vectors and high-throughput applications like recommendation or search engines. Dedicated databases are also effective if your search requirements demand specific capabilities like an HNSW index with dynamic edge pruning or sub-vector product quantization.&lt;/p&gt;

&lt;p&gt;This table summarizes the key differentiators between purpose-built databases and PostgreSQL + pgvector extension.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Features&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Purpose-built&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;PostgreSQL + pgvector&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Performance (QPS)&lt;/td&gt;
&lt;td&gt;&amp;gt;5k QPS&lt;/td&gt;
&lt;td&gt;500–1500 QPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale (max vectors)&lt;/td&gt;
&lt;td&gt;Billions of vectors&lt;/td&gt;
&lt;td&gt;&amp;lt;100M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;&amp;lt;50 ms&lt;/td&gt;
&lt;td&gt;&amp;lt;100 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost model&lt;/td&gt;
&lt;td&gt;Usage-based for cloud-native databases; infrastructure-driven for self-hosted&lt;/td&gt;
&lt;td&gt;Infrastructure-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational complexity&lt;/td&gt;
&lt;td&gt;Fully managed for cloud-based databases; self-hosted options require infrastructure ownership&lt;/td&gt;
&lt;td&gt;Requires proficiency in SQL and PostgreSQL-specific features&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer experience&lt;/td&gt;
&lt;td&gt;Designed for speed and abstraction; provides APIs and SDKs&lt;/td&gt;
&lt;td&gt;Broad tooling support with many connectors and libraries for different development use cases&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One key factor driving teams to rethink database choices in 2026 is cost. Cloud-based vector databases like Pinecone reveal something uncomfortable about cloud bills. &lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud Economics Are Breaking (Usage-Based Pricing at Scale)
&lt;/h2&gt;

&lt;p&gt;Usage-based pricing seems cost-effective for modest workloads until a system succeeds. Consider a RAG application handling 10M queries per month. At first, the base storage and computational cost feel predictable. But as traffic grows to 150M, the cumulative costs of storage, database lookups, indexing recomputation, and egress fees reveal how volatile usage-based billing becomes at scale. &lt;/p&gt;

&lt;p&gt;For instance, with 100M (1024-dim) vectors, 150M queries, and 10M writes per month, your estimated Pinecone bill for the RAG application will total around $5,000-$6,000, accounting only for storage, query cost, and write cost. If you factor in egress fees of about $0.08 per GB, the bill escalates further when data transfer is involved.&lt;/p&gt;

&lt;p&gt;Teams using cloud-based vector databases have reported surprise bills up to $5,000 on Reddit. Market pricing trends also echo this cloud bill volatility. In 2025, cloud vendors introduced &lt;a href="https://www.saastr.com/the-great-price-surge-of-2025-a-comprehensive-breakdown-of-pricing-increases-and-the-issues-they-have-created-for-all-of-us/" rel="noopener noreferrer"&gt;price hikes&lt;/a&gt; estimated at 9-25%, and between 2010 and 2024, cloud database costs increased by 30%, with usage-based pricing becoming the dominant model. &lt;/p&gt;

&lt;p&gt;In cloud environments, &lt;a href="https://www.actian.com/blog/databases/the-hidden-cost-of-vector-database-pricing-models/" rel="noopener noreferrer"&gt;costs scale unpredictably&lt;/a&gt; with growing data volume and query frequency. Pay-as-you-go pricing is the accelerant here, amplifying unreliable cost forecasting. Meanwhile, cloud vendors’ incentives scale with your consumption. More queries, storage, and processing result in higher, unpredictable bills for teams, while vendor revenue grows. &lt;a href="https://www.deloitte.com/us/en/what-we-do/capabilities/cloud-transformation/articles/cloud-consumption-model.html" rel="noopener noreferrer"&gt;Deloitte&lt;/a&gt; reported that companies adopting usage-based models grow revenue 38% faster year-over-year. &lt;/p&gt;

&lt;p&gt;Consumption-driven billing promises automatic scaling with workload demand. But teams often lack visibility into exactly what drives the spend and receive bills for both active queries, idle replicas, redundant embedding recomputation, and cloud add-ons. With the variability of the usage-based pricing model, it makes sense to reassess deployment strategy.&lt;/p&gt;

&lt;p&gt;For workloads with predictable traffic, teams can trade the flexibility of a usage-based model for the cost stability of reserved capacity. For instance, committing to a one-year reserved capacity plan can reduce the cost of handling 150M queries per month to $40,000-$42,000 annually, about 32% less than the usage-based pricing cost. &lt;/p&gt;

&lt;p&gt;Migrating to on-premises infrastructure is another alternative for teams with existing DevOps maturity. There's the upfront hardware and security investments. But when optimized, on-premises deployment can significantly control cost. For instance, a self-hosted Milvus deployment handling 150M vectors might require three &lt;code&gt;m5.2xlarge&lt;/code&gt; instances plus distributed storage, totaling around $900-$1,000 per month. &lt;/p&gt;

&lt;p&gt;For latency-critical workloads, edge processing provides another path. Processing 5TB of data at the edge, for example, can save approximately $400-$600 in egress fees. But there's still a huge gap in edge deployment. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Edge Deployment Gap (Where the Market Isn't Looking)
&lt;/h2&gt;

&lt;p&gt;Market attention has focused on cloud vector databases, but they don’t tell the full story of what is happening in offline and air-gapped environments where security, ultra-low latency, decentralization, and compliance are non-negotiables. &lt;/p&gt;

&lt;p&gt;In 2026, &lt;a href="https://services.global.ntt/en-us/newsroom/new-report-finds-enterprises-are-accelerating-edge-adoption#:~:text=your%20business%20transformation-,2026%20Global%20AI%20Report:%20A%20Playbook%20for%20AI%20Leaders,San%20Jose%2C%20Calif" rel="noopener noreferrer"&gt;more enterprises&lt;/a&gt; are leaning towards edge deployment, indicating a rethink of how teams want to handle data processing. Regulated industries need infrastructure that runs where most data decisions are already made, on devices at the network’s edge. Edge deployment meets this demand by keeping computation closer to the source.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.gartner.com/en/newsroom/press-releases/2023-08-01-gartner-identifies-top-trends-shaping-future-of-data-science-and-machine-learning" rel="noopener noreferrer"&gt;Gartner&lt;/a&gt; projects that 55% of deep neural network data analysis will occur at the edge. Yet the edge AI ecosystem remains immature. Cloud is not dead, but there are mission-critical workloads today that cloud deployment cannot support efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Cases Cloud Vendors Can't Address
&lt;/h3&gt;

&lt;p&gt;While cloud vendors offer mature features for integrating vector search into enterprise workflows, there are still use cases they aren't equipped to handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare&lt;/strong&gt;: Medical data and patient records often reside on-premises, governed by HIPAA, GDPR, and other privacy regulations. Hospitals need real-time health analysis happening on-premises, as migrating private data to the cloud expands their attack surface, requires a strong security posture, and increases compliance overhead. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous systems&lt;/strong&gt;: Autonomous vehicles need split-second local decision-making on camera and LiDAR data to maintain situational awareness, with or without external connectivity. Network round-trips to cloud servers limit the delivery of this time-sensitive data. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Military&lt;/strong&gt;: Military services manage sensitive assets through classified networks in an air-gapped and high-risk environment. They expect to push an update to an edge node and have it go live across the fleet in real time for tactical operations. Military services cannot tolerate the network latency and bandwidth constraints of the public cloud. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manufacturing&lt;/strong&gt;: Manufacturing sites’ network carries real-time sensor streams, safety systems, and production telemetry that require immediate analysis for predictive maintenance and operational efficiency. Some manufacturing facilities operate in remote locations with no connectivity, so going "cloud-first” is impractical, as they need solutions designed for interference-heavy factory floors.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retail&lt;/strong&gt;: Retail businesses need consistent local retrieval and immediate analysis of point-of-sale data, regardless of intermittent connectivity, as downtime costs approximately $700 per hour. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These use cases show where cloud vector databases still struggle to meet the latency and security requirements of on-device data. What features enable edge vector databases to satisfy these requirements, and why are comprehensive solutions still scarce? &lt;/p&gt;

&lt;h3&gt;
  
  
  What an Edge Vector Database Needs
&lt;/h3&gt;

&lt;p&gt;Edge vector databases run on edge servers, enabling AI applications to process data stored locally and receive responses in real time without waiting for back-and-forth communication with the cloud. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcjscamxrlhi4pjo7ef3z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcjscamxrlhi4pjo7ef3z.png" alt="Figure 3: Cloud vs. edge vector database architecture" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unlike cloud environments, which assume steady connectivity and large compute power, edge solutions are engineered to manage unstable networks and process local data under resource constraints. With edge vector databases, data stays at its point of generation, ingestion and analysis happen in real time, and the system adapts to unpredictable conditions at the edge.&lt;/p&gt;

&lt;p&gt;There are three core design requirements an &lt;a href="https://www.actian.com/glossary/edge-databases/#:~:text=Reduced%20Latency:%20Traditional%20data%20storage,store%20frequently%20accessed%20data%20locally." rel="noopener noreferrer"&gt;edge database&lt;/a&gt; needs to deliver on this promise of speed and reliability: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight infrastructure&lt;/strong&gt;: Distributed operations require infrastructure that is lightweight and deployable by design for resource-constrained edge servers. Having a compact in-memory data structure also helps to minimize the database memory footprint. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline capability&lt;/strong&gt;: Edge databases must execute local data analytics without relying on connected servers. Even with intermittent connectivity and limited bandwidth, AI applications should remain functional and operate independently.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sync-when-connected architecture&lt;/strong&gt;: Edge databases must automatically sync offline data, resolve conflicts, and reflect data changes when connectivity is restored. This mechanism helps to track performance metrics locally and maintain operational visibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Despite growing demand, the database market has few edge-native solutions because designing one that ticks the lightweight, offline-capable, and synchronization boxes is complex.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Nobody's Building This
&lt;/h3&gt;

&lt;p&gt;The edge deployment model remains an underdeveloped market with fragmented tooling for several reasons. &lt;/p&gt;

&lt;p&gt;One, edge infrastructure is complex, emphasizing fault tolerance and near-instant latency. Teams also need immediate visibility into device status, synchronization health, and data integrity across potentially thousands of endpoints. But edge devices, such as sensors and cameras, have limited compute and memory resources. &lt;/p&gt;

&lt;p&gt;Even enterprise-level control hosts often cap at 2-16GB of memory, significantly smaller than the memory centralized servers provide. Running inference on these devices will waste resources at their edge nodes and increase latency. Optimizing for real-time results becomes harder. &lt;/p&gt;

&lt;p&gt;However, that hardware baseline is improving. Advancements in edge computing, including the adoption of Ampere architecture, and the increasing prevalence of devices like the Jetson Nano, are expanding the amount of usable compute available at the edge. &lt;/p&gt;

&lt;p&gt;Another challenge is that edge computing is inherently distributed, with configurations varying across several hardware that operate independently. This hardware heterogeneity complicates data synchronization between diverse edge devices, especially as workloads shift across an unpredictable network. &lt;/p&gt;

&lt;p&gt;Nobody is building edge deployment models because of the operational complexity and specialization they require. Purpose-built databases like Qdrant add edge computing support, but still primarily operate under a centralized model. Edge-specific databases barely exist, with ObjectBox being a rare exception. The vendors who get it right must find a balance between strict latency requirements, hardware orchestration, consistent operational performance, and computational power.&lt;/p&gt;

&lt;p&gt;This table highlights where each available database deployment strategy thrives and where it falls short. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Deployment model&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cloud-native&lt;/td&gt;
&lt;td&gt;Ready-to-use solution, faster time-to-success, auto-scaling&lt;/td&gt;
&lt;td&gt;High TCO at scale, cyberattack vulnerability, and increased latency with each network hop&lt;/td&gt;
&lt;td&gt;Teams seeking managed infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-premises&lt;/td&gt;
&lt;td&gt;Development flexibility, full control and customization, data privacy&lt;/td&gt;
&lt;td&gt;High upfront fees, maintenance burden&lt;/td&gt;
&lt;td&gt;Organizations in regulated sectors with stringent data privacy requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge/offline&lt;/td&gt;
&lt;td&gt;Near-instant latency, local data processing&lt;/td&gt;
&lt;td&gt;Emerging market, lacks infrastructure software&lt;/td&gt;
&lt;td&gt;Engineers building latency-critical AI applications or seeking decentralized data processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid&lt;/td&gt;
&lt;td&gt;Keeps control systems local while leveraging cloud analytics&lt;/td&gt;
&lt;td&gt;Management complexity, high latency&lt;/td&gt;
&lt;td&gt;Organizations seeking both cloud scalability and on-prem flexibility and security&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Engineers can explore a hybrid approach that combines cloud for elasticity, on-premises for flexibility, and edge for speed. &lt;/p&gt;

&lt;h2&gt;
  
  
  What To Do in 2026 (Decision Framework)
&lt;/h2&gt;

&lt;p&gt;The decision you make in 2026 can mean the difference between an AI application that thrives and one that struggles. Your architecture evaluation should prioritize your performance goals, scale, preferred cost model, existing stack, regulatory requirements, and data sovereignty needs. &lt;/p&gt;

&lt;h3&gt;
  
  
  If You're Starting Fresh
&lt;/h3&gt;

&lt;p&gt;Workload patterns should be your decision driver, not industry trends or scale panic. Is your AI application handling: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;10M vectors&lt;/strong&gt;: Start with PostgreSQL + pgvector, especially if your core data already lives in PostgreSQL. pgvector thrives with moderate data scale, and its hybrid search architecture improves retrieval quality for RAG applications. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10M-100M vectors&lt;/strong&gt;: Both purpose-built databases and PostgreSQL's pgvectorscale can serve your workload, but with trade-offs. PostgreSQL + pgvectorscale works effectively at this scale, but performance might degrade with dynamic workloads or concurrent queries. Purpose-built outperforms in auto-scaling with increased data volume, and in maintaining persistent latency during traffic spikes. The trade-off is unpredictable cloud costs or operational overhead for self-hosted solutions. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100M+ vectors&lt;/strong&gt;: Use specialized vector databases like Pinecone, Qdrant, and Milvus. They are designed for billion-scale vector operations, especially for high-throughput vector search (&amp;gt; 1,000 QPS) and high concurrent writes. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, if your application must run offline, the options on the market are still limited.&lt;/p&gt;

&lt;h3&gt;
  
  
  If You're Already Using a Vector Database
&lt;/h3&gt;

&lt;p&gt;Architect for expansion, but analyze your present situation. You should: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate cost trajectory&lt;/strong&gt;: Track your actual monthly spend, considering factors like data volume, QPS requirements, storage, and computation. At your projected growth, deduce what your current bill will look like in 12 months. If the numbers demand a more predictable cost model, consider reserved capacity or on-premises deployment. But if usage-based pricing better aligns with your budget and scale, continue with it. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark query patterns&lt;/strong&gt;: Determine the dataset size your application processes monthly, and its average query latency. If you're hitting agent-scale queries, consider implementing optimization methods like semantic caching and quantization, or horizontal scaling techniques like sharding, which partitions agent memory, embeddings, and tool state, enabling parallel writes. For fluctuating workloads, future-proofing your vector database means designing for elastic scaling, which cloud solutions can provide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider PostgreSQL migration if scale permits&lt;/strong&gt;: If growth is slow (for instance, 10M vectors, 200 QPS average, doubling every 6-12 months), migrating to PostgreSQL fits this scenario.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assess deployment model constraints&lt;/strong&gt;: Understand the strengths and limitations of your current runtime environment. Cloud vendors introduce non-linear costs and compliance overhead. On-premises setup presents high upfront expenses and limited elasticity. Edge deployment means limited resources and synchronization complexity. Being realistic about these constraints helps you validate that switching vector databases solves a real problem rather than creating new ones. &lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  If You Need Edge/On-premises
&lt;/h3&gt;

&lt;p&gt;Understand that while cloud vendors compete for hyperscale workloads, edge deployment remains largely unaddressed. As a result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate rare options&lt;/strong&gt;: Native edge deployment solutions are scarce, but some existing options include ObjectBox, an on-device NoSQL object database, and pgEdge, an extension of standard PostgreSQL, but for distributed setups. There are also industry-specific custom edge solutions, but each comes with trade-offs in maturity, scalability, or ecosystem support.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider using PostgreSQL on-premises with pgvector&lt;/strong&gt;: If you already have operational capacity, deploying PostgreSQL on-premises gives you total control over your database environment. The trade-off is manually optimizing for performance, monitoring, and security. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anticipate new market entrants&lt;/strong&gt;: The native edge deployment gap discussed earlier remains largely overlooked by major vendors, but emerging solutions, such as &lt;a href="https://www.actian.com/databases/vectorai-db/" rel="noopener noreferrer"&gt;Actian VectorAI DB&lt;/a&gt;, are addressing this gap with a database that accounts for the physical and network realities of offline scenarios. Specifically, Actian supports local data analytics in environments with unstable connectivity, such as store checkout hardware and factory-floor machinery.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The flowchart below captures this decision framework at a glance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96kenw5s53ovqgw67d4n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96kenw5s53ovqgw67d4n.png" alt="Figure 4: Choosing a vector database in 2026" width="800" height="1375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;This analysis has spotlighted fundamental shifts in a market that focused squarely on purpose-built vector databases before 2025. &lt;/p&gt;

&lt;p&gt;In 2026, vectors are now a data type, and we are seeing more teams returning to the relational databases where their data already lives and leveraging their vector extensions. PostgreSQL is at the forefront of this renewed interest, providing the ACID-compliance, operational expertise, and flexibility that GenAI applications need. What this means for purpose-built solutions is that they now matter only for high-throughput, recall-sensitive systems. &lt;/p&gt;

&lt;p&gt;Meanwhile, even for high-throughput vector databases, AI agents’ query pressure is forcing a rethink of architectural design to support parallel writes and concurrent requests at a new scale. On top of this, fragmentation defines edge and on-premises deployments, with few straightforward approaches for processing data closer to the point of production.&lt;/p&gt;

&lt;p&gt;Looking ahead, the next shift will come from vendors that move beyond 2024's cloud-first database promotions to cater to the growing demand for offline-capable architecture. If you need to run AI workloads on-premises or at the edge, the options in 2026 are still limited, but that gap is starting to close with databases like Actian VectorAI DB. &lt;a href="https://www.actian.com/databases/vectorai-db/#waitlist" rel="noopener noreferrer"&gt;Join the waitlist&lt;/a&gt; for early access. &lt;/p&gt;

</description>
      <category>vectordatabase</category>
      <category>database</category>
      <category>vectoraidb</category>
    </item>
  </channel>
</rss>
