DEV Community: Offisong Emmanuel

Should You Use RAG or Fine-Tune Your LLM?

Offisong Emmanuel — Tue, 19 May 2026 19:43:02 +0000

The debate over retrieval augmented generation (RAG) vs. fine-tuning appears simple at first glance. RAG pulls in external data at inference time. Fine-tuning modifies model weights during training. In production systems, that distinction is insufficient.

According to the Menlo Ventures 2024 State of Generative AI in the Enterprise report, 51 percent of enterprise AI deployments use RAG in production. Only nine percent rely primarily on fine-tuning. Yet research such as the RAFT study from UC Berkeley shows that hybrid systems combining retrieval and fine-tuning outperform either approach alone across benchmarks.

If hybrid systems can produce better results, why does industry adoption favor only RAG? In this article, we’ll compare RAG, fine-tuning, and a hybrid architecture to understand the trade-offs and where each approach excels.

TL;DR

RAG: Best for frequently changing knowledge and moderate traffic; easy to update without retraining.
Fine-tuning: Best for stable domains and high-volume or low-latency tasks; improves task-specific accuracy and formatting.
Hybrid/RAFT: Combines up-to-date retrieval with optimized model behavior for the highest accuracy.
Key trade-off: Choice depends on query volume, how often knowledge changes, and team expertise.

Why the Standard RAG vs. Fine-Tuning Comparison Fails

RAG is a method where the model dynamically pulls in external data at inference time. Each query retrieves relevant documents or knowledge chunks, which the system appends to the prompt, allowing the model to produce answers grounded in current information.

Fine-tuning is the process of modifying a model’s weights during training using labeled data. Instead of relying on external retrieval, the model internalizes patterns directly, producing consistent outputs without querying external sources.

While these definitions are technically correct, most standard comparisons miss the factors that actually drive decisions in production. In real-world systems, the choice between RAG and fine-tuning depends on variables like scale, query volume, and how often your data changes.

Missing variable 1: Context expansion at scale

In many production RAG systems, every request appends hundreds of tokens. That added context changes how the model allocates attention and prioritizes weights.

Large retrieved contexts compete for attention with the prompt and instructions, which can dilute signal quality. Small retrieval errors or loosely relevant chunks can introduce formatting drift, or shift reasoning in subtle ways. The system’s output becomes tightly coupled to retrieval quality.

Fine-tuning works differently. Instead of injecting large volumes of text at inference time, it embeds patterns and constraints directly into the model during training. The distinction affects how the system behaves under real workloads.

Missing variable 2: Retraining frequency

The common advice says “use RAG if knowledge changes frequently" and “use fine-tuning if behavior is stable.” But how frequently is “frequently”?

If your knowledge base changes daily, retraining pipelines may introduce operational friction. Evaluation cycles, dataset versioning, and deployment validation all add delay.

Data preparation also matters. If your organization lacks structured, versioned, and clean datasets, the hidden cost of preparing training data can exceed compute costs.

The Cost Math of RAG vs. Fine-Tuning

Surface-level comparisons of RAG and fine-tuning often ignore the cost curves that determine long-term viability. In production systems, financial estimations are crucial in architectural decisions. To evaluate RAG vs. fine-tuning realistically, we need to examine three cost layers:

Token cost and context expansion
Retrieval infrastructure cost
Training infrastructure cost

The cost structure of RAG

RAG systems introduce a recurring operational cost because each query retrieves external information and injects it into the model’s prompt. That additional context is billed on every request.

Context expansion
Production RAG systems append around 500 tokens of retrieved context to each query. The provider bills those tokens on every request.

Using pricing similar to GPT-5.2 at 1.750 dollars per million input tokens, the incremental monthly cost becomes:

Cost per query
500 tokens × $1.75/1,000,000 = $0.000875 per query

At a small scale, this cost appears negligible. However, because it applies to every query, the total overhead grows linearly with traffic.

At different traffic levels:

Monthly queries	Context cost
10 million	$8,750
50 million	$43,750
100 million	$87,500

This is context overhead alone. It does not include output tokens or base prompt tokens. At a sustained scale, what appears flexible and inexpensive becomes a significant recurring expense.

Vector database and retrieval cost
Token cost is only one component of RAG costs. RAG also relies on a vector database for semantic search. The system must store, index, and query embeddings efficiently.

Public pricing of Pinecone lists:

Storage at approximately 0.33 dollars per gigabyte per month
Read units at approximately 16 dollars per million
Write units at approximately four dollars per million

For example, consider a system handling 50 million queries per month, where each query performs a single vector search (assuming a 1,024-dimension vector). That would result in 50 million read operations monthly. If the system also writes approximately six million records per month, the combined read and write activity would bring the total estimated monthly cost to around $1,532.

At 200 million queries per month, the total expenses rises to $9,000 per month.

Two RAG systems serving identical traffic can therefore have materially different cost structures depending on how the vector database is designed and optimized.

Infrastructure cost
RAG systems require storage and compute infrastructure to generate embeddings, store and index vectors, execute retrieval queries, and run inference. Each of these stages consumes compute resources, typically provisioned through cloud servers that must scale with traffic.

For real-time or high-throughput applications, additional capacity is required to maintain low latency and system reliability. Replication, autoscaling, monitoring, and failover mechanisms all add operational complexity. These infrastructure layers are essential for production-grade RAG, but they expand the total cost footprint beyond token usage alone.

The cost structure of fine-tuning

Fine-tuning introduces a different economic model from RAG systems. Instead of paying incremental costs on every request for external context, you invest upfront to modify the model’s internal behavior.

That upfront investment can be broken into four primary cost categories: data, training compute, experimentation, and operational maintenance.

Data preparation costs
High-quality labeled data is the foundation of effective fine-tuning. This includes collecting domain-specific examples, cleaning inconsistencies, formatting inputs and outputs correctly, and validating annotation quality.

In many organizations, data preparation consumes 20 to 40 percent of the total fine-tuning budget. Poorly curated data directly degrades model performance, leading to additional retraining cycles and wasted compute.

Training compute costs
OpenAI lists fine-tuning at roughly $25 per million training tokens for GPT-4.1. A run using 20 million tokens would cost about $500 in direct training fees, with larger datasets or multiple runs increasing this total.

For self-hosted training, costs depend on model size and hardware. High-performance GPUs such as A100 clusters can cost thousands of dollars per training epoch. Because fine-tuning is rarely a single-pass process, multiple epochs, evaluations, and retraining cycles are common, which further increases the overall cost.

Experimentation and validation costs
Fine-tuning is an iterative process that requires experimentation with hyperparameters, evaluation against baseline models, and testing across edge cases. These workflows require engineering time, infrastructure, and structured evaluation frameworks. Unlike prompt engineering, fine-tuning introduces a full ML lifecycle, adding ongoing operational overhead.

This creates a non-linear cost curve. Fine-tuning concentrates cost at the beginning, while marginal cost per request remains relatively stable as traffic grows.

Whether that trade-off is advantageous depends on three variables: query volume, knowledge stability, and retraining frequency. Without modeling those explicitly, cost comparisons between RAG and fine-tuning remain incomplete.

When RAG Wins

Despite its scaling trade-offs, RAG remains the dominant production choice for a reason. In certain operating conditions, it is structurally more flexible, faster to iterate, and operationally safer than fine-tuning. RAG is suitable in the following scenarios:

1. When knowledge changes frequently

If your domain knowledge changes weekly or daily, fine-tuning becomes operationally expensive. Dataset updates, retraining, evaluation, and deployment introduce delays that can stretch from hours to weeks depending on governance requirements.

Teams frequently underestimate the operational overhead of keeping a fine-tuned model synchronized with a rapidly evolving knowledge base. In these environments, RAG shifts the problem from model retraining to data indexing.

2. When you have extensive unstructured data but limited labeled data

Many organizations possess terabytes of internal documents but lack high-quality supervised datasets. Building labeled training corpora requires annotation workflows, domain experts, and quality validation pipelines. In practice, this often becomes the most expensive part of fine-tuning projects.

RAG bypasses this constraint by allowing models to operate directly on existing document corpora without constructing large labeled datasets.

3. When governance and data residency requirements are strict

Once sensitive information is embedded in model weights, deletion and auditing become difficult. Removing a specific record from a fine-tuned model often requires retraining or maintaining complex dataset lineage.

RAG architectures avoid this issue by keeping sensitive information in external storage systems where standard governance controls already exist.

4. When query volume is moderate

As shown in the earlier cost analysis, context expansion overhead grows with query volume, reaching approximately $43,750 per month at 50 million queries. At moderate traffic, RAG’s per-request costs are typically lower than the amortized expenses of fine-tuning, including training and ongoing maintenance. This makes RAG an attractive choice for organizations that want high-quality outputs without front-loading infrastructure and compute investments.

Use cases

Large-scale examples illustrate RAG’s effectiveness at this volume. Notion’s Q&A assistant is effectively a large-scale RAG system over workspace data. The difficult engineering problem was not retrieval itself, but enforcing identity and access controls during retrieval. When a user queries the assistant, the system must ensure the model only retrieves documents that the user is permitted to see.

LinkedIn leveraged RAG and knowledge graphs to preserve the structure of their support cases. This system retrieved relevant subgraphs rather than isolated text chunks, improving retrieval accuracy by 77.6% and reducing median issue resolution time by 28.6%.

For systems at this scale, RAG combines cost efficiency with flexibility, allowing teams to update knowledge sources rapidly without retraining models, while still delivering high-quality results.

When Fine-Tuning Wins

Fine-tuning becomes structurally advantageous under different conditions. These conditions typically involve scale, stability, and behavioral precision.

1. When query volume exceeds 100 million per month

At very high traffic levels (100M+ queries per month), RAG’s per-request context overhead becomes significant. Each query adds hundreds of retrieved tokens that the model processes, causing costs to scale linearly with traffic. Large context windows can also increase latency, reduce throughput, and complicate infrastructure reliability.

If domain knowledge is relatively stable, fine-tuning can become more efficient. By embedding knowledge directly into the model, organizations avoid repeated retrieval and token costs, leading to more predictable per-query expenses, better consistency, and simpler operations at scale.

2. When output structure is critical

Fine-tuned models often excel in tasks that require strict adherence to structure or formal constraints. For example, Cosine, which is an AI software engineering assistant that’s able to autonomously resolve bugs and build features, was able to achieve a SOTA score of 43.8% on the SWE-bench⁠ verified benchmark.

Similarly, Distyl secured the top position on the BIRD-SQL benchmark, widely regarded as the premier evaluation for text-to-SQL performance. Its fine-tuned GPT-4o model reached an execution accuracy of 71.83% on the leaderboard.

In applications where errors propagate downstream, into financial calculations, automated APIs, or compliance documents, behavioral consistency is mandatory. In these contexts, fine-tuning provides the reliability needed to minimize risk and maintain trust in automated outputs.

3. When latency requirements are strict

RAG adds multiple steps to the inference pipeline that increase response time. Each query must go through embedding generation, vector search, and context injection before reaching the model.

Fine-tuned models skip retrieval entirely. All necessary knowledge and reasoning patterns are internalized, allowing the model to generate outputs immediately. In applications where sub-100ms responses are required, such as live recommendation engines or high-frequency trading systems, removing the retrieval pipeline eliminates a major bottleneck.

4. When deep domain reasoning matters more than freshness

A domain-specific agriculture benchmark study found that fine-tuning improved model accuracy from 75% to 81%, while hybrid systems (fine-tuning + retrieval) reached 86%. Because the dataset focused on specialized agricultural knowledge and reasoning tasks, the improvement primarily reflects stronger domain reasoning, not simply better access to external information.

In domains such as legal analysis or medical decision support, reasoning patterns can be complex. Fine-tuning enables models to internalize domain expertise rather than rely solely on retrieved context.

The Hybrid Approach

While RAG and fine-tuning each have clear advantages, research shows that combining them effectively can produce superior results, but only when done correctly. The RAFT (Retrieval Augmented Fine-Tuning) approach, developed by UC Berkeley, Microsoft, and Meta Research, demonstrates how to do this in practice.

RAFT trains a model to operate in an “open-book” setting. It learns to process retrieved context, identify relevant passages, ignore distractors, and cite evidence accurately. Without this explicit training, simply layering RAG on top of a fine-tuned model often fails. For instance, a model fine-tuned on medical reasoning may retrieve irrelevant journal articles if it hasn’t learned to filter and prioritize context, resulting in hallucinations or incorrect recommendations.

RAFT addresses this with a structured 80/20 training split. 80% of training examples include oracle documents that the model should use, and 20% do not, forcing the model to learn when to trust retrieved data and when to rely on internalized knowledge. This operational detail is crucial for engineers evaluating whether their team can implement a hybrid approach successfully. It is not enough to just combine RAG and fine-tuning. The model must be trained to reason over the retrieved context.

A common and practical pattern is “fine-tune for format, RAG for knowledge.” Fine-tuning shapes the model’s internal behavior, enforcing domain-specific reasoning, output structure, and style. RAG provides dynamic access to external information that changes frequently or is too large to store in the model weights. In healthcare, for example, fine-tuning ensures the model understands medical terminology, follows proper diagnostic reasoning, and formats outputs according to clinical documentation standards. RAG supplements this by retrieving the latest research, newly published treatment guidelines, or patient-specific records, keeping recommendations current without retraining the entire model.

Similarly, Harvey AI fine-tuned on 10 billion case law tokens, but still leverages RAG to handle current cases and updates. This pattern is widely used in other domains too. Legal systems fine-tune for statutory reasoning and citation style, then layer RAG to retrieve the most current case law; finance models fine-tune for portfolio analysis rules, then layer RAG for market updates and regulatory changes. It’s a way to balance the stability of learned behavior with the adaptability of retrieval.

A Quantified Decision Framework for RAG vs. Fine-Tuning

The question is no longer “Which approach is better?” It is “Under what conditions does each approach make economic and operational sense?”

Instead of defaulting to architectural preference, evaluate three measurable variables:

Knowledge change frequency
Monthly query volume
Infrastructure capability and governance constraints

When those variables are quantified, the decision becomes far clearer.

Step 1: Measure knowledge volatility

Knowledge change frequency is often the fastest way to eliminate one option. If your domain knowledge changes weekly or daily, RAG is structurally favored. Updating an index is far simpler than retraining a fine-tuned model. The separation between model weights and external data enables real time data retrieval without redeployment cycles.

If knowledge remains stable for months at a time, fine-tuning becomes economically viable. Retraining frequency drops, and training cost can be amortized over longer intervals. In these environments, embedding domain specific knowledge directly into model parameters may reduce long-term inference overhead.

As a practical threshold:

Knowledge changes more than monthly → prioritize RAG
Knowledge stable for multiple months → evaluate fine-tuning

Step 2: Calculate context expansion cost**

The next variable is query volume. Large-scale RAG systems append hundreds of tokens to every query, and this context overhead scales linearly with traffic.

Quantitative triggers

Monthly queries	Guidance
<10M	RAG is cheaper
10–50M	Evaluate fine-tuning vs. RAG
50–100M	Fine-tuning or hybrid
>100M	Fine-tuning or hybrid

Step 3: Assess infrastructure maturity

Even if economics favor one approach, infrastructure capability may dictate feasibility.

RAG requires:

Strong data engineering
Reliable data pipelines
Efficient vector database architecture
Observability and monitoring

Fine-tuning requires:

High quality labeled data
Machine learning expertise
Compute resource allocation
Evaluation discipline

When teams ignore their actual capabilities, architecture decisions collapse under scale. Many production failures blamed on “model quality” are just traits of immature infrastructure.

Decision matrix

The following matrix translates the analysis into practical guidance.

Scenario	Monthly queries	Knowledge update frequency	Recommendation	Rationale
Domain knowledge updates weekly, moderate traffic	10–50M	Weekly/Daily	RAG	Immediate indexing and low recurring cost
High-scale traffic, knowledge stable	50–100M+	<1 update/month	Fine-tuning	Avoids recurring context injection, reduces latency
Structured output or code generation required	Any	Any	Fine-tuning	Embeds domain-specific rules and formatting internally
Specialized reasoning + frequent updates	10–50M	Weekly/Daily	Hybrid	Combines internalized reasoning with dynamic knowledge
Multi-domain systems with diverse knowledge update cycles	10–100M	Mixed	Hybrid	Fine-tuning stabilizes core domains, RAG handles rapidly changing sources

Using this matrix, it becomes easier to make the decision whether to utilize RAG, fine-tune your LLMs, or use the hybrid approach.

Final Thoughts

The debate between RAG and fine-tuning is often framed as a binary choice, but the more useful question is “If hybrid systems demonstrably outperform either approach alone, why does industry adoption still overwhelmingly favor RAG?”

Hybrid requires both ML and data engineering capabilities simultaneously, a combination few organizations have. RAG remains the practical default, offering agility and transparency with less upfront complexity.

The key takeaway is to choose the architecture that matches your knowledge volatility, query scale, and team capability. For teams exploring enterprise-scale retrieval systems, platforms like Actian VectorAI DB provide purpose-built vector database capabilities designed for performance and scalability.

Join the Discord community and learn how Actian fits to your AI strategy.

What 37signals’ Cloud Repatriation Taught Us About AI Infrastructure

Offisong Emmanuel — Tue, 19 May 2026 19:42:51 +0000

In 2023, 37signals announced that it had completely left the public cloud and followed up by publicly documenting its cloud repatriation process, providing one of the clearest real-world examples of on-premises economics at scale. By reversing its cloud migration and shifting workloads to private cloud infrastructure, the company drastically reduced its annual cloud infrastructure spend by almost $2 million.

The transparency of the numbers made the case compelling. In 2022, 37signals spent $3,201,564 on cloud services, which is about $266,797 per month. These detailed cost breakdowns, along with published hardware investment and payback timelines, provided a rare look into the financial mechanics of large-scale cloud repatriation.

For commodity SaaS workloads, the math was clear. But the same logic raises an important question for the next generation of compute-heavy systems: “Does the economic argument extend to AI infrastructure as well?” In this article, we examine whether the same economic logic holds for AI infrastructure.

TL;DR

37signals spent ~$3.2M/year on AWS in 2022.
After repatriating workloads to their own infrastructure, cloud spend dropped to ~$1.3M by 2024.
The company invested roughly $700K–$800K in servers and paid them off in under 18 months.
The entire infrastructure is still run by the same 10-person team. No additional operational overhead.
The key takeaway is that at a sustained scale, owning infrastructure can be dramatically cheaper than renting it.

The 37signals Playbook: What Hanson Actually Documented

In 2022, 37signals spent $3.2 million annually on AWS. After leaving the cloud in 2023, their annual costs had dropped to approximately $1.3 million by 2024, a reduction of almost $2 million per year.

The transition required a hardware investment of roughly $600,000 in Dell servers. The company fully recouped the investment in under 18 months, achieving complete payback in the second half of 2023 as their AWS reserved instance contracts expired. From that point forward, the savings flowed directly to operating margin rather than offsetting capital expense.

37signals projected $1.5 million in hardware costs and roughly $200,000 per year in operating expenses. This shift replaces a recurring $1.3 million annual cloud storage bill with a one-time capital outlay plus a fraction of the ongoing operating cost. Over five years, 37signals revised the total savings projections upward from $7 million to more than $10 million.

37signals cloud exit financials by year

To illustrate the financial impact of 37signals’ cloud exit over time, the table below breaks down annual cloud spending, on-premises hardware investments, and operating costs, highlighting the resulting net savings and key operational notes.

Year	Cloud spend	Hardware investment	Operating costs	Notes
2022 Baseline	~$3.2M	$0	Included in cloud spend	Full cloud dependency
2023 Migration	~$2M	~$700–800K	Moderate	Hardware fully recouped in under 18 months
2024+ Post-repatriation	~$1.3M	~$1.5M (storage)	~$200K/year	~$1.9M annual savings
2025+	Minimal AWS dependency	~$1.5M (Pure Storage, 18PB)	~$200K/year	$10M+ projected 5-year savings

Notably, the migration did not require the team to expand operations. A 10-person infrastructure team handled the entire repatriation without adding new staff. Addressing a common concern about operational overhead, 37signals co-founder David Heinemeier Hansso n noted:

“We've been out for just over a year now, and the team managing everything is still the same. There were no hidden dragons of additional workload associated with the exit that required us to balloon the team, as some spectators speculated when we announced it. All the answers in our Big Cloud Exit FAQ continue to hold.”

This directly challenges the common assumption that moving away from public cloud environments inevitably requires a significantly larger infrastructure team.

Execution followed a “criticality ladder” strategy where the team migrated lower-risk services first and more critical ones later. The team moved the HEY email system in stages, starting with caching, then database, and finally, job services. To minimize risk, they colocated infrastructure approximately one millisecond from the AWS region to preserve rollback capability during the cloud repatriation process. After stabilizing the system, they replaced managed services with substantial recurring cost, including RDS and managed Elasticsearch which exceeded $500,000 together annually.

What makes 37signals' case study consequential is the publicly documented cost efficiency. For organizations questioning long-term cloud adoption assumptions particularly with regard to storage costs and managed services, the 37signals documentation provides a rare baseline for comparison.

Why AI Infrastructure Economics Are Even More Extreme

The lessons from 37signals’ cloud repatriation take on a sharper edge when applied to AI infrastructure. Higher GPU costs, predictable inference workloads, massive embedding storage, and stricter data regulations create financial and operational pressures that amplify the advantages of on-premises or hybrid cloud solutions that allow you to move workloads where they make the most sense. Below, we break down the key drivers.

AI infrastructure cost comparison

To evaluate the cost implications of different AI infrastructure approaches, the table below compares upfront setup costs, monthly operating expenses at varying workloads, and expected break-even timelines for cloud, on-premises, and hybrid configurations.

Setup	Setup cost	Monthly cost	Break-even
Cloud GPU rental (AWS/ Azure)	$0	$2,900–3,500 (8h/day × $4–8/hour × 15 days)	N/A
Cloud inference APIs (Lambda Labs)	$0	$1,800–2,500 (8h/day × $3.67/hour × 15 days)	N/A
Self-hosted GPU (8×H100 server)	$200K–400K	$1,500–2,000 (power + maintenance)	<12 months
Hybrid (Cloud training + On-Prem)	$200K–400K	Training only, inference minimal	<12 months

Note: For cloud GPU rental, we estimate monthly cost assuming eight hours/day per GPU. The cost scales linearly with utilization; it is not directly per-query.

1. GPU cloud markups are high

AI workloads depend heavily on GPUs, and cloud providers charge far steeper premiums for GPU capacity than for typical CPU compute. On-demand AWS P5 instances with H100 GPUs cost roughly $4–8 per GPU-hour, while comparable Azure H100 instances are about $3.67 per hour. By contrast, spot markets and alternative providers such as Lambda Labs offer similar GPU capacity for $1–2 per hour, or $1.85–2.49 per hour with reserved commitments.

The result is a 4–8× markup for on-demand hyperscaler GPU capacity relative to the spot or specialized GPU cloud market. In other words, the premium cloud providers charge for high-end AI compute is significantly larger than typical CPU cloud markups. For organizations running sustained inference workloads, this pricing gap quickly becomes the dominant cost driver in AI infrastructure.

2. Predictable inference makes GPU ownership economical

High GPU pricing becomes especially significant because AI inference workloads are unusually predictable. Purchasing H100 GPUs outright can be cost-efficient. A single GPU costs roughly $25K–40K, while a complete 8×H100 server ranges from $200K–400K. Lenovo’s analysis shows that six or more hours of sustained daily usage reaches payback against AWS within the first year.

The reason this break-even arrives so quickly is that AI inference workloads are unusually predictable. Unlike SaaS traffic which fluctuates throughout the day, production AI systems such as recommendation engines tend to process steady volumes of requests.

Predictability changes the economics. When infrastructure runs at consistent utilization, owned hardware can be amortized efficiently across the workload. Paying cloud premiums for burst capacity that teams rarely use becomes unnecessary.

For organizations running inference continuously, the hardware investment is often recouped in under 12 months. From that point forward, the savings resemble the same pattern documented by 37signals. Fixed infrastructure replacing an ongoing rental bill.

3. Embedding storage requirements are massive

Even if GPU compute were optimized, AI systems introduce another rapidly growing cost layer: embedding storage. Vector databases store high-dimensional embeddings used for search, retrieval, and recommendation. As datasets scale into millions or billions of records, storage requirements expand quickly.

For instance, 10 million vectors at 1,536 dimensions require at least 58GB of raw storage, often 200–300GB with indexes and metadata. Cloud storage services like Pinecone charge $0.33/GB/month, meaning 500GB could cost $165/month before any queries. Self-hosted solutions like PostgreSQL with pgvector dramatically reduce cloud spending while keeping sensitive data under direct control. Over time, these storage requirements compound infrastructure costs alongside GPU compute, further reinforcing the economic advantages of self-hosted or hybrid architectures.

4. Data sovereignty and compliance favor on-premises deployment

Data residency regulations and general compliance are priorities in the AI space with the industry becoming increasingly regulated. Notably, the EU AI Act introduced strict regulations for AI systems, with prohibitions on certain AI use cases which took effect in February 2025. On-premises deployment simplifies compliance.

For financial organizations navigating complex regulatory environments, solutions like Actian’s Data Intelligence platform helps enforce data governance and streamline compliance workflows.

The Cloud Infrastructure Case Studies 37signals Validated

As much as the financial transparency of 37signals’ cloud exit was radical, their repatriation was not an isolated occurrence. It was part of a growing trend by many organizations trying to regain cost control and optimize their cloud infrastructure. Many high-profile case studies illustrate the scale and economics of moving workloads back from public clouds to owned or hybrid infrastructure.

Dropbox

Dropbox pioneered enterprise cloud repatriation as early as 2015, completing the migration between 2016 and 2018. The company moved roughly 90% of customer data, reportedly over 500 petabytes, off AWS to three owned colocation facilities. The infrastructure investment totaled $53 million, yet Dropbox reported $74.6 million in operational savings over two years per its 2018 S‑1 filing. A small portion of workloads, primarily European customers and specialized services, remain in AWS. Internally, the initiative was known as “Magic Pocket,” and it exemplifies how a well-executed hybrid cloud approach can deliver substantial savings while aligning with long-term business objectives.

Ahrefs

Ahrefs, the SEO tools company, relied on a Singapore colocation setup with 850 servers. Their reported savings from avoiding public cloud were approximately $400 million over 2.5 years. Actual infrastructure cost: $39.5 million for 850 servers (~$1,500/server/month), versus an estimated $447.7 million if hosted entirely on AWS (~$17,557/server/month equivalent). As Ahrefs put it: “We wouldn’t be profitable, or even exist, if our products were 100% on AWS.” While critics argue that Ahrefs inflated AWS estimates, the directional savings were undeniable, illustrating that cloud repatriation challenges can be surmounted at scale with careful planning.

GEICO

GEICO spent a decade migrating to multiple cloud providers only for its costs to climb and exceed projections by 2.5×, reaching $300 million by 2022 across eight providers. In response, GEICO began moving workloads to a private cloud using OpenStack and Kubernetes, targeting over 50% repatriation by 2029. Early results show 50% reductions in compute and 60% reduction per gigabyte of storage costs compared with public cloud services, demonstrating how a hybrid cloud architecture can deliver efficiency, compliance, and alignment with long-term business objectives.

Akamai

Akamai was on the path to spending over $100 million on third party cloud services before migrating compute workloads to its own global edge network of 350,000+ servers. The migration delivered savings of roughly $100 million per year, a testament to the economics of repatriation when existing infrastructure and scale align.

What these cases share is the same economic pattern documented by 37signals. Predictable, high-volume workloads eventually become cheaper to run on owned infrastructure than on hyperscaler clouds.

These examples reflect a broader shift occurring across enterprise infrastructure strategies. Barclays’ Chief Information Officers (CIO) surveys show cloud repatriation trending upward in recent years, with the sentiment peaking in the second half of 2024 with 86% of CIOs planning repatriation.

However, this statistic does not mean that companies are abandoning public cloud environments completely. According to IDC, only 8–9% of companies favor full repatriation with most preferring a hybrid approach that combines public and private clouds. Hybrid cloud infrastructure allows organizations to optimize workload placement by strategically allocating sensitive data and mission-critical applications on-premises while leveraging public cloud services for less critical workloads. As such, it has become increasingly important for teams exploring similar transitions to understand the nuances of hybrid deployments and their associated risks.

Cloud Repatriation Statistics

Cloud repatriation is accelerating at the same time as public cloud spending keeps climbing. IDC projects global public cloud spend will reach $1.6 trillion in 2028, doubling from their 2024 prediction. Yet as mentioned earlier, 86% of CIOs are planning some form of repatriation according to Barclays. Both trends can be true because this is not a cloud exodus so much as a rebalancing. Enterprises are leaning towards a hybrid cloud model.

AI is likely to accelerate that shift. AI workloads account for less than 10% of total cloud compute today but Gartner projects that this figure will approach 50% by 2029. Hyperscalers are responding with enormous capital investment. There is an estimated $600 billion in infrastructure spend in 2026, roughly three-quarters of it tied to AI. The assumption is clear: Enterprises will rent that GPU capacity. But the 37signals math suggests that once AI workloads move from experimentation to steady production, ownership economics begin to dominate.

Cost pressure is already driving behavior. Flexera reports that 27% of cloud resources are wasted or underutilized, and 21% of workloads have already been repatriated. The primary reason cited is cost exceeding projections, followed by performance concerns. With GPUs, the margin for inefficiency is thinner. There are fewer optimization levers, higher hourly rates, and faster budget burn.

Regulation adds another layer. The EU AI Act, DORA for financial services, China’s PIPL, and India’s DPDP are tightening data governance requirements. Mimecast reports that 87% of organizations now factor data sovereignty into vendor decisions. For AI systems, sovereignty extends beyond data location to model provenance, audit trails, and compliance documentation. On-premises deployment does not eliminate regulatory complexity, but it centralizes control and for many enterprises, that simplicity is becoming strategically attractive.

The Counter-Arguments and When Cloud Providers Win

Not all observers agree that cloud repatriation is the best path for every organization. Public cloud environments still deliver value in certain circumstances. But arguments often do not hold strong in the case of AI workloads.

When cloud wins vs. when on-premises wins

Component	Cloud advantage	On-prem advantage
Workload predictability	Handles spiky or unpredictable workloads	Predictable workloads cheaper to self-host
Team expertise	Requires minimal in-house infrastructure skill	Strong IT teams can optimize and reduce vendor reliance
Scale and growth	Rapid scaling and global expansion	Predictable growth enables cost-efficient hardware
Regulatory requirements	Managed compliance, geo-redundancy	Direct control simplifies regulatory alignment
Cost and margins	Pay-as-you-go reduces upfront spend	Long-term savings from owned infrastructure
Service quality	Cloud SLAs ensure availability and performance	Dedicated resources guarantee predictable uptime

Cloud “wrong usage” argument

Jeremy Daly, a serverless advocate, argues that “37signals was using the cloud wrong.” By treating cloud environments as virtual colocation, running VMs and Kubernetes, they were paying cloud premiums without capturing the value of serverless, managed services, and instant scaling. As Daly notes, “In the cloud, we should be renting services, not servers.”

For SaaS workloads with highly variable or spiky traffic, this argument is compelling. Serverless infrastructure allows organizations to scale instantly and pay only for the compute they actually use.

However, AI inference workloads often behave very differently. Production inference systems, such as recommendation models, copilots, and document processing pipelines, tend to run at steady, sustained utilization rather than unpredictable bursts. In these cases, the economic advantage of elastic cloud scaling diminishes. The premium paid for burst capacity still exists, but the workload itself rarely needs that burst capacity.

Daly’s argument therefore holds for variable SaaS workloads, where elasticity is critical. For sustained AI inference workloads running at high utilization, paying a premium for burst capacity that is rarely used can make dedicated infrastructure or hybrid deployments more cost-efficient.

Full cost critique

Some critics also question the financial assumptions behind 37signals’ approach. They point out that hardware and software normally account for only about 20% of IT costs, with the remainder covering electricity, cooling, physical security, racking, Uninterruptible Power Supply (UPS), and opportunity costs. David Heinemeier Hanson’s analysis did not include all of these overheads because 37signals used colocation facilities rather than fully owned data centers. Even so, considering 37signals’ figures, it is reasonable to conclude that renting colocation space can still be far cheaper than relying on cloud services.

Competence vs. growth framework

Forrest Brazeal’s IT competence versus growth aspirations framework provides additional nuance. He places 37signals in the High Competence/Low Growth quadrant, ideal for self-hosting. “Not every company has the competence (high) or growth aspirations (low) of 37signals,” he observes. Startups with uncertain or spiky workloads benefit from cloud flexibility, but AI companies running production inference at scale often combine high operational competence with steady growth. Such profiles (steady growth & high competence) are well suited to repatriation.

Applying the Playbook to AI Infrastructure

If 37signals provided the economic blueprint, AI infrastructure makes the economics more concrete. The decision is no longer abstract. It becomes a structured assessment grounded in workload behavior, utilization, and regulatory exposure.

A practical four-question framework helps translate the 37signals logic into AI terms:

Is your inference workload predictable and sustained?

Unlike SaaS traffic spikes, most production AI systems such as recommendation engines, RAG pipelines, or fraud detection models process steady volumes with gradual growth.
Are projected GPU utilization rates above 60–70%?

At this threshold, owned hardware amortization typically undercuts public cloud GPU pricing within the first year.
Are you processing more than 10–50 million queries per month?

At this scale, per-token and per-query pricing from cloud APIs compound rapidly.
Do you face data sovereignty or strict compliance requirements?

For financial services, healthcare, or government workloads, regulatory mandates can tilt the decision toward controlled environments.

If the answer is “yes” to three or four of these, the repatriation economics tend to favor on-premises deployment for production inference.

Decision matrix

Workload stage	Recommended environment	Rationale
Model training	Public cloud	Compute-intensive; cloud GPUs handle burst workloads cost-effectively
Experimentation and prototyping	Public cloud	Flexible, fast provisioning for early-stage iteration
Production inference	On-premises / Hybrid	Steady workloads; owned hardware cheaper at 60–70%+ GPU utilization
Vector storage (embeddings)	On-premises	Reduces recurring managed-service costs and ensures data control

The hybrid AI pattern

In practice, most AI organizations adopt a hybrid model rather than an all-or-nothing shift. Training remains in the cloud. Inference moves closer to owned infrastructure.

Lenovo documented that training Llama 3.1 at hyperscale (39.3 million GPU hours) in the cloud would exceed $483 million. That type of elastic, short-term scale is exactly where public cloud excels. Inference is different. Once a model is trained, serving it for three to five years becomes steady, predictable work. That is where amortized hardware economics has the upper hand.

This split architecture also simplifies data migration risk. Instead of relocating entire AI pipelines at once, organizations can migrate production inference workloads gradually while leaving experimentation and early-stage training in cloud environments. A controlled, phased migration process reduces operational disruption while ensuring seamless integration between cloud-based training and on-premises serving layers.

Self-hosted inference economics

The economics of self-hosted inference depend heavily on utilization and token volume. According to enterprise deployment benchmarks, a 7B-parameter model running on an H100 GPU at roughly 70% utilization costs about $10,000 per year in spot nodes or hardware amortization. Power costs about $300 annually, bringing the total costs to about $10,300.

Public LLM APIs, by contrast, typically charge per million tokens, with enterprise pricing in 2025 ranging from $0.25–$15 per million input tokens and $1.25–$75 per million output tokens depending on model tier and provider.

At low usage levels, APIs remain the more economical option because infrastructure sits idle. However, the economics change as workloads scale. Industry analyses suggest that self-hosted deployment begins to break even at roughly two million tokens per day, after which the fixed cost of owned infrastructure is amortized across a large inference volume.

At high volumes, self-hosted inference can reduce costs by up to 78%. Artefact’s analysis) found break-even around 8,000 conversations per day. Below that threshold, managed cloud APIs remain more economical. Above it, ownership compounds savings. The pattern mirrors 37signals: predictable workload plus high utilization equals rapid payback.

Vector databases

Instacart documented migrating from Elasticsearch plus FAISS to PostgreSQL with pgvector, achieving 80% cost savings and a 10× reduction in write amplification. Timescale’s pgvectorscale benchmarks show approximately 75% lower costs than managed vector services like Pinecone at comparable performance.

For RAG systems handling millions of queries monthly, self-hosted vector infrastructure produces savings that resemble the 37signals S3 case: large recurring storage bills replaced by amortized hardware and open-source tooling.

Data sovereignty as a structural driver

Grandview research reports that the sovereign cloud market was worth 648.87 billion USD in 2025 and is projected to reach USD 648.87 billion by 2033. Also, according to Gartner, around 60% of financial firms outside the United States are expected to adopt sovereign or on-premises deployments by 2028.

Frameworks such as the EU AI Act, China’s PIPL, and India’s DPDP mandate data localization and traceability. For organizations processing sensitive training datasets or proprietary inference logs, on-premises deployment inherently satisfies residency requirements because data never leaves jurisdictional boundaries.

The Bottom Line

37signals showed that cloud repatriation teams can measure, model, and defend decisions with hard numbers. With AI infrastructure, the economics can be even more pronounced. If cloud repatriation saved roughly $10 million for Basecamp, an equivalent AI company running production inference at comparable scale could save multiples of that amount, given the much higher cost of GPU compute and embedding infrastructure.

For organizations choosing to run AI workloads in controlled environments, platforms like Actian VectorAI DB provide a purpose-built vector database designed for high-volume vector search and AI inference workloads. It can be deployed on-premises or in the cloud, allowing organizations to place vector infrastructure where it best fits their operational and economic requirements.

Join the community and learn more about Actian.

How to Build a HIPAA Compliant AI Ecosystem Without the Cloud

Offisong Emmanuel — Tue, 19 May 2026 19:42:41 +0000

Healthcare cannot rely on cloud RAG because patient data leaves your network and your system logs, stores, and exposes it outside your control. You sign a Business Associate Agreement (BAA), connect your pipeline to a managed vector database, and assume compliance is complete. That assumption is wrong. The BAA covers the provider’s infrastructure. It does not cover what your application sends, logs, or exposes during retrieval and generation.

You remain responsible for every path your system sends Protected Health Information (PHI) through. A clinician query can leak sensitive data through logs. A system prompt can include patient context that your system stores outside your boundary. Weak access control allows retrieval results to expose records across departments. These risks exist in your application layer, not in the cloud provider’s scope.

American regulators now target this gap. In 2026, they flagged attack patterns like membership inference, where an adversary probes an AI system to confirm whether a patient’s data exists in the index. Cloud-hosted pipelines increase this risk because queries and embeddings move across external infrastructure. Audit requirements tighten further when logs live on third-party systems.

In this tutorial, you will build a clinical knowledge assistant that runs entirely on hospital infrastructure. It performs semantic search over clinical data, enforces role-based access at query time, and generates answers with clear citations. Every query stays inside your network, every access is logged locally, and no external API calls are required.

Why BAA Is Not Enough

A BAA protects the cloud provider’s infrastructure, not how your system handles PHI during queries, retrieval, and generation. You remain responsible for every place PHI appears, moves, or gets stored inside your pipeline. There are multiple failure modes that make your system non-compliant, even when you sign a BAA.

Shared responsibility gap

The BAA stops at the infrastructure boundary. Your system controls what enters a prompt, what gets logged, and what leaves your network. If a clinician query includes PHI and your application logs it to an external service, you are responsible. If your retrieval step returns records across departments without strict filters, you have created an internal data breach. These failures happen in your code, not in the cloud provider’s scope.

For example, a physician searches “Show me similar cases to John Doe with early stage lung cancer.” Your application logs the full query to a cloud logging service for debugging. That log now contains PHI outside your network. The cloud provider did not leak it. Your application sent it.

Audit log ownership

HIPAA requires a complete audit trail for every access to PHI. When your vector database runs on third-party infrastructure, your system stores query logs and retrieval traces outside your control. You cannot guarantee completeness, retention, or isolation. Your security team cannot verify access patterns without relying on another provider’s system. That breaks your ability to enforce and prove compliance.

For example, your compliance team asks for a report of all oncology patient records in the past 30 days. Your vector database provider stores query logs on their platform with limited retention. Some logs are missing and others lack user-level metadata. You cannot produce a complete audit trail.

Membership inference exposure

Attackers can probe your system with targeted queries to determine whether a specific patient’s data exists in your index. This attack class is now a regulatory concern. Cloud-hosted indexes increase this risk because they expose a remote interface for repeated probing. A locally hosted index removes that external interface and limits access to your internal network.

For example, an attacker sends repeated queries like “Patients diagnosed with HIV in 2024 treated with drug X” and slightly modifies filters each time. They observe changes in response confidence and content. Over time, they infer whether a specific individual’s record exists in your dataset.

These failures show that a BAA does not ensure compliance. An on-premises deployment removes the third-party surface entirely and gives you full control over data flow, access, and auditability.

Split view showing Cloud vs On-premises RAG architecture

What You Are Building

In this section, you will build a RAG system with three layers:

Ingestion layer

You ingest clinical notes and treatment protocols into a controlled vector index with enforced data hygiene. You de-identify data before any processing. HIPAA Safe Harbor requires removal of identifiers, while Expert Determination allows a statistical approach. You apply one of these before ingestion, not after. You then chunk documents into 512 token segments with 50 token overlap, generate embeddings using a local model, and store them in VectorAI DB with metadata.

You define a strict schema for every record. Each chunk includes document type, department, date, and author role. This metadata is not optional. It enables access control at query time and prevents cross-department leakage. Do not store raw documents without structure.

Query layer

You process clinician queries through a controlled retrieval pipeline. Every query passes through role-based access control before it reaches the index. A cardiology user can only retrieve cardiology data. A scheduling bot cannot access diagnosis notes. You enforce this with a MUST filter on department or patient cohort at the database level.

Run hybrid search. Vector similarity retrieves semantically relevant chunks. Metadata filters restrict the result set. Pass the filtered context into a local LLM. The model generates an answer from retrieved data only and includes citations. Do not allow the model to invent or pull from external knowledge.

Audit layer

Log every interaction locally with full traceability. Each query writes a record that includes timestamp, user ID, department, query text, and retrieved document references. This log lives on your infrastructure with defined retention and access policies. You do not rely on external logging systems.

You can reconstruct any access event from this log. You can answer who accessed what, when, and under which role. This satisfies audit requirements and gives your security team direct visibility into system behavior.

The entire system runs on commodity hardware inside the hospital network. The end-to-end architecture of the system is shown in the image:

Hospital RAG system architecture

Building a HIPAA Compliant RAG Workflow

In this section, you will build a fully local RAG system that ingests clinical data, enforces access control, answers queries, and logs every interaction.

Prerequisites

To follow along, install the following tools on your local network:

Docker and Docker Compose installed.
Python 3.10 or higher.
PIP or UV: This guide uses UV.

Step 1: Deploy a vector database

You deploy a local instance of Actian VectorAI DB with persistent storage for both vector data and audit logs.

Create a docker-compose.yaml file:

services:
  vectorai:
    image: williamimoh/actian-vectorai-db:latest
    platform: linux/amd64
    container_name: vectorai_db
    ports:
      - "50052:50051"
    volumes:
      # vector data persists across restarts
      - ./data:/app/data
      # audit log lives on host — not inside the container
      - ./audit_logs:/app/audit_logs
    environment:
      - VECTORAI_LOG_LEVEL=info
    restart: unless-stopped

Run the service:

docker-compose up -d

The database starts and exposes port 50051 for local access. Vector data persists in ./data. Audit logs write to ./audit_logs on the host, which keeps all access records inside your network boundary.

Note:

M3/_M4 Apple Silicons might encounter a GRPC disconnection error without any container _logs. In this case, disable Rosetta in Docker Desktop.
VectorAI DB is under active development.

Step 2: Build the ingestion pipeline

Install the client library and run the ingestion pipeline to convert clinical documents into embeddings and store them in your local vector database.

Use uv for dependency management and execution. It is fast, reproducible, and avoids global Python state.

Download the Actian VectorAI client package. This creates a file actian_vectorai-0.1.0b2-py3-none-any

Initialize your project by:

uv init .
uv venv

After initialization, install the Actian VectorAI package by:

uv pip3 install actian_vectorai-0.1.0b2-py3-none-any

Add the embedding model dependency:

uv add sentence-transformers

Create a file ingest.py with the following contents:

import re
import hashlib
from actian_vectorai import VectorAIClient, Distance, VectorParams, PointStruct
from sentence_transformers import SentenceTransformer

# ── Config ────────────────────────────────────────────────────────────────────
VECTORAI_HOST  = "localhost:50052"
COLLECTION     = "clinical_docs"
EMBED_MODEL    = "sentence-transformers/all-MiniLM-L6-v2"  # 384-dim
VECTOR_DIM     = 384
CHUNK_TOKENS   = 512
OVERLAP_TOKENS = 50

# ── Synthetic clinical notes (replace with real de-identified corpus) ─────────
RAW_NOTES = [
    {
        "document_id":   "card_note_001",
        "document_type": "clinical_note",
        "department":    "cardiology",
        "date":          "2025-03-15",
        "author_role":   "attending_physician",
        "text": """
            Patient: [NAME REDACTED], DOB: [DATE REDACTED], MRN: [MRN REDACTED]
            Chief Complaint: Chest pain radiating to left arm, onset 2 hours ago.
            Assessment: Acute ST-elevation myocardial infarction confirmed on ECG.
            History: Hypertension and type 2 diabetes. Started on aspirin 325 mg,
            clopidogrel 600 mg loading dose, and heparin infusion per ACS protocol.
            Plan: Emergency PCI. Beta-blocker therapy with metoprolol succinate
            25 mg daily post-procedure. ACE inhibitor ramipril 5 mg daily initiated
            24 hours post-PCI. Follow-up echocardiography in 6 weeks.
        """,
    },
    {
        "document_id":   "card_protocol_001",
        "document_type": "treatment_protocol",
        "department":    "cardiology",
        "date":          "2025-01-10",
        "author_role":   "department_head",
        "text": """
            Cardiology Protocol — Heart Failure with Reduced EF (HFrEF)
            First-line therapy:
            - ACE inhibitor: ramipril 2.5–10 mg daily (or ARB if ACE-intolerant).
            - Beta-blocker: bisoprolol 1.25–10 mg daily, carvedilol 3.125–25 mg BID,
              or metoprolol succinate 12.5–200 mg daily. Titrate every 2 weeks.
            - MRA: spironolactone 25–50 mg daily for NYHA class II–IV
              if eGFR > 30 and K+ < 5.0.
            Target: Symptomatic improvement. Reassess LVEF at 3–6 months.
            Device therapy (ICD/CRT) if LVEF ≤ 35% after 3 months optimal therapy.
        """,
    },
    {
        "document_id":   "psych_note_001",
        "document_type": "clinical_note",
        "department":    "psychiatry",
        "date":          "2025-03-18",
        "author_role":   "psychiatrist",
        "text": """
            Psychiatry intake note — [NAME REDACTED], [AGE REDACTED]-year-old.
            Presenting with major depressive episode, PHQ-9 score 18 (severe).
            No current suicidal ideation. Started sertraline 50 mg daily.
            Psychotherapy referral placed. Follow-up in 2 weeks.
            Safety plan documented. Family support confirmed present.
        """,
    },
    {
        "document_id":   "onco_note_001",
        "document_type": "clinical_note",
        "department":    "oncology",
        "date":          "2025-03-20",
        "author_role":   "oncologist",
        "text": """
            Oncology note — [NAME REDACTED].
            Diagnosis: Stage IIIA non-small cell lung cancer, adenocarcinoma.
            EGFR mutation positive (exon 19 deletion).
            Plan: Osimertinib 80 mg daily (first-line EGFR-targeted therapy).
            Baseline CT chest/abdomen/pelvis completed. Brain MRI negative.
            Next imaging review in 8 weeks. Antiemetics PRN, skin care for rash.
        """,
    },
]

# ── Step 1: De-identification ─────────────────────────────────────────────────
# For production use Presidio:
#   from presidio_analyzer import AnalyzerEngine
#   from presidio_anonymizer import AnonymizerEngine
#   analyzer, anonymizer = AnalyzerEngine(), AnonymizerEngine()
#   result = analyzer.analyze(text=raw, entities=[...], language="en")
#   clean  = anonymizer.anonymize(text=raw, analyzer_results=result).text
#
# This demo applies lightweight regex to already-synthetic notes.

_HIPAA_PATTERNS = [
    (r"\b\d{3}-\d{2}-\d{4}\b",                   "[SSN]"),          # SSN
    (r"\bMRN[-:\s]*\d{4,10}\b",                   "[MRN]"),          # medical record #
    (r"\b\d{1,2}/\d{1,2}/\d{2,4}\b",             "[DATE]"),         # dates
    (r"\b[A-Z][a-z]+ [A-Z][a-z]+\b",             "[NAME]"),         # names (simple)
    (r"\b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b",         "[PHONE]"),        # phone
    (r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.\w+\b","[EMAIL]"),      # email
    (r"\b\d{5}(?:-\d{4})?\b",                     "[ZIP]"),          # zip
    (r"\b(?:https?://)\S+",                        "[URL]"),          # URLs
    (r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",  "[IP]"),           # IP addresses
]

def deidentify(text: str) -> str:
    """Remove HIPAA Safe Harbor identifiers from text."""
    for pattern, replacement in _HIPAA_PATTERNS:
        text = re.sub(pattern, replacement, text)
    return text.strip()

# ── Step 2: Chunking ──────────────────────────────────────────────────────────
def chunk(text: str, size: int = CHUNK_TOKENS, overlap: int = OVERLAP_TOKENS) -> list[str]:
    """Split text into overlapping token windows (whitespace tokenisation)."""
    tokens = text.split()
    chunks, start = [], 0
    while start < len(tokens):
        end = min(start + size, len(tokens))
        chunks.append(" ".join(tokens[start:end]))
        if end == len(tokens):
            break
        start += size - overlap
    return chunks

# ── Step 3: Embedding ─────────────────────────────────────────────────────────
print(f"Loading embedding model: {EMBED_MODEL}")
_model = SentenceTransformer(EMBED_MODEL)

def embed(texts: list[str]) -> list[list[float]]:
    return _model.encode(texts, normalize_embeddings=True).tolist()

# ── Step 4: Ingest into VectorAI DB ──────────────────────────────────────────
def _chunk_id(doc_id: str, idx: int) -> int:
    """Stable integer ID from (document_id, chunk_index)."""
    h = hashlib.sha256(f"{doc_id}:{idx}".encode()).hexdigest()
    return int(h[:15], 16)

def ingest(notes: list[dict]) -> None:
    with VectorAIClient(VECTORAI_HOST) as client:
        # Health check
        info = client.health_check()
        print(f"VectorAI DB connected  version={info['version']}")

        # Create collection (skip if already exists)
        try:
            client.collections.create(
                name=COLLECTION,
                vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.Cosine),
            )
            print(f"Collection '{COLLECTION}' created  dim={VECTOR_DIM}")
        except Exception as e:
            if "exists" in str(e).lower():
                print(f"Collection '{COLLECTION}' already exists — skipping create")
            else:
                raise

        total_chunks = 0
        for note in notes:
            # De-identify FIRST — before chunking or embedding
            clean_text = deidentify(note["text"])

            # Chunk second
            chunks = chunk(clean_text)

            # Embed third
            vectors = embed(chunks)

            # Build PointStruct records with strict metadata schema
            # All four metadata fields are REQUIRED — no optional fields.
            points = [
                PointStruct(
                    id=_chunk_id(note["document_id"], i),
                    vector=vectors[i],
                    payload={
                        # ── strict schema ──────────────────────────────────────
                        "document_type": note["document_type"],   # required
                        "department":    note["department"],       # required — RBAC filter key
                        "date":          note["date"],             # required
                        "author_role":   note["author_role"],      # required
                        # ── retrieval helpers ──────────────────────────────────
                        "document_id":   note["document_id"],
                        "chunk_index":   i,
                        "text":          chunks[i],                # de-identified chunk text
                    },
                )
                for i in range(len(chunks))
            ]

            client.points.upsert(COLLECTION, points)
            total_chunks += len(chunks)
            print(f"  ✓ {note['document_id']}  dept={note['department']}  chunks={len(chunks)}")

        print(f"\nIngestion complete — {len(notes)} documents, {total_chunks} chunks total")

if __name__ == "__main__":
    ingest(RAW_NOTES)

This file performs the following actions:

De-identifies data: The system removes all HIPAA identifiers from raw text before processing. Names, dates, and other sensitive fields are replaced with placeholders to prevent PHI from entering the system pipeline.
Chunks the texts: The system splits the cleaned text into 512-token segments with a 50-token overlap. This overlap preserves context across boundaries, enhancing retrieval accuracy.
Embeds the chunks: The model converts each chunk into a numerical vector using a local sentence-transformers model. This process captures semantic meaning while keeping all processing within the network.
Stores with metadata: The system writes each chunk and its vector to VectorAI DB, along with necessary fields like document_type, department, date, and author_role. These fields support strict access control during queries.

Run the script by:

uv run ingest.py

You should see the following results after running the command:

ingest.py execution

From the logs, you see that the ingestion pipeline writes chunks to VectorAI DB.

Step 3: Run your queries

Execute queries against your local RAG system and validate retrieval, access control, and audit logging.

Create a file query.py with the following contents:

import json
import datetime
import urllib.request
import urllib.error
from pathlib import Path
from actian_vectorai import VectorAIClient
from actian_vectorai import FilterBuilder, Field
from sentence_transformers import SentenceTransformer

# ── Config ─────────────────────────────────────────────────────────────────────
VECTORAI_HOST = "localhost:50052"
COLLECTION    = "clinical_docs"
EMBED_MODEL   = "sentence-transformers/all-MiniLM-L6-v2"
AUDIT_LOG     = Path("./audit_logs/queries.jsonl")   # volume-mounted path

# Ollama settings — set OLLAMA_ENABLED=True once `ollama serve` is running
OLLAMA_ENABLED = False          # flip to True when Ollama is ready
OLLAMA_URL     = "http://localhost:11434/api/generate"
OLLAMA_MODEL   = "mistral"      # or "llama3.2:3b" for lower hardware

# ── RBAC: role → allowed departments ──────────────────────────────────────────
# Access is enforced as a MUST filter at the database level.
# A scheduling_bot cannot reach clinical notes; cardiology cannot see psychiatry.
ROLE_PERMISSIONS = {
    "cardiology_clinician":  ["cardiology"],
    "oncology_clinician":    ["oncology"],
    "general_practitioner":  ["cardiology", "oncology", "general"],
    "admin":                 ["cardiology", "oncology", "psychiatry", "general"],
    "scheduling_bot":        ["scheduling"],   # no clinical note access
}

class AccessDeniedError(Exception):
    pass

def allowed_departments(role: str) -> list[str]:
    if role not in ROLE_PERMISSIONS:
        raise AccessDeniedError(f"Unknown role '{role}' — access denied by default.")
    return ROLE_PERMISSIONS[role]

# ── Embedding (reuse the same model as ingest.py) ─────────────────────────────
print(f"Loading embedding model: {EMBED_MODEL}")
_model = SentenceTransformer(EMBED_MODEL)

def embed(text: str) -> list[float]:
    return _model.encode([text], normalize_embeddings=True).tolist()[0]

# ── Step 5: Search with department MUST filter ─────────────────────────────────
def retrieve(query_vec: list[float], departments: list[str], top_k: int = 5) -> list[dict]:
    """
    Hybrid retrieval: vector similarity + metadata MUST filter.
    Results from departments outside the allowed list are impossible —
    the filter is applied at the database level, not in application code.
    """
    results = []
    with VectorAIClient(VECTORAI_HOST) as client:
        for dept in departments:
            hits = client.points.search(
                collection_name=COLLECTION,
                vector=query_vec,
                limit=top_k,
                # MUST filter — department equality enforced at DB level
                filter=FilterBuilder().must(Field("department").eq(dept)).build(),
            )
            for hit in hits:
                payload = getattr(hit, "payload", {}) or {}
                results.append({
                    "score":         round(getattr(hit, "score", 0.0), 4),
                    "document_id":   payload.get("document_id"),
                    "document_type": payload.get("document_type"),
                    "department":    payload.get("department"),
                    "date":          payload.get("date"),
                    "author_role":   payload.get("author_role"),
                    "chunk_index":   payload.get("chunk_index"),
                    "text":          payload.get("text", ""),
                })

    results.sort(key=lambda r: r["score"], reverse=True)
    return results[:top_k]

# ── Step 6: LLM answer via Ollama ─────────────────────────────────────────────
_RAG_SYSTEM = (
    "You are a clinical decision support assistant. "
    "Answer ONLY using the context passages below. "
    "Do NOT use external knowledge or make assumptions. "
    "Cite each fact as [Doc N]. "
    "If the context is insufficient, say: 'I cannot answer from the available documents.'"
)

def build_context(chunks: list[dict]) -> str:
    return "\n\n".join(
        f"[Doc {i+1}] ({c['document_type']}, dept={c['department']}, "
        f"date={c['date']}, role={c['author_role']})\n{c['text']}"
        for i, c in enumerate(chunks)
    )

def generate(query_text: str, chunks: list[dict]) -> str:
    if not OLLAMA_ENABLED:
        # Return raw retrieved context when LLM is disabled
        return "[LLM disabled — set OLLAMA_ENABLED=True]\n\n" + build_context(chunks)

    context = build_context(chunks)
    prompt = f"{_RAG_SYSTEM}\n\nContext:\n{context}\n\nQuestion: {query_text}\n\nAnswer:"

    payload = json.dumps({
        "model":  OLLAMA_MODEL,
        "prompt": prompt,
        "stream": False,
        "options": {"num_predict": 400},
    }).encode()

    try:
        req = urllib.request.Request(
            OLLAMA_URL,
            data=payload,
            headers={"Content-Type": "application/json"},
            method="POST",
        )
        with urllib.request.urlopen(req, timeout=30) as resp:
            data = json.loads(resp.read())
            return data.get("response", "").strip()
    except urllib.error.URLError as e:
        return f"[Ollama unreachable: {e}]\n\nRetrieved context:\n{build_context(chunks)}"

def write_audit(record: dict) -> None:
    AUDIT_LOG.parent.mkdir(parents=True, exist_ok=True)
    with open(AUDIT_LOG, "a", encoding="utf-8") as f:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

# ── Public query entry point ───────────────────────────────────────────────────
def query(user_id: str, role: str, query_text: str, top_k: int = 5) -> dict:
    """
    Execute a role-gated RAG query.

    Returns:
        {answer, retrieved_docs, access_denied, error}
    """
    timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat()

    # RBAC check — before anything else
    try:
        departments = allowed_departments(role)
    except AccessDeniedError as e:
        write_audit({
            "timestamp": timestamp, "user_id": user_id, "role": role,
            "department": "DENIED", "query_text": query_text,
            "retrieved_docs": [], "answer_provided": False, "access_denied": True,
            "denial_reason": str(e),
        })
        return {"answer": f"Access denied: {e}", "retrieved_docs": [], "access_denied": True}

    # Embed → retrieve (with MUST filter) → generate
    q_vec  = embed(query_text)
    chunks = retrieve(q_vec, departments, top_k)
    answer = generate(query_text, chunks)

    doc_refs = [
        {"document_id": c["document_id"], "chunk_index": c["chunk_index"],
         "department": c["department"], "document_type": c["document_type"],
         "score": c["score"]}
        for c in chunks
    ]

    # Audit log — every query, regardless of outcome
    write_audit({
        "timestamp":       timestamp,
        "user_id":         user_id,
        "role":            role,
        "department":      ",".join(departments),
        "query_text":      query_text,
        "retrieved_docs":  doc_refs,
        "answer_provided": True,
        "access_denied":   False,
    })

    return {"answer": answer, "retrieved_docs": doc_refs, "access_denied": False}

# ── Demo runs ─────────────────────────────────────────────────────────────────
if __name__ == "__main__":

    separator = "─" * 60

    # ── Query 1: authorised cardiology query ────────────────────────────────
    print(f"\n{separator}")
    print("QUERY 1 — cardiology_clinician (authorised)")
    print(separator)
    r1 = query(
        user_id="dr_chen_007",
        role="cardiology_clinician",
        query_text="What beta-blocker is recommended for heart failure with reduced ejection fraction?",
    )
    print(f"\nAnswer:\n{r1['answer']}\n")
    print("Retrieved sources:")
    for d in r1["retrieved_docs"]:
        print(f"  score={d['score']}  [{d['document_type']}]  dept={d['department']}  "
              f"doc={d['document_id']}  chunk={d['chunk_index']}")

    # ── Query 2: scheduling bot tries to access clinical notes ───────────────
    print(f"\n{separator}")
    print("QUERY 2 — scheduling_bot (attempting clinical note access)")
    print(separator)
    r2 = query(
        user_id="bot_sched_01",
        role="scheduling_bot",
        query_text="What are the diagnosis notes for cardiology patients?",
    )
    if r2["access_denied"]:
        print(f"\n✗ Access denied (as expected): {r2['answer']}")
    else:
        print(f"\nAnswer:\n{r2['answer']}")
        print("Sources:", r2["retrieved_docs"])

    # ── Query 3: cardiology query that must NOT return psychiatry notes ───────
    print(f"\n{separator}")
    print("QUERY 3 — cardiology_clinician (RBAC must exclude psychiatry)")
    print(separator)
    r3 = query(
        user_id="dr_chen_007",
        role="cardiology_clinician",
        query_text="antidepressant dosing and patient management",
    )
    departments_returned = {d["department"] for d in r3["retrieved_docs"]}
    cross_leak = "psychiatry" in departments_returned
    print(f"\nDepartments in results: {departments_returned or 'none'}")
    print(f"Cross-department leak: {'✗ LEAK DETECTED' if cross_leak else '✓ none — RBAC working correctly'}")

    # ── Show audit log tail ───────────────────────────────────────────────────
    print(f"\n{separator}")
    print("AUDIT LOG  →  {AUDIT_LOG}")
    print(separator)
    if AUDIT_LOG.exists():
        lines = AUDIT_LOG.read_text().strip().splitlines()
        for line in lines[-3:]:          # show last 3 entries
            entry = json.loads(line)
            print(json.dumps({
                "timestamp":     entry["timestamp"],
                "user_id":       entry["user_id"],
                "role":          entry["role"],
                "query_text":    entry["query_text"][:60] + "…",
                "docs_accessed": len(entry["retrieved_docs"]),
                "access_denied": entry["access_denied"],
            }, indent=2))
    else:
        print("No audit log found — run ingest.py first.")

The script performs three core operations in a single flow.

Enforces access control: The system checks the user’s role before retrieving any data. Each role is mapped to specific departments, and enforced as a mandatory filter at the database level. The authorization layer immediately blocks and logs unauthorized roles.
Retrieve and generate answers: The system embeds the query and retrieves relevant document chunks using vector search, with strict department filters applied. The results are then passed to a local LLM. If the LLM is disabled, the retrieved context is returned directly.
Write audit logs: The system logs every query locally, including the user ID, role, query text, accessed documents, and access status. This creates a complete audit trail for compliance and review.

Run the script by:

uv run query.py

You should see the following results after running the command:

query.py execution

The output shows three test cases: one showing a valid clinician query, a denied access attempt, and a check for cross-department leakage. These confirm that RBAC and audit logging work correctly before you move to production.

Step 4: Configure the audit log

Store every query locally by using the volume mapping defined during deployment.

The Docker configuration mounts ./audit_logs from your host into the container. When you run queries, this creates a local folder named audit_logs with a file queries.jsonl.

The file contains the following entries:

{"timestamp": "2026-03-31T15:56:02.552088+00:00", "user_id": "dr_chen_007", "role": "cardiology_clinician", "department": "cardiology", "query_text": "What beta-blocker is recommended for heart failure with reduced ejection fraction?", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:03.098346+00:00", "user_id": "bot_sched_01", "role": "scheduling_bot", "department": "scheduling", "query_text": "What are the diagnosis notes for cardiology patients?", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:03.663767+00:00", "user_id": "dr_chen_007", "role": "cardiology_clinician", "department": "cardiology", "query_text": "antidepressant dosing and patient management", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:18.331657+00:00", "user_id": "dr_chen_007", "role": "cardiology_clinician", "department": "cardiology", "query_text": "What beta-blocker is recommended for heart failure with reduced ejection fraction?", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:18.492188+00:00", "user_id": "bot_sched_01", "role": "scheduling_bot", "department": "scheduling", "query_text": "What are the diagnosis notes for cardiology patients?", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

{"timestamp": "2026-03-31T15:56:18.569824+00:00", "user_id": "dr_chen_007", "role": "cardiology_clinician", "department": "cardiology", "query_text": "antidepressant dosing and patient management", "retrieved_docs": [], "answer_provided": true, "access_denied": false}

Each line represents a single query event. The log captures who made the request, their role, the department scope, the query text, and whether access was allowed. This file lives entirely on your infrastructure and gives you a complete, verifiable audit trail for every interaction with PHI.

Wrapping Up

A BAA was never enough because it does not control how your application handles PHI. You solved that by keeping all data, queries, and logs inside your network.

You now have a RAG system that enforces role-based access, retrieves only authorized data, and logs every interaction locally, without external APIs or third-party exposure.

Apply this pattern to other regulated systems. Refer to the VectorAI DB documentation and GitHub repository for updates and implementation details.

Join the community and learn more about Actian.

When to choose on-premises vs. cloud for vector databases

Offisong Emmanuel — Tue, 19 May 2026 19:42:27 +0000

For most of the last decade, the on-premises vs. cloud debate felt settled. Cloud computing was cheaper, faster, and easier to adopt. Enterprises moved workloads from on-premises infrastructure to public cloud services, relying on major cloud providers to handle scalability, maintenance, and security.

In 2026, that assumption is breaking, and cracks are showing up in legal reviews, financial projects, and SLA negotiations. Enterprises are facing an increasing pressure for data residency regulations, stricter enforcement, and scrutiny around cloud security models. Compliance constraints, data security requirements, cost predictability, and latency are forcing teams to reconsider on-premises solutions, private cloud computing, and hybrid cloud infrastructure.

At the same time, AI is moving closer to where data is generated. Manufacturing sites, retail stores, and healthcare environments increasingly require offline capability and sub-100ms latency. That shift helps explain why Oracle released AI Database 26ai for on-premises deployment and why Google is pushing Gemini onto Distributed Cloud for air-gapped environments. This shift signals that large-scale enterprise AI no longer fits neatly in cloud environments.

In this article, we’ll examine why on-premises infrastructure is resurging, what trade-offs you need to know, and how to make defensible deployment decisions.

What’s Driving The On-premises Resurgence

The renewed interest in on-premises infrastructure is not about going back to old systems. It is a response to clear changes in how AI systems are being built and used in 2025 and 2026. For many enterprises, cloud-only vector databases no longer fit their compliance, cost, and reliability needs.

A lot of factors drive this current on-premises resurgence, but in this article we will consider four key causes.

Large vendors now support on-premises AI

On-premises AI is no longer treated as an edge case by major vendors. Oracle’s release of AI Database 26ai and Google’s decision to run Gemini on Distributed Cloud show a clear shift in how enterprise AI is being packaged and delivered.

These products are built for large enterprises, not early-stage experiments or research projects. That distinction matters. Large vendors do not invest in complex on-premises AI platforms unless there is strong and growing customer demand. These announcements confirm that many enterprises want to run AI systems inside their own environments, close to their data, and under their full operational control. Why is this?

Regulatory pressure is now a real blocker

Teams used to plan for regulatory risk as a future possibility. Now it’s a day-to-day reality. GDPR enforcement reached record levels in 2025, with insufficient legal basis for data processing driving the largest penalties. That year alone, regulators issued nearly 2,700 fines totaling billions of euros.

From a data security perspective, GDPR enforcement has fundamentally changed how enterprises evaluate cloud services. While cloud service providers offer compliance tooling, legal teams are increasingly wary of relying on third-party providers for sensitive data storage and processing.

HIPAA adds another layer of complexity. For example, in Florida, physicians must maintain medical records for five years after the last patient contact, whereas hospitals must maintain them for seven years under state record-retention requirements. This makes repeated data movement risky and expensive. Financial services and government contractors face similar data sovereignty requirements that limit where data can be stored and processed. In these situations, cloud deployments add legal review, audit work, and ongoing risk. Keeping data on-premises is often the most straightforward way to meet these obligations.

Edge AI requires local and offline operation

AI workloads are increasingly deployed close to where data is created. Manufacturing facilities may operate in air-gapped environments or remote locations with limited connectivity. Retail systems must continue working during network outages. Healthcare applications often require very low latency for real-time decision support.

In these environments, relying on a remote cloud service introduces risk. Network delays and outages directly affect system reliability. On-premises and edge deployments allow vector search and inference to run locally, without depending on constant network access. For many use cases, this local execution is not an optimization but a requirement.

Together, these shifts explain why on-premises vector databases are gaining traction again. The change is driven by the practical realities of deploying production AI systems under real regulatory, cost, and reliability constraints.

The Compliance Calculus

For many enterprises, compliance is the deciding factor in the on-premises versus cloud debate. While cloud providers offer compliance certifications, the real challenge is not whether a platform can be compliant in theory, but whether it can withstand legal review, audits, and long-term operational scrutiny in practice. Once vector databases move into production and begin storing sensitive or regulated data, these questions become unavoidable.

GDPR and the limits of cross-border transfers

The Schrems II ruling changed how European data can be processed outside the EU. Privacy Shield was invalidated, leaving Standard Contractual Clauses as the primary legal mechanism for cross-border data transfers. In highly regulated industries such as financial services and healthcare, many legal teams consider SCCs insufficient due to enforcement uncertainty and ongoing legal challenges.

For vector databases, this matters because embeddings often contain derived personal data. Even if raw records are masked or tokenized, embeddings can still be considered personal data under GDPR. If data must remain within the EEA, or within a specific country, cloud deployments that rely on global infrastructure introduce legal risk. In these cases, on-premises or in-region deployment becomes a requirement rather than a preference.

HIPAA retention and the real cost of data movement

HIPAA does not explicitly require data to stay on-premises, but it does require long retention periods and strict access controls. When vector embeddings are built on top of this data, they inherit the same retention requirements. HIPAA data governance must be enforced when considering on-premises or cloud vector databases.

The cost impact becomes clear when egress fees are included. Consider a system storing 100 TB of embeddings in a cloud environment. At a common egress rate of $0.09 per GB, moving that data out of the cloud over a seven-year retention period results in:

100 TB × $0.09 per GB × 84 months = over $750,000 in egress costs alone

This does not include compute, storage, or indexing costs. With this in mind, will cloud data warehouses really help you cut costs?

Financial services and data sovereignty rules

Financial institutions face additional constraints beyond GDPR. Regulations such as GLBA, APRA, and regional data sovereignty mandates often require strict control over where customer data is stored and processed. Regulators may demand clear evidence of geographic boundaries, access controls, and auditability.

Cloud services can meet some of these requirements, but they often introduce complex configurations, contractual dependencies, and ongoing compliance reviews. For many banks and insurers, on-premises deployment simplifies audits by keeping data within controlled infrastructure that regulators already understand.

Government and public sector constraints

Government contracts introduce some of the strictest infrastructure requirements. Standards such as FedRAMP often mandate US-only infrastructure, restricted access, and tightly controlled environments.

In these cases, public cloud services are frequently disallowed or require extensive approvals. On-premises deployment is often the only viable option for running vector databases in support of government workloads.

When compliance makes cloud untenable

If legal teams flag cross-border data transfers as unacceptable, cloud deployments quickly become impractical. Once data residency is mandatory, on-premises deployment is no longer a trade-off decision. It is a compliance requirement.

The Cost Breakdown Analysis

Cost is often the reason teams revisit the on-premises versus cloud decision. To make a defensible decision, teams need to understand where costs diverge and when self-hosting becomes economically rational.

Where self-hosting breaks even

Research from OpenMetal shows a consistent breakeven point for Pinecone vector databases at scale. Once workloads reach roughly 80 to 100 million queries per month, self-hosted deployments tend to be cheaper than managed cloud services. Below this range, cloud pricing is usually competitive. Above it, usage-based billing begins to dominate total cost.

This threshold matters because many enterprise RAG systems cross it quickly. Customer support, document search, fraud detection, and recommendation systems often serve tens or hundreds of millions of queries each month once deployed across business units or regions.

The hidden cost in cloud pricing

Cloud pricing is rarely just a per-query fee. Vector databases introduce several cost drivers that are easy to overlook during planning.

Egress fees are a major factor. Most cloud providers charge around $0.09 per GB for data leaving their network. Moving embeddings between regions, exporting data for analytics, or migrating to another system all incur these fees. Over time, they become a meaningful portion of total spend.

Finally, vector search does not scale linearly. As vector counts grow and dimensionality increases, query costs rise faster than expected. What looks affordable at 10 million vectors can become expensive at 500 million, even if query volume grows steadily.

On-premises costs are fixed and predictable

On-premises deployments have real costs, but they behave differently. Hardware is typically amortized over three to five years. Staffing requirements are stable once the system is running. Facilities and power costs are known in advance.

The key difference is predictability. Costs do not spike because of usage patterns or data movement. Once the system is sized correctly, monthly spend remains largely flat, even as query volume increases.

A real world example

Consider a production e-commerce application with the following scale:

500M vectors
200M queries every month
1024 vector dimensions
6M writes monthly

At this scale, a typical managed Pinecone vector database costs around $8,500 per month once compute, storage, and rebuild overhead are included.

Estimated monthly cost

Total Estimated Cost: $8,454 / month

1. Storage

Usage: 845 GB
Cost: $279

2. Query Costs

Configuration:
24 b1 nodes
4 shards × 6 replicas
Assumption: 1% filter selectivity
Estimated Cost: $8,074
Note: Actual query cost may vary. Benchmark your workload on DRN for more accurate estimates.

3. Write Costs

Write Volume: 30 million Write Units (WU)
Assumption: Each write request consumes ≥ 5 WU
Cost: $101

An equivalent on-premises deployment might cost approximately half of that after hardware amortization, assuming an 18-month payback period and one to two engineers supporting the system. After that payback period, costs drop further while capacity remains available.

A study by Enterprise Storage Forum shows the cost projection of on-premises and cloud workloads.

Cost alone does not decide every deployment, but once vector workloads reach scale, the economics become difficult to ignore. Understanding where your system sits on this curve is essential before locking in a long-term vector database strategy.

When Latency And Connectivity Matter

Latency and connectivity are often treated as secondary concerns in architecture decisions. For many AI workloads, they are decisive. Once vector databases support real-time systems, network round-trips and internet dependency can make cloud deployments impractical or unsafe.

Real-time response requirements

Some applications have strict response time limits. In healthcare, clinical decision support and diagnostic systems often require responses in under 50 milliseconds. This budget includes data retrieval, vector search, and model inference. Similarly, banks and financial institutions often require very low latency for maximum user experience.

Public cloud deployments add unavoidable network latency. Even within the same region, round-trip latency typically adds 20 to 80 milliseconds before any compute work begins. For applications with tight latency targets, this overhead alone can exceed the total allowed response time. On-premises deployments remove that network hop, allowing systems to meet real-time requirements consistently.

Systems that must work offline

Many environments cannot rely on constant connectivity. Retail point-of-sale systems must continue operating during network outages. Manufacturing facilities are often located in remote areas with unstable connections. Military and maritime deployments may operate in fully disconnected or classified environments.

In these scenarios, a cloud dependency is a single point of failure. If the network goes down, the AI system stops working. On-premises and edge deployments allow vector search and inference to run locally, ensuring the system continues to function even when external connectivity is unavailable.

The cost of downtime

It is no news that there has been an increase in downtime from cloud providers. On November 18, 2025, Cloudflare outage disrupted large portions of the internet, causing downtime across major platformsincluding X, Amazon Web Services, Spotify, and so on. The impact of connectivity failures is not theoretical. In manufacturing, average downtime costs are estimated at $260,000 per hour. When AI systems support quality control, predictive maintenance, or process automation, any outage directly affects production.

A cloud-only architecture introduces risk that is hard to justify in these environments. Even short network disruptions can lead to significant financial loss. On-premises deployments reduce this risk by removing external dependencies from critical execution paths.

For workloads with strict latency targets or limited connectivity, the choice is often clear. Cloud-based vector databases may work during development, but they fail to meet operational requirements in production.

The Operational Complexity Question

The strongest argument for cloud vector databases is operational simplicity. Managed services remove the need to provision hardware, manage clusters, apply patches, or handle failures. For small teams or early-stage projects, this advantage is real and often decisive. Cloud deployments allow engineers to focus on application logic rather than infrastructure.

It is also important to recognize that modern on-premises deployments look very different from those of a decade ago. This is not the world of manual server provisioning and fragile scripts. Kubernetes, infrastructure-as-code, and automated deployment pipelines have reduced operational overhead significantly. Rolling upgrades, automated scaling, and monitoring are now standard practices in on-premises environments as well as in the cloud.

Many enterprises adopt hybrid approaches to balance speed and control. Development and experimentation happen in the cloud, where teams can move quickly and iterate. Production systems run on-premises, where costs are predictable and compliance is easier to enforce. This pattern allows teams to get the best of both models without committing fully to either.

Decision Framework: Eight Questions

The fastest way to make a defensible deployment decision is to walk through a small set of yes or no questions with engineering, legal, finance, and operations.

1. Does your data require geographic restrictions?

Regulations such as GDPR, HIPAA, and financial services rules may limit where data can be stored or processed.

If yes, on-premises should be strongly considered because it provides full control over data location. If no, cloud deployment remains viable.

2. Do you have predictable, high-volume query patterns?

Cloud vector database costs scale with usage. A simple check is monthly queries multiplied by the unit cost.

If usage exceeds roughly 80 to 100 million queries per month, on-premises is often cheaper. Below that range, cloud pricing is usually more economical.

3. Do you need offline capability?

Some systems must continue working without network access, such as in manufacturing, retail, or edge environments.

If yes, on-premises is required. If no, cloud remains an option.

4. Can you tolerate additional latency?

Cloud deployments add network latency, often 50 to 100 milliseconds.

If your application cannot tolerate this, on-premises deployment is necessary. If it can, cloud performance may be acceptable.

5. Do you have existing infrastructure teams?

Operational capacity matters.

If you already run on-premises systems, the added burden is limited. If not, cloud-managed services provide a clear operational advantage.

6. Is cost predictability important?

Usage-based billing introduces cost variability.

If predictable costs matter, on-premises provides stability. If flexibility matters more, cloud pricing may be a better fit.

7. Are you extending existing IT infrastructure?

Deployment context affects the decision.

If you are extending existing systems, on-premises leverages current investments. If you are building something new, cloud may be faster to deploy.

8. How large is your data footprint?

Data volume and access frequency influence long-term cost.

If you manage more than 10 TB with frequent access, on-premises becomes attractive. If your data is smaller, cloud is often sufficient.

When several answers point in the same direction, the decision becomes easy to explain and defend across engineering, legal, finance, and operations teams.

When Cloud Makes Sense

On-premises deployment is not always the right answer. In many situations, cloud-based vector databases remain the better choice. Being clear about these cases helps avoid over-engineering.

Unpredictable scaling: Startups and new products often face uncertain growth. Cloud platforms allow rapid scaling without long-term infrastructure commitments, which reduces risk when demand is unclear.
Small data volumes: When total data is under 10 TB and query volume stays below about 50 million queries per month, cloud pricing usually works well and is simpler than self-hosting.
Rapid experimentation: Proof-of-concepts, research projects, and early prototypes benefit from fast setup and easy teardown. Cloud services support quick iteration with minimal operational effort.
No compliance constraints: If data residency, sovereignty, and regulatory requirements are not an issue, cloud deployment avoids legal complexity and speeds up delivery.
Limited infrastructure expertise: Teams focused on application logic rather than operations can rely on managed services instead of maintaining databases, clusters, and hardware.

In these cases, cloud is the most effective and practical option.

Hybrid Deployment Strategies

Hybrid deployments act as the middle ground for enterprises that need both speed and control. Rather than treating cloud and on-premises as mutually exclusive, teams place each part of the system where it performs best.

Cloud for iteration, on-prem for scale

A common pattern is to develop and test in the cloud, where managed services and elastic infrastructure enable rapid iteration. Once models, indexes, or pipelines are stable, they are promoted into on-premises production environments to meet compliance, latency, and operational requirements. This preserves developer velocity without compromising production guarantees.

Data segregation by risk and regulation

Hybrid architectures also allow organizations to separate workloads by risk profile. Sensitive or regulated data stays on-premises, while analytics, training, or search over derived data runs in the cloud. The same logic applies regionally: EU data may remain on-premises or in sovereign environments, while US workloads run in public cloud regions, avoiding global systems being constrained by the strictest jurisdiction.

Cost and migration flexibility

Cost optimization is another driver. Frequently accessed vectors or low-latency services can be cheaper and more predictable on-premises, while cold storage and bursty workloads benefit from cloud pricing. Many teams start cloud-first, then selectively move components on-premises as scale or compliance pressures grow. Hybrid makes this a controlled evolution rather than a disruptive rewrite.

Industry research shows this is a stable operating model. Google Distributed Cloud and similar platforms explicitly frame hybrid as a long-term strategy, recognizing that modern systems are designed to span environments, not collapse them into one.

Actian’s Approach To On-premises Vector Databases

For teams that conclude that on-premises is the right deployment model, the next question is: which platform can actually meet these requirements? Actian’s approach is built specifically for this audience, without assuming the cloud is the default or the end state.

Actian delivers an enterprise-grade vector database that runs fully in your own data center or controlled environments. You retain full control over data placement, networking, and operations. There is no forced dependency on external cloud services, which simplifies audits and long-term system design.

Compliance requirements are treated as baseline constraints. By keeping data local and eliminating egress paths, Actian aligns with GDPR, HIPAA, FedRAMP, and similar regulatory frameworks. This reduces the need for compensating controls or complex legal workarounds.

Cost behavior is also predictable. Actian avoids usage-based pricing models that scale with queries or vector counts. This makes budgeting simpler and removes surprises as workloads grow.

Edge support is also taken into consideration. Actian’s architecture supports offline operation and local inference, making it suitable for manufacturing sites, retail locations, and other environments where connectivity is limited or unreliable. The system is designed to keep working even when the network does not.

Final Thoughts

Choosing between cloud and on-premises for vector databases is about understanding your priorities. Cloud works well for small workloads, rapid experimentation, and teams without deep infrastructure expertise. On-premises makes sense when compliance, latency, cost predictability, or scale are critical.

Many enterprises find a hybrid approach is the best balance, combining cloud flexibility with on-premises control. The key is making intentional decisions based on your data, workloads, and regulatory needs rather than following trends.

Actian empowers enterprises to confidently manage and govern data at scale. Organizations trust Actian data management and data intelligence solutions to streamline complex data environments and accelerate the delivery of AI-ready data. As the data and AI division of HCLSoftware, Actian helps enterprises manage and govern data at scale across on-premises, cloud, and hybrid environments. Learn more about Actian and how it fits into your on-premises AI strategy.

What if ML pipelines had a lock file?

Offisong Emmanuel — Wed, 11 Feb 2026 16:16:24 +0000

I spent two hours last month staring at identical Git commits trying to figure out why my model retrain had different results.

The code was the same. The hyperparameters were the same. I was even running on the same machine. But the validation metrics had shifted by 12%, and I couldn't explain why. I checked everything twice: my random seeds were fixed, my dependencies were pinned, my Docker image hadn't changed. Then I looked at the data.

Someone had added a column to an upstream table and backfilled it. Nothing broke. The pipeline kept running. Training succeeded. But the feature distribution had shifted, and the model had learned from data that no one realized was different.

That experience changed how I think about ML pipelines. We can lock dependencies. We can lock infrastructure. But the computation itself has no identity. Pipelines are still scripts that read mutable data, assume schemas that drift, and depend on execution details that change quietly.

In this article, we’ll walk through why that makes ML pipelines hard to reproduce, what a pipeline lock file actually needs to capture, and how treating computation as an artifact changes how we debug, audit, and build models.

Why ML pipelines are hard to reproduce

When an ML pipeline fails to reproduce, the code is rarely the problem. Most teams already version their training scripts, feature logic, and model code using Git. The issue is that the meaning of that code depends on far more than what lives in the repository.

Consider a fraud detection pipeline use-case. The code reads transaction data, joins it with user profiles, applies feature transformations, and trains a model. The Python script and SQL queries are tracked in Git. The model architecture is documented. Everything looks reproducible.

After a while, fraud detection accuracy drops in production, and you are tasked to recreate the training run for an audit, but you can't. The code runs, but the model comes out different. Something changed, but what?

The problem is that ML pipelines don't just depend on code. They depend on data, schemas, and execution details that live outside the repository and change without anyone noticing.

Data
Pipelines usually read from tables that change over time. Most of these tables are stored in a data warehouse like Amazon Redshift or Google BigQuery. Rows are added or removed. Backfills happen. A column gets renamed or its meaning changes. Even when teams snapshot data, those snapshots are often implicit, not recorded as part of the pipeline run itself.

In this fraud pipeline, training data comes from a warehouse table like transactions. Between the original training run and the reproduction attempt, the data team backfilled several months of historical records to fix a reporting bug. The pipeline query didn’t change:

SELECT * FROM transactions WHERE date >= '2025-01-01'

But the rows returned did.

The original model was trained on one set of data (transaction amounts, merchant categories, and user behavior), while the reproduced run was trained on a different set. Even though both runs used the same code, neither recorded which specific data version was used.

From the outside, it looks like “the same pipeline.” In reality, two different datasets flowed through it.

The problem is even worse with derived tables. If the fraud model depends on a shared feature table maintained by another team, and that team fixes a bug in their aggregation logic and recomputes the table, our pipeline can keep running and silently consume the updated features. There is no error or warning, just different inputs flowing into the same code.

Schemas
Schemas add another layer of fragility. Many pipelines assume schemas rather than enforce them.
During the fraud detection data backfill, the schema changed, too. A new column, merchant_risk_score, was added to the transactions table. It was nullable at first because historical data didn’t have values for it yet.

The feature pipeline didn’t break. It simply treated missing values as zero during normalization. That meant older transactions effectively had no merchant risk, while newer ones suddenly did. The feature still existed. The code still ran. But the meaning of the feature changed.

As a result, the model learned two different behaviors depending on when a transaction occurred. Recent data emphasized merchant risk. Older data didn’t. Overall metrics looked fine during training, but once deployed, the model began misclassifying edge cases in production.

When accuracy dropped, the team assumed normal data drift and retrained. The retrain succeeded, but the new model still didn’t match the original. The schema change had rewritten the semantics of the features, and nothing in the pipeline recorded that shift or made it visible.

Dependencies and execution details
Dependencies and execution details add another layer of instability. A query planner may choose a different plan. A caching layer may reuse an old result. A User Defined Function (UDF) can change behavior because one of its dependencies was updated. None of this shows up in git, and very little of it is visible in logs.

Caching sometimes alters your model performance. They speed things up, which is good. But they also introduce a hidden state that can change results between runs. For example, your pipeline caches a feature table. Someone updates the upstream logic. Your cache is now stale, but nothing tells you that. You're training on a mix of old features and new data.

Even the runtime version matters. The original model artifact had been serialized with Python 3.9, but the reproduction ran under Python 3.11. The model loaded successfully, but downstream behavior wasn’t identical.

The result
The pipeline was reproducible in theory, but not in practice. The same code ran. A different computation happened.

There was no single artifact to inspect. No receipt that captured the data that was read, the schemas that were assumed, the UDF logic that executed, or the cache state that influenced the result. The team spent weeks reconstructing the run from logs, guesses, and tribal knowledge.

This is the gap lock files solved for software dependencies. And it’s the same gap ML pipelines still have today.

Why existing tools don’t fix this

At this point, most teams reach for familiar fixes.

They add more logging. They version datasets manually. They pin library versions. They introduce orchestrators, lineage tools, and experiment trackers. Each tool helps in isolation, but none of them answer the one question that matters during an incident or an audit:

What actually ran?
Logs tell you that a job executed, not which data it read. Git tells you what the code looked like, not how it resolved at runtime. Lineage graphs show connections, but not the concrete inputs, schemas, or cached state used in a specific run. Experiment tracking stores metrics and artifacts, but not the computation that produced them. So when something goes wrong, teams are left reconstructing history from fragments and guesswork.

The deeper issue is that ML pipelines don’t produce a durable artifact of the computation itself. The code is versioned, but the resolved execution is not. Data is mutable. Schemas drift. Execution details change. And none of that has a stable identity you can point to later.

Software engineering solved this problem years ago. We didn’t fix reproducibility by writing better README files or adding more logs. We fixed it by introducing lock files. Lock files are machine-readable artifacts that capture the fully resolved state of a system at execution time, representing the actual thing that ran rather than configuration.

The missing piece in ML is the same idea, applied to computation.

What an ML pipeline lock file actually is

An ML pipeline lock file is not a configuration file. It is not another place to declare what you want to run. It is a record of what actually ran.

In software, a lock file answers a simple question: What was installed? Not which dependencies were requested, but which ones were resolved, down to exact versions and hashes. An ML pipeline lock file needs to answer the same kind of question, but for computation. What computation is this?

That requires three things:

An explicit computation graph
Content identities
Roundtrippability

An explicit computation graph
The lock file must capture the computation as a concrete object. Not a Python script that does things, but the actual reads, transformations, joins, aggregations, UDFs, and caches that make up the pipeline.

For example, when you look at package-lock.json, you don't see installation scripts. You see the resolved dependency tree. Each package, each version. The lock file for an ML pipeline needs the same clarity.

Content identities
Every piece of the computation needs an identity based on its content. The inputs you read. The UDFs you execute. The dependencies you use. The cached artifacts you produce. Same inputs should mean the same identity and different inputs should mean different identities.

If two runs have the same content identities for their inputs, UDFs, and dependencies, they're running the same computation. If any of those identities differ, something changed. You don't have to guess. You can check the hashes.

Roundtrippability
One of the core features of an ML lock file is roundtrippability. A real pipeline lock file must be runnable on its own. Given the lock file and its associated artifacts, you should be able to rerun the pipeline without relying on a particular machine, environment, or set of hidden caches.

If your lock files have these features, you can diff computations the way you diff lock files. You can verify that a rerun is actually running the same thing. You can cache based on content, not guesses. You can bisect regressions by comparing hashes instead of reading through logs.

Git vs. Manifests

A useful way to understand the value of manifests is to compare what traditional version control captures with what a build manifest records. Git excels at tracking how a pipeline is written, but it stops short of describing the fully resolved computation that actually executed. The manifest (expr.yaml) fills in that missing layer by freezing the execution-time reality of the pipeline.

	Code (git)	Manifest (expr.yaml)
Pipeline definition	✓	✓
Resolved inputs at execution time	✗	✓
Schema contracts	✗	✓
UDF and UDXF content hashes	✗	✓
Cached artifacts	✗	✓
What actually ran	✗	✓

Git is excellent at tracking the source code that defines a pipeline. The manifest goes further by recording the resolved state of that pipeline at execution time.

Create an ML lock file using Xorq

Once you understand what a pipeline lock file is and why it matters, the next step is seeing it in action. Xorq makes it straightforward to turn a declarative pipeline into a reproducible, versioned artifact with a lock file.

To get started, install Xorq using pip or uv:

pip install "xorq[examples]"

uv add "xorq[examples]"

Next, download the financial fraud dataset from Kaggle and place the CSV file in your working directory. This example uses a simplified fraud detection pipeline, but the structure mirrors what you would build in a real production system.

Create a file main.py with the following content:

import xorq.api as xo
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from xorq.caching import ParquetCache
from xorq.config import options
import os
# specifies cache directory as current directory/cache
options.cache.default_relative_path=f"{os.getcwd()}/cache"
con = xo.connect()
cache = ParquetCache.from_kwargs()
# 1. Load the dataset
data = xo.read_csv('synthetic_fraud_dataset.csv')
# 2. Train / test split
train, test = xo.train_test_splits(data, test_sizes=0.2)
sk_pipeline = Pipeline([
    ("model", RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        random_state=42
    ))
])
# 3. Define the model
model = xo.Pipeline.from_instance(sk_pipeline)
# 4. Fit the model
fitted = model.fit(
    train,
    features=[
        'amount',
        'hour',
        'device_risk_score',
        'ip_risk_score'
    ],
    target='is_fraud'
)
# 5. Generate predictions (deferred execution)
predictions = fitted.predict(test).cache(cache=cache)
# 6. Execute the computation
print(predictions.execute())

A few important things are happening here. The entire pipeline is defined declaratively, with each step clearly described: data ingestion, train–test splitting, model configuration, and a cached prediction stage. Nothing runs until execution is requested. When it does run, Xorq has enough information to capture the full computation as an explicit graph.

At this point, you have a working ML pipeline. In the next step, instead of just running it, we will build it. That build step is what produces the lock file: a manifest that records the resolved computation, the data it read, the schemas it assumed, the cached artifacts it created, and the exact logic that ran.

If your project directory is not already a Git repository, you need to initialize one before building an expression. Xorq records the git state as part of the build metadata, so a repository with at least one commit is required.

Run the following commands in your project folder:

git init
git add .
git commit -m "initial commit"

Once the repository is initialized, you can build the expression and generate the lock file by running:

xorq build main.py -e predictions

If you are using uv, the equivalent command is:

uv run xorq build main.py -e predictions

This build step is what turns your pipeline from a runnable script into a versioned artifact, complete with a manifest that records the resolved computation.

The output of the run is shown in the image below:

After the build completes, you should see two new directories: builds and cache. The cache directory holds cached intermediate results created during execution. The builds directory contains the build artifacts themselves. Inside builds, you will find a directory named with a content derived hash, for example 78ff43314468. This directory is the lock file in practice. It is the concrete, portable representation of the pipeline run.

Within that directory, several files are generated automatically, including expr.yaml, metadata.json, and profiles.yaml. The most important of these is expr.yaml. This file is the receipt for what actually ran. It describes the computation graph, the resolved inputs, the schema contracts, the cached nodes, and the content hashes that give the pipeline its identity.

Taken together, the build directory is a versioned, cached, and portable artifact. Once it exists, workflows that were previously fragile or manual become straightforward: reproducible runs, diffable computation, bisectable regressions, portable artifacts, and, importantly, composition.

The expression file
At first glance, expr.yaml looks intimidating. It contains many components, but its purpose is simple. It describes the computation itself, explicitly and completely.

Below is an abridged example:

nodes:
    '@read_4d6c147c9486':
      op: Read
      method_name: read_parquet
      name: ibis_read_csv_nepinfk5dzbxja2bo4kycwisyq
      profile: 846181d9920579c7c1b10dd45b3ab9b2_0
      read_kwargs:
      - - path
        - builds/78ff43314468/database_tables/917eccee9a442913a8c1afca12cf69b0.parquet
      - - table_name
        - ibis_read_csv_nepinfk5dzbxja2bo4kycwisyq
      normalize_method: fvfvfvfvf
      schema_ref: schema_c4a0925bdfca
      snapshot_hash: 4d6c147c9486fe2f5140558ff6860b60

This first node answers a deceptively important question: What data was read? Not “which table name,” and not “which query,” but the exact data source. The Read node points to a concrete file, often materialized into the build directory itself. That means the pipeline is tied to the data that was actually used, not whatever that table happens to contain today.

The schema_ref is part of the plan. If the schema changes, this node no longer matches, and the computation’s identity changes with it.

Now look at how transformations are represented:

    '@filter_d5f72ffce15d':
      op: Filter
      parent:
        node_ref: '@read_4d6c147c9486'
      predicates:
      - op: LessEqual
        left:
          op: Multiply
          left:
            op: Cast
     predicted:
          op: ExprScalarUDF
          class_name: _predicted_18c1451165c

The code above describes the filter. The predicate itself is part of the graph, not hidden inside a function call or a SQL string. The filter is explicitly connected to its parent node, so there is no ambiguity about ordering or dependencies.

Every transformation builds on a previous node, forming a complete expression tree:
Read → Filter → Aggregate → Cache

Later in the file, you’ll see nodes like this:

'@cachednode_e7b5fd7cd0a9':
  op: CachedNode
  parent:
    node_ref: '@remotetable_9a92039564d4'
  cache:
    type: ParquetCache

Caching is also part of the computation. Because the cache appears in the graph, it is reproducible and portable. There are no hidden cache keys, no local assumptions, and no silent reuse of stale results. If the upstream logic changes, the cache node’s identity changes too.

Finally, notice the node names themselves:

@read_4d6c147c9486
@filter_d5f72ffce15d
@cachednode_e7b5fd7cd0a9

These identifiers are content-derived. They are hashes of the node’s inputs, logic, schema, and configuration. Change anything meaningful, and the identifier changes. That change propagates through the graph.

This is what makes expr.yaml a lock file. Instead of saying “run this Python script,” it records what computation resolved, what data it read, what schemas it assumed, and where caching occurred. The hash of the build becomes the identity of the computation itself.

Treating pipelines as building blocks

So far, we’ve looked at how Xorq turns a pipeline into a versioned artifact. The payoff comes because these artifacts are composable. When you build a pipeline with Xorq, the output isn’t just a model or a metric. It’s a versioned computation artifact with a stable hash e.g. xyz123. That hash represents the fully resolved training run: data, schemas, feature logic, and execution details.

Because that artifact has an identity, it can be reused. An inference pipeline can explicitly reference the training artifact it depends on. Instead of “load the latest model,” it loads the model produced by build *xyz123*, along with the exact feature definitions and schema contracts that training used. If training changes, inference doesn’t silently drift. The composition produces a new hash.

This also makes deployment seamless. You can easily rollback to previous hashes without guesswork.

Why is this different from experiment tracking?
Tools like MLflow track artifacts. DVC versions data. Both are useful but neither gives you composable, versioned computation graphs.

MLflow can tell you which model file was produced, but not the resolved computation that created it.
DVC can version datasets, but not how those datasets were transformed, joined, cached, and consumed end-to-end.

Xorq’s unit of composition is the computation itself. Training pipelines produce artifacts that inference pipelines can depend on directly, without re-encoding assumptions in glue code.

What do we gain from this?

The most immediate gain is reproducibility. With a pipeline lock file, rerunning a pipeline means rerunning the same computation, not just the same code. The inputs are fixed, the schemas are known, the logic is explicit, and cached artifacts are part of the record. “Works on my machine” stops being a concern because the computation has a concrete identity.

You can easily run builds by:

xorq run builds/<build-hash>

Another advantage is portability. This means you can take a build produced on a developer’s laptop and execute it in CI, inside a container, or on a different execution engine with confidence that it will behave the same way.

Also, when a model regresses, you can diff runs. Two builds produce two manifests. Instead of guessing what changed, you get a semantic diff: data sources, schema changes, UDF content, planner decisions, cached nodes. This turns multi-week investigations into focused comparisons.

Schema drift becomes visible early. Because schemas are part of the contract, drift shows up at boundaries rather than leaking silently into downstream logic. Pipelines fail fast, in the right place, instead of producing subtly wrong models.

Finally, there is an organizational gain. When computation is explicit and versioned, teams move faster with less risk. Audits become tractable because training runs are reproducible.

Closing insights

Lock files changed how we think about software. They gave us a stable unit we could diff, ship, and trust. ML pipelines have needed the same thing for a long time, but until now, there has been nothing concrete to lock.

By giving computation an identity, pipeline manifests turn runs into artifacts. They capture what actually ran, not just what the code described. Once that exists, reproducibility, debugging, audits, and collaboration stop being fragile processes and start becoming mechanical.

Xorq provides a practical and robust foundation for building reproducible, auditable, and production-grade ML workflows. This makes it easy to generate an ML lock file that captures not just what was written, but what actually ran, including resolved inputs, content hashes, and cached artifacts.

For more information about Xorq, head over to their GitHub or their official documentation.