<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cooper D</title>
    <description>The latest articles on DEV Community by Cooper D (@gigapress).</description>
    <link>https://dev.to/gigapress</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1251677%2F2781de9d-e38e-4e18-9d22-c47520c09b5d.jpeg</url>
      <title>DEV Community: Cooper D</title>
      <link>https://dev.to/gigapress</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gigapress"/>
    <language>en</language>
    <item>
      <title>The Deterministic Problem with Probabilistic AI Analytics</title>
      <dc:creator>Cooper D</dc:creator>
      <pubDate>Sun, 07 Dec 2025 05:32:56 +0000</pubDate>
      <link>https://dev.to/gigapress/the-deterministic-problem-with-probabilistic-ai-analytics-1n2</link>
      <guid>https://dev.to/gigapress/the-deterministic-problem-with-probabilistic-ai-analytics-1n2</guid>
      <description>&lt;p&gt;&lt;strong&gt;TLDR:&lt;/strong&gt; AI-powered analytics tools use probabilistic systems (LLMs, semantic search, RAG) to answer business questions that demand deterministic accuracy. Ask the same question twice with slightly different wording, and you might get different SQL queries and different answers. This isn't a minor bug - it's a fundamental architectural problem. The solution isn't better AI, it's rethinking how we match business questions to data, using exact matching for core concepts instead of fuzzy semantic search.&lt;/p&gt;




&lt;p&gt;Imagine asking your CFO: "How many claims did we deny in Q3?"&lt;/p&gt;

&lt;p&gt;They pull up a dashboard and say "approximately 1,247, give or take."&lt;/p&gt;

&lt;p&gt;You'd be concerned, right? When audit asks a question, "approximately" doesn't cut it. It's either 1,247 or it isn't.&lt;/p&gt;

&lt;p&gt;Now imagine that same CFO runs the question again tomorrow using slightly different words - "What's the count of rejected claims last quarter?" - and gets 1,189.&lt;/p&gt;

&lt;p&gt;You'd question the entire system.&lt;/p&gt;

&lt;p&gt;Yet this is exactly how most AI-powered analytics tools work today.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Probabilistic Foundation
&lt;/h2&gt;

&lt;p&gt;Modern AI analytics stacks look something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User asks a question in natural language&lt;/li&gt;
&lt;li&gt;Semantic search finds relevant tables and columns from metadata&lt;/li&gt;
&lt;li&gt;RAG pulls additional context about those data elements
&lt;/li&gt;
&lt;li&gt;An LLM generates a SQL query based on the retrieved information&lt;/li&gt;
&lt;li&gt;Execute query, return results&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every step in this pipeline is probabilistic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic search&lt;/strong&gt; uses embeddings to find "similar" content. The same question phrased differently produces different vector representations, which retrieve different metadata chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG retrieval&lt;/strong&gt; ranks results by relevance scores. Small changes in the query can shuffle the ranking, changing which context makes it into the LLM's prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM generation&lt;/strong&gt; is fundamentally non-deterministic. Even with temperature set to 0, the same prompt can produce variations in SQL structure, table joins, or filter conditions.&lt;/p&gt;

&lt;p&gt;This works fine for creative tasks. Writing marketing copy, brainstorming ideas, drafting emails - probabilistic is perfect for these use cases.&lt;/p&gt;

&lt;p&gt;But business analytics isn't creative work. It's precision work.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Approximation Fails
&lt;/h2&gt;

&lt;p&gt;Let's trace what happens with a real question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; "What's the average order value for premium customers last quarter?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First attempt:&lt;/strong&gt; Semantic search finds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;fact_orders&lt;/code&gt; table (0.89 relevance)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customer_tier&lt;/code&gt; column (0.87 relevance)
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;order_total&lt;/code&gt; column (0.85 relevance)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM generates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;fact_orders&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_tier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'premium'&lt;/span&gt; 
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-07-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: $247.83&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second attempt&lt;/strong&gt; (user rephrases): "What's the mean order amount for our top-tier customers in Q3?"&lt;/p&gt;

&lt;p&gt;Semantic search now finds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;fact_orders&lt;/code&gt; table (0.88 relevance)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customer_segment&lt;/code&gt; column (0.86 relevance) - different column!&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;order_value&lt;/code&gt; column (0.84 relevance) - different column!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM generates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;fact_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_segment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'gold'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-07-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: $231.56&lt;/p&gt;

&lt;p&gt;Same question. Different columns. Different answer.&lt;/p&gt;

&lt;p&gt;The user didn't change what they were asking. But the probabilistic system interpreted it differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Problem
&lt;/h2&gt;

&lt;p&gt;The issue isn't that the AI is "wrong." It's that there's no ground truth enforcement.&lt;/p&gt;

&lt;p&gt;In the first query, how does the system know that "premium customers" maps to &lt;code&gt;customer_tier = 'premium'&lt;/code&gt; and not &lt;code&gt;customer_segment = 'gold'&lt;/code&gt;? &lt;/p&gt;

&lt;p&gt;It doesn't. It guesses based on semantic similarity.&lt;/p&gt;

&lt;p&gt;Both mappings seem reasonable. Both are retrieved by the semantic search. The LLM makes a choice based on subtle factors in the retrieval ranking and its training data.&lt;/p&gt;

&lt;p&gt;Change the wording slightly, change the retrieval ranking, change the choice.&lt;/p&gt;

&lt;p&gt;This is the fundamental mismatch: &lt;strong&gt;using approximation engines to deliver deterministic outcomes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6adb0ilcod5jgc60z9yi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6adb0ilcod5jgc60z9yi.png" alt="Probablistic vs Deterministic" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Humans Actually Work
&lt;/h2&gt;

&lt;p&gt;Watch an experienced analyst tackle a question, and you'll notice something important.&lt;/p&gt;

&lt;p&gt;Given: "What's the average order value and customer satisfaction rating for premium customers who made at least 3 purchases in high volume stores during peak hours last quarter? Break it down by product and region."&lt;/p&gt;

&lt;p&gt;An analyst doesn't treat every word equally. They immediately decompose it into structured components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metrics (what to measure):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average order value&lt;/li&gt;
&lt;li&gt;Customer satisfaction rating&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Filters (what to include):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Premium customers&lt;/li&gt;
&lt;li&gt;At least 3 purchases
&lt;/li&gt;
&lt;li&gt;High volume stores&lt;/li&gt;
&lt;li&gt;Peak hours&lt;/li&gt;
&lt;li&gt;Last quarter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dimensions (how to group):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Product&lt;/li&gt;
&lt;li&gt;Region&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now here's the key: they do &lt;strong&gt;exact lookups&lt;/strong&gt;, not fuzzy searches.&lt;/p&gt;

&lt;p&gt;They open their metrics catalog and search for "average order value" - exact match. Either it exists or it doesn't. &lt;/p&gt;

&lt;p&gt;They check their customer segments for "premium customers" - exact match. Either it's defined or it isn't.&lt;/p&gt;

&lt;p&gt;They look up "high volume stores" in the business glossary - exact match.&lt;/p&gt;

&lt;p&gt;No approximation. No "well, this seems similar enough." Either the concept is defined in the system, or they need to go ask someone what it means.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture That Works
&lt;/h2&gt;

&lt;p&gt;The solution is to structure query generation around exact matching for business concepts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxve4r67gshpu1022vqtq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxve4r67gshpu1022vqtq.png" alt="Architecture That Works" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Parse the Question into Components
&lt;/h3&gt;

&lt;p&gt;Use an LLM to break down the natural language question into structured parts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"average order value"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"customer satisfaction rating"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"filters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"premium customers"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"at least 3 purchases"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"high volume stores"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="s2"&gt;"peak hours"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"last quarter"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dimensions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"product"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"region"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This step can be probabilistic - you're just extracting the business concepts the user is asking about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Exact Match Against Your Metadata
&lt;/h3&gt;

&lt;p&gt;Now search your metrics catalog for &lt;strong&gt;exact matches&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"average order value" → FOUND: &lt;code&gt;metrics.avg_order_value&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;"customer satisfaction rating" → FOUND: &lt;code&gt;metrics.csat_score&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Search your business glossary for filter definitions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"premium customers" → FOUND: &lt;code&gt;customer_tier = 'premium'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;"high volume stores" → FOUND: &lt;code&gt;store_id IN (SELECT store_id FROM dim_stores WHERE annual_revenue &amp;gt; 5000000)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;"peak hours" → FOUND: &lt;code&gt;EXTRACT(HOUR FROM order_time) IN (10,11,12,13,17,18,19,20)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Search dimension tables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"product" → FOUND: &lt;code&gt;dim_product.product_name&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;"region" → FOUND: &lt;code&gt;dim_geography.region_name&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any concept doesn't have an exact match, you stop and tell the user: "I don't have a definition for 'at least 3 purchases' in the system. Can you clarify, or should I use this interpretation?"&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Assemble the Query Deterministically
&lt;/h3&gt;

&lt;p&gt;Now you have concrete metadata for each component. Pull the complete definition for each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;metrics.avg_order_value&lt;/code&gt; includes: table, column, calculation logic, any required joins&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customer_tier = 'premium'&lt;/code&gt; includes: table, column, exact value, any prerequisites&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use an LLM to assemble these concrete pieces into SQL. But now the LLM isn't guessing which tables or columns to use - it's combining predefined building blocks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_order_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;satisfaction_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_csat&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;fact_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_customer&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;  
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_product&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;dim_geography&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_tier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'premium'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'quarter'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'3 months'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;store_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;dim_stores&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;annual_revenue&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HOUR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same question asked twice → same metadata retrieved → same query generated.&lt;/p&gt;

&lt;p&gt;Deterministic.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Requires
&lt;/h2&gt;

&lt;p&gt;This architecture has prerequisites:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need a metrics catalog.&lt;/strong&gt; Every KPI, every metric, every measure your business uses - defined precisely, stored in a queryable system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need a business glossary.&lt;/strong&gt; Every business term, every filter condition, every segment - with exact definitions and the logic to implement them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need dimension mapping.&lt;/strong&gt; Clear relationships between business concepts and data model entities.&lt;/p&gt;

&lt;p&gt;In other words, you need the 80% solved (see the previous article on this). You need the business context documented and structured.&lt;/p&gt;

&lt;p&gt;But once you have that foundation, this architecture works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hybrid Approach
&lt;/h2&gt;

&lt;p&gt;Pure exact matching is rigid. What if a user asks about "VIP customers" when your system calls them "premium customers"?&lt;/p&gt;

&lt;p&gt;The answer is a hybrid approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Try exact matching first&lt;/li&gt;
&lt;li&gt;If no exact match, do fuzzy matching and ask for confirmation&lt;/li&gt;
&lt;li&gt;Learn from confirmations to expand your synonym mappings&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;"I don't have a definition for 'VIP customers'. Did you mean 'premium customers' or 'enterprise customers'?"&lt;/p&gt;

&lt;p&gt;User confirms. System logs that "VIP customers" → "premium customers" for this user. Next time, exact match succeeds.&lt;/p&gt;

&lt;p&gt;Over time, the system learns the vocabulary variations while maintaining deterministic query generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The difference between "approximately right" and "exactly right" is everything in business analytics.&lt;/p&gt;

&lt;p&gt;Investors expect precise numbers. Audits demand accurate counts. Regulatory reports have zero tolerance for approximation. Strategic decisions need reliable data.&lt;/p&gt;

&lt;p&gt;A system that gives you different answers to the same question isn't useful. It's dangerous. It erodes trust in data.&lt;/p&gt;

&lt;p&gt;The current approach - throw everything into RAG and hope the LLM figures it out - works for demos. It fails in production where precision matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Path Forward
&lt;/h2&gt;

&lt;p&gt;If you're building or buying AI analytics tools, ask:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you ensure deterministic query generation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the answer is "better prompts" or "fine-tuned models" or "improved embeddings," that's not enough. Those are incremental improvements to a fundamentally probabilistic system.&lt;/p&gt;

&lt;p&gt;Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exact matching on business concepts&lt;/li&gt;
&lt;li&gt;Structured metadata catalogs&lt;/li&gt;
&lt;li&gt;Clear decomposition of questions into components&lt;/li&gt;
&lt;li&gt;Explicit handling of ambiguity (asking users instead of guessing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What happens when I ask the same question twice?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you get different SQL queries, that's a red flag. The system should be stable and predictable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens when a concept isn't defined?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system should admit it doesn't know rather than approximate. "I don't have a definition for X" is better than silently using the wrong definition.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Probabilistic AI is powerful for creative tasks. But business analytics isn't creative - it's precision engineering.&lt;/p&gt;

&lt;p&gt;You can't build a deterministic system on a purely probabilistic foundation. Semantic search and RAG are tools, but they can't be the entire architecture.&lt;/p&gt;

&lt;p&gt;The solution is to use exact matching where precision matters (business concepts, metrics, filters) and use AI where flexibility helps (parsing questions, assembling queries, explaining results).&lt;/p&gt;

&lt;p&gt;Get the architecture right, and AI analytics becomes reliable. Get it wrong, and you build an impressive demo that nobody trusts in production.&lt;/p&gt;

&lt;p&gt;The choice is yours.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Key Takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business analytics demands deterministic accuracy, but most AI systems are fundamentally probabilistic&lt;/li&gt;
&lt;li&gt;Semantic search and RAG can return different results for the same question phrased differently&lt;/li&gt;
&lt;li&gt;Human analysts use exact matching for business concepts, not fuzzy approximation&lt;/li&gt;
&lt;li&gt;The solution: parse questions into components, exact-match against metadata, then assemble queries&lt;/li&gt;
&lt;li&gt;This requires well-documented business context (metrics catalogs, business glossaries)&lt;/li&gt;
&lt;li&gt;Hybrid approaches can handle vocabulary variation while maintaining deterministic core behavior&lt;/li&gt;
&lt;li&gt;The right architecture uses AI for flexibility while ensuring reproducible, accurate results&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>sql</category>
      <category>deterministic</category>
      <category>nlp</category>
    </item>
    <item>
      <title>Why AI Analytics Tools Are Solving the Wrong Problem</title>
      <dc:creator>Cooper D</dc:creator>
      <pubDate>Sun, 07 Dec 2025 05:12:27 +0000</pubDate>
      <link>https://dev.to/gigapress/why-ai-analytics-tools-are-solving-the-wrong-problem-57da</link>
      <guid>https://dev.to/gigapress/why-ai-analytics-tools-are-solving-the-wrong-problem-57da</guid>
      <description>&lt;p&gt;&lt;strong&gt;TLDR:&lt;/strong&gt; The AI analytics industry is obsessed with building better query engines - using LLMs to turn natural language into SQL. But that's only 20% of the real challenge. The other 80%? Capturing and maintaining the massive amount of business context that exists in people's heads, undocumented meetings, and scattered wikis across five layers of your organization. Until we solve this unglamorous documentation problem, AI-powered analytics will remain impressive demos that struggle in production.&lt;/p&gt;




&lt;p&gt;Every "chat with your data" demo looks the same. Someone types "show me sales by region last month" into a sleek interface. An LLM generates a SQL query. Results appear. Everyone nods approvingly.&lt;/p&gt;

&lt;p&gt;Then you try to deploy it at your company.&lt;/p&gt;

&lt;p&gt;Suddenly, questions that should be simple become impossible. "What's our revenue from premium customers?" sounds straightforward until you realize three different teams define "premium" differently, and "revenue" means something else to finance than it does to operations.&lt;/p&gt;

&lt;p&gt;The demo worked because the demo had clean, simple data. Your reality is messier.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Everyone Is Building
&lt;/h2&gt;

&lt;p&gt;Open any AI analytics product and you'll find roughly the same architecture under the hood.&lt;/p&gt;

&lt;p&gt;They connect to your database and pull the schema - tables, columns, data types, foreign keys. They use retrieval augmented generation (RAG) to find relevant metadata when you ask a question. An LLM takes that context and generates SQL. Execute the query, format the results, maybe generate some insights.&lt;/p&gt;

&lt;p&gt;For simple questions against well-designed databases, this works. "Total orders last week" or "top 10 customers by spend" - no problem.&lt;/p&gt;

&lt;p&gt;This is the 20% of analytics that everyone's racing to perfect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 80% That Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;The real work begins when you step outside the database schema and into the messy world of business meaning.&lt;/p&gt;

&lt;p&gt;Your challenges live in five distinct layers, each one adding complexity that no amount of clever prompting can solve.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0kw1d7yb3g0v43es5in.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0kw1d7yb3g0v43es5in.png" alt="Hidden Work" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Business Definitions That Live Nowhere
&lt;/h3&gt;

&lt;p&gt;Your database has a &lt;code&gt;customers&lt;/code&gt; table with 2 million rows. Great. But which ones are "premium customers"? &lt;/p&gt;

&lt;p&gt;Is it customers who spend over $10K annually? Or is it the VIP tier from your loyalty program? Or maybe it's anyone with a dedicated account manager? Different teams give different answers.&lt;/p&gt;

&lt;p&gt;What about "high volume stores"? Is that top 10% by revenue? By transaction count? By square footage? The answer exists somewhere - maybe in a strategy deck from 18 months ago, maybe in someone's head who's been here for five years.&lt;/p&gt;

&lt;p&gt;"Peak hours" sounds objective until you learn that retail defines it as 10am-2pm and 5pm-8pm, but the warehouse team uses 7am-11am and 3pm-7pm.&lt;/p&gt;

&lt;p&gt;None of this lives in your database. It's business knowledge, and it needs to be documented in a structured way before any LLM can use it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Metrics and Their Hidden Complexity
&lt;/h3&gt;

&lt;p&gt;Ask five people what "revenue" means and you might get five different answers.&lt;/p&gt;

&lt;p&gt;Does revenue include pending orders? What about returns? Is it before or after discounts? Do you count the shipping fee? What about tax? Is it when the order is placed, when it ships, or when payment clears?&lt;/p&gt;

&lt;p&gt;Every one of these questions has an answer somewhere in your organization. Probably in multiple places with multiple versions, some contradictory.&lt;/p&gt;

&lt;p&gt;Your analytics team might calculate "Monthly Recurring Revenue" one way. Finance calculates it differently for the board. The sales dashboard shows a third number because it excludes trials.&lt;/p&gt;

&lt;p&gt;Each metric needs a single source of truth - not just a definition in plain English, but the actual business logic. The conditions, the exclusions, the edge cases. All of it documented and maintained.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Domain-Specific Business Rules
&lt;/h3&gt;

&lt;p&gt;Now add in the decisions each business unit makes to solve their specific problems.&lt;/p&gt;

&lt;p&gt;Marketing runs a campaign and excludes customers who purchased in the last 30 days. Operations has special handling for orders over $5,000. Customer service treats warranty claims differently than regular support tickets. Finance has revenue recognition rules that vary by product type.&lt;/p&gt;

&lt;p&gt;These rules get implemented to solve today's business problem. The people making these decisions aren't thinking about downstream analytics. They're not documenting for future AI systems. They're shipping features and closing deals.&lt;/p&gt;

&lt;p&gt;But every one of these decisions affects what the data means and how it should be interpreted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Technical Implementation Decisions
&lt;/h3&gt;

&lt;p&gt;The business requirements land with engineering. Now developers make their own choices.&lt;/p&gt;

&lt;p&gt;They build microservices, each owning its own data. They choose data structures that make sense for their use case. They optimize for performance, for their API contracts, for their deployment constraints.&lt;/p&gt;

&lt;p&gt;Is a customer ID a string or an integer? Are addresses stored as structured fields or freeform text? Is the timestamp in UTC or local time? Different services make different choices.&lt;/p&gt;

&lt;p&gt;These aren't wrong choices - they're pragmatic engineering decisions. But data becomes a byproduct of operations, not a first-class concern.&lt;/p&gt;

&lt;p&gt;And again, most of these decisions aren't documented anywhere that a data system can access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: The Data Platform Transformation Layer
&lt;/h3&gt;

&lt;p&gt;Finally, the data team pulls everything together. They extract data from dozens of sources, cleanse it, standardize it, transform it.&lt;/p&gt;

&lt;p&gt;They create &lt;code&gt;dim_customer&lt;/code&gt; by joining six different customer tables. They build &lt;code&gt;fact_orders&lt;/code&gt; by combining order data with returns, refunds, and adjustments. They calculate derived metrics like &lt;code&gt;customer_lifetime_value&lt;/code&gt; using complex logic.&lt;/p&gt;

&lt;p&gt;Every table, every transformation, every derived field represents decisions. What business logic is embedded in this ETL job? Why was this data transformed this way? What assumptions were made? What edge cases are handled?&lt;/p&gt;

&lt;p&gt;Without documentation, this knowledge lives in the data engineer's head or buried in hundreds of lines of SQL code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskvccitw80iz2s6gsaih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskvccitw80iz2s6gsaih.png" alt="5 Layers deep" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Documentation Debt Crisis
&lt;/h2&gt;

&lt;p&gt;Add it up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business definitions for every domain term&lt;/li&gt;
&lt;li&gt;Precise logic for every metric and KPI
&lt;/li&gt;
&lt;li&gt;Rules and exclusions from every business unit&lt;/li&gt;
&lt;li&gt;Technical decisions from every engineering team&lt;/li&gt;
&lt;li&gt;Transformation logic from every data pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the context an LLM needs to generate correct queries for real business questions.&lt;/p&gt;

&lt;p&gt;And almost none of it is documented in a way machines can understand.&lt;/p&gt;

&lt;p&gt;This documentation problem is the 80%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Is So Hard
&lt;/h2&gt;

&lt;p&gt;Documentation is manual work. Unglamorous, time-consuming, never-ending manual work.&lt;/p&gt;

&lt;p&gt;Business definitions change. A "premium customer" today might be redefined next quarter. Metrics evolve as the business grows. Rules get updated when regulations change. The data platform refactors tables and schemas.&lt;/p&gt;

&lt;p&gt;Static documentation becomes stale the moment it's written. You need a living system that evolves with the business.&lt;/p&gt;

&lt;p&gt;But who owns this? The business teams are focused on business problems. Engineering is shipping features. The data team is drowning in pipeline maintenance. Nobody has "document everything for future AI" in their job description.&lt;/p&gt;

&lt;p&gt;Even worse, this problem compounds. Every new business rule, every new data source, every new transformation adds to the documentation debt. The gap between what your AI needs to know and what's actually documented grows wider every sprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Opportunity
&lt;/h2&gt;

&lt;p&gt;RAG retrieval algorithms, query optimization, result formatting - these are interesting technical problems. They make for great blog posts and conference talks.&lt;/p&gt;

&lt;p&gt;But they're the easy 20%.&lt;/p&gt;

&lt;p&gt;The companies that will win in AI-powered analytics aren't the ones with the smartest query generation. They're the ones who solve documentation.&lt;/p&gt;

&lt;p&gt;What if business context was captured automatically as a byproduct of work instead of as extra work? What if implementing a business rule automatically created the metadata an AI needs? What if refactoring a data pipeline updated the documentation simultaneously?&lt;/p&gt;

&lt;p&gt;What if documentation was a living system that breathed with your business instead of a static artifact that goes stale?&lt;/p&gt;

&lt;p&gt;That's the hard problem. That's the valuable problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to Start
&lt;/h2&gt;

&lt;p&gt;You can't solve all five layers overnight. But you can start making documentation a first-class concern:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For business teams:&lt;/strong&gt; Maintain a single source of truth for business definitions. Not in slides or wikis, but in a structured system that can be queried programmatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For product teams:&lt;/strong&gt; When you implement business logic, document it where the data team can find it. Make "how will this be analyzed?" part of the design conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For engineering teams:&lt;/strong&gt; Treat data as a first-class output, not a byproduct. Document your implementation decisions. Make your data contracts explicit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For data teams:&lt;/strong&gt; Don't just transform data - document why. Capture the business logic embedded in your pipelines. Make your schemas self-describing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For everyone:&lt;/strong&gt; Stop treating documentation as a chore to avoid. It's the foundation that makes everything else possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Better LLMs won't solve AI analytics. Better RAG won't solve it. Better query optimization won't solve it.&lt;/p&gt;

&lt;p&gt;The bottleneck isn't technical capability. It's business context.&lt;/p&gt;

&lt;p&gt;You can't assemble an answer from context you don't have. You can't generate the right query if you don't know what "premium customers" means in your organization.&lt;/p&gt;

&lt;p&gt;The 20% is getting commoditized fast. Every AI analytics vendor has roughly the same query generation capability.&lt;/p&gt;

&lt;p&gt;The real differentiator is the 80% - capturing, maintaining, and surfacing the business knowledge that turns database schemas into business intelligence.&lt;/p&gt;

&lt;p&gt;That's where the hard work is. That's where the value is.&lt;/p&gt;

&lt;p&gt;And that's the problem that's still unsolved.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Key Takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI analytics tools focus on query generation (20%) while the real challenge is business context documentation (80%)&lt;/li&gt;
&lt;li&gt;Business knowledge exists across five layers: business definitions, metrics/KPIs, domain rules, technical implementation, and data transformations&lt;/li&gt;
&lt;li&gt;Each layer contains critical context that isn't captured in database schemas or code&lt;/li&gt;
&lt;li&gt;Static documentation fails because business context changes constantly&lt;/li&gt;
&lt;li&gt;The solution requires making documentation a byproduct of regular work, not extra work&lt;/li&gt;
&lt;li&gt;Whoever builds systems that capture and maintain this context automatically will win the AI analytics race&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>nlp</category>
      <category>sql</category>
    </item>
    <item>
      <title>Why Your Enterprise Data Platform Is No Longer Just for Analytics</title>
      <dc:creator>Cooper D</dc:creator>
      <pubDate>Tue, 18 Nov 2025 04:06:02 +0000</pubDate>
      <link>https://dev.to/gigapress/why-your-enterprise-data-platform-is-no-longer-just-for-analytics-1n1i</link>
      <guid>https://dev.to/gigapress/why-your-enterprise-data-platform-is-no-longer-just-for-analytics-1n1i</guid>
      <description>&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;The relationship between data and applications is undergoing a fundamental shift. For decades, we've moved data to applications. Now, we're moving applications to data. This isn't just an architectural preference—it's becoming a necessity as businesses demand richer context, faster insights, and real-time operations. Here's what's driving this change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context is king&lt;/strong&gt;: Connected data provides multidimensional insights that isolated data simply cannot match&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The old pattern is breaking&lt;/strong&gt;: Extracting data to specialized tools creates silos, brittleness, and duplication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The line has blurred&lt;/strong&gt;: Enterprise Data Platforms are no longer just analytical systems—they're becoming operational platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three critical shifts&lt;/strong&gt;: Data latency must drop to seconds, query latency to sub-seconds, and availability must reach production-grade standards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The solution space&lt;/strong&gt;: Event-driven architectures, operational databases like serverless Postgres, and treating your EDP as a P1 system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best of both worlds&lt;/strong&gt;: Leverage specialized analytics capabilities by bringing them to your data, not moving data to them—every data movement step adds failure points, costs, and complexity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Power of Connected Data: A Tale of Two Dashboards
&lt;/h2&gt;

&lt;p&gt;Consider how Uber thinks about a driver who just completed a ride. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without connected data&lt;/strong&gt;: "Driver #47291 completed an 18-minute ride. Rating: 5 stars."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With connected data&lt;/strong&gt;: "Driver #47291 completed an 18-minute ride during rush hour in San Francisco. Has a 4.92 rating over 2,847 trips, typically works evenings, now in a surge zone. The passenger is gold-status but gave 3 stars today (usually gives 5). Heavy rain—this driver's cancellation rate jumps 8% in rain."&lt;/p&gt;

&lt;p&gt;Same event, different universe. The first tells you what happened. The second tells you why, predicts what might happen next, and suggests what action to take. When you view information through multiple dimensions—user behavior, location patterns, time series, weather, operational metrics—you move from reporting to insight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Enterprise Data Platforms Became the Center of Gravity
&lt;/h2&gt;

&lt;p&gt;In large organizations, data from everywhere converges in a central Enterprise Data Platform: CRM systems, transaction data, product telemetry, marketing attribution, customer service interactions.&lt;/p&gt;

&lt;p&gt;This wasn't arbitrary. Connecting data is hard, and doing it repeatedly across different tools is wasteful. The EDP became the natural convergence point where data gets cleaned once, relationships between different sources get mapped, historical context accumulates, and governance gets enforced. &lt;/p&gt;

&lt;p&gt;When you need to understand customer lifetime value, you need purchase history, support interactions, usage patterns, and marketing touchpoints. These don't naturally live together—they get connected in the EDP. This made it perfect for contextual analytics: not just because data lives there, but because the relationships and ability to view information from multiple angles exist in one place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Old Playbook: Extract, Load, Specialize
&lt;/h2&gt;

&lt;p&gt;For years, the workflow was straightforward. When teams wanted to improve customer experience, run marketing campaigns, or optimize products, they'd: identify needed data from the EDP, procure a specialized tool (Qualtrics for customer experience, Segment for customer data, Hightouch for reverse ETL), build pipelines to extract and load data, then let the specialized tool work its magic.&lt;/p&gt;

&lt;p&gt;Marketing got Braze. Customer success got Gainsight. Product got Amplitude. Each loaded with curated enterprise data. &lt;/p&gt;

&lt;p&gt;This made sense—these platforms had years of domain expertise and optimized databases for specific use cases. But cracks started showing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data Paradox: More is Better, Until It Isn't
&lt;/h2&gt;

&lt;p&gt;Every specialized tool works better with richer data. Your NPS scores don't just tell you satisfaction dropped—you want to know it dropped specifically among enterprise customers with multiple support tickets coming up for renewal.&lt;/p&gt;

&lt;p&gt;Theoretically, send more data. Practically, this creates three problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, you're duplicating your entire dataset&lt;/strong&gt; across multiple tools. Your customer data lives in the EDP, in marketing, in customer success, in product analytics. Each copy needs syncing. Each represents another data quality surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, you're creating brittle pipelines.&lt;/strong&gt; Different data models, different APIs, different limitations. Each pipeline is a failure point needing maintenance as schemas evolve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, you're siloing insights.&lt;/strong&gt; Marketing sees one version of the customer, product sees another, support a third. The connected data you built in the EDP gets disconnected as it flows into specialized tools.&lt;/p&gt;

&lt;p&gt;This becomes an anti-pattern—working against what made the EDP valuable: keeping data connected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Inversion: Bring Applications to the Data
&lt;/h2&gt;

&lt;p&gt;If moving data to applications creates these problems, what if we inverted the pattern? Instead of extracting data from the EDP to specialized tools, build those capabilities on top of the EDP itself.&lt;/p&gt;

&lt;p&gt;When the most contextual, connected data already lives in your Enterprise Data Platform, why ship it elsewhere? Why not build your customer experience dashboards, your marketing segmentation engines, your operational applications directly on the EDP?&lt;/p&gt;

&lt;p&gt;This is where most people raise an objection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wait, Isn't That an Anti-Pattern?
&lt;/h2&gt;

&lt;p&gt;For decades, we've been taught that analytical systems and operational systems are fundamentally different. Analytics platforms—data warehouses, lakes, EDPs—handle complex queries over large datasets, optimized for throughput. Operational systems—transactional databases—handle fast queries on specific records, optimized for latency.&lt;/p&gt;

&lt;p&gt;You wouldn't run e-commerce checkout on a data warehouse. You wouldn't build real-time fraud detection on overnight batch jobs.&lt;/p&gt;

&lt;p&gt;But here's what changed: the line between analytical and operational has blurred dramatically over the past five years.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Line Has Blurred
&lt;/h2&gt;

&lt;p&gt;Applications have become analytics-hungry. A decade ago, an operational application might look up a customer record. Today, that same application needs to compute lifetime value in real-time, analyze 90 days of behavior, compare against historical patterns, and aggregate data across multiple dimensions.&lt;/p&gt;

&lt;p&gt;Meanwhile, data freshness requirements have compressed. Marketing campaigns that used to refresh daily now need hourly or minute-level updates. Customer health scores calculated overnight now need to reflect recent interactions within minutes.&lt;/p&gt;

&lt;p&gt;And context requirements have exploded. It's no longer enough to know what a customer bought—you need what they viewed but didn't buy, what promotions they've seen, what support issues they've had, and what predictive models say about their churn likelihood.&lt;/p&gt;

&lt;p&gt;This creates a new reality: operational applications need the rich, connected context of the EDP, but with operational characteristics—low latency, high availability, and fresh data.&lt;/p&gt;

&lt;p&gt;EDPs can no longer be P2 or P3 systems that indirectly support business. They're becoming P1 systems powering business directly, in real-time, at the point of customer interaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Critical Shifts
&lt;/h2&gt;

&lt;p&gt;For EDPs to power operational applications, three characteristics must change:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Data Latency: From Hours to Seconds
&lt;/h3&gt;

&lt;p&gt;Traditional data pipelines moved data in batches—often daily, sometimes hourly, occasionally every 15 minutes if you were pushing it. This worked fine when insights were consumed the next morning in dashboard reviews.&lt;/p&gt;

&lt;p&gt;It doesn't work when you're trying to trigger a marketing campaign based on a customer's action taken 30 seconds ago. It doesn't work when flagging potentially fraudulent transactions while they're still pending. It doesn't work when customer service needs to see what happened during the call that just ended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution&lt;/strong&gt;: Event-driven architecture, end to end. This isn't just about having a message queue somewhere. It means rethinking how data flows through your entire enterprise. When a customer completes a purchase, that event should propagate through your systems in seconds, not hours.&lt;/p&gt;

&lt;p&gt;This is the architecture that makes an enterprise truly data-driven—not from yesterday's data, but from what's happening right now. Technologies like Kafka, Debezium for change data capture, and streaming platforms become foundational, not optional.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Query Latency: From Seconds to Sub-Seconds
&lt;/h3&gt;

&lt;p&gt;Users won't wait three seconds for a dashboard to load. They definitely won't wait 30 seconds for a page to render. Applications need to respond in hundreds of milliseconds, not seconds.&lt;/p&gt;

&lt;p&gt;But here's the fundamental issue: modern data warehouses and lakes are built on storage-compute separation. This isn't a bug—it's an intentional design choice that provides enormous benefits for analytical workloads. You can scale storage and compute independently. You can spin up compute when needed and shut it down when you don't.&lt;/p&gt;

&lt;p&gt;However, this separation introduces a first-principles problem: when you run a query, data needs to move from remote storage to compute nodes. Even with optimized formats like Parquet, even with clever caching—data still needs to travel. For analytical queries over large datasets, a few seconds is acceptable. For operational APIs, it's not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters for operational workloads&lt;/strong&gt;: Operational applications don't make single queries. They chain hundreds of API calls. A single page load might trigger dozens of queries. Real-time business decisions—approve this transaction, show this offer, flag this behavior—can't wait for data to move from storage to compute. They need millisecond responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution&lt;/strong&gt;: Relational databases, where compute and storage live together. This is where solutions like Neon and serverless Postgres come into play.&lt;/p&gt;

&lt;p&gt;The pattern: Keep your rich, historical, connected data in the EDP where it belongs—that's still the system of record. But sync the operational subset—the data that needs to power real-time applications—into a relational database optimized for low-latency queries.&lt;/p&gt;

&lt;p&gt;This operational database becomes your fast access layer, holding the most frequently accessed data: current customer states, recent transactions, active orders. Everything else—full history, rarely accessed dimensions, large analytical datasets—stays in the EDP and is linked when needed.&lt;/p&gt;

&lt;p&gt;Why relational databases? When compute and storage are together, query latency drops dramatically. No network hop to fetch data. Indexes live next to the data. Query planners optimize on actual data locality.&lt;/p&gt;

&lt;p&gt;Why serverless Postgres? It solves the operational challenges that traditionally made databases hard to scale—automatic scaling, no provisioning for peak load—while maintaining the low-latency benefits of the relational model.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. High Availability: From "It's Down" to "It's Always On"
&lt;/h3&gt;

&lt;p&gt;When your data platform is used for monthly reports and strategic planning, a few hours of downtime is annoying but not catastrophic. When your data platform powers customer-facing applications, every minute of downtime directly impacts revenue.&lt;/p&gt;

&lt;p&gt;This means treating your EDP—or at least the operational layer sitting on top of it—with the same availability standards you'd apply to any production application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution&lt;/strong&gt;: Active-active configurations, multi-region deployments, automatic failover. At minimum, the operational database layer needs production-grade infrastructure.&lt;/p&gt;

&lt;p&gt;This shift is cultural as much as technical. It means your data team needs to adopt DevOps practices. It means SLAs matter. It means on-call rotations become part of data platform management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;None of these ideas are entirely new. People have been talking about operational analytics for years. So why is this pattern becoming critical now?&lt;/p&gt;

&lt;p&gt;Several trends have converged:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cost of computation has dropped dramatically.&lt;/strong&gt; What was prohibitively expensive five years ago—maintaining real-time data pipelines, running operational databases on large datasets—is now economically feasible. Serverless architectures have made it even more accessible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitive pressure has increased.&lt;/strong&gt; Customers expect personalization, immediate responses, and consistency across channels. Companies that can deliver these experiences with richer context have a meaningful advantage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The technology has matured.&lt;/strong&gt; Event streaming platforms are production-ready. Change data capture tools reliably sync databases. Serverless databases handle operational workloads without traditional overhead. The pieces needed to build this architecture actually work now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data teams have the skills.&lt;/strong&gt; A generation of data engineers who grew up building real-time pipelines and thinking about data as something that flows rather than sits have moved into leadership positions. The organizational knowledge exists to execute this pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  The New Architecture
&lt;/h2&gt;

&lt;p&gt;Here's what this looks like in practice:&lt;/p&gt;

&lt;p&gt;Your Enterprise Data Platform remains the system of record—the place where data is cleaned, connected, and stored historically. Data flows into it through event-driven pipelines that capture changes as they happen, not in overnight batches.&lt;/p&gt;

&lt;p&gt;On top of the EDP, an operational layer provides fast, consistent access to the subset of data needed for real-time applications. This might be a serverless Postgres instance that's automatically synced with your data platform, maintaining operational data with sub-second query latency.&lt;/p&gt;

&lt;p&gt;Applications—whether internal tools, customer-facing features, or analytical dashboards—query the operational layer directly. They get the rich context of the EDP with the performance characteristics of an operational database.&lt;/p&gt;

&lt;p&gt;The operational layer is treated as a P1 system: multi-region if needed, highly available, monitored like any production service.&lt;/p&gt;

&lt;p&gt;Data flows through this architecture in near real-time. An event happens in a source system, gets captured and streamed to the EDP, triggers processing and transformation, and updates the operational layer—all within seconds or minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Enables
&lt;/h2&gt;

&lt;p&gt;When you build applications on top of your connected data rather than extracting subsets to specialized tools, several things become possible:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Richer insights.&lt;/strong&gt; You're not limited to the subset of data you could feasibly extract and load. Your application has access to the full context of the EDP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Faster iteration.&lt;/strong&gt; Adding a new dimension to your analysis doesn't require building a new pipeline and waiting for data to load. It's already there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reduced duplication.&lt;/strong&gt; Data lives in fewer places. Updates happen in one location. Data quality issues are fixed once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better cross-functional work.&lt;/strong&gt; When everyone is building on the same data foundation, insights are easier to share. Marketing and product aren't looking at different versions of customer behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lower operational overhead.&lt;/strong&gt; Fewer pipelines to maintain, fewer data synchronization issues to debug, fewer copies of data to govern and secure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trade-offs
&lt;/h2&gt;

&lt;p&gt;You're trading specialized tools' optimizations for platform flexibility. You need teams capable of building applications and organizational buy-in to treat data platforms as production infrastructure. But for many organizations, the benefits—flexibility, reduced duplication, faster iteration, contextual insights—justify the investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  But What About Specialized Capabilities?
&lt;/h2&gt;

&lt;p&gt;Here's a legitimate question: what about all those cutting-edge features that specialized platforms offer? Qualtrics has StatsIQ and TextIQ—sophisticated analytics capabilities built over years. Segment has identity resolution algorithms refined across thousands of companies.&lt;/p&gt;

&lt;p&gt;If we're building on our EDP instead of using these tools, are we throwing away innovation? Are we asking data teams to rebuild complex models from scratch?&lt;/p&gt;

&lt;p&gt;Not necessarily. The key insight: you don't need to move data to leverage specialized capabilities. Bring those capabilities to where data lives, or let them operate on your data in place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two Emerging Patterns
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;First, bring capabilities to the EDP.&lt;/strong&gt; This is already happening. Many specialized analytics capabilities are becoming available as standalone services or libraries that operate directly on data platforms. Modern EDPs support user-defined functions, external ML service calls, and integration with specialized processing engines. You can invoke sentiment analysis APIs on text stored in your EDP. You can run statistical models using libraries that operate directly on your warehouse tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, let specialized tools operate in place.&lt;/strong&gt; Instead of extracting data into Qualtrics, imagine Qualtrics connecting directly to your EDP and running its StatsIQ algorithms on your data where it sits. This "compute on data in place" trend is accelerating—it's the core idea behind data clean rooms, query federation, and interoperability standards.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Operating In Place Wins
&lt;/h3&gt;

&lt;p&gt;Every time you add a step to move data, you introduce:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failure points&lt;/strong&gt;: Another pipeline that can break, another synchronization that can fall out of date&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Costs&lt;/strong&gt;: Data egress charges, storage duplication, compute for transformation and loading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Time to extract, transform, and load before insights are available&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity&lt;/strong&gt;: Another system to monitor, another set of credentials to manage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk&lt;/strong&gt;: More copies of sensitive data, more surfaces for security issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most successful solutions operate on existing data in place. Think about dbt's success—it transforms data where it sits. Or how BI tools evolved from requiring extracts to connecting directly to warehouses. The winning pattern is always "work with data in place."&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Requires
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;From vendors&lt;/strong&gt;: APIs that operate on external data, federated query engines, embedded analytics libraries. Some will resist—their business models depend on data lock-in. But those that embrace this will win in a world where enterprises are consolidating their data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From data platforms&lt;/strong&gt;: Rich API layers, fine-grained access control, performance for external queries, support for specialized compute through user-defined functions and external procedures. Modern platforms like Snowflake's external functions, Databricks' ML capabilities, and BigQuery's remote functions are steps in this direction.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Emerging Architecture
&lt;/h3&gt;

&lt;p&gt;Your Enterprise Data Platform holds your connected, contextual data. Your operational layer provides fast access for real-time applications. And specialized analytics capabilities—whether built in-house or licensed from vendors—operate on this data without requiring it to be moved.&lt;/p&gt;

&lt;p&gt;You get the rich context and operational efficiency of centralized, connected data. And you get the specialized capabilities of best-in-class tools. Without the brittleness, cost, and complexity of moving data between systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Forward
&lt;/h2&gt;

&lt;p&gt;The shift from "move data to applications" to "move applications to data" reflects how central data has become. The line between operational and analytical systems has blurred.&lt;/p&gt;

&lt;p&gt;Organizations adapting to this—event-driven architectures, operational databases near data platforms, treating EDPs as P1 systems—will act on richer context, respond faster, and deliver better experiences. Those maintaining old extraction patterns will fight complex pipelines and synchronization issues.&lt;/p&gt;

&lt;p&gt;The technology exists. The question is organizational readiness.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The future of enterprise data isn't choosing between analytical power and operational performance. It's architectures delivering both.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>architecture</category>
      <category>database</category>
      <category>analytics</category>
    </item>
    <item>
      <title>True future of Databricks Lakebase</title>
      <dc:creator>Cooper D</dc:creator>
      <pubDate>Thu, 19 Jun 2025 03:17:10 +0000</pubDate>
      <link>https://dev.to/gigapress/true-future-of-databricks-lakebase-13cl</link>
      <guid>https://dev.to/gigapress/true-future-of-databricks-lakebase-13cl</guid>
      <description>&lt;p&gt;When Databricks announced Lakebase, most people dismissed it as just another product. Even Databricks markets it as "the backend for AI Agents and Data Apps."&lt;/p&gt;

&lt;p&gt;This messaging puzzles me. It's the same pattern they followed with Delta Lake in 2019, positioning it as "bringing reliability to Big Data." But Delta Lake was actually much simpler: a transactional layer on top of an immutable object store. That's it. Yet this simple concept solved a massive architectural challenge—enabling storage and compute separation for data warehouses at scale.&lt;/p&gt;

&lt;p&gt;Storage and compute separation became the foundation for everything that followed: data sharing, unified storage formats, multiple query engines. But separation creates a problem: latency. For analytical workloads, this latency is manageable. For transactional and operational analytics? It's a dealbreaker.&lt;/p&gt;

&lt;p&gt;Here's why this matters: the traditional divide between applications and analytics is disappearing. As businesses demand data-driven decisions in real-time operations, analytical platforms are being pulled into the critical path of business processes. Data platforms are no longer backend systems—they're front and center.&lt;br&gt;
This shift demands sub-second response times that storage and compute separation simply can't deliver. You need a fused engine. You need a database.&lt;/p&gt;

&lt;p&gt;That's what Lakebase really is.&lt;/p&gt;

&lt;p&gt;But consider the bigger picture. In most enterprises, data follows a predictable journey: applications generate data, it's ingested into a central data lake, curated, and used for analytics. Then those insights flow back to business applications for decision-making.&lt;br&gt;
What if you could collapse this entire cycle? What if applications and analytics ran on the same backend?&lt;/p&gt;

&lt;p&gt;Object stores and Delta Lake can't solve this. You need a true database for applications. But the heavy lifting of moving data between systems can be simplified—what Databricks calls "zero-ETL." It's not actually zero-ETL; someone still runs the ETL. Databricks just handles it for you.&lt;/p&gt;

&lt;p&gt;This positions Databricks as something entirely different. They're no longer just an analytics company. They're becoming the AWS of enterprise data—where your applications run on Databricks, your analytics run on Databricks, and low-code solutions make it accessible to business users.&lt;/p&gt;

&lt;p&gt;The data producers become the ones running end-to-end pipelines and analytics.&lt;/p&gt;

&lt;p&gt;But this approach creates its own set of challenges. Let's explore those next.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>lakebase</category>
      <category>olap</category>
      <category>oltp</category>
    </item>
    <item>
      <title>What are Object Stores - simplified</title>
      <dc:creator>Cooper D</dc:creator>
      <pubDate>Sat, 13 Jan 2024 22:40:03 +0000</pubDate>
      <link>https://dev.to/gigapress/what-are-object-stores-simplified-1ge3</link>
      <guid>https://dev.to/gigapress/what-are-object-stores-simplified-1ge3</guid>
      <description>&lt;p&gt;[Originally published in medium]&lt;/p&gt;

&lt;p&gt;We all know that Object stores are the backbone of the internet of cloud era and, we also know they have certain behavioral characteristics. Like scalability, immutability, eventual consistency.&lt;/p&gt;

&lt;p&gt;But, do we know why object stores behave this way? What are they actually? Let’s find out.&lt;/p&gt;

&lt;p&gt;Since I started using AWS S3 10 years ago, it really amazed me all that can be done using it and, all the incidents that showed how dependent the internet is on S3. The other thing that made me curious is- that there is no official architecture documentation or details for AWS S3. It’s kind of a “secret” — I guess.&lt;/p&gt;

&lt;p&gt;First, let’s find out what an Object Store is. Per Wiki, below is the definition:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“Object storage (also known as object-based storage[1]) is a computer data storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks.”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let me try to explain it in English. To understand the definition even better, we need to understand what Block Storage is. Am not going to pull the definition from Wiki for this one — let me try to explain in simple English.&lt;/p&gt;

&lt;p&gt;What is Block Storage?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Block Storage is the fundamental storage mechanism of the computer systems where data is stored in “blocks”. The key here is that there is a central system/module that keeps track of what data is stored in which specific block.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhqlrh2uphhs38qenb75.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhqlrh2uphhs38qenb75.jpeg" alt="Image description" width="168" height="196"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What that means is, when a file is stored, the filesystem breaks it down to blocks and stores the data in multiple blocks. Only the filesystem knows which part of the file is stored in which block. So, any changes to the file, has to go through the filesystem only. All of this data about the blocks is the “metadata” and its centrally stored and managed by the filesystem.&lt;/p&gt;

&lt;p&gt;For the sake of simplicity, am not going to differentiate File storage from Block storage. In a way, they are related to each other but, they are fundamentally different from Object storage.&lt;/p&gt;

&lt;p&gt;To understand it a little better, let’s try an analogy.&lt;/p&gt;

&lt;p&gt;Note: Some of the operations, mechanics of databases and object storage have been overly simplified for easier understanding.&lt;/p&gt;

&lt;p&gt;Block Storage Analogy : Storage Units based on item type with one/more store keepers and, no direct access to storage details.&lt;br&gt;
In this “hypothetical” analogy, imagine a multiple storage units like Book storage unit, Furniture storage unit, Electronics storage unit ..etc where people store their items. You bring your items in a box(labelled with your name and address) and hand over to the store keeper at the appropriate storage unit.&lt;/p&gt;

&lt;p&gt;The store keepers at the storage units only understand and handle the items they are designed for — like store keeper at book storage unit only can store books and nothing else. Same with store keeper at electronics storage unit — can only store electronics.&lt;/p&gt;

&lt;p&gt;How and where exactly inside the storage unit each of these items are stored, is a black box to you. That information is only known to the store keeper and is stored in a place accessible only by the store keeper. The store keeper needs to know what the contents of the box are. He/she records all the details of each customer details(from the box), books details, along with where exactly inside the storage unit each one of those books are stored. Let’s call this data as the “&lt;strong&gt;&lt;em&gt;book-storage-details&lt;/em&gt;&lt;/strong&gt;”.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5cppdkblc38kij4m937j.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5cppdkblc38kij4m937j.jpeg" alt="Credits — Angelascottauthor" width="168" height="110"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, the equivalents of these hypothetical &lt;em&gt;&lt;strong&gt;box, book, page&lt;/strong&gt;&lt;/em&gt; to, the data storage in real world are &lt;em&gt;&lt;strong&gt;file/data, record and, block&lt;/strong&gt;&lt;/em&gt; respectively.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmm479zerwcnmv83r0433.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmm479zerwcnmv83r0433.png" alt="Equivalents of Real World entities to those in the Hypothetical Analogy" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Below is an example of how &lt;strong&gt;&lt;em&gt;book-storage-details&lt;/em&gt;&lt;/strong&gt; can look like. This is the data about where each of the customer’s books are stored. It’s nothing but the &lt;strong&gt;metadata&lt;/strong&gt; of the books stored. In real world, it’s the &lt;strong&gt;metadata&lt;/strong&gt; of the file/data stored in the filesystem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1y5r267655dw8im0zgx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1y5r267655dw8im0zgx.png" alt="Image description" width="800" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenarios — add or retrieve books.&lt;/strong&gt;&lt;br&gt;
If you have to store one or more additional books at a later point of time, you give the new books to the store keeper and he/she will make a note of the details in the &lt;strong&gt;&lt;em&gt;book-storage-details&lt;/em&gt;&lt;/strong&gt; and then store them on the shelves.&lt;/p&gt;

&lt;p&gt;If you have to retrieve one or more books from the storage, you give the details of the book(s) you need and the store keeper will retrieve them for you. When you bring them back, the store keeper follows the same process of making note of all the metadata and storing them on one of the shelves.&lt;/p&gt;

&lt;p&gt;In real world — this is how you interact with the file system when you want to make any changes to a saved file/data. You can retrieve one or more records and make changes to them at a record level and save them back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantage — consistency&lt;/strong&gt;. As each and every storage/retrieval is going though the store keeper, you can pretty much look at the metadata(book-storage-details) at a given point in time and exactly tell what books are stored and by whom and where. If the entry exists in the metadata(book-storage-details), then the book is stored and, if there is no entry, it is not stored. Simple! The answer to questions like &lt;strong&gt;&lt;em&gt;“is the book  stored?”&lt;/em&gt;&lt;/strong&gt;, &lt;em&gt;&lt;strong&gt;“how many total books are stored?”&lt;/strong&gt;&lt;/em&gt;, &lt;strong&gt;&lt;em&gt;“list of all books stored”&lt;/em&gt;&lt;/strong&gt; ..etc are all precise and definite at a given point in time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantage-efficiency&lt;/strong&gt;. As the exchanges are at the record level, only one or more records can be retrieved or written. This is the most efficient way to operate on data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disadvantage-scalability&lt;/strong&gt;. If there are a 100 customers who want to store their books or retrieve them at the same time, they will have to wait until the store keeper processes all of them one after the other. May be there can be multiple store keepers sharing the same &lt;strong&gt;book-storage-details&lt;/strong&gt; can work faster and handle 10 customers at a time but eventually, there is a limit to the scalability and the process slows down as every single storage/retrieval detail has to be recorded in detail(about every single book) in a metadata store(book-storage-details).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Key here is the store keeper &lt;em&gt;&lt;strong&gt;needs to know&lt;/strong&gt;&lt;/em&gt; &lt;em&gt;&lt;strong&gt;“what”&lt;/strong&gt;&lt;/em&gt; you store. That is where the bottle neck is. They need to know &lt;strong&gt;&lt;em&gt;“what”&lt;/em&gt;&lt;/strong&gt; is inside the box -every single time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnw8kapx2is0m9bin4qxo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnw8kapx2is0m9bin4qxo.png" alt="Image of a screened bag by TSA. Credits - insider.com" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is how most of the databases work in real world. There is a central metadata store that keeps track of every record stored and its details(like size, location, block details etc). Every exchange with the database is “transactional” and is “consistent”. All of the exchanges have to go through a centrally stored metadata store and thats where the potential for a bottleneck is. They are precise and consistent but, do not scale linearly per the load. Once the number of writes reaches a limit, the database does not scale.&lt;/p&gt;

&lt;p&gt;Now let’s look at how Object stores work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Object storage?&lt;/strong&gt;&lt;br&gt;
Simply said — Object storage is a storage mechanism designed to tradeoff consistency of Block storage mechanism in exchange for infinite scalability.&lt;/p&gt;

&lt;p&gt;What this means is — object storage is designed from ground up to be scalable infinitely(theoretically).&lt;/p&gt;

&lt;p&gt;Now, why do we need infinite scalability? That’s what the &lt;strong&gt;Volume&lt;/strong&gt; and &lt;strong&gt;Velocity&lt;/strong&gt; of &lt;strong&gt;Big data&lt;/strong&gt; are reasons for. The rate and size at which the world is generating data is too fast and too huge than what we can process with the traditional databases. We cannot ever keep-up or even catchup to the backlog and store all of that data produced without potentially losing some of the data.&lt;/p&gt;

&lt;p&gt;Let’s look at the same &lt;strong&gt;Book storage unit&lt;/strong&gt; analogy to understand how it would play out in case of Object storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Object Storage Analogy : Box storage with one/more store keepers and direct access to storage details.&lt;/strong&gt;&lt;br&gt;
Key words here are “box storage”, “infinite store keepers” and “direct access”. Imagine the same hypothetical storage unit example we discussed earlier but, in this case we deal with &lt;strong&gt;boxes&lt;/strong&gt; and no longer with &lt;strong&gt;books&lt;/strong&gt;. You can store anything in the box and it doesn’t matter. The store keepers do &lt;strong&gt;NOT&lt;/strong&gt; know or care what you have inside the box. This, reduces a lot of their work — to know whats inside and record everything every time. So, practically the same number of store keepers, can do more work — which is just moving boxes in or out of the store unit.&lt;/p&gt;

&lt;p&gt;What is this going to change? Well, it’s going to change the way you interact with the storage unit a lot. First, you can store anything inside the box. So, there is only one type of storage unit that is needed. Second, you can store/retrieve boxes much faster irrespective of how many customers are trying to store/retrieve boxes at the same time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uzac0xzp3k11mnxkdmd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uzac0xzp3k11mnxkdmd.png" alt="Credits — Pinterest" width="700" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this new model, you bring your contents in a &lt;strong&gt;box&lt;/strong&gt;(labelled appropriately as earlier) and, handover the box to the store keeper. The store keeper then assigns a vacant aisle for you and names that aisle after you and, he/she stores the box in that aisle. The store keeper also records the information about the box like aisle details, name of the box, weight of the box, size of the box, time of storage etc and stores them in &lt;strong&gt;“box-storage-details”&lt;/strong&gt;. But, this time, this information is no longer centrally managed or stored, but its stored along with your box. One copy of the information is attached to the box so that anyone can find out the details by looking at it and, one more copy of its is stored at the front desk. Also, you get to see the information in &lt;strong&gt;“box-storage-details”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This will help you because, next time, when you have to interact with your box(add/remove contents), you can handover the details to the store keeper and he/she can help you with your box.&lt;/p&gt;

&lt;p&gt;With this new model, you have to always interact with the book store keeper in terms of a &lt;strong&gt;“box”&lt;/strong&gt; and no longer in terms of individual &lt;strong&gt;items&lt;/strong&gt;. The store keeper no longer has any details of what books/toys/electronics you have in the box, or anything inside the box. This cuts down the requirement of keeping track of &lt;strong&gt;what is inside&lt;/strong&gt; the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenarios — add or retrieve items.&lt;/strong&gt;&lt;br&gt;
If you have to add more items to the box or retrieve items from the box, you will have to retrieve the &lt;strong&gt;entire box&lt;/strong&gt;, make whatever changes you need to make — add or remove books/electronics/toys and put the &lt;strong&gt;box&lt;/strong&gt; back in the aisle.&lt;/p&gt;

&lt;p&gt;Magically, the process of adding content or retrieving content is &lt;strong&gt;simplified&lt;/strong&gt;. Any storekeeper can help you as you have all the details required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantage — scalability&lt;/strong&gt;. Now that the job a store keeper does is simply store the box or retrieve the box and the details about where to find your box are already provided at the time of an interaction, the process becomes linearly scalable by simple adding more store keepers as the number or concurrent customers increase. There is no longer a central metadata store or &lt;strong&gt;“box-storage-details”&lt;/strong&gt; being maintained.&lt;/p&gt;

&lt;p&gt;Periodically, may be all of the storekeepers consolidate their storage details numbers to come up with the total stats of the entire storage unit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disadvantage-consistency&lt;/strong&gt;. How does this process effect consistency is the key here. Now that the information about what boxes are stored and where they are stored is stored with multiple store keepers in a distributed fashion, you cannot get a consistent answer for questions like &lt;strong&gt;&lt;em&gt;“is the box  stored?”&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;“how many total boxes are stored?”&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;“list of all boxes weighing more than 5 lbs”&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Every time you ask the question — is the box#25 from customer 1001 stored, the storekeepers have to consolidate their records to get to an answer. And while they are consolidating the records, customer 1001 might add/retrieve box#25 and that info is not recorded yet. So, the numbers are eventually consistent. Meaning, if you give enough time for storekeepers to consolidate their records after box#25 from customer 1001 is stored, then you might get a consistent answer.&lt;/p&gt;

&lt;p&gt;That is “Eventual Consistency”. And, that is NOT a bug — it’s by design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disadvantage-Efficiency&lt;/strong&gt;. Now that all of the interactions are in terms of “boxes” and not anymore in terms of individual “items” or “contents”, even if you have to add/remove of item, you will have to retrieve the entire box. This is not very efficient but, with the developments in the processing speeds of the cloud, unlimited availability of compute power and network transfer speeds, the inefficiency makes a very small dent in the overall big picture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantage-Scalability&lt;/strong&gt;. This is the big one. With this design of not needing to record details about the contents of the boxes, the speed at which storekeepers can store/retrieve boxes can be increased dramatically. This gives rise to potentially infinite scalability. The absolute answer to our question “How can we keep up with the speed at which data is being generated?”&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Key here is the store keeper &lt;strong&gt;does not know “what”&lt;/strong&gt; you store. That is where the bottleneck is eliminated. They only need to know &lt;strong&gt;“about”&lt;/strong&gt; the box -every single time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is how object stores work. This is why object stores can scale infinitely and store anything. You can store flat files, songs, pictures, movies..etc&lt;/p&gt;

&lt;p&gt;Now that we’ve solved the need for scalability, how can this be implemented in a shared infrastructure world of cloud where they have to host data from customers all over the world? How can they make sure, people do not overwrite contents created by others?&lt;/p&gt;

&lt;p&gt;Do you know that the bucket name you pick for storing your data in AWS S3 has to be unique? By unique, what I mean is — it has to be unique across the world. &lt;strong&gt;No one else in the world can have the same name for their bucket&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That is interesting right? The reason is — the flat naming structure followed by AWS to name the objects.&lt;/p&gt;

&lt;p&gt;When you store a file named “vacation_pic1.jpg” in the folder structure &lt;strong&gt;&lt;em&gt;///&lt;/em&gt;&lt;/strong&gt;, it’s designed to make the navigation and understanding of the data stored easier on the end-users. But, in reality, there are no folders at all in AWS S3.&lt;/p&gt;

&lt;p&gt;The actual implementation of the storage works by flattening the path into one single name and hashing it out.&lt;/p&gt;

&lt;p&gt;So, the object “vacation_pic1.jpg” is stored as &lt;strong&gt;&lt;em&gt;bucket1_folder1_folder2_vacation_pic1.jpg&lt;/em&gt;&lt;/strong&gt; or the hash of the name of the file. So, when the starting point of the object name is made unique, whatever logical folder structure you create after that — doesn’t matter and, in the end, object names will be unique across the board.&lt;/p&gt;

&lt;p&gt;Now, can S3 be used to store anything — yes. Then, do we still need all the other databases like Postgres, Redshift, SQL Server, Teradata, Neo4j, Arango DB, MongoDB..etc?&lt;/p&gt;

&lt;p&gt;Yes, we do need other databases as well. Let’s discuss the need for those in a followup article. For now, let’s discuss the use cases for S3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the best use cases for Object Stores?&lt;/strong&gt;&lt;br&gt;
As you can see by the design, the ability to quickly write huge datasets is possible with AWS S3. You can read huge data as well but, the data is immutable. Meaning, once you create an object, you cannot edit the contents. You will have to reproduce the entire object with what ever changes you want.&lt;/p&gt;

&lt;p&gt;So, that is the reason, AWS S3 is not a good fit for frequently changing data. In other words, object stores are good for static data. If you produce the data once and read it multiple times, then it’s the perfect fit for it. Write once, read a million times.&lt;/p&gt;

&lt;p&gt;How many times do you think Netflix produces a movie and edits it? May be it will edit the movie initially for a couple of times but after that a movie is a movie. So, you produce/create a movie and store it on S3, and stream it from S3 millions or even billions of times. Same with all website content. How many times would you change the logo, picture or other static content of websites? Not very much — so, the static content of almost all websites can be stored on S3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What’s the future of Objects stores?&lt;/strong&gt;&lt;br&gt;
The reliability, cost effectiveness, infinite scalability of Object stores scream a lower total cost of ownership to store data. But, the eventual consistency and immutability stops it just short of being the only storage we would ever need. Is it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;** &lt;em&gt;AWS S3 was eventually consistent at the time of writing this article (June-2020), its strongly consistent as of Dec-2020.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is where, very interesting ideas came in from companies like Netflix, Google, Snowflake, Databricks..etc. These companies created virtual ACID layers on top of the Object stores making the eventual consistency and immutability of Object stores virtually non-existent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7e3fjj48kcagk2wszdet.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7e3fjj48kcagk2wszdet.png" alt="Image description" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How far these virtual ACID layers have come along in making the eventual consistency and immutability of Object stores non-existent and how they work? How did some open source solutions(Netflix’s s3mper) try to solve the eventual consistency using OLTP database systems? How does Google solve the eventual consistency using Google Spanner?&lt;/p&gt;

&lt;p&gt;Let’s discuss these in detail in my next article.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
