DEV Community: Rajesh G

AI Dependency Paradox

Rajesh G — Mon, 27 Apr 2026 03:25:04 +0000

Imagine you're a senior engineer overseeing five AI agents, each building a different microservice. Agent 1 raises its hand:

"I've analyzed 47 existing API endpoints across our codebase, reviewed 23 similar implementations in related services, identified 4 potential race conditions, generated 12 test scenarios, and drafted 3,000 lines of implementation. Here's the context dump. Which approach should I take for the distributed cache invalidation strategy?"

You have 10 minutes before Agent 2 needs a decision on database schema migration, Agent 3 hits a deployment blocker, Agent 4 needs architectural guidance, and Agent 5 encounters a test failure it can't diagnose.

You're reading a summary of a problem-solving journey you didn't take. You're making decisions from abstracts, not understanding. And you're doing this while simultaneously monitoring four other agents, each accumulating context you'll need to rebuild when they escalate.

This is the AI dependency paradox: automation that speeds up execution creates a human bottleneck in supervision — and that bottleneck is cognitively unsustainable.

The Trade We're Making Without Realizing It

Here's what happened. Your team adopted agentic AI to accelerate delivery. The agents are impressive: they read requirements, design systems, write code, generate tests, and handle deployments. Your velocity tripled. Your sprint burndown charts look beautiful.

But something shifted in how the work gets done.

Before AI agents, an engineer owned a feature end-to-end. She read the requirements, sketched the design, made architectural trade-offs, wrote the implementation, debugged the edge cases, deployed it, and monitored the rollout. When something broke, she had context — she'd lived through every decision, knew why the system was shaped that way, understood the failure modes because she'd reasoned about them during design.

With AI agents, that same engineer supervises three agents building three features in parallel. She reviews generated designs, approves implementations, validates test coverage, and monitors deployments. When an agent escalates — "I can't decide between approach A and B, here's 2,000 lines of analysis" — she's reading a summary of someone else's problem-solving journey. She didn't explore the solution space. She didn't hit the dead ends. She's making a decision from a compressed abstract.

And she's doing this while context-switching between three other escalations.

The cognitive cost is real. Every context switch carries a rebuild penalty. You read the summary, reconstruct the problem space, load the relevant architectural context, make the call, then dump that mental model to handle the next escalation. Research on context switching suggests it can take 15-25 minutes to fully rebuild a mental model after an interruption. If you're supervising five agents that each escalate twice a day, you're spending more time rebuilding context than you are making actual technical decisions.

Evidence from Fields That Got Here First

The pattern isn't new. Other domains automated skilled work and watched what happened to human capability.

Aviation provides the clearest evidence. When autopilot systems automated routine flying, pilots were supposed to supervise and step in during edge cases. Aviation safety research has documented that pilots' manual flying skills degrade measurably when they rely heavily on automation systems. When automation failed — sensor malfunction, unusual weather, system errors — pilots couldn't recover. The automation had created a skill gap: they were trained to supervise systems, not to fly planes. The critical finding: pilots who never used GPS navigated measurably better than those who relied on it, because navigation is more than following directions — it's spatial reasoning, orientation, and understanding the map. Remove the practice, lose the skill.

Medicine saw similar results with AI-assisted diagnosis. Doctors using AI for pattern recognition became less effective at visual diagnosis over time. When the AI misclassified an edge case, they couldn't catch it — their diagnostic intuition had atrophied from disuse.

Software engineering is seeing early signals. Research on AI coding assistants (GitHub Copilot, Claude Code, Cursor) suggests that developers working in unfamiliar codebases with AI assistance show measurably decreased understanding of the code they ship. When the AI suggestion is wrong — and it is, frequently — they can't debug it because they never built the mental model of why the code works. They're shipping implementations they don't fully understand.

Now scale that pattern to agentic AI handling entire features. The agent does requirements analysis, system design, implementation, testing, and deployment. The human reviews outputs and makes decisions when the agent gets stuck. What skill is being practiced? Supervision. What skill is atrophying? Engineering.

The Augmentation Principle: A Framework for Sustainable AI Integration

Here's the critical insight that separates sustainable AI use from dependency:

AI should augment human work or eliminate meta-work — but never replace the core work itself.

Think of it this way:

The Wrong Model (What Creates Dependency):

AI agent does the feature development
Human jumps in when AI can't handle it
Result: Context switching, skill atrophy, fragility

The Right Model (What Creates Capability):

Human owns the feature development from start to finish
AI augments the human with better tools, information, and automation of surrounding tasks
Result: Humans stay skilled, AI makes them superhuman

The distinction matters. In the wrong model, the human becomes a supervisor — someone who approves AI-generated work and debugs failures. In the right model, the human remains an engineer — someone who reasons about systems, makes design trade-offs, and builds things — while AI handles the toil.

When and How to Use AI: The Augmentation Rubric

Tier 1: AI as Intelligence Amplification (Engineers Own the Work)

The engineer owns the feature end-to-end, but AI provides superhuman capabilities:

Real-time information retrieval: "Show me all places we handle cache invalidation", "What were the last 10 incidents related to rate limiting?", "Pull up the architecture decision record for why we chose this database"
Pattern recognition and alerts: "This design has the same race condition that caused the incident last quarter", "This API signature doesn't match our consistency patterns", "This query will scan 10M rows under peak load"
Suggestion, not decision: AI proposes architectural approaches, engineer chooses. AI generates implementation scaffolding, engineer writes the critical logic. AI drafts test scenarios, engineer validates coverage.

The litmus test: If the AI disappeared tomorrow, the engineer could still deliver the feature — it would just take longer and require more manual context-gathering.

Tier 2: AI for 100% Automation of Meta-Work (Work About Work)

These are tasks that don't build engineering capability:

Administrative overhead: Writing JIRA ticket descriptions, updating sprint boards, generating standup summaries
Mechanical processing with clear rules: Auto-formatting code, running standard linters, deploying to staging environments with passing tests
Documentation and knowledge management: Transcribing design discussions, updating API documentation from code changes, maintaining runbooks

The litmus test: These tasks are necessary but don't improve your ability to design, build, or debug systems. Automate them completely.

Tier 3: AI Full Autonomy (Proceed with Extreme Caution)

Only for work that is:

Truly routine with zero ambiguity
Low risk if handled incorrectly
Non-core to engineering capability

Warning signs you're in the danger zone:

Engineers only see escalations (context switching problem)
Engineers never handle routine features (skill atrophy problem)
AI autonomy is expanding because "it's more efficient" (path to dependency)

The Three Paths Forward

Path 1: The Atrophy Spiral (What Happens By Default)

AI agents handle increasingly complex features. Engineers supervise, review, and jump in when agents escalate. New hires never build full-stack features from scratch — they learn to supervise agents. Senior engineers stop writing code and become "AI prompt engineers" who generate specifications for agents to implement.

Five years in: your engineering team can't ship without AI assistance. When an agent produces subtly broken code, no one catches it because no one developed intuition for what correct implementations feel like. When a complex architectural decision is needed, the team defers to whatever the AI suggested because they never practiced making those trade-offs.

Path 2: The Elevation Model (What Requires Deliberate Design)

Engineers own features end-to-end. AI handles meta-work (documentation, ticket updates, deployment automation). AI provides real-time intelligence (relevant code examples, architectural patterns, performance implications). Engineers make decisions, AI amplifies their capability.

New hires build features from scratch first — no AI assistance for their first six months. They learn to reason about systems, debug without assistance, and make trade-offs. Once they have foundational skills, they get AI augmentation that makes them 10x more productive.

Senior engineers maintain "manual mode" practice — one day per quarter, they build a feature without any AI assistance. Not because it's more efficient, but because the skill must be maintained.

Path 3: The Dependency Model (What Happens If We're Wrong About AI's Limits)

We accept that humans aren't needed for feature-level engineering anymore. We design for AI reliability rather than human fallback. Engineers become product managers who specify behavior and validate outputs, not people who understand systems from first principles.

This path only works if AI never fails in ways that matter. If it does — misunderstands a requirement, introduces a subtle security flaw, makes an architectural decision with compounding consequences — there's no one left who can recognize the failure before it ships.

What You Must Decide Now

For your organization:

Define your AI strategy explicitly: Enhancement (AI amplifies engineers), replacement (AI does the work), or automation (AI handles toil). Most organizations drift into replacement by default and call it efficiency. That's dependency. If you haven't chosen explicitly, you're choosing replacement — and your engineering capability is degrading while your velocity metrics look good.

Identify your actual differentiator. If you automate feature development completely, what's left? If the answer is "product vision" or "customer relationships" and not "engineering capability," you're betting your competitive advantage doesn't require engineering depth. That bet works until a competitor with deep engineering capability builds something you can't replicate.

Audit what you're training. Are you training engineers or supervisors? If new hires learn to review AI-generated code but never build a feature from scratch, you're training supervisors. In five years, you'll have a team that can't ship without AI. Call it what it is.

For your career:

Assess what you're building vs borrowing. Are you developing capability that compounds, or dependency that constrains? If your value is "knowing how to prompt AI effectively," you're competing with everyone else who can prompt AI. If your value is "can debug a distributed system without AI, then uses AI to do it 10x faster," you're building leverage.

Test yourself at the moments that matter most. Can you debug a distributed race condition without AI assistance? Design a database schema that holds up under scale? Reason about latency and consistency trade-offs from first principles? If the answer is "not anymore," your capability is atrophying. AI didn't cause that — choosing to let AI replace practice instead of augment it caused that.

Ask what work will be left for you as AI improves. If your answer is "oversight" or "high-level direction," you're describing a product manager, not an engineer. That's fine if it's what you want. But call it what it is: a career shift away from engineering, not advancement within it.

What Path 2 Means for How You Build Teams

If you choose the Elevation Model, your org structure must reflect it:

Hiring shifts from volume to judgment density. You need fewer total engineers but higher seniority. Instead of 20 junior engineers supervised by 2 seniors, flip it: 10 engineers, 6 of them senior. AI handles execution volume; humans handle judgment density. The unit economics: one senior engineer + AI costs more than two junior engineers, but ships more and better work while maintaining capability. You're not paying for hands on keyboards — you're paying for decisions that hold up under pressure.

Training budget goes to foundational skills, not tools. Invest in distributed systems courses, architectural workshops, incident analysis drills — the judgment AI can't build. Cut spending on framework training. Frameworks change every three years. The ability to reason about consensus, latency, consistency, and failure modes compounds over decades. Train for what doesn't change.

Productivity metrics shift from velocity to leverage. Stop measuring story points, commits, or features shipped. Those metrics reward AI doing more work. Measure instead: architectural decisions made vs delegated to AI, incidents caught before AI shipped them, designs that held up six months later when requirements changed. If your metrics improve when AI does more and humans do less, you're measuring the wrong things.

Create a Capability Auditor role. Someone's job is quarterly assessment: can the team still ship without AI? Run drills. Disable AI assistance for a sprint. See what breaks, what slows down, what stops entirely. Not to prove AI is bad — to prove human capability hasn't atrophied. If the team can't function without AI, you're on Path 1 (Atrophy Spiral), not Path 2 (Elevation). The audit isn't punitive. It's preventive.

Establish "manual mode" requirements. Senior engineers build one feature per quarter without AI assistance — from requirements to deployment. Not because it's efficient. Because the skill must be practiced. Pilots do manual landing drills not because autopilot is unreliable, but because when it fails, someone needs to know how to fly. Your senior engineers are the fallback. If they can't execute without AI, you have no fallback.

The Litmus Test: Are You Augmenting or Replacing?

Ask these questions about any AI integration:

After using this AI, is the engineer more capable or more dependent?
Does this preserve or erode the engineer's connection to the work?
If the AI disappeared tomorrow, would work quality drop or stop entirely?
Does this eliminate toil or eliminate learning?

If your team couldn't deliver your core product without AI assistance, you don't have efficiency. You have dependency.

The Choice

We stand at an inflection point. The decisions we make about agentic AI will determine whether we become more capable, more fragile, or obsolete.

The research is clear: Without intentional design, we drift toward fragility.

The question isn't whether to use AI. The question is: Are we using AI to become superhuman, or to avoid becoming human-level capable in the first place?

The pilot still needs to know how to fly.

The navigator still needs to understand the map.

And the engineer still needs to understand the system.

AI can make them better. But it cannot make them unnecessary — unless we choose to let it.

"Automation should eliminate the toil, not the craft. The moment we can't function without it, we've traded capability for dependency."

The Deterministic Problem with Probabilistic AI Analytics

Rajesh G — Sun, 07 Dec 2025 05:32:56 +0000

TLDR: AI-powered analytics tools use probabilistic systems (LLMs, semantic search, RAG) to answer business questions that demand deterministic accuracy. Ask the same question twice with slightly different wording, and you might get different SQL queries and different answers. This isn't a minor bug - it's a fundamental architectural problem. The solution isn't better AI, it's rethinking how we match business questions to data, using exact matching for core concepts instead of fuzzy semantic search.

Imagine asking your CFO: "How many claims did we deny in Q3?"

They pull up a dashboard and say "approximately 1,247, give or take."

You'd be concerned, right? When audit asks a question, "approximately" doesn't cut it. It's either 1,247 or it isn't.

Now imagine that same CFO runs the question again tomorrow using slightly different words - "What's the count of rejected claims last quarter?" - and gets 1,189.

You'd question the entire system.

Yet this is exactly how most AI-powered analytics tools work today.

The Probabilistic Foundation

Modern AI analytics stacks look something like this:

User asks a question in natural language
Semantic search finds relevant tables and columns from metadata
RAG pulls additional context about those data elements
An LLM generates a SQL query based on the retrieved information
Execute query, return results

Every step in this pipeline is probabilistic.

Semantic search uses embeddings to find "similar" content. The same question phrased differently produces different vector representations, which retrieve different metadata chunks.

RAG retrieval ranks results by relevance scores. Small changes in the query can shuffle the ranking, changing which context makes it into the LLM's prompt.

LLM generation is fundamentally non-deterministic. Even with temperature set to 0, the same prompt can produce variations in SQL structure, table joins, or filter conditions.

This works fine for creative tasks. Writing marketing copy, brainstorming ideas, drafting emails - probabilistic is perfect for these use cases.

But business analytics isn't creative work. It's precision work.

When Approximation Fails

Let's trace what happens with a real question:

Question: "What's the average order value for premium customers last quarter?"

First attempt: Semantic search finds:

fact_orders table (0.89 relevance)
customer_tier column (0.87 relevance)
order_total column (0.85 relevance)

LLM generates:

SELECT AVG(order_total) 
FROM fact_orders 
WHERE customer_tier = 'premium' 
AND order_date >= '2024-07-01'

Result: $247.83

Second attempt (user rephrases): "What's the mean order amount for our top-tier customers in Q3?"

Semantic search now finds:

fact_orders table (0.88 relevance)
customer_segment column (0.86 relevance) - different column!
order_value column (0.84 relevance) - different column!

LLM generates:

SELECT AVG(order_value)
FROM fact_orders
WHERE customer_segment = 'gold'
AND order_date >= '2024-07-01'

Result: $231.56

Same question. Different columns. Different answer.

The user didn't change what they were asking. But the probabilistic system interpreted it differently.

The Core Problem

The issue isn't that the AI is "wrong." It's that there's no ground truth enforcement.

In the first query, how does the system know that "premium customers" maps to customer_tier = 'premium' and not customer_segment = 'gold'?

It doesn't. It guesses based on semantic similarity.

Both mappings seem reasonable. Both are retrieved by the semantic search. The LLM makes a choice based on subtle factors in the retrieval ranking and its training data.

Change the wording slightly, change the retrieval ranking, change the choice.

This is the fundamental mismatch: using approximation engines to deliver deterministic outcomes.

How Humans Actually Work

Watch an experienced analyst tackle a question, and you'll notice something important.

Given: "What's the average order value and customer satisfaction rating for premium customers who made at least 3 purchases in high volume stores during peak hours last quarter? Break it down by product and region."

An analyst doesn't treat every word equally. They immediately decompose it into structured components:

Metrics (what to measure):

Average order value
Customer satisfaction rating

Filters (what to include):

Premium customers
At least 3 purchases
High volume stores
Peak hours
Last quarter

Dimensions (how to group):

Product
Region

Now here's the key: they do exact lookups, not fuzzy searches.

They open their metrics catalog and search for "average order value" - exact match. Either it exists or it doesn't.

They check their customer segments for "premium customers" - exact match. Either it's defined or it isn't.

They look up "high volume stores" in the business glossary - exact match.

No approximation. No "well, this seems similar enough." Either the concept is defined in the system, or they need to go ask someone what it means.

The Architecture That Works

The solution is to structure query generation around exact matching for business concepts.

Step 1: Parse the Question into Components

Use an LLM to break down the natural language question into structured parts:

{
  "metrics": ["average order value", "customer satisfaction rating"],
  "filters": [
    "premium customers",
    "at least 3 purchases",
    "high volume stores", 
    "peak hours",
    "last quarter"
  ],
  "dimensions": ["product", "region"]
}

This step can be probabilistic - you're just extracting the business concepts the user is asking about.

Step 2: Exact Match Against Your Metadata

Now search your metrics catalog for exact matches:

"average order value" → FOUND: metrics.avg_order_value
"customer satisfaction rating" → FOUND: metrics.csat_score

Search your business glossary for filter definitions:

"premium customers" → FOUND: customer_tier = 'premium'
"high volume stores" → FOUND: store_id IN (SELECT store_id FROM dim_stores WHERE annual_revenue > 5000000)
"peak hours" → FOUND: EXTRACT(HOUR FROM order_time) IN (10,11,12,13,17,18,19,20)

Search dimension tables:

"product" → FOUND: dim_product.product_name
"region" → FOUND: dim_geography.region_name

If any concept doesn't have an exact match, you stop and tell the user: "I don't have a definition for 'at least 3 purchases' in the system. Can you clarify, or should I use this interpretation?"

Step 3: Assemble the Query Deterministically

Now you have concrete metadata for each component. Pull the complete definition for each:

metrics.avg_order_value includes: table, column, calculation logic, any required joins
customer_tier = 'premium' includes: table, column, exact value, any prerequisites

Use an LLM to assemble these concrete pieces into SQL. But now the LLM isn't guessing which tables or columns to use - it's combining predefined building blocks.

SELECT 
  p.product_name,
  g.region_name,
  AVG(o.order_total) as avg_order_value,
  AVG(o.satisfaction_score) as avg_csat
FROM fact_orders o
JOIN dim_customer c ON o.customer_id = c.customer_id  
JOIN dim_product p ON o.product_id = p.product_id
JOIN dim_geography g ON o.store_id = g.store_id
WHERE c.customer_tier = 'premium'
  AND o.order_date >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
  AND o.store_id IN (SELECT store_id FROM dim_stores WHERE annual_revenue > 5000000)
  AND EXTRACT(HOUR FROM o.order_time) IN (10,11,12,13,17,18,19,20)
GROUP BY p.product_name, g.region_name

Same question asked twice → same metadata retrieved → same query generated.

Deterministic.

What This Requires

This architecture has prerequisites:

You need a metrics catalog. Every KPI, every metric, every measure your business uses - defined precisely, stored in a queryable system.

You need a business glossary. Every business term, every filter condition, every segment - with exact definitions and the logic to implement them.

You need dimension mapping. Clear relationships between business concepts and data model entities.

In other words, you need the 80% solved (see the previous article on this). You need the business context documented and structured.

But once you have that foundation, this architecture works.

The Hybrid Approach

Pure exact matching is rigid. What if a user asks about "VIP customers" when your system calls them "premium customers"?

The answer is a hybrid approach:

Try exact matching first
If no exact match, do fuzzy matching and ask for confirmation
Learn from confirmations to expand your synonym mappings

"I don't have a definition for 'VIP customers'. Did you mean 'premium customers' or 'enterprise customers'?"

User confirms. System logs that "VIP customers" → "premium customers" for this user. Next time, exact match succeeds.

Over time, the system learns the vocabulary variations while maintaining deterministic query generation.

Why This Matters

The difference between "approximately right" and "exactly right" is everything in business analytics.

Investors expect precise numbers. Audits demand accurate counts. Regulatory reports have zero tolerance for approximation. Strategic decisions need reliable data.

A system that gives you different answers to the same question isn't useful. It's dangerous. It erodes trust in data.

The current approach - throw everything into RAG and hope the LLM figures it out - works for demos. It fails in production where precision matters.

The Path Forward

If you're building or buying AI analytics tools, ask:

How do you ensure deterministic query generation?

If the answer is "better prompts" or "fine-tuned models" or "improved embeddings," that's not enough. Those are incremental improvements to a fundamentally probabilistic system.

Look for:

Exact matching on business concepts
Structured metadata catalogs
Clear decomposition of questions into components
Explicit handling of ambiguity (asking users instead of guessing)

What happens when I ask the same question twice?

If you get different SQL queries, that's a red flag. The system should be stable and predictable.

What happens when a concept isn't defined?

The system should admit it doesn't know rather than approximate. "I don't have a definition for X" is better than silently using the wrong definition.

The Bottom Line

Probabilistic AI is powerful for creative tasks. But business analytics isn't creative - it's precision engineering.

You can't build a deterministic system on a purely probabilistic foundation. Semantic search and RAG are tools, but they can't be the entire architecture.

The solution is to use exact matching where precision matters (business concepts, metrics, filters) and use AI where flexibility helps (parsing questions, assembling queries, explaining results).

Get the architecture right, and AI analytics becomes reliable. Get it wrong, and you build an impressive demo that nobody trusts in production.

The choice is yours.

Key Takeaways:

Business analytics demands deterministic accuracy, but most AI systems are fundamentally probabilistic
Semantic search and RAG can return different results for the same question phrased differently
Human analysts use exact matching for business concepts, not fuzzy approximation
The solution: parse questions into components, exact-match against metadata, then assemble queries
This requires well-documented business context (metrics catalogs, business glossaries)
Hybrid approaches can handle vocabulary variation while maintaining deterministic core behavior
The right architecture uses AI for flexibility while ensuring reproducible, accurate results

Why AI Analytics Tools Are Solving the Wrong Problem

Rajesh G — Sun, 07 Dec 2025 05:12:27 +0000

TLDR: The AI analytics industry is obsessed with building better query engines - using LLMs to turn natural language into SQL. But that's only 20% of the real challenge. The other 80%? Capturing and maintaining the massive amount of business context that exists in people's heads, undocumented meetings, and scattered wikis across five layers of your organization. Until we solve this unglamorous documentation problem, AI-powered analytics will remain impressive demos that struggle in production.

Every "chat with your data" demo looks the same. Someone types "show me sales by region last month" into a sleek interface. An LLM generates a SQL query. Results appear. Everyone nods approvingly.

Then you try to deploy it at your company.

Suddenly, questions that should be simple become impossible. "What's our revenue from premium customers?" sounds straightforward until you realize three different teams define "premium" differently, and "revenue" means something else to finance than it does to operations.

The demo worked because the demo had clean, simple data. Your reality is messier.

What Everyone Is Building

Open any AI analytics product and you'll find roughly the same architecture under the hood.

They connect to your database and pull the schema - tables, columns, data types, foreign keys. They use retrieval augmented generation (RAG) to find relevant metadata when you ask a question. An LLM takes that context and generates SQL. Execute the query, format the results, maybe generate some insights.

For simple questions against well-designed databases, this works. "Total orders last week" or "top 10 customers by spend" - no problem.

This is the 20% of analytics that everyone's racing to perfect.

The 80% That Nobody Talks About

The real work begins when you step outside the database schema and into the messy world of business meaning.

Your challenges live in five distinct layers, each one adding complexity that no amount of clever prompting can solve.

Layer 1: Business Definitions That Live Nowhere

Your database has a customers table with 2 million rows. Great. But which ones are "premium customers"?

Is it customers who spend over $10K annually? Or is it the VIP tier from your loyalty program? Or maybe it's anyone with a dedicated account manager? Different teams give different answers.

What about "high volume stores"? Is that top 10% by revenue? By transaction count? By square footage? The answer exists somewhere - maybe in a strategy deck from 18 months ago, maybe in someone's head who's been here for five years.

"Peak hours" sounds objective until you learn that retail defines it as 10am-2pm and 5pm-8pm, but the warehouse team uses 7am-11am and 3pm-7pm.

None of this lives in your database. It's business knowledge, and it needs to be documented in a structured way before any LLM can use it.

Layer 2: Metrics and Their Hidden Complexity

Ask five people what "revenue" means and you might get five different answers.

Does revenue include pending orders? What about returns? Is it before or after discounts? Do you count the shipping fee? What about tax? Is it when the order is placed, when it ships, or when payment clears?

Every one of these questions has an answer somewhere in your organization. Probably in multiple places with multiple versions, some contradictory.

Your analytics team might calculate "Monthly Recurring Revenue" one way. Finance calculates it differently for the board. The sales dashboard shows a third number because it excludes trials.

Each metric needs a single source of truth - not just a definition in plain English, but the actual business logic. The conditions, the exclusions, the edge cases. All of it documented and maintained.

Layer 3: Domain-Specific Business Rules

Now add in the decisions each business unit makes to solve their specific problems.

Marketing runs a campaign and excludes customers who purchased in the last 30 days. Operations has special handling for orders over $5,000. Customer service treats warranty claims differently than regular support tickets. Finance has revenue recognition rules that vary by product type.

These rules get implemented to solve today's business problem. The people making these decisions aren't thinking about downstream analytics. They're not documenting for future AI systems. They're shipping features and closing deals.

But every one of these decisions affects what the data means and how it should be interpreted.

Layer 4: Technical Implementation Decisions

The business requirements land with engineering. Now developers make their own choices.

They build microservices, each owning its own data. They choose data structures that make sense for their use case. They optimize for performance, for their API contracts, for their deployment constraints.

Is a customer ID a string or an integer? Are addresses stored as structured fields or freeform text? Is the timestamp in UTC or local time? Different services make different choices.

These aren't wrong choices - they're pragmatic engineering decisions. But data becomes a byproduct of operations, not a first-class concern.

And again, most of these decisions aren't documented anywhere that a data system can access.

Layer 5: The Data Platform Transformation Layer

Finally, the data team pulls everything together. They extract data from dozens of sources, cleanse it, standardize it, transform it.

They create dim_customer by joining six different customer tables. They build fact_orders by combining order data with returns, refunds, and adjustments. They calculate derived metrics like customer_lifetime_value using complex logic.

Every table, every transformation, every derived field represents decisions. What business logic is embedded in this ETL job? Why was this data transformed this way? What assumptions were made? What edge cases are handled?

Without documentation, this knowledge lives in the data engineer's head or buried in hundreds of lines of SQL code.

The Documentation Debt Crisis

Add it up:

Business definitions for every domain term
Precise logic for every metric and KPI
Rules and exclusions from every business unit
Technical decisions from every engineering team
Transformation logic from every data pipeline

This is the context an LLM needs to generate correct queries for real business questions.

And almost none of it is documented in a way machines can understand.

This documentation problem is the 80%.

Why This Is So Hard

Documentation is manual work. Unglamorous, time-consuming, never-ending manual work.

Business definitions change. A "premium customer" today might be redefined next quarter. Metrics evolve as the business grows. Rules get updated when regulations change. The data platform refactors tables and schemas.

Static documentation becomes stale the moment it's written. You need a living system that evolves with the business.

But who owns this? The business teams are focused on business problems. Engineering is shipping features. The data team is drowning in pipeline maintenance. Nobody has "document everything for future AI" in their job description.

Even worse, this problem compounds. Every new business rule, every new data source, every new transformation adds to the documentation debt. The gap between what your AI needs to know and what's actually documented grows wider every sprint.

The Real Opportunity

RAG retrieval algorithms, query optimization, result formatting - these are interesting technical problems. They make for great blog posts and conference talks.

But they're the easy 20%.

The companies that will win in AI-powered analytics aren't the ones with the smartest query generation. They're the ones who solve documentation.

What if business context was captured automatically as a byproduct of work instead of as extra work? What if implementing a business rule automatically created the metadata an AI needs? What if refactoring a data pipeline updated the documentation simultaneously?

What if documentation was a living system that breathed with your business instead of a static artifact that goes stale?

That's the hard problem. That's the valuable problem.

Where to Start

You can't solve all five layers overnight. But you can start making documentation a first-class concern:

For business teams: Maintain a single source of truth for business definitions. Not in slides or wikis, but in a structured system that can be queried programmatically.

For product teams: When you implement business logic, document it where the data team can find it. Make "how will this be analyzed?" part of the design conversation.

For engineering teams: Treat data as a first-class output, not a byproduct. Document your implementation decisions. Make your data contracts explicit.

For data teams: Don't just transform data - document why. Capture the business logic embedded in your pipelines. Make your schemas self-describing.

For everyone: Stop treating documentation as a chore to avoid. It's the foundation that makes everything else possible.

The Bottom Line

Better LLMs won't solve AI analytics. Better RAG won't solve it. Better query optimization won't solve it.

The bottleneck isn't technical capability. It's business context.

You can't assemble an answer from context you don't have. You can't generate the right query if you don't know what "premium customers" means in your organization.

The 20% is getting commoditized fast. Every AI analytics vendor has roughly the same query generation capability.

The real differentiator is the 80% - capturing, maintaining, and surfacing the business knowledge that turns database schemas into business intelligence.

That's where the hard work is. That's where the value is.

And that's the problem that's still unsolved.

Key Takeaways:

AI analytics tools focus on query generation (20%) while the real challenge is business context documentation (80%)
Business knowledge exists across five layers: business definitions, metrics/KPIs, domain rules, technical implementation, and data transformations
Each layer contains critical context that isn't captured in database schemas or code
Static documentation fails because business context changes constantly
The solution requires making documentation a byproduct of regular work, not extra work
Whoever builds systems that capture and maintain this context automatically will win the AI analytics race

Why Your Enterprise Data Platform Is No Longer Just for Analytics

Rajesh G — Tue, 18 Nov 2025 04:06:02 +0000

Key Takeaways

The relationship between data and applications is undergoing a fundamental shift. For decades, we've moved data to applications. Now, we're moving applications to data. This isn't just an architectural preference—it's becoming a necessity as businesses demand richer context, faster insights, and real-time operations. Here's what's driving this change:

Context is king: Connected data provides multidimensional insights that isolated data simply cannot match
The old pattern is breaking: Extracting data to specialized tools creates silos, brittleness, and duplication
The line has blurred: Enterprise Data Platforms are no longer just analytical systems—they're becoming operational platforms
Three critical shifts: Data latency must drop to seconds, query latency to sub-seconds, and availability must reach production-grade standards
The solution space: Event-driven architectures, operational databases like serverless Postgres, and treating your EDP as a P1 system
Best of both worlds: Leverage specialized analytics capabilities by bringing them to your data, not moving data to them—every data movement step adds failure points, costs, and complexity

The Power of Connected Data: A Tale of Two Dashboards

Consider how Uber thinks about a driver who just completed a ride.

Without connected data: "Driver #47291 completed an 18-minute ride. Rating: 5 stars."

With connected data: "Driver #47291 completed an 18-minute ride during rush hour in San Francisco. Has a 4.92 rating over 2,847 trips, typically works evenings, now in a surge zone. The passenger is gold-status but gave 3 stars today (usually gives 5). Heavy rain—this driver's cancellation rate jumps 8% in rain."

Same event, different universe. The first tells you what happened. The second tells you why, predicts what might happen next, and suggests what action to take. When you view information through multiple dimensions—user behavior, location patterns, time series, weather, operational metrics—you move from reporting to insight.

Why Enterprise Data Platforms Became the Center of Gravity

In large organizations, data from everywhere converges in a central Enterprise Data Platform: CRM systems, transaction data, product telemetry, marketing attribution, customer service interactions.

This wasn't arbitrary. Connecting data is hard, and doing it repeatedly across different tools is wasteful. The EDP became the natural convergence point where data gets cleaned once, relationships between different sources get mapped, historical context accumulates, and governance gets enforced.

When you need to understand customer lifetime value, you need purchase history, support interactions, usage patterns, and marketing touchpoints. These don't naturally live together—they get connected in the EDP. This made it perfect for contextual analytics: not just because data lives there, but because the relationships and ability to view information from multiple angles exist in one place.

The Old Playbook: Extract, Load, Specialize

For years, the workflow was straightforward. When teams wanted to improve customer experience, run marketing campaigns, or optimize products, they'd: identify needed data from the EDP, procure a specialized tool (Qualtrics for customer experience, Segment for customer data, Hightouch for reverse ETL), build pipelines to extract and load data, then let the specialized tool work its magic.

Marketing got Braze. Customer success got Gainsight. Product got Amplitude. Each loaded with curated enterprise data.

This made sense—these platforms had years of domain expertise and optimized databases for specific use cases. But cracks started showing.

The Data Paradox: More is Better, Until It Isn't

Every specialized tool works better with richer data. Your NPS scores don't just tell you satisfaction dropped—you want to know it dropped specifically among enterprise customers with multiple support tickets coming up for renewal.

Theoretically, send more data. Practically, this creates three problems:

First, you're duplicating your entire dataset across multiple tools. Your customer data lives in the EDP, in marketing, in customer success, in product analytics. Each copy needs syncing. Each represents another data quality surface.

Second, you're creating brittle pipelines. Different data models, different APIs, different limitations. Each pipeline is a failure point needing maintenance as schemas evolve.

Third, you're siloing insights. Marketing sees one version of the customer, product sees another, support a third. The connected data you built in the EDP gets disconnected as it flows into specialized tools.

This becomes an anti-pattern—working against what made the EDP valuable: keeping data connected.

The Inversion: Bring Applications to the Data

If moving data to applications creates these problems, what if we inverted the pattern? Instead of extracting data from the EDP to specialized tools, build those capabilities on top of the EDP itself.

When the most contextual, connected data already lives in your Enterprise Data Platform, why ship it elsewhere? Why not build your customer experience dashboards, your marketing segmentation engines, your operational applications directly on the EDP?

This is where most people raise an objection.

Wait, Isn't That an Anti-Pattern?

For decades, we've been taught that analytical systems and operational systems are fundamentally different. Analytics platforms—data warehouses, lakes, EDPs—handle complex queries over large datasets, optimized for throughput. Operational systems—transactional databases—handle fast queries on specific records, optimized for latency.

You wouldn't run e-commerce checkout on a data warehouse. You wouldn't build real-time fraud detection on overnight batch jobs.

But here's what changed: the line between analytical and operational has blurred dramatically over the past five years.

Why the Line Has Blurred

Applications have become analytics-hungry. A decade ago, an operational application might look up a customer record. Today, that same application needs to compute lifetime value in real-time, analyze 90 days of behavior, compare against historical patterns, and aggregate data across multiple dimensions.

Meanwhile, data freshness requirements have compressed. Marketing campaigns that used to refresh daily now need hourly or minute-level updates. Customer health scores calculated overnight now need to reflect recent interactions within minutes.

And context requirements have exploded. It's no longer enough to know what a customer bought—you need what they viewed but didn't buy, what promotions they've seen, what support issues they've had, and what predictive models say about their churn likelihood.

This creates a new reality: operational applications need the rich, connected context of the EDP, but with operational characteristics—low latency, high availability, and fresh data.

EDPs can no longer be P2 or P3 systems that indirectly support business. They're becoming P1 systems powering business directly, in real-time, at the point of customer interaction.

The Three Critical Shifts

For EDPs to power operational applications, three characteristics must change:

1. Data Latency: From Hours to Seconds

Traditional data pipelines moved data in batches—often daily, sometimes hourly, occasionally every 15 minutes if you were pushing it. This worked fine when insights were consumed the next morning in dashboard reviews.

It doesn't work when you're trying to trigger a marketing campaign based on a customer's action taken 30 seconds ago. It doesn't work when flagging potentially fraudulent transactions while they're still pending. It doesn't work when customer service needs to see what happened during the call that just ended.

The solution: Event-driven architecture, end to end. This isn't just about having a message queue somewhere. It means rethinking how data flows through your entire enterprise. When a customer completes a purchase, that event should propagate through your systems in seconds, not hours.

This is the architecture that makes an enterprise truly data-driven—not from yesterday's data, but from what's happening right now. Technologies like Kafka, Debezium for change data capture, and streaming platforms become foundational, not optional.

2. Query Latency: From Seconds to Sub-Seconds

Users won't wait three seconds for a dashboard to load. They definitely won't wait 30 seconds for a page to render. Applications need to respond in hundreds of milliseconds, not seconds.

But here's the fundamental issue: modern data warehouses and lakes are built on storage-compute separation. This isn't a bug—it's an intentional design choice that provides enormous benefits for analytical workloads. You can scale storage and compute independently. You can spin up compute when needed and shut it down when you don't.

However, this separation introduces a first-principles problem: when you run a query, data needs to move from remote storage to compute nodes. Even with optimized formats like Parquet, even with clever caching—data still needs to travel. For analytical queries over large datasets, a few seconds is acceptable. For operational APIs, it's not.

Why this matters for operational workloads: Operational applications don't make single queries. They chain hundreds of API calls. A single page load might trigger dozens of queries. Real-time business decisions—approve this transaction, show this offer, flag this behavior—can't wait for data to move from storage to compute. They need millisecond responses.

The solution: Relational databases, where compute and storage live together. This is where solutions like Neon and serverless Postgres come into play.

The pattern: Keep your rich, historical, connected data in the EDP where it belongs—that's still the system of record. But sync the operational subset—the data that needs to power real-time applications—into a relational database optimized for low-latency queries.

This operational database becomes your fast access layer, holding the most frequently accessed data: current customer states, recent transactions, active orders. Everything else—full history, rarely accessed dimensions, large analytical datasets—stays in the EDP and is linked when needed.

Why relational databases? When compute and storage are together, query latency drops dramatically. No network hop to fetch data. Indexes live next to the data. Query planners optimize on actual data locality.

Why serverless Postgres? It solves the operational challenges that traditionally made databases hard to scale—automatic scaling, no provisioning for peak load—while maintaining the low-latency benefits of the relational model.

3. High Availability: From "It's Down" to "It's Always On"

When your data platform is used for monthly reports and strategic planning, a few hours of downtime is annoying but not catastrophic. When your data platform powers customer-facing applications, every minute of downtime directly impacts revenue.

This means treating your EDP—or at least the operational layer sitting on top of it—with the same availability standards you'd apply to any production application.

The solution: Active-active configurations, multi-region deployments, automatic failover. At minimum, the operational database layer needs production-grade infrastructure.

This shift is cultural as much as technical. It means your data team needs to adopt DevOps practices. It means SLAs matter. It means on-call rotations become part of data platform management.

Why This Matters Now

None of these ideas are entirely new. People have been talking about operational analytics for years. So why is this pattern becoming critical now?

Several trends have converged:

The cost of computation has dropped dramatically. What was prohibitively expensive five years ago—maintaining real-time data pipelines, running operational databases on large datasets—is now economically feasible. Serverless architectures have made it even more accessible.

Competitive pressure has increased. Customers expect personalization, immediate responses, and consistency across channels. Companies that can deliver these experiences with richer context have a meaningful advantage.

The technology has matured. Event streaming platforms are production-ready. Change data capture tools reliably sync databases. Serverless databases handle operational workloads without traditional overhead. The pieces needed to build this architecture actually work now.

Data teams have the skills. A generation of data engineers who grew up building real-time pipelines and thinking about data as something that flows rather than sits have moved into leadership positions. The organizational knowledge exists to execute this pattern.

The New Architecture

Here's what this looks like in practice:

Your Enterprise Data Platform remains the system of record—the place where data is cleaned, connected, and stored historically. Data flows into it through event-driven pipelines that capture changes as they happen, not in overnight batches.

On top of the EDP, an operational layer provides fast, consistent access to the subset of data needed for real-time applications. This might be a serverless Postgres instance that's automatically synced with your data platform, maintaining operational data with sub-second query latency.

Applications—whether internal tools, customer-facing features, or analytical dashboards—query the operational layer directly. They get the rich context of the EDP with the performance characteristics of an operational database.

The operational layer is treated as a P1 system: multi-region if needed, highly available, monitored like any production service.

Data flows through this architecture in near real-time. An event happens in a source system, gets captured and streamed to the EDP, triggers processing and transformation, and updates the operational layer—all within seconds or minutes.

What This Enables

When you build applications on top of your connected data rather than extracting subsets to specialized tools, several things become possible:

Richer insights. You're not limited to the subset of data you could feasibly extract and load. Your application has access to the full context of the EDP.

Faster iteration. Adding a new dimension to your analysis doesn't require building a new pipeline and waiting for data to load. It's already there.

Reduced duplication. Data lives in fewer places. Updates happen in one location. Data quality issues are fixed once.

Better cross-functional work. When everyone is building on the same data foundation, insights are easier to share. Marketing and product aren't looking at different versions of customer behavior.

Lower operational overhead. Fewer pipelines to maintain, fewer data synchronization issues to debug, fewer copies of data to govern and secure.

The Trade-offs

You're trading specialized tools' optimizations for platform flexibility. You need teams capable of building applications and organizational buy-in to treat data platforms as production infrastructure. But for many organizations, the benefits—flexibility, reduced duplication, faster iteration, contextual insights—justify the investment.

But What About Specialized Capabilities?

Here's a legitimate question: what about all those cutting-edge features that specialized platforms offer? Qualtrics has StatsIQ and TextIQ—sophisticated analytics capabilities built over years. Segment has identity resolution algorithms refined across thousands of companies.

If we're building on our EDP instead of using these tools, are we throwing away innovation? Are we asking data teams to rebuild complex models from scratch?

Not necessarily. The key insight: you don't need to move data to leverage specialized capabilities. Bring those capabilities to where data lives, or let them operate on your data in place.

Two Emerging Patterns

First, bring capabilities to the EDP. This is already happening. Many specialized analytics capabilities are becoming available as standalone services or libraries that operate directly on data platforms. Modern EDPs support user-defined functions, external ML service calls, and integration with specialized processing engines. You can invoke sentiment analysis APIs on text stored in your EDP. You can run statistical models using libraries that operate directly on your warehouse tables.

Second, let specialized tools operate in place. Instead of extracting data into Qualtrics, imagine Qualtrics connecting directly to your EDP and running its StatsIQ algorithms on your data where it sits. This "compute on data in place" trend is accelerating—it's the core idea behind data clean rooms, query federation, and interoperability standards.

Why Operating In Place Wins

Every time you add a step to move data, you introduce:

Failure points: Another pipeline that can break, another synchronization that can fall out of date
Costs: Data egress charges, storage duplication, compute for transformation and loading
Latency: Time to extract, transform, and load before insights are available
Complexity: Another system to monitor, another set of credentials to manage
Risk: More copies of sensitive data, more surfaces for security issues

The most successful solutions operate on existing data in place. Think about dbt's success—it transforms data where it sits. Or how BI tools evolved from requiring extracts to connecting directly to warehouses. The winning pattern is always "work with data in place."

What This Requires

From vendors: APIs that operate on external data, federated query engines, embedded analytics libraries. Some will resist—their business models depend on data lock-in. But those that embrace this will win in a world where enterprises are consolidating their data.

From data platforms: Rich API layers, fine-grained access control, performance for external queries, support for specialized compute through user-defined functions and external procedures. Modern platforms like Snowflake's external functions, Databricks' ML capabilities, and BigQuery's remote functions are steps in this direction.

The Emerging Architecture

Your Enterprise Data Platform holds your connected, contextual data. Your operational layer provides fast access for real-time applications. And specialized analytics capabilities—whether built in-house or licensed from vendors—operate on this data without requiring it to be moved.

You get the rich context and operational efficiency of centralized, connected data. And you get the specialized capabilities of best-in-class tools. Without the brittleness, cost, and complexity of moving data between systems.

Looking Forward

The shift from "move data to applications" to "move applications to data" reflects how central data has become. The line between operational and analytical systems has blurred.

Organizations adapting to this—event-driven architectures, operational databases near data platforms, treating EDPs as P1 systems—will act on richer context, respond faster, and deliver better experiences. Those maintaining old extraction patterns will fight complex pipelines and synchronization issues.

The technology exists. The question is organizational readiness.

The future of enterprise data isn't choosing between analytical power and operational performance. It's architectures delivering both.

True future of Databricks Lakebase

Rajesh G — Thu, 19 Jun 2025 03:17:10 +0000

When Databricks announced Lakebase, most people dismissed it as just another product. Even Databricks markets it as "the backend for AI Agents and Data Apps."

This messaging puzzles me. It's the same pattern they followed with Delta Lake in 2019, positioning it as "bringing reliability to Big Data." But Delta Lake was actually much simpler: a transactional layer on top of an immutable object store. That's it. Yet this simple concept solved a massive architectural challenge—enabling storage and compute separation for data warehouses at scale.

Storage and compute separation became the foundation for everything that followed: data sharing, unified storage formats, multiple query engines. But separation creates a problem: latency. For analytical workloads, this latency is manageable. For transactional and operational analytics? It's a dealbreaker.

Here's why this matters: the traditional divide between applications and analytics is disappearing. As businesses demand data-driven decisions in real-time operations, analytical platforms are being pulled into the critical path of business processes. Data platforms are no longer backend systems—they're front and center.
This shift demands sub-second response times that storage and compute separation simply can't deliver. You need a fused engine. You need a database.

That's what Lakebase really is.

But consider the bigger picture. In most enterprises, data follows a predictable journey: applications generate data, it's ingested into a central data lake, curated, and used for analytics. Then those insights flow back to business applications for decision-making.
What if you could collapse this entire cycle? What if applications and analytics ran on the same backend?

Object stores and Delta Lake can't solve this. You need a true database for applications. But the heavy lifting of moving data between systems can be simplified—what Databricks calls "zero-ETL." It's not actually zero-ETL; someone still runs the ETL. Databricks just handles it for you.

This positions Databricks as something entirely different. They're no longer just an analytics company. They're becoming the AWS of enterprise data—where your applications run on Databricks, your analytics run on Databricks, and low-code solutions make it accessible to business users.

The data producers become the ones running end-to-end pipelines and analytics.

But this approach creates its own set of challenges. Let's explore those next.

What are Object Stores - simplified

Rajesh G — Sat, 13 Jan 2024 22:40:03 +0000

[Originally published in medium]

We all know that Object stores are the backbone of the internet of cloud era and, we also know they have certain behavioral characteristics. Like scalability, immutability, eventual consistency.

But, do we know why object stores behave this way? What are they actually? Let’s find out.

Since I started using AWS S3 10 years ago, it really amazed me all that can be done using it and, all the incidents that showed how dependent the internet is on S3. The other thing that made me curious is- that there is no official architecture documentation or details for AWS S3. It’s kind of a “secret” — I guess.

First, let’s find out what an Object Store is. Per Wiki, below is the definition:

“Object storage (also known as object-based storage[1]) is a computer data storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks.”

Let me try to explain it in English. To understand the definition even better, we need to understand what Block Storage is. Am not going to pull the definition from Wiki for this one — let me try to explain in simple English.

What is Block Storage?

Block Storage is the fundamental storage mechanism of the computer systems where data is stored in “blocks”. The key here is that there is a central system/module that keeps track of what data is stored in which specific block.

What that means is, when a file is stored, the filesystem breaks it down to blocks and stores the data in multiple blocks. Only the filesystem knows which part of the file is stored in which block. So, any changes to the file, has to go through the filesystem only. All of this data about the blocks is the “metadata” and its centrally stored and managed by the filesystem.

For the sake of simplicity, am not going to differentiate File storage from Block storage. In a way, they are related to each other but, they are fundamentally different from Object storage.

To understand it a little better, let’s try an analogy.

Note: Some of the operations, mechanics of databases and object storage have been overly simplified for easier understanding.

Block Storage Analogy : Storage Units based on item type with one/more store keepers and, no direct access to storage details.
In this “hypothetical” analogy, imagine a multiple storage units like Book storage unit, Furniture storage unit, Electronics storage unit ..etc where people store their items. You bring your items in a box(labelled with your name and address) and hand over to the store keeper at the appropriate storage unit.

The store keepers at the storage units only understand and handle the items they are designed for — like store keeper at book storage unit only can store books and nothing else. Same with store keeper at electronics storage unit — can only store electronics.

How and where exactly inside the storage unit each of these items are stored, is a black box to you. That information is only known to the store keeper and is stored in a place accessible only by the store keeper. The store keeper needs to know what the contents of the box are. He/she records all the details of each customer details(from the box), books details, along with where exactly inside the storage unit each one of those books are stored. Let’s call this data as the “book-storage-details”.

Now, the equivalents of these hypothetical box, book, page to, the data storage in real world are file/data, record and, block respectively.

Below is an example of how book-storage-details can look like. This is the data about where each of the customer’s books are stored. It’s nothing but the metadata of the books stored. In real world, it’s the metadata of the file/data stored in the filesystem.

Scenarios — add or retrieve books.
If you have to store one or more additional books at a later point of time, you give the new books to the store keeper and he/she will make a note of the details in the book-storage-details and then store them on the shelves.

If you have to retrieve one or more books from the storage, you give the details of the book(s) you need and the store keeper will retrieve them for you. When you bring them back, the store keeper follows the same process of making note of all the metadata and storing them on one of the shelves.

In real world — this is how you interact with the file system when you want to make any changes to a saved file/data. You can retrieve one or more records and make changes to them at a record level and save them back.

Advantage — consistency. As each and every storage/retrieval is going though the store keeper, you can pretty much look at the metadata(book-storage-details) at a given point in time and exactly tell what books are stored and by whom and where. If the entry exists in the metadata(book-storage-details), then the book is stored and, if there is no entry, it is not stored. Simple! The answer to questions like “is the book stored?”, “how many total books are stored?”, “list of all books stored” ..etc are all precise and definite at a given point in time.

Advantage-efficiency. As the exchanges are at the record level, only one or more records can be retrieved or written. This is the most efficient way to operate on data.

Disadvantage-scalability. If there are a 100 customers who want to store their books or retrieve them at the same time, they will have to wait until the store keeper processes all of them one after the other. May be there can be multiple store keepers sharing the same book-storage-details can work faster and handle 10 customers at a time but eventually, there is a limit to the scalability and the process slows down as every single storage/retrieval detail has to be recorded in detail(about every single book) in a metadata store(book-storage-details).

Key here is the store keeper needs to know “what” you store. That is where the bottle neck is. They need to know “what” is inside the box -every single time.

This is how most of the databases work in real world. There is a central metadata store that keeps track of every record stored and its details(like size, location, block details etc). Every exchange with the database is “transactional” and is “consistent”. All of the exchanges have to go through a centrally stored metadata store and thats where the potential for a bottleneck is. They are precise and consistent but, do not scale linearly per the load. Once the number of writes reaches a limit, the database does not scale.

Now let’s look at how Object stores work.

What is Object storage?
Simply said — Object storage is a storage mechanism designed to tradeoff consistency of Block storage mechanism in exchange for infinite scalability.

What this means is — object storage is designed from ground up to be scalable infinitely(theoretically).

Now, why do we need infinite scalability? That’s what the Volume and Velocity of Big data are reasons for. The rate and size at which the world is generating data is too fast and too huge than what we can process with the traditional databases. We cannot ever keep-up or even catchup to the backlog and store all of that data produced without potentially losing some of the data.

Let’s look at the same Book storage unit analogy to understand how it would play out in case of Object storage.

Object Storage Analogy : Box storage with one/more store keepers and direct access to storage details.
Key words here are “box storage”, “infinite store keepers” and “direct access”. Imagine the same hypothetical storage unit example we discussed earlier but, in this case we deal with boxes and no longer with books. You can store anything in the box and it doesn’t matter. The store keepers do NOT know or care what you have inside the box. This, reduces a lot of their work — to know whats inside and record everything every time. So, practically the same number of store keepers, can do more work — which is just moving boxes in or out of the store unit.

What is this going to change? Well, it’s going to change the way you interact with the storage unit a lot. First, you can store anything inside the box. So, there is only one type of storage unit that is needed. Second, you can store/retrieve boxes much faster irrespective of how many customers are trying to store/retrieve boxes at the same time.

In this new model, you bring your contents in a box(labelled appropriately as earlier) and, handover the box to the store keeper. The store keeper then assigns a vacant aisle for you and names that aisle after you and, he/she stores the box in that aisle. The store keeper also records the information about the box like aisle details, name of the box, weight of the box, size of the box, time of storage etc and stores them in “box-storage-details”. But, this time, this information is no longer centrally managed or stored, but its stored along with your box. One copy of the information is attached to the box so that anyone can find out the details by looking at it and, one more copy of its is stored at the front desk. Also, you get to see the information in “box-storage-details”.

This will help you because, next time, when you have to interact with your box(add/remove contents), you can handover the details to the store keeper and he/she can help you with your box.

With this new model, you have to always interact with the book store keeper in terms of a “box” and no longer in terms of individual items. The store keeper no longer has any details of what books/toys/electronics you have in the box, or anything inside the box. This cuts down the requirement of keeping track of what is inside the box.

Scenarios — add or retrieve items.
If you have to add more items to the box or retrieve items from the box, you will have to retrieve the entire box, make whatever changes you need to make — add or remove books/electronics/toys and put the box back in the aisle.

Magically, the process of adding content or retrieving content is simplified. Any storekeeper can help you as you have all the details required.

Advantage — scalability. Now that the job a store keeper does is simply store the box or retrieve the box and the details about where to find your box are already provided at the time of an interaction, the process becomes linearly scalable by simple adding more store keepers as the number or concurrent customers increase. There is no longer a central metadata store or “box-storage-details” being maintained.

Periodically, may be all of the storekeepers consolidate their storage details numbers to come up with the total stats of the entire storage unit.

Disadvantage-consistency. How does this process effect consistency is the key here. Now that the information about what boxes are stored and where they are stored is stored with multiple store keepers in a distributed fashion, you cannot get a consistent answer for questions like “is the box stored?”, “how many total boxes are stored?”, “list of all boxes weighing more than 5 lbs”.

Every time you ask the question — is the box#25 from customer 1001 stored, the storekeepers have to consolidate their records to get to an answer. And while they are consolidating the records, customer 1001 might add/retrieve box#25 and that info is not recorded yet. So, the numbers are eventually consistent. Meaning, if you give enough time for storekeepers to consolidate their records after box#25 from customer 1001 is stored, then you might get a consistent answer.

That is “Eventual Consistency”. And, that is NOT a bug — it’s by design.

Disadvantage-Efficiency. Now that all of the interactions are in terms of “boxes” and not anymore in terms of individual “items” or “contents”, even if you have to add/remove of item, you will have to retrieve the entire box. This is not very efficient but, with the developments in the processing speeds of the cloud, unlimited availability of compute power and network transfer speeds, the inefficiency makes a very small dent in the overall big picture.

Advantage-Scalability. This is the big one. With this design of not needing to record details about the contents of the boxes, the speed at which storekeepers can store/retrieve boxes can be increased dramatically. This gives rise to potentially infinite scalability. The absolute answer to our question “How can we keep up with the speed at which data is being generated?”

Key here is the store keeper does not know “what” you store. That is where the bottleneck is eliminated. They only need to know “about” the box -every single time.

This is how object stores work. This is why object stores can scale infinitely and store anything. You can store flat files, songs, pictures, movies..etc

Now that we’ve solved the need for scalability, how can this be implemented in a shared infrastructure world of cloud where they have to host data from customers all over the world? How can they make sure, people do not overwrite contents created by others?

Do you know that the bucket name you pick for storing your data in AWS S3 has to be unique? By unique, what I mean is — it has to be unique across the world. No one else in the world can have the same name for their bucket.

That is interesting right? The reason is — the flat naming structure followed by AWS to name the objects.

When you store a file named “vacation_pic1.jpg” in the folder structure ///, it’s designed to make the navigation and understanding of the data stored easier on the end-users. But, in reality, there are no folders at all in AWS S3.

The actual implementation of the storage works by flattening the path into one single name and hashing it out.

So, the object “vacation_pic1.jpg” is stored as bucket1_folder1_folder2_vacation_pic1.jpg or the hash of the name of the file. So, when the starting point of the object name is made unique, whatever logical folder structure you create after that — doesn’t matter and, in the end, object names will be unique across the board.

Now, can S3 be used to store anything — yes. Then, do we still need all the other databases like Postgres, Redshift, SQL Server, Teradata, Neo4j, Arango DB, MongoDB..etc?

Yes, we do need other databases as well. Let’s discuss the need for those in a followup article. For now, let’s discuss the use cases for S3.

What are the best use cases for Object Stores?
As you can see by the design, the ability to quickly write huge datasets is possible with AWS S3. You can read huge data as well but, the data is immutable. Meaning, once you create an object, you cannot edit the contents. You will have to reproduce the entire object with what ever changes you want.

So, that is the reason, AWS S3 is not a good fit for frequently changing data. In other words, object stores are good for static data. If you produce the data once and read it multiple times, then it’s the perfect fit for it. Write once, read a million times.

How many times do you think Netflix produces a movie and edits it? May be it will edit the movie initially for a couple of times but after that a movie is a movie. So, you produce/create a movie and store it on S3, and stream it from S3 millions or even billions of times. Same with all website content. How many times would you change the logo, picture or other static content of websites? Not very much — so, the static content of almost all websites can be stored on S3.

What’s the future of Objects stores?
The reliability, cost effectiveness, infinite scalability of Object stores scream a lower total cost of ownership to store data. But, the eventual consistency and immutability stops it just short of being the only storage we would ever need. Is it?

Note** AWS S3 was eventually consistent at the time of writing this article (June-2020), its strongly consistent as of Dec-2020.

That is where, very interesting ideas came in from companies like Netflix, Google, Snowflake, Databricks..etc. These companies created virtual ACID layers on top of the Object stores making the eventual consistency and immutability of Object stores virtually non-existent.

How far these virtual ACID layers have come along in making the eventual consistency and immutability of Object stores non-existent and how they work? How did some open source solutions(Netflix’s s3mper) try to solve the eventual consistency using OLTP database systems? How does Google solve the eventual consistency using Google Spanner?

Let’s discuss these in detail in my next article.