DEV Community

Luke
Luke

Posted on • Originally published at hypequery.com

The Analytics Language Layer: Why Real-Time Data Needs Typed APIs, Not Just Faster Databases

We’ve made our databases real-time. We haven’t made our analytics interfaces real-time-safe. The missing abstraction between your analytics DB and your consumers is a typed, programmable analytics language layer.

ClickHouse can ingest a billion rows per second and return aggregations across terabytes in milliseconds. The storage problem is solved. The execution problem is solved. But the interface problem, how consumers actually talk to the engine, remains stuck in the era of hand-crafted SQL strings, copy-pasted metric definitions, and dashboards that nobody trusts.

This gap didn’t matter much when the consumer was a human analyst writing a query in a notebook. It matters enormously now that the consumer is increasingly a service, a background job, an embedded dashboard, or an AI agent. The weakest link in the modern analytics stack isn’t the database. It’s the language we use to talk to it.

When Your User Is a Model, SQL Becomes a Liability

The push toward AI-driven analytics has exposed a fundamental fragility in text-to-SQL approaches.

Spider 2.0, released in late 2024 with enterprise-level complexity (3,000+ columns, multiple SQL dialects), showed even the best models solving only 17% of queries. The BIRD-Interact benchmark, which simulates real interactive analytics sessions, reports a best-case success rate of 16%. Uber built an internal text-to-SQL system and found only 50% overlap with ground truth on their own evaluation set.

The failure modes are what make this genuinely dangerous. An analysis of 50,000+ production LLM-generated queries found that most broken queries execute successfully and return data, they’re semantically wrong but syntactically valid. The model hallucinates columns that don’t exist, picks wrong join paths, applies incorrect aggregation logic, or silently drops required filters. You get a clean DataFrame back. The numbers just happen to be wrong.

The evidence for a different approach is already in. Snowflake’s internal tests show query accuracy jumping from 40% to 85% when LLMs are routed through a semantic layer instead of raw SQL. DataBrain reports accuracy going from roughly 55% to over 90% with semantic context. dbt Labs reported at Coalesce 2025 that their semantic layer achieved 83% accuracy on natural language analytics questions, with several categories at 100%. The pattern is clear: constrain what the model can express, and accuracy improves dramatically.

The fix isn’t better documentation or more careful code review. It’s making the interface itself safe by default. Type-safe query builders. Pre-declared metrics and datasets. Schema-aware tooling that catches errors at compile time, not at query time. The system should enforce correctness structurally, not rely on the discipline of the person or model writing the query.

Defining the Analytics Language Layer

What’s needed is an abstraction that doesn’t exist cleanly in the current tooling landscape. Call it an analytics language layer: a typed, programmable, stable API for a company’s metrics and queries that everything else plugs into.

It’s not an ORM — those map objects to rows and optimise for CRUD. It’s not a BI tool — those own the visualisation and assume human consumers. It’s not a raw query builder — those give you flexibility without constraints. The analytics language layer sits between the database and all its consumers, providing a contract that is:

Type-safe and schema-aware. Column references, filter expressions, and aggregation logic are validated at compile time. If the schema changes, your build breaks before your dashboards do.

Versioned and evolvable. Metrics and query definitions are first-class code constructs with version history, review workflows, and the ability to deprecate gracefully. You can evolve your analytics API the same way you’d evolve a public REST API.

Multi-protocol. The same definitions are consumable by backend services, React components, CLI tools, AI agent toolchains, traditional BI. The metric definition is written once; the consumption pattern varies.

This is the layer that platform teams at scale end up building internally, whether they call it a semantic layer, a metrics catalog, a query translation engine, or something else entirely. The pattern is universal because the problem is universal.

Seven Predictions for the Next Three Years

The analytics language layer isn’t a speculative concept. The convergence is already visible. Here’s where this heads:

  1. Every serious ClickHouse deployment will have a dedicated analytics language layer sitting between the database and its consumers. The alternative is linear growth in platform engineering headcount as consumer count increases.

  2. AI agents will talk to semantic layers, not databases. The MCP server approach of exposing raw SQL to agents will mature into structured tool interfaces where agents invoke named, typed queries rather than generating SQL strings. The accuracy data demands it.

  3. BI tools will consume typed endpoints. Rather than authoring raw SQL or maintaining their own query logic, BI tools will connect to analytics language layers the same way frontend applications consume REST or GraphQL APIs. The Open Semantic Interchange initiative (dbt, Snowflake, Salesforce, ThoughtSpot) is an early signal of this convergence.

  4. Metric definitions will be code, not configuration. YAML-based metric definitions will give way to programmatic definitions in the same language as the application (TypeScript, Python), enabling IDE support, testing, and the same CI/CD workflows used for application code.

  5. Schema drift will become a build failure, not a production incident. Type-safe analytics layers will catch breaking schema changes at compile time. The “we renamed a column and three dashboards broke” class of incident will go the way of runtime type errors in typed languages — still possible, but structurally discouraged.

  6. The semantic layer market will consolidate around API-first architectures. Gartner’s concept of “composable analytics” — modular, API-first business components — will define the winning pattern. Tools that can’t serve their semantics via APIs will lose ground to those that can.

  7. Real-time and batch semantic layers will merge. The artificial divide between dbt (batch) and Cube/hypequery (real-time) will collapse as the analytics language layer becomes the single interface regardless of freshness. The layer’s job is to provide safe access; the underlying engine handles the latency profile.

The Challenge

If you’re running ClickHouse in production and your consumers are still writing raw SQL strings to query it, you’ve solved the hard problem (making the database fast) and left the easy problem unsolved (making it safe to talk to).

The analytics language layer is the missing piece. Not because the database needs help, but because every consumer that touches it does. The organisations that adopt this abstraction early will ship fewer data incidents, onboard new teams faster, and safely expose real-time analytics to more products and agents than their competitors.

We’ve spent a decade making databases real-time. It’s time to make the interfaces real-time-safe.

We’re solving this problem at hypequery, you can check out the project on GitHub.

Top comments (0)