DEV Community

Apache Doris
Apache Doris

Posted on

Agent Facing Analytics with High Concurrency: Doris vs Clickhouse vs Snowflake

Data warehouses have evolved drastically over the past 30 years—from BI-driven legacy systems to big data-powered modern platforms. Now, with the explosion of GenAI and LLM applications, we're entering a new era where data warehouses must seamlessly integrate with AI workflows, support real-time agent interactions, and deliver extreme performance at scale. Apache Doris 4.0 emerges as the game-changer, combining enterprise-grade analytics with AI-native capabilities to meet the demands of today's intelligent applications.

The Evolution of Data Warehouses: From Legacy to AI-Native

Let's trace the journey of data warehouses and understand how AI is reshaping their core requirements.

Legacy Data Warehouses (BI-Driven)

The first generation of data warehouses separated analytical data from transactional systems to handle large volumes of historical data (e.g., daily trading reports for stockbrokers). However, they quickly hit walls in the big data era:

  • Scalability: Expensive hardware upgrades with limited horizontal scaling

  • Cost Efficiency: On-premise deployments required specialized hardware and high maintenance costs

  • Advanced Analytics: Poor support for real-time insights, AI, and ML

  • Flexibility: Rigid architectures unable to adapt to new use cases or diverse data sources

Modern Data Warehouses (Big Data-Driven)

Post-2000, the mobile internet and e-commerce boom drove the need for more agile analytics. Modern data warehouses addressed legacy limitations with:

  • Stateless Compute/Storage: Lower overhead for scaling resources

  • Low-Latency: Sub-second response times for user queries

  • High-Concurrency: Effortlessly handles thousands of concurrent workloads

  • Hybrid Workloads: Supports ad-hoc queries, ETL, and batch processing

  • Federated Queries: Breaks data silos by unifying access to data lakes, transactional DBs, and more

Data Warehouses in the AI Era

ISG Research predicts: "Through 2027, almost all enterprises developing GenAI applications will invest in data platforms with vector search and retrieval-augmented generation (RAG) to complement foundation models with proprietary data."

LLMs thrive on high-quality data—for both training and inference. AI-driven applications require data warehouses to:

  • Balance Volume & Quality: High-quality data directly impacts model performance

  • Dual-Purpose Data: Support both model training and real-time inference

  • Dynamic Freshness: Handle continuous data read/write with near-zero latency

  • Agent-Friendly: Enable autonomous AI agents to interact without human intervention

  • AI-First Design: Natively support LLM functions, vector storage, and high-performance vector I/O

The Paradigm Shift: Agentic-Facing Analytics

Traditional BI and OLAP systems are built for passive, historical reporting—few users running heavy queries with slow tolerance. AI changes this with agentic-facing analytics:

  • Proactive, autonomous AI agents that reason, analyze in real-time, and trigger actions

  • Workloads shift to: "Massive users (agents), light/iterative queries, zero latency tolerance"

  • Requires millisecond response times for thousands of concurrent queries

Legacy OLAP systems can't keep up—their pre-aggregated data cubes, batch processing, and data silos create bottlenecks for agentic workflows. The solution? A semantics-and-response-centric architecture that prioritizes flexibility, real-time access, and unified data context.

Apache Doris Outperforms Competitors: Benchmark Results

Apache Doris (and its commercial distribution VeloDB) sets a new standard for performance across key analytics benchmarks. We compared it against Snowflake and ClickHouse Cloud with equivalent compute resources (128 cores for VeloDB/ClickHouse, XL-size cluster for Snowflake) using Apache JMeter to measure QPS at 10/30/50 parallelisms.

Benchmark Overview

Benchmark Focus Key Findings
SSB-FLAT Single wide-table queries (no joins) VeloDB outperforms Snowflake 4.76–7.39x, ClickHouse 4.76–6.92x
SSB (Star Schema) Join-heavy analytics VeloDB outperforms Snowflake 5.17–6.37x; ClickHouse failed most join queries
TPC-H Complex ad-hoc decision support VeloDB outperforms Snowflake 1.71–3.10x; ClickHouse couldn’t run all queries (Q20/Q21/Q22 failed)

Key Takeaways

  • Complex Joins: Doris excels at join-heavy workloads (SSB/TPC-H) thanks to its advanced optimizer and execution engine

  • High Concurrency: Maintains performance at scale (50 parallelisms) while competitors struggle with memory or parsing errors

  • Wide-Table Performance: Even in single-table scans (SSB-FLAT), outperforms purpose-built systems like ClickHouse

  • Cost-Efficiency: Delivers more throughput per compute unit than Snowflake’s elastic architecture

Deep Dive: Apache Doris Core Technologies

Apache Doris’s performance and AI readiness stem from its innovative architecture. Let’s explore the key features powering its success.

1. Data Pruning: "Don’t Process Unnecessary Data"

The most efficient way to process data is to avoid processing it entirely. Doris uses two types of pruning:

Static Filters (Pre-Execution)

  • Partition Pruning: FE uses metadata to skip irrelevant partitions (e.g., time-based partitions outside a date range)

  • Key Column Pruning: Data is sorted by key columns—binary search narrows down the row range to scan

  • Value Column Pruning: Column files store min/max metadata to skip files that can’t match predicates

Dynamic Filters (Post-Execution)

For joins, filters are generated after building hash tables on the build side. This prunes irrelevant data on the probe side before joining, reducing join overhead.

2. Advanced Pruning Optimizations

LIMIT Pruning

Pushes LIMIT clauses down to data scanning—stops processing once the required number of rows is retrieved.

TopK Pruning

Optimizes TopK queries (e.g., "top 10 highest-grossing products") with:

  • Local truncation in scanning threads

  • Global merge sort via a coordinator

  • Two-phase execution: first sort key columns to get row indices, then fetch required columns—avoids full data scans

Join Pruning

Reduces probe-side data for hash joins:

  • Uses build-side hash table values to filter probe-side data

  • Minimizes data transfer and join computation (O(M+N) complexity vs. O(M*N) for Cartesian product)

3. Pipeline Engine: Efficient Execution at Scale

Doris uses a coroutine-like pipeline engine to maximize CPU utilization:

  • Yields CPU during blocking operations (disk I/O, network I/O in joins/exchanges)

  • Eliminates thread switching overhead with task scheduling triggered by external events (e.g., RPC completion)

  • Independent parallelism per pipeline (not constrained by tablet count)

  • Even data distribution to minimize skewing via local exchange optimization

  • Shared states across pipeline tasks (reduces initialization overhead)

4. Vectorized Query Execution

Processes data in batches (vectors) instead of row-by-row, leveraging:

  • SIMD (Single Instruction, Multiple Data) CPU instructions

  • Loop unrolling to reduce branch mispredictions

  • Accelerated compression, computation, and data processing

  • Delivers 2–10x performance gains for analytical queries

AI-Native Capabilities in Apache Doris 4.0

Apache Doris 4.0 is built for the AI era with native support for:

  • Vector Search: High-performance storage and retrieval of feature vectors for LLM inference

  • RAG Integration: Seamlessly connects with LLMs to augment generation with proprietary data

  • AI Functions: Built-in UDFs for ML/LLM workflows (e.g., embedding generation, text processing)

  • MCP Server: Native support for Model-as-a-Service integration

  • Agent Compatibility: Designed for programmatic access by AI agents with low-latency responses

Conclusion

The AI revolution demands data warehouses that are fast, flexible, and AI-native. Apache Doris 4.0 delivers on all fronts:

  • Outperforms competitors in complex joins, high concurrency, and wide-table analytics

  • Features like data pruning, pipeline engine, and vectorized execution enable millisecond response times

  • AI-native capabilities (vector search, RAG, agent support) integrate seamlessly with GenAI workflows

For teams building AI-driven applications, Apache Doris isn’t just a data warehouse—it’s the foundation for intelligent, real-time analytics that powers the next generation of products and decision-making.

Top comments (0)