Apache Druid® - an enterprise architect's overview

#apachedruid

“Describe Apache Druid in less than 60 seconds,” said an old enterprise architect colleague of mine... challenge accepted!

Apache Druid is part of the modern data architecture. It uses a special data format designed for analytical workloads, using extreme parallelisation to get data in and get data out. A shared-nothing, microservices architecture helps you to build highly-available, extreme scale analytics features into your applications.

Modern data architecture

Druid is a real-time analytics database that can ingest data from file stores and data lakes (for batch ingestion), and event streams (for immediate query). It’s used globally to build analytical interface elements in applications “whenever someone is sat there, waiting for the results” via a SQL API or JDBC interface.

Storage system

Druid has its own data format. Source data is sharded at ingestion time into “segments” that enable planning for massively parallelized operations. The “Druid segment” is paired with Java code, crafted over many years by engineers from AirBnB to Yahoo, so that each one can be scanned and statistics extracted extremely quickly. Druid front-loads data distribution and replication, balancing it automatically to make it ready as soon as it’s needed for queries, and it uses metadata from each segment to intelligently prune query execution plans based on time periods and, optionally, clusters of dimension values.

Parallel execution

Dedicated leader processes for each of ingestion, query, and on-going data management each generate highly parallelised execution plans. The plans are executed by independent follower processes, and there are facilities to target those workloads at particular groups of followers to account for different query and ingestion throughput requirements.

Availability and elasticity

Druid has a shared-nothing, loosely-coupled microservices design, making use of cloud-era dependencies. The leading processes that each perform critical functions use proven leadership election techniques, while multiple follower processes add workload redundancy. A Druid cluster remains aware of the resources you make available to it and will distribute tasks to microservices automatically, allowing for on-the-fly adjustments to cluster size and shape. Together, this helps users create deployments with high availability that can adapt to the query concurrency and data scales that you have.

Learn more…

Take a look at the design section of the official Apache Druid documentation.
Watch videos from Imply’s developer relations team’s “adoption tips” playlist.
Meet members of the Druid community

DEV Community

Apache Druid® - an enterprise architect's overview

Modern data architecture

Storage system

Parallel execution

Availability and elasticity

Learn more…

Top comments (0)

Read next

Debezium - CDC

React Reconciliation

A Beginner's Guide to Front-End Development

HELLO FRIENDS 😇