Peter Marshall

Posted on Dec 15, 2022

Apache Druid analytics applications - a 2020 metastudy

#apachedruid

This post was first published in 2020 - :confession-face: it's due an update :)

By using Apache Druid to power your applications, you are part of the charge to deliver a new generation of decision support tooling. That makes you a disruptor, delivering user experiences that increase democratic, socialised decision-making, breaking your users and your engineers free. Here are 10 inspiring characteristics of Apache Druid Powered applications!

They build knowledge networks
They have time front and centre
They enable episodic comparison
They are appealing
They make understanding data easier
They enable ad-hoc discovery
They hand control over statistics to the user
They align the heat of business with the heat of data
They drive product development
They are digitally secure

They build Knowledge Networks

Experiences powered by Apache Druid make the best, latest, statistics-ready data highly accessible and people want to share their thinking and their observations widely. This not only reduces the propagation of duplicate technologies in an enterprise, but increases consistent interpretations of data through socialisation, and improves data literacy around the organisation.

Druid’s highly extensible connectivity and scalability means it’s possible to place Druid as the engine for not just one BI tool or just one data science front end or one development framework or one portal, but for many across the organisation.

And by building apps that have effective integration with real-time communication platforms, whether that’s Microsoft Teams or Pager Duty, Druid-powered apps stop isolated thinking in its tracks and become a foundational part of Knowledge Networks that circulate evidence and findings at speed.

“Users have already unlocked new use cases for capacity analysis, analyzing the traffic matrix, and inter-domain traffic analysis (peering analysis). Users are also creating their own customized dashboards, and are freely sharing insights throughout different teams.”

NTT Communications

“Dream11 now has direct access to the raw, real-time data via over 900 Kafka topics. It streams in from web and mobile applications and is then ingested into Druid, where all the raw data is kept and made available for analytics and reporting. Before, the product team had to submit a request for analytics information that could take hours to days to produce. Now with Imply, business users who aren’t trained data or business analysts have direct self-service analytics capabilities.”

Dream 11

“We have diverse use cases. We didn’t want a different system for each use case because then you have to learn about each one, maintain each one. This is exactly why we chose Druid: its flexible architecture, its extensive query language, its ability to tune each component. It was a great fit for us.”

Nielson

“We give people the ability to use the data … in a high-performance, easy-to-use way. Non-technical users … who are well-versed in business. We don’t want to limit this to a single domain, it’s not just marketing, it’s every area of the business that can use … our most valued data-sets.

“There’s no shortage of available options, a well established market for decades. We need a place to enable self-service querying - can we equip a business user with a tool that allows them to get insight in minute or hours, not in weeks or months? Can they share it? Can the person next to them share that data? Can users explore this insight-ready data, putting it in a place where people can take action? We’ve got lots of business partners we work with - can they use this platform?

“Druid is fast. It does things that are really common in the analytics space. It’s great at filtering. It’s scalable. And you can write some super-sophisticated operations.

“There are folks in almost every pocket of the business using this platform.”

Target

“We wanted to empower everyone at the company to be able to make appropriate decisions backed up by data in their day-to-day. Non data users would first have to think about exactly what data they need (like to analyse web traffic) then have to place a request with someone in the the data staff who would figure out “where is this data?”, “what format are they going to present it back in?” (tabular, graphical) … and the requester would … hopefully then be able to make a decision. But … they would find they had more questions to ask, and then have to repeat the cycle.

“What has changed is that [non data users] don’t need to ask [data users] any more and instead asks the cube that we have created [on top of Druid].

“We create shareable links such that any cubes we’re interacting with can be shared with anyone at the company. For instance, [a non data users] looking at different cubes finding something that really interesting, may apply certain filters, and is able to generate a link to share with other product managers or other members of her team so they can all look at the exact same data so they can use that to empower their discussions.”

Twitch

They have time front-and-centre

Apache Druid-powered apps create a new data journey that starts with “time” - not just immediacy but interactivity. Anyone can enrich their understanding of statistics by using “time” both as an interactive element and a visual cue, and time-sensitive decisions can be made in informed ways.

Users hone everything they see using sliders, time charts, clocks, calendar selectors, and other time-based design elements. Whether it’s for visual / spatial visual thinkers, like a map, bubble chart, or histogram, or elements for textual thinkers, like overall figures, top 10 lists, or matrices, they are related back to time as a primary dimension.

While in BI, navigating and drilling down into what’s happening using time is an afterthought, in Druid-powered apps, time is a core design element that often begins the conversation with the data at hand.

“Feature stores are critical [to verification of fraudulent events in real-time]; we wouldn’t be able to query a data store directly and compute features to inform validation in short time periods, but using a feature store (precomputed in batch systems) it enables low-latency queriability. We use Druid as an aggregation platform to fuel TrafficGuard’s feature store.

“To achieve real-time reporting means thousands of queries at high-concurrency, and with billions of rows, thousands of dimensions, and hundreds of measures, compounded by fast growth, led to us choosing Druid [for analytics].”

TrafficGuard

“It’s super important to us that time is key to everything in Druid. Almost everything of interest, almost every question that a user would ask has some time element to it. What were sales last week? How many guest were in our stores in the last month? Time is super important.”

Target

They enable episodic comparison

Druid-powered apps use both real-time and historical data - a single flow of data that is fresh and frequently updated. Users aren’t forced to choose between good reflection on historical data or the latest data, they see both. They are discouraged equally from being overly-reactive by focusing entirely on what just happened and on being overly-sensitive by focusing entirely on the past.

Interfaces allow people to compare statistics from different time periods quickly and in a variety of ways. Whether that’s data that triggered alert alongside a timeline of previous similar events, or an overlay of a trend line from yesterday versus the same time last month.

Fresh data in an Apache Druid application activates people’s instincts to react properly to threats, risks, and sudden changes, while the instantly-available historical data puts it in context.

“The main tool we provide to developers is a web application where [developers] can see all their stats in real-time, and also historical aggregates for years back in time.”

Game Analytics

“Customers want to see a list of dimensions about pins over periods of time, comparing stats with months or weeks ago, such as impressions, engagement, links clicked, and saves. The more stats that people can get the more confident they are to use our platform.”

“You can pass in an array of intervals. In pretty much any business, period-over-period comparison is really powerful, and the fact you can pass in an array of intervals to Druid is really helpful.”

Target

“We give customers log-level insights … with 14+ reporting sections inside our product, and some of the subsections have 18+ components that include time-period comparisons … with results that come back in under 2 seconds.”

TrafficGuard

They are appealing

A UI built with Druid is not like a data analyst’s tool. Data is easy to understand and encourages engagement. An excellent Druid-powered UI reduces the chances of people’s patience, technical skill, knowledge of data sources, role or status being a barrier.

The schema that’s presented doesn’t use weird internal field names, but uses terms that are highly personalised. That means both dimensions and measures - gone is customer.demographics.surname and COUNT of CUSTOMER and in comes имя and ਗਾਹਕਾਂ ਦੀ ਗਿਣਤੀ. Gone are multiple pull-downs hidden inside cryptic visual metaphors and in come obvious, self-explanatory pinches and gestures and visual cues.

“Data applications are going to be used by people who are not necessarily professional data person. While they might well be happy waiting 10, 15 seconds or even a minute or longer, someone whose job is not just to sit there and work with data all day is not going to wait for that.

“They are going to be less engaged with the tool and when things go longer than that, they’re just not going t use it at all. Things being responsive help people engage with data - and apps that interact with speed is what makes that possible.”

Gian Merlino, Apache Druid PMC Chair

They make understanding easier

Druid-enabled user interfaces put data into context, with multiple sources shown alongside one another using time (and other dimensions) as connecting keys, and with additional data and structure added to individual sources through enrichment. Now events from one function (say, sales) can be seen alongside data from other functions (say, logistics) in real-time with more colour and depth than has been possible before.

“Taking all available data into account enables TrafficGuard to provide the deepest understanding of traffic, and to provide reliable validation. We have a treasure-trove of data enrichments that can be of a user journey, IP enrichment, network intelligence, user agent enrichment, device enrichment, … and the multitude of campaign and partner source enrichments.”

TrafficGuard

“We make multiple visuals available to users whether that is tabular, whether it’s a heat map, whether it’s a bubble chart, whether it’s a histogram… You can look at the data in whichever presentation you prefer, and you have the ability to do that yourself. You can highlight and zoom into a specific area within a visual that you care about. The ability to do this self-service really empowers our users.”

Twitch

They enable ad-hoc discovery

Unlike many BI tools powered by ETL, Apache Druid applications drive deep dives into data by allowing people to filter and drill-down interactively on aspects of the data they previously had no ability to do.

There are fewer dashboards, reports, and visuals aligned to specific pre-computed filtered data sets, and instead multi-functional experiences that do more than one job, all using the same data. Often, users see statistics and visualisations that are a complete view, and when they need to understand what’s going on in a specific area of their process, function, or service, whether that’s a specific country, location, account, team, organisation, product, or service, they just select it from an intuitive pull-down list or tree, and the data instantly refreshes and updates to match.

“Marketers can filter the reach, or potential audience, to find exactly the right people for their ads, and cost scales with reach. Hagen explains, “we filter on 50+ dimensions, so we can do something like ‘all male iPhone users over 40 that are less than one mile from a Starbucks’.”

Liquid M

“We provide complex filtering and complex visualisations based on different dimensions, and we allow the users to display KPIs split by country, by day … and we want those filters to be quick and responsive, and that’s one place where Druid comes into its own.”

Game Analytics

“We are allowing clients to define audiences which later they can target. An audience is a composition of devices with common attributes, so for example a client can define an audience of all females in a specific age [group] and with a specific interest, and … in real time we calculate the number of unique devices … inside that audience.”

Nielson Identity

They hand control over statistics to the user

Engineers are free to exploit Apache Druid’s super-fast statistical calculations, able to offer up measures that in the past may have been too computationally expensive to provide interactively - if at all. That means not just MAX or AVG but complex analysis, such as approximate distinct counts on extremely high-cardinality data and complex set calculations such as used in funnel analysis.

Druid can be asked to compute not just one but multiple measures in parallel, allowing engineers to intelligently execute fewer queries to improve app responsiveness, and then push the boundaries of what we traditionally think data exploration is about: the citizen data analyst is born having fun with data, not by throwing their tablet out of the window.

And because the measures are not fixed or pre-computed, engineers can build, and analysts can use, applications that offer maximum flexibility in the measures on offer.

“Whether understanding the time it will take to collect answers from their audience of interest, or breaking down results in real-time as respondents answer surveys, Druid delivers the high-performance time-series OLAP database we need to monitor, measure and explore data in real-time.”

Pollfish

“We need to be able to count the number of unique people who first got exposed to an ad and later surfed to a landing page. Later, some of those people … also add a product to a shopping cart, so now we need to count the unique persons who first watch the ad, surfed to a landing page, and then added a specific product to their shopping cart. And finally, of course, we need to count the number of unique people who actually purchased the product. For that, we chose the thetasketch in Druid.”

Nielson Identity

“In order to insert a new dimension or a new column we needed to compute out all the combinations of all the different keys [in our older key-value lookup technology], like country, devices, genders… It reached cardinality explosion and became impossible to compute all the necessary combinations as our use cases scaled. And so this is why we turned to using Druid.”

“We take advantage of estimation aggregators: count distinct is always a tough problem. Taking advantage of things like datasketch - getting more and more into percentiles, quantiles - these are the types of things we can with the native functionality of Druid.”

Target

They align the heat of business with the heat of the data

Druid-builders leverage multiple data configurations to support the use cases they are serving. From red-hot-lava data sources of utmost value, with critical dimensions, approximation, and roll-up, and filtering on ingestion, to cooler data sources with more dimensions, finer granularity, maybe even raw data where it’s accepted less money will be spent on making queries run quickly - Druid enables specific configurations according to the cost-benefit.

In infrastructure, engineers are able to embed the true cost-benefit of analytics using tiering to put more compute and storage resources where it matters most, and by using reingestion tasks they can reduce the granularity, retention period, and breadth of data over time, ultimately reducing storage costs.

“Most users are focused on what happens in real time and on the first day, so it makes sense to eliminate needless data as time passes. On day one, LiquidM ingests one million requests per second, and via a series of compaction tasks brings the deep storage required down to about three gigabytes for the year.”

Liquid M

“We lean more towards flexibility than query performance, making sure that the database is performant enough that our users can get what they need.

“We could restrict dimensionality … leaning more on educating our users and explaining the effect of having many dimensions or very high cardinality, and how they play off with query performance. We could limit cardinality, running a job in our pipeline to pick out just Top N before it’s ingested, roll up values into another category, or we could have explicit include lists (only allow these values in this dimension to pass through).

“Or we could just monitor, running reports on cardinality and roll-up, then go and work out what’s happened. We could change the query granularity to limit queries to hourly statistics … or go down to the minute which is good enough for real-time alerting but isn’t so costly as no roll-up whatsoever. We could restrict or reduce the query concurrency … to keep it fast for a few users (though we set a limit so that they will fail fast if it’s overloaded).”

Netflix

They drive product development

Druid interfaces drive competitive advantage. Take a look at content on the web, from people like Twitter Mopub, SK Telecom, Game Analytics - and see what they have built.

Product owners find that, when Apache Druid is put into their data pipeline, they have many more product development options than they did before.

More raw data can be stored and more metrics can be used in experiences. Not only this, but each part (ingestion, query, and storage) can be scaled independently.

And because Druid has no computed cube, schema changes can be made over time and query capabilities adjust on-the-fly.

Now Product Owners can use Apache Druid to build digital products that create fundamental shifts in their industry.

“Our schemas evolve over time. We add and remove dimensions and metrics as our needs change.”

Netflix

“Innowatts’ platform helps energy providers accurately forecast load, design effective rate plans, manage risk, increase customer value and prepare for a sustainable future. Innowatts had helped its customers see a 40% improvement in forecast accuracy within 3 months, enhance customer lifetime value by $3,000 per customer, and avoid $4 million in Opex costs.”

Innowatts

They are digitally secure

Apache Druid applications are secure - not just protecting data but providing highly-available and consistently reliable services.

Applications may use real-time information about a user’s role, for example, applying row-and column-based filtering, transparent to the user, embedded inside the application and surfaced in a way that ensures the user knows that their own information - and their customer’s information - is protected.

And when it comes to availability, Druid deployments are highly-resilient in all aspects and allow for rolling upgrades, making it an extremely reliable back-end database for applications used by thousands of people in multiple regions of the world simultaneously.

“As an online database, it has a scale-out and fault-tolerant architecture which today is table stakes for any kind of database system. No downtime for software updates is really important: your online database is not going to be very online if you have to take it down when you want to do a software update!

“No downtime for data management is also super important: things like altering a table, updating every row in a table, running a compaction (aka vacuuming or optimising…) on a table - all this stuff you might want to do … happens in the background and you can continue querying the old data as the new version comes online. For all these kinds of operation there is no reason to take downtime.”

Gian Merlino, Apache Druid PMC Chair

DEV Community

Apache Druid analytics applications - a 2020 metastudy

They build Knowledge Networks

They have time front-and-centre

They enable episodic comparison

They are appealing

They make understanding easier

They enable ad-hoc discovery

They hand control over statistics to the user

They align the heat of business with the heat of the data

They drive product development

They are digitally secure

Top comments (0)