Igor Lukanin for Cube

Posted on Mar 16, 2022 • Originally published at cube.dev

Combining dbt Metrics with API, Caching, and Access Control

#dbt #sql #database

If you're using dbt to transform data in your data warehouse and you're excited about dbt Metrics feature, here's how you can add caching, access control, and a variety of APIs (such as REST, GraphQL, or SQL) on top.

dbt quickly became the major part of the modern data stack and the favorite tool of data practitioners. As Cube community grows, especially among data engineers, we more often see dbt being used upstack from Cube to define transformations and data model.

After a community-driven feature proposal for dbt Metrics and an insightful discussion that followed, our community requested an integration with dbt that would allow keeping metrics definitions in dbt as the single source of truth while leveraging the rest of Cube.

dbt Metrics meet Cube, the headless BI platform

At Cube, we are building headless business intelligence, and want to make data accessible and consistent across every application, ranging from embedded analytics to internal dashboarding and reporting tools. It’s an amazing opportunity for us to integrate with dbt Metrics, enhance it with Cube’s caching and access control, and deliver to every application via our SQL, REST, and GraphQL APIs.

Today, we are happy to announce an integration with dbt Metrics. Cube can now read metrics from dbt, merge them into Cube’s data model, provide caching and access control, and expose metrics via our APIs to downstream applications.

In the tutorial below we’ll use dbt to define metrics, Cube to apply caching (and get a ~8x performance boost), provide access control, and expose data to Apache Superset via Cube’s SQL API and to other tools and applications via our REST API or GraphQL API.

Building a metrics layer with dbt Metrics

Everything starts with a great dataset, so I’ve taken GitHub Activity Data, a snapshot of more than 2.8 million (~10 %) open-source repositories on GitHub, publicly available on Google Cloud Platform. As of March 2022, it contains information about 257 million unique commits, their authors, messages, files, etc.

Let’s say we’re interested in learning all about productivity of individuals and teams expressed in the number of commits and commit message lengths:

We would start with defining a dbt model like this in commits.sql:

SELECT
    TIMESTAMP_SECONDS(`author`.`time_sec`) AS timestamp,
    `author`.`email` AS author_id,
    `author`.`name` AS author_name,
    REGEXP_EXTRACT(`author`.`email`, r'@(.+\..+)$') AS author_domain,
    LENGTH(subject) AS subject_length,
    LENGTH(message) AS message_length
FROM `bigquery-public-data.github_repos.commits`

The next step would be to add the dbt metrics definitions into schema.yml:

version: 2

metrics:
  - name: commits_count
    label: Commits Count
    model: ref('commits')
    description: "A count metric for commits"

    type: count

    timestamp: timestamp
    time_grains: [day, week, month, year]

    dimensions:
      - author_domain
      - author_name

  - name: users_count
    label: Users Count
    model: ref('commits')
    description: "A count metric for users"

    type: count_distinct
        sql: author_id

    timestamp: timestamp
    time_grains: [day, week, month, year]

    dimensions:
      - author_domain

Defined metrics will be available for programmatic introspection via Metadata API and manual exploration in GraphiQL IDE:

Exposing dbt Metrics via an API

As a headless BI platform, Cube consists of four logical layers: metrics, acceleration, access control, and API. Our metrics layer is able to read metrics definitions from dbt via the Metadata API and translate them into equivalent Cube data model.

You can get started with Cube within minutes in Cube Cloud or set up Cube locally with Docker. You’ll need to provide the credentials for your data warehouse; there are plenty of data sources supported by Cube.

Next, connecting Cube to dbt is as trivial as adding this code snippet to your data model:

import dbt from '@cubejs-backend/dbt-schema-extension'
import { dbtJobId, dbtApiKey } from '../config'

asyncModule(async () => {
  await dbt.loadMetricCubesFromDbtCloud(dbtJobId, dbtApiKey)
})

In an instant, you’ll be able to explore the data in Cube Playground:

Cube will take care of SQL generation and querying the data in your data warehouse that was previously transformed with dbt:

SELECT
  `github_commit_stats_commits`.author_domain `author_domain`,
  `github_commit_stats_commits`.author_name `author_name`,
  COUNT(*) `commits_count`
FROM
  `cube-devrel-team`.`dbt_prod`.`commits` AS `github_commit_stats_commits`
WHERE
  `github_commit_stats_commits`.author_domain IN (?, ?)
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10000

This gives you the luxury of using a plenty of API flavors provided by Cube:

SQL API — to expose the metrics to popular BI tools, e.g., Apache Superset (check this tutorial to see it in action)
REST API — to expose the metrics to even more BI tools (e.g., Appsmith, Bubble, or Retool) or integrate Cube into your report-generation or machine learning pipeline
GraphQL API and integrations with popular front-end frameworks such as React, Angular, and Vue — for embedded analytics in custom-built applications (check one of many tutorials that we have)

Fetching data via the REST API is as simple as sending a JSON-encoded query and parsing the JSON-encoded result set. Cube provides convenient JavaScript client libraries for that purpose, but even a curl or wget call will work just great:

Bringing the same data into Apache Superset or another data visualization tool via Cube’s SQL API requires just a few clicks in the UI that will yield an equivalent SQL query and result set:

Take a closer look! Superset generates a fairly straightforward SQL query (shown on the screenshot above) and sends it to Cube. Actually, you can write your custom SQL queries, like the one below, where you can reference measures and dimensions from your data model. Curious what MEASURE(commitsCount) is? Well, we see a lot of value in valid SQL, that why you can list your dimensions in the GROUP BY statement and enclose your measures using the MEASURE aggregate function (if your BI tool of choice would generate something like COUNT instead of MEASURE, Cube will readily accept that, too):

SELECT authorDomain, MEASURE(commitsCount)
FROM db.`GithubCommitStatsCommits`
WHERE authorDomain IS NOT NULL
GROUP BY authorDomain

Now we’ve learned how dbt Metrics can be exposed to other tools via various APIs provided by Cube. Next, let’s explore how dbt users can leverage Cube’s acceleration and access control layers.

Accelerating dbt Metrics with pre-aggregations

Cube’s acceleration layer serves as a smart cache on top of the metrics layer.

Based on the configuration provided, Cube will pre-emptively fetch data from the data warehouse, pre-aggregate it into analytical rollups, and store them in Cube Store, a custom-built distributed data store, in a columnar format. In most cases, Cube Store enables Cube to fulfill API requests within 300 ms, and allows for concurrencies up to 100 QPS.

You can see on the Apache Superset screenshot above that an unaccelerated request took more than 2 seconds. Of course, if we run the same request once again, we’ll get the result mush faster because it will be cached by Superset. Let’s see what can be done so any request would yield a sub-second response at all times.

The Cube way to achieve this is to use pre-aggregations. We’ll extend the data model with a new cube (think, “a namespace for metrics”) containing the same metrics and a single pre-aggregation:

cube('GithubCommitStatsCommitsCached', {
  extends: GithubCommitStatsCommits,

  preAggregations: {
    main: {
      measures: [ commitsCount ],
      dimensions: [ authorDomain, authorName ],
      timeDimension: timestamp,
      granularity: 'day',
      partitionGranularity: 'year'
    }
  },
})

This caching configuration instructs Cube to use the main pre-aggregation for all queries that match it, i.e., contain a subset of listed measures and dimensions. It also says that day is the minimum time grain we’re interested in using. On top of that, it advises Cube to split the pre-aggregated data into partitions (one per year) and distribute these partitions across all Cube Store nodes so they all can be used in parallel to fulfil queries.

That’s basically it. Cube will seamlessly build the pre-aggregation, store it in Cube Store, and transparently dispatch queries that match it against Cube Store rather that the original data source, e.g., BigQuery. Here’s a side by side comparison of the query hitting BigQuery (1.8 s) ⚠️ and the same query fulfilled by a pre-aggregation in Cube Store (215 ms) ⚡

Also note that, depending on your usage patterns, not hitting your cloud data warehouse for every client-side query might not only improve the performance of your data application but allow for a substantial total cost reduction.

That’s how Cube’s acceleration layers works. What about access contol?

Securing dbt Metrics and implementing multitenancy

Cube’s access control layer lets you manage who is able to access data and which data it is. Subsequently, Cube has built-in mechanisms for row-level security and multitenancy.

You can mandate that every query need to be accompanied by a security context (think “securely signed set of credentials and meta data”) that identifies a user or an application running a query. Cube allows you to validate the security context and the query, possibly amending the contents of the query to enforce access control rules.

We’ll use the queryRewrite extension point to require certain meta data to be provided with every query and used in a mandatory filter. With this configuration, no query can be run “unfiltered”:

module.exports = {
  queryRewrite: (query, { securityContext }) => {
    if (!securityContext.domain) {
      throw new Error("Please specify the domain");
    }

    query.filters.push({
      member: 'GithubCommitStatsCommitsCached.authorDomain',
      operator: 'equals',
      values: [ securityContext.domain ],
    });

    return query;
  }
};

If anyone tries, they’ll get an error message. However, with a security context provided (see its JSON representation below), the query will yield the properly filtered results according to access control rules:

And that’s the shortest demo of how Cube’s access control layers works.

Next steps for dbt users

Please feel free to check out the full source code of dbt models and the Cube project shown above on GitHub. Also, don’t hesitate to give Cube a star on GitHub as well as to follow us on Twitter or LinkedIn 🤘

If you’re a dbt user and you haven’t tried dbt Metrics, please do. You’ll be able to abstract your metrics definitions from the presentation layer and keep them in the single source of truth. Note that, as of March 2022, dbt Metrics are still an experimental feature, so if things doesn’t work as expected, please provide your feedback and find the time to file issues in dbt’s GitHub repository. That is much appreciated.

Also, if you see the value in integrations with BI tools and custom-built applications as well as query acceleration and access control that are built into Cube, please follow the steps outlined in this tutorial and share your success or ask questions in Cube’s Slack community.