Getting started with a metrics store

#metricsstore

Regardless of whether you are a startup, medium-sized, or large organization, the chances are that your data and derived metrics are flying around in the form of analytics applications and reporting, ML platforms, downstream operational systems, etc. As those metrics are often defined in different places on a client side, e.g. in reporting itself, they could end up looking differently across your applications.

Yes, you could go and you should go the data governance route and try to create a business glossary 'to document and store an organization’s business concepts and terminology, definitions, and the relationships between those terms' (DMBoK - Data Management Body of Knowledge). But the devil is in the details. How could you ensure that the same filters are used? How would you know if data is joined the same way? That was the reason I looked into metrics store, also called a semantic layer, headless BI, semantic store, etc.

The idea of a metric layer is to standardize different definitions and define rules around how you aggregate and filter your data upstream instead of applying logic in the individual layer.

Source: Metrics store

Some of the companies that operate in space are Cube Dev; Transform(currently acquired by dbt); metriql. See more companies at https://www.moderndatastack.xyz/companies/metrics-store.

dbt was out of the question as it currently works with Snowflake only. The objective of this and a follow-up article is to implement the metrics store using some of the popular solutions on the market, having Redshift as the data warehouse layer and Google Data Studio/Power BI as the visualization layer. The first tool to be reviewed is Cube Dev.

Cube Dev, a computer software company that specializes in the fields of open source, analytics, databases, and developer tools was founded in 2016. Cube Dev has a free open-source version and a paid Cube Cloud service in which they manage all of the infrastructure in a SaaS offering. With the free tier, you could transfer up to 1Gb of data between Cube Cloud and API clients or databases.

Getting ready

1.For this exploration, I've been using Redshift - read more on setting up Redshift at Getting started. At first, the user that I created had read-only access to the schema but I was getting the permission error and changed it to:

create group load;
create user analytics_user password 'YourPassword123' in group load;
grant create on database dev to group load;
grant select on all tables in schema information_schema to group load;
grant select on all tables in schema pg_catalog to group load;
grant select on all tables in schema public to group load;

2.Redshift provides with a public ticket data set that you can use for practice. As per https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html, the data set contains two fact - sales and events - and five dimensions tables - date, category, users, listing, and venues.

The objective of the final dashboard is to track individual sales performance. Key metrics chosen are 'total # of sales' calculated as the total number of tickets sold; 'revenue' or total price paid; and 'average revenue per user'. Hence, only two tables from the sample database are required to complete this objective - sales, and users.

Source: AWS ERD

How to do it...

Set up a Cube cloud account as it was the fastest way to deploy Cube.
Once you sign up, you would be able to add your data warehouse/database credentials. Because my Redshift data warehouse was created for the exercise's sake and was meant to run only for the time of the exercise, I exposed the instance to the internet. 3.In the next window, you will have the option to configure a data schema or skip it. Lesson(s) learned: Keep in mind that if your keys are integers, Cube would recognize them as 'measures' not 'dimensions', so you need to spend some time correcting the schema, specifying primary keys for each table, and defining joins. Although I did go the declaration path but the final 'cubes' that I used in my dashboard came from the cubes defined by me based on the data schema declared during the initial attempts.

The part that might get you confused is the joins. Cube dev has three ways to define relationships between tables:
-hasOne to depict a one-to-one relationships
-hasMany to depict a one-to-many relationship
-belongsTo to depict a many-to-one relationship. Why was it confusing for me? You would use only one type of relationship to show how instances relate to each other, e.g. I have 'eventid' in the events table and 'eventid' in the sales table. Do I need to use:

joins: {
Sales_analytics: {
sql: ${CUBE}.eventid = ${Events_analytics}.eventid,
relationship: belongsTo
}
}

joins: {
Events_analytics: {
sql: ${CUBE}.eventid = ${Sales_analytics}.eventid,
relationship: hasMany
}
} ?
Based on the explanations from the tutorials, the answer is ... it depends on your analysis objective.

4.Your cubes will be exposed as tables, where both your measures and dimensions are columns. If you have complex schemas, most probably you should keep cubes in a different file. For the exercise's sake, I defined both cubes, 'Sales_analytics' and 'Users_analytics' in the same file because at some point I wondered if putting them together would help with joins in Google Data Studio / Power BI - I was wrong with this assumption.

The final cubes with some of my notes are:

cube(`Sales_analytics`, {
  sql: `SELECT * FROM public.Sales`,
  preAggregations: {
  },

//calculating the measure to be used for the calculation of 'total # of sales'
  measures: {
    qtysold: {
      sql: `qtysold`,
      type: `sum`,
      drillMembers: []
    },
//calculating the measure to be used for the calculation of 'revenue'
    pricepaid: {
      sql: `pricepaid`,
      type: `sum`,
      drillMembers: []
    },
//calculating the number of tickets sold for a particular user to practice the cases when some metrics are needed to be filtered on the metrics store layer
    qtysold_TFY09PKU: {
    sql: `qtysold`,
    type: `sum`,
    filters: [{ sql:`${Users_analytics}.username = 'TFY09PKU'` }],
  },
  },
  dimensions: {
    sellerid: {
      sql: `sellerid`,
      type: `string`
    },
    salesid: {
      sql: `salesid`,
      type: `string`,
      primaryKey: true,
      shown: true //to be visible when data is pulled to a BI application; otherwise, it will be blank
    },
    saletime: {
      sql: `saletime`,
      type: `time`
    }
  }
});

cube('Users_analytics', {
  sql: `SELECT * FROM Users`,
  preAggregations: {},
  joins: {
    Sales_analytics: {
      sql: `${CUBE}.userid = ${Sales_analytics}.sellerid`,
      relationship: `hasMany`
    }
  },
  measures: {
    count: {
      type: `count`,
      drillMembers: [city, username]
    }
  },
  dimensions: {
    userid: {
      sql: `userid`,
      type: `string`,
      primaryKey: true,
      shown: true //to be visible when data is pulled to a BI application; otherwise, it will be blank
    },
    city: {
      sql: `city`,
      type: `string`
    },
    state: {
      sql: `state`,
      type: `string`
    },
    username: {
      sql: `username`,
      type: `string`
    }
  }
});

5.Cubes seem to work perfectly in Playground, and it is time to move to BI applications. Cube introduced SQL API to allow 'querying Cube via Postgres-compatible SQL' and to enable 'the use of BI applications, Python notebooks, reverse ETL tools, and other downstream tools on top of Cube.' back in 2021. You could find the list of BI applications that can be connected to SQL API, but some of them are Power BI, Tableau, Google Data Studio, Excel, Jupyter Notebook, etc.

Attempt #1: Power BI. It went well until I tried to join my cubes to get the average revenue (from Sales_analytics) per username (from User_analytics). Power BI continued to throw the data model errors, saying that the Cube has measures - doh, of course, it does; OR the right table has limits - no limits were set up.
Lesson(s) learned: I should explicitly 'show' primary keys in the data schema in Cube, otherwise, they will be hidden, hence will be populated with blank values in the BI application. And as you know, joining blank columns will result in a many-to-many relationship and mess overall.

Attempt #2: Google Data Studio.
As expected, my cubes are displayed as tables.

Lesson(s) learned:
1.To address the join problem, I used 'Data blending' to "blend" data from two data sets into a single visualized report, or to put it simpler, Google allows you to join several tables into one table. With my keys being shown, I was able to join my Sales_analytics and User_analytics cubes/tables to get the dataset which I later used to build the visualization.

I am not sure the way I implemented it is the correct way to go because it defeats the objective of havings joins done on the metrics store layer in order to avoid data consumers to join data incorrectly. Moreover, when I tried to get the data in Jupyter notebook and join the cubes the way I would normally join tables, I was getting the error message suggesting to use '_cubeJoinField' instead. A potential solution could be to join cubes in one wide table in the SQL query in your cube, instead of using the joins:{} block.

2.Based on my experience with Google Data Studio, it tends to fail to connect to the server, hence, causing the dashboards to be broken. The same has happened with my Cube dashboard, but I believe the troubleshooting tutorials on the Internet might help to resolve the issue.

How it works...
Cube like other metrics stores doesn't store data. Instead, as soon as you defined your metrics and aggregation rules in your data model, it would parse your data model, convert incoming requests from a downstream system to a SQL query, and then execute the query in your data warehouse/database. This means that in a way, the metrics store pulls the data directly from the data source, at the query time.

To decrease the response time and reduce the data processing cost, some metrics stores implemented a pre-aggregated table feature where the most frequently used metrics and dimensions are precomputed and either stored in the original database/data warehouse or stored in their database. These pre-aggregated tables are refreshed periodically to catch up with the changes in a data source. As per Cube's documentation: 'Upon an incoming request, Cube.js will first look for a relevant pre-aggregation... . Once the pre-aggregation is built, all the subsequent requests will go to the pre-aggregated layer instead of hitting the raw data.'

To try it out, I've run the query to get the revenue per user without the aggregation. Then, I've added the below block of code to my data schema in Cube and 'built' the aggregation.

  preAggregations: {
    mainAggregate: {
      measures: [Sales_analytics.qtysold, Sales_analytics.pricepaid, qtysold_TFY09PKU],
      dimensions: [Sales_analytics.sellerid, Sales_analytics.salesid, Users_analytics.userid, Users_analytics.city, Users_analytics.state, Users_analytics.username]
    }
  }

Without the aggregation, the query ran for 1.5 seconds while with the aggregation, the query ran for 223 ms.

Summary
To summarize, you could employ different ways to ensure that metrics are consistent across the applications. For example, you can create summary tables/views in your database/data warehouse and then, use them in your downstream applications or you could try to be consistent with the queries used across different tools. Another way to align metrics across different applications is to use a metrics store. It has been a good experience with Cube as I got the overall idea of how the metrics store works in practice. The question that I still have is with joins as left join vs inner join can bring completely different results. And I am off to another metrics store tool review...