Cover image for Using MySQL as a Cache Layer for BigQuery

Using MySQL as a Cache Layer for BigQuery

keydunov profile image Artyom Keydunov Originally published at cube.dev ・5 min read

BigQuery is great at handling large datasets, but will never give you a sub-second response, even on small datasets. It leads to a wait time on dashboards and charts, especially dynamic, where users can select different date ranges or change filters. It is almost always okay for internal BIs, but not for customer-facing analytics. We tolerate a lot of things such as poor UI and performance in internal tools, but not in those we ship to customers.

But we still can leverage BigQuery’s cheap data storage and the power to process large datasets, while not giving up on the performance. As BigQuery acts as a single source of truth and stores all the raw data, MySQL can act as cache layer on top of it and store only small, aggregated tables and provides us with a desired sub-second response.

You can check out the demo here and the source code on Github. Make sure to play with date range and switchers—dynamic dashboards benefit the most from the pre-aggregations.


Cube.js leverages the pre-aggregations layer as a part of its two-level caching system. We recently released support for external pre-aggregations to target use cases, where users can combine multiple databases and get the best out of the two worlds. The schema below shows the typical setup for Cube.js with BigQuery and MySQL.

Alt Text

To use the external rollup feature, we need to configure Cube.js to connect to both BigQuery and MySQL, as well as specify which pre-aggregation we want to build externally. If you are new to Cube.js, I recommend checking this 101-style tutorial first and then come back here. We are going to use the public Hacker News dataset from BigQuery for our sample application.

Let’s first install Cube.js CLI and create a new application.

$ npm install -g cubejs-cli
$ cubejs create external-rollups -d bigquery

We set -d bigquery to make our main database be a BigQuery. Next, cd into the bigquery-mysql folder and configure .env with correct credentials.


Here we set credentials for both the main DB (BigQuery) and external DB for pre-aggregations (MySQL). You can learn more about obtaining BigQuery credentials at the Cube.js docs here. Also, in order to build pre-aggregations inside MySQL, Cube.js should have write access to the stb_pre_aggregations schema where pre-aggregation tables will be stored.

Now, let’s install the Cube.js MySQL driver.

$ npm install @cubejs-backend/mysql-driver --save

Once we have it, the last configuration step is to provide MySQL related options via the externalDbType and externalDriverFactory properties. Replace the content of the index.js file with the following.

const CubejsServer = require("@cubejs-backend/server");
const MySQLDriver = require('@cubejs-backend/mysql-driver');

const server = new CubejsServer({
  externalDbType: 'mysql',
  externalDriverFactory: () => new MySQLDriver({
    host: process.env.CUBEJS_EXT_DB_HOST,
    database: process.env.CUBEJS_EXT_DB_NAME,
    user: process.env.CUBEJS_EXT_DB_USER,
    password: process.env.CUBEJS_EXT_DB_PASS.toString()

server.listen().then(({ port }) => {
  console.log(`🚀 Cube.js server is listening on ${port}`);

That is all we need to let Cube.js connect to both BigQuery and MySQL. Now, we can create our first Cube.js data schema file. Cube.js uses the data schema to generate an SQL code, which will be executed in your database.

Create the schema/Stories.js file with the following content.

cube(`Stories`, {
  sql: `select * from \`fh-bigquery.hackernews.full_partitioned\` WHERE type = 'story'`,

  measures: {
    count: {
      type: `count`,

  dimensions: {
    category: {
      type: `string`,
      case: {
        when: [
          { sql: `STARTS_WITH(title, "Show HN")`, label: `Show HN` },
          { sql: `STARTS_WITH(title, "Ask HN")`, label: `Ask HN` }
        else: { label: `Other` }

    time: {
      sql: `timestamp`,
      type: `time`

Now start the Cube.js server by running node index.js and navigate to the development playground at http://localhost:4000.

You can select the Stories count measure and category dimension, alongside a time dimension to build a chart as shown below.

Alt Text

If we inspect a generated SQL by clicking a SQL button, we’ll see the following.

    WHEN STARTS_WITH(title, "Show HN") THEN 'Show HN'
    WHEN STARTS_WITH(title, "Ask HN") THEN 'Ask HN'
    ELSE 'Other'
  END `stories__category`,
  DATETIME_TRUNC(DATETIME(`stories`.timestamp, 'UTC'), MONTH) `stories__time_month`,
  count(*) `stories__count`
      type = 'story'
  ) AS `stories`
  2 ASC

This SQL shows us that this query runs against the raw data inside BigQuery. Now, let’s make it run against the pre-aggregated table inside MySQL. To do that, we are going to define a pre-aggregation. Usually, it is done inside the same cube, but for the sake of the tutorial, so we can compare the performance with and without pre-aggregation, let’s create a new cube. We can do it inside the same file. Add the following code to the schema/Stories.js file.

cube(`StoriesPreAgg`, {
  extends: Stories,
  preAggregations: {
    main: {
      type: `rollup`,
      measureReferences: [count],
      dimensionReferences: [category],
      granularity: `month`,
      timeDimensionReference: time,
      external: true

In the above code, we declare a pre-aggregation with a rollup type and specify which measures and dimensions to include in the aggregate table. Also note external: true; this line tells Cube.js to upload this pre-aggregation into MySQL.

Now, go to the development playground and select the same measures and dimensions as before: count, category, and time grouped by month, but this time select them from the Stories PreAgg cube. When requested the first time, Cube.js will generate an aggregate table and upload it to MySQL. All subsequent requests will go directly to the aggregate table inside MySQL. You can inspect the generated SQL and it should look like the following.

  `stories_pre_agg__category` `stories_pre_agg__category`,
  `stories_pre_agg__time_month` `stories_pre_agg__time_month`,
  sum(`stories_pre_agg__count`) `stories_pre_agg__count`
  2 ASC

As you can see, it now queries data from the stb_pre_aggregations.stories_pre_agg_main table inside MySQL. You can play around with filters to see the performance boost of the aggregated query compared to the raw one.

You can also check this demo dashboard with multiple charts and compare its performance with and without pre-aggregations. The source code of the example dashboard is available on Github.

Posted on by:

keydunov profile

Artyom Keydunov


Engineer / Co-Founder of Cube.js, an Open Source Analytics Framework https://cube.dev


markdown guide

This looks really interesting to me I have spent quite some time on a system that allows users to query data sets sharded across MySQL and BigQuery, and it looks like there is some overlap with what you are doing at Cube.js. I really want to start playing with Cube now!


Thank you! I'm happy to help if you have any questions. Feel free to ping me at Twitter or in Slack Cube.js Community - slack.cube.dev