Artyom Keydunov

Posted on Jul 9, 2020 • Edited on Jul 22, 2020

Building an Open Source Web Analytics Platform. Part 1: Overview and Analytics Backend

#react #javascript #webdev #tutorial

This is the first part of a guide on building an open source web analytics platform with Cube.js You can find the full guide here.

Also, below you can see the demo of the final application. Online demo is availble here.

Building your own analytics engine, like the one behind Google Analytics, sounds like a very sophisticated engineering problem. And it truly is. Back then, it would require years of engineering time to ship such a piece of software. But as data landscape changes, now we have a lot of tools which solve different part of this problem extremely well: data collection, storage, aggregations, and query engine. By breaking the problem into smaller pieces and solving them one-by-one by using existing open-source tools, we will be able to build our own web analytics engine.

If you’re familiar with Google Analytics (GA), you probably already know that every web page tracked by GA contains a GA tracking code. It loads an async script that assigns a tracking cookie to a user if it isn’t set yet. It also sends an XHR for every user interaction, like a page load. These XHR requests are then processed, and raw event data is stored and scheduled for aggregation processing. Depending on the total amount of incoming requests, the data will also be sampled.

Even though this is a high-level overview of Google Analytics essentials, it’s enough to reproduce most of the functionality.

Architecture overview

Below you can see the architecture of the application we are going to build. We'll use Snowplow for data collection, Athena as the main data warehouse, MySQL to store pre-aggregations, and Cube.js as the aggregation and querying engine. The frontend will be built with React, Material UI, and Recharts. Although the schema below shows some AWS services, they can be partially or fully substituted by open-source alternatives: Kafka, MinIO, and PrestoDB instead of Kinesis, S3, and Athena, respectively.

We'll start with data collection and gradually build the whole application, including the frontend. If you have any questions while going through this guide, please feel free to join this Slack community and post your question there.

We're going to use Snowplow for data collection, S3 for storage, and Athena to query the data in S3.

Data Collection with Snowplow

Snowplow is an analytics platform to collect, enrich, and store data. We'll use the Snowplow Javascript tracker on our website, which generates event-data and send it to the Snowplow Collector to load to S3.

Before loading the data, we'll use Enricher to turn IP addresses into coordinates. We'll use AWS Kinesis to manage data streams for collection, enrichment, and then finally loading into S3. The schema below illustrates the whole process.

Let's start by setting up the tracker. Adding Snowplow's tracker to the website is the same, as adding Google Analytics or Mixpanel tracker. You need to add the asynchronous Javascript code, which loads the tracker itself.

<!-- Snowplow starts plowing -->
<script type="text/javascript">
;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
};p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","//d1fc8wv8zag5ca.cloudfront.net/2.10.2/sp.js","snowplow"));

window.snowplow('newTracker', 'cf', '{{MY-COLLECTOR-URI}}', { // Initialise a tracker
  appId: '{{MY-SITE-ID}}',
  cookieDomain: '{{MY-COOKIE-DOMAIN}}'
});

window.snowplow('trackPageView');
</script>
<!-- Snowplow stops plowing -->

The above snippet references a Snowplow Analytics hosted version of the Snowplow JavaScript tracker v2.10.2 (//d1fc8wv8zag5ca.cloudfront.net/2.10.2/sp.js). Snowplow Analytics no longer hosts the latest versions of the Snowplow JavaScript tracker. It is recommended to self-host sp.js by following the Self-hosting Snowplow.js guide.

For more details about setting up the tracker, please refer to the official Snowplow Javascript Tracker Setup guide.

To collect the data from the tracker, we need to setup Snowplow Collector. We'll use Scala Stream Collector. Here the detailed guide on how to install and configure it. This repository with the Docker images for the Snowplow components is very helpful if you plan to deploy Snowplow with Docker.

Next, we need to install Snowplow Stream Enrich. Same as for collector, I
recommend following the official guide here and use these Docker images.

Finally, we need to have S3 Loader installed and configured to consume records from AWS Kinesis and writes them to S3. You can follow [this guide (https://github.com/snowplow/snowplow/wiki/snowplow-s3-loader-setup) to set it up it.

Query S3 with Athena

Once we have data in S3 we can query it with AWS Athena or Presto. We’ll use Athena in our guide, but you can easily find a lot of materials online on how to set up an alternative configuration.

To query S3 data with Athena, we need to create a table for Snowplow events. Copy and paste the following DDL statement into the Athena console. Modify the LOCATION for the S3 bucket that stores your enriched Snowplow events.

CREATE EXTERNAL TABLE snowplow_events (
  app_id STRING,
  platform STRING,
  etl_tstamp TIMESTAMP,
  collector_tstamp TIMESTAMP,
  dvce_tstamp TIMESTAMP,
  event STRING,
  event_id STRING,
  txn_id INT,
  name_tracker STRING,
  v_tracker STRING,
  v_collector STRING,
  v_etl STRING,
  user_id STRING,
  user_ipaddress STRING,
  user_fingerprint STRING,
  domain_userid STRING,
  domain_sessionidx INT,
  network_userid STRING,
  geo_country STRING,
  geo_region STRING,
  geo_city STRING,
  geo_zipcode STRING,
  geo_latitude STRING,
  geo_longitude STRING,
  geo_region_name STRING,
  ip_isp STRING,
  ip_organization STRING,
  ip_domain STRING,
  ip_netspeed STRING,
  page_url STRING,
  page_title STRING,
  page_referrer STRING,
  page_urlscheme STRING,
  page_urlhost STRING,
  page_urlport INT,
  page_urlpath STRING,
  page_urlquery STRING,
  page_urlfragment STRING,
  refr_urlscheme STRING,
  refr_urlhost STRING,
  refr_urlport INT,
  refr_urlpath STRING,
  refr_urlquery STRING,
  refr_urlfragment STRING,
  refr_medium STRING,
  refr_source STRING,
  refr_term STRING,
  mkt_medium STRING,
  mkt_source STRING,
  mkt_term STRING,
  mkt_content STRING,
  mkt_campaign STRING,
  contexts STRING,
  se_category STRING,
  se_action STRING,
  se_label STRING,
  se_property STRING,
  se_value STRING,
  unstruct_event STRING,
  tr_orderid STRING,
  tr_affiliation STRING,
  tr_total STRING,
  tr_tax STRING,
  tr_shipping STRING,
  tr_city STRING,
  tr_state STRING,
  tr_country STRING,
  ti_orderid STRING,
  ti_sku STRING,
  ti_name STRING,
  ti_category STRING,
  ti_price STRING,
  ti_quantity INT,
  pp_xoffset_min INT,
  pp_xoffset_max INT,
  pp_yoffset_min INT,
  pp_yoffset_max INT,
  useragent STRING,
  br_name STRING,
  br_family STRING,
  br_version STRING,
  br_type STRING,
  br_renderengine STRING,
  br_lang STRING,
  br_features_pdf STRING,
  br_features_flash STRING,
  br_features_java STRING,
  br_features_director STRING,
  br_features_quicktime STRING,
  br_features_realplayer STRING,
  br_features_windowsmedia STRING,
  br_features_gears STRING,
  br_features_silverlight STRING,
  br_cookies STRING,
  br_colordepth STRING,
  br_viewwidth INT,
  br_viewheight INT,
  os_name STRING,
  os_family STRING,
  os_manufacturer STRING,
  os_timezone STRING,
  dvce_type STRING,
  dvce_ismobile STRING,
  dvce_screenwidth INT,
  dvce_screenheight INT,
  doc_charset STRING,
  doc_width INT,
  doc_height INT,
  tr_currency STRING,
  tr_total_base STRING,
  tr_tax_base STRING,
  tr_shipping_base STRING,
  ti_currency STRING,
  ti_price_base STRING,
  base_currency STRING,
  geo_timezone STRING,
  mkt_clickid STRING,
  mkt_network STRING,
  etl_tags STRING,
  dvce_sent_tstamp TIMESTAMP,
  refr_domain_userid STRING,
  refr_dvce_tstamp TIMESTAMP,
  derived_contexts STRING,
  domain_sessionid STRING,
  derived_tstamp TIMESTAMP
)
PARTITIONED BY(run STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\\t'
STORED AS TEXTFILE
LOCATION 's3://bucket-name/path/to/enriched/good';

Now, we're ready to connect Cube.js to Athena and start building our application.

Analytics API with Cube.js

We'll build our analytics API on top of the Athena with Cube.js. Cube.js is an open-source framework for building analytical web applications. It creates an analytics API on top of the database and handles things like SQL organization, caching, security, authentication, and much more.

Let's install Cube.js CLI and use it to create our application. Run the following commands in your terminal:

$ npm install -g cubejs-cli
$ cubejs create react-dashboard -d athena

Once run, the create command will create a new project directory that contains the scaffolding for your new Cube.js project. Cube.js uses environment variables starting with CUBEJS_ for configuration. To configure the connection to Athena, we need to specify the AWS access and secret keys with the access necessary to run Athena queries, and the target AWS region and S3 output location where query results are stored.

CUBEJS_DB_TYPE=athena
CUBEJS_AWS_KEY=<YOUR ATHENA AWS KEY HERE>
CUBEJS_AWS_SECRET=<YOUR ATHENA SECRET KEY HERE>
CUBEJS_AWS_REGION=<AWS REGION STRING, e.g. us-east-1>
# You can find the Athena S3 Output location here: https://docs.aws.amazon.com/athena/latest/ug/querying.html
CUBEJS_AWS_S3_OUTPUT_LOCATION=<S3 OUTPUT LOCATION>

Next, let's create a sample data schema for our events. Cube.js uses the data schema to generate SQL code, which will be executed in the database. The data schema is not a replacement for SQL. It is designed to make SQL reusable and give it a structure while preserving all of its power. We can build complex data models with Cube.js data schema. You can learn more about Cube.js data schema here.

Create a schema/Events.js file with the following content.

cube(`Events`, {
  sql: `
    SELECT
      event_id,
      event,
      platform,
      derived_tstamp,
      domain_sessionidx,
      domain_sessionid,
      domain_userid,
      ROW_NUMBER() OVER (PARTITION BY domain_sessionid ORDER BY derived_tstamp) AS event_in_session_index
    FROM
       analytics.snowplow_events
  `,

  measures: {
    count: {
      type: `count`,
    },
  },

  dimensions: {
    timestamp: {
      type: `time`,
      sql: `derived_tstamp`
    },

    id: {
      sql: `event_id`,
      type: `string`,
      primaryKey: true
    }
  }
})

Please, note that we query snowplow_events table from analytics database.
Your database and table name may be different

Now, we can start Cube.js server and open http://localhost:4000. In development mode, Cube.js run Playground. It is an application to help you explore the data schema and send test queries.

Let's test our newly created data schema!

Cube.js accepts queries as JSON objects in the specific query format. Playground lets you visually build and explore queries. For example, we can construct the test query to load all the events over time. Also, you can always inspect the underlying JSON query by clicking JSON Query button.

You can explore other queries as well, test different charting libraries used to
visualize results and explore the frontend javascript code. If you are just starting with Cube.js I recommend checking this tutorial as well.

In the next part, we'll start working on the frontend application and will
steadily build out our data schema.

Latest comments (1)

Marko Jak • Jul 12 '21

Are you going to do the Part2 on this still?
I'm super interested in using Cube JS as a part of our application that would be installed at multiple customers so that we can machine learn off the website data.