DEV Community

Cover image for How To Build an Open-Source Analytics Stack using RudderStack and Apache SuperSet
Team RudderStack for RudderStack

Posted on • Edited on • Originally published at rudderstack.com

How To Build an Open-Source Analytics Stack using RudderStack and Apache SuperSet

It is 2020, and discussions around data-ownership and privacy have finally crossed the Atlantic and reached the American boardroom, thanks to CCPA coming into force.

Companies, both big and small are – or should be – taking stock of all their data-sharing activities, and the first thing usually discussed is the use of data analytics. Products like Google Analytics, Amplitude and MixPanel are near omnipresent, but businesses are finally wondering if there are alternatives. Beyond the data sharing & privacy issue, a couple of other drivers for that thought are:

  • Cost: The SaaS products can cost hundreds of thousands of dollars beyond their free tier (typically sub-10m events). Per-user unit economics in some domains (e.g. free to play mobile games) often don't justify these costs.
  • Flexibility: The SaaS analytics products are great, but often restrict the kind of queries you can run. Full SQL is not supported, which makes it hard to combine event data with other internal data. Warehouse dumps, even if supported, cost extra. Also, they are quite time-delayed and restricted in terms of functionality (for e.g. Google Analytics only gives a data dump into BigQuery).
  • Open-source and Vendor lock-in: Finally, there is an increasing drive to adopting open-source tech as much as possible to avoid vendor lock-in.

This blog describes the design and implementation of an open-source analytics stack in the context of a mobile casino game. We used RudderStack and  Apache SuperSet, with Amazon Redshift as the underlying data-warehouse. This game had all the pain points discussed above - they wanted to bring their data in-house, and run complex analytics queries which weren’t possible on Amplitude’s UI. At the same time, they wanted to continue to use Amplitude for some of their other reporting tasks, but only forward the relevant events to it, thereby limiting data sharing and saving on the event budget.

RudderStack's transformation module is designed to address this use-case - the events can be removed, sampled or even combined before they are sent to the required destinations. However, that is not the focus of this blog and is covered in detail in a separate blog.

Understanding the mobile game data metrics

Mobile game developers make extensive use of data to increase user engagement and ensure effective monetization. Most mobile games adopt a ‘freemium’ approach, wherein the game is made free to users with the expectation that users will pay for in-app purchases in order to use certain features. Analytics plays a very important role in identifying such features. At the same time, it helps in optimizing the UX (User Experience) so that they don’t lose interest in the game.

Different teams within a mobile game company work with different data metrics. For example:

  • Executive teams typically track overall metrics around the health of the business such as total users, revenue, etc.
  • Growth teams track campaign or group level metrics, such as the activity of users who have performed a particular event - for e.g. downloaded the app from a specific campaign or made an in-app purchase
  • Customer teams track individual user-level metrics such as identifying most active users, or users who had a significant drop in activity.

All these metrics demonstrate the diverse nature and complexity of the queries that any analytics stack has to support, beyond just the traditional reporting.

Next, we give a quick overview of the tools involved.

RudderStack

RudderStack (https://github.com/rudderlabs/rudder-server/) is an open-source alternative to platforms like Segment or mParticle for collecting and routing event data. 

Note: You can read this article on clickstream analytics to learn more about events, and how event data is processed.

Why use RudderStack?

RudderStack offers many features that make it a good choice for building an analytics stack. Some of them are:

  • RudderStack provides for a powerful transformation framework to process your event data on the go
  • It has SDKs for multiple platforms, including Android, iOS, React Native, JavaScript and Unity, as well as server-side SDKs for languages like Ruby, Python, etc.
  • RudderStack supports a plethora of hosting options for the backend, such as Docker, Kubernetes, Terraform, etc. It runs on all cloud environments including AWS, GCP and Azure, as well as on a private cloud or a bare-metal environment
  • RudderStack has integrations into multiple data warehouses like Amazon Redshift, S3, Google BigQuery, Snowflake and more, in addition to cloud destinations like Google Analytics, Amplitude, MixPanel, HubSpot, UserIQ and Salesforce (Over 20 as of now)
  • Finally, RudderStack is open-source 

How the mobile game uses RudderStack

In this specific use-case, the mobile game integrated RudderStack’s Unity SDK for generating events. Events were then sent to the RudderStack’s open-source data plane, from where they were dumped into Amazon Redshift.

Some of the events generated by the game were:

  • spin_result, corresponding to casino spins
  • daily_rewards_claim,
  • bonus_click
  • offer_displayed

Following is a typical event structure generated by application using RudderStack’s SDK. It includes attributes specific to the event as well as specific to the user - all of which needs to be accurately captured in order to make the information comprehensive.

"event_payload": {
      "batch": {
        "channel": "mobile",
        "sentAt": "2019-12-10T09:35:34.763Z",
        "messageId": "1575970534464-bfe73f1a-0186-4fd8-ac1d-109e109b0fb4",
        "event": "hyper_bonus_purchase",
        "anonymousId": "27612234ca5b2",
        "type": "track",
        "properties": {
          "ishighroller": "False",
          "bet_level": 1,
          "bet_multiplier": 2000,
          "game_name": "#####",
          "no_of_spin": 0,
          "cost": 1080000,
          "category": "hyper_bonus_purchase"
        }, 
"userId": "######",
        "userProperties": {
          "versionSessionCount": 11,
          "coin_balance": "######",
          "current_vip_points": "######",
          "isLowEndDevice": false,
          "user_id": "ce2adb8b-4bc6-4086-ae92-2a76e7d78bf0",
          "idfa": "######",
          "start_date": "2019-12-05",
          "lifetime_gem_balance": 3251,
          "current_module_name": "CasinoGameModule",
          "player_total_shields": 0,
          "gem_balance": 3251,
          "player_total_battles": 0,
          "abtest_tags": "",
          "abtest_values": "",
          "vip_days_left": 84,
          "vip_level": 0,
          "fb_profile": "0",
          "graphicsQuality": "HD",
          "game_name": "######",
          "level": 56,
          "internetReachability": "ReachableViaLocalAreaNetwork",
          "total_payments": 0
 },

In addition, the RudderStack SDK automatically captures contextual information regarding the type/Operating System/screen dimensions of the device, time of the day etc.

"context": {
          "timezone": "America/Los_Angeles",
          "network": {
            "cellular": false,
            "bluetooth": false,
            "carrier": "T-Mobile",
            "wifi": true
          },
          "device": {
            "name": "######",
            "id": "######",
            "manufacturer": "Google",
            "model": "Pixel 2"
          },
          "os": {
            "name": "Android",
            "version": "10"
          },
          "screen": {
            "density": 420,
            "height": 1080,
            "width": 1794
          },
          "traits": {
            "anonymousId": "######"
          },
          "userAgent": "######",
          "app": {
            "name": "game",
            "build": "102",
            "namespace": "######",
            "version": "1.2.0"
          },
          "locale": "en-US",
          "library": {
            "name": "rudder-android-library",
            "version": "1.0"
          }
}

The following section shows how Amazon Redshift loads the data received from RudderStack.

Loading Data into Amazon Redshift

RudderStack has native connectors to analytics databases like Redshift, one of the most popularly used data warehousing services in the cloud. All the events routed by RudderStack are stored in Redshift’s track table, with the entire JSON payload as a column. In addition, we also create separate tables for each event type (e.g. track event) with event properties as columns. The following images demonstrate two examples - one for a particular event type and the other is for all events:

Table for a particular event type

Table for event-agnostic information

The table for a specific event contains attributes specific to the event, whereas the second table contains information that is event-agnostic. The two tables can be linked by the id field.  

RudderStack does automatic schema management. It creates the table schema based on the event structure and keeps it updated if new fields are added to the event JSON or data types of existing fields are changed.

Redshift and other analytics databases store the table in columnar format. All the column values are stored together. This is very efficient for queries that only touch a small number of columns. For example, getting the total revenue requires fetching the revenue column data and summing it up. Adding a filter (SELECT) on a few columns is pretty efficient too. For example, to get the total revenue from a particular game, we need to fetch fields such as the game_name and revenue columns and sum the revenue values for matching games.

Once the data is loaded into Redshift, we can then perform analytics on it using a BI tool on top of Redshift.

Performing Visual Analytics in Redshift using Apache SuperSet


Apache SuperSet provides an intuitive, code-free interface for creating enterprise-ready dashboards and charts. It also provides a rich visual interface to write SQL  queries.

For analyzing the data, we will use Apache Superset to use the data from Redshift, and visualize the results of our queries. The following are some examples of this:

Identifying top users by event count

The query is as follows:

SELECT anonymous_id , COUNT (*) as count 
FROM torpedo_replay_source .tracks 
group by anonymous_id order by COUNT DESC 
limit 100

Execution Time : 0.98 seconds

Pie chart for the top 100 events 

The query is as follows:

SELECT event, COUNT (*) as count
FROM torpedo_replay_source .tracks
group by event
order by count desc limit 100

Execution Time :  09.01 seconds

Find people who have churned

As a more complex example, finding paying users whose activity count dropped by 90% week-over-week can be a bit more tricky as it requires:

  • Computing the user level week-over-week activity level by doing a GROUP-BY on user_id & week on the ‘tracks’ table
  • Doing a self JOIN of the above table to compute the activity drop from week X to week X+1
  • Filtering the above to those users which have > 90% activity drop and are paying users

The query is as below:


SELECT SUM(no_of_spin) as total_spins, anonymous_id, day 
 FROM 
  (
    SELECT anonymous_id, no_of_spin, DATE_TRUNC('day', sent_at) as day FROM torpedo_replay_source.spin_result 
     WHERE anonymous_id  IN 
        (
            SELECT DISTINCT(anonymous_id)
             FROM
             (
                SELECT T2.anonymous_id, (0-T2.no_of_spin)*100/(T2.no_of_spin+1) as inc_pct, 0 as spin_after, T2.no_of_spin as spin_before, T2.week+1 as wk_after, T2.week as wk_before
                    FROM
                        (
                            SELECT SUM(no_of_spin) as no_of_spin, anonymous_id, week FROM
                                (SELECT anonymous_id, no_of_spin, DATE_PART(w, timestamp) as week FROM torpedo_replay_source.spin_result)
                            GROUP BY anonymous_id, week
                        ) AS T1
                        
                        RIGHT JOIN
                        
                        (
                            SELECT SUM(no_of_spin) as no_of_spin, anonymous_id, week FROM
                                 (SELECT anonymous_id, no_of_spin, DATE_PART(w, timestamp) as week FROM torpedo_replay_source.spin_result)
                            GROUP BY anonymous_id, week
                        ) AS T2
                            
                        ON T1.anonymous_id = T2.anonymous_id AND T1.week=T2.week+1 WHERE T1.anonymous_id is NULL
                )
            WHERE wk_after = 52 AND 
            anonymous_id IN 
            (SELECT DISTINCT(anonymous_id) from torpedo_replay_source.revenue)
            ORDER BY anonymous_id 
            LIMIT 25
    )
  )
GROUP BY anonymous_id, day 

The corresponding graph is as shown below:

In the above example, we identified users who are churning week-over-week. That’s great, but can we engage with these users to get them to play the game again? Something that allows the game developers to send a push notification to these users, or showing them an ad on Facebook would be an invaluable feature to have.

To address such a problem, RudderStack will also support taking the output of a SQL query as a data source, and send it to cloud destinations like Facebook Ads, Mailchimp, etc. so that the users can be notified. This feature will be available in RudderStack soon.

Conclusion

We saw how RudderStack can be easily integrated with tools such as Redshift and SuperSet to build a seamless, efficient and enterprise-grade analytics stack.

While this blog focuses only on Redshift, the concept of building the analytics stack is equally applicable to any data warehouse which offers a SQL interface, such as Google BigQuery, Azure, etc. or their open-source equivalents like ClickHouse or Apache Druid.

To learn about how RudderStack can help you build your own analytics stack, schedule a demo.

Top comments (0)