Building a SNOWPLOW Playground

#datascience #docker #tutorial #architecture

SNOWPLOW is a platform for tracking events, i.e. collecting various user behavior events and analyzing them. Nowadays, there are many famous similar products like Google Analytics, Mixpanel, etc., but in the open source ecosystem, SNOWPLOW has a place.

Although there is a paid version of SNOWPLOW, I can't figure out any reason not to use Mixpanel instead of SNOWPLOW since I have to pay for it, but there are a lot of advantages I can put forward for the open source version of SNOWPLOW.

For example, the architecture is simple and easy to integrate with existing infrastructure, and there are many flexible components that can be customized. Most importantly, there are a lot of SDKs available for various scenarios, e.g., Web, Mobile apps and even backend systems.

This article will not introduce the use case of SNOWPLOW, because it has been introduced in detail in the official document. Instead, this article will introduce the architecture of SNOWPLOW and provide a free playground.

Why emphasize on free?

In fact, in the official SNOWPLOW document there is a quick startup environment for the terraform, but it is deployed on AWS or GCP and uses a lot of paid services. For a beta player, maybe we just want to experience what it can do, but don't want to pay for it, at least I don't, then a local free test environment is necessary.

Architecture Overview

First of all, let's take a quick look at the SNOWPLOW architecture.

There are three core components in the whole infrastructure, collector, enricher and schema registry, which are the fundamentals of SNOWPLOW. As for what kind of data warehouse or analysis engine to use, those can be freely matched and are not part of the SNOWPLOW package.

Collector: Collects events from the SDK, these events are raw data and therefore continue to be delivered to the enricher.
Schema Registry: The registry used in SNOWPLOW is a self-developed iglu, which is a service that manages JSON schema.
Enricher: After receiving the raw data, the first step is to validate it with the schema registered in the schema registry, and if it passes the validation, then it will be processed to produce a more analytical format and continue to send it to the next stage. As for the data to be sent to the data warehouse or Looker or other analysis platforms is based on demand.

One of the more interesting parts of the process is the enrichment stage, where here lists all the available plugins. Let's take a practical example. An incoming event will only have an IP field, but subsequent analysis will depend on the geographic location, which can be enriched by the plugin IP Lookup.

This is a simple but powerful architecture that includes both validation and enrichment, and what's more, the events are flexible in their form and can be customized to fit every need. In addition, these events can be downstreamed in a variety of ways based on demand, and can be flexibly integrated with the existing infrastructure.

Playground

https://github.com/wirelessr/snowplow-pipeline

This Github repository provides a docker-compose.yml, yes, we all love docker-compose. All the required components can be built locally and the usage is written in the README which should not cause any problems.

Also, for testing purposes, in addition to viewing the local database, I've put an additional Kafka management console at localhost:9021, which makes it possible to visually see each event.

All the settings of the components to be modified are located in the config.hocon folder under the corresponding folder.

Let's describe the architecture of the test environment as follows.

There is a mock web with Javascript SDK installed, after entering http://localhost it will send some events to the collector periodically.
When the collector receives an event, it will store it in the database (atomic.events) and send it to Kafka.
When the Enricher receives an event, it first confirms the schema with the Iglu server, then performs basic enrichment and sends it to Kafka.
The final mock processor is to simulate how the enriched event should be handled.

The whole process is straightforward. After setting up the environment, we can observe the complete data flow in Kafka's management console.

Conclusion

In fact, I have tried SNOWPLOW's cloud-based Enterprise version, and the biggest difference I felt was the schema management. In the the open-source version, if we want to customize the event fields, we have to integrate Iglu's REST API, and then we have to understand Iglu's design concepts. However, in the cloud version, there is an easy-to-use UI that makes managing schema much easier.

Nevertheless, if we just want to do basic event tracking, I believe the open-source version of SNOWPLOW provides a good capability. Of course, if we want to use it more deeply, we need to understand and integrate it more comprehensively, which is also the price to pay for open source software.

DEV Community

Building a SNOWPLOW Playground

Architecture Overview

Playground

Conclusion

Top comments (0)

Read next

The JavaScript History API

Build a Symfony 7 boilerplate using FrankenPHP, Docker, PostgreSQL and php 8.4

👺Nasty People in the workplace, How to deal with them without stooping down to their level.

Part 1: Master Authentication and Role-Based Access Control (RBAC) with Kinde and Convex in a File-Sharing Application