DEV Community

Cover image for Let's Build a Data Warehouse
Thomas Brittain
Thomas Brittain

Posted on • Edited on • Originally published at ladvien.com

Let's Build a Data Warehouse

The Reason

I am the lead data engineer at Bitfocus, the best SaaS provider for homeless management information systems (HMIS).

For many reasons, I've advocated with our executives to switch our analytics flow to use a data warehouse. Strangely, it worked. I'm now positioned to design the beast. This series will be my humble attempt to journal everything I learn along the way.

The Plan

I've decided it would be better if I learned how to build an analytics warehouse on my own, before committing our entire business to an untested technology stack. Of course, I needed a project similar enough it had the same challenges. I've decided to expose my Apple HealthKit data in a business intelligence platform like Looker.

datawarehouse-flow-diagram-for-exposing-apple-health-data

The Stack

I've attempted similar health data projects before:

But I'm going to be a bit more ambitious this time. I'd like to get all of the data generated by my Apple Watch and iPhone out of HealthKit and send them to a SQL database inside AWS.

The ingestion of these data will be lightly structured. That is, no transformations will be done until the data are in the database. I'll use the Auto Health Export iOS app to send data to a web API endpoint.

The web API endpoint will be created using the Severless Framework. This should allow me to build a REST API on top of AWS' Lambda functions, enabling the Auto Health Export app to synchronize data to the database periodically (as low as every 5 minutes, but dictated by Apple).

Once the data are stored in a database, I'll use Dbt (Data Build Tool) to transform the data for exposing them in the business intelligence tool. This idea of processing data inside a database is often referred to as "extract, load, transform" or "ELT," which is becoming standard versus the classical "extract, transform, load" or "ETL."

After transformation, I'll use Dbt to shove the data back into a database built for analytics query processing. Aka, a "data warehouse." The most popular analytics database technologies right now are Snowflake and Redshift. I'll use neither. I can't afford them and I despise how pushy their salespeople are. Instead, I'll use Postgres. (It's probably what Snowflake and Redshift are built on anyway.)

Lastly, I'll stand up Lightdash as the business intelligence tool.

Tools

Auto Health Export

As mentioned, I've attempted to pull data from the Apple ecosystem several times in the past. One challenge I had was writing an iOS app to pull HealthKit data from an Apple device and send it to a REST API. It's not hard. I'm just not an iOS developer, making it time-consuming. And I'm at the point in my tech career I'm comfortable purchasing others' solutions to my problems.

Anyway, not a perfect app, but it does have the ability to pull your recent HealthKit data and send it to a REST API endpoint.

Serverless

The team creating our SaaS product has talked about going "serverless" for a while. In an attempt to keep up, I decided to focus on creating Lambda and API Gateway services to receive the data from the Auto Health Export app.

While I was researching serverless architecture I ran into the Serverless Framework. I quite like it. It allows you to focus on the code of Lambda, as it will build out the needed infrastructure on deployment.

AWS' Lambda

The heart of the ingestion. This method will receive the health data as JSON and store it in the Postgres database for transformation.

Postgres

I've selected the Postgres database after researching. I'll detail my reason for selecting Postgres later in the series.

Regardless of tech choice, this database will hold untransformed data, or "raw data," as well as act as the processing engine for transforming the data before moving it into the analytics warehouse.

Dbt

data-build-tool-banner-log

Data Build Tool, or Dbt, is quickly becoming the de facto tool for transforming raw data into an analytics data warehouse. At its core, it uses SQL and Jinja to enable consistent and powerful transformation "models." These models rely on a scalable SQL database, known as a "processing engine," to transform the data before sending it on to its final destination, the analytics warehouse.

MariaDB with ColumnStore

mariadb-banner-logo
I'd like to use MariaDB as the actual analytics data warehouse. It is an unconventional choice compared to Snowflake or Redshift. But, I'm attempting to use open-source software (OSS) as much as possible, as our company is having a horrific experience with the downfall of Looker.

One of the better-kept secrets about MariaDB is its ColumnStore engine. It allows three major features which have convinced me to try it out:

  • It stores data in a column, making it faster for analytics
  • The data store is distributed among S3 buckets
  • Data are run-length encoded, greatly reducing storage size
  • ColumnStore engine tables have join capability with InnoDB tables

I'm also curious whether MariaDB could possibly be used as a processing engine, as I'd prefer to reduce the cognitive complexity of the stack by having only one SQL dialect.

Lightdash

lightdash-data-visualization-tool-banner-logo
At this point, you probably got I'm not happy with what Google has done to Looker. That stated, Looker is still my favorite business intelligence tool. Luckily, there is a budding Looker alternative that meets my OSS requirement: Lightdash. I've never used it, let alone deployed it into production. I guess we'll see how it goes.

Terraform

terraform-banner-logo

You may notice, a data warehouse requires a lot of infrastructure. I hate spinning-up infrastructure through a UI. It isn't repeatable, it's not in version control, and I like writing code. Luckily, Hashicorp's got my back with Terraform. It allows defining and deploying infrastructure as code (IaC).

Terraform

Terragrunt

Terragrunt is a wonderful addition to Terraform. It allows an entire company's infrastructure to be easily maintained.

Terragrunt

The Beginning

First up, let's take a look at the Auto Health Export app and our Serverless architecture. A couple of personal notes before jumping in.

I'm writing this series as a journal of what I learn. That stated, be aware I may switch solutions anywhere in our stack, at any time.

Also, I know a lot of stuff. But there's more stuff I don't know. If I make a mistake, please let me know in the comments. All feedback is appreciated, but respectful feedback is adored.

Top comments (0)