DEV Community

Robertino
Robertino

Posted on • Originally published at auth0.com

Data Pipelines on Auth0’s New Private Cloud Platform

Data goes in; data goes out. You can’t explain that.


TL;DR: Auth0 has replatformed our private cloud offering onto Kubernetes. As part of this new architecture, we’ve had the opportunity to apply modern practices to the data infrastructure that powers near real-time features and provides the necessary knobs for data governance.

Hello friends! Before we get started, I recommend that you check out “A Technical Primer of Auth0’s New Private Cloud Platform”, “Architecture of the New Platform”, and “Security of the New Platform”. These wonderful posts cover foundational aspects of the platform that I will build upon here.

Introduction

“Data is the new oil,” quipped Clive Humby in 2006, and the industry has been chanting “Drill, baby, drill” ever since. Just like oil, data needs to be refined/processed/transformed to unlock its full value. As the saying goes, “Necessity is the mother of invention,” so the industry has birthed an entire specialization around Data (Science, Engineering, Platform, Warehouse) to extract that value. The analogy remains apt because, just like oil, data is now being respected as a potentially hazardous material. With GDPR, CCPA, and their future brethren, we’ve come to understand the importance of proper packaging and safe keeping of our data assets. No more ziplock bags of data goop in the trunk of my Dodge Neon. Neatly stacked steel crates with big padlocks under 24/7 surveillance that can be crypto shredded at a moment's notice is the future of data management. With this New Platform initiative at Auth0, we’ve had the opportunity to apply some modern practices to our data infrastructure and take another step towards that dystopian data future.

Introduction

Taking a page from Leo’s 10k foot view in “Architecture of the New Platform”, we’ll be touching on the highlighted sections of the platform in this post. We’ll start with some introductory context, discuss our guiding principles, dive into the tools we chose, and then exercise them with a few sample scenarios commonly seen at Auth0.

Put down that bag of data goop, and let’s get started!

What is a Data Pipeline?

“In a world where SaaS and microservices have taken over...” — Movie Voiceover Guy

Looks around

“Hey, he’s talking about us!”

Software is on a hockey stick growth curve of complexity. We’ve got an ever-increasing number of disparate systems hoarding data and even higher demand on us to unlock value from it.

  1. “Show me the region where credential stuffing attacks are originating from” — A customer dashboard
  2. “Let me run custom code in an Auth0 Action immediately after a user is deleted” — A feature request
  3. “Show me failed login counts from the past 60 days for all tenants on the Performance tier that have been a paying customer for more than one year” — An upsell opportunity

Before we can begin to actually answer any of those questions, we need the connective tissue to ship data from Point A to Point B enriched with E and transformed by T. That’s the role of the Data Pipeline.

That’s not to say that Data Pipelines just carry simple domain/product events. Application logs, exception stack traces, Stripe purchase orders, Salesforce billing updates, etc., are all events that can be emitted and flow through a pipeline.

“The Data Pipeline is a series of tubes.” — Michael Scott

Eh, close enough.

Our Guiding Principles

Before jumping into implementation, it’s important to understand that our requirements are not necessarily your requirements. The decisions we’ve made might not be the correct decisions for you! When evaluating solutions and discussing tradeoffs, these are some of the key points that helped guide us.

Event Durability

Event durability is a supremely important requirement for us; missing and out-of-order events are a non-starter.

If we imagine emitting a simple event, e.g., “UserCreated,” there are a few approaches:

  1. Integration/domain event from application code
  2. Outbox Pattern
  3. Change Data Capture (CDC)

Discussing tradeoffs between these is a blog post in itself, so I’ll just simply say that they’re complementary and fit different needs, with the main tradeoff being durability vs. coupling.

Events emitted by application code can get lost with failures during dual writes. Outbox Pattern struggles if you don’t have strong transactional guarantees with your database of choice and doubles your write load. Change Data Capture (CDC) is the most durable of the three but “ships” the database table schema in each event, promoting unnecessarily tight coupling and having limited access to only the modified row.

Data Governance (O Data, Where art thou?)

Data Pipelines come in different shapes and sizes.

  • Are messages serialized as Avro, JSON, or Protobuf? XML? Crickets.
  • Are they encrypted or plain text?
  • What is the expected throughput? Sustained or burst?
  • Streaming or batch processing?
  • Do producers retry or drop events during downtime?
  • What enrichments/transforms do we need to apply?

Before discussing these technical details of a pipeline, we must first consider what is in the pipeline.

  1. Do messages within the pipeline contain Customer Data?
  2. Do messages within the pipeline need to leave the region where they originated?

If you answered “Yes” to one or more of the previous questions, calmly go outside and light your computer on fire. Joking aside, by asking and answering those questions, you can begin to understand the security/compliance posture you must take. This is likely to lead you down the road of even more questions.

  • What region is data being shipped to relative to where it’s generated?
  • Are we sending aggregated data or individual records?
  • How is this data encrypted in transit and at rest?
  • What Customer Data (down to each individual field) are we processing?
  • Are there treatments that are sufficient for each field to be compliant while still fulfilling the business need? Masking, partial redaction, hashing, encryption, etc.

Remember that data is a valuable resource but can also be a hazardous material and must be handled properly. If you discover cases of processing Customer Data and shipping it across geographic boundaries, find yourself a buddy in the compliance department to help sort it out.

Read more...

Top comments (0)