Data Sharing: Why Upstream Solutions Work Best

#privacy #sql #database #architecture

Last updated: Oct 11, 2022

In order to share production data, you need the means to generate production samples and get them to where they need to go; but you also need to figure how to protect the privacy of your customers where there is sensitivity to your data (such as PII, or personally identifiable information).

Let's Start With The Data Privacy Part of This

There are many large and expensive data privacy solutions, most of them targeted at large enterprises. They generally take a defensive approach in helping you identify PII and sensitive information that has already proliferated throughout our organization — in other words damage control. Some sit in between your production databases and some end user client app and provide dynamic data masking which is great for use cases involving accessing individual records, but not so great when you need to work directly with large datasets for development, machine learning modelling, analytics, etc.

As a developer, I'm interested in solutions that take proactive approaches to mitigating potential problems with a sane architecture and automated process (especially with something as important as data), rather than just minimizing the impact of a dirty mess with hacks, or constant fiddling and ongoing toil.

These patterns I've seen throughout my career: kicking the can down the street with regard to tackling these problems properly, processes built around VPN access to the production database, trying to minimize damage by restricting access only to read replicas, relying on people adhering to policies, and building hacky homemade scripts and solutions for creating production data samples inspired me to build a company called Redactics. I'm not a sales person — I cringe at my own attempts to sales, but the point here is that I build Redactics out of my motivation to automate scalable solutions to these problems. Data sharing can be a little finicky and tedious (with long feedback loops) in and of itself without taking data privacy into account, and taking it into account existing data privacy solutions are applied to the wrong place of your data architecture.

What's An "Upstream" Solution?

An upstream solution (for lack of a better term, I'm kind of making this up) is a solution that is applied as close as possible to its source — in this case redacting sensitive information before exposed to other systems. Rather than conducting intensive sweeps looking for sensitive data across your organization, getting into the dynamic data masking and data virtualization proxies, etc. it is best to automate sending "safe" data to where it goes so that there simply is no direct human contact with anything but this safe data. In this way, there is no longer a need to access your production database directly.

To do this, you need a tool that has direct access to your production database. Many companies would prefer not to trust some SaaS provider with direct access to their production databases (although this seems to be required by some existing offerings), so this needs to be something you can install directly into your own infrastructure and something that you can trust that isn't some weird black box. Even if you trusted this SaaS provider with your life, do you want to pay the bandwidth costs and pay to use their infrastructure on top of the costs of your own infrastructure? If you can facilitate this, you can fire off your safe/cleansed data wherever it needs to go. I'm basically proposing an on-prem ETL solution, but one with data privacy as its core focus.

How Do I Carve Out Exceptions and/or Build a Stepping Stone to Get There?

There are often needs outside of your application for doing something with production data, but these should be automated with code as much as possible so that, at the end of the day, there are no longer scenarios where employees are sitting at the command line of your database able to issue whatever SQL queries they want. Why? We know that humans are the leading cause of data breaches (statistically, over 90% of the time this is so). Whether this is because they are bribed/coerced or make a mistake, it almost doesn't matter. They can also accidentally issue a harmful SQL query that either chews through all of your database resources or taints your data, you obviously don't want any of this, so whatever the scenario is, we are fallible. This form of direct access is best reserved for emergency "break glass" sort of scenarios.

This is admittedly a bit of a strawman argument since I think you would be hard-pressed to find a data privacy company that would disagree with this (they just have different approaches to these problems), but I would argue that the an ounce of prevention is worth a pound of cure approach is superior - that is, just make safe data your default. In other words: data privacy by design.

Okay, Your Awesome Points Are Compelling! Do I Have to Build This Myself?

This is where my bad salesing returns. Honestly, Redactics is free for developer usage, I would genuinely recommend this approach even if I had zero involvement with this project.

I won't lay out all of Redactics' sales arguments and bells and whistles, but I would sincerely, genuinely appreciate your checking it out and providing us with your feedback. We are a brand new company, it is my dream to get it off the ground, so please, contact us via the form on our website!

DEV Community

Data Sharing: Why Upstream Solutions Work Best

Let's Start With The Data Privacy Part of This

What's An "Upstream" Solution?

How Do I Carve Out Exceptions and/or Build a Stepping Stone to Get There?

Okay, Your Awesome Points Are Compelling! Do I Have to Build This Myself?

Top comments (0)

Read next

Clean architecture: Where to start ?

Run a Local 💻 MySQL 🐬 Instance

Patterns of Enterprise Application Architecture-Day 4

Top 🐘👀 Postgres Monitoring Tools 🧰 and Best Practices in 2024 🔝