Choosing Your Data Warehouse
Sena Heydari Jun 17, 2017
Last year, I had the distinct pleasure of working daily with The Practical Dev's own @jessleenyc as she product managed the Data team (codenamed Quasar internally) at DoSomething.org (Jess references some of this in her excellent post here). This post follows the trials, tribulations, and mis-steps of a novice data engineer trying to decide on the right data warehouse for DoSomething.org as we re-built our data platform from the ground up.
I'll preface this journey by stating: I have a strong devOps background (little
d intentional) up until about a year ago. If you're a seasoned data engineer/hacker/awesome-person-who-gets-it, please leave comments and I'll happily update/incorporate them into this post or a follow-up one!
The legacy system that powered all of DoSomething.org's analytics was a single beefy MySQL server and a Jenkins server that ran a bunch of Python ETL scripts. This system was originally running on a single custom hacked together box in our office server closet. During our org move to AWS, I helped migrate the system to 2 separate instances, the MySQL server and Jenkins/ETL server. This was still far from ideal, as the system had no system configuration stored anywhere in code, and we were manually maintaining patching and upgrades. The setup had served the org very well for almost 5 years, but the original small cracks in the system were starting to turn into dangerous fissures. Some dashboards wouldn't load properly, we had a single 9 of uptime, and every server crash or bug became a 5-alarm-fire that had to be put out immediately or no one had access to any data. It was time to rebuild!
Phase One: Excitement and Research
The first phase was one almost every engineer is familiar with: the allure of the new shiny tech, the one piece of software to Solve All the Problems™, the "best" solution for your particular problem set without any regard for considerations other than purely technical ones and personal bias. The existing database schema was free-range, grass-fed, impeccably manicured for the data questions of yesteryear, but ready to be put out to pasture. But it was stable. We got to start entirely from scratch with no restrictions!! A truly once in a half-decade opportunity!
Armed with this freedom, I set about researching data pipelines and infrastructure setups from companies large, medium, and small. I watched AWS Re:Invent videos and Google Cloud Next videos by the dozens. We could ETL and present any data source we wanted with amazing tools that Just Worked™. The era of oppressive data limitations were over. We were going to build a data lake, nay, a DATA OCEAN! Anyone could ask any question and find their own answer immediately!
Phase Two: Prototyping
Out of this heady and limitless realm of possibilities emerged: everything on Google Cloud! Server instances with more network throughput, compute, and better pricing. BigQuery for zero server provisioned data warehouse, and get ready for Google Dataproc: we're going to Map/Reduce and Spark process the shit out of our data, no dark corners left unexposed. I even presented this insanity to one of the smartest group of developers I've ever worked with in an internal lunch and learn. It took less than a week of prototyping and testing to understand that while Google Cloud is better in some ways than AWS, we had no domain expertise in it, and I had no time to figure out a new infrastructure provider while figuring out how to code in Python (and by extension code at all), design new schema for our data warehouse, and learn a new subset of SQL all at the same time. Talk about dying with a whimper.
With a bit of newfound humbleness and thinking about scope, my amazing CTO and I had a long debrief, in which he gently pointed out that we spent a week of time learning some valuable lessons: what Google Cloud setup ramp-up looked like migrating from AWS, in general Google Cloud was technically superior but we weren't really taxing our AWS ETL server as it was, among others. So venturing back to our Cloud 9 of AWS, where all of our apps are hosted, I quickly dug into AWS RedShift. With 5 years of AWS experience, cluster spin-up and provisioning was pretty easy and straightforward. I migrated some of our data sets in using AWS Database Migration Service, an invaluable tool, and set our data scientist/analysts free on the test data. And then came the pain/pleasure of direct user feedback & honesty to make you see the light quickly. Our chief data officer had used PostgreSQL during his undergrad years, but those skills were 5 years out of date, and our data analyst only knew and MySQL. Our particular data schema didn't lend itself to easily use the lowest tier Redshift instances. This meant a 40x price increase for a cluster using the next instance size up. Efffffff...
Thankfully this experiment was better contained to a few days. I re-grouped with the CTO, now brutally aware that working on my own in a corner in an entirely new domain was a bad idea, no matter how many articles I read. So, I took a couple of days to think things through while working on other issues and let some ideas simmer. In the meantime, the CTO reached out to his technical network and gathered advice from experts in the field. In retrospect, I could have contributed much more by sending out queries via Reddit/StackOverflow, etc. In the end, we approached the problem from an angle painfully obvious in retrospect: can we use tools we're already knowledgeable about and that our 15 in-house engineers have experience with to meet our data needs? Can our data team use those same tools? The answer was an overwhelming YES!!
Phase Three: Implementation of MySQL RDS
We ended up going with 2 beefy MySQL RDS instances. We moved away from managing our own MySQL server and simplified using a cloud-native service. And it's been powerfully enabling for the past year. We're setup so a multi-AZ RDS MySQL 5.7 master instance is utilized by our redundant ETL servers for all the heavy data transformation. There is a read-only slave instance that has the
read_only = 0 setting enabled that Looker, which handles all of our data visualizations, connects to and serves as our dedicated analytics node.
Proof is in the pudding
Over the past year, we've had the following improvements (maintained by a novice data engineer with no other dedicated support):
- Our data warehouse has had 99.9% uptime with no effort on my part (also backups are automatically taken care of). The read-only slave had a replication failure at one point, and I had to press 2 buttons and wait an hour or two to have it fully back and re-attached to Looker. Over the 6 months previous to using RDS, I'd had to re-build our MySQL server instance 3 times due to various issues.
- I've been able to focus on implementing new data warehouse schema to replace our beloved legacy schema and have it tested consistently by our data team.
- With the help of many internal and external mentors, I've gone from disaster software monkey in Python to pseudo-respectable novice. All through it, my glorified ETL scripts (I don't really think they qualify as "code") have given our org more accurate and improved data.
There are some larger lessons that came out of this experience that I'd like to share as well:
- As an org, we don't really have big data. We have medium data at best and will for a while. Our MySQL platform, while not traditionally used as an OLAP data warehouse is serving us really well. Use the right tools you know for the problem at hand before exploring shiny new ones. Always with an eye to the future, of course.
- Small, iterative experiments are always better than large sweeping ones. I only built out a partial test environment in Google Cloud before deciding to try the next thing. Defining time and dollar limits to your experiments is invaluable.
- The problem you're trying to solve probably isn't unique, and that's a good thing. Read up on solutions, but more importantly, ask experts. Go to meetups, ask on Reddit, ask on Twitter, ask on Practical Dev! ;)