I recently read Designing Data-Intensive Applications by Martin Kleppmann and I really appreciated it. I especially liked that the book covers a wide range of topics and I think it’s a very good introduction to the subject. I had a prior exposure to most of the covered topics in some university classes, but the book has been a useful refresher and it surely taught me several new things.
The book starts with details on data modeling and querying with a comparison of the different models and explaining what kind of tasks they’re suitable for. Going throught that chapter will help you decide when it makes sense for you to use a relational database management system or when a NoSQL database is a better choice, just to name two of the most common paradigms. However, I enjoyed that the book doesn’t stop there: it covers a good variety of approaches and it also puts them in their historical context.
Once you know how to model your data and your queries, it’s time to dive deeper in the implementation of a data storage system. The book covers with enough details different implementation techniques and the data structure commonly used so you’ll know what properties to expect from a system implemented in a certain way.
The central part of the book shows how to make a data intensive system with certain properties. Here you’ll discover how speading the load over multiple machines is going to let you scale. And you’ll also learn again that distributed systems are hard. The book focuses on replication and partitioning, transactions (both on a single system and distributed). Finally, the central part of the book ends with a chapter about consistency and consensus.
Once you got so far into the book you’ll have a clearer idea of how storage systems work and you’ll hopefully be able to pick the one that fits with the problem you want to solve. In the final part, the book discusses common patterns to implement applications on top of a storage systems. Depending on the nature of your data you might want to do batch or stream processing. Either way, the book discusses the most common approaches.
I’d really recommend this book. I think it hits the sweet spot between giving a comphrehensive overview and giving enough details and reference to be actually useful. You won’t learn all the implementation details of a particular system, but you’ll learn enough of how a huge variety of systems work so you’ll find yourself in a good position when you’ll have to choose how to implement your data intensive application.
This post was originally published on mariosangiorgio.github.io