Selecting data storage technology I would name one of the most important technical decision that has to be made. Wrongly selected data storage can prevent you from being able to meet software requirements. Database choice requires thoughtful design and research investment and should be done with maximum responsibility. Here I want to share my knowledge and thoughts about how to choose right database for your task.
Understand difficulty of the problem
While some applications need to store petabytes and process them in sub-milliseconds other applications might need simple storage for the data that fits in memory of your laptop and can wait seconds to be available. Not all applications are equally tough and technology choice is not equally important. Mission critical systems require thoughtful design: feature comparison, prototype implementation, performance testing but not every system needs these steps. For simple tasks you might use already adopted technologies. Invest time appropriate to the value of the problem.
Gather requirements
Everything in software engineering starts from that. Try to understand your data model and ways your application will interact with it. Try to estimate the size of your data. Get throughput and latency requirements. Understand which consistency guarantees required and if linearizability required.
Consider SQL by default
Today SQL vs NoSQL is not about SQL. Traditional SQL databases like PostgreSQL or MySQL already support unstructured document store in JSON or other formats while NoSQL databases evolved query syntax to reach capabilities close to SQL. The main difference remains in the fact that SQL databases are built by default for non-sharded setup with single write instance and few read replicas. This is the way how full ACID can be met without heavy 2PC-like protocols. Consider anything else then “30-years old legacy” SQL database only if SQL cannot meet requirements in terms of the size of data, throughput, latency or query capabilities. With SQL you are sure you can query almost everything and make data durable and consistent at any time. You are on the safe side if requirements will change or if you realize that you made wrong assumptions on your data model.
Anything better comes with the price
To do anything better then SQL database we have to sacrifice something. Usually, this includes consistency, durability, and ways how you can access and manipulate your data. Need to write more data faster? Don’t write immediately to disk or don’t update indexes immediately. Writes cannot be handled by single instance? Remove consistency between documents and spread them across the cluster. Want to have faster access by key? Change how data is stored and remove some query capabilities. These are rules of the game caused by real world hardware limitations. Every characteristic of NoSQL database that outperforms SQL comes with sacrificing some SQL database capabilities. There is no free lunch here.
Right question contains the answer
Assuming the problem is big enough or specific enough. By classifying the problem we find and answer already. Need to store key/value data? Look on key/value stores. Write intensive workload? Look at write optimized solutions. Want to store relationships? Look on graph databases. But only if your problem is big enough or specific enough.
CAP classification is not that helpful
While CAP theorem is absolutely valid classifying databases into CP, AP, CA is usually not. Most of NoSQL databases can be tuned to the level of consistency that is required. Focus rather on understanding database architecture and which problems and use-cases it typically solves. Find experiences of others.
Don’t trust marketing to full extend
Every database according to white-papers is best, fastest and the most reliable in the world. No one will write limitations on the main page or in whitepapers. They are usually hidden in documentation or even hidden that much that revealing them becomes surprised to many people. Jepsens by Kyle Kingsbury is a good example of exploring database limitations in hands-on tests. Measure don’t trust. Benchmarks are tricky business. You can find a benchmark where database X outperforms database Y and another one with database Y outperforming X. Only your setup and data can give the confident answer.
Understand limitations of free editions
I cannot name any single popular database behind which there is no company making money. Even those technologies born and open sourced at companies like Facebook and Google have already consultancies and companies offering commercial editions. More tricky are the databases born because of NoSQL trend. They are developed by companies making money on these technologies and this is fair. That’s why free edition might be very constrained forcing you to pay for commercial one. Understand limitations of selected edition. Frequently they include security, replication, monitoring, bug fixes and are vital to many systems. Do this early in your project and align with your budget.
Understand limitations of hosted options
There are many databases provided as a service such as Heroku Add-ons. You don’t have to deal with infrastructure, replication setup, and backups. Everything is there to start implementing an application. Usually hosted databases are more expensive than just bare metal machines where you can setup any database with a license. But the price difference is not the only cost. Database maintenance as a service prevents you from being able to use some administrative capabilities directly. This means you might not be able to create multiple users with different permissions, install extensions or even change durability/consistency level. What you will get for sure is database communication protocol compliance but not all the features. Evaluate these limitations carefully.
Is it right time to play
Experience matters. Knowledge inside the team matters. Old and known is better than new and unknown. Heterogeneity brings complexity. Distributed systems are tough. Building databases is tough. Everything old and tested is better than new and fancy. Trying new technologies is always a risk. Trying multiple new technologies at the same time is a big risk. Not every task and circumstances allow taking this risk. Understand if you can allow yourself trying new technology which should be better then already used and adopted one. But not every problem is a nail you can fix with a hammer. There is no “best” database. There might be the most suitable one for your problem.
Top comments (8)
Hi Grygoriy, looks like a great article, but I'm confused as to your intended audience. Based on the title I assumed (incorrectly I believe in hindsight) that it was for those new to databases and would therefore struggle to decide between them. Until "Assume SQL..." I was very excited by the article, but about the midpoint I was lost in industry specific acronyms and no links to any articles that would allow me to clarify your meaning.
I'm not critical of your article, because from what I've read it sounds pretty great. I'm just pointing out that as someone new to this area it was too much for me and that the title seemed to indicate a more junior level of reading. Maybe this is an opportunity to write a multipart that would be lighter and includes my demographic.
Just putting it out there.
Thanks for your feedback! Will think how to make it more friendly to the novice audience
That would be awesome. I'd even be willing to assist you because ultimately you know it; that's clear, but I could edit and ask when I don't understand making it easy faster for you to produce. Let's be honest helping you would be helping me, because I really need a version of your article that is at my level. Seriously I do. :))
Would be happy to collaborate :) docs.google.com/document/d/1cDvKzo... shared the content as Google Doc, leave your questions as comments and I will try to make it simpler!
Awesome! Will start tonight. Thank you.
Published the update, many thanks for your feedback! Hope, now the article easier to understand for less-experiences in the subject people.
Super great guide, thanks a lot.
I find the discussion either doesn't take this stance and gets exotic by default, or dismisses the concept of ever choosing a non-SQL-based DB.
Thanks for the feedback! By default doesn't mean always ;)