Selecting data storage technology I would name one of the most important technical decision that has to be made. Wrongly selected data storage can prevent you from being able to meet software requirements. Database choice requires thoughtful design and research investment and should be done with maximum responsibility. Here I want to share my knowledge and thoughts about how to choose right database for your task.
While some applications need to store petabytes and process them in sub-milliseconds other applications might need simple storage for the data that fits in memory of your laptop and can wait seconds to be available. Not all applications are equally tough and technology choice is not equally important. Mission critical systems require thoughtful design: feature comparison, prototype implementation, performance testing but not every system needs these steps. For simple tasks you might use already adopted technologies. Invest time appropriate to the value of the problem.
Everything in software engineering starts from that. Try to understand your data model and ways your application will interact with it. Try to estimate the size of your data. Get throughput and latency requirements. Understand which consistency guarantees required and if linearizability required.
Today SQL vs NoSQL is not about SQL. Traditional SQL databases like PostgreSQL or MySQL already support unstructured document store in JSON or other formats while NoSQL databases evolved query syntax to reach capabilities close to SQL. The main difference remains in the fact that SQL databases are built by default for non-sharded setup with single write instance and few read replicas. This is the way how full ACID can be met without heavy 2PC-like protocols. Consider anything else then “30-years old legacy” SQL database only if SQL cannot meet requirements in terms of the size of data, throughput, latency or query capabilities. With SQL you are sure you can query almost everything and make data durable and consistent at any time. You are on the safe side if requirements will change or if you realize that you made wrong assumptions on your data model.
To do anything better then SQL database we have to sacrifice something. Usually, this includes consistency, durability, and ways how you can access and manipulate your data. Need to write more data faster? Don’t write immediately to disk or don’t update indexes immediately. Writes cannot be handled by single instance? Remove consistency between documents and spread them across the cluster. Want to have faster access by key? Change how data is stored and remove some query capabilities. These are rules of the game caused by real world hardware limitations. Every characteristic of NoSQL database that outperforms SQL comes with sacrificing some SQL database capabilities. There is no free lunch here.
Assuming the problem is big enough or specific enough. By classifying the problem we find and answer already. Need to store key/value data? Look on key/value stores. Write intensive workload? Look at write optimized solutions. Want to store relationships? Look on graph databases. But only if your problem is big enough or specific enough.
While CAP theorem is absolutely valid classifying databases into CP, AP, CA is usually not. Most of NoSQL databases can be tuned to the level of consistency that is required. Focus rather on understanding database architecture and which problems and use-cases it typically solves. Find experiences of others.
Every database according to white-papers is best, fastest and the most reliable in the world. No one will write limitations on the main page or in whitepapers. They are usually hidden in documentation or even hidden that much that revealing them becomes surprised to many people. Jepsens by Kyle Kingsbury is a good example of exploring database limitations in hands-on tests. Measure don’t trust. Benchmarks are tricky business. You can find a benchmark where database X outperforms database Y and another one with database Y outperforming X. Only your setup and data can give the confident answer.
I cannot name any single popular database behind which there is no company making money. Even those technologies born and open sourced at companies like Facebook and Google have already consultancies and companies offering commercial editions. More tricky are the databases born because of NoSQL trend. They are developed by companies making money on these technologies and this is fair. That’s why free edition might be very constrained forcing you to pay for commercial one. Understand limitations of selected edition. Frequently they include security, replication, monitoring, bug fixes and are vital to many systems. Do this early in your project and align with your budget.
There are many databases provided as a service such as Heroku Add-ons. You don’t have to deal with infrastructure, replication setup, and backups. Everything is there to start implementing an application. Usually hosted databases are more expensive than just bare metal machines where you can setup any database with a license. But the price difference is not the only cost. Database maintenance as a service prevents you from being able to use some administrative capabilities directly. This means you might not be able to create multiple users with different permissions, install extensions or even change durability/consistency level. What you will get for sure is database communication protocol compliance but not all the features. Evaluate these limitations carefully.
Experience matters. Knowledge inside the team matters. Old and known is better than new and unknown. Heterogeneity brings complexity. Distributed systems are tough. Building databases is tough. Everything old and tested is better than new and fancy. Trying new technologies is always a risk. Trying multiple new technologies at the same time is a big risk. Not every task and circumstances allow taking this risk. Understand if you can allow yourself trying new technology which should be better then already used and adopted one. But not every problem is a nail you can fix with a hammer. There is no “best” database. There might be the most suitable one for your problem.