Relational Databases took the world by storm around 1990. Ever since, they've become a must-have in the toolbox of professional developers - nearly independent of the language you might be using, you will need SQL at some point. However, since around 2000, the NoSQL movement has regained some attention as well. In this article, I will provide some of my personal thoughts on databases, focusing on relational databases and the SQL language. As I've been working with both relational and non-relational systems, I hope that this article can provide some insights and interesting perspectives.
This is a strongly opinion-based article based on my personal experience, not a universal truth. Please read it as such.
The fact that Relational Database Management Systems (RDBMSs) have such a high market share has several reasons.
First and foremost, an RDBMS, regardless of its implementation, is always based on Relational Algebra which formally defines its operations and data structures. This is very convenient: all the formal foundations have already been taken care of, the system is known to be sound and complete, "all" database vendors have to do is go ahead and implement them. This is a big difference to NoSQL databases. For example, it is not obvious a priori whether or not you can formulate a query for every possible information demand in a Graph Query language such as Gremlin or Cipher. The authors of those languages had to think those theoretical questions through before even writing the first line of code.
A second reason why RDMBSs are so successful nowadays is the enormous amounts of money which were put into their development. There was a lot of hype surrounding RDBMSs in research, therefore investors were willing to put a lot of money into it. This process spiraled upwards to the point where "database" was used as a synonym to "RDBMS". In a way, RDBMSs became popular not necessarily by means of features or well-foundedness, but at least partially also due to a massive hype cycle which provided the financial resources to make all the research and optimizations happen.
Another big advantage of RDBMSs is that they can operate in a highly optimized fashion, as SQL is declarative in nature. Operators can be shifted around in the operator tree, indices can be applied where appropriate, there's just a lot of things happening behind the scene before your query even hits the actual B-Trees. Again, there has been a tremendous amount of research efforts going into this topic, and people are still coming up with new ideas on how to make queries even faster. Try using the
EXPLAIN statement in one of your more complex queries to see what I mean.
In most SQL databases out there, the schema has to be fixed up-front. And, aside from special
LOB (Large OBject) types such as
TEXT, they pretty much all have a fixed length in bits. Even for a
VARCHAR you have to specify a maximum length. Therefore, for any given row in your database, the system knows in advance how much memory the row will need - before it is actually loaded into RAM. This is a huge deal and helps immensely when implementing database systems in languages with manual memory management, such as C.
To many people, in particular in the more business-oriented offices, SQL is much easier to approach than a general-purpose programming language. SQL, as the primus inter pares of Domain Specific Languages, has been designed to be close to natural english. From a marketing perspective, this is a great boon: "Query your data without needing programmers! It's even standardized so you can use any database you want with it!" And I think it's safe to say that it worked, people fell for it.
RDBMSs excel at providing your data exactly in the format you need. They can slice and dice the database content in just about any way you can possibly imagine. This is an impressive and often overlooked capability.
Relational databases excel at specifying constraints. Whether it is about value constraints (minimum & maximum of a numeric value, null handling...), foreign key constraints or access rights, RDBMSs have you covered in all directions. This ensures that your data remains in a correct and consistent state.
As the title implies, I firmly believe that there are a number of issues with RDBMSs out there.
Yes, you read that right - this caption appeared as an advantage. But SQL being a human-readable language also implies a lot of problems:
It is inherently non-systematic. The number of keywords needed is enormous compared to general-purpose programming languages. In my opinion, this makes it harder to learn, rather than easier, as advertised.
It suffers from poor composability. Unifying two arbitrary queries can be challenging. SQL engines often impose constraints which affect the whole query, rather than individual parts. This becomes particularly evident with
It is a nightmare to parse. Read any mailing list on the development of relational databases you want, you will see the people responsible for the parser complain about it, and rightfully so.
In total, the caption of this section could also be written as "SQL is not a language for machines". Let's not forget that the vast majority of queries out there are not written manually - they are being generated. Whether they originate from Object-Relational Mapping systems, string templates or other mechanisms, most queries are not written by a human sitting in front of a text field. SQL as a language therefore focuses on the wrong audience, causing many issues in the process - the generators have a hard time producing it, and the SQL engines have a hard time parsing it. So, who benefits from this situation? Did we, as software development community, take a wrong turn somewhere?
... and everyone knows it. Even calling it a "Standard" is an insult to real standards. First of all, using only the elements found in the standard, it's already challenging to write a simple
SELECT-FROM-WHERE query. Often times, you need vendor-specific extensions to get your work done (the SQL standard often speaks of "language opportunities"). For instance, you might think that there is a standardized way to do a paginated
OFFSET query in SQL. Well, too bad, you're out of luck. While pretty much any RDBMS out there can do it, the SQL syntax is different for every one of them. Data types also often don't do the same thing across different vendors. And don't even get me started on stored procedures. There is a reason why companies look for experience with certain RDBMS products in their job announcements, rather than looking for experience with RDBMSs in general. If you see such an advert, you can be sure that this company is already neck deep in the vendor lock-in. In my experience, the idea of writing an application which can make effective use of more than one type of RDBMS as its backend is an illusion.
When dealing with even slightly more sophisticated applications, it is often the case that you load some data from the database, make some modifications to it, and then save it back. RDBMSs have a really bad habit here: they force you as the developer to distinguish between
UPDATE, all the time. Tracking this object state correctly throughout an entire application, only to decide at the final moment which of the two keywords to use, is a real pain. In the end, you put in a lot of effort only to make the SQL constraint checker happy. The
UPSERT keyword has not yet made it into the big RDBMSs. There are ways to do it, in the form of
INSERT ON CONFLICT (PostGreSQL) or
ON DUPLICATE KEY UPDATE (MySQL), but overall the situation is still pretty bad. To add insult to injury, Object-Relational Mappers cannot rely on those features, as they have to talk to multiple different RDBMS backends. In comparison, do you know the primary way of inserting data into Neo4j? It's the
MERGE keyword, which is just another way of saying
UPSERT. Neo4j is not without issues, but they at least got that right from the get-go. There are certainly use cases where it actually makes a difference if you create a new object or update an existing one, but they're a small minority in my experience. In 99% of all cases, I just have an object I want to have saved to query it later.
I don't mean any offense against O/R mapping frameworks or their authors. In fact, I'm thankful they exist, because without them, the pain would be even bigger than it already is. Yes, RDBMSs are powerful and highly optimized, and thousands of hours went into their development to ensure they are correct and fast. We would be mad not to make use of them, right? Well, too bad the interface we have to use to talk to them (SQL) is horrible. I intend to write another article about this topic where I'll go into greater detail.
RDBMSs are here to stay. Even if all modern applications were developed with NoSQL databases, there would still be enough legacy applications to keep all of us busy for centuries to come. However, change is happening. JSON and XML support is creeping into RDBMS engines, giving us at least some more flexibility. But what about SQL as a language? I personally think that Datalog would be a very nice, clean and effective replacement which is easily generated and easily parsed. I thought of it as a toy when I first came into contact with it in university, but the more time I spend with databases, the more I think that this could very well be worth a shot. Unfortunately, I've never seen any actual support for it in the big database systems.
The relational model has many use cases and advantages, and has come a long way. RDMBSs are stable, mature and have a proven track record. However, SQL as a language is becoming a limiting factor. I think using an RDBMS as the "default" choice for a database is a good habit, but I would advise everyone to at least try a NoSQL solution as well if the opportunity presents itself.