One of my colleagues, @margo_hdb, recently posted a great article Database Architectures & Use Cases - Explained here on dev.to. In response a user asked for a deeper dive on NewSQL databases, so I thought I would put one together.
The term NewSQL was coined in 2011 by 451 Group analyst Matthew Aslett.
Wikipedia defines the term below:
“ NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.
Many enterprise systems that handle high-profile data (e.g., financial and order processing systems) are too large for conventional relational databases, but have transactional and consistency requirements that are not practical for NoSQL systems. The only options previously available for these organizations were to either purchase more powerful computers or to develop custom middleware that distributes requests over conventional DBMS. Both approaches feature high infrastructure costs and/or development costs. NewSQL systems attempt to reconcile the conflicts.”
The problem in my mind is that the term is so new, and not widely adopted, that what is defined as a NewSQL database can be really confusing. DB-Engines, doesn’t even list NewSQL databases as a category. I understand why, because I think until the term is fully cemented, listing it would create more confusion than clarity.
Here is one list I found of “NewSQL” databases
- Amazon Aurora
- Apache Trafodion
- CLustrix DB
- Google Spanner
I obviously, and selfishly like that it includes HarperDB, and I think a lot of the players mentioned deserve to be on that list, but Amazon Aurora is by no means a NewSQL database. It is basically just an RDBMS like Postgres and MySQL.
Why do I think that Aurora should not be on the list? Because to me, what makes a “NewSQL” database a “NewSQL” database is the underlying storage engine. I don’t think you would find Postgres or MongoDB, two of the most popular databases in the world, on a list of NewSQL databases. That’s because while they are both capable of supporting SQL (columns and rows) and NoSQL (documents/json), their underlying storage engines are not designed to fulfill the definition of “NewSQL” above.
Wikipedia defines a “Storage Engine” as follows:
“A database engine (or storage engine) is the underlying software component that a database management system (DBMS) uses to create, read, update and delete (CRUD) data from a database. Most database management systems include their own application programming interface (API) that allows the user to interact with their underlying engine without going through the user interface of the DBMS.”
If you remove the element of time from the innovation lifecycle of a product and compress it from beginning to end into one moment, most products at a high-level are building very similar things. They may build them in a different order and in different ways, but they are trying to accomodate the same market feedback. For example if you look at the first automobiles on the road they were wildly different, but over time began to look pretty similar with different high-level features that distinguished them.
As a result almost all databases will eventually accommodate SQL, NoSQL, GraphQL, etc… These are really just interfaces for how you interact with the storage engine of a product. Kind of like all cars eventually having seat-belts, four wheels, gas pedals.
Most people look at the following categories when evaluating a database:
- What interfaces does it have? (REST, JDBC, ODBC, CLI, Web-Sockets, etc…)
- Is it ACID compliant?
- Does it support SQL and/or NoSQL?
- What is the cost?
- How is the performance?
- What are the security features?
- Can it scale?
- Is it easy to use?
- How popular is it?
- How long will it take to migrate?
- How long will it take for my dev team to get up to speed?
- What does the ecosystem look like?
These are all really valid points, but what’s not on that list is the underlying storage engine. Obviously some very sophisticated data experts will look at that, but the vast majority of people do not. What most folks do not realize is that the underlying storage engine impacts nearly all of those questions.
Additionally, once a database has picked an underlying storage engine it is a nearly Herculean effort to migrate to another. MongoDB for example, realized the need to provide SQL, and instead of changing their underlying storage engine which could never accommodate SQL in any performant manner, appended to their storage engine which created a host of issues which I have outlined in this article: Multimodel Databases- A Mistake.
For an example with HarperDB, it took us almost a year to migrate from one storage engine (file system) to another (LMDB) and we were only able to do that because we are not nearly as widely used as MongoDB. In contrast, for us to add in Geospatial search capability into HarperDB took only a few weeks.
So to use my analogy above, automobile manufactures can start from scratch when building a new car. When you buy a new car, the legacy technology in your previous car has no impact on the new car. They may want to keep a consistent user experience across the two cars, but each new car can be redesigned from the ground up entirely without the old car causing issues.
The same is not true of databases. If you upgrade from AcmeDB 1.0 with a document store to AcmeDB 2.0 with a columnar store your data is going to look VASTLY different. You could have data corruption, loss of features, new cost, down-time, migration time, security issues, etc. Whereas if you buy car 1 with a gas engine and switch to car 2 with an electric engine, it has no impact on car 2.
That is why IMHO, in order to categorize something as a NewSQL database, the underlying storage engine has to have been specifically designed for or able to accommodate the definition of NewSQL in terms of scale, flexibility, and analytical capability.
Databases are typically designed for specific use cases. Timescale for example was purpose built to solve time-series analysis needs that are challenging in other DBMS systems, i.e. analyzing data over time. It’s really good at it. That said, no one would use Timescale to power a simple web page when a document store like MongoDB could do it much easier, or even Postgres which Timescale sits on top of would be a better choice. If however you try to do time-series analysis in MongoDB, good luck to you, and please make sure you have a budget the size of a small country’s GDP.
So when we look at what defines a “NewSQL” database we see a few key things:
- “…a class of relational database management systems that seek to provide the scalability of NoSQL systems… “
- “….while maintaining the ACID guarantees…”
- “Many enterprise systems … are too large for conventional relational databases, but … are not practical for NoSQL systems.”
- “The only options previously available for these organizations were to either purchase more powerful computers”
- “…or to develop custom middleware…”
- “Both approaches feature high infrastructure costs and/or development costs. NewSQL systems attempt to reconcile the conflicts.”
Thus a NewSQL system must accommodate the following:
- Scale with the level and flexibility of a NoSQL system
- Be ACID compliant
- Be cost effective and efficient
- Not solve the challenge through massive computing needs
- Not require middleware either internalized or externalized
So let’s dig deeper. The reason I droned on so much about the storage engine is because they are all impacted by them significantly.
Document Stores, what are more commonly called NoSQL databases, are fantastic at write scale. They are also highly flexible because they are not enforcing schema, and as a result simply store a new version of your record for every transaction. The scale comes from the fact that unlike a relational database, they don’t need to update other tables on write, check data consistency, guarantee a transaction, etc… Basically each write is independent, has nothing to do with other rows of data, and just creates a new document each time. Typically one attribute, the hash, is indexed by default, and if you want to add additional indexes then they actually create copies of these records. You can then search by those other attributes; however, you are searching a copy of that record, not the original. This means you have data duplication, eventual consistency, which drives up your compute costs, slows down writes, and creates a lot of overhead.
This means though that they can’t do performant joins, they are not ACID compliant, and it’s challenging to do highly scalable analytics. Thus, they cannot be NewSQL databases. Again some document stores try to achieve this, but read my other blog to understand why that’s a bad idea.
RDBMS systems on the other hand, struggle with high-scale data ingestion, but are fantastic at read-scale. They struggle with high-scale data ingestion because they are capable of maintaining relationships of data, they have a lot of background processes to guarantee transactions and consistency, and each write might affect numerous rows of data causing row-locking etc… This causes a ton of overhead on write, whereas Document Stores have the overhead on read. It can also lead to crashes on high-scale writes. For example, if I am trying to write to a table that has a billion rows of data, and it has foreign key to a table with a billion rows of data, the RDBMS has to inspect both tables prior to committing the write. As a one-off this is not a big deal, but if I am doing this with a 1,000 threads writing a 1,000 rows per second it begins to become problematic. RDBMS tight enforce of schema, which is much needed at times for certain use cases, is what also makes them inflexible.
People have tried to solve this by making RDBMS systems in-memory, but this just pushes the problem down the road by requiring really expensive computing needs, and when you run out of memory leading to massive crashes, data inconsistency, downtime etc… Our team previously managed a multi-terabyte in-memory database for streaming analytics. When we had enough memory for all our data it was AWESOME; however, the moment we began to run close to our memory limits it was a nightmare. It was like fighting off a zombie horde. We were always racing to prevent a crash, because the moment we crashed, it was nearly impossible to catch back up to the stream. Restarting the DB and loading everything into memory took forever, and we were processing hundreds of thousands of rows of data a second. So minutes were a long time. Additionally, making an RDBMS an in-memory RDBMS doesn’t make it any more flexible or change how the storage engine works.
In my mind then, to be a “NewSQL” database, the storage engine needs to be optimized to minimize the overhead of read AND write as much as possible. It needs to do this in a flexible manner that allows for dynamic data, without driving up costs.
Without going into too much of a sales pitch, that is exactly how we designed HarperDB. We optimized it for read, by making every attribute fully indexed dynamically upon write. We did this without duplicating data. This allowed us to achieve high read performance without the pitfalls of multimodal or document stores. We optimized it for write by ensuring each data element was atomic. This allows for us to ensure high scale writes without bottlenecks because updating one attribute is a totally different transaction than another. If you want to learn more you can read this somewhat dated blog about our storage engine. It is dated, because instead of the file system we now use LMDB. Pingcap has a similar, more complex, model.
Our solution and Pincap’s are not the ONLY way to solve the “NewSQL” challenge, but in my mind in order to be “NewSQL” the approach needs to be new to accommodate these needs. That said, this is just my view of the NewSQL database landscape and I would love to hear your thoughts! Am I wrong? What am I missing?