MatrixOne, a hyperconverged and one-size-fits-most database

#database #bigdata #architecture #databasedesign

MatrixOne is an open source database project. https://github.com/matrixorigin/matrixone

Over-abundance of databases

I was recently asked by a CIO about why there are so many database products and how to choose the correct one. Indeed, the question is very common for many enterprises. If we look at the famous DB-engine ranking that listed the popularity of the database, 388 database systems are in the ranking list! Each one claims to be better in some aspects than others. This over-abundance becomes real trouble for many. The database is supposed to simplify the data processing, but the selection of the database itself is overwhelmed by too much data. What an irony!

Why are so many databases existing?

There is a quite long historical evolution towards the current situation.

The world used to get by on just a small collection of databases. Back in the 1980’s, you may only have a few choices such as IBM DB2, Oracle, Ingres, SQL Server. The relational OLTP database will do just fine for most applications. DBAs don’t have to learn and maintain many different databases. Application developers don’t have to worry about data processing capability and can focus on their application implementation. It was a golden one-size-fits-all era.

In the 1990’s, the data warehousing trend started to emerge. Enterprises wanted to gather data together from applications and find insights from it, which was the so-called business intelligence. The need for a database became making analytics instead of handling transaction. The OLAP analytics took amounts of data at once, run heavy and complex queries, for which the OLTP databases were incompetent. From then on, enterprises started to deal with at least two database systems.

From the 2000s, Internet applications and big data concepts started to bloom up, their strong needs for ultra-dense storage, scalability, high availability led to a big divergence of databases: NoSQL family and Hadoop family. Both implemented a highly scalable architecture for processing large amounts of data， meanwhile they renounced the important abilities of traditional databases such as atomicity, consistency, key constraints, SQL support, etc. A lot of data processing logic was stripped off from the database and application developers had to handle that again.

Besides, many specialized databases were evolving as well, including but not limited to: Streaming for real-time application, In-memory for faster caching, Time-series database for IoT or industrial applications, Graph databases for handling semantic and relationships, Search engine dedicated to search, etc.

Eventually, we ended up with a long list of databases with different purpose.

Category	Characteristics	Typical Selection
OLTP	1. Support transactional workload 2. Light-weight and frequent CRUD operations 3. Low latency, high concurrency	Oracle, DB2, SQL Server, PostgreSQL, MySQL
OLAP	1. Support analytical workload 2. Intense, infrequent & complex query with amounts of data 3. Mostly insert and query operation	Teradata, Greenplum, Clickhouse, Druid
NoSQL	1. Handle unstructured data 2. Easier to scale 3. No data schema required	Cassandra, MongoDB, HBase
Big data/Data lake	1. Huge amounts of storage 2. Scheme on read 3. High salability	Hadoop、Hive、Impala、Spark
Specialized	Support specialized data type or workload	1. Streaming: Kafka,Flink 2. In-Memory: Redis, Memcache 3. Time series: InfluxDB, Prometheus 4. Graph: Neo4j, JanusGraph 5. Search: ElasticSearch, Splunk

Typical data architectures

Depending on the organization size and business type, a modern enterprise usually will deploy a mix of these databases to construct their data platform. Here is a typical example of a modern web application.

Typical data architecture

It consists of four major parts:
● Data Source: Structured/semi-structured data from applications databases, usually relational SQL or NoSQL. Others are like logs or other unstructured data (ex. Images).
● Data Integration: Big pool for merging all data, usually Hadoop or Data Warehousing databases are used for this.
● Data Consumption: Reporting/Monitoring/Business Intelligence, depending on usage, OLAP databases or Search engines might be used.
● Data Transformation and Transportation: Different ETL tools and some streaming engines are being used to move data from one form/database to another.

Advantages and drawbacks of using multiple databases

There are certainly many advantages of using multiple databases:
● Firstly, you have so many choices, you can build whatever you want. A good practice is to take some good pieces of different tools and construct a very customized system.
● Secondly, it adapts to your needs quickly. Once you have new applications requiring specific functions from the database, you can quickly choose to add a new one to your architecture.
● Thirdly, as many have been transforming towards micro-service, your monolithic application is broken into pieces, every piece could choose an appropriate database.

However, this current approach is not universal, not for everyone. Now let’s look at some of the drawbacks of the current architecture:
● Firstly, your team needs to be a super scholar in mastering such a huge stack. Any new member will experience a very steep learning curve, the complexity of architecture slows much down their speed to contribute anything meaning to the codebase.
● Secondly, it’s very hard to maintain. Each time you add a brick to the data building, not only do you have an extra component to monitor, but you also have several extra links with other components to worry about. Data coping from one place to another is a very easy-to-go-wrong job. Reviewing the consistency of data from upstream to downstream is a big pain.
● Thirdly, redundant data increase cost and provoke security and privacy concern. As multiple databases hold several copies of data, you are paying multiple times for one set of data. And with data privacy regulations such as GDPR in place, redundant data complicates complying with them.

There is a tradeoff to consider between capacities, costs, and human resources. Different databases are like LEGO bricks still in pieces, some prefer and are capable of building some masterpieces from the ground up. But some have no time nor talent to work with pieces, so they just buy an established work.

Bricks vs Building

For some big internet companies living and eating with IT technology, it’s not too hard to recruit a data engineering team and build their data platform. But for some small middle size or a traditional business, time and talent are both limited, why not just buy a finished LEGO work.

What’s a hyperconverged, one-size-fits-most database?

There were quite some debates on the one-size-fits-all architecture for databases. The traditional relational database was a vivid and elegant one-size-fits-all example, but it was behind our time. Michael Stonebraker published an article in 2005: “One Size Fits All”: An Idea Whose Time Has Come and Gone, explaining that one-size-fits-all architecture failed to meet the diversity of applications, which seemed to put an end to this debate. Since then, the database options started to flourish, more and more applications were built upon different databases. But rather than making the world of data management easier, we have created a jungle of systems that has the opposite effect: it makes the life of a company harder and more costly.

Therefore, we believe for companies who invest a lot in consolidating these systems, we need a better solution, especially for small to medium-sized businesses. So now in the year of 2022, do we have a better choice?

In fact, a lot of efforts are made trying to merge different database architectures.
In 2011, the 451 group introduced the term “NewSQL”to describe a system providing the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.
In 2014, Gartner published a report defined the term“HTAP”(Hybrid transaction/analytical processing) as an emerging application architecture that "breaks the wall" between transaction processing and analytics.
In 2020, Databricks introduced a new data management architecture called “Lakehouse”, which combines the best elements of data lakes and data warehouses.

We watched three new architectures working towards hyperconverged architecture. And that’s where we want to introduce MatrixOne, a hyperconverged database, the intersection sites between OLTP, OLAP, and Data Lake.
MatrixOne is a redesigned database. It’s designed to be planet-scale, extreme performant, support heterogeneous workloads while being native to different infrastructures.

Hyperconverged architecture

There are three main reasons why MatrixOne fits the hyperconverged intersection:
● Support transactional, analytical, and streaming workloads.
● Implement a strong-consistent distributed framework supporting high scalability.
● Storage for all types of data and support schema-on-read.

MatrixOne leverages and integrates advantages of the existing database systems. With these features embedded in a database, the majority of data processing jobs could just be done within a single database. Even it doesn’t handle all workloads, for most general use cases, it is a one-size-fits-most solution, which remarkably reduces your time for selection, development, maintenance, and the overall cost for a decent data engineering team.

For now, MatrixOne has just released its 0.3.0 version. MatrixOne is an open-source project, even if we still have a long way towards the final objective, we trust the power of the community and developers. So Welcome to check out our code at Github (https://github.com/matrixorigin/matrixone), and if you feel encouraged by our ideology, you are most welcome for your contribution!