loading...
Cover image for Hitchhiker's guide to database types

Hitchhiker's guide to database types

pythonmeister profile image pythonmeister Updated on ・9 min read

I said it before: I love databases. For me, they fall into the same category of beautiful and complex systems as operating system kernels - because they have to solve the same sort of problems. A life without databases is possible but meaningless ;-)
However, there are so many different incarnations of databases available implementing a high number of different concepts, that it might be tough to choose the right concept for the job at hand and thus choose the right database software.

In general, a database has one job: organize a collection of data which can be stored and retrieved from a computer system. That means there are three jobs to be done:

  • organization of a data collection (definition of virtual and physical structures)
  • storage (creation, insertion, updates, and deletion)
  • retrieval (making the data accessible by an external application)

Each of these tasks possibly has a large number of limiting factors or constraints and these, in turn, have an impact on the others. So database systems can become very, very complex pieces of software. If you ever write your own, you get confronted with problems you possibly never thought they exist ;-)

The no database model

Imagine you can't use a dedicated database system for your need to "organize a collection of data which can is stored and retrieved from a computer system." What would you do? I think it is fair to say that you would try to avoid writing extra software for that - every modern operating system does a brilliant job on that in the form of a filesystem, billions trust them for their most critical data every day.
So why not use it?
In my opinion, adequately designed filesystem layouts (that is: directory structures and file naming conventions) are an often overlooked method to solve the database problem. Filesystems are (depending on the underlying physical storage and its properties) capable of storing exabytes of data in zillions of files, at high transaction rates and massive throughput. But: to save data as a file is not the problem here. The problems emerge when you try to (re-)access data, or more specifically, a dedicated data set. How do you find the data in an efficient and thus fast way? If the desired data is the file itself, there are many ways to handle that. But what if it is somewhere inside, sitting between hundreds or thousands uninteresting information? You end up creating and maintaining an index. One problem solved. But how do you handle simultaneous reads? And writes? And how do you manage the situation where data is not only hierarchical but perhaps has many interdependent connections and links? Now things are getting complicated. My message: as long as the semantics of a filesystem and its layout fit your application's use-case, the benefit of using a higher level of abstraction in terms of database management system might not pay out. Put the "/etc" directory of a Linux system in a database system? Might create more problems than it might solve (to make my point and give you an idea).

The relational database model

The relational theory is around us for quite some time and is almost omnipresent. It is a particular case of sets, and E.F. Codd developed the relational algebra, which in turn led to the invention of the now ubiquitous standard query language, SQL. In its core, the relational theory is all about entities and relations. If you have the time, I highly suggest you read his paper and if you have even more time: read c.J. Date's An Introduction to Database Systems which is in-depth and very intense coverage of all things relational. With this model, it is possible to define relations between entities (even recursive!) which enables you to create a precise model of real-world relationships with high flexibility.
You can model the real world in all its glory: is-a, one-to-one, one-to-many, many-to-many. If you need constrained flexibility, this is for you. By this, I mean that you can model almost any possible relationship and ensure that it adheres to given rules, such as unique, not null or foreign key constraints. But these constraints are more or less only structural. What you also should get is adherence to the 12 rules for relational database systems (also given to us mere mortals by Codd) as well as conformance to the ACID principles:

  • Atomicity: every transaction is performed entirely or not at all
  • Consistency: the database state is consistent after a transaction completed
  • Isolation: concurrent transactions are isolated from the effects of each other (this one is hard to get right!)
  • Durability: as soon as a transaction finished, the data persists on non-volatile storage

You might guess, that some of these properties are hard on a single server, but are even harder or impossible to have in a distributed system. In such systems you need concepts like clustering, sharding, two-phase commits, etc. But sooner or later you enter the arena of the CAP theorem and its many relatives - which led to the creation of a special form of NoSQL databases (more on this later).
To summarize, with a relational database engine you get a centralized state machine, state transitions into a new one with every transaction while ensuring that the structural and enforcing referential integrity of your data at every single point of time. If you need true scalability (not only horizontal - more resources on a single host - but also vertically - more hosts), this can be become complex and very costly (it once was a unique selling proposition for enterprise-class database engines like Oracle, IBM DB2, MS SQL Server or Sybase Adaptive Server).

The non-relational database models

Yes, these are plural. NoSQL is nothing but a tag that you can put on any database which does not conform to the relational database model. So memcached is a NoSQL database engine as many others are, and some of them are there for a long time. It is not new; there were NoSQL databases long before there where SQL database, but even after the relational database went into existence there where use-cases that required very different approaches. NoSQL might seem at first as if SQL alone is missing, but interestingly SQL is such common knowledge that many of NoSQL databases at least offer a subset not to make a move not too hard at first (IMHO they should more appropriately named "NonRelational" databases). So this property in the name is sometimes very misleading. Wikipedia lists a large number of different types under the umbrella of NoSQL. What they all share is that do not implement the relational model and thus can omit some of its defining properties or introduce weaker versions thereof (for example see eventual consistency).
Let's discuss some of the use-cases that require very different approaches:

Key-Values stores

Key-value stores, or KV-stores for short, behave much like Python dictionary or a map in C#, C++ or Java which is accessible over a network protocol. One popular example is memcached. All keys and values are in memory, which makes it very fast. The protocol is straightforward. If you want to retrieve the data, you need to know the key. It's that simple. While such an easy pattern is straightforward to replicate in relational databases, the overhead of the ACID pattern is possibly huge. In the end, it is easier and faster to write a highly specialized database than to tune a general-purpose relational database to reach the same performance figures - and fail miserably.
A typical architecture pattern is to cache rendered web-content (in Python pseudocode):

def cachedRenderRequest(request):
    return memcached.get(calculateHash(request), render_page(request))

If you are unfamiliar with Python: if you use get() on a dictionary (a map in C#, C++ or Java) it returns either the value to the given key or a default value. In this example, the render_page method would need to make sure that a rendered page written to the Memcached instance.
This can reduce the number of database queries to the backend by magnitudes and lower the response times and latency as well (the fastest code is always the code you never execute!).
A common anti-pattern is to replicate relational tables and their structure by using (compound-)prefixes as keys.

Document stores

If you intend to store data which is highly diverse in terms of structure, then document stores are for you. A table definition is stringent, if you want to store additional or fewer fields we would need to talk about schema migration strategies (an interesting field on its own), and sooner or later you end up with ugly designs: hundreds of columns, nested tables, self-joins with automated join generation - you name it. Collections can build groups of documents, but every document in a collection can look different. These databases add full-text indexing and document addressing schemes. Remember IBM Notes? The underlying technology, Notes Storage Facility, is at its heart a NoSQL database extended into a full-fledged PIM application.

Time series databases

The often heart story goes like this: you create a table like this:

CREATE TABLE IOT_DATA (
    TS TIMESTAMP WITH TIMEZONE,
    DEVICEID VARCHAR(255),
    KEY VARCHAR(255),
    VALUE VARCHAR(255))

And begin to onboard devices. Each of these devices sends measurements once a minute, let's say humidity, temperature, wind speed, air pressure, and carbon dioxide concentration. So 5 data points every minute makes 720 x 5 = 3.600 rows per device per day. In a year 1.314.000 rows per device. And now imagine that you are successful and you sell 1.000, 10.000 or even more than 100.000 devices. As your devices make the market they gain new measurement capabilities which create even more data points. And perhaps some customers need a measurement frequency of once a second. Or you think about an application in high-frequency trading with thousands of assets in a sub-second timescale. Time series data is guaranteed to blow your relational database; you can watch it melting down. You probably survive a short time by partitioning the tables and using hierarchical storage systems, but then it doesn't scale any more on one host, and you begin building clusters. Sometime later you see the problems arise again and start to re-design your application to implement some form of sharding. My advice: when you think of time series data you should evaluate a specialized database for that purpose like TimescaleDB, which is ironically a PostgreSQL extension. But what exactly is the problem with having time series data in a relational database? The amount of data. To make queries fast you will need indices, probably one for each of the table's column. In that way, not only the storage and thus RAM requirements rise but also the necessary IO capacity (every insert or delete operation requires corresponding index updates, for example). As a relational database is designed to hold to the ACID properties, scaling such data is hard especially when several servers are required.

Scale-out databases

Some databases are designed to be scalable in every respect; this requires to leave the ACID principles behind and do differently. Such examples are Apache Cassandra or ScyllaDB. Their design uses on many ideas that are present since the Dynamo paper and I can only encourage you to read this paper and get into very details.

Graph databases

Graph databases enable you to store and query graph information, which are structures where nodes are connected by edges. This type of structure is so different from the relational model that it can be implemented much more efficient as a specialized engine like Neo4j. With these databases, you can answer questions like "who is connected to me?", "what is the shortest path from Alice to Bob?".

While some people try to solve different problems with the same tool (=database), but I think it is essential to understand the use case and get the best-suited tool to do the job. If you only use a hammer, anything else must be a nail, right?!

Happy hacking ;-)

P.S.: the list of database types is not even close to complete, it is based on things I came across in the past, at least in terms of architecture patterns.

Picture by Axel Hindemith, Public Domain

Posted on by:

pythonmeister profile

pythonmeister

@pythonmeister

Senior DevOps with emphasis on Ops, Python fanatic, PostgreSQL lover.

Discussion

markdown guide
 

Great article... especially the part of time-series DB.

I would like to point to every single developer reading this that the fallacy of abstracting the problem and being able to switch to any DB provider in future is horrible!

Once you go relational, the whole app logic is relational. The same goes for non-relational; so as you said, it's super duper important to pick the right tool before starting.

 

Yes, learnt the hard way.
In my team we spend about 30-40% of our time talking, reasoning and discussing options and their pros and cons.
The code comes then almost for free.

 

Really great piece of content.. this remind me why I love backend development... Once I was working on a db projet with a friend just as a hobby and we have saw a lot problema but not all those you have described in your article, we was trying to store data in a kind of JSON form, we had even designed a language to communicate with the server. Anyway it was working fine... The experience was cool.

github.com/na2axl/jsondb-js

 

Thanks for sharing!
Databases are never easy - one of the reasons I react sometimes a bit harsh when people say "Oh, we're using MySQL because it's easy.".

That could mean several things:

  • complex problem not yet encountered
  • problem not understood
  • usage pattern fits database engine by coincidence
  • it seems easy because they know the toolchain very well

I think thinking hard before installing something pays out well in the long term ;-)

 

I Love your "no database model" exploration as the first explanation,
@ harperdb.io, this is precisely where we started. As a Co-founder, I was very vocal about the Linux file system being a database already; we only need to organize the data in a manner for retrieval as well as concurrent read and write access. We built an excellent data model based on file paths and database attribute values.
hdb/schema/dog/name/billie/133.hdb
hdb/schema/dog/age/7/133.hdb
All the xxx.hdb files with the unique GUID lived in the attribute/value/xxx.hdb directory.
The data model is still valuable, but we ran into some file system limitations regarding large sets of data concerning inodes and indirect block pointers.
I crossed my fingers; we would find a new file system tech that would allow us to get past those limitations. Still, we did not have success, from bcachefs, to CEPH to ext4 with a lot of optional parameters to increase Inodes and allow inline data (putting small data values on the inode itself).
We have finally settled into using LMDB as a back end to our data model storage algorithm.
HarperDB allows one to do SQL and No-SQL calls, building it from scratch has allowed us to make it very modular, and adding new interpreters are on the road map.
Keep up the excellent insights and explorations.
Cheers.

 

Great post, thank you for making database types so easy to understand and providing extra resources.