Philipp Gysel

Posted on Aug 27, 2020 • Edited on Sep 1, 2020

Introduction to MongoDB and Document Databases

#mongodb #nosql #database #beginners

This tutorial teaches you the basics of NoSQL Document Databases. To keep things practical, we’ll also look at a concrete implementation, namely MongoDB.

First, we’ll start with the data model in document databases and compare it to the SQL data model. Second, we dive into the characteristics of MongoDB, like: queries, data replication, sharding, and consistency. We’ll wrap things up with some real world uses cases well suited for MongoDB.

Data Model

The fundamental difference between SQL and NoSQL is how the data gets stored. Relational databases are like Excel tables with rows and columns. This way of storing data means each row has the exact same fields. In contrast, NoSQL allows for more flexibility in what you store. It’s easier to persist arrays, fields of unknown length, or add new data fields. Let’s look at an example.

UML data model:

Above you see the data model for a simple online store: we store customers and their orders. A given customer can have multiple orders, and each order is associated with a delivery address and all items purchased.

SQL tables:

In SQL, a convenient way for storing this data is by having 4 tables (see graph above). We would connect the orders to the customer and the items through foreign keys. In order to fetch a given order, you’d perform a join over the Order, Address and Item table.

Let's switch from the relational storage model to the non-relational one. In a document store, we store all data in so called documents (either in JSON, XML or BSON format).

NoSQL document:

// in orders
{
 "_id": 99,
 "orderDate": "2020-08-22",
 "customer": {
  "customerId": 1,
  "firstName": "Philipp",
  "lastName": "Gysel"
 },
 "address": {
  "street": "Main ln",
  "city": "Bern"
 },
 "items": [
  {
   "itemId": 88,
   "price": 9.90,
   "description": "NoSQL Distilled"
  }
 ]
}

The above code snippet uses 1 collection to store all data (as opposed to 4 tables in SQL). Each document in the collection orders contains the delivery address and all items. Moreover, the customer data is also present in each order. Note that we leverage a powerful feature here: different orders will have varying lengths, since the items element is an array of arbitrary size.

How to Store and Manipulate Data

In SQL:

No Arrays: We can’t store arrays. Thus, we need one table row per order.
Normalization: Data is normalized. There’s no data duplication. In case the customer wants to change his/her name in the online store, we just need to update the customer name in one place.

In NoSQL document store:

Arrays are supported: We can store all items of a given order together.
Data duplication: Each order has the customer data. Now if a customer changes name, we have to perform this change in each order of this customer, unfortunately.
Aggregation: In document databases and NoSQL in general, we use aggregates, which contain a mix of data. We typically draw the aggregate boundaries according to how the application accesses the data. In our case, orders contains more than just the order itself, which makes sense since the application has to read all data anyways.
Not optimized for transactions: Newer NoSQL versions do offer support for transactions, but they are not primarily designed for distributed ACID behavior. For most use cases though, this is not a problem, especially when all logically connected data lives in the same aggregate.

MongoDB

In order to make things more tangible, let’s look at a concrete implementation! In the next few sections we cover the basics of MongoDB.

MongoDB Queries

MongoDB is a document store which persists a key-value map, where each value is a JSON document of varying schema. As a case in point, a collection can contain the following two documents:

{
 "_id": 1,
 "firstName": "Philipp",
 "lastName": "Gysel"
}
{
 "_id": 2,
 "firstName": "John",
 "lastName": "Smith",
 "age": 22
}

Using MongoDB Shell, we could query for a customer with first name “Philipp” as follows:

> db.customers.find({firstName:"Philipp"})

... which would return us the first customer in JSON format.

MongoDB offers a variety of query features for CRUD operations, projection, sorting etc. For more on queries, watch out for my next post in this series.

MongoDB Naming Convention

With MongoDB, there are no rows and columns, instead, we deal with documents and JSON elements:

Oracle	MongoDB
table	collection
row	document
rowid	_id

The primary key of a MongoDB document is always called _id and naturally you can do queries by primary key.

MongoDB - No Schema

MongoDB has no predefined schema. If you look again at the documents above, you can see that “John” has an age element, which is missing from “Philipp”. Moreover, you can even change the schema of an existing document:

> db.customers.updateOne({_id:1},{$set:{city:"Bern"}})

... this will add a city field to the existing document.

MongoDB Replica Sets

For read scalability, MongoDB supports replica sets. A replica set contains one master and multiple slaves; each node contains all data. MongoDB allows to add new nodes to a running database when more data traffic needs to be supported. By default, all requests from the application go to the master node.

In the image above, the application performs a write operation to the database, which is sent to the master and then passed on to all slave nodes.

Read scalability is achieved through the fact that each slave contains all data. All you need to do is specify that reads from slaves are ok:

Mongo mongo = new Mongo(“localhost:270127”);
Mongo.slaveOk();

Given 3 slaves, you now have 4 times higher read throughput!

MongoDB Sharding

Read scalability: check✔️. What about write scalability❓ Replica sets won’t help us here, since all writes need to go through the master. That’s where sharding comes in: With sharding, we split our data into partitions, and each partition is stored in a different shard. By using some simple rule like applying Modulo on the primary key of a document, each document is assigned to a shard.

Now the writes get distributed over different nodes. This can be especially helpful for write heavy applications like log capturing (e.g. NewRelic). Also, MongoDB makes sharding easy from an application perspective and performs all complicated work automatically in the background, like balancing shards or figuring out which document lives on which shard.

MongoDB Consistency

As soon as a MongoDB cluster uses replication, special care needs to be taken for consistency. Hypothetically, an application can persist a document, then query for it, but the DB will throw a "not found" error. The reason lies in the topology of the cluster: In a master-slave configuration, a write doesn’t reach each slave immediately, and it might take a sub-second before all slaves receive the update. While SQL offers immediate consistency, MongoDB guarantees only eventual consistency – one is guaranteed to have all updates on all nodes in the end.

For better control, MongoDB allows to specify a writeConcern for each insert and update:

db.companies.insert(
   {_id: 1, firstName: "Philipp", lastName: "Gysel"},
   {writeConcern: {w: "majority", wtimeout: 5000}}
)

Here, we mandate that an insert is propagated to the majority of nodes and the DB call is blocked until this happens or the timeout of 5 seconds is reached. Beware though: consistency comes at a price (latency), so think carefully about your specific use cases and how consistent your data needs to be.

MongoDB Transactions

Since MongoDB 4.0, multi-document transactions are supported. Thanks to this feature, documents living in different collections can be updated in an atomic fashion. Beware though, as the official MongoDB documentation states:

In most cases, multi-document transaction incurs a greater performance cost over single document writes, and the availability of multi-document transactions should not be a replacement for effective schema design.

As a rule of thumb, you should try to pack all interconnected data into one document, if possible, so that you don’t need to worry about transaction.

MongoDB: Suitable Business Cases

MongoDB isn’t a hammer to be used for every nail. Make sure you analyze the business and non-functional requirements of your application, and then search for a matching database solution. While MongoDB has many advantages (scalability, simple query language, plus it’s open source), there are also disadvantages like slow transactions compared to SQL.

So here are some use cases well suited for MongoDB:

Logging applications: Apps which track huge logs are well suited for MongoDB. Immediate consistency is not necessary, as a new log doesn’t have to be available right away. However, performance is really important, especially when many applications with verbose logs are involved. Big write volumes can be nicely handled through sharding.
Blogs / social media: For a Twitter-like app, each entry has a different format. Some contain text, some images, some videos, but everything is optional. This matches well to documents which can contain arbitrary elements. Also, if the app wants to support new types of blogs in the future – no problem, no database changes are required, you can simply add a new field to new documents. What's more, immediate consistency isn’t important, it’s fine to see a particular entry with a delay of 1 second.

Conclusion

I hope this helped you learn more about MongoDB. Needless to say, that the best way to learn a technology is by actually using it😉.

Add a ❤️ if you liked the MongoDB tutorial. Leave a comment if you have any questions / feedback.

For more on MongoDB, checkout my next post in this series which covers MongoDB queries from MongoDB Shell and Java!

Oldest comments (2)

Dĵ ΝιΓΞΗΛψΚ • Sep 1 '20

MongoDB doesn’t support transactions. Consequently, you can’t update multiple documents in a guaranteed atomic fashion.

dude! are you serious?
mongodb does have multi-document transactions 🤑

Philipp Gysel • Sep 1 '20

hey dude, actually no, there are no transactions! ... just kidding;) thanks for the hint, I updated the article accordingly!