Le Vuong

Posted on Sep 30, 2024 • Edited on Dec 19, 2024

MongoDB schema design - Key considerations

#mongodb #schemadesign #bestpractices

Introduction

The document discusses sophisticated schema designs, including denormalization (duplicating data) and two-way referencing.
Cardinality of relationship: “one-to-few,” “one-to-many,” or “one-to-squillions”

Basics: Modeling one-to-few

The main advantage is that you don’t have to perform a separate query to get the embedded details; the main disadvantage is that you have no way of accessing the embedded details as stand-alone entities.

> db.person.findOne()
{
 name: 'Kate Monster',
 ssn: '123-456-7890',
 addresses : [
    { street: '123 Sesame St', city: 'Anytown', cc: 'USA' },
    { street: '123 Avenue Q', city: 'New York', cc: 'USA' }
 ]
}

Basics: One-to-many

Reference from parent (product) to child (parts)

> db.parts.findOne()
{
   _id : ObjectID('AAAA'),
   partno : '123-aff-456',
   name : '#4 grommet',
   qty: 94,
   cost: 0.94,
   price: 3.99
}

> db.products.findOne()
{
 name : 'left-handed smoke shifter',
 manufacturer : 'Acme Corp',
 catalog_number: 1234,
 parts : [     // array of references to Part documents
     ObjectID('AAAA'),    // reference to the #4 grommet above
     ObjectID('F17C'),    // reference to a different Part
     ObjectID('D2AA'),
     // etc
 ]
}

// Fetch the Product document identified by this catalog number
> product = db.products.findOne({catalog_number: 1234});
// Fetch all the Parts that are linked to this Product
> product_parts = db.parts.find({_id: { $in : product.parts } } ).toArray() ;

This style of referencing has a complementary set of advantages and disadvantages to embedding. Each Part is a stand-alone document, so it’s easy to search them and update them independently. One trade off for using this schema is having to perform a second query to get details about the Parts for a Product. (But hold that thought until we get to denormalization.)
As an added bonus, this schema lets you have individual Parts used by multiple Products, so your One-to-N schema just became an N-to-N schema without any need for a join table!

Basics: One-to-squillions

16 MB document size (common Mongodb document size limit)
Reference from child to parent

> db.hosts.findOne()
{
    _id : ObjectID('AAAB'),
    name : 'goofy.example.com',
    ipaddr : '127.66.66.66'
}


>db.logmsg.findOne()
{
    time : ISODate("2014-03-28T09:42:41.382Z"),
    message : 'cpu is on fire!',
    host: ObjectID('AAAB')       // Reference to the Host document
}

// App level join

// find the parent ‘host’ document
> host = db.hosts.findOne({ipaddr : '127.66.66.66'});  // assumes unique index
// find the most recent 5000 log message documents linked to that host
> last_5k_msg = db.logmsg.find({host: host._id}).sort({time : -1}).limit(5000).toArray()

Recap

You need to consider two factors:

Will the entities on the “N” side of the One-to-N ever need to stand alone?
What is the cardinality of the relationship: Is it one-to-few; one-to-many; or one-to-squillions?

Based on these factors, you can pick one of the three basic One-to-N schema designs:

Embed the N side if the cardinality is one-to-few and there is no need to access the embedded object outside the context of the parent object.
Use an array of references to the N-side objects if the cardinality is one-to-many or if the N-side objects should stand alone for any reasons.
Use a reference to the One-side in the N-side objects if the cardinality is one-to-squillions.

Intermediate: Two-way referencing

db.person.findOne()
{
    _id: ObjectID("AAF1"),
    name: "Kate Monster",
    tasks [     // array of references to Task documents
        ObjectID("ADF9"),
        ObjectID("AE02"),
        ObjectID("AE73")
        // etc
    ]
}

db.tasks.findOne()
{
    _id: ObjectID("ADF9"),
    description: "Write lesson plan",
    due_date:  ISODate("2014-04-01"),
    owner: ObjectID("AAF1")     // Reference to Person document
}

This design has all of the advantages and disadvantages of the “One-to-Many” schema, but with some additions. Putting in the extra "owner" reference into the Task document means that its quick and easy to find the task’s owner, but it also means that if you need to reassign the task to another person, you need to perform two updates instead of just one.
Using this schema design over a normalized database model means that it is no longer possible to reassign a Task to a new Person with a single atomic update. This is acceptable in some use-cases.

Intermediate: Database denormalization with one-to-many relationships

Database denormalization from many to one

Add name to product.parts[id, name]
Cost of a more expensive update … when you update the Part name you must also update every place it occurs in the "products" collection
Denormalization only makes sense when there’s an high ratio of reads to updates.
… you lose the ability to perform atomic and isolated updates on that field

> db.products.findOne()
{
    name : 'left-handed smoke shifter',
    manufacturer : 'Acme Corp',
    catalog_number: 1234,
    parts : [
        { id : ObjectID('AAAA'), name : '#4 grommet' },         // Part name is denormalized
        { id: ObjectID('F17C'), name : 'fan blade assembly' },
        { id: ObjectID('D2AA'), name : 'power switch' },
        // etc
    ]
}

Database denormalization from one to many

You can also denormalize fields from the “one” side into the “many” side
This is likely to be a more expensive update, since you’re updating multiple Parts instead of a single Product

> db.parts.findOne()
{
   _id : ObjectID('AAAA'),
   partno : '123-aff-456',
   name : '#4 grommet',
   product_name : 'left-handed smoke shifter',   // Denormalized from the ‘Product’ document
   product_catalog_number: 1234,                     // Ditto
   qty: 94,
   cost: 0.94,
   price: 3.99
}

Intermediate: Database denormalization with one-to-squillions relationships

You can also denormalize the one-to-squillions relationship. This works in one of two ways: you can either put information about the “one” side (from the "hosts" document) into the “squillions” side (the log entries), or you can put summary information from the “squillions” side into the “one” side.

> db.logmsg.findOne()
{
   time : ISODate("2014-03-28T09:42:41.382Z"),
   message : 'cpu is on fire!',
   ipaddr : '127.66.66.66',
   host: ObjectID('AAAB')
}


> last_5k_msg = db.logmsg.find({ipaddr : '127.66.66.66'}).sort({time :
-1}).limit(5000).toArray()

If there’s only a limited amount of information you want to store at the “one” side, you can denormalize it all into the “squillions” side and get rid of the “one” collection altogether

> db.logmsg.findOne()
{
   time : ISODate("2014-03-28T09:42:41.382Z"),
   message : 'cpu is on fire!',
   ipaddr : '127.66.66.66',
   hostname : 'goofy.example.com',
}

On the other hand, you can also denormalize into the “one” side. Let's say you want to keep the last 1,000 messages from a host in the "Hosts" document. You could use the $each / $slice functionality introduced in MongoDB 2.4 to keep that list sorted, and only retain the last 1,000 messages

//  Get log message from monitoring system
logmsg = get_log_msg();
log_message_here = logmsg.msg;
log_ip = logmsg.ipaddr;
// Get current timestamp
now = new Date()
// Find the _id for the host I’m updating
host_doc = db.hosts.findOne({ipaddr : log_ip },{_id:1});  // Don’t return the whole document
host_id = host_doc._id;
// Insert the log message, the parent reference, and the denormalized data into the ‘many’ side
db.logmsg.save({time : now, message : log_message_here, ipaddr : log_ip, host : host_id ) });
// Push the denormalized log message onto the ‘one’ side
db.hosts.update( {_id: host_id },
 {$push : {logmsgs : { $each:  [ { time : now, message : log_message_here } ],
     $sort:  { time : 1 },  // Only keep the latest ones
     $slice: -1000 }        // Only keep the latest 1000
 }} );

Whoa! Looks at all these database denormalization choices!

Database denormalization rules of thumb: Your guide through the rainbow

Here are some “rules of thumb” to guide you through these innumerable (but not infinite) choices:

One: Favor embedding unless there is a compelling reason not to.
Two: Needing to access an object on its own is a compelling reason not to embed it.
Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
Four: Don’t be afraid of application-level joins: If you index correctly and use the projection specifier, then application-level joins are barely more expensive than server-side joins in a relational database.
Five: Consider the read-to-write ratio with denormalization. A field that will mostly be read and only seldom updated is a good candidate for denormalization. If you denormalize a field that is updated frequently then the extra work of finding and updating all the instances of redundant data is likely to overwhelm the savings that you get from denormalization.
Six: As always with MongoDB, how you model your data depends entirely on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it.

Your guide to the rainbow

When modeling “One-to-N” relationships in MongoDB, you have a variety of choices, so you have to carefully think through the structure of your data. The main criteria you need to consider are:

What is the cardinality of the relationship? Is it “one-to-few,” “one-to-many,” or “one-to-squillions”?
Do you need to access the object on the “N” side separately, or only in the context of the parent object?
What is the ratio of updates-to-reads for a particular field?

Your main choices for structuring the data are:

For “one-to-few,” you can use an array of embedded documents.
For “one-to-many,” or on occasions when the “N” side must stand alone, you should use an array of references. You can also use a “parent-reference” on the “N” side if it optimizes your data access pattern.
For “one-to-squillions,” you should use a “parent-reference” in the document storing the “N” side.
Once you’ve decided on the overall structure of the data in your database design, then you can, if you choose, denormalize data across multiple documents

Reference

Mongo doc

DEV Community