Piter Adyson

Posted on Feb 8

MongoDB schema design — 6 patterns every developer should master

#mongodb #database

MongoDB gives you flexibility that relational databases don't. No rigid tables, no mandatory schemas, no upfront column definitions. You just throw documents into a collection and go. That freedom is exactly what makes schema design in MongoDB so important and so easy to get wrong.

The problem is that "schemaless" doesn't mean "no design needed." Without a good schema strategy, you end up with slow queries, bloated documents and data that's hard to work with as your application grows. These six patterns solve the most common problems developers hit when designing MongoDB schemas.

1. Embedding vs referencing

This is the first decision you'll make for every relationship in your data model. Should related data live inside the same document or in a separate collection with a reference? The answer depends on how you read and write the data.

Embedding means nesting related data directly within a document. If you have a blog post with comments, embedding puts the comments array inside the post document. One read gets everything. No joins needed.

// Embedded comments inside a blog post
{
  _id: ObjectId("..."),
  title: "MongoDB schema tips",
  author: "Jane",
  comments: [
    { user: "Bob", text: "Great article!", date: ISODate("2026-02-01") },
    { user: "Alice", text: "Very helpful", date: ISODate("2026-02-02") }
  ]
}

Referencing stores related data in a separate collection and links them with an ObjectId. You fetch the post first, then the comments in a second query (or use $lookup for a server-side join).

// Post document
{ _id: ObjectId("post1"), title: "MongoDB schema tips", author: "Jane" }

// Separate comment documents
{ _id: ObjectId("c1"), postId: ObjectId("post1"), user: "Bob", text: "Great article!" }
{ _id: ObjectId("c2"), postId: ObjectId("post1"), user: "Alice", text: "Very helpful" }

When to embed vs reference:

Factor	Embed	Reference
Read pattern	Data is always read together	Data is read independently
Array growth	Bounded (won't grow indefinitely)	Unbounded (could grow to thousands)
Document size	Stays well under 16 MB limit	Would approach size limits
Update frequency	Nested data rarely changes	Nested data changes frequently
Data reuse	Used only in this context	Shared across multiple documents

Embedding works well for one-to-few relationships where the nested data is tightly coupled to the parent. Think user profiles with addresses, products with a small list of variants or orders with line items. Referencing is better when the related data grows without bound, gets accessed independently or is shared across multiple parent documents.

2. The subset pattern

Documents in MongoDB have a 16 MB size limit, but you'll hit performance problems long before that. Loading a 2 MB document when you only need a few fields from it wastes network bandwidth and memory. The subset pattern solves this by keeping the most-accessed data in the main document and moving the rest to a secondary collection.

A common example is an e-commerce product page. The product listing shows the name, price, main image and the three most recent reviews. But the product might have 500 reviews total. Loading all 500 reviews every time someone views the product page is wasteful.

// Main product document (fast reads for product listings)
{
  _id: ObjectId("prod1"),
  name: "Wireless Headphones",
  price: 79.99,
  image: "headphones-main.jpg",
  recentReviews: [
    { user: "Alex", rating: 5, text: "Sound quality is excellent", date: ISODate("2026-02-05") },
    { user: "Sam", rating: 4, text: "Comfortable for long use", date: ISODate("2026-02-03") },
    { user: "Jordan", rating: 5, text: "Best in this price range", date: ISODate("2026-01-28") }
  ],
  reviewCount: 487,
  averageRating: 4.3
}

// Full reviews in a separate collection (loaded only on "See all reviews")
{
  _id: ObjectId("rev1"),
  productId: ObjectId("prod1"),
  user: "Alex",
  rating: 5,
  text: "Sound quality is excellent",
  date: ISODate("2026-02-05")
}

The trade-off is data duplication. The three recent reviews exist in both the product document and the reviews collection. You need to keep them in sync when reviews are added. But the read performance gain is significant because 95% of your traffic only needs the subset.

This pattern applies anywhere you have a one-to-many relationship where most reads only need a small portion of the "many" side. User activity feeds, article comments and notification lists all benefit from it.

3. The bucket pattern

Time-series and event data can generate enormous numbers of documents. If your IoT sensors send readings every second, that's 86,400 documents per sensor per day. Storing each reading as an individual document creates index bloat and makes range queries slower than they need to be.

The bucket pattern groups multiple data points into a single document based on a time range. Instead of one document per reading, you store one document per hour (or per minute, depending on your granularity).

// Without bucket pattern: one document per reading
{ sensorId: "temp-01", value: 22.5, timestamp: ISODate("2026-02-08T10:00:00Z") }
{ sensorId: "temp-01", value: 22.6, timestamp: ISODate("2026-02-08T10:00:01Z") }
{ sensorId: "temp-01", value: 22.4, timestamp: ISODate("2026-02-08T10:00:02Z") }
// ... 86,397 more documents for this sensor today

// With bucket pattern: one document per hour
{
  sensorId: "temp-01",
  startDate: ISODate("2026-02-08T10:00:00Z"),
  endDate: ISODate("2026-02-08T10:59:59Z"),
  count: 3600,
  readings: [
    { value: 22.5, timestamp: ISODate("2026-02-08T10:00:00Z") },
    { value: 22.6, timestamp: ISODate("2026-02-08T10:00:01Z") },
    { value: 22.4, timestamp: ISODate("2026-02-08T10:00:02Z") }
    // ... 3597 more readings
  ],
  summary: {
    avg: 22.5,
    min: 21.8,
    max: 23.1
  }
}

Benefits of the bucket pattern:

Fewer documents means smaller indexes and faster queries
Pre-computed summaries (avg, min, max) avoid full scans for common aggregations
Range queries only touch a handful of bucket documents instead of thousands of individual ones
Deleting old data is simpler since you drop entire bucket documents

The bucket size depends on your access pattern. If most queries ask for hourly summaries, use hourly buckets. If users typically look at daily dashboards, daily buckets work better. The key is to match bucket granularity to how the data gets consumed.

Note that MongoDB 5.0+ introduced native time series collections which handle some of this automatically. But the bucket pattern is still useful for custom aggregations and when you need pre-computed summaries stored alongside the raw data.

4. The polymorphic pattern

Not every document in a collection needs to look the same. The polymorphic pattern handles entities that share some common fields but differ in their details. Instead of creating separate collections for each variation, you store them all in one collection with a type field.

A content management system is a good example. You might have articles, videos, podcasts and image galleries. They all have a title, author, publish date and tags. But an article has a body field, a video has a duration and URL, a podcast has an audio file and episode number.

// Article
{
  _id: ObjectId("..."),
  type: "article",
  title: "Getting started with MongoDB",
  author: "Jane",
  publishDate: ISODate("2026-02-01"),
  tags: ["mongodb", "tutorial"],
  body: "MongoDB is a document database...",
  wordCount: 1500
}

// Video
{
  _id: ObjectId("..."),
  type: "video",
  title: "MongoDB schema design workshop",
  author: "Jane",
  publishDate: ISODate("2026-02-05"),
  tags: ["mongodb", "schema"],
  videoUrl: "https://example.com/videos/mongo-schema",
  duration: 2400,
  resolution: "1080p"
}

// Podcast
{
  _id: ObjectId("..."),
  type: "podcast",
  title: "Database trends in 2026",
  author: "Bob",
  publishDate: ISODate("2026-02-07"),
  tags: ["databases", "trends"],
  audioUrl: "https://example.com/podcasts/db-trends",
  episodeNumber: 42,
  duration: 1800
}

The advantage is that queries across all content types are simple. Want all content by Jane sorted by date? One query on one collection. Want only videos? Add a filter on the type field. The shared fields make indexing straightforward, and you can create partial indexes for type-specific fields.

// Index for type-specific queries
db.content.createIndex({ type: 1, publishDate: -1 })

// Partial index only for videos
db.content.createIndex(
  { duration: 1 },
  { partialFilterExpression: { type: "video" } }
)

This pattern works when the entities share enough common fields to justify a single collection and when you frequently query across types. If different types are always queried separately and share almost nothing, separate collections might be cleaner.

5. The extended reference pattern

When you reference data in another collection, sometimes you need a few fields from that referenced document on almost every read. The extended reference pattern copies those frequently-needed fields into the referencing document to avoid a second lookup.

Consider an order system. Every order references a customer. When you display the order list, you need the customer name and email. Without the extended reference, every order list page requires a $lookup or a second query to the customers collection.

// Instead of just storing customerId
{
  _id: ObjectId("order1"),
  customerId: ObjectId("cust1"),
  items: [
    { product: "Widget", quantity: 3, price: 9.99 }
  ],
  total: 29.97,
  orderDate: ISODate("2026-02-08")
}

// Store frequently-needed customer fields directly in the order
{
  _id: ObjectId("order1"),
  customer: {
    _id: ObjectId("cust1"),
    name: "Alice Johnson",
    email: "alice@example.com"
  },
  items: [
    { product: "Widget", quantity: 3, price: 9.99 }
  ],
  total: 29.97,
  orderDate: ISODate("2026-02-08")
}

The trade-off is data staleness. If Alice changes her email, the orders still show the old one until you update them. For many use cases this is acceptable. An order should probably reflect the customer information at the time it was placed anyway.

When to use the extended reference pattern:

The referenced fields are read frequently but updated rarely
Join operations ($lookup) are causing performance issues
The copied fields are small relative to the document size
Slight staleness in the copied data is acceptable

This pattern is different from full embedding. You're not copying the entire customer document into every order. You're selectively copying only the fields that the most common queries need. The full customer record still lives in its own collection for detailed views and updates.

6. The computed pattern

Some values are expensive to calculate on the fly. If you're counting the number of views on a video, computing the average rating from thousands of reviews or aggregating daily sales totals, doing that calculation on every read is wasteful.

The computed pattern pre-calculates these values and stores them in the document. You update them when the underlying data changes, not when someone reads the result.

// Product with pre-computed statistics
{
  _id: ObjectId("prod1"),
  name: "Wireless Headphones",
  price: 79.99,
  stats: {
    totalReviews: 487,
    averageRating: 4.3,
    ratingDistribution: {
      "5": 203,
      "4": 156,
      "3": 78,
      "2": 34,
      "1": 16
    },
    totalSold: 2341,
    lastPurchaseDate: ISODate("2026-02-08T14:30:00Z")
  }
}

When a new review comes in, you update the stats using atomic operations:

db.products.updateOne(
  { _id: ObjectId("prod1") },
  {
    $inc: {
      "stats.totalReviews": 1,
      "stats.ratingDistribution.4": 1
    },
    $set: {
      "stats.averageRating": 4.28
    }
  }
)

Approach	Read cost	Write cost	Accuracy
Calculate on read	High (aggregation every time)	None	Always current
Computed pattern	Low (pre-stored value)	Low (incremental update)	Eventually consistent
Background job	Low (pre-stored value)	Batch update on schedule	Delayed

The computed pattern is the right choice when reads vastly outnumber writes and the computation is non-trivial. Product ratings, follower counts, dashboard metrics and leaderboards are all good candidates.

For background computation jobs, you need reliable scheduling. If the computation updates stall because a cron job dies silently, your users see stale data indefinitely. Monitoring and alerting on these jobs matters.

Combining patterns in practice

Real applications rarely use a single pattern in isolation. A product catalog might use the subset pattern for reviews, the computed pattern for aggregate statistics, embedding for product variants and the extended reference pattern for category information. The patterns compose well.

The key principle behind all of them is the same: design your schema around your queries, not around your entities. In relational databases, you normalize first and optimize later. In MongoDB, you start by listing your most frequent queries and design the schema to serve those queries efficiently.

Here are a few practical guidelines for combining patterns:

Start simple. Embed first. Only introduce references and patterns when you hit a specific problem like document size, update complexity or query performance.
Know your read-to-write ratio. High-read workloads benefit from denormalization (embedding, computed, extended reference). High-write workloads favor normalization (referencing) to avoid updating data in multiple places.
Monitor document growth. If a document's embedded array keeps growing, apply the subset or bucket pattern before it becomes a problem.

As your MongoDB deployment grows, having reliable MongoDB backup becomes critical. Schema changes and data migrations can go wrong, and recovering from a bad migration without a backup means data loss. Databasus is an industry standard for MongoDB backup tools, offering automated scheduled backups with compression, encryption and multiple storage destinations for both solo developers and enterprise teams.

Choosing the right pattern

There's no single correct schema for any application. The right choice depends on your query patterns, data volume, update frequency and consistency requirements. These six patterns cover the scenarios that come up most often in practice.

Start with the simplest design that works. Add complexity only when you have evidence that the simple approach isn't performing. Profile your queries, watch your document sizes and pay attention to how your data grows over time. The best schema is the one that makes your most common operations fast and your least common operations possible.