Finny Collins

Posted on Mar 31

MongoDB aggregation pipeline — 8 stages you need to master

#mongodb #database

The aggregation pipeline is one of the most powerful features in MongoDB. It lets you transform, filter and analyze documents step by step — each stage takes the output of the previous one and passes the result forward. Think of it like a Unix pipe for your data.

If you've been relying on find() with simple queries, there's a good chance you're doing too much work in application code. The aggregation pipeline can handle most of that for you, and it does it closer to the data, which usually means faster.

This article walks through 8 stages that cover the vast majority of real-world use cases.

How the pipeline works

Before jumping into stages, it helps to understand the basic mechanics. An aggregation pipeline is an array of stage objects. MongoDB processes documents through each stage sequentially. The output of one stage becomes the input for the next.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
  { $sort: { total: -1 } }
])

Each stage narrows, reshapes or enriches the data. The order matters — putting $match early reduces the number of documents later stages have to process.

1. $match — filter documents early

$match filters documents, much like a find() query. It accepts standard query operators — $gt, $in, $regex and everything else you'd use in a regular query.

The most important thing about $match is placement. Always put it as early as possible. When $match is the first stage, MongoDB can use indexes. Push it further down the pipeline and you lose that optimization.

db.orders.aggregate([
  { $match: {
    createdAt: { $gte: ISODate("2025-01-01") },
    status: { $in: ["completed", "shipped"] }
  }}
])

This is not just a best practice — on large collections, the difference between an indexed $match at stage one and an unindexed filter at stage three can be orders of magnitude in execution time.

2. $project — reshape your documents

$project controls which fields appear in the output. You can include fields, exclude them, rename them or compute new ones.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $project: {
    _id: 0,
    orderId: "$_id",
    customer: "$customerId",
    totalCents: { $multiply: ["$amount", 100] },
    year: { $year: "$createdAt" }
  }}
])

A few things to keep in mind. Setting _id: 0 suppresses the default _id field. You can use expressions like $year, $concat and $multiply to derive new values. And you can rename fields by mapping a new name to an existing field path.

$project is also useful for trimming payload size. If your documents have 30 fields but the client needs 4, project early and save bandwidth.

3. $group — aggregate values

$group is where the real analytical power lives. It groups documents by a key and applies accumulator expressions to each group.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: {
    _id: "$customerId",
    orderCount: { $sum: 1 },
    totalSpent: { $sum: "$amount" },
    avgOrder: { $avg: "$amount" },
    lastOrder: { $max: "$createdAt" }
  }}
])

The _id field defines the grouping key. It can be a single field, a computed expression or an object for compound grouping.

Accumulator	What it does	Example
`$sum`	Adds values or counts documents	`{ $sum: "$amount" }`
`$avg`	Calculates the average	`{ $avg: "$rating" }`
`$min` / `$max`	Finds minimum or maximum	`{ $max: "$createdAt" }`
`$push`	Collects values into an array	`{ $push: "$product" }`
`$addToSet`	Collects unique values into an array	`{ $addToSet: "$category" }`
`$first` / `$last`	Takes the first or last value in each group	`{ $first: "$name" }`

One gotcha: $group does not preserve document order within groups unless you $sort before it. If you need $first or $last to be meaningful, sort first.

4. $sort — order the results

$sort orders documents by one or more fields. Use 1 for ascending and -1 for descending.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: {
    _id: "$customerId",
    totalSpent: { $sum: "$amount" }
  }},
  { $sort: { totalSpent: -1 } }
])

When $sort is the first stage (or immediately follows a $match), MongoDB can use an index. Later in the pipeline, it becomes an in-memory sort, which has a 100 MB memory limit by default. For large result sets, you either need to set allowDiskUse: true or restructure the pipeline so the sort can use an index.

db.orders.aggregate([
  { $sort: { totalSpent: -1 } }
], { allowDiskUse: true })

You can sort by multiple fields — MongoDB applies them in order, so { status: 1, createdAt: -1 } sorts by status ascending first, then by date descending within each status group.

5. $lookup — join collections

$lookup performs a left outer join with another collection. This is the closest thing MongoDB has to SQL joins.

db.orders.aggregate([
  { $lookup: {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customerDetails"
  }}
])

The result adds an array field (customerDetails in this case) to each document. If no match is found, you get an empty array. If you expect a single match, you'll typically follow with an $unwind to flatten it.

db.orders.aggregate([
  { $lookup: {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customer"
  }},
  { $unwind: "$customer" }
])

For more complex join conditions, there's a pipeline form of $lookup that lets you run a sub-pipeline inside the join.

db.orders.aggregate([
  { $lookup: {
    from: "products",
    let: { productIds: "$items.productId" },
    pipeline: [
      { $match: { $expr: { $in: ["$_id", "$$productIds"] } } },
      { $project: { name: 1, price: 1 } }
    ],
    as: "productDetails"
  }}
])

This form is more flexible but watch the performance — sub-pipelines run for each input document.

6. $unwind — flatten arrays

$unwind deconstructs an array field, outputting one document per array element. It's commonly used after $lookup or when you need to aggregate across array items.

db.orders.aggregate([
  { $unwind: "$items" },
  { $group: {
    _id: "$items.productId",
    totalQuantity: { $sum: "$items.quantity" },
    totalRevenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } }
  }},
  { $sort: { totalRevenue: -1 } }
])

By default, $unwind removes documents where the array is missing or empty. If you want to preserve them, use the expanded form.

{ $unwind: {
  path: "$items",
  preserveNullAndEmptyArrays: true
}}

Be careful with $unwind on large arrays — an order with 100 line items becomes 100 documents. That multiplication can blow up memory usage if you're not filtering or limiting beforehand.

7. $addFields — enrich without losing data

$addFields adds new fields to documents without removing existing ones. It's like $project, but non-destructive.

db.orders.aggregate([
  { $addFields: {
    itemCount: { $size: "$items" },
    isHighValue: { $gte: ["$amount", 1000] },
    dayOfWeek: { $dayOfWeek: "$createdAt" }
  }}
])

This is particularly useful in the middle of a pipeline when you need a computed field for a later stage but don't want to manually re-include every other field with $project.

You can also overwrite existing fields.

db.orders.aggregate([
  { $addFields: {
    amount: { $round: ["$amount", 2] }
  }}
])

The stage is an alias for $set — they're functionally identical. Use whichever reads better in your context.

8. $facet — run multiple pipelines at once

$facet lets you run several sub-pipelines in parallel on the same set of input documents. Each sub-pipeline produces its own output field. This is perfect for dashboards where you need aggregated data and paginated results from the same query.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $facet: {
    summary: [
      { $group: {
        _id: null,
        totalOrders: { $sum: 1 },
        totalRevenue: { $sum: "$amount" },
        avgOrderValue: { $avg: "$amount" }
      }}
    ],
    topCustomers: [
      { $group: { _id: "$customerId", spent: { $sum: "$amount" } } },
      { $sort: { spent: -1 } },
      { $limit: 5 }
    ],
    recentOrders: [
      { $sort: { createdAt: -1 } },
      { $limit: 10 },
      { $project: { customerId: 1, amount: 1, createdAt: 1 } }
    ]
  }}
])

Each facet is independent. They share the same input but don't affect each other. The output is a single document with one field per facet.

One limitation — you can't use $out or $merge inside a $facet. And because all sub-pipelines share the same input, make sure your initial $match is doing enough filtering.

Performance tips

Getting a pipeline to return correct results is step one. Getting it to run fast is step two. Here are the things that matter most.

Tip	Why it matters
Put `$match` first	Enables index usage and reduces documents flowing through later stages
Create compound indexes for `$match` + `$sort`	MongoDB can satisfy both in a single index scan
Use `$project` early to drop unused fields	Less data per document means less memory and faster processing
Set `allowDiskUse: true` for large sorts	Prevents failures when in-memory sort exceeds the 100 MB limit
Avoid `$unwind` on large arrays without filtering first	Each array element creates a new document — this multiplies quickly
Use `explain()` to inspect the pipeline plan	Shows whether indexes are used and where bottlenecks are

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $sort: { createdAt: -1 } }
]).explain("executionStats")

The explain() output tells you if MongoDB used an index scan or a collection scan, how many documents were examined and how long each stage took.

Protecting your data with backups

Aggregation pipelines are read-only — they don't modify your data. But once you start building complex analytical workflows on top of MongoDB, the data itself becomes more valuable. A corrupted collection or an accidental drop() can wipe out months of carefully structured documents.

MongoDB backup is something worth setting up before you need it. Databasus is the industry standard for MongoDB backup tools and the most widely used solution in its category. It supports scheduled logical backups with compression, multiple storage destinations like S3 and Google Drive, and retention policies — all through a self-hosted UI that takes a few minutes to deploy with Docker.

Wrapping up

These 8 stages handle the vast majority of what you'll need from MongoDB's aggregation framework. $match and $project for filtering and shaping, $group for aggregation, $sort for ordering, $lookup and $unwind for joins and array handling, $addFields for enrichment and $facet for multi-output queries.

The key is stage ordering. Filter early, project what you need, aggregate, then sort. Most performance problems in pipelines come from doing these steps in the wrong order or skipping the filtering step entirely.

Start with simple pipelines and build up. The aggregation framework is deep — there are dozens of stages and hundreds of expressions beyond what's covered here — but these 8 will carry you through most real-world scenarios.

DEV Community