DEV Community

Cover image for MongoDB aggregation pipeline — 8 stages you need to master
Finny Collins
Finny Collins

Posted on

MongoDB aggregation pipeline — 8 stages you need to master

The aggregation pipeline is one of the most powerful features in MongoDB. It lets you transform, filter and analyze documents step by step — each stage takes the output of the previous one and passes the result forward. Think of it like a Unix pipe for your data.

If you've been relying on find() with simple queries, there's a good chance you're doing too much work in application code. The aggregation pipeline can handle most of that for you, and it does it closer to the data, which usually means faster.

This article walks through 8 stages that cover the vast majority of real-world use cases.

MongoDB aggregation

How the pipeline works

Before jumping into stages, it helps to understand the basic mechanics. An aggregation pipeline is an array of stage objects. MongoDB processes documents through each stage sequentially. The output of one stage becomes the input for the next.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
  { $sort: { total: -1 } }
])
Enter fullscreen mode Exit fullscreen mode

Each stage narrows, reshapes or enriches the data. The order matters — putting $match early reduces the number of documents later stages have to process.

1. $match — filter documents early

$match filters documents, much like a find() query. It accepts standard query operators — $gt, $in, $regex and everything else you'd use in a regular query.

The most important thing about $match is placement. Always put it as early as possible. When $match is the first stage, MongoDB can use indexes. Push it further down the pipeline and you lose that optimization.

db.orders.aggregate([
  { $match: {
    createdAt: { $gte: ISODate("2025-01-01") },
    status: { $in: ["completed", "shipped"] }
  }}
])
Enter fullscreen mode Exit fullscreen mode

This is not just a best practice — on large collections, the difference between an indexed $match at stage one and an unindexed filter at stage three can be orders of magnitude in execution time.

2. $project — reshape your documents

$project controls which fields appear in the output. You can include fields, exclude them, rename them or compute new ones.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $project: {
    _id: 0,
    orderId: "$_id",
    customer: "$customerId",
    totalCents: { $multiply: ["$amount", 100] },
    year: { $year: "$createdAt" }
  }}
])
Enter fullscreen mode Exit fullscreen mode

A few things to keep in mind. Setting _id: 0 suppresses the default _id field. You can use expressions like $year, $concat and $multiply to derive new values. And you can rename fields by mapping a new name to an existing field path.

$project is also useful for trimming payload size. If your documents have 30 fields but the client needs 4, project early and save bandwidth.

3. $group — aggregate values

$group is where the real analytical power lives. It groups documents by a key and applies accumulator expressions to each group.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: {
    _id: "$customerId",
    orderCount: { $sum: 1 },
    totalSpent: { $sum: "$amount" },
    avgOrder: { $avg: "$amount" },
    lastOrder: { $max: "$createdAt" }
  }}
])
Enter fullscreen mode Exit fullscreen mode

The _id field defines the grouping key. It can be a single field, a computed expression or an object for compound grouping.

Accumulator What it does Example
$sum Adds values or counts documents { $sum: "$amount" }
$avg Calculates the average { $avg: "$rating" }
$min / $max Finds minimum or maximum { $max: "$createdAt" }
$push Collects values into an array { $push: "$product" }
$addToSet Collects unique values into an array { $addToSet: "$category" }
$first / $last Takes the first or last value in each group { $first: "$name" }

One gotcha: $group does not preserve document order within groups unless you $sort before it. If you need $first or $last to be meaningful, sort first.

4. $sort — order the results

$sort orders documents by one or more fields. Use 1 for ascending and -1 for descending.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: {
    _id: "$customerId",
    totalSpent: { $sum: "$amount" }
  }},
  { $sort: { totalSpent: -1 } }
])
Enter fullscreen mode Exit fullscreen mode

When $sort is the first stage (or immediately follows a $match), MongoDB can use an index. Later in the pipeline, it becomes an in-memory sort, which has a 100 MB memory limit by default. For large result sets, you either need to set allowDiskUse: true or restructure the pipeline so the sort can use an index.

db.orders.aggregate([
  { $sort: { totalSpent: -1 } }
], { allowDiskUse: true })
Enter fullscreen mode Exit fullscreen mode

You can sort by multiple fields — MongoDB applies them in order, so { status: 1, createdAt: -1 } sorts by status ascending first, then by date descending within each status group.

5. $lookup — join collections

$lookup performs a left outer join with another collection. This is the closest thing MongoDB has to SQL joins.

db.orders.aggregate([
  { $lookup: {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customerDetails"
  }}
])
Enter fullscreen mode Exit fullscreen mode

The result adds an array field (customerDetails in this case) to each document. If no match is found, you get an empty array. If you expect a single match, you'll typically follow with an $unwind to flatten it.

db.orders.aggregate([
  { $lookup: {
    from: "customers",
    localField: "customerId",
    foreignField: "_id",
    as: "customer"
  }},
  { $unwind: "$customer" }
])
Enter fullscreen mode Exit fullscreen mode

For more complex join conditions, there's a pipeline form of $lookup that lets you run a sub-pipeline inside the join.

db.orders.aggregate([
  { $lookup: {
    from: "products",
    let: { productIds: "$items.productId" },
    pipeline: [
      { $match: { $expr: { $in: ["$_id", "$$productIds"] } } },
      { $project: { name: 1, price: 1 } }
    ],
    as: "productDetails"
  }}
])
Enter fullscreen mode Exit fullscreen mode

This form is more flexible but watch the performance — sub-pipelines run for each input document.

6. $unwind — flatten arrays

$unwind deconstructs an array field, outputting one document per array element. It's commonly used after $lookup or when you need to aggregate across array items.

db.orders.aggregate([
  { $unwind: "$items" },
  { $group: {
    _id: "$items.productId",
    totalQuantity: { $sum: "$items.quantity" },
    totalRevenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } }
  }},
  { $sort: { totalRevenue: -1 } }
])
Enter fullscreen mode Exit fullscreen mode

By default, $unwind removes documents where the array is missing or empty. If you want to preserve them, use the expanded form.

{ $unwind: {
  path: "$items",
  preserveNullAndEmptyArrays: true
}}
Enter fullscreen mode Exit fullscreen mode

Be careful with $unwind on large arrays — an order with 100 line items becomes 100 documents. That multiplication can blow up memory usage if you're not filtering or limiting beforehand.

7. $addFields — enrich without losing data

$addFields adds new fields to documents without removing existing ones. It's like $project, but non-destructive.

db.orders.aggregate([
  { $addFields: {
    itemCount: { $size: "$items" },
    isHighValue: { $gte: ["$amount", 1000] },
    dayOfWeek: { $dayOfWeek: "$createdAt" }
  }}
])
Enter fullscreen mode Exit fullscreen mode

This is particularly useful in the middle of a pipeline when you need a computed field for a later stage but don't want to manually re-include every other field with $project.

You can also overwrite existing fields.

db.orders.aggregate([
  { $addFields: {
    amount: { $round: ["$amount", 2] }
  }}
])
Enter fullscreen mode Exit fullscreen mode

The stage is an alias for $set — they're functionally identical. Use whichever reads better in your context.

8. $facet — run multiple pipelines at once

$facet lets you run several sub-pipelines in parallel on the same set of input documents. Each sub-pipeline produces its own output field. This is perfect for dashboards where you need aggregated data and paginated results from the same query.

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $facet: {
    summary: [
      { $group: {
        _id: null,
        totalOrders: { $sum: 1 },
        totalRevenue: { $sum: "$amount" },
        avgOrderValue: { $avg: "$amount" }
      }}
    ],
    topCustomers: [
      { $group: { _id: "$customerId", spent: { $sum: "$amount" } } },
      { $sort: { spent: -1 } },
      { $limit: 5 }
    ],
    recentOrders: [
      { $sort: { createdAt: -1 } },
      { $limit: 10 },
      { $project: { customerId: 1, amount: 1, createdAt: 1 } }
    ]
  }}
])
Enter fullscreen mode Exit fullscreen mode

Each facet is independent. They share the same input but don't affect each other. The output is a single document with one field per facet.

One limitation — you can't use $out or $merge inside a $facet. And because all sub-pipelines share the same input, make sure your initial $match is doing enough filtering.

Performance tips

Getting a pipeline to return correct results is step one. Getting it to run fast is step two. Here are the things that matter most.

Tip Why it matters
Put $match first Enables index usage and reduces documents flowing through later stages
Create compound indexes for $match + $sort MongoDB can satisfy both in a single index scan
Use $project early to drop unused fields Less data per document means less memory and faster processing
Set allowDiskUse: true for large sorts Prevents failures when in-memory sort exceeds the 100 MB limit
Avoid $unwind on large arrays without filtering first Each array element creates a new document — this multiplies quickly
Use explain() to inspect the pipeline plan Shows whether indexes are used and where bottlenecks are
db.orders.aggregate([
  { $match: { status: "completed" } },
  { $sort: { createdAt: -1 } }
]).explain("executionStats")
Enter fullscreen mode Exit fullscreen mode

The explain() output tells you if MongoDB used an index scan or a collection scan, how many documents were examined and how long each stage took.

Protecting your data with backups

Aggregation pipelines are read-only — they don't modify your data. But once you start building complex analytical workflows on top of MongoDB, the data itself becomes more valuable. A corrupted collection or an accidental drop() can wipe out months of carefully structured documents.

MongoDB backup is something worth setting up before you need it. Databasus is the industry standard for MongoDB backup tools and the most widely used solution in its category. It supports scheduled logical backups with compression, multiple storage destinations like S3 and Google Drive, and retention policies — all through a self-hosted UI that takes a few minutes to deploy with Docker.

Wrapping up

These 8 stages handle the vast majority of what you'll need from MongoDB's aggregation framework. $match and $project for filtering and shaping, $group for aggregation, $sort for ordering, $lookup and $unwind for joins and array handling, $addFields for enrichment and $facet for multi-output queries.

The key is stage ordering. Filter early, project what you need, aggregate, then sort. Most performance problems in pipelines come from doing these steps in the wrong order or skipping the filtering step entirely.

Start with simple pipelines and build up. The aggregation framework is deep — there are dozens of stages and hundreds of expressions beyond what's covered here — but these 8 will carry you through most real-world scenarios.

Top comments (0)