Franck Pachot

Posted on Dec 15, 2025

Many-to-One: Stronger Relationship Design with MongoDB

#mongodb #database #schema #modeling

In a relational database, a one-to-many relationship is typically implemented with two tables. The "one" side is the parent and has a primary key that is indexed to guarantee uniqueness. The "many" side is the child and references the parent’s primary key with a foreign key. An index is often added to the foreign key as well, so that operations on the parent can efficiently locate its child rows.

While this design seems straightforward and ideal from a database administrator’s point of view—where data integrity is the primary concern—it can surprise developers. This normalized data model does not account for access patterns or cardinalities: the same structure is used whether the "many" side contains millions of rows and keeps growing or only a few items.

MongoDB takes a different approach: its data model is optimized for a specific application, based on known access patterns and cardinalities. In a document model, a one-to-many relationship can be implemented either as multiple documents linked by references or as an embedded array or subdocument within a single document. In both cases, you can choose whether to embed or reference from the parent (the "one" side) or from the child (the "many" side).

An example

I use an HR Dataset with employees that I load in an "employees" collection. It has two million documents:

cd /var/tmp
## Download HR_Dataset
 from Kaggle (https://www.kaggle.com/datasets/kadirduran/hr-dataset) 
## Unzip and import
curl -L -o hr-data-mnc.zip https://www.kaggle.com/api/v1/datasets/download/rohitgrewal/hr-data-mnc &&
unzip -o hr-data-mnc.zip &&
mongoimport -d mdb -c employees --type=csv --headerline --drop 'HR_Data_MNC_Data Science Lovers.csv'

Once imported, I connect with mongosh and update the performance rating to add random decimal digits, to recognize them better when comparing the results:

use mdb;

db.employees.updateMany( {}, [
 {
   $set: {
   Performance_Rating: {
    $add: [ { $toDouble: "$Performance_Rating" }, { $rand: {} }                       ]
   }
  }
 }
]);

This collection contains employees associated with a department name. I’ll add additional details about each department, such as a description, and then explore alternative models for this one-to-many relationship, where one department has many employees and each employee belongs to one department.

For each model, I'll look at the performance of the following query:

Let's identify the top 10 most outstanding active employees in the IT department and list their names along with their performance ratings.

Embed in the "many" side

The many-to-one relationship here is employee-to-department, as each employee has a department name:

  {  
    _id: ObjectId('693f1d61e235ef0960ae2b53'),  
    'Unnamed: 0': 22,  
    Employee_ID: 'EMP0000023',  
    Full_Name: 'James Valdez',  
    Department: 'R&D',  
    Job_Title: 'Research Scientist',  
    Hire_Date: '2017-10-30',  
    Location: 'East Scott, Mauritius',  
    Performance_Rating: 5.882713091486423,  
    Experience_Years: 7,  
    Status: 'Retired',  
    Work_Mode: 'On-site',  
    Salary_INR: 789283  
  }

As we generally embed more than one field, I add a description and structure it as a sub-object:

db.employees.aggregate([  
  {  
    $addFields: {  
      Department: {  
        $switch: {  
          branches: [  
            { case: { $eq: ["$Department", "Finance"] }, then: { Name: "Finance", Description: "Manages the company’s budgets, expenses, and financial planning to ensure fiscal health." } },  
            { case: { $eq: ["$Department", "HR"] }, then: { Name: "HR", Description: "Handles recruitment, employee relations, and organizational development initiatives." } },  
            { case: { $eq: ["$Department", "IT"] }, then: { Name: "IT", Description: "Maintains technology infrastructure, software systems, and cybersecurity protections." } },  
            { case: { $eq: ["$Department", "Marketing"] }, then: { Name: "Marketing", Description: "Promotes the company’s products and services through strategic campaigns and market research." } },  
            { case: { $eq: ["$Department", "Operations"] }, then: { Name: "Operations", Description: "Oversees daily business activities, logistics, and process optimization for efficiency." } },  
            { case: { $eq: ["$Department", "R&D"] }, then: { Name: "R&D", Description: "Researches and develops innovative products and services to support future growth." } },  
            { case: { $eq: ["$Department", "Sales"] }, then: { Name: "Sales", Description: "Builds customer relationships and drives revenue through product and service sales." } }  
          ],  
          default: { Name: "$Department", Description: "No description available" }  
        }  
      }  
    }  
  },  
  {  
    $merge: {  
      into: "employees",   // same collection  
      whenMatched: "merge", // update existing docs  
      whenNotMatched: "fail"  
    }  
  }  
])

The result for the same employee is:

  {
    _id: ObjectId('693f1d61e235ef0960ae2b53'),
    'Unnamed: 0': 22,
    Employee_ID: 'EMP0000023',
    Full_Name: 'James Valdez',
    Department: {
      Name: 'R&D',
      Description: 'Researches and develops innovative products and services to support future growth.'
    },
    Job_Title: 'Research Scientist',
    Hire_Date: '2017-10-30',
    Location: 'East Scott, Mauritius',
    Performance_Rating: 5.882713091486423,
    Experience_Years: 7,
    Status: 'Retired',
    Work_Mode: 'On-site',
    Salary_INR: 789283
  }

I want to retrieve the top 10 best-performing active employees from the IT department and display their names and performance ratings:

db.employees.find( 
 { "Status": "Active", "Department.Name": "IT" },
 { _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10)

// result:

[
  { Full_Name: 'Stuart Lopez', Performance_Rating: 5.999973276392604 },
  { Full_Name: 'Mr. Ethan Morton', Performance_Rating: 5.9999561502903065 },
  { Full_Name: 'Lee White', Performance_Rating: 5.999935393136708 },
  { Full_Name: 'Amber Coleman', Performance_Rating: 5.999919949194189 },
  { Full_Name: 'Eugene Brown', Performance_Rating: 5.999917240114123 },
  { Full_Name: 'Nicole Edwards', Performance_Rating: 5.999914413630196 },
  { Full_Name: 'Erika Stewart', Performance_Rating: 5.999902351452448 },
  { Full_Name: 'Jenna King', Performance_Rating: 5.999896490219257 },
  { Full_Name: 'Douglas Hill', Performance_Rating: 5.999886177014563 },
  { Full_Name: 'Richard Gonzalez', Performance_Rating:  5.999879794558417 }
]

Since I have no index, it reads all documents, which takes 1.3 seconds:

x=db.employees.find( 
 { "Status": "Active", "Department.Name": "IT" },
 { _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10).explain("executionStats");

printjson({  
  nReturned:           x.executionStats.nReturned,  
  executionTimeMillis: x.executionStats.executionTimeMillis,
  totalKeysExamined:   x.executionStats.totalKeysExamined,
  totalDocsExamined:   x.executionStats.totalDocsExamined,
});

// Execution statistics:

{
  nReturned: 10,
  executionTimeMillis: 1367,
  totalKeysExamined: 0,
  totalDocsExamined: 2000000
}

One benefit of embedding on the "many" side is that you can use all fields to create a compound index. For instance, I can build an index that supports my filter, sort, and projection needs:

db.employees.createIndex({
 "Status":1,               // for equality predicate on employee
 "Department.Name":1,      // for equality predicate on department
 "Performance_Rating": 1,  // for sort and limit (pagination)
 "Full_Name": 1,           // for projection (covering index)
})

The query now instantly retrieves the top 10 documents from the index:

x=db.employees.find( 
 { "Status": "Active", "Department.Name": "IT" },
 { _id: 0, "Full_Name": 1, "Performance_Rating": 1 }
).sort({ "Performance_Rating": -1 }).limit(10).explain("executionStats")

printjson({  
  nReturned:           x.executionStats.nReturned,  
  executionTimeMillis: x.executionStats.executionTimeMillis,
  totalKeysExamined:   x.executionStats.totalKeysExamined,
  totalDocsExamined:   x.executionStats.totalDocsExamined,
});

// Execution statistics:

{
  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 10,
  totalDocsExamined: 0
}

Embedding data on the "many" side helps create optimized indexes and improves response times but involves duplicating data from the "one" side into the "many" side. In our example, each employee records the department name and description. This leads to two main effects:

Increased storage usage, which can be reduced through compression. You might also opt not to embed all fields—such as storing only the department name, which is often queried with the employee, and keeping the description in a separate "departments" collection.
Any update to the department information must be reflected across all associated employee records. This is generally manageable for infrequent changes, like name updates. Often, a department name change coincides with broader reorganizations, requiring employee record updates anyway.

Reference and lookup from the "many" side

To minimize duplication, I create a separate "departments" collection using the unique department names and descriptions I embedded, ensuring each department's information is stored only once:

db.employees.aggregate([  
  {  
    $group: {  
      _id: "$Department.Name",  
      Name: { $first: "$Department.Name" },  
      Description: { $first: "$Department.Description" }  
    }  
  },  
  {  
    $project: { _id: 0, Name: 1, Description: 1 }
  },  
  {  
    $merge: {  
      into: "departments",  
      whenMatched: "keepExisting",  
      whenNotMatched: "insert"  
    }  
  }  
]);

You might be surprised by the speed of this aggregation pipeline. Instead of scanning all documents, MongoDB efficiently retrieved the unique departments by searching for distinct values in the index (a loose index scan).

Then, I can substitute the "Department" sub-object with a reference to the "_id" from the "departments" collection:

db.employees.aggregate([  
  {  
    $lookup: {  
      from: "departments",  
      localField: "Department.Name",  
      foreignField: "Name",  
      as: "deptInfo"  
    }  
  },  
  {  
    $addFields: {  
      Department: { $arrayElemAt: ["$deptInfo._id", 0] }
    }  
  },  
  { $project: { deptInfo: 0 } },  
  {  
    $merge: {  
      into: "employees",  
      whenMatched: "merge",  
      whenNotMatched: "fail"  
    }  
  }  
]);

Here is the shape of an employee document with a single field for the department:

  {
    _id: ObjectId('693f1d61e235ef0960ae2b53'),
    'Unnamed: 0': 22,
    Employee_ID: 'EMP0000023',
    Full_Name: 'James Valdez',
    Department: ObjectId('693f2e38c2dd5ab4fbfd73b8'),
    Job_Title: 'Research Scientist',
    Hire_Date: '2017-10-30',
    Location: 'East Scott, Mauritius',
    Performance_Rating: 5.882713091486423,
    Experience_Years: 7,
    Status: 'Retired',
    Work_Mode: 'On-site',
    Salary_INR: 789283
  }

To find the top 10 highest-performing active employees in the IT department and display their names and ratings, I will join the department's collection using $lookup. Since $lookup incurs a cost, it's more efficient to filter the data beforehand. Therefore, I first apply $match to filter employees by status, then perform the $lookup with the "departments" collection from the reference, and filter more with the department name fetched from the foreign collection:

x=db.employees.aggregate([  
  { $match: { Status: "Active" } },  
  {  
    $lookup: {  
      from: "departments",  
      localField: "Department", 
      foreignField: "_id", 
      as: "deptInfo"  
    }  
  },  
  { $unwind: "$deptInfo" },  
  { $match: { "deptInfo.Name": "IT" } },  
  {  
    $project: {  
      _id: 0,  
      Full_Name: 1,  
      Performance_Rating: 1  
    }  
  },  
  { $sort: { Performance_Rating: -1 } },  
  { $limit: 10 }  
]).explain("executionStats")
print(x.stages[1])

// Execution plan for the lookup stage:

{
  '$lookup': {
    from: 'departments',
    as: 'deptInfo',
    localField: 'Department',
    foreignField: '_id',
    let: {},
    pipeline: [
      { '$match': { Name: { '$eq': 'IT' } } }
    ],
    unwinding: { preserveNullAndEmptyArrays: false }
  },
  totalDocsExamined: Long('1401558'),
  totalKeysExamined: Long('1401558'),
  collectionScans: Long('0'),
  indexesUsed: [ '_id_' ],
  nReturned: Long('421333'),
  executionTimeMillisEstimate: Long('94596')
}

I printed the statistics for the $lookup stage, which read 1,401,558 documents—one per active employee. Each access was quick, thanks to the index on "_id," but executing it a million times took over a minute. It returned only 421,333 documents because the department name filter was pushed down by the query planner from the subsequent $unwind and $match stages into the $lookup pipeline. The main issue remains reading a million identical documents.

Two queries instead of lookup

If you have a many-to-one relationship and numerous documents on the many side, it's better to join from the application rather than using an aggregation pipeline lookup.

I retrieved the "_id" of the departments I am interested in:


var itDeptId = db.departments.findOne({ Name: "IT" })._id;

If the department names are unlikely to change, this can be run once and stored in the application's cache.

Then I can access employees filtered by the reference '_id', so I will create an index for it:

db.employees.createIndex(  
  { Status: 1, Department: 1, Performance_Rating: -1, Full_Name: 1 }  
)

Since I'm querying just a single collection, I don't require an aggregation pipeline:

x=db.employees.find(  
  {   
    Status: "Active",   
    Department: itDeptId   
  },  
  {   
    _id: 0,   
    Full_Name: 1,  
    Performance_Rating: 1  
  }  
).sort({ Performance_Rating: -1 }).limit(10).explain("executionStats");  

printjson({  
  nReturned:           x.executionStats.nReturned,  
  executionTimeMillis: x.executionStats.executionTimeMillis,
  totalKeysExamined:   x.executionStats.totalKeysExamined,
  totalDocsExamined:   x.executionStats.totalDocsExamined,
});

// Execution statistics:

{
  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 10,
  totalDocsExamined: 0
}

Finally, when the application accesses the lookup table first, I observe the same performance with a many-to-one reference as with embedding the single item into the many.

Reference from the "many" side, but lookup from the "one" side

I can perform the same operation as above in the application, using an aggregation pipeline that starts with the departments and includes employees via a lookup. I use a lookup pipeline to add the filter for active employees:

x=db.departments.aggregate([  
  {  
    $match: { Name: "IT" } 
  },  
  {  
    $lookup: {  
      from: "employees",  
      let: { deptId: "$_id" },
      pipeline: [  
        {  
          $match: {  
            $expr: {  
              $and: [  
                { $eq: ["$Department", "$$deptId"] },  
                { $eq: ["$Status", "Active"] }
              ]  
            }  
          }  
        },  
        {  
          $sort: { Performance_Rating: -1 }  
        },  
        {  
          $limit: 10  
        },  
        {  
          $project: {  
            _id: 0,  
            Full_Name: 1,  
            Performance_Rating: 1  
          }  
        }  
      ],  
      as: "employees"  
    }  
  },  
  {  
    $project: {  
      _id: 0,  
      Name: 1,  
      employees: 1  
    }  
  }  
]).explain("executionStats")

print(x.stages[1])

// Execution plan for the lookup stage:

{
  '$lookup': {
    from: 'employees',
    as: 'employees',
    let: { deptId: '$_id' },
    pipeline: [
      {
        '$match': { '$expr': { '$and': [ [Object], [Object] ] } }
      },
      { '$sort': { Performance_Rating: -1 } },
      { '$limit': 10 },
      { '$project': { _id: 0, Full_Name: 1, Performance_Rating: 1 } }
    ]
  },
  totalDocsExamined: Long('10'),
  totalKeysExamined: Long('10'),
  collectionScans: Long('0'),
  indexesUsed: [ 'Status_1_Department_1_Performance_Rating_-1_Full_Name_1' ],
  nReturned: Long('1'),
  executionTimeMillisEstimate: Long('1')
}

The result arrives in one millisecond, reading only what is necessary because the lookup pipeline can use the index on the equality predicates and sort field. It is evident from the execution plan that all operations on employees have been pushed down to the lookup pipeline: $match, $sort, $limit, $project. When dealing with a one-to-many relationship with references, and the many side involves thousands, it's better to start from the one side.

Reference from the "one" side

To explore all possibilities, I can embed the employees in the department documents:

db.employees.aggregate([  
  {  
    $group: {  
      _id: "$Department",  
      employees: { $push: "$_id" }  
    }  
  },  
  {  
    $merge: {  
      into: "departments",  
      on: "_id",
      whenMatched: [  
        { $set: { employees: "$$new.employees" } }
      ],  
      whenNotMatched: "discard"  
    }  
  }  
]);

Here is what the documents look like:

{  
  "_id": ObjectId("693f2e38c2dd5ab4fbfd73b8"),  
  "Name": "IT",  
  "Description": "Maintains technology infrastructure, software systems, and cybersecurity protections.",  
  "employees": [  
    ObjectId("693f1d61e235ef0960ae2b52"),  
    ObjectId("693f1d61e235ef0960ae2b99"),  
    ObjectId("693f1d61e235ef0960ae2b23"),  
    ... (1,401,558 in total)
  ]  
}

In this article, I examine all options. Generally, having an array with millions of items that is continuously growing is not advisable. However, a million items still fit within 16MB:

db.departments.aggregate([  
  { $project: {  _id: 0, Name: 1, docSize: { $bsonSize: "$$ROOT" }, EmployeesCount: { $size: "$employees" }}  
  }  
]);

// result:

[
  { Name: 'Finance', docSize: 3886514, EmployeesCount: 199873 },
  { Name: 'HR', docSize: 3071423, EmployeesCount: 159119 },
  { Name: 'IT', docSize: 11909884, EmployeesCount: 601042 },
  { Name: 'Marketing', docSize: 4690681, EmployeesCount: 240081 },
  { Name: 'Operations', docSize: 5890954, EmployeesCount: 300095 },
  { Name: 'R&D', docSize: 1884463, EmployeesCount: 99759 },
  { Name: 'Sales', docSize: 7889665, EmployeesCount: 400031 }
]

I create the right index for this lookup, and to start with the departments:


db.employees.createIndex({ Status: 1, Performance_Rating: -1 });

db.departments.createIndex({ Name: 1 });

The query begins with departments and performs a lookup using an array of employee identifiers:

x=db.departments.aggregate([  
  { $match: { Name: "IT" } },  
  {  
    $lookup: {  
      from: "employees",  
      let: { empIds: "$employees" },  
      pipeline: [  
        {  
          $match: {  
            $expr: {  
              $and: [  
                { $in: ["$_id", "$$empIds"] },  
                { $eq: ["$Status", "Active"] }  
              ]  
            }  
          }  
        },  
        { $sort: { Performance_Rating: -1 } },  
        { $limit: 10 },  
        { $project: { _id: 0, Full_Name: 1, Performance_Rating: 1 } }  
      ],  
      as: "employees"  
    }  
  },  
  {  
    $project: { _id: 0, Name: 1, employees: 1 }  
  }  
]).explain("executionStats")

printjson(x.stages[1])

// Execution plan for the lookup stage:

{
  '$lookup': {
    from: 'employees',
    as: 'employees',
    let: { empIds: '$employees' },
    pipeline: [
      {
        '$match': { '$expr': { '$and': [ 
                                  { '$in': [ '$_id', '$$empIds' ]  },
                                  { '$eq': [ '$Status', 'Active' ] } 
              ] } }
      },
      { '$sort': { Performance_Rating: -1 } },
      { '$limit': 10 },
      { '$project': { _id: 0, Full_Name: 1, Performance_Rating: 1 } }
    ]
  },
  totalDocsExamined: Long('32'),
  totalKeysExamined: Long('32'),
  collectionScans: Long('0'),
  indexesUsed: [ 'Status_1_Performance_Rating_-1' ],
  nReturned: Long('1'),
  executionTimeMillisEstimate: Long('460')
}

Performing a lookup from a million entries can be slow, but it improves significantly when all filters and pagination are pushed down to the lookup pipeline. In this case, it took half a second to read the department with a list of employees and retrieve 32 employees from the 'IT' department. Since the index is ordered by rating, the process stopped after finding 10 employees.

Embed from the one side

Embedding employees into departments is unlikely to be effective in this case. One reason is the size constraint: if ObjectIds are already near their limit, adding extra data isn't practical. Another reason is efficiency: embedding employees means that to access a department, the entire document must be read, even if only one employee's information is needed.

This method makes sense only if departments are accessed without needing individual employee details, allowing a covering index to skip reading the document, or if all employees are always required. Multi-key indexes can't prevent a document from being fetched.

Understanding use cases and expected cardinalities is crucial for data modeling. I'm not interested in all employees within the departments, only the top performers. I could keep an array of just the top ten active employees, or store employees with a performance score above 5.9, ensuring there aren't too many but enough to identify the top ten. I build such an array to embed in departments:

db.departments.aggregate([  
  {  
    $lookup: {  
      from: "employees",  
      let: { deptId: "$_id" },  
      pipeline: [  
        {  
          $match: {  
            $expr: {  
              $and: [  
                { $eq: ["$Department", "$$deptId"] },  
                { $gt: ["$Performance_Rating", 5.9] }  
              ]  
            }  
          }  
        },  
        { $sort: { Performance_Rating: -1 } },  
        {  
          $project: {  
            _id: 1,  
            Full_Name: 1,  
            Performance_Rating: 1,  
            Status: 1  
          }  
        }  
      ],  
      as: "Top_Employees"  
    }  
  },  
  {  
    $project: {  
      _id: 1,  
      Name: 1,  
      Description: 1,  
      Top_Employees: 1  
    }  
  },  
  {  
    $merge: {  
      into: "departments",  
      whenMatched: "replace",  
      whenNotMatched: "insert"  
    }  
  }  
]);

To query it, I don't depend on the array's order. Instead, I use sort and limit to get the top 10:

x=db.departments.aggregate([  
  { $match: { Name: "IT" } },  
  {  
    $project: {  
      _id: 0,  
      Department: { Name: "$Name" },  
      Employee: {  
        $slice: [  
          {  
            $sortArray: {  
              input: {  
                $filter: {  
                  input: "$Top_Employees",  
                  as: "emp",  
                  cond: { $eq: ["$$emp.Status", "Active"] }
                }  
              },  
              sortBy: { Performance_Rating: -1 }  
            }  
          },  
          10  
        ]  
      }  
    }  
  }  
]).explain("executionStats");  

printjson({  
  nReturned:           x.executionStats.nReturned,  
  executionTimeMillis: x.executionStats.executionTimeMillis,
  totalKeysExamined:   x.executionStats.totalKeysExamined,
  totalDocsExamined:   x.executionStats.totalDocsExamined,
});

// Execution statistics:

{
  nReturned: 1,
  executionTimeMillis: 47,
  totalKeysExamined: 1,
  totalDocsExamined: 1
}

This query is fast. Of course, it is useful only if it is maintained: any change in an employee's rating must be reflected accordingly, pushing it to the array if the rating exceeds 5.9, or removing them if it falls below. This offers a good balance between embedding all data and referencing everything.

Conclusion

By testing multiple schema designs against a 2‑million‑document HR dataset, we saw how MongoDB’s flexibility in modeling one‑to‑many relationships directly affects query speed, index usage, and document size.

What we learned:

Review execution plans together with runtime statistics to accurately understand the true cost of each query.
Design indexes to match the query pattern, including the lookup predicates: start with equality filters, then add sort or range fields, and finally include projected columns, aiming for a covering index when possible.
Embed small, frequently co-accessed data—especially the fields used in selective filters. For large or volatile data, use references instead to prevent bloat or duplication, and fetch it per document when needed.
Avoid using $lookup on huge sets. Instead, consider
- Starting the aggregation from the "one" side and pushing filters and limits into the $lookup pipeline.
- Using multiple single-collection queries, applying filters on the "one" collection to retrieve the referenced identifiers. Then, use those identifiers when querying the "many" table, where the referencing field is indexed.
Be cautious with large arrays and deep embedding: they can still query fast with well‑designed indexes, but they increase document size and update cost.
Aim for covering indexes where possible, especially when embedding the "one" side into the "many" side, to enable index‑only scans, or at least filtering before fetching.

Final Thought:

In MongoDB, there is no single “best” way to model one-to-many relationships. The right choice depends on your read/write patterns, data volume, and how often your data changes. When your schema matches your queries and is backed by good indexes, both embedding and referencing can be fast. Check the execution plan to validate it. Performance is proactive with MongoDB, with a data model optimized for the application.

Relational databases may look simpler because they use a single normalized model for one-to-many relationships, but that changes once you use an ORM or code complex queries. You still need to tune the fetch strategy—choosing what to load with joins and what to fetch with separate queries. With SQL databases, you may also duplicate and denormalize data so that all filters and sorts can use a single compound index. Performance is often reactive with SQL databases, creating indexes when encountering slow queries.

Finally, SQL lets you start modeling before you fully understand how the application will use the data, at the cost of tuning the model later when performance issues arise. MongoDB, by contrast, encourages stronger schema design for relationships once your access patterns and cardinality are well understood.