Mohamed Idris

Posted on May 21

Learning MongoDB As If You Built It Yourself

#mongodb #database #learning

If you have ever tried to fit a messy real world thing into rigid SQL tables, you remember the feeling. A user has zero, one, or many addresses. Each address has a label, sometimes a unit number, sometimes notes. So you make an addresses table. Now every read needs a join. Then a customer feature ships that needs "favorite delivery instructions per address per day of the week", and your diagram grows three more boxes.

The shape of your data changed faster than your schema could keep up. Every change meant a migration, a deploy, and a small panic.

That is the gap MongoDB fills.

What is MongoDB, really

Think of MongoDB as a giant filing cabinet of folders. Each folder is a complete dossier on one thing. A user folder contains the user's name, their preferences, all their addresses, even their three latest orders if you want them right there. You do not have to open six folders and staple them together to learn about the user. Everything that belongs together lives together.

The folders are JSON like documents. The cabinet drawers are called collections. The cabinet itself is the database. There are no tables, no rigid columns, and no joins required for the common case. You ask for the folder, you get the folder.

The price for that flexibility is responsibility: the database is not going to enforce your schema for you unless you ask it to, and if you store data sloppily, you will read it back sloppily.

That is the whole vibe.

Let's pretend we are building one

We want a database that stores data the way our app already thinks about it (as nested objects), scales out across many machines, and lets us evolve schemas without crying. We will call it MongoDB.

For the running example, we are building a tiny recipe app. Recipes have ingredients, steps, tags, comments. Perfect for showing off documents.

Decision 1: The unit of data is a document, not a row

A document is a JSON object. (Internally, it is BSON, a binary form of JSON that adds types like Date, ObjectId, Decimal128, and binary blobs.) It can have arrays. It can have nested objects. It can be thirty fields deep. It is not a row.

{
  _id:    ObjectId("65fa1c..."),
  title:  "Mochi Cookies",
  author: "Mia",
  tags:   ["dessert", "japanese", "easy"],
  ingredients: [
    { name: "rice flour", amount: 200, unit: "g" },
    { name: "sugar",      amount: 80,  unit: "g" },
    { name: "milk",       amount: 250, unit: "ml" }
  ],
  steps: [
    "Mix dry ingredients.",
    "Add milk slowly while whisking.",
    "Cook on medium heat until smooth."
  ],
  prepMinutes: 20,
  createdAt:   ISODate("2026-04-01T10:00:00Z")
}

A few things worth pausing on:

Every document has an _id. If you do not provide one, the driver creates an ObjectId for you. It is sortable by time and unique across the cluster.
Schemas are flexible. Two recipes in the same collection do not need identical fields.
Arrays are first class. No "join table" needed for tags, steps, or ingredients on a recipe.
There is no enforced schema by default. This is a feature and a footgun. We will add validation in a moment.

Decision 2: Four verbs, just like SQL

The mental model is familiar. We just talk to documents instead of rows.

Insert

db.recipes.insertOne({
  title: "Mochi Cookies",
  author: "Mia",
  tags: ["dessert", "japanese"],
  prepMinutes: 20,
  createdAt: new Date(),
});

db.recipes.insertMany([
  { title: "Miso Soup",  prepMinutes: 10 },
  { title: "Onigiri",    prepMinutes: 15 },
]);

Find (the workhorse)

db.recipes.find({ tags: "dessert" });
db.recipes.findOne({ _id: ObjectId("65fa1c...") });
db.recipes.find({}, { projection: { title: 1, prepMinutes: 1 } }); // only those fields
db.recipes.find({ prepMinutes: { $lte: 15 } })
          .sort({ prepMinutes: 1 })
          .limit(10);

find returns a cursor. You stream from it. findOne returns one document or null.

Update

db.recipes.updateOne(
  { _id: ObjectId("65fa1c...") },
  { $set: { prepMinutes: 18 } }
);

db.recipes.updateMany(
  { tags: "japanese" },
  { $addToSet: { tags: "asian" } }
);

Notice the dollar prefixed operators. They are how you describe what to do, not the literal new document. If you write { $set: { x: 1 } }, you set field x. If you forget the operator and write just { x: 1 }, you replace the entire document with { x: 1 }. Yes, that bug stings the first time.

Delete

db.recipes.deleteOne({ _id: ObjectId("65fa1c...") });
db.recipes.deleteMany({ tags: "spam" });

Decision 3: Query operators, the dollar sign menu

Inside a query, MongoDB has a small language of operators. The greatest hits:

db.recipes.find({
  prepMinutes: { $gt: 10, $lte: 30 },
  tags:        { $in: ["dessert", "snack"] },
  author:      { $ne: "Anonymous" },
  createdAt:   { $gte: ISODate("2026-01-01") },
  notes:       { $exists: true },
});

The comparison operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin.

Logical operators (when AND of fields is not enough):

db.recipes.find({
  $or: [
    { prepMinutes: { $lt: 5 } },
    { tags: "easy" }
  ]
});

Querying inside arrays is shockingly useful:

// any ingredient with name "sugar"
db.recipes.find({ "ingredients.name": "sugar" });

// at least one ingredient with name "sugar" AND amount > 50, on the same element
db.recipes.find({
  ingredients: { $elemMatch: { name: "sugar", amount: { $gt: 50 } } }
});

// a recipe with all of these tags
db.recipes.find({ tags: { $all: ["easy", "dessert"] } });

Update operators are a parallel menu:

db.recipes.updateOne({ _id: id }, {
  $set:        { prepMinutes: 25 },
  $inc:        { views: 1 },
  $push:       { tags: "popular" },
  $addToSet:   { tags: "popular" },        // push but only if not already there
  $pull:       { tags: "draft" },
  $unset:      { temporaryNote: "" },
  $rename:     { oldField: "newField" }
});

If you remember three operators, remember $set, $inc, and $addToSet. You will reach for them every day.

Decision 4: Schema design, the part that actually matters

Here is the truth that catches every newcomer: in MongoDB, schema design is 90% of your job, even though there is no enforced schema. The flexibility is real, but the wrong shape is still going to hurt you.

The single most important decision is embed or reference.

Embed when data is owned and read together

A recipe owns its ingredients. They are not shared with other recipes. They are read every time you read the recipe. So embed them right inside the recipe document.

{
  title: "Mochi Cookies",
  ingredients: [
    { name: "rice flour", amount: 200, unit: "g" },
    { name: "sugar",      amount: 80,  unit: "g" }
  ]
}

One read, all the data. No joins. This is MongoDB's superpower.

Reference when data is shared or grows without bound

If a recipe is part of a cookbook, and a cookbook can have hundreds of recipes, you do not embed all the recipes inside the cookbook. You store the recipe ids:

// cookbooks
{ _id: ObjectId("c1"), title: "Tiny Asian Bites", recipeIds: [ObjectId("r1"), ObjectId("r2")] }

// recipes
{ _id: ObjectId("r1"), title: "Mochi Cookies", ... }

Then you fetch the recipes separately (or with $lookup, see the aggregation pipeline soon).

The rule of thumb

One to few: embed. (A user with 3 addresses.)
One to many but bounded and small: embed. (A blog post with 50 comments? Probably embed. With 5,000? Reference.)
One to many, unbounded or shared: reference.
Many to many: reference, with arrays of ids on one side or both.

There is also a maximum document size: 16 MB. If a document is approaching that limit, your model is wrong.

A few more design patterns that show up in production:

Bucket pattern: instead of one document per sensor reading per second, store one document per minute (or hour) holding an array of readings. Way fewer documents, better cache use, much faster aggregations.
Computed pattern: precompute aggregates and store them on the parent (commentCount, totalSpentCents). Update them on writes. The read becomes free.
Extended reference: when you reference, copy the few fields you almost always need (a recipe stores author: { _id, name, avatarUrl }). Saves a join. Trade off: you must update the copy when the source changes.
Schema versioning: store a schemaVersion: 2 field on every document. Migrating becomes "any document still on v1 gets transformed lazily on read".

These patterns are why senior MongoDB feels different from senior SQL. You design for the read, not the write.

Decision 5: Validate at the door anyway

Schemaless feels great until a typo in your code writes { tile: "Mochi" } instead of { title: "Mochi" } for three weeks before anyone notices. So MongoDB lets you bolt on JSON Schema validation at the collection level:

db.createCollection("recipes", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["title", "createdAt"],
      properties: {
        title:       { bsonType: "string", minLength: 1 },
        prepMinutes: { bsonType: "int", minimum: 0 },
        tags:        { bsonType: "array", items: { bsonType: "string" } },
        createdAt:   { bsonType: "date" }
      }
    }
  },
  validationLevel:  "strict",
  validationAction: "error"
});

Now writes that violate the schema are rejected. You keep the flexibility, you remove the typos.

In a real Node.js app, you usually go one level higher: define the schema once with Zod (or use Mongoose if you want a fuller ODM), and let it both validate at the API boundary and produce your TypeScript types.

Decision 6: Indexes work just like SQL, but read the rules

Without indexes, every query is a full collection scan. Same as SQL.

db.recipes.createIndex({ title: 1 });               // ascending
db.recipes.createIndex({ author: 1, createdAt: -1 }); // compound
db.recipes.createIndex({ email: 1 }, { unique: true });

Special index types you will actually use:

Multikey (automatic): if a field is an array, the index covers each element. db.recipes.createIndex({ tags: 1 }) lets you find recipes by any tag.
Text: full text search. db.recipes.createIndex({ title: "text", description: "text" }). Then find({ $text: { $search: "mochi" } }).
Geospatial (2dsphere): for "find places near me" queries on GeoJSON points.
TTL: documents auto delete after a deadline. Great for sessions, magic links, signed urls. db.sessions.createIndex({ expiresAt: 1 }, { expireAfterSeconds: 0 }).
Partial: index only the matching subset. { partialFilterExpression: { active: true } }.
Wildcard: index unknown fields. Use sparingly.

To see what a query is doing, ask:

db.recipes.find({ tags: "dessert" }).explain("executionStats");

Look at winningPlan and the totalDocsExamined vs nReturned. If they are wildly different, you are probably scanning when you could be using an index.

The compound index ordering rule is the same as SQL: the order of fields matters. { author: 1, createdAt: -1 } helps queries that filter by author (alone) or by author plus createdAt. It does not help a query that only filters by createdAt. This is sometimes called the ESR rule in MongoDB land: index keys should generally appear in the order Equality, Sort, Range.

Decision 7: The aggregation pipeline, the real power tool

find is fine for "give me documents that match". The moment you need to group, transform, join, project, or compute, you graduate to the aggregation pipeline.

You think of it as Unix pipes. Each stage takes documents in and emits documents out.

db.orders.aggregate([
  { $match: { placedAt: { $gte: ISODate("2026-04-01") } } },
  { $group: {
      _id: "$customerId",
      orders: { $sum: 1 },
      revenueCents: { $sum: "$totalCents" }
  }},
  { $sort:  { revenueCents: -1 } },
  { $limit: 10 }
]);

The stages you will use most:

$match: filter, like WHERE. Put it first when you can to use indexes.
$project: shape the output, pick fields, compute new ones.
$group: collapse documents by a key, with accumulators ($sum, $avg, $min, $max, $push, $addToSet, $first, $last).
$sort, $limit, $skip: classic.
$unwind: turn an array field into one document per element. Magic for "explode tags".
$lookup: yes, MongoDB does joins. Same idea as SQL LEFT JOIN, with a slightly different API.
$facet: run multiple sub pipelines on the same input, return them side by side. Great for dashboards.
$set / $addFields: add computed fields without losing the existing ones.
$merge / $out: write the result back into a collection.

A $lookup example:

db.recipes.aggregate([
  { $match: { tags: "dessert" } },
  { $lookup: {
      from:         "users",
      localField:   "authorId",
      foreignField: "_id",
      as:           "author"
  }},
  { $unwind: "$author" },                  // flatten to a single object
  { $project: { title: 1, "author.name": 1, prepMinutes: 1 } }
]);

That is a join, an unwind, and a projection in one query. Once the pipeline clicks, you stop reaching for application code to do post processing. You let the database do it.

Senior tip: aggregations can use indexes, but only on the early $match and $sort stages. Filter early, then transform.

Decision 8: Transactions, when you really need them

For most workloads, the trick is to design your documents so a single write changes everything that needs to change. One document update is atomic by itself. That is enough for 90% of cases.

When you really do need to write across multiple documents or collections atomically, MongoDB has multi document transactions:

const session = client.startSession();
try {
  await session.withTransaction(async () => {
    await accounts.updateOne({ _id: a }, { $inc: { balance: -1000 } }, { session });
    await accounts.updateOne({ _id: b }, { $inc: { balance:  1000 } }, { session });
    await transfers.insertOne({ from: a, to: b, amountCents: 1000 }, { session });
  });
} finally {
  await session.endSession();
}

ACID is real here, just like in SQL. The catches:

Transactions need a replica set (which Atlas always gives you, and most production MongoDBs run as).
They are slower than single document writes. Reach for them only when you need them.
Long transactions hurt throughput. Keep them short.

The right design instinct in MongoDB is "make this one document update", and the second instinct is "okay then, transactions". Not the other way around.

Decision 9: Scaling, replication, and sharding

Three things you will hear about, in order of importance for most apps:

Replica set: a primary plus secondaries that follow along. Reads can go to secondaries with the right read preference. Writes go to the primary. If the primary dies, a secondary is elected. This gives you durability and high availability and is the default in Atlas.
Read concern and write concern: you choose how many nodes must acknowledge a write (w: "majority" is the safe default), and how recent your reads must be (readConcern: "majority" for the safest reads). These two knobs are the trade off between durability and latency.
Sharding: when one machine cannot hold the data, you split the collection across many. You pick a shard key that determines which documents live where. Pick wisely: an _id or random shard key spreads writes evenly, but lookups by user are then scattered across all shards. There is no perfect shard key, only the one that fits your access patterns.

Most apps under "huge" scale never need sharding. Replica sets are enough.

Decision 10: The modern stack

In 2026, the typical MongoDB setup looks like:

Atlas (the official managed service) for production. Free tier for prototypes. Handles replica sets, backups, point in time restore, monitoring, and Atlas Search.
Atlas Search for full text search backed by Lucene. It is wildly more capable than the built in $text operator.
Atlas Vector Search for semantic search and RAG (retrieval augmented generation) for AI apps. Vectors live next to your documents. One database, both kinds of search.
Official drivers for every language. In Node.js, the modern choice is the official mongodb driver, often paired with Zod for validation, or the Mongoose ODM if you want a more opinionated layer with hooks and middleware.
Compass (the GUI) for poking at data and prototyping aggregations.
mongosh (the new shell) for command line work.

If you are starting a new app, the lean default is: official driver + Zod schemas + Atlas. Reach for Mongoose only if you genuinely want its conveniences.

Decision 11: Pitfalls you only learn the hard way

A short list of senior level traps:

Forgetting $set in updates replaces the document. Always include the operator.
Storing a stringly typed _id when you could have used ObjectId. ObjectIds are sortable by creation time, smaller, and indexed by default.
Querying with the wrong type silently returns nothing. { price: "10" } will not match { price: 10 }. Types matter.
Letting documents grow unbounded. A recipe with 50,000 comments is a problem. Bucket or reference them.
Treating Mongo like SQL. If your queries are full of $lookup joins, you may be designing as if you are still in SQL. Reconsider whether the data should be embedded.
Writing find() and forgetting the projection. You ship the entire 8KB document over the wire when you only needed the title.
Skipping indexes "for now". "For now" becomes "in production" faster than you think.
Ignoring the _id in $group. _id: null groups everything together. _id: "$field" groups by that field. This is the most confusing line in the aggregation pipeline for newcomers.
Not using .lean() (Mongoose) or projection (driver) when you do not need full hydrated objects. Slower otherwise.
Treating capped collections like queues. Use a real queue.

A peek under the hood

What really happens when you run a query:

The driver opens a connection (pooled) to the cluster and figures out the topology (which node is primary, which are secondaries).
Your query goes to the appropriate node based on read preference.
The query planner picks an execution plan, ideally using an index.
The storage engine (WiredTiger, by default) reads pages from the cache or from disk. Documents are stored as compressed BSON.
Results stream back through the driver. Cursors fetch in batches by default (100 documents or 1MB), so a find() over a million docs does not blow up memory.
Writes go to the primary, get applied in memory, then flushed to the on disk journal, then replicated to secondaries based on your write concern.

The mental model is: WiredTiger is the SQLite or InnoDB underneath, BSON is the row format, and the replica set is the durability story. Once you know that, debugging slow queries and odd behavior gets much easier.

Tiny tips that will save you later

Design for the read. What does your most common query look like? Shape documents to make it one read, no joins.
Pick ObjectId for _id unless you have a reason not to.
Always set w: "majority" in production. Default is fine for development.
Use Atlas Search for anything past trivial text search.
Add indexes before they bite, not after. Watch slow query logs.
Use the aggregation pipeline. It is faster than fetching documents to your app and looping.
Keep documents under a megabyte where you can. 16 MB is the hard limit, but cache and replication love smaller documents.
Validate at the boundary with Zod (or the collection validator). Schemaless does not mean schema free.
Use .explain("executionStats") like you would EXPLAIN ANALYZE. It is the same skill in a different costume.
Back it up. Test the restore. Atlas does this for you. If you self host, you are on the hook.

Wrapping up

So that is the whole story. We were tired of forcing nested, evolving data into rigid tables. We built a database that stores documents, the same shape your app already passes around. We grouped them into collections, indexed them, and gave them a powerful aggregation pipeline so we did not have to do post processing in app code.

We accepted the trade off: schemaless is freedom, and freedom needs discipline. We designed documents around the reads we cared about, embedded what was owned, referenced what was shared, and added validation to keep typos from rotting our data. We learned that ACID is not just a SQL thing: single document writes are atomic, and full transactions are there when we need them.

Once that map is in your head, MongoDB stops feeling like SQL with weird syntax and starts feeling like the data model your app was always trying to be.

Happy modeling, and save a cookie for me.

DEV Community