Franck Pachot

Posted on Apr 25 • Edited on Apr 27

Anti-Join in MongoDB

#mongodb #document #database #postgres

SQL databases implement anti-joins—finding rows with no match—through NOT EXISTS. Although anti-join is not a distinct relational algebra operator, it can be derived from set difference and semi-join. A naive implementation that scans both sets to find the complement would not scale, so databases use short-circuit evaluation: the inner scan stops as soon as a match is found or ruled out, because only existence matters. That is why a subquery without any JOIN keyword can still appear as an Anti Join in the execution plan.

Here is an example in PostgreSQL, where we find users who haven't made any paid transactions:


postgres=# EXPLAIN ANALYZE
SELECT u.*
FROM users u
WHERE NOT EXISTS (
    SELECT 1
    FROM transactions t
    WHERE t.user_id = u.id
      AND t.type = 'paid'
);
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join (actual time=17.087..17.087 rows=0.00 loops=1)
   Buffers: shared hit=30095
   ->  Seq Scan on users u (actual time=0.007..0.628 rows=10000.00 loops=1)
         Buffers: shared hit=94
   ->  Index Only Scan using idx_transactions_user_type on transactions t (actual time=0.001..0.001 rows=1.00 loops=10000)
         Index Cond: ((user_id = u.id) AND (type = 'paid'::text))
         Heap Fetches: 0
         Index Searches: 10000
         Buffers: shared hit=30001
 Planning:
   Buffers: shared hit=14
 Planning Time: 0.200 ms
 Execution Time: 17.109 ms
(13 rows)

How does this translate to MongoDB, where there is no NOT EXISTS syntax? It can be very similar with on index lookup per user (Index Searches: 10000) returning at most one entry per user (rows=1.00 loops=10000).

In MongoDB, there are two equivalents to SQL joins:

The left outer join materialized at write time, where you embed the inner set as an array nested into the outer document. This is best when the cardinality is known and bounded, the entities are queried together as a single aggregate, and they share the same consistency boundaries within an atomic transaction. When you query the document, it is already joined, and you can $unwind the array to get the equivalent of a relational left outer join.
The $lookup operator, which finds matching documents from an aggregation pipeline and adds them as an array that can be used as if it were embedded, or can be $unwind.

With reference (join on read)

With $lookup, an anti-join is possible and efficient by defining the lookup pipeline with a $limit and a $project to check for existence only, without fetching all matching documents and all fields.

Here is an example with a collection of ten thousand users and a collection of transactions:

db.users.drop();
db.transactions.drop();

// Insert 10000 users
const users = [];
for (let i = 0; i < 10000; i++) {
    users.push({
        _id: ObjectId(),
        name: `user_${i}`,
        email: `user_${i}@example.com`,
        created_at: new Date()
    });
}
db.users.insertMany(users);

// Insert between 0 and 100 transactions per user
// ~80% are "paid", ~20% are "free"
const transactions = [];
users.forEach(user => {
    const numTransactions = Math.floor(Math.random() * 101); // 0 to 100
    for (let j = 0; j < numTransactions; j++) {
        transactions.push({
            user_id: user._id,
            type: Math.random() < 0.8 ? "paid" : "free",
            amount: Math.round(Math.random() * 10000) / 100,
            date: new Date(
                Date.now() - Math.floor(Math.random() * 365 * 24 * 60 * 60 * 1000)
            )
        });
    }
});

// Insert in batches (insertMany has a limit)
const BATCH_SIZE = 10000;
for (let i = 0; i < transactions.length; i += BATCH_SIZE) {
    db.transactions.insertMany(transactions.slice(i, i + BATCH_SIZE));
}

print(`Users created: ${db.users.countDocuments()}`);
print(`Users with transactions: ${db.transactions.distinct("user_id").length}`);
print(`Users with paid transactions: ${db.transactions.distinct("user_id", { type: "paid" }).length}`);
print(`Transactions created: ${db.transactions.countDocuments()}`);
print(`Paid transactions: ${db.transactions.countDocuments({ type: "paid" })}`);
print(`Free transactions: ${db.transactions.countDocuments({ type: "free" })}`);

This generated random data for 10,000 users, of whom 105 have no transactions at all, and 24 have only unpaid transactions.

test> print(`Users created: ${db.users.countDocuments()}`);
Users created: 10000

test> print(`Users with transactions: ${db.transactions.distinct("user_id").length}`);
Users with transactions: 9895

test> print(`Users with paid transactions: ${db.transactions.distinct("user_id", { type: "paid" }).length}`);
Users with paid transactions: 9871

test> print(`Transactions created: ${db.transactions.countDocuments()}`);
Transactions created: 498812

test> print(`Paid transactions: ${db.transactions.countDocuments({ type: "paid" })}`);
Paid transactions: 399612

test> print(`Free transactions: ${db.transactions.countDocuments({ type: "free" })}`);
Free transactions: 99200

In this sample dataset, there are 129 users with no paid transactions (either no transactions at all or only unpaid transactions).

As it will have to find transactions per user and type, I create the following index:


db.transactions.createIndex({ user_id: 1, type: 1 });

Here is the aggregation pipeline that returns the users with no paid transactions:


db.users.aggregate([
    {
        $lookup: {
            from: "transactions",
            localField: "_id",
            foreignField: "user_id",
            pipeline: [
                { $match: { type: "paid" } },
                { $limit: 1 },                  // ✅ Stop as soon as we find ONE
                { $project: { _id: 0 , user_id: 1 } }        // ✅ Return minimal data
            ],
            as: "paid_transactions"
        }
    },
    {
        $match: {
            paid_transactions: { $eq: [] }      // Keep only users with NO match
        }
    },
    {
        $project: {
            paid_transactions: 0                // Clean up the temporary field
        }
    }
]).explain("executionStats").stages

The $lookup stage has examined only index key per user with a paid transaction, so 9,871 ones, thanks to the { '$limit': 1 }, and no document thanks to the projection covered by the index:

  {
    '$lookup': {
      from: 'transactions',
      as: 'paid_transactions',
      localField: '_id',
      foreignField: 'user_id',
      let: {},
      pipeline: [
        { '$match': { type: 'paid' } },
        { '$limit': 1 },
        { '$project': { _id: 0, user_id: 1 } }
      ]
    },
    totalDocsExamined: Long('0'),
    totalKeysExamined: Long('9871'),
    collectionScans: Long('0'),
    indexesUsed: [ 'user_id_1_type_1' ],
    nReturned: Long('10000'),
    executionTimeMillisEstimate: Long('1011')
  },

It has returned one document for each of the 10,000 users, with an empty array when there was no match, so the $match stage filtered out 129 documents:

  {
    '$match': { paid_transactions: { '$eq': [] } },
    nReturned: Long('129'),
    executionTimeMillisEstimate: Long('1011')
  },

This (1011ms) is notably slower than PostgreSQL anti-join (17ms). The explanation is that $lookup executes a separate pipeline per document in a subpipeline, which has high per-document overhead compared to a native join operator.

SBE (Slot-Based Execution) pushdown for $lookup is not supported when the lookup contains a pipeline, which forces execution to fall back to the classic engine with independent subpipelines. You can use SBE when not pushing down the { $match: { type: "paid" } } and { $limit: 1 } but then you need to filter and group afterwards on all matching documents:

db.users.aggregate([
    {  $lookup: {  from: "transactions",  localField: "_id",  foreignField: "user_id",  as: "transactions"  }  },
    {  $match:  {  "transactions.type": "paid"  }  },
    {  $group:  {  _id: "$_id",  user: { $first: "$$ROOT" },  hasPaid: { $first: "$transactions" }  }  },
    {  $match:  {  hasPaid: { $eq: [] }  }  },
    {  $replaceRoot: { newRoot: "$user" }  }  ]).explain("executionStats")

This will be efficient only if there's few transactions by users.

In MongoDB when there's a join from thousands of documents to a bounded amount of associated sub-document, you should consider embedding as an array.

With embedding (join on write)

If you preferred to store the transactions embedded within each user, the query can be simpler and faster.

I update users to embed the transactions:

db.users.aggregate([  
    {  
        $lookup: {  
            from: "transactions",  
            localField: "_id",  
            foreignField: "user_id",  
            as: "transactions"  
        }  
    },  
    {  
        $merge: {  
            into: "users",  
            whenMatched: "merge"  
        }  
    }  
]);

The query to find users with no paid transactions is simply:

db.users.find(
{ "transactions.type": { $ne: "paid" } } ,
{ "transactions": 0 }
).explain("executionStats");

This scans the users to return the 129 ones without paid transactions:

    inputStage: {
      stage: 'COLLSCAN',
      filter: { 'transactions.type': { '$not': { '$eq': 'paid' } } },
      nReturned: 129,
      executionTimeMillisEstimate: 20,
      works: 10001,
      advanced: 129,
      needTime: 9871,
      needYield: 0,
      saveState: 1,
      restoreState: 1,
      isEOF: 1,
      direction: 'forward',
      docsExamined: 10000
    }
  }

This is not efficient because it has to read large documents, with all transactions, and discard them later.

A multikey index on "transactions.type" does not help much either. Because it indexes every element in the array, a user with both "paid" and "free" transactions has entries under both values. The $ne: "paid" scan retrieves all users who have any non-paid entry — which is nearly all of them — and then the FETCH stage must read each full document to verify that none of their transactions are paid:

      inputStage: {
        stage: 'IXSCAN',
        nReturned: 9587,
        executionTimeMillisEstimate: 10,
        works: 9589,
        advanced: 9587,
        needTime: 1,
        needYield: 0,
        saveState: 2,
        restoreState: 2,
        isEOF: 1,
        keyPattern: { 'transactions.type': 1 },
        indexName: 'transactions.type_1',
        isMultiKey: true,
        multiKeyPaths: { 'transactions.type': [ 'transactions' ] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          'transactions.type': [ '[MinKey, "paid")', '("paid", MaxKey]' ]
        },
        keysExamined: 9588,
        seeks: 2,
        dupsTested: 9587,
        dupsDropped: 0
      }

but it will still fetch documents with some paid transactions, in addition to unpaid, and that must be discarded later:

      stage: 'FETCH',
      filter: { 'transactions.type': { '$not': { '$eq': 'paid' } } },
      nReturned: 129,
      executionTimeMillisEstimate: 36,

In general, indexes do not help with anti-joins because they index existing values, not the absence of values. In MongoDB with embedded documents, you can work around this by adding a computed flag — such as "has_paid": false — that is maintained when transactions are inserted or removed, and indexing that flag. Since the flag and the transactions live in the same document, this adds negligible write overhead while making the anti-join a simple indexed equality match.

DEV Community

Anti-Join in MongoDB

With reference (join on read)

With embedding (join on write)

Top comments (0)