DEV Community

Franck Pachot
Franck Pachot

Posted on

Anti-Join in MongoDB

SQL databases implement anti-joins—finding rows with no match—through NOT EXISTS. Although anti-join is not a distinct relational algebra operator, it can be derived from set difference and semi-join. A naive implementation that scans both sets to find the complement would not scale, so databases use short-circuit evaluation: the inner scan stops as soon as a match is found or ruled out, because only existence matters. That is why a subquery without any JOIN keyword can still appear as an Anti Join in the execution plan.

Here is an example in PostgreSQL, where we find users who haven't made any paid transactions:


postgres=# EXPLAIN ANALYZE
SELECT u.*
FROM users u
WHERE NOT EXISTS (
    SELECT 1
    FROM transactions t
    WHERE t.user_id = u.id
      AND t.type = 'paid'
);
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join (actual time=17.087..17.087 rows=0.00 loops=1)
   Buffers: shared hit=30095
   ->  Seq Scan on users u (actual time=0.007..0.628 rows=10000.00 loops=1)
         Buffers: shared hit=94
   ->  Index Only Scan using idx_transactions_user_type on transactions t (actual time=0.001..0.001 rows=1.00 loops=10000)
         Index Cond: ((user_id = u.id) AND (type = 'paid'::text))
         Heap Fetches: 0
         Index Searches: 10000
         Buffers: shared hit=30001
 Planning:
   Buffers: shared hit=14
 Planning Time: 0.200 ms
 Execution Time: 17.109 ms
(13 rows)
Enter fullscreen mode Exit fullscreen mode

How does this translate to MongoDB, where there is no NOT EXISTS syntax? It can be very similar with on index lookup per user (Index Searches: 10000) returning at most one entry per user (rows=1.00 loops=10000).

In MongoDB, there are two equivalents to SQL joins:

  • The left outer join materialized at write time, where you embed the inner set as an array nested into the outer document. This is best when the cardinality is known and bounded, the entities are queried together as a single aggregate, and they share the same consistency boundaries within an atomic transaction. When you query the document, it is already joined, and you can $unwind the array to get the equivalent of a left outer join.
  • The $lookup operator, which finds matching documents from an aggregation pipeline and adds them as an array that can be used as if it were embedded, or can be $unwinded.

With reference (join on read)

With $lookup, an anti-join is possible and performant by defining the lookup pipeline with a $limit and a $projection to check for existence only.

Here is an example with a collection of ten thousand users and a collection of transactions:

db.users.drop();
db.transactions.drop();

// Insert 10000 users
const users = [];
for (let i = 0; i < 10000; i++) {
    users.push({
        _id: ObjectId(),
        name: `user_${i}`,
        email: `user_${i}@example.com`,
        created_at: new Date()
    });
}
db.users.insertMany(users);

// Insert between 0 and 100 transactions per user
// ~80% are "paid", ~20% are "free"
const transactions = [];
users.forEach(user => {
    const numTransactions = Math.floor(Math.random() * 101); // 0 to 100
    for (let j = 0; j < numTransactions; j++) {
        transactions.push({
            user_id: user._id,
            type: Math.random() < 0.8 ? "paid" : "free",
            amount: Math.round(Math.random() * 10000) / 100,
            date: new Date(
                Date.now() - Math.floor(Math.random() * 365 * 24 * 60 * 60 * 1000)
            )
        });
    }
});

// Insert in batches (insertMany has a limit)
const BATCH_SIZE = 10000;
for (let i = 0; i < transactions.length; i += BATCH_SIZE) {
    db.transactions.insertMany(transactions.slice(i, i + BATCH_SIZE));
}

print(`Users created: ${db.users.countDocuments()}`);
print(`Users with transactions: ${db.transactions.distinct("user_id").length}`);
print(`Users with paid transactions: ${db.transactions.distinct("user_id", { type: "paid" }).length}`);
print(`Transactions created: ${db.transactions.countDocuments()}`);
print(`Paid transactions: ${db.transactions.countDocuments({ type: "paid" })}`);
print(`Free transactions: ${db.transactions.countDocuments({ type: "free" })}`);

Enter fullscreen mode Exit fullscreen mode

This generated random data for 10,000 users, of whom 105 have no transactions at all, and 24 have only unpaid transactions.

test> print(`Users created: ${db.users.countDocuments()}`);
Users created: 10000

test> print(`Users with transactions: ${db.transactions.distinct("user_id").length}`);
Users with transactions: 9895

test> print(`Users with paid transactions: ${db.transactions.distinct("user_id", { type: "paid" }).length}`);
Users with paid transactions: 9871

test> print(`Transactions created: ${db.transactions.countDocuments()}`);
Transactions created: 498812

test> print(`Paid transactions: ${db.transactions.countDocuments({ type: "paid" })}`);
Paid transactions: 399612

test> print(`Free transactions: ${db.transactions.countDocuments({ type: "free" })}`);
Free transactions: 99200
Enter fullscreen mode Exit fullscreen mode

In this sample dataset, there are 129 users with no paid transactions (either no transactions at all or only unpaid transactions).

As it will have to find transactions per user and type, I create the following index:


db.transactions.createIndex({ user_id: 1, type: 1 });

Enter fullscreen mode Exit fullscreen mode

Here is the aggregation pipeline that returns the users with no paid transactions:


db.users.aggregate([
    {
        $lookup: {
            from: "transactions",
            localField: "_id",
            foreignField: "user_id",
            pipeline: [
                { $match: { type: "paid" } },
                { $limit: 1 },                  // ✅ Stop as soon as we find ONE
                { $project: { _id: 1 } }        // ✅ Return minimal data
            ],
            as: "paid_transactions"
        }
    },
    {
        $match: {
            paid_transactions: { $eq: [] }      // Keep only users with NO match
        }
    },
    {
        $project: {
            paid_transactions: 0                // Clean up the temporary field
        }
    }
]).explain("executionStats").stages

Enter fullscreen mode Exit fullscreen mode

The $lookup stage has examined only one document per user with a paid transaction, so 9,871 ones, thanks to the { '$limit': 1 }:

{
    '$lookup': {
      from: 'transactions',
      as: 'paid_transactions',
      localField: '_id',
      foreignField: 'user_id',
      let: {},
      pipeline: [
        { '$match': { type: 'paid' } },
        { '$limit': 1 },
        { '$project': { _id: 1 } }
      ]
    },
    totalDocsExamined: Long('9871'),
    totalKeysExamined: Long('9871'),
    collectionScans: Long('0'),
    indexesUsed: [ 'user_id_1_type_1' ],
    nReturned: Long('10000'),
    executionTimeMillisEstimate: Long('1096')
  },
Enter fullscreen mode Exit fullscreen mode

It has returned one document for each of the 10,000 users, with an empty array when there was no match, so the $match stage filtered out 129 documents:

  {
    '$match': { paid_transactions: { '$eq': [] } },
    nReturned: Long('129'),
    executionTimeMillisEstimate: Long('1065')
  },
Enter fullscreen mode Exit fullscreen mode

This (1065ms) is notably slower than PostgreSQL anti-join (17ms). The explanation is that $lookup executes a separate query per document in a subpipeline, which has high per-document overhead compared to a native join operator. In MongoDB when there's a join from thousands of documents, you should consider embedding.

With embedding (join on write)

If you preferred to store the transactions embedded within each user, the query can be simpler and faster.

I update users to embed the transactions:

db.users.aggregate([  
    {  
        $lookup: {  
            from: "transactions",  
            localField: "_id",  
            foreignField: "user_id",  
            as: "transactions"  
        }  
    },  
    {  
        $merge: {  
            into: "users",  
            whenMatched: "merge"  
        }  
    }  
]);  
Enter fullscreen mode Exit fullscreen mode

The query to find users with no paid transactions is simply:

db.users.find(
{ "transactions.type": { $ne: "paid" } } ,
{ "transactions": 0 }
).explain("executionStats"); 

Enter fullscreen mode Exit fullscreen mode

This scans the users to return the 129 ones without paid transactions:

    inputStage: {
      stage: 'COLLSCAN',
      filter: { 'transactions.type': { '$not': { '$eq': 'paid' } } },
      nReturned: 129,
      executionTimeMillisEstimate: 20,
      works: 10001,
      advanced: 129,
      needTime: 9871,
      needYield: 0,
      saveState: 1,
      restoreState: 1,
      isEOF: 1,
      direction: 'forward',
      docsExamined: 10000
    }
  }
Enter fullscreen mode Exit fullscreen mode

This is not efficient because it has to read large documents, with all transactions, and discard them later.

A multikey index on "transactions.type" does not help much either. Because it indexes every element in the array, a user with both "paid" and "free" transactions has entries under both values. The $ne: "paid" scan retrieves all users who have any non-paid entry — which is nearly all of them — and then the FETCH stage must read each full document to verify that none of their transactions are paid:

      inputStage: {
        stage: 'IXSCAN',
        nReturned: 9587,
        executionTimeMillisEstimate: 10,
        works: 9589,
        advanced: 9587,
        needTime: 1,
        needYield: 0,
        saveState: 2,
        restoreState: 2,
        isEOF: 1,
        keyPattern: { 'transactions.type': 1 },
        indexName: 'transactions.type_1',
        isMultiKey: true,
        multiKeyPaths: { 'transactions.type': [ 'transactions' ] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          'transactions.type': [ '[MinKey, "paid")', '("paid", MaxKey]' ]
        },
        keysExamined: 9588,
        seeks: 2,
        dupsTested: 9587,
        dupsDropped: 0
      }
Enter fullscreen mode Exit fullscreen mode

but it will still fetch documents with some paid transactions, in addition to unpaid, and that must be discarded later:

      stage: 'FETCH',
      filter: { 'transactions.type': { '$not': { '$eq': 'paid' } } },
      nReturned: 129,
      executionTimeMillisEstimate: 36,
Enter fullscreen mode Exit fullscreen mode

In general, indexes do not help with anti-joins because they index existing values, not the absence of values. In MongoDB with embedded documents, you can work around this by adding a computed flag — such as "has_paid": false — that is maintained when transactions are inserted or removed, and indexing that flag. Since the flag and the transactions live in the same document, this adds negligible write overhead while making the anti-join a simple indexed equality match.

Top comments (0)