DEV Community

Cover image for MongoDB Data model, queries and challenges
ravi
ravi

Posted on • Edited on

MongoDB Data model, queries and challenges

This article discusses about the challenges that arise when using NoSQL or specifically MongoDB for application development. This post is not intended to present negative aspects of MongoDB but rather point out importance of considering challenges and scenarios that would come up when working with schema less database. This is the first post of the series of articles related to the similar topics and I will continue updating the posts as I add up future contents. Source code for testing this piece is available in the github

For now, I am planning to include topics mentioned below in this series.

  1. Advantages and problems when working with denormalized data in MongoDB
  2. Complexities in aggregation queries when using Lookup.
  3. Indexing efficiency in Lookup queries

MongoDB

In short - "MongoDB documents are very flexible in a way that they don't need to have same set of fields. It allows developers to focus on application logic rather than worrying about how to split the data across different rigid tables."

Use Case

In this post, we will talk about complex Aggregation query stages and expressions when using Lookup queries. Lookup queries when used with array or embedded documents, things are not straight forward. To demonstrate this, we have used two collections Users and Banks.

Users data Model

MongoDB has a rule. Link

The rule of thumb when you use MongoDB is data that is accessed together should be stored together

Complying with the rule, we have kept users info, property purchase data and loan data in one collection. Yet, in some scenarios, we have to reference information from another collection. The obvious reason is to avoid duplicate data across multiple documents. In our example, we have used bank_id to reference bank information from Banks collection.

Users data

user_data = [{
    "_id": ObjectId("65d6f663c1ac9dcd7587f82b"),
    "user_id": "cbfc5b52-a43d-4dbb-b20f-bfcb24bfc19f",
    "first_name": "Angela",
    "last_name": "Wench"
    "email": "Angela@example.org",
    "purchase_data": [{
       "purchase_date": "2021-10-29T23:13:58.177+00:00",
       "loan_detail": {
          "bank_id": "90a56685-4dc6-4529-93dd-db2cae97f95f",
          "loan_amount": 100000,
          "loan_term": 10,
          "interest_rate": 3.5
    }, {
       "purchase_date": "2018-11-13T9:13:11.177+00:00",
       "loan_detail": {
          "bank_id": "b8ba45cf-c5e5-4c78-a7ad-7945702b29ee",
          "loan_amount": 450000,
          "loan_term": 7,
          "interest_rate": 3.5
    }]
},
....{}]
Enter fullscreen mode Exit fullscreen mode

Bank data

Banks collections stores general banking information.

banks = [{
            "_id": ObjectId("65d2e38c445bea758eb5d16b"),
            "name": "JPMorgan Chase",
            "uuid": "90a56685-4dc6-4529-93dd-db2cae97f95f",
            "email": "support@jpmorgan.com",
        },
        {
            "_id": ObjectId("65d2e38c445bea758eb5d16c"),
            "name": "Bank of America",
            "uuid": "b8ba45cf-c5e5-4c78-a7ad-7945702b29ee",
            "email": "support@bankofamerica.com",
        },
        {
            "_id": ObjectId("65d6f663c1ac9dcd7587f82c"),
            "name": "CityGroup",
            "uuid": "1d1dfa36-5111-4b6f-af79-b482b3d885b6",
            "email": "support@citygroup.com",
        },
        .....{}]
Enter fullscreen mode Exit fullscreen mode

Objective

Our objective is to get the final result comprising of users, property purchase and bank information as given below. We will see how different stages in the query adds complexities.

Expected output

[
  {
    '_id': ObjectId('65d6f663c1ac9dcd7587f82b'),
    'purchase_data': [
      {
        'purchase_date': '2021-10-29T23:13:58.177+00:00',
        'loan_detail': {
          'loan_amount': 784409,
          'interest_rate': 3.5,
          'bank_name': 'JPMorgan Chase'
        }
      },
      {
        'purchase_date': '2018-11-13T9:13:11.177+00:00',
        'loan_detail': {
          'loan_amount': 114488,
          'interest_rate': 3.5,
          'bank_name': 'Bank of America'
        }
      },
    'buyer': 'Angela Welch',
    'buyer_phone': '001-853-651-7290',
    'buyer_email': 'Angela@example.org'
]
Enter fullscreen mode Exit fullscreen mode

Aggregation query steps

To get the result, we need to use the aggregate query. The stages in the query are

  1. "$match" stage to filter information for one user.
  2. "$unwind" purchase_data to flatten the purchase information. If you don't unwind it, it will be impossible to map the bank info with correct purchase record. You can try it out.
  3. "$lookup" stage to join two collections
  4. "unwind" stage to flatten bank_info after lookup
  5. "$project" stage to extract the required fields. Here, we have extracted bank name as per our need. All looks good right ?. The aggregation query is given below.

Aggregation query

    pipeline = [
    {
        "$match": {
            "user_id": user_id
        }
    },
    {
        "$unwind": "$purchase_data"
    },
    {
        "$lookup": {
            "from": "banks",
            "localField": "purchase_data.loan_detail.bank_id",
            "foreignField": "uuid",
            "as": "bank_info"
        }
    },
    {
        "$unwind": "$bank_info"
    },
    {
        "$project":{
            "_id":-1,
            "buyer": {"$concat": ["$first_name", " ", "$last_name"]},
            "buyer_phone": "$phone",
            "buyer_email": "$email",
            "purchase_data": {
                "purchase_date": "$purchase_data.purchase_date",
                "loan_detail": {
                    "loan_amount": "$purchase_data.loan_detail.loan_amount",
                    "interest_rate": "$purchase_data.loan_detail.interest_rate",
                    "bank_name": "$bank_info.name"
                }
            },

        }
    }
]
Enter fullscreen mode Exit fullscreen mode

Obtained output

The result of the above query is given below. Unfortunately, we can see two variations compared to the expected output

  1. There are multiple purchase records for the same user. Purchase records of one user should be in single array.
  2. User's information is repeated
[
  {
    '_id': ObjectId('65d6f663c1ac9dcd7587f82b'),
    'purchase_data': {
      'purchase_date': '',
      'loan_detail': {
        'loan_amount': 784409,
        'interest_rate': 3.5,
        'bank_name': 'Nic Asia Bank'
      }
    },
    'buyer': 'Angela Welch',
    'buyer_phone': '001-853-651-7290',
    'buyer_email': 'Angela@example.org'
  },
  {
    '_id': ObjectId('65d6f663c1ac9dcd7587f82b'),
    'purchase_data': {
      'purchase_date': '',
      'loan_detail': {
        'loan_amount': 114488,
        'interest_rate': 3.5,
        'bank_name': 'Nepal Bank Limited'
      }
    },
    'buyer': 'Angela Welch',
    'buyer_phone': '001-853-651-7290',
    'buyer_email': 'Angela@example.org'
  }
]
Enter fullscreen mode Exit fullscreen mode

In order to keep all purchase records for specific user into one array, we need to introduce the grouping mechanism.

  1. Add "$group" stage and use "$push" operator to keep similar purchase records into one array.
    {
        "$group":{
            "_id": "$_id",
            "purchase_data": {
                "$push": "$purchase_data"
            }
    }
Enter fullscreen mode Exit fullscreen mode

However, group action adds another problem. We cannot retrieve other fields that are not included in the group stage like buyer name, phone and email. We can retrieve them but there is no right approach for it. Let's explore two approaches

Approach 1

Use the "$first" operator for all the fields you want to include. This looks like a simple approach.

    {
        "$group":{
            "_id": "$_id",
            "purchase_data": {
                "$push": "$purchase_data"
            },
            "buyer": {"$first": "$buyer"},
            "buyer_phone": {"$first": "$buyer_phone"},
            "buyer_email": {"$first": "$buyer_email"}
    }
Enter fullscreen mode Exit fullscreen mode

Approach 2

This approach looks fairly complex but is commonly used one.

  1. Retrieve the first occurrence of the input data using "$$ROOT".
  2. Remove purchase_data from doc field as it is already present in our result
  3. Merge the User's information with rest of the purchase results.
    {
        "$group":{
            "_id": "$_id",
            "purchase_data": {
                "$push": "$purchase_data"
            },
            "doc": { "$first": "$$ROOT" }
        }
    },
    {
        "$unset": ["doc.purchase_data"]
    },
      { "$replaceRoot": { 
          "newRoot":{
              "$mergeObjects": [
                    {
                        "_id": "$_id", 
                        "purchase_data": "$purchase_data"
                    }, "$doc" 
                ]
            }
        }
    }
Enter fullscreen mode Exit fullscreen mode

The final query using the second approach would look like below.

    pipeline = [
    {
        "$match": {
            "user_id": user_id
        }
    },
    {
        "$unwind": "$purchase_data"
    },
    {
        "$lookup": {
            "from": "banks",
            "localField": "purchase_data.loan_detail.bank_id",
            "foreignField": "uuid",
            "as": "bank_info"
        }
    },
    {
        "$unwind": "$bank_info"
    },
    {
        "$project": {
            "_id": 1,
            "buyer": {"$concat": ["$first_name", " ", "$last_name"]},
            "buyer_phone": "$phone",
            "buyer_email": "$email",
            "purchase_data": {
                "purchase_date": "$purchase_data.purchase_date",
                "loan_detail": {
                    "loan_amount": "$purchase_data.loan_detail.loan_amount",
                    "interest_rate": "$purchase_data.loan_detail.interest_rate",
                    "bank_name": "$bank_info.name"
                }
            },
        }
    },
    {
        "$group":{
            "_id": "$_id",
            "purchase_data": {
                "$push": "$purchase_data"
            },
            "doc": { "$first": "$$ROOT" }
        }
    },
    {
        "$unset": ["doc.purchase_data"]
    },
      { "$replaceRoot": { 
          "newRoot":{
              "$mergeObjects": [
                    {
                        "_id": "$_id", 
                        "purchase_data": "$purchase_data"
                    }, "$doc" 
                ]
            }
        }
    }
    ]
Enter fullscreen mode Exit fullscreen mode

My view on the above query

  1. I find unwind stage unnatural and difficult to visualize with the data specially when there are multiple of them. We can see there are two unwind stage in above query. If your data has multiple array types or there are multiple lookup stages then unwind will occur multiple times which will make query unnecessarily long and lose query's main sense.
  2. MongoDB documentation suggests keeping records of same entity into array and embed it. But such structure also comes with overheads. The above query required multiple unwind stages, then group it into one field, followed by complex replaceRoot, mergeObjects to get the results in the format we want. There are so many operations formatting the data.

Key Takeaways

  1. We saw how using Lookup queries with array elements will increase the query complexities to get the result in the format we want.
  2. MongoDb in their documentation presents Lookup queries can do many things like joining multiple collections, join correlated and uncorrelated sub-queries with "pipeline" . But it does not talk about the complexities, query readability, performance issues and other limitations.
  3. My suggestion would be to use Lookup if you data structure is simple enough and does not require extensive intermediate manipulation. If your data structure has nested structure then avoid Lookup queries even if you have to duplicate the data.

Top comments (0)