This article discusses about the challenges that arise when using NoSQL or specifically MongoDB for application development. This post is not intended to present negative aspects of MongoDB but rather point out importance of considering challenges and scenarios that would come up when working with schema less database. This is the first post of the series of articles related to the similar topics and I will continue updating the posts as I add up future contents. Source code for testing this piece is available in the github
For now, I am planning to include topics mentioned below in this series.
- Advantages and problems when working with denormalized data in MongoDB
- Complexities in aggregation queries when using Lookup.
- Indexing efficiency in Lookup queries
MongoDB
In short - "MongoDB documents are very flexible in a way that they don't need to have same set of fields. It allows developers to focus on application logic rather than worrying about how to split the data across different rigid tables."
Use Case
In this post, we will talk about complex Aggregation query stages and expressions when using Lookup queries. Lookup queries when used with array or embedded documents, things are not straight forward. To demonstrate this, we have used two collections Users and Banks.
Users data Model
MongoDB has a rule. Link
The rule of thumb when you use MongoDB is data that is accessed together should be stored together
Complying with the rule, we have kept users info, property purchase data and loan data in one collection. Yet, in some scenarios, we have to reference information from another collection. The obvious reason is to avoid duplicate data across multiple documents. In our example, we have used bank_id to reference bank information from Banks collection.
Users data
user_data = [{
"_id": ObjectId("65d6f663c1ac9dcd7587f82b"),
"user_id": "cbfc5b52-a43d-4dbb-b20f-bfcb24bfc19f",
"first_name": "Angela",
"last_name": "Wench"
"email": "Angela@example.org",
"purchase_data": [{
"purchase_date": "2021-10-29T23:13:58.177+00:00",
"loan_detail": {
"bank_id": "90a56685-4dc6-4529-93dd-db2cae97f95f",
"loan_amount": 100000,
"loan_term": 10,
"interest_rate": 3.5
}, {
"purchase_date": "2018-11-13T9:13:11.177+00:00",
"loan_detail": {
"bank_id": "b8ba45cf-c5e5-4c78-a7ad-7945702b29ee",
"loan_amount": 450000,
"loan_term": 7,
"interest_rate": 3.5
}]
},
....{}]
Bank data
Banks collections stores general banking information.
banks = [{
"_id": ObjectId("65d2e38c445bea758eb5d16b"),
"name": "JPMorgan Chase",
"uuid": "90a56685-4dc6-4529-93dd-db2cae97f95f",
"email": "support@jpmorgan.com",
},
{
"_id": ObjectId("65d2e38c445bea758eb5d16c"),
"name": "Bank of America",
"uuid": "b8ba45cf-c5e5-4c78-a7ad-7945702b29ee",
"email": "support@bankofamerica.com",
},
{
"_id": ObjectId("65d6f663c1ac9dcd7587f82c"),
"name": "CityGroup",
"uuid": "1d1dfa36-5111-4b6f-af79-b482b3d885b6",
"email": "support@citygroup.com",
},
.....{}]
Objective
Our objective is to get the final result comprising of users, property purchase and bank information as given below. We will see how different stages in the query adds complexities.
Expected output
[
{
'_id': ObjectId('65d6f663c1ac9dcd7587f82b'),
'purchase_data': [
{
'purchase_date': '2021-10-29T23:13:58.177+00:00',
'loan_detail': {
'loan_amount': 784409,
'interest_rate': 3.5,
'bank_name': 'JPMorgan Chase'
}
},
{
'purchase_date': '2018-11-13T9:13:11.177+00:00',
'loan_detail': {
'loan_amount': 114488,
'interest_rate': 3.5,
'bank_name': 'Bank of America'
}
},
'buyer': 'Angela Welch',
'buyer_phone': '001-853-651-7290',
'buyer_email': 'Angela@example.org'
]
Aggregation query steps
To get the result, we need to use the aggregate query. The stages in the query are
- "$match" stage to filter information for one user.
- "$unwind" purchase_data to flatten the purchase information. If you don't unwind it, it will be impossible to map the bank info with correct purchase record. You can try it out.
- "$lookup" stage to join two collections
- "unwind" stage to flatten bank_info after lookup
- "$project" stage to extract the required fields. Here, we have extracted bank name as per our need. All looks good right ?. The aggregation query is given below.
Aggregation query
pipeline = [
{
"$match": {
"user_id": user_id
}
},
{
"$unwind": "$purchase_data"
},
{
"$lookup": {
"from": "banks",
"localField": "purchase_data.loan_detail.bank_id",
"foreignField": "uuid",
"as": "bank_info"
}
},
{
"$unwind": "$bank_info"
},
{
"$project":{
"_id":-1,
"buyer": {"$concat": ["$first_name", " ", "$last_name"]},
"buyer_phone": "$phone",
"buyer_email": "$email",
"purchase_data": {
"purchase_date": "$purchase_data.purchase_date",
"loan_detail": {
"loan_amount": "$purchase_data.loan_detail.loan_amount",
"interest_rate": "$purchase_data.loan_detail.interest_rate",
"bank_name": "$bank_info.name"
}
},
}
}
]
Obtained output
The result of the above query is given below. Unfortunately, we can see two variations compared to the expected output
- There are multiple purchase records for the same user. Purchase records of one user should be in single array.
- User's information is repeated
[
{
'_id': ObjectId('65d6f663c1ac9dcd7587f82b'),
'purchase_data': {
'purchase_date': '',
'loan_detail': {
'loan_amount': 784409,
'interest_rate': 3.5,
'bank_name': 'Nic Asia Bank'
}
},
'buyer': 'Angela Welch',
'buyer_phone': '001-853-651-7290',
'buyer_email': 'Angela@example.org'
},
{
'_id': ObjectId('65d6f663c1ac9dcd7587f82b'),
'purchase_data': {
'purchase_date': '',
'loan_detail': {
'loan_amount': 114488,
'interest_rate': 3.5,
'bank_name': 'Nepal Bank Limited'
}
},
'buyer': 'Angela Welch',
'buyer_phone': '001-853-651-7290',
'buyer_email': 'Angela@example.org'
}
]
In order to keep all purchase records for specific user into one array, we need to introduce the grouping mechanism.
- Add "$group" stage and use "$push" operator to keep similar purchase records into one array.
{
"$group":{
"_id": "$_id",
"purchase_data": {
"$push": "$purchase_data"
}
}
However, group action adds another problem. We cannot retrieve other fields that are not included in the group stage like buyer name, phone and email. We can retrieve them but there is no right approach for it. Let's explore two approaches
Approach 1
Use the "$first" operator for all the fields you want to include. This looks like a simple approach.
{
"$group":{
"_id": "$_id",
"purchase_data": {
"$push": "$purchase_data"
},
"buyer": {"$first": "$buyer"},
"buyer_phone": {"$first": "$buyer_phone"},
"buyer_email": {"$first": "$buyer_email"}
}
Approach 2
This approach looks fairly complex but is commonly used one.
- Retrieve the first occurrence of the input data using "$$ROOT".
- Remove purchase_data from doc field as it is already present in our result
- Merge the User's information with rest of the purchase results.
{
"$group":{
"_id": "$_id",
"purchase_data": {
"$push": "$purchase_data"
},
"doc": { "$first": "$$ROOT" }
}
},
{
"$unset": ["doc.purchase_data"]
},
{ "$replaceRoot": {
"newRoot":{
"$mergeObjects": [
{
"_id": "$_id",
"purchase_data": "$purchase_data"
}, "$doc"
]
}
}
}
The final query using the second approach would look like below.
pipeline = [
{
"$match": {
"user_id": user_id
}
},
{
"$unwind": "$purchase_data"
},
{
"$lookup": {
"from": "banks",
"localField": "purchase_data.loan_detail.bank_id",
"foreignField": "uuid",
"as": "bank_info"
}
},
{
"$unwind": "$bank_info"
},
{
"$project": {
"_id": 1,
"buyer": {"$concat": ["$first_name", " ", "$last_name"]},
"buyer_phone": "$phone",
"buyer_email": "$email",
"purchase_data": {
"purchase_date": "$purchase_data.purchase_date",
"loan_detail": {
"loan_amount": "$purchase_data.loan_detail.loan_amount",
"interest_rate": "$purchase_data.loan_detail.interest_rate",
"bank_name": "$bank_info.name"
}
},
}
},
{
"$group":{
"_id": "$_id",
"purchase_data": {
"$push": "$purchase_data"
},
"doc": { "$first": "$$ROOT" }
}
},
{
"$unset": ["doc.purchase_data"]
},
{ "$replaceRoot": {
"newRoot":{
"$mergeObjects": [
{
"_id": "$_id",
"purchase_data": "$purchase_data"
}, "$doc"
]
}
}
}
]
My view on the above query
- I find unwind stage unnatural and difficult to visualize with the data specially when there are multiple of them. We can see there are two unwind stage in above query. If your data has multiple array types or there are multiple lookup stages then unwind will occur multiple times which will make query unnecessarily long and lose query's main sense.
- MongoDB documentation suggests keeping records of same entity into array and embed it. But such structure also comes with overheads. The above query required multiple unwind stages, then group it into one field, followed by complex replaceRoot, mergeObjects to get the results in the format we want. There are so many operations formatting the data.
Key Takeaways
- We saw how using Lookup queries with array elements will increase the query complexities to get the result in the format we want.
- MongoDb in their documentation presents Lookup queries can do many things like joining multiple collections, join correlated and uncorrelated sub-queries with "pipeline" . But it does not talk about the complexities, query readability, performance issues and other limitations.
- My suggestion would be to use Lookup if you data structure is simple enough and does not require extensive intermediate manipulation. If your data structure has nested structure then avoid Lookup queries even if you have to duplicate the data.
Top comments (0)