Rudi Visser

Posted on Apr 21, 2023 • Edited on Apr 27, 2023 • Originally published at vissers.page

Efficiently Reading Multiple Document Types from Cosmos DB (.NET SDK) - From Review

#cosmos #csharp #dotnet #azure

This is a cross-post from our new blog. Any future development-related articles will also be cross-posted here for the dev community 💪

Containers, Partition Keys, Point Reads, Queries, Request Units. That's the life of Cosmos DB and it's pretty great until you need to know something.

If you do end up wanting to know something specific about Cosmos DB then like every other developer you're probably going to Google it. Then you'll find yourself on Stack Overflow reading misguided questions and receiving questionable answers. Then you'll regret not being able to easily navigate Microsoft's mostly-excellent, but sometimes-lacking, documentation.

Here I'll take you through a few key things to know about Cosmos DB and depending on the Google Overlords doing their job, hopefully answer a few questions that people are searching for.

It's pretty long so here's a TLDR: When you're using GetItemQueryIterator to read and parse documents in a single Container with multiple types, don't use dynamic or JsonConvert.DeserialiseObject(). It's a JObject and you can simply use ToObject<T>()!

A little backstory first as to what prompted this article. At [insert our next venture name] we use Cosmos for some parts of our architecture that heavily benefits from NoSQL and efficient operations including but not limited to auditing, configurations, highly available internal services, event streams and basic key-value type storage requirements.

Late last night I was reviewing a critical path in one of our services that uses Cosmos as a backing store and noticed that it had actually undergone several fundamental changes to how it worked in the last couple of weeks. This code was storing and receiving multiple different types of documents in a single Container (which is a perfectly valid, if not encouraged practice).

The code was responsible for reading different document types from a Container in Cosmos and correctly getting them to their actual types.

Here's what the different versions were:

Only get a single item at a time with the defined up front. Simple, effective, fast, strongly typed. This used ReadItemAsync<T> from the Cosmos SDK.
Get all of the documents and return them to the API as object to be serialized back to JSON. This uses a Query Iterator with a query like SELECT * FROM c, with a partition key. So do all of the next ones.
Use dynamic and Newtonsoft's JsonConvert to Deserialize the result of .ToString()
Use dynamic and System.Text.Json's JsonSerializer to Deserialize the result of .ToString()

They're all ~acceptable~ ways of doing this yet I wasn't happy with any of them.

I want to start by saying that I'm a huge fan of using dynamic in only 2 scenarios. The first is to wind up colleagues and exclaim "DYNAMIC ALL THE WAY" just for fun. The second is when you truly don't have a clue what's going on and something could literally be anything and you need to be flexible, generally when communicating with third parties or other languages that aren't typed and we don't really care too much.

When I saw dynamic in actual production code that we're running and in a path that will be called many times per second I was a little confused and dug deeper.

Let's understand where this came from. Here's 2 notable Stack Overflow questions that are doing what we're doing:

The first which includes a sample from Mark Brown, someone I consider to be the CosmosDB God. He doesn't generally answer with code from what I've seen, more about the theory and best practices, but this time he included some code as well as stating that using dynamic in this scenario is a "typical pattern" to use.
The second has answers from Iahsrah and Nick Chapsas (a pretty cool YouTube developer) which both also reference dynamic as the way to go.

There are more of these that all basically say the same thing, these are just the first 2 I found whilst writing.

These answers both do a curious thing, take the dynamic object, check the type property on the document, and then JsonConvert it to the type we want. Fair enough.

But is it? Let's take a look.

Here I've created a benchmark to test out different methods, there's 2 documents to test with, 1 is 1.8KB and the other is 2.1KB. Negligible. This is also tested on both .NET 6 and 7 as we use both in production depending on the environment. I ran these on an M1 Pro with 16GB RAM and a tonne of other stuff open, so YMMV.

Note also that Newtonsoft.Json is the default (de)serializer for the Cosmos SDK v3. v4 which seems to never be coming (it hasn't been touched for years) does use System.Text.Json but for now we're stuck.

Here's the benchmark code.

Type1 t1Out;
Type2 t2Out;
foreach (var obj in DynamicObjects)
{
    if (obj.type == "type_1")
    {
        t1Out = JsonConvert.DeserializeObject<Type1>(obj.ToString());
    }
    else if (obj.type == "type_2")
    {
        t2Out = JsonConvert.DeserializeObject<Type2>(obj.ToString());
    }
}

Method	Job	Runtime	Mean	Error	StdDev	Gen0	Gen1	Allocated
Dynamic_ToString_Deserialize	.NET 6.0	.NET 6.0	33.63 us	0.087 us	0.077 us	21.3623	-	43.7 KB
Dynamic_ToString_Deserialize	.NET 7.0	.NET 7.0	27.36 us	0.381 us	0.356 us	7.1106	0.2441	43.7 KB

Alright. 43.7KB memory, around 30 microseconds.

Let's see how we improved it a little (this is essentially the latest version I had reviewed) by using System.Text.Json to deserialize it.

New code:

Type1 t1Out;
Type2 t2Out;
foreach (var obj in DynamicObjects)
{
    if (obj.type == "type_1")
    {
        t1Out = JsonSerializer.Deserialize<Type1>(obj.ToString());
    }
    else if (obj.type == "type_2")
    {
        t2Out = JsonSerializer.Deserialize<Type2>(obj.ToString());
    }
}

and the results:

Method	Job	Runtime	Mean	Error	StdDev	Gen0	Gen1	Allocated
Dynamic_ToString_SystemTextJson_Deserialize	.NET 6.0	.NET 6.0	14.96 us	0.038 us	0.033 us	13.4583	-	27.48 KB
Dynamic_ToString_SystemTextJson_Deserialize	.NET 7.0	.NET 7.0	13.53 us	0.130 us	0.109 us	4.4708	0.1373	27.48 KB

Better in every way! It's faster and allocates 42% less memory. Winning! We've beat out the Stack Overflow answers already.

So that is the code that was left and committed and would have made it out in to production if I wasn't a pedant. But I am.

We know that Cosmos DB itself as well as the SDK don't actually care what types we're storing up there as long as they meet some criteria (actually having a partition key is actually the only criteria, an ID will be generated, and the partition key could be that ID!), so the object itself is not actually dynamic, it's something.

> obj.GetType()
JObject

Yup. Something. In V2 of the SDK, I believe this was Document as it's referenced in some other Stack Overflow answers though I haven't personally seen it used so can't comment.

Let's work with that, and instead of using dynamic which is the root of all evil in the performance world, supposedly. Instead of DynamicObjects (which is what it says on the tin), Objects is the same thing but as a JObject. They were both created using JObject.Parse (which is what Cosmos does internally), we're just typed now.

This gives us a little more type safety. I say a little more because we're still going to just assume obj["type"] is there and it's a string. Which is pretty safe.

Type1 t1Out;
Type2 t2Out;
foreach (var obj in Objects)
{
    var type = (string)obj["type"];
    if (type == "type_1")
    {
        t1Out = JsonSerializer.Deserialize<Type1>(obj.ToString());
        // And the equivalent JsonConvert in another benchmark
    }
    else if (type == "type_2")
    {
        t2Out = JsonSerializer.Deserialize<Type2>(obj.ToString());
    }
}

(and the JsonConvert equivalent for good comparative measure).

Method	Job	Runtime	Mean	Error	StdDev	Gen0	Gen1	Allocated
JObject_Cast_ToString_Deserialize	.NET 6.0	.NET 6.0	34.14 us	0.288 us	0.240 us	21.3623	-	43.63 KB
JObject_Cast_ToString_SystemTextJson_Deserialize	.NET 6.0	.NET 6.0	15.33 us	0.304 us	0.338 us	13.4277	-	27.41 KB
JObject_Cast_ToString_Deserialize	.NET 7.0	.NET 7.0	27.39 us	0.424 us	0.397 us	7.1106	0.2136	43.63 KB
JObject_Cast_ToString_SystemTextJson_Deserialize	.NET 7.0	.NET 7.0	13.02 us	0.254 us	0.250 us	4.4708	0.1221	27.41 KB

Excellent, we've shaved off a few bytes on memory allocation. Done! We have type-safety(ish, kinda, sorta) from using JObject and we shaved off some time.

I'm joking, obviously. Let's look at what a JObject actually is.

Represents a JSON object.

Helpful.

Anyway to save a rant, it's basically a parsed object representing the document that was in it.

Does it actually hold the original document JSON though which ToString() is just returning? Of course not, why would it? Let's see where our allocations are actually coming from.

obj.ToString();

Method	Job	Runtime	Mean	Error	StdDev	Gen0	Gen1	Allocated
JObject_ToString	.NET 6.0	.NET 6.0	9.795 us	0.0190 us	0.0159 us	13.3209	-	27.23 KB
JObject_ToString	.NET 7.0	.NET 7.0	8.681 us	0.1725 us	0.2634 us	4.4403	0.0916	27.23 KB

When we're calling JObject.ToString() we are reserializing the object back to JSON. In effect, this is what the answers to the Stack Overflow questions are doing:

Let Cosmos load the document blob from the database itself
Let Newtonsoft.Json Deserialize it from JSON to a JObject
Using .ToString(), serialize it back from a JObject to JSON
Deserialize it (using Newtonsoft or in our case, System.Text.Json) back to our type

We got the same JSON twice in this case, and parsed it twice too resulting in all the allocations.

We also learned that System.Text.Json is actually super awesome, resulting in virtually no extra allocations above the .ToString(). Of course Newtonsoft.Json basically doubles it, but we knew that it was nowhere near in competition with System.Text.Json for simple document (de)serialization anyway.

With that said, to use System.Text.Json we still have to call this .ToString() method. There has to be a better way!

Well there is, since a JObject is also a JToken, it means we can access ToObject<T>. We'll live in the Newtonsoft.Json world still, but we're already there anyway.

So here we go:

Type1 t1Out;
Type2 t2Out;
foreach (var obj in Objects)
{
    var type = obj["type"].ToString();
    // or type = (string)obj["type"]
    // or type = obj["type"].Value<string>()
    // it doesn't matter
    if (type == "type_1")
    {
        t1Out = obj.ToObject<Type1>();
    }
    else if (type == "type_2")
    {
        t2Out = obj.ToObject<Type2>();
    }
}

Method	Job	Runtime	Mean	Error	StdDev	Gen0	Allocated
JObject_Cast_ToObject	.NET 6.0	.NET 6.0	16.36 us	0.046 us	0.043 us	1.6174	3.32 KB
JObject_ToString_ToObject	.NET 6.0	.NET 6.0	16.22 us	0.082 us	0.072 us	1.6174	3.32 KB
JObject_Value_ToObject	.NET 6.0	.NET 6.0	16.42 us	0.051 us	0.048 us	1.6174	3.32 KB
Dynamic_ToObject	.NET 6.0	.NET 6.0	16.80 us	0.019 us	0.017 us	1.6479	3.39 KB
JObject_Cast_ToObject	.NET 7.0	.NET 7.0	12.52 us	0.030 us	0.027 us	0.5341	3.32 KB
JObject_ToString_ToObject	.NET 7.0	.NET 7.0	12.03 us	0.018 us	0.015 us	0.5341	3.32 KB
JObject_Value_ToObject	.NET 7.0	.NET 7.0	11.86 us	0.016 us	0.015 us	0.5341	3.32 KB
Dynamic_ToObject	.NET 7.0	.NET 7.0	12.33 us	0.034 us	0.031 us	0.5493	3.39 KB

Consistent results, no unnecessary (de)serializing and it's just as fast as (well, slightly faster than, of course) using System.Text.Json to parse it back. The times are all so close to one another that it doesn't matter much, but the memory allocations are significant enough to make a difference when we're calling this code over and over again in quick succession.

For good measure I also through the dynamic benchmark back in just calling ToObject<T>() with also negligible differences, but why use it when we know what it is?

So there we have it. Rather than calling .ToString() (serialize) and then deserializing the result from, we simply let much less of the work be done and simply map it to our target type.

Also dynamic is not actually that evil in this case (though it's still more evil than being explicit).

There's actually an even better way to do this (that we ended up implementing), but I'll save that for a quick follow up in a couple of days once I've had chance to do real benchmarks.

A small bonus for any of you that are still here and are eagle eyed enough to realise that we did a Point Read in Cosmos on the type rather than using the query to get back all of the documents. By using Point Reads we would have the strong type from the start and arguably slightly simpler code.

Cosmos DB bills on Request Units (or RUs). Roughly, point reading 1 x 1KB document is 1 RU. 1 x 100KB document is 100 RU. Queries activate the Query Engine at the Cosmos Side and have a base cost of around 2.8 RU.

Let's say we had 4 document types that we wanted to read, each at roughly 2KB, we can assume that a Point Read would cost us 2RU x 4 = 8 RU. Point Reads are super quick, and we see them returning in around 8ms generally which is astonishing. With that we're using 8 RU and spending around 32ms loading in the documents from Cosmos inclusive of the parsing to load in these 4 documents.

However, we're only loading in 4 documents and an important thing to know is that strangely query-based RUs don't scale with document size. This seems bad for Azure's bottom line, but great for us!

When running SELECT * FROM c and returning those 4 documents, the query costs 2.93 RU and takes 22ms. We're slashing costs in half, and time by a third.

If you have a similar set up to this give it a try if you're not already querying. Just make sure that you specify the Partition Key for the query as you don't want any cross partition queries going on.

For your reference it'll look a little something like this:

var query = new QueryDefinition("SELECT * FROM c");
var iterator = container.GetItemQueryIterator<JObject>(query, requestOptions: new QueryRequestOptions
{
    PartitionKey = partitionKey
});

Happy Cosmosing and stay tuned for the even better way! 🪐

DEV Community

Efficiently Reading Multiple Document Types from Cosmos DB (.NET SDK) - From Review

Top comments (0)

Read next

Step-by-Step Guide: Easy Reporting 📋 with ClosedXML (Excel) and Database Views

Mastering Error Handling in ASP.NET Core 9.0: Advanced Strategies for Robust Applications

How Azure IoT Hub and Digital Twins are Shaping the Future of IoT

Measuring Performance in C#: Benchmarking with BenchmarkDotNet