Rudi Visser

Posted on Apr 27, 2023 • Originally published at vissers.page

Super Efficiently Reading Multiple Document Types from Cosmos DB (.NET SDK) - Part 2

#cosmosdb #csharp #dotnet #azure

About a week ago I wrote an article off the back of a code review about how to efficiently read multiple document types from Cosmos DB which focused on using GetItemQueryIterator which is probably the most typical use case.

With that said the end result wasn't quite the one we ended up using.

I spent a little time looking in to it further specifically how we could get more raw results from Cosmos. Basically, the way to get raw results from Cosmos is by using GetItemQueryStreamIterator instead of GetItemQueryIterator. This skips the built in serialisation using Newtonsoft.Json and lets us do whatever we want directly with the underlying ResponseMessage (and in turn the Stream).

Rather than just looking at how we can use that though I upped the stakes a little on the benchmarks. First off, I've included a live benchmark which disregards the time taken as it's reliant on my network connection, but also increased the amount of documents to 41 which reflects our production workload a bit more whilst still being pretty small in the grand scheme of Cosmos.

Let's first look at 2 items, the baseline from the StackOverflow answers (that is calling ToString() and using Deserialize<T>()), and the end result that we finished with in the last article which just directly used .ToObject<T>() from the JObject. Just for simplicity I've only ran this on .NET 7 as the differences between frameworks mean nothing to the end result.

Method	Mean	Error	StdDev	Allocated
Baseline_DeserializeFromToString	893.7 us	17.71 us	34.55 us	479.07 KB
Final_JObjectToObject	366.7 us	7.34 us	12.26 us	55.77 KB

These differences are as I explained in the last article and obviously show when there's more documents involved, skipping the extra steps involved in going back and forth from JSON bring excellent benefits. Frankly, for the simplicity of the code I wouldn't really recommend more than this for most scenarios.

As I said above to make this more than just a benchmarking post let's take a look at the real world now and include the overhead from the Cosmos DB SDK. These are marked as _Live in the below table. Also, there's a Cosmos_Baseline thrown in there that is literally fetching the results but doing no conversion.

Method	Mean	Error	StdDev	Allocated
Baseline_DeserializeFromToString	893.7 us	17.71 us	34.55 us	479.07 KB
Baseline_Live_DeserializeFromToString	16,451.4 us	550.98 us	1,517.56 us	1198.91 KB
Cosmos_Baseline *	20.24 ms	3.399 ms	9.418 ms	742.98 KB
Final_JObjectToObject	366.7 us	7.34 us	12.26 us	55.77 KB
Final_Live_JObjectToObject	16,204.7 us	422.05 us	1,224.45 us	775.51 KB

Obviously with these we can't really pay attention to the time taken as it's including a real connection and stream back of the data from Cosmos DB. However we do note that with a baseline of 743KB the final version is actually only producing an additional 33KB of usage which I believe is the size of the 41 documents that we have in the database. Sweet!

As a recap these tests are using container.GetItemQueryIterator<JObject>, here's the code for Final_Live_JObjectToObject:

FirstType t1Out;
SecondType t2Out;
using (var feedIterator = container.GetItemQueryIterator<JObject>(queryDefinition, null, new QueryRequestOptions { PartitionKey = PartitionKey }))
{
    while (feedIterator.HasMoreResults)
    {
        var responses = await feedIterator.ReadNextAsync();
        foreach(var obj in responses)
        {
            var type = (string)obj["type"];
            if (type == FirstType.Name)
            {
                t1Out = obj.ToObject<FirstType>();
            }
            else if (type == SecondType.Name)
            {
                t2Out = obj.ToObject<SecondType>();
            }
        }
    }
}

Ridiculously simple and gets the job done, but we should be able to do better.

Enter GetItemQueryStreamIterator. This method differs from the non-stream version in 2 ways:

It skips the deserialization of the individual documents and gives us the raw result stream from the request
It can only be used on single-partition queries (which, please, make sure you're doing as often as possible!)

Let's re-run a quick baseline for the Cosmos SDK to compare:

Method	Mean	Error	StdDev	Allocated
Cosmos_Baseline	20.24 ms	3.399 ms	9.418 ms	742.98 KB
Cosmos_Baseline_Stream	15.42 ms	0.364 ms	1.028 ms	233.63 KB

Given that there's no deserialization happening with Newtonsoft.Json the memory allocated is much lower. I'm not going to spend any time investigating where that 233KB came from, let's just take it as gospel and move on to how we can now deal with it.

BUT WAIT! I can hear people screaming, why not just use System.Text.Json directly with Cosmos?

Unsurprisingly enough, that's essentially the answer.

There's a very simple way to do this, within the GitHub Cosmos SDK repo there's a CosmosSystemTextJsonSerializer.cs. Drag that in to your project, wire it up how you like (you always need to provide the options):

new CosmosClient(
    ...,
    new CosmosClientOptions
    {
        Serializer = new CosmosSystemTextJsonSerializer(new JsonSerializerOptions
        {
            PropertyNamingPolicy = JsonNamingPolicy.CamelCase
        })
    });

Done. Now we've basically cut the memory usage by half, let's take a look:

Method	Mean	Error	StdDev	Allocated
Cosmos_Baseline	20.24 ms	3.399 ms	9.418 ms	742.98 KB
Cosmos_Baseline_Stream	15.42 ms	0.364 ms	1.028 ms	233.63 KB
Cosmos_Baseline_STM	15.82 ms	0.374 ms	1.074 ms	421.88 KB
Live_JsonNode_STM	15.51 ms	0.343 ms	0.957 ms	690.28 KB

To do this you should stop using JObject and instead use JsonNode (basically the System.Text.Json equivalent), also ToObject<T> is Deserialize<T>. The code looks like this now:

using (var feedIterator = container.GetItemQueryIterator<JsonNode>(queryDefinition, null, new QueryRequestOptions { PartitionKey = PartitionKey }))
{
    while (feedIterator.HasMoreResults)
    {
        var responses = await feedIterator.ReadNextAsync();
        foreach(var obj in responses)
        {
            var type = (string)obj["type"];
            if (type == FirstType.Name)
            {
                t1Out = obj.Deserialize<FirstType>();
            }
            else if (type == SecondType.Name)
            {
                t2Out = obj.Deserialize<SecondType>();
            }
        }
    }
}

Usually I'd say we're done but let's get every little KB of memory out as we can (or more accurately, what I can be bothered to do).

I'm not going to bang on about what's different as it should be fairly obvious by now. Using the Stream, we can simply let System.Text.Json deserialize it directly from the stream and skip a little more memory being consumed on each iteration.

That looks like this:

using (var = container.GetItemQueryStreamIterator(queryDefinition, null, new QueryRequestOptions { PartitionKey = PartitionKey }))
{
    while (feedIterator.HasMoreResults)
    {
        using (ResponseMessage response = await feedIterator.ReadNextAsync())
        {
            var doc = JsonNode.Parse(response.Content);
            foreach (var obj in doc["Documents"].AsArray())
            {
                var type = (string)obj["type"];
                if (type == "core")
                {
                    t1Out = obj.Deserialize<FirstType>();
                }
                else if (type == "pricing_rule")
                {
                    t2Out = obj.Deserialize<SecondType>();
                }
            }
        }
    }
}

Note the change back to GetItemQueryStreamIterator and now we call JsonNode.Parse directly on the response stream. We're actually getting back a Cosmos result set here with the results themselves within the Documents property.

Here's the end result with another 60KB gone:

Method	Mean	Error	StdDev	Allocated
Live_JsonNode_STM	15.51 ms	0.343 ms	0.957 ms	690.28 KB
Live_JsonNode_STM_Parsed	16.40 ms	0.421 ms	1.196 ms	631.01 KB

That's it, possibly the most efficient way to read multiple document types from a single Cosmos container by changing 2 lines from the first article. For most implementations you shouldn't really have to bother with this at all, but if you're calling over and over and over and over and over again your servers may thank you for a little more diligence. The differences are even more apparent with more results coming out.

Before I go I wanted to mention that we did try an intermediate object to map this so we could just do .Documents:

public class DiscriminatorHolder
{
    [JsonPropertyName("Documents")]
    public List<JsonObject> Documents { get; set; }
}

That added another 6KB so it wasn't included but it may work for you. It's still better than letting the SDK do it itself!

In any case we're down from >1MB by relying too much on Stack Overflow, to just ~600KB for each call. That's a win.

Happy Cosmosing! 🪐

DEV Community

Super Efficiently Reading Multiple Document Types from Cosmos DB (.NET SDK) - Part 2

Top comments (0)

Read next

Advent of Code 2024: Day 12: Garden Groups

Streamlining .NET Development with Practical Aspects

RabbitMQ: Fanout Exchange Pattern.

Blazor and Single-Page Applications (SPA)