Discussion on: Understanding Partitioning in Azure Cosmos DB

View post

Hi Will, Nice article, but I am interested in getting your take on a problem we have.

While choosing a partition key with sufficient cardinality ensures the documents can be distributed evenly to all available partitions, it only ensures the same for RU's if you can assume that all documents are equally likely to be accessed. In our case, we have use cases where we need to keep all the data online for a long time (years), but the probability of someone requesting any given document decreases over time. What we need is not just to have ALL the documents evenly distributed, but to make sure all of the MOST RECENT documents are evenly distributed. It is not enough that we have roughly the same number of documents in every partition. If the documents in many of the partitions are older and infrequently accessed while most of the most recent documents tended to congregate in one or a few partitions, we start seeing throttling. The only way we have to counter this today is to increase the RU's on the collection. Since the RU's are divided across all the partitions (including the ones not seeing much traffic), we end up wasting resources (and money). To get an extra 100 RU's on the partition that is throttling we have to increases the collection RU's by 100 * number of partitions, even further over-provisioning those partitions that are not seeing much activity. For large collections with many partitions, this is a lot of money.

We could imagine some solutions that might have worked had Cosmos DB used range partitioning on the partition key. We could leverage the time aspects inherent in the partition key (e.g. the one-up nature of an order id) itself to help balance the distribution. But with hash partitioning, we have been unable to come up with any strategy that gives us confidence these hot partitions won't suddenly appear because any time information inherent in the partition key is effectively erased by the hash operation (hash values of consecutive values are not necessarily consecutive).

This problem is not purely theoretical. We have observed this multiple times. The data looks pretty evenly distributed (volume-wise) across all partitions, but one or a few partitions are being throttled while many others sit almost idle.

Have you encountered such cases? Were you able to solve for them? Any thoughts on solutions? I am inclined to believe there is none ... but not ready to give up yet.

Thanks.

Bill Bailey