Cloud: IO limits gone full circle

#postgres #performance #cloud #tuning

In the old days we used rotating disks, which had mechanical arms moving over the surface to read data, which meant there was a certain latency before data could be obtained, and a limited amount of bandwidth. The solution for getting more bandwidth was to use more disks (RAID). Still the overall usage was essentially bound by IOPS because of the mechanical arms/attenuators. (there was other storage media before that, but that is outside of the scope of this article)

Then came the solid state disks (SSD). Because these do not use mechanical, rotating disks, the latency is severely limited. In fact, this was such an improvement that the existing access protocols were found to be limiting SSD and new protocols were needed to take advantage of parallelism and bandwidth that was made possible by SSD (such as multipath IO and NVMe). Of course new storage (technology) comes with their own problems, but that is beyond the scope of this article.

Fast forward further in time and we enter the cloud era. Now we can rent a (virtual) machine, and elastically scale up and down, and let all these properties that we have obsessed about in the past, such as the number of disks, disk failure rates, bandwidth, etc. be the problem of the cloud vendor, we can just use the infrastructure...

Or can we? If you carefully look at the specifications of the virtual machines of all major cloud providers, you will notice that a cloud machine shape has obvious limits such as number of vCPUs and memory, but also has limits on disks, both on the layer of the virtual machine, as well as on the disk.

The disk limits being less obvious also gives me the impression that these are put in such a way that makes it easy to miss these.

But that is not what I wanted to discuss: if you look and work out the information about IO limits for a cloud machine shape together with one or more disk devices, you will notice that the IO limits of especially the smaller machine shapes are quite low.

In fact...if you take the IO limits of such machines, it leaves an impression with me that we essentially are back at the disk limits of the time of rotating disks.

But it's not all nostalgia: there is another side to this; this means that disk IO sensitive applications that have to use these machines have to be tuned for limited IOPS again, and cannot assume close to unlimited amounts of IOPS and bandwidth, using tuning such as using large IOs to be able to reach bandwidth, because parallel usage of small IOs will run into the IOPS limit.

Top comments (3)

MetaDave 🇪🇺 • Nov 4 '21

Yes, we recently hit a limit on RDS IO burst with AWS RDS PostgreSQL, and then another limit on EBS IO burst.

It was very confusing, but unlike in the bad old days when every single database was weirdly IO bound and always would be, it only took a 15 minute outage and a bit of money to scale to a new instance size and type.

Reminiscing about the old magnetic storage RAID days, you could change the IO escalator on a RAID array and achieve close to the theoretical maximum of throughput for a database with large IOs (e.g. data warehouses). Good times.

Frits Hoogland • Nov 4 '21 • Edited

Thank you for your reply Dave.

I can't tell if it was an IO burst on 'your' side, or a problem with burst limits. I think bursting is the 21st century snake oil, but that is another discussion.

I agree that cloud means improvements have been made in many area's, which were impossible in the old days. I love 'infrastructure as code', and vividly remember the days where adding infrastructural components meant something had to be ordered and physically arrived at the data centre, and then had to be assembled, installed and configured.

The goal of the article is to warn that despite the many advantages that cloud gave us, there are things that have come back and moved in the opposite direction.

I have first hand seen a cloud vendor proclaiming they provided unlimited capacity. That doesn't exist.

Michael Christofides • Nov 30 '21

Great observation! This feels like another good reason to always include BUFFERS when looking at execution plans, a topic that has come up again recently on the mailing lists:

postgresql.org/message-id/flat/CAN...