Kiran

Posted on Dec 22, 2024

How to optimize latency and throughput

#graviton #latency #throughtput #amazonec2

Most of us are familiar with deploying our applications to the cloud using EC2. However, if you're running a large instance and not achieving the expected performance, then this blog is for you.

Let's compare two different instances: the c5.large and the c8g.large for better understanding

On comparing we could see the vCPU and Memory is same for both the instances, but though they are same we could see c8g is slight cheaper than c5 large instance, also c8g comes under graviton family, where AWS claims graviton processors are 40% better performance at up to 20% lower cost over comparable x86-based instances.

On the other hand AWS Graviton-based Amazon EC2 instances use up to 60% less energy than comparable EC2 instances for the same performance, reducing carbon foot prints.

Then if c8g.large is better in all of these metrics then can we always use c8g.large ?
the answer is no, let's see why.
let's deep dive on creating those instances and let's compare other metrics.

for evaluating performance let's use openssl speed command

for c8g.large instance I get
Doing sha256 for 3s on 256 size blocks: 10424489 sha256's in 2.99s
which means system was able to calculate over 10 million SHA-256 hashes in just 3 seconds.

for c5.large instance I get
Doing sha256 for 3s on 256 size blocks: 3442633 sha256's in 3.00s

which means system was able to calculate over 3 million SHA-256 hashes in just 3 seconds.

3 million sounds too low on comparing 10 million right let's deep dive and see what is happening here.

first let's see cpu info to understand its capabilities.
cat /proc/cpuinfo

for c5.large we get

and for c8g.large we get

from this we can see c5.large has two cores and we can do multi threading there and for c8g.large we don't have multiple cores for multi threading
Here is the thing, Graviton type instances are single threaded, which means we can't do Simultaneous-multithreading there.

Now let's see it in pictorial representation for more clarity.

lstopo imagename.png

running the above command
for c5.large we get

for c8g.large we get

from this we can see clearly that graviton instances can do only 1 thread per core and it has double the L2 and L1 cache sizes, which benefits performance in workloads that depend heavily on caching.
also on the networking PCI devices (like the NVMe block device and network interface) have access to high-speed bandwidth over the PCIe bus as it is using 16 PCIe lanes each.
More PCIe lanes = higher data transfer bandwidth.
On plotting the graph between Load (CPU) and Latency for both of these cases we get

From this we can conclude
With SMT: Better latency at lower loads, but a sharp increase occurs when CPU load exceeds 60%.
Without SMT: Higher latency overall but more predictable and stable across the full CPU load range.

What is Breaking latency ?
Breaking latency is the point when the machine can no longer serve more throughput and maintain acceptable response times to the load-generator and incrementally more throughput induces an exponential rise in latency. An example of that exponential rise is below.

source:
https://github.com/aws/aws-graviton-getting-started/blob/main/perfrunbook/defining_your_benchmark.md
https://pages.awscloud.com/rs/112-TZM-766/images/2023_OTT-OD-0501-NGI_Slide-Deck.pdf

Conclusion
In this comparison between the c5.large and c8g.large EC2 instances, we observe several key differences that influence performance. Although both instances offer the same vCPU and memory configurations, the c8g.large, which is part of the AWS Graviton family, provides a better price-to-performance ratio. AWS claims that Graviton processors deliver up to 40% better performance at up to 20% lower cost, while also reducing energy consumption by up to 60%. This makes the c8g.large an attractive option in terms of cost-efficiency and sustainability.
However, despite these advantages, the c8g.large instance does have certain limitations. Graviton-based instances, like the c8g.large, are single-threaded, meaning they cannot utilize simultaneous multithreading (SMT), which can limit performance in workloads requiring parallel processing. This is in contrast to the c5.large, which supports multi-threading and is more suitable for workloads that can take advantage of multiple threads.
When running benchmarks such as OpenSSL speed tests, the c8g.large showed a higher throughput in single-threaded tasks (10 million hashes vs. 3 million for the c5.large). However, when considering the underlying architecture, the c5.large with SMT outperformed the c8g.large in scenarios demanding multi-threaded processing, where c8g.large would reach performance bottlenecks due to its inability to handle simultaneous threads efficiently.
The PCIe lanes on both instances also play a role in data transfer bandwidth, with the c8g.large potentially having an edge due to its design, but this advantage is most beneficial in specific workloads, such as high-speed data transfer or storage-bound tasks.
In conclusion, while the c8g.large may be a more cost-effective option for certain workloads, particularly those relying on single-threaded operations, it may not always outperform the c5.large in multi-threaded tasks. Therefore, the choice between these two instances depends on the nature of your workload: If your application can leverage multi-threading, the c5.large may still be the better choice, while the c8g.large is ideal for single-threaded tasks where cost savings and energy efficiency are prioritized.

DEV Community

How to optimize latency and throughput

Top comments (0)