Dimitrios Kechagias

Posted on Oct 17, 2023

Google Cloud C3D Review: Record-breaking performance with EPYC Genoa

#googlecloud #amd #cloud #gcp

Google Cloud Compute announced today the general availability of the new VM type C3D, powered by AMD's latest 4th Generation EPYC, codenamed Genoa.

I've been testing C3D for the last 3-4 months on our Cloud at SpareRoom, so I thought I'd share my findings.

I was testing mainly because I was hoping we could benefit of the new VMs the same way we benefited (in both performance and cost) a couple of years ago when we switched to EPYC 3rd Generation (Milan).

I won't spoil it any further than the title, but here are some handy links to navigate the post:

EPYC Genoa & Google C3D
Performance per Core
Max size instance performance
AVX-512
Performance / Price
Performance / Price - 1 Year reserve
Conclusion
Addendum: Test methodology and full results

EPYC Genoa & Google C3D

The 4th-Gen EPYC CPUs have been out for a year now, with great benchmark results for chips with up to 96 cores / 192 threads (there was also a 128-core model introduced recently). They even added AVX-512 support which, if your software depended on it, was one of the last reasons to be "limited" to Intel Xeons.

I was glad to see that Google's C3D implements Genoa at near their top-spec. The "AMD EPYC 9B14" that shows up as the model number, seems to be quite close to the EPYC 9654 we've seen in benchmarks, as it offers 90 cores (180 vCPUs with Symmetric Multi Threading - SMT), with a 3.7GHz maximum boost clock. The all-core boost speed seems to be 3.45GHz, and that's what you are limited to if you don't allocate a full processor (<180 vCPUs). What's more, instances can be dual-processor, offering an unprecedented (for the Google Cloud) 180 cores / 360 threads!

Here is an overview of the Google 2nd-Gen or later instance types with Intel and AMD that are available (prices for US Central 1 - including sustained use discounts):

Type	CPU	vCPUs	RAM / vCPU	Monthly $ / 2vCPU+8GB
c3d	AMD EPYC Genoa	4-360	2-8GB	66.79
c2d	AMD EPYC Milan	2-112	2-8GB	66.28
n2d	AMD EPYC Milan*	2-224	0.5-8GB	49.34
t2d	AMD EPYC Milan	1-60	4GB	61.68
c3	Intel Sapphire Rapids	4-176	2-8GB	76.22
n2	Intel Ice Lake*	2-128	1-8GB	56.72
c2	Intel Cascade Lake	4-60	4GB	60.97

For N2D/N2 you have to specify Milan/Ice Lake, otherwise it will be Rome/Cascade Lake respectively.

Impressively, the C3D cost about the same as the C2D and less than the C3 (prices vary per region, US-Central shown on table). I expect this to be the result of good power efficiency, which is the major cost to a cloud provider.

As you can see, the C3D provides more cores than any other instance type, but first let's see how fast each one of those cores is.

Performance per Core

In order to have an apples-to-apples comparison, I will use 1 core / 2 hyper-threads from each instance. I'll have to use Google's "visible cores" limiter for instance types that have a minimum of 4 vCPUs. I'll skip the "special" T2D for now, as with no SMT (core = thread) the comparison becomes more complex.

First, I will be using the latest version of the DKbench CPU benchmark suite, which I developed for my previous cloud performance comparisons to measure performance of the type of generic compute workloads we run on our servers. All the detailed benchmark results and methodology are available as an addendum at the end of this post.

Note that the CPU type is marked next to the instances on the bottom axis of the graphs according to the following key (which is listing the CPU generations from newest = top, to oldest = bottom):



(SR) = Intel Sapphire Rapids   (G) = AMD Genoa
(I)  = Intel Ice Lake          (M) = AMD Milan
(C)  = Intel Cascade Lake

The composite score over the 19 benchmarks of the suite gives the c3d-s180/2 (full boost) instance a 20% average performance advantage over C2D and C3 for single thread performance. The delta over C3 is even greater (25%) for multi-threaded loads, as the AMD CPUs consistently show they get a greater advantage from Hyper-Threading (SMT) compared to Xeons.
We can actually calculate the exact advantage you get from SMT on each type:

Type	c3d	c2d	n2d	c3	n2	c2
HT/single x	1.22	1.22	1.22	1.15	1.06	1.10

This means that with SMT enabled, a single core on both Milan and Genoa performs like 1.22x cores if you feed it two threads. For Intel it varies on this benchmark between 1.05-1.15x.

Back to the results, if you don't allocate a full Genoa processor you will be limited to the lower "all core" boost (this is true also for the T2D and N2D types), giving you a smaller advantage over the C3D/C3 which have a fixed boost for all VM sizes. You can still expect across the board performance advantages though, and this becomes quite important when you also factor in the price.

Let's try another couple of popular benchmarks:

7zip was not strong for Xeons already, so the C3D opens up the performance difference with the C3 to up to 50-60% for decompression.

For a timed Linux Kernel compilation the C3 had the advantage over the C2D, but the Genoa instances are comfortably the fastest. Note that this benchmark is also affected by I/O - I did use the same Google storage solution for all types though.

The title of the post mentions "record-breaking" performance, we can see why by moving on to the full size instances.

Max size instance performance

This is a list of all the compute instances you can get with over 50 full cores, priced with the minimum RAM (highcpu versions):

Type	CPU	Core/Thread	GB RAM	$/Month
c3d-hcpu-360	AMD EPYC Genoa	180/360	708	9815.33
c3d-hcpu-180	AMD EPYC Milan	90/180	354	4907.66
c2d-hcpu-112	AMD EPYC Milan	56/112	224	3064.45
n2d-hcpu-224	AMD EPYC Milan	112/224	224	4079.89
t2d-std-60	AMD EPYC Milan	60/60	240	1850.37
c3-hcpu-176	Intel Sapphire Rapids	88/176	352	5536.46
h3-std-88	Intel Sapphire Rapids	88/88	352	3594.23
n2-std-128	Intel Ice Lake	64/128	512	3629.88

You will note that I tested both single (180 vCPU) and dual-processor (360 vCPU) variants. The reason is that the single processor variant is closer to the available offerings and was hoping it would not make the comparison a complete rout.

Apart from the T2D, The H3 is also added, which is Google's new non-SMT Sapphire Rapids instance for HPC workloads and comes only in a 88-core/88-thread size with 352GB RAM.

Starting as before with DKbench:

Even a single Genoa processor (180 vCPU) easily outperforms a dual-processor Sapphire Rapids (176 vCPU) by over 20%. The full Genoa instance with 2 processors, 360 threads is a massive 2.6x faster on average than the C3 on these benchmarks. It is 2.2x faster than the largest previous type, the 224-thread Milan N2D, and over 3.3x faster than the largest C2D.

In fact, for many benchmarks I tried that could use hundreds of threads and did not rely on disk/GPU, I could beat the at-the-time world record on OpenBenchmarking.org.
For example the 7zip used previously:

As of this writing, the world records for the 7zip compression/decompression benchmark are 1206954 / 831483 set by 2x EPYC 9654 & 2x EPYC 9634. The max C3D shatters them, being 20% & 50% faster respectively.

The Linux Kernel Compilation does depend on disk speed, so my instances using a network-attached SSD could not set a world record, but the C3D again shows its power:

AVX-512

One significant category of software where the previous EPYC generations lagged behind Xeon were those that took extensive advantage of Intel's AVX-512 extensions. Genoa adds these extensions (a complete set too, same as Sapphire Rapids which added over Ice Lake's). Even thought there was some concern at first regarding the implementation (256bit "double pumped" units), it has shown to be performant. Although we don't use AVX-512 software at SpareRoom, I thought I'd prove the point by using an AVX512-heavy OpenSSL 4096bit RSA signing benchmark. First for a single core for an apples-to-apples comparison:

C3D is 18-27% faster than the C3 depending on your max boost on Genoa. It is quite impressive given that it was a staple of Xeon performance. Interesting too to note that Google's Ice Lake performs just a bit better here than Sapphire Rapids.
We could run the same on max core instances:

The 96729 signs/s was a world record too when I got it, but now I see it has been broken by another EPYC beast, a dual-EPYC 9754 @ 256 cores / 512 threads!

Performance / Price

Performance/price is even more important than performance by itself for many (I would guess a good majority) of cloud customers. You can expect to pay extra for a faster instance, and often the "bleeding edge" of tech asks for a price premium which might make adoption harder. From the first table on this post, we saw that this was not the case in this case: the C3D seems to be offered at almost the same price as the previous-gen C2D which was already cheaper than the C3 introduced earlier this year.

Let's plot the DKbench score per month-$:

The C3D doesn't only have clearly the fastest cores, they are also the best value, with the exception of the N2D which come ahead thanks to their sustained-use 20% discount.

At this point, I should mention that, as we have seen in my previous cloud performance comparison, there is also the T2D (Tau) type which offers non-SMT Milan instances at the same price per vCPU as the C2D, which is a tremendous value for parallel processing tasks - as long as you are happy with lower single core performance (same as N2D), no RAM choice and maximum 60 vCPU per instance. Until there is a Genoa Tau type, the T2D will have an advantage for some deployments.

To demonstrate this we can show the performance/price graph for the max size instances:

T2D leads for value, despite having more RAM per core, and the N2D follows thanks to the sustained-use discounts. The C3D is a better value than all other compute clouds, but the new H3 manages to almost tie. This is unexpected, as the SMT version of Sapphire Rapids (C3) is not a good value, but welcome nonetheless.

Performance / Price - 1 Year reserve

To conclude the performance/price comparison I thought I'd add a reserved price comparison, as, at least for us, on-demand pricing is less relevant.

Type	CPU	1Yr reserve $/2vCPU+8GB
c3d	AMD EPYC Genoa	501
c2d	AMD EPYC Milan	501
n2d	AMD EPYC Milan	466
c3	Intel Sapphire Rapids	576
n2	Intel Ice Lake	535
c2	Intel Cascade Lake	576

The main difference is that the N2 and N2D lose the sustained use discount advantage, the latter no longer being as good a value compared to the C3D. Genoa does seem like the best value easily when you reserve your instances, although you may want to take advantage of the T2D pricing for multi-threaded performance.

Conclusion

In one word, those 180-core C3D instances are beasts. I was disappointed earlier this year when the C3 was introduced, as it did not offer any real gains over C2D, but the C3D is indeed a good leap in performance. A simple switch of VM type and you have instant, measurable performance gains for the same price (if you were on C2D) or even cheaper (C3 users).

I expect that the new type will be adopted quickly by the cloud customers who follow the developments and care about the performance and cost of their VM fleet.

Addendum: Test methodology and results

You can find the full benchmark results listed here.

All instances were set up with a 10GB "SSD persistent disk" using Google's Debian 12 bookworm image. The first run of the benchmarks was recorded (each one on multiple iterations), but instances on both US-central-1 and US-east-4 were subsequently benchmarked to make sure there are no performance discrepancies.

Some system packages were installed to support the DKbench and phoronix test suites:



sudo su
apt-get update
apt install -y wget build-essential cpanminus libxml-simple-perl php-cli php-xml php-zip

Benchmark::DKbench

The DKbench suite recorded 5 iterations over 19 benchmarks. If you have followed my previous cloud performance comparisons, it is a suite based in perl and C/XS, which is meant to evaluate the performance of generic CPU workloads that the typical SpareRoom job or web server will run. It is very scalable, which is good for evaluating massive VMs.

To setup the benchmark with a standardized environment you would do:



cpanm -n BioPerl Benchmark::DKbench
setup_dkbench -f

To run (5 iterations):



dkbench -i 5

OpenBenchmarking.org (phoronix test suite)

To setup the phoronix test suite:



wget https://phoronix-test-suite.com/releases/phoronix-test-suite-10.8.4.tar.gz
tar xvfz phoronix-test-suite-10.8.4.tar.gz
cd phoronix-test-suite
./install-sh

To run the benchmarks I used:



phoronix-test-suite benchmark compress-7zip
phoronix-test-suite benchmark build-linux-kernel
phoronix-test-suite benchmark openssl

DEV Community

Google Cloud C3D Review: Record-breaking performance with EPYC Genoa

EPYC Genoa & Google C3D

Performance per Core

Max size instance performance

AVX-512

Performance / Price

Performance / Price - 1 Year reserve

Conclusion

Addendum: Test methodology and results

Benchmark::DKbench

OpenBenchmarking.org (phoronix test suite)

Top comments (0)

Read next

AWS Centralised Root Access Management : Simplifying Operations

CREATING AND CONNECTING TO A LINUX VIRTUAL MACHINE SCALE SET

Gestión de Identidades y Accesos (IAM) en AWS: Buenas prácticas para fortalecer la seguridad

Issue 71 of AWS Cloud Security Weekly