DEV Community

Cover image for Unveiling GPU Cloud Economics: The Concealed Truth
Dataoorts | GPU Cloud
Dataoorts | GPU Cloud

Posted on

Unveiling GPU Cloud Economics: The Concealed Truth

Image description

The Rise of Pureplay GPU Clouds: A Deep Dive into the Economics

Over the past year, we’ve witnessed an explosion in the number of pureplay GPU cloud providers. More than a dozen companies have presented us with proposals for launching GPU cloud services, and there are likely many more we haven't even encountered yet. While the wave of new deals has slowed, it’s worth taking a closer look at the underlying economics driving this trend.

Why the Surge in GPU Clouds?
One of the key reasons for the influx of GPU cloud providers is that GPU clouds are significantly easier to manage than general-purpose clouds from a software standpoint. Unlike traditional clouds, these providers don’t need to worry about complex services like advanced database management, block storage, strict multi-tenant security guarantees, or extensive APIs for third-party services. In many cases, even virtualization is not a major concern.

This makes the barrier to entry for GPU cloud businesses much lower. The core focus is providing high-performance GPU infrastructure for AI and ML tasks, without the complexity of managing various other cloud services.

The AWS Example: Software Isn’t Always the Key
A great example of how little cloud-specific software matters in AI is seen in AWS. Although AWS pushes its SageMaker platform as the go-to solution for creating, training, and deploying models in the cloud, it’s a case of "do as I say, not as I do." For their own top-tier model, Titan, AWS uses Nvidia’s Nemo framework instead of SageMaker. Notably, Titan still underperforms compared to several open-source models. This highlights that the “value-add” cloud software is often less critical than access to top-tier hardware like NVIDIA GPUs.

Simpler Infrastructure Requirements for GPU Clouds
While general-purpose clouds require flexibility across compute, storage, RAM, and networking, the demands of a GPU cloud are far simpler. GPU workloads are relatively homogeneous, and servers are typically committed for long periods. In today’s landscape, the NVIDIA H100 GPU is the gold standard for most modern use cases, such as LLM training and high-volume inference.

For end users, the primary decision revolves around how many GPUs are needed for the task at hand. While networking performance is important, the costs of overspending on networking are minor compared to the price of GPUs themselves.

Data Locality and Egress Costs Are Minor Concerns
For most users, the locality of data during training or inference is not a critical factor because egress costs are relatively low. The data can be easily transferred and transformed without significant expenses. Furthermore, purchasing high-performance storage from providers like Pure, Weka, or Vast is a minor cost relative to the overall cost of building AI infrastructure.

Why Choose Dataoorts for Your GPU Cloud Needs?
With the rise of numerous GPU cloud providers, Dataoorts stands out as a reliable and affordable option for businesses looking to harness the power of NVIDIA GPUs. Our platform is designed specifically for scientific computing and AI tasks, offering a seamless, cost-effective solution. Launch your GPU instances quickly and easily through Dataoorts Cloud, and benefit from scalable, high-performance infrastructure without the complexity of managing traditional cloud services.

By choosing Dataoorts, you gain access to industry-leading GPUs at a fraction of the cost, without sacrificing performance. Visit Dataoorts today to learn more and take your AI projects to the next level.

Comparing CPU and GPU Colocation: Total Cost of Ownership (TCO)

The rapid rise of new GPU cloud providers can be attributed to the straightforward total cost of ownership (TCO) equation when comparing CPU servers to GPU servers in colocation (colo) environments. Unlike CPU servers, which have a wide range of factors influencing their TCO, GPU servers are primarily dominated by capital costs, largely due to NVIDIA’s high margins. The main barrier to entry for new GPU cloud providers is capital, not infrastructure, making it easier for many to enter the market.

For CPU servers, the monthly hosting costs (around $221) and capital costs (around $304) are relatively similar in scale. In contrast, for GPU servers, hosting costs (about $1,875 per month) are vastly overshadowed by capital costs (about $7,036 per month). This capital-heavy equation explains why so many third-party GPU clouds are emerging.

Hyperscale cloud providers like Google, Amazon, and Microsoft excel at optimizing hosting costs by designing and operating data centers with extremely efficient Power Usage Effectiveness (PUE) metrics, approaching as close to 1 as possible. This means very little power is wasted on cooling and power delivery. However, colocation facilities typically have a higher PUE, around 1.4 or more, indicating that around 40% of power is lost to cooling and transmission. Even the newest GPU cloud facilities tend to have PUEs of around 1.25, still much higher than the efficiencies achieved by hyperscalers.

For CPU servers, this makes hosting costs a significant part of TCO, whereas for GPU servers, hosting costs are less impactful, as the capital costs dominate the overall equation. For instance, a less efficient datacenter operator can still purchase an NVIDIA HGX H100 server with 13% interest debt and achieve an all-in cost of around $1.525 per hour. Even though some operators can optimize further, the primary cost driver remains the capital expenses. As a result, even the best GPU cloud deals hover around $2 per hour for an H100, while some desperate customers end up paying over $3 per hour or even more than that.

Launch H100 VMs at Dataoorts with dynamic pricing rates ranging from $2.1 to $0.56 per hour per GPU.

This simplified model provides a basic understanding, though many variables can drastically alter the TCO. Some companies, like CoreWeave, have even tried to promote eight-year lifecycles for servers, but such claims don't hold up under scrutiny. The real-world numbers, especially in colocation environments, tend to differ significantly from these simplified assumptions. Let's now dive into a more realistic and detailed model to explain these economics further.

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry 👀

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Imagine monitoring actually built for developers

Billboard image

Join Vercel, CrowdStrike, and thousands of other teams that trust Checkly to streamline monitor creation and configuration with Monitoring as Code.

Start Monitoring

👋 Kindness is contagious

Immerse yourself in a wealth of knowledge with this piece, supported by the inclusive DEV Community—every developer, no matter where they are in their journey, is invited to contribute to our collective wisdom.

A simple “thank you” goes a long way—express your gratitude below in the comments!

Gathering insights enriches our journey on DEV and fortifies our community ties. Did you find this article valuable? Taking a moment to thank the author can have a significant impact.

Okay