DEV Community

Cover image for How Thunder Compute attaches GPUs over TCP
Carl Peterson
Carl Peterson

Posted on • Originally published at thundercompute.com

How Thunder Compute attaches GPUs over TCP

Thunder Compute uses network-attached GPUs instead of physically-attached GPUs. Behind the scenes, Thunder Compute tricks CPU-only instances into thinking that they have GPUs attached. These GPUs are network-attached over TCP. From your perspective, the resulting instances behave like they have GPUs without requiring that a GPU is physically connected.

As a result, all instances on Thunder Compute are on-demand CPU-only instances, exactly like you would find on AWS, GCP, or Azure. These instances do not have GPUs. Logically, it follows that the CPU-only instances you interact with on Thunder Compute have all of the functionality of EC2 instances that you would find on Amazon or Google Cloud. In fact, many of them are hosted on Amazon or Google Cloud.

Here is a rough diagram of how we manage these connections between CPU-only instances and GPUs behind the scenes:

Image description

A simple example demonstrates the distinction between our virtual GPU-over-TCP technology and a physical PCIe connection:

Running $ nvidia-smi on Thunder Compute behaves exactly as expected with a physical GPU, returning the attached GPU.

Image description

Meanwhile, running lspci shows no connected GPUs.

Image description

To hammer home the point that there is no GPU, here is the full list of PCIe-connected devices on this Thunder Compute instance.

Image description

I hope we have convinced you that there is no GPU physically connected to the machine. Pretty cool, right? You can pip install tnr and run tnr start to try this same demo yourself.

Now that you understand the distinction between a Thunder Compute instance and a GPU instance on EC2, it is worth explaining the limitations of this virtualized approach.

  1. Performance: TCP is slower than PCIe. While this may seem problematic, Thunder Compute is optimized to minimize the resulting performance impact. The real-world slowdown often is not noticeable and minimally impacts common data science tasks.
  2. Limited Compatibility: Eventually, our GPUs-over-TCP will have the full functionality of physically attached cards, but today, Thunder Compute lacks official support for some GPU libraries. If Thunder Compute does not work for your particular use case, please reach out. We can usually add support for a new library within a few days.

The impact of these drawbacks will vary depending on your specific workload, and we continue to improve both over time. Until now, our testing has shown data science workflows to be the most performant and stable. Thunder Compute is open to the public, so the easiest way to test compatibility with your workflow is to try it yourself at thundercompute.com.

Top comments (0)