DEV Community

Cover image for Stable Diffusion Vs. The most powerful GPU. NVIDIA A100.
Juan Felipe Lujan
Juan Felipe Lujan

Posted on

Stable Diffusion Vs. The most powerful GPU. NVIDIA A100.

Earlier this week, I published a short on my YouTube channel explaining how to run Stable diffusion locally on an Apple silicon laptop or workstation computer, allowing anyone with those machines to generate as many images as they want for absolutely FREE. It’s really quite amazing.

Today I’ve decided to take things to a whole level. I will run Stable Diffusion on the most Powerful GPU available to the public as of September of 2022. The Nvidia Tesla A100 with 80 Gb of HBM2 memory, a behemoth of a GPU based on the ampere architecture and TSM's 7nm manufacturing process. Yup, that’s the same ampere architecture powering the RTX 3000 series, except that the A100 is a datacenter-grade GPU, or how Nvidia themselves call it: An enterprise-ready Tensor Core GPU.

Unfortunately, I don’t have an A100 at home. Google’s got plenty for us to experiment with. However, using this type of GPU requires contacting Google Cloud’s customer support and submitting a quota increase request.
In my request, I explained that I wanted to use Stable Diffusion. Therefore I needed one A100 80GB in the us-central1 region. The A100 80GB is in public preview.

I already sent that request, Google approved it, and I am now ready to create a Linux VM with the most powerful GPU attached.

Virtual machine configuration with an A100 GPU

So in Google Compute Engine, I clicked “New Instance” I went to the GPU section and selected the Nvidia A100 80GB; this will automatically configure an a2-ultragpu-1g machine type for me, which packs a 12vCPU and 170GB of RAM.
As for the boot drive, I will select one of the images under Deep learning on Linux.

Boot drive configuration screen in Compute Engine

Google Cloud will warn you if the selected boot drive image does not include CUDA drivers preinstalled. If you try to run Stable Diffusion UI in a VM without CUDA, it will fall back to CPU, making the image generation process atrociously slow.

Warning banner

With everything set, I will SSH into the monstrous VM and finish the installation of Nvidia’s drivers.
I could have partitioned the A100 down into multiple GPUs at this time, but that will be a test for another day.

I chose the Stable Diffusion UI repo as it comes with a bash script that will automatically download and install all the necessary dependencies. It also starts a web server ready to get prompts and spit out images.

Thanks to all the contributors of Stable Diffusion UI.

Without further ado, I will download the compressed package, unzip it, and run the startup script as instructed in the repo's README file. After a few minutes, the user interface UI will start in port 9000.

To see the UI, grab the ephemeral IP address of your VM from Google Compute Engine, paste it into the URL bar of your browser and add :9000

Stable Diffusion UI
The first time you send a prompt, the server will download some stuff, so wait a couple more minutes before retrying.
And that’s pretty much it in terms of configuration to make Stable Diffusion run on the Nvidia A100.

Even though I never managed to use all the available VRAM one thing was for sure about the A100, It runs faster than a VM using the T4 GPU and much MUUCH faster than my little apple M1 MacBook air.

If you want to replicate this experiment, there are three things that you have to keep in mind:

  • You might have to create a firewall rule in your Google Cloud Project to allow access to the Stable Diffusion UI in port 9000.
  • This deployment will cost you 1.15 dollars per hour or 836.72 after the sustained use discount.

Cost Breakdown of the A3-ultragpu-1g VM

  • The 300-dollar free tier is not eligible for VMs with GPUs attached.

Now a brief comparison.

╔═══════════════════════╦═════════╦═══════════╦═══════════╗
║        Device         ║ 512x512 ║ 1024x1024 ║ 1024x2048 ║
╠═══════════════════════╬═════════╬═══════════╬═══════════╣
║ MacBook Air M1 15Gb   ║ 138     ║ DNF       ║ DNF       ║
║ Nvidia T4 16 GB GDDR6 ║ 36.96s  ║ 281.01s   ║ DNF       ║
║ Nvidia A100 80Gb HBM2 ║ 14.05s  ║ 27.03s    ║ 92.78s    ║
╚═══════════════════════╩═════════╩═══════════╩═══════════╝
Enter fullscreen mode Exit fullscreen mode

When generating a 2048x2048, the nvidia-smi command provides a deeper look at what's happening inside the A100.

Screenshot of nvidia-smi showing the GPU working at 100% of it's capacity

Not even when asked to generate 20 2048x2048 images at once, the A100 gave up…

nvidia-smi screenshot of A100 pulling 417W and 72125 MiB of VRAM used

The prompt used throughout those tests.

prompt: young female battle robot, award winning, portrait bust, symmetry, faded lsd colors, galaxy background, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, masterpiece, trending on artstation, cinematic composition, beautiful lighting, hyper detailed, Melancholic, Horrifying, 3D Sculpt, Blender Model, Global Illumination, Glass Caustics, 3D Render
    seed: 1259654
    num_inference_steps: 50
    guidance_scale: 7.5
    w: 2048
    h: 2048
    precision: autocast
    save_to_disk_path: None
    turbo: True
    use_cpu: False
    use_full_precision: True
    use_face_correction: GFPGANv1.3
    use_upscale: RealESRGAN_x4plus
    show_only_filtered_image: True 
    device cuda
Using precision: full
Global seed set to 1259654
Sampling:   0%|                                                                           | 0/1 [00:00<?, ?it/sseeds used =  [1259654, 1259655, 1259656, 1259657, 1259658, 1259659, 1259660, 1259661, 1259662, 1259663, 1259664, 1259665, 1259666, 1259667, 1259668, 1259669, 1259670, 1259671, 1259672, 1259673]
Data shape for PLMS sampling is [20, 4, 256, 256]
Running PLMS Sampling with 50 timesteps
PLMS Sampler:   6%|███▌            | 3/50 [10:10<2:26:11,186.63s/it]

Enter fullscreen mode Exit fullscreen mode

This time it was me who DNF’ed out at 2:30 am after a long night of nerding around in GCP.

Hope you found this journey interesting.

Top comments (4)

Collapse
 
geekyayush profile image
Ayush Somani

Thanks for the pos.

Which region did you choose when deploying your GPU?

Collapse
 
raspire profile image
Matthew Burley

I did not realize Stable Diffusion required so much resource to generate images within a sub 10s timeframe for only one user.

How the heck is “Salad” providing their services so cheap

Collapse
 
juan9999 profile image
juan9999

Awesome work! Any way to extrapolate how fast an M1 Max with say 32GB of RAM would stand in this ranking?

Collapse
 
kalliasx profile image
Christian Kallias • Edited

I have a M1 Max Studio and it's about 8x slower than a PC with a RTX2070S for comparison... now I kinda wished I had went with the Ultra... but it would still be way short vs a 4 years old middle of the line Nvidia GPU.