Weihao and I have been working on programmatic benchmarks for DeepCell on Google Batch.
We tried Vertex AI custom training jobs but ran into an issue with service accounts. It appears that the training job ran on the expected(?) service account, but in an unexpected project. We didn't track down how to give that project's user access to BigQuery. We also figured that we may want to run the container a little closer to the metal (not a VM though).
Enter Google Batch … I've used Batch-like products but never with a GPU. Initial work often looks like a lot of red failures 🥲
First impressions:
1: BigQuery rate limit
I forgot BigQuery has a fairly low rate limit (5 ops per 10 seconds). So a batch of 10 finishing too close would overwhelm the table update. Quick fix with retry logic.
2: GPU scarcity
We've had bad luck getting GPUs. The zone reports exhausted resource pools on the regular:
We ran into a surprising quota issue as well, running out of persistent disk SSDs – even though we weren't using any…
The quota page showed the usage going up and down (again, we never observed any disks in the GCE console):
You can (kinda) see it trying different availability zones within region us-central1 here:
We tried increasing the quota to 1TB (from 500 GB). No luck so far: no resources…!
The quota goes up in increments of 30GB, one per zone resource exhaustion error. I'm guessing it's a Batch implementation detail to spin up the disks in anticipation of having a VM ready.
Fortunately…! there is no billing charge for these disks. It's nice that it only bills when it actually runs, although it's odd to use up the quota.
I've heard several reports of using GPU on Batch, but it's clear that the incantations are arcane indeed. If you know how to reliably get GPUs or have worked through these errors– please let me know!
Top comments (0)