Cloud Run scaling is simple until it isn't: the settings that actually matter
Cloud Run scaling looks wonderfully hands-off right up until a real workload lands on it.
Then the questions start:
- why did the first request feel slow?
- why did the service spin up so many instances?
- why is the database suddenly unhappy?
The good news is that Cloud Run scaling is not difficult once you focus on the few settings that actually shape behaviour: minimum instances, maximum instances, and concurrency.
If you understand those three, you can avoid most of the beginner mistakes without turning a simple service into a tuning project.
Start with the mental model
Cloud Run scales based on concurrent requests against the instances it already has available.
When the number of in-flight requests per instance approaches the configured concurrency limit, Cloud Run starts more instances. When traffic drops, idle instances are stopped after a cooldown period. If minimum instances is set to 0, the service eventually scales to zero.
That is the whole model in plain English:
- more concurrent requests than current capacity means more instances
- less traffic means fewer instances
- no traffic long enough means zero instances if you allow scale-to-zero
The default behaviour is usually fine. The problems come from not matching the defaults to the service you are actually running.
Cold starts are real, but they are not always a problem
A cold start happens when Cloud Run needs to start a new container and there is no warm instance ready to take the request.
In the source guide, the typical added latency is around 200 ms to 2 seconds, depending on image size and startup time.
That sounds bad until you ask the right question: who notices?
For internal automation, webhook receivers, and background triggers, an occasional cold start is often acceptable. For user-facing APIs and web services, it can be very noticeable.
That is why the first real scaling decision is not "how do I eliminate cold starts everywhere?" but "does this service need a warm instance all the time?"
When to use minimum instances
If the service is user-facing, setting --min-instances=1 is often the cleanest fix.
That keeps one instance warm and ready, which makes response times more consistent after quiet periods. The source guide also notes that keeping one warm instance is usually affordable for most services, typically only a few dollars per month at standard memory allocations.
If the service is not user-facing, scale-to-zero is usually the better trade-off:
- zero idle cost
- simpler defaults
- no warm capacity you are paying for unnecessarily
There is also a middle ground people forget about: if you need stronger rollout resilience, two or more minimum instances can make sense so one instance is not carrying everything during a deployment transition.
Why maximum instances matters more than people think
Beginners often spend time worrying about cold starts and ignore the setting that protects everything behind the service.
--max-instances is not just a scaling knob. It is a safety limit.
If Cloud Run is free to create lots of instances under load, every one of those instances may try to talk to the same database, queue, or downstream API. That is where trouble starts.
The source guide makes this point clearly: set the maximum based on downstream capacity, especially database connection limits, not just your hoped-for traffic peak.
If you hit the maximum and all instances are full, new requests are queued or can return HTTP 429.
That is not ideal, but it is still often better than letting the service overwhelm a dependency it cannot safely scale with.
Most services should not set concurrency to 1
This is probably the easiest Cloud Run mistake to make.
People see concurrency and think, "one request per instance sounds safer". Sometimes it is. Often it is just more expensive and less efficient.
Cloud Run defaults to a concurrency of 80. That means one instance can handle up to eighty simultaneous requests.
Lowering concurrency can make sense for CPU-heavy workloads where each request needs a lot of processor time. But for many I/O-bound services, reducing concurrency to 1 just creates more instances, more cold starts, and more pressure on downstream systems.
If you do not have a clear reason to lower it, the default is usually the right place to stay.
A practical starting point
For a normal user-facing API, this is a sensible first pass:
gcloud run deploy my-service \
--image=IMAGE \
--region=us-central1 \
--min-instances=1 \
--max-instances=100 \
--concurrency=80
That is not the right configuration for every service, but it is a good example of reasonable defaults:
- one warm instance for consistent latency
- a maximum instance cap so the service does not grow without bounds
- default concurrency unless you have measured evidence to change it
For an internal endpoint or background trigger, I would be much more willing to leave minimum instances at 0.
The simplest ways to reduce cold start pain
If you do care about cold start time, there are four levers in the source guide worth paying attention to:
- use a smaller base image
- minimise startup logic
- keep a minimum instance warm
- use
--cpu-boostto speed up startup
That last one is useful because the service gets extra CPU during startup, which helps it become ready more quickly.
gcloud run services update my-service \
--region=us-central1 \
--cpu-boost
The main point is to fix startup properly before you try to compensate for a slow application with lots of always-warm capacity.
Do not pay for always-allocated CPU unless you need it
Cloud Run has two CPU allocation modes:
- CPU during requests only
- CPU always allocated
The default request-only mode is what most HTTP services want. CPU is billed while requests are being handled, and idle instances with minimum instances configured only incur a reduced memory cost.
Always-allocated CPU is for cases where the container needs CPU even between requests. If you do not have that kind of workload, it is an easy way to spend more than necessary.
That is one reason scaling and cost are tied together more closely than people expect.
The real rule: tune for workload, not for ideology
The strongest advice in the original guide is also the least glamorous: choose scaling settings based on who is calling the service and what is behind it.
- user-facing service: keep one warm instance
- internal trigger endpoint: scale to zero
- fragile downstream database: cap the maximum instance count
- CPU-bound workload: test lower concurrency carefully
- normal web service: do not rush to override the default concurrency
That is the kind of tuning that actually helps.
If you want the fuller walkthrough, read the original Cloud Run scaling behaviour guide.
If you also want to see how scaling choices affect spend, the Cloud Run Cost Calculator is the easiest way to model the difference between scaling to zero and keeping one or more warm instances.
Top comments (0)