Maciej Strzelczyk for Google Cloud

Posted on Mar 10 • Originally published at Medium

AI deployment: to host or not to host?

#ai #gcp #vertexai #kubernetes

So you’ve built your AI application prototype. You used your own local GPU to run the AI model, or just used the free AI Studio tier to power your clever program. The app is ready, the world is ready, time to deploy your production instance! In the case of traditional, non-AI powered apps and services, the choice of deployment platform is based on personal preference, what you are familiar with, how much control over fine details you want to have etc. Cost is usually not the most important factor, as for a new service, that’s just going to start gaining a userbase, the first usage bills won’t be that high anyway. The situation is different when it comes to running services that make use of AI. Here, you need to make two separate decisions. First is how to deploy your application, this is the same as for a vanilla non-AI app. Second is how you are going to provision the AI capabilities. This second decision will most likely be responsible for a big chunk of your bill and it shouldn’t be made without proper consideration. In this article, I will try to help you make the right decision for your use case.

Serverless vs hosted inference service

There are two ways of provisioning AI for a production-grade application:

Serverless - where you pay for the tokens your application sends and receives.This is sometimes called Model as a Service (MaaS). In Google Cloud, this approach is available in Vertex AI and Google AI Studio (Gemini API).
Hosted - where you pay for the time you use the infrastructure running an LLM. In Google Cloud, this model is available through multiple services like: Compute (through certain machine types), Vertex AI, GKE or Cloud Run.

Depending on your situation, you may not have an option to choose between the two, because only one would be possible. For example, if you have to use one of the Gemini models, there’s no way to host it yourself and the MaaS (pay per token) approach is the only one available. Similarly, if you have to use a custom model that is not available as a service, you just have to go down the hosted path.

In cases where you do have a choice between the two paths you need to understand how they will affect your budget.

Serverless (pay per token)

Paying only for the tokens your application uses is a fair and easy to understand setup. It works exactly like any other paid service on Google Cloud - you pay for what you use.

Pros:

It scales to zero, when you don’t use the AI,
you don’t have to worry about scaling,
Configuration and maintenance are extremely simple,

Cons:

Less predictable for your budget
You may reach service quota, either when your application experiences a rush-hour or when you reach some total monthly usage quota
In case your application is hacked, your bill might skyrocket
Once your application gets popular, the bill will grow with your active userbase

Hosted (pay per second)

Hosting an LLM on infrastructure that you pay for is extremely predictable cost-wise. As long as you know how long you are going to hold on to that GPU or TPU accelerated instance, you know exactly how much you are going to pay.

Pros:

Extremely predictable cost
Many ways to lower your bill: CUDs, Spot Instances, choosing a cheaper zone or choosing the right instance and/or accelerator type
No quota on how much tokens your application consumes
Full control over hardware and software inference configuration

Cons:

Big initial cost
Doesn’t scale as smoothly as serverless
Configuration and maintenance is more complicated

Couple of considerations

To help you out a bit further, here are some questions you should ask yourself, before deciding on one of the deployment options.

How much traffic do I expect?

With low traffic, the choice is almost obvious - serverless is cheaper and easier. However, as your usage grows, the number of tokens consumed will add up to a considerable amount. In such a case, using a self-hosted solution might save you from unexpected bills at the end of the month.

Am I legally bound to keep user data in certain region?

In some cases, like with medical or financial data, you might be required by local regulations or your own contracts to ensure your user data doesn’t leave a certain location, or will not be sent to service you don’t control. This might be a situation where no matter the cost effectiveness self-hosting an AI model is the only possible solution.

Am I likely to hit the hourly/monthly quota?

All API Services have some usage quotas, that includes AI services. If you expect your application may reach this quota, it’s a big hint that you should consider self-hosting your model.

Mixed-approach

It is also worth noting that you don’t have to limit your architecture to using only one AI Model with one deployment option. Imagine your application offers multiple AI-powered features - some of them might be simple enough for a small model to handle, while others require full power of Gemini. It is perfectly fine to have for example a Gemma 3 running on a VM, handling the easier tasks, while you delegate the harder/bigger tasks to Gemini API.

This is not an irrevocable decision

Even after careful consideration, the decision might still not be a simple one, especially if you’re starting with a new idea and simply don’t know how popular it’ll get. Luckily, with good architecture of your application, it is not that difficult to prepare for changing the AI API endpoint. It’s reasonable to start with a serverless solution, where you will often make great use of the fact that no traffic = zero cost. Once your application takes off and the Vertex AI or AI Studio bill reaches levels comparable to running a self-hosted model, you should reevaluate your situation and perhaps switch to the more predictable approach.

Keep up!

The ecosystem of AI is changing at a rapid pace and it’s important to stay up to date with all the latest news. Follow the official Google Cloud blog, Google Developers blog and Google Cloud Tech YouTube channel to not miss any updates!

P.S. Did you know that Google Cloud now offers Developer Knowledge API and MCP server that can give your AI Agents access to always up-to-date knowledge straight from the official Google Cloud, Firebase and Android documentation?!

DEV Community

AI deployment: to host or not to host?

Serverless vs hosted inference service

Serverless (pay per token)

Pros:

Cons:

Hosted (pay per second)

Pros:

Cons:

Couple of considerations

How much traffic do I expect?

Am I legally bound to keep user data in certain region?

Am I likely to hit the hourly/monthly quota?

Mixed-approach

This is not an irrevocable decision

Keep up!

Top comments (0)