DEV Community

Cover image for Why Azure Container Apps for AI Workloads
Brian Spann
Brian Spann

Posted on

Why Azure Container Apps for AI Workloads

Why Azure Container Apps for AI Workloads

Part 1 of "Running LLMs & Agents on Azure Container Apps"


I spend a lot of time helping teams at Microsoft figure out where to run their AI workloads. The conversation usually starts the same way: "We want to use LLMs, but we don't want to send our data to OpenAI, and we don't want to manage Kubernetes." That's a completely reasonable position. It's exactly the gap Azure Container Apps fills.

In this series, I'll walk you through deploying Ollama on ACA, building C# agents with Semantic Kernel, wiring up multi-agent architectures with Dapr, and hardening the whole thing for production. But first, let's talk about why ACA is the right platform for this kind of work, and when it isn't.


The Problem with Running Your Own LLMs

The moment you decide to self-host a model, you've signed up for a set of infrastructure decisions that most application developers aren't used to making. Where does the model live? How do you serve it? What happens when nobody's using it at 2 AM, are you still paying for a GPU?

In my experience, teams end up in one of four places:

Running on a laptop works great for hacking on a Saturday afternoon, but it's a dead end for anything beyond that. You can't share it with a team, you can't scale it, and you can't keep it running when you close your lid.

A VM with a GPU solves the sharing problem but creates a new one: you're paying 24/7 whether the model is handling requests or sitting idle. I've seen teams burn through hundreds of dollars a month on GPU VMs that were doing real work less than 10% of the time.

Kubernetes (AKS) gives you everything: autoscaling, GPU scheduling, health checks, the works. But now you need someone who knows how to operate a Kubernetes cluster. For a team building AI features, not a platform team, that's a big ask. Projects stall for weeks while developers learn about node pools, taints, and GPU device plugins.

Azure Container Apps sits in the gap between "just give me a VM" and "I guess I need Kubernetes." You deploy a Docker image, ACA handles scaling, and you don't touch kubectl. It's built on Kubernetes under the hood, but that's an implementation detail you never have to think about.


What Azure Container Apps Actually Gives You

If you haven't worked with ACA before, the short version is: it's serverless containers. You give it a Docker image and tell it what port to listen on. ACA provisions the infrastructure, handles TLS, and scales your containers based on demand. That includes scaling to zero when there's no traffic, which means no cost when nobody's using your model.

What makes it interesting for AI workloads specifically is the combination of a few features that came together over the last year or so. Workload profiles now include GPU-enabled options, so you can run inference on actual GPU hardware without managing nodes. Dapr integration is built in, which matters when you start running multiple agents that need to talk to each other (we'll get deep into this in Part 4). And KEDA-based autoscaling means you can scale on custom metrics beyond HTTP concurrency, like queue depth or even custom telemetry from your model.

Think of it as the serverless experience of Azure Functions, but without being locked into the Functions programming model. You bring any container, and ACA runs it.


How ACA Compares to the Alternatives

Let me break this down the way I explain it to teams I work with.

Azure OpenAI Service

Azure OpenAI is the easiest path to production. Setup takes minutes, you get access to GPT-4 and the latest models, and Microsoft handles all the infrastructure. Your data stays within your Azure tenant, which satisfies most compliance requirements.

Where it gets expensive is token volume. Azure OpenAI charges per token, and that math gets uncomfortable fast. A chatbot processing a million tokens a day at GPT-4 prices will run you around $600/month. That's fine for a prototype or a low-volume internal tool, but high-traffic production apps feel it.

You also give up control. You get fine-tuning, but you don't get to run arbitrary open-source models, and you can't customize the serving infrastructure. If you need to run Llama 3 or Mistral or a fine-tuned domain model, Azure OpenAI isn't the answer.

Azure Kubernetes Service (AKS)

AKS is the power tool. You get full control over scheduling, GPU node pools, custom operators like KubeRay, and the entire CNCF ecosystem. If you're running large-scale inference with a dedicated ML ops team, AKS is probably the right choice.

But "full control" comes with "full responsibility." You're managing node pools, configuring GPU drivers, writing Helm charts, and debugging pod scheduling issues. One team I worked with spent more time operating their cluster than building their actual AI application. If you already have Kubernetes expertise on the team, great. Most teams building AI features don't, and for them it's a distraction.

Azure Container Apps

ACA gives you most of what AKS offers for inference workloads (containerized deployments, autoscaling, GPU support, health probes) without the operational overhead. Setup takes minutes instead of hours. You don't need to know what a DaemonSet is.

The catch is flexibility. ACA has fewer knobs than raw Kubernetes. GPU workload profiles are still relatively new, and you're limited to the instance types ACA supports. You can't install custom operators or run training workloads. But for inference, which is what most application teams actually need, it covers the use case well.


When ACA Is the Right Call

I've found ACA works best in a few specific scenarios, and I want to be honest about where it doesn't.

The strongest use case is development and iteration. When you're building an agent and experimenting with different models, the last thing you want is to burn through API credits every time you test a prompt. Deploy Ollama to ACA, point your code at it, and iterate as much as you want. Scale to zero means you're only paying when you're actually working.

It also makes sense for cost-sensitive production. If you've done the math and your token volume is high enough that self-hosting is cheaper than API calls (I'll show you exactly where that crossover is in a minute), ACA lets you capture those savings without the operational burden of Kubernetes.

Data sovereignty comes up a lot in the government and financial services teams I work with. Some workloads simply can't send data to a third-party API, even one hosted in Azure. Self-hosting on ACA means your data never leaves your subscription, your VNet, or your region. And increasingly, I'm seeing teams run hybrid architectures where a cheap local model handles classification, summarization, and simple tasks while complex reasoning gets routed to Azure OpenAI. ACA makes it easy to run the local piece alongside the rest of your application.

Where ACA is not the right call: training workloads, multi-GPU inference (70B+ parameter models that need model parallelism across GPUs), or situations where you need fine-grained control over GPU scheduling. For those, you want AKS or Azure ML.

The Cost Crossover: Self-Hosted vs. API

This is the question everyone asks, so let me lay it out with real numbers.

At low token volumes, say 100K tokens per day, the math is roughly a wash. Azure OpenAI GPT-4 costs about $60/month at that volume. A self-hosted Llama 3 instance on ACA with CPU-only compute costs about the same, and GPT-4 is a better model, so the API wins on quality.

The crossover happens around 200-300K tokens per day. Above that, self-hosting costs stay relatively flat (you're paying for compute time, not tokens), while API costs scale linearly with usage. At 1M tokens/day, Azure OpenAI runs about $600/month. The same workload self-hosted on ACA? Still around $60/month, maybe $120 if you're on a GPU profile.

That's a 5-10x difference, and it only gets wider at higher volumes.

The caveat (and I always flag this) is that you're comparing different models. Llama 3 70B is good, but it's not GPT-4. For many tasks (classification, extraction, summarization, structured output), the quality gap is negligible. For complex multi-step reasoning, GPT-4 still has an edge. The hybrid approach I mentioned earlier lets you get the best of both.

Note: These cost estimates are based on Azure consumption pricing as of early 2026. Your actual costs will vary based on model size, workload profile, region, and usage patterns. Always check the Azure pricing calculator for current rates.

What We're Building in This Series

Over the next four posts, we'll go from zero to a production-ready, multi-agent AI system running entirely on Azure Container Apps.

We'll start in Part 2 by getting Ollama deployed and serving models, with persistent storage so you're not re-downloading 5GB on every cold start, and proper security so you don't accidentally expose an unauthenticated GPU endpoint to the internet. From there, Part 3 connects Semantic Kernel to your Ollama instance and builds a C# agent with function calling, the kind that can actually do things, not just chat. Part 4 is where it starts to feel like a real system: multiple specialized agents communicating through Dapr, with Dynamic Sessions for safe code execution. Finally, Part 5 hardens everything for production: health probes that account for slow model loading, autoscaling that makes sense for LLM workloads, monitoring, and cost controls.

I'll include working code for everything, and I'll call out the gotchas I've hit so you don't have to discover them yourself.


Next up: Deploying Ollama to Azure Container Apps, with persistent model storage and proper security.

Top comments (0)