Brian Spann

Posted on Apr 19

Running Ollama on Azure Container Apps

#azure #ai #ollama #containers

Running Ollama on Azure Container Apps

Part 2 of "Running LLMs & Agents on Azure Container Apps"

In Part 1, I made the case for why Azure Container Apps hits the sweet spot for self-hosted LLM inference. Now let's actually build it.

By the end of this post, you'll have Ollama running in Azure, serving Llama 3, with persistent model storage and a secure endpoint. The basic deployment takes about 20 minutes. The production hardening we'll add (persistent volumes, auth, GPU) takes it from a demo to something you'd actually run for a team.

A Quick Word on Ollama

If you haven't used Ollama before, the pitch is simple: it's the easiest way to run open-source LLMs. On your local machine, it's one command, ollama run llama3, and you've got a model running with an API endpoint.

The reason Ollama works so well for what we're building is the OpenAI-compatible API at /v1/chat/completions. Any code written against the OpenAI SDK, including Semantic Kernel (which we'll use in Part 3), works with Ollama without modification. Swap the endpoint URL and you're done. That portability is why I chose Ollama for this series over vLLM or text-generation-inference.

Step 1: Create the Environment

First, set up a resource group and an ACA environment. The environment is the shared boundary for your container apps: networking, Dapr configuration, and logging all live at this level.

az group create --name rg-ollama-demo --location eastus

az containerapp env create \
  --name ollama-env \
  --resource-group rg-ollama-demo \
  --location eastus

I'm using East US here because it has good availability for GPU workload profiles. If you're just doing CPU-only for development, any region works.

Step 2: Deploy Ollama

az containerapp create \
  --name ollama \
  --resource-group rg-ollama-demo \
  --environment ollama-env \
  --image ollama/ollama:latest \
  --target-port 11434 \
  --ingress internal \
  --cpu 4 \
  --memory 8Gi \
  --min-replicas 0 \
  --max-replicas 1

Two settings here that I want to call attention to.

--ingress internal means this endpoint is only accessible to other containers in the same ACA environment. I've seen people deploy Ollama with --ingress external in tutorials, and that's a real problem. An unauthenticated Ollama instance on the public internet means anyone who finds the URL can run arbitrary models on your hardware. You're handing out free GPU time. Start with internal ingress, and if you need external access later, add authentication first (I'll show you how below).

--min-replicas 0 enables scale-to-zero. When nobody's sending requests, ACA shuts down the container entirely and you stop paying. The first request after idle triggers a cold start: the container needs to spin up and (if models aren't persisted) re-download the model weights. We'll fix the cold start problem with persistent storage in a minute, but even with it, expect 15-30 seconds on the first request. That's fine for development. For production, you might want --min-replicas 1 to keep one instance warm.

Step 3: Pull a Model

With internal ingress, you can't hit the endpoint directly from your local machine. You need to either exec into the container or temporarily switch to external ingress to pull your first model.

# Get the internal FQDN
OLLAMA_URL=$(az containerapp show \
  --name ollama \
  --resource-group rg-ollama-demo \
  --query "properties.configuration.ingress.fqdn" -o tsv)

# From another container in the same environment, or temporarily with external ingress:
curl -X POST "https://$OLLAMA_URL/api/pull" \
  -d '{"name": "llama3:8b"}'

Practical tip: If you're just getting started, temporarily flip to --ingress external, pull your model, then flip back to internal. It's a few seconds of exposure and much simpler than setting up a jump box. For production, use the pre-baked image approach I cover later in this post. It avoids runtime downloads entirely.

Step 4: Test It

curl "https://$OLLAMA_URL/api/generate" \
  -d '{"model": "llama3:8b", "prompt": "Hello!", "stream": false}'

You should get back a JSON response with the model's reply. If you do, you've got a self-hosted LLM running in Azure.

The OpenAI-compatible endpoint is what we'll actually use in code:

curl "https://$OLLAMA_URL/v1/chat/completions" \
  -d '{"model": "llama3:8b", "messages": [{"role": "user", "content": "Hello"}]}'

This is the endpoint that Semantic Kernel, LangChain, and anything else built against the OpenAI API will talk to. We'll wire it up in Part 3.

Persistent Model Storage

Here's a gotcha that bites everyone the first time: when your container scales to zero and back up, it loses everything in ephemeral storage. That includes your downloaded models. Llama 3 8B is about 4.7 GB. Re-downloading it on every cold start means your first request takes minutes instead of seconds, and you're paying for egress bandwidth every time.

The fix is to mount an Azure Files share so models survive container restarts.

# Create a storage account
az storage account create \
  --name stollamademo \
  --resource-group rg-ollama-demo \
  --location eastus \
  --sku Standard_LRS

# Create a file share
az storage share create \
  --name ollama-models \
  --account-name stollamademo

# Get the storage account key
STORAGE_KEY=$(az storage account keys list \
  --account-name stollamademo \
  --resource-group rg-ollama-demo \
  --query "[0].value" -o tsv)

# Register the storage with your ACA environment
az containerapp env storage set \
  --name ollama-env \
  --resource-group rg-ollama-demo \
  --storage-name ollama-storage \
  --azure-file-account-name stollamademo \
  --azure-file-account-key $STORAGE_KEY \
  --azure-file-share-name ollama-models \
  --access-mode ReadWrite

Now you need to mount that storage into the container. ACA requires a YAML file for volume mounts because there's no pure CLI flag for this. Create volume-mount.yaml:

properties:
  template:
    volumes:
      - name: ollama-models
        storageName: ollama-storage
        storageType: AzureFile
    containers:
      - image: ollama/ollama:latest
        name: ollama
        resources:
          cpu: 4
          memory: 8Gi
        env:
          - name: OLLAMA_MODELS
            value: /models
        volumeMounts:
          - volumeName: ollama-models
            mountPath: /models

Apply it:

az containerapp update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --yaml volume-mount.yaml

The OLLAMA_MODELS environment variable tells Ollama where to store and look for model files. With this in place, the first cold start after pulling a model still takes a few seconds (the container itself needs to start), but the model weights are already there on the mounted share. Every subsequent start is fast.

Adding GPU Support

Everything we've done so far uses CPU-only compute. For development and testing with 7-8B parameter models, CPU is fine. Llama 3 8B generates tokens at a usable speed on 4 cores with 8 GB of RAM. Not fast, but fast enough to test your agent logic without waiting.

When you need production-level latency or you're working with larger models (70B+), you'll want a GPU. ACA supports this through workload profiles:

az containerapp env workload-profile add \
  --name ollama-env \
  --resource-group rg-ollama-demo \
  --workload-profile-name gpu \
  --workload-profile-type NC24-A100 \
  --min-nodes 0 \
  --max-nodes 1

az containerapp update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --workload-profile-name gpu

A word of caution on cost: A100 GPUs run about $2/hour on ACA. If you leave --min-nodes 1 (always on), that's roughly $1,440/month. With --min-nodes 0, you only pay when there's active inference traffic, but you take a cold start hit when the GPU node needs to spin up. For most development work, stick with CPU. Add GPU when you've validated your agent logic and need to optimize for latency.

Securing External Access

At some point you'll need external access. Maybe it's a frontend app, a mobile client, or a teammate who wants to test from their machine. Here are three approaches, in order of complexity.

Option 1: ACA Built-in Authentication

ACA has a built-in auth feature that can gate access behind Azure AD, Google, or other identity providers:

az containerapp auth update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --enabled true \
  --unauthenticated-client-action RedirectToLoginPage

This works well for interactive users (browser-based access), but it's clunky for programmatic API calls.

Option 2: API Key via Reverse Proxy

For programmatic access, deploy a lightweight proxy container in front of Ollama that validates a custom X-API-Key header before forwarding requests. This is what I typically set up for team development environments. Everyone gets an API key, and you can rotate or revoke keys without touching the Ollama deployment.

# Switch to external ingress
az containerapp ingress update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --type external

Then add a sidecar or separate container app that acts as your auth gateway.

Option 3: VNet Integration

For enterprise scenarios where you need network-level isolation, keep ingress internal and access Ollama through VNet peering, a VPN gateway, or ExpressRoute. This is the option I recommend for production workloads handling sensitive data.

az containerapp env create \
  --name ollama-env \
  --resource-group rg-ollama-demo \
  --infrastructure-subnet-resource-id /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/{subnet}

You're putting your entire ACA environment inside your corporate network. External access goes through whatever VPN or gateway you already have.

Pre-Baking Models into the Image

For production deployments, I recommend avoiding runtime model downloads entirely. Build a custom Docker image that includes the model weights:

FROM ollama/ollama:latest

# Pre-download the model during build
RUN ollama serve & sleep 5 && ollama pull llama3:8b && pkill ollama

Build and push to your Azure Container Registry:

docker build -t myregistry.azurecr.io/ollama-llama3:latest .
docker push myregistry.azurecr.io/ollama-llama3:latest

az containerapp update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --image myregistry.azurecr.io/ollama-llama3:latest

The downside is image size. You're looking at 5 GB+ for even a small model. But you get deterministic deployments: every release gets exactly the model version you tested against, and cold starts don't depend on network speed to a model registry. Combined with persistent storage (which acts as a cache for any additional models you pull at runtime), this is the fastest and most reliable startup configuration.

Practical Cost Tips

A few things I've learned from running this setup across different projects.

Scale-to-zero is your biggest lever. If your workload is bursty (heavy during business hours, quiet at night), the difference between always-on and scale-to-zero can be 3-4x on your monthly bill. The cold start penalty is real, but for many use cases it's worth it.

I've seen teams default to GPU instances "just in case" and spend 10x more than they needed to. Llama 3 8B runs fine on 4 cores and 8 GB of RAM. Start with CPU, measure your token generation speed, and only upgrade if it's actually too slow for your use case.

Don't overlook smaller models either. Phi-3 Mini and Qwen 2.5 3B handle classification, extraction, and structured output at a fraction of the compute cost. Not everything needs a 70B model.

And persistent storage is cheap insurance. An Azure Files share costs pennies per GB per month. Re-downloading models on every cold start costs more in egress bandwidth and startup latency than the storage ever will.

Next Up

In Part 3, we'll build a C# agent with Semantic Kernel that talks to this Ollama endpoint, with swappable backends so you can use self-hosted models for development and Azure OpenAI for production without changing your code.

Questions about the deployment? Hit me in the comments. I've probably hit the same wall you're about to hit.