How We Set Up One Private Container Registry for 6 AKS Clusters Across 3 Regions and What Broke Along the Way

#azure #kubernetes #devops #docker

When our team started expanding from a single AKS cluster in East US to clusters across West Europe and Southeast Asia, the first thing we assumed was that container image management would be the easy part. It wasn't.

This post walks through how we architected a single private container registry accessible by all six of our AKS clusters across three Azure regions. I'll cover what worked, what silently failed for weeks before we noticed, and the decisions I'd make differently today.

The Setup We Started With

In the beginning we had one AKS cluster and one Azure Container Registry sitting in the same region. The CI/CD pipeline would build an image, push it to ACR, and AKS would pull it. Simple.

Then we added a second cluster in West Europe for latency reasons. And a third in Southeast Asia for compliance reasons. Suddenly we had a problem that sounds simple but has a lot of moving parts: how does every cluster pull images securely without us managing a pile of credentials and without paying a fortune in cross-region egress fees.

What We Evaluated

We looked at three approaches before landing on our final architecture.

Option 1: One central ACR, all clusters pull from it

The simplest option and the one we tried first. Every cluster just pointed to the same ACR endpoint regardless of where it was running. This worked fine in testing. In production it caused two problems. First the pull latency during cluster upgrades was noticeable since every node was pulling large images over a long network path. Second when we had a brief connectivity blip in East US one night clusters in other regions couldn't pull images and pod restarts started failing. A registry in one region had become a single point of failure for the entire platform.

Option 2: Separate ACR per region

We considered running an independent ACR in each region and pushing images to all three from the pipeline. This solved the latency and availability problem but created a worse one. Now our pipeline had to push to three registries on every build. Image promotion between environments became a mess. And keeping digests consistent across registries turned out to be harder than expected since different push operations under load would sometimes result in slightly different metadata depending on timing.

Option 3: ACR Geo-Replication

This is what we landed on. ACR's Premium SKU supports geo-replication where you maintain a single registry logically but Azure replicates your images automatically to replicas in whichever regions you choose. You push once and every regional replica gets the image. Clusters in each region pull from their nearest replica automatically with no changes to image references in your manifests.

The Architecture We Run Today

Here is the high level picture.

Our CI/CD pipeline in Azure DevOps builds the image and pushes to a single ACR endpoint in East US. ACR handles replication to West Europe and Southeast Asia replicas in the background. Each AKS cluster is configured to authenticate using Managed Identity so there are no image pull secrets to rotate or manage. Azure handles the auth handshake between AKS and ACR natively.

The command to wire up a cluster to ACR is straightforward.

az aks update \
  --name my-cluster \
  --resource-group my-rg \
  --attach-acr my-registry

That one command grants the cluster's managed identity the AcrPull role on the registry. Do this for each cluster and you're done with credentials.

For geo-replication you add replicas through the portal or CLI.

az acr replication create \
  --registry my-registry \
  --location westeurope

az acr replication create \
  --registry my-registry \
  --location southeastasia

After this any image you push to the registry replicates to both locations within a few minutes depending on image size.

What Broke and When

I want to be honest about the parts that didn't go smoothly because these are the things no architecture diagram ever shows you.

Replication lag during fast rollouts

Our first incident with this setup was subtle. We pushed a new image and triggered a rollout within seconds of the push completing. The East US cluster picked up the new image fine since it was pulling from the local replica. But the West Europe cluster tried to pull before replication had completed and got an image not found error. Pods went into ImagePullBackoff and we spent 20 minutes confused about why the same rollout was failing in one region and succeeding in another.

The fix was to add a replication wait check in our pipeline before triggering rollouts across all regions.

az acr replication show \
  --name westeurope \
  --registry my-registry \
  --query "status.displayStatus" \
  --output tsv

We poll this until it returns "Synced" before allowing the pipeline to proceed to the rollout stage. Simple but not something you'd think to add until you've been burned.

Digest pinning across replicas

We use digest pinning in production so that what gets deployed is exactly what got scanned and approved. The assumption was that the same image pushed to ACR would have the same digest everywhere. That assumption held in our case but it is worth explicitly verifying in your setup because some replication tools and registry configurations can rewrite manifest metadata in ways that change the digest. If your digest changes between regions your GitOps tooling will treat them as different images and you'll have a very confusing debugging session ahead of you.

Node pool scaling pulling a lot at once

Cluster autoscaler adding multiple nodes simultaneously means many nodes pulling the same large image at the same time. During a traffic spike we scaled out 12 nodes in Southeast Asia within two minutes. Each node started pulling a 2.4GB image independently. The registry replica handled it but we saw throttling warnings in our ACR metrics that we hadn't seen before. We addressed this by implementing image pre-pulling using DaemonSets for our heaviest images and by tuning the parallel image pull settings in the kubelet config.

Egress costs were not zero

Even with geo-replication you pay for replication traffic between regions. For small teams with small images this is negligible. For us with images averaging around 800MB across a dozen services the monthly replication cost was noticeable. It didn't break the budget but it was something we hadn't accounted for in our initial cost estimates. Factor this in before you go to your engineering manager with a cost projection.

What I'd Do Differently

If I were starting this from scratch today here's what I'd change.

First I'd instrument the registry from day one. ACR exposes metrics for pull latency, error rates and replication lag. We didn't set up alerts on these until after our first incident. Add them before you need them.

Second I'd enforce digest pinning in production from the start using something like Kyverno. We retrofitted this policy later and it was painful to roll out across six clusters without disrupting running workloads.

Third I'd test regional failover before an actual outage forces you to. Simulate a replica going dark and confirm your clusters handle it gracefully. We did this six months in and found an edge case where one of our clusters had a stale DNS cache that caused it to keep trying the failed replica instead of failing over. Finding that in a drill at 2pm is much better than finding it during an actual incident at 2am.

The Takeaway

ACR geo-replication with Managed Identity auth is a genuinely good solution for multi-region AKS setups. The operational overhead is low compared to running your own Harbor instances or managing push-to-multiple-registries pipelines. But like everything in distributed systems the edge cases live in the timing and the assumptions.

Watch your replication lag. Pin your digests. Alert on your registry metrics. And test your failover before production tests it for you.