Juan Torchia

Posted on Apr 17 • Edited on Apr 20 • Originally published at juanchi.dev

Docker Pull Fails in Spain Because of Cloudflare and a Soccer Match — Nobody Talks About the Real Pattern

#english #opinion #docker #devops

I spent three hours convinced the problem was mine.

It was a Tuesday afternoon. I had a call with my client in Madrid in two hours, and the deploy pipeline was broken. docker pull timing out. Registry not responding. Me checking my network config, my DNS, my credentials — as if I'd accidentally touched something. That classic feeling of "this worked yesterday, what did I break?"

Nothing. I didn't break anything. A soccer match in Spain caused ISPs to block Cloudflare IP ranges to comply with an anti-piracy court order, and Docker Hub — which runs on Cloudflare infrastructure — got caught in the crossfire as collateral damage.

I'm writing about it because I made the mistake of assuming the pipe is neutral. That CDNs are like water — they flow the same for everyone, always. And that silent assumption is baked into every architecture decision I've made over the last few years.

Cloudflare DNS Blocks and Infrastructure: What Actually Happened

There's a legal context here: Spain has a mechanism that lets rights holders ask ISPs to block IPs associated with piracy sites during live sporting events. The targets are illegal streams of LaLiga matches, Champions League, that kind of thing.

The problem is that executing that block in 2025 is surgically impossible. Cloudflare uses shared IP ranges. Thousands of services live on the same IP. When an ISP blocks 104.21.x.x to cut off a pirated stream, it's potentially blocking every other service sharing that range.

In this case, Docker Hub became unreachable from several Spanish ISPs during the match. Not for minutes — for hours. And the kicker: from outside Spain, the service responded perfectly. From Argentina, I had zero issues pulling from Docker Hub. From Madrid, my client couldn't docker pull anything.

# What my client saw in Madrid
$ docker pull node:20-alpine
Error response from daemon: Get "https://registry-1.docker.io/v2/": 
net/http: request canceled while waiting for connection 
(Client.Timeout exceeded while awaiting headers)

# What I saw in Buenos Aires, at the exact same time
$ docker pull node:20-alpine
20-alpine: Pulling from library/node
# ... everything fine, obviously

Classic collateral damage from shared infrastructure. And the part that bothers me most isn't the block itself — it's that we burned 45 minutes figuring out the problem wasn't ours.

The Assumption Nobody Writes in the Documentation

Every architecture has implicit assumptions. We write them in ADRs when we're being disciplined, but most of them live in the head of whoever made the call two years ago.

One of those assumptions — one I had without ever articulating it — is this:

Base infrastructure services — container registries, CDNs, DNS resolution — are neutral carriers. They have no relevant geography. They have no politics. They're always available in the same way for everyone.

This incident proves that assumption is false in at least three dimensions:

1. Geography matters, even for infrastructure.

Docker Hub isn't a service with a differential SLA by country. But its actual availability depends on how local ISPs resolve legal conflicts that have nothing to do with you. A soccer match in Spain can break your deploy pipeline if your client is in Madrid. That doesn't appear on any status page.

2. CDNs aren't neutral — they're risk aggregators.

When Cloudflare has a problem — and it has had them — it doesn't affect one service. It affects thousands simultaneously. The concentration of infrastructure in a handful of providers creates single points of failure that don't exist in the architecture diagram of any of those individual services.

I touched on this tangentially in the post about TigerFS and the obsession with putting everything inside PostgreSQL — there's a consolidation pattern that reduces operational complexity but amplifies the blast radius when something fails.

3. Status pages lie by omission.

When Docker Hub is blocked for users in Spain, the status page shows green. Technically correct — the service is up. But for a percentage of users, it's effectively down. Aggregate availability metrics hide the real experience of geographic subsets.

// The implicit assumption in almost every retry loop I've ever written
async function pullDockerImage(image: string): Promise<void> {
  const maxRetries = 3;

  for (let i = 0; i < maxRetries; i++) {
    try {
      await execCommand(`docker pull ${image}`);
      return; // Success — move on
    } catch (error) {
      // Silent assumption: if it fails, it's transient
      // Never considered: what if it's geographic?
      // What if retrying 3 times changes absolutely nothing?
      if (i === maxRetries - 1) throw error;
      await sleep(2000 * (i + 1));
    }
  }
}

// The version I should have written
async function pullDockerImage(
  image: string,
  options: {
    fallbackRegistry?: string;  // registry.company.com/mirror
    timeout?: number;
  } = {}
): Promise<void> {
  const registries = [
    'registry-1.docker.io',           // Docker Hub (default)
    options.fallbackRegistry,          // Internal mirror
    'mirror.gcr.io',                   // Google Container Registry mirror
  ].filter(Boolean);

  for (const registry of registries) {
    try {
      const imageWithRegistry = registry !== 'registry-1.docker.io'
        ? `${registry}/${image}`
        : image;

      await execCommand(`docker pull ${imageWithRegistry}`);
      return;
    } catch (error) {
      // Log which registry failed, not just that something failed
      console.warn(`Registry ${registry} unavailable:`, error.message);
      continue;
    }
  }

  throw new Error(`Failed to pull ${image} from any registry`);
}

The difference isn't technically complex. It's conceptual. It requires accepting that the registry might be unavailable in a way that retrying won't fix.

The Pattern I Keep Seeing This Week

This isn't the first time I've written about dependencies we assume are stable and aren't.

When I reviewed those PRs with hardcoded API keys, the underlying problem was the same: we assume Anthropic's API, OpenAI's API, service X's API, will be available in the same way for everyone running the code. The hardcoded key is the symptom. The assumption of universal availability is the disease.

When I read the debate about contributing to the Linux kernel with AI, what resonated most wasn't the AI — it was that the kernel has decades of decisions made assuming the network is best-effort, not guaranteed. Low-level protocols have fallbacks because their authors lived in a world where nothing was reliable. We live in a world where everything seems reliable, and that makes us worse architects.

And when I analyzed how AI agent benchmarks are broken, the pattern was: single point of failure disguised as an elegant solution. An agent that depends on a single model, a single API endpoint, a single container registry — is fragile in ways the benchmark doesn't measure.

The Docker incident in Spain is the same pattern wearing a different costume.

How We Mitigated It (and What I Still Haven't Fixed)

The first thing I did after the incident was talk to my Madrid client about mirrors. Docker natively supports registry mirrors in the daemon:

// /etc/docker/daemon.json
{
  "registry-mirrors": [
    "https://mirror.madrid-company.com",
    "https://mirror.gcr.io"
  ],
  "max-concurrent-downloads": 3,
  "max-concurrent-uploads": 5
}

With this, if Docker Hub doesn't respond, the daemon tries the mirrors automatically. An internal mirror requires infrastructure — you can spin up Harbor or a simple Docker registry — but it's the real solution for teams deploying frequently.

The second thing was auditing our CI/CD pipeline on Railway to find which other steps have external dependencies assumed to be stable:

# railway.toml — what we had
[build]
dockerfilePath = "./Dockerfile"

# What we want to add
[build]
dockerfilePath = "./Dockerfile"
# Variables Railway resolves at build time
[build.env]
DOCKER_BUILDKIT = "1"
# If we use base images, pull them from the internal mirror
BASE_REGISTRY = "mirror.madrid-company.com"

And in the Dockerfile:

# Before: direct dependency on Docker Hub
FROM node:20-alpine

# After: parameterizable, with documented fallback
ARG BASE_REGISTRY=""
ARG NODE_VERSION="20-alpine"

# If BASE_REGISTRY is defined, use it; otherwise, Docker Hub
FROM ${BASE_REGISTRY:+${BASE_REGISTRY}/}node:${NODE_VERSION}

# Rest of the Dockerfile unchanged
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]

What I still haven't fixed: a health check system that distinguishes between "the service is down" and "the service is unreachable from this geography." Those are different problems with different solutions, and right now I treat them the same.

The Mistakes I Keep Seeing (That I've Made Myself)

Confusing "always worked" with "always will work."

Docker Hub has been reliable for years. Cloudflare has been reliable for years. That history isn't a guarantee of future availability — especially when the failure cause can be completely outside the provider's control (like a court order triggered by a soccer match).

Designing the happy path and calling it architecture.

If your architecture diagram doesn't have red arrows showing what happens when each external dependency fails, it's not a complete architecture diagram. It's a diagram of how you want it to work.

Assuming the status page is reality.

Status pages report aggregated global availability. Your user in Madrid, in Iran, in China, can have a completely different experience while the status page shows green. You need synthetic monitoring from the actual geographies of your real users.

Not having a mirror for base images.

If you deploy more than once a week with Docker, an internal registry mirror isn't gold plating — it's basic resilience engineering. The cost to set it up is hours. The cost of not having it you pay when you can least afford it.

FAQ: Cloudflare, DNS, Blocks, and Infrastructure Resilience

Why does a Cloudflare block affect services that have nothing to do with piracy?

Cloudflare uses IPs shared across thousands of clients. When an ISP blocks a Cloudflare IP to cut access to a specific site, it's blocking every service sharing that IP. Docker Hub, monitoring services, third-party APIs — everything can get caught in the blast. That's the cost of the shared infrastructure model.

Does Docker Hub have any geographic redundancy mechanism that prevents this?

Docker Hub has multiple points of presence and uses Cloudflare as a CDN, but that doesn't solve the IP block problem — if anything, it centralizes it. Docker Hub's geographic redundancy doesn't help you if ISPs in your region are blocking the CDN IP ranges Docker Hub depends on. The solution is on the client side: internal mirrors or alternative registries.

What's a registry mirror and how do I set one up?

A registry mirror is a local proxy/cache for Docker images. When you docker pull node:20, the daemon checks the mirror first. If it has the image cached, it serves it locally. If not, it pulls from Docker Hub and caches it for next time. You can spin one up with Harbor (enterprise, more features) or with Docker's official registry image (registry:2) which is simpler. Config goes in /etc/docker/daemon.json under the registry-mirrors key.

Is this specific to Spain or can it happen elsewhere?

It can happen in any country where ISPs execute block orders based on IP rather than domain. Spain is a documented case because of LaLiga's anti-piracy orders, but the same pattern exists in the UK (copyright blocks), across much of the Middle East (political blocks), and potentially any jurisdiction where blocking mechanisms aren't surgical enough. If you have globally distributed clients or users, this is a real risk.

Why doesn't Cloudflare fix this on their end?

Cloudflare can do some things — like rotating IPs or using more granular ranges — but the structural problem is that the CDN business model is built on sharing infrastructure to reduce costs. There's no perfect technical solution while blocking mechanisms are IP-based instead of SNI- or content-based. Cloudflare has incentives to solve this, but the ISPs executing court orders have zero incentive to invest in more surgical solutions.

How do I detect whether an infrastructure failure is geographic before wasting hours debugging?

Three quick steps: 1) Check downdetector.es or its equivalent for the affected service, filtering by region. 2) Use host-tracker.com or similar to check the service from multiple geographies simultaneously — if it responds from the US but not from Spain, it's geographic. 3) Ask someone on a different network or country to confirm. Those 5 minutes of diagnosis would have saved me 45 minutes of hunting for a problem in my own config.

The Conclusion I'm Not Going to Soften

There's something uncomfortable about this incident that goes beyond the technical fix.

We build systems that assume base infrastructure is stable, neutral, and universal. That was never completely true, but the current level of centralization — Cloudflare handling an enormous slice of web traffic, Docker Hub being the de facto default registry, AWS/GCP/Azure concentrating most of the world's compute — makes that assumption more dangerous than it's ever been.

It's not that Cloudflare is bad. It's not that Docker Hub is irresponsible. It's that the model of "everything on the same CDN, everything in the same registry, everything with the same compute provider" creates interdependencies that none of the individual actors fully control or document.

I felt this with the brain-computer interface for the dancer with ALS — a medical system that depended on stable network latency. And I see it in every discussion about Surelock and deadlocks in Rust — real resilience requires thinking about failure cases from the design stage, not bolting them on as an afterthought.

The question I was left with after the incident with my Madrid client isn't "how do I prevent Docker Hub from failing?" It's: "how many other silent assumptions of universal availability are sitting in my architecture, waiting for a soccer match to break them?"

I don't have the complete answer. But at least now I know I need to look for it.