eldara

Posted on Apr 12

Docker Swarm Auto-healing: A Guide to Troubleshooting 'Pending' States

#devops #docker #sre #tutorial

It’s 3:00 AM. Your monitoring alert goes off. You check your cluster, and instead of a healthy green "Running" status, you see the dreaded, silent word: Pending.

In Docker Swarm, "Pending" is the scheduler’s way of saying, "I want to run this container, but I literally can’t find a home for it." Unlike an "Error" or a "Crash," a Pending state doesn't always show up in the standard logs. It’s a scheduling ghost.

In 2026, with the rise of resource-heavy AI sidecars and high-density edge clusters, "Pending" states are usually tied to one of three things. Let's look at how to find the bottleneck and fix it.

1. The "Resource Exhaustion" Trap

The most common reason for a Pending state is that you’ve promised more than your hardware can deliver. If you define a service with heavy reservations:

deploy:
  resources:
    reservations:
      cpus: '2'
      memory: 4G

...and your nodes only have 1GB of RAM free, Swarm will wait forever for resources to clear up.

How to Diagnose

The Native Way:

docker service ps --no-trunc <service_name>

Look for the Error column. It will usually say: insufficient resources on X nodes.

The SwarmCLI Fix:
Instead of checking every service one by one, SwarmCLI highlights the problems immediately.

This gives you a overview of your cluster’s services. You’ll instantly see which service has issues and needs attention.

2. The "VIP Section" (Placement Constraints)

We use constraints to make sure our Database only runs on nodes with SSDs, or our AI models only run on nodes with GPUs. But if you get too specific, you create an impossible puzzle for the scheduler.

The Scenario

You have a constraint node.labels.storage == ssd, but you forgot to actually label your new Raspberry Pi 5 nodes. Result? Pending.

Quick Check

Run docker node inspect <node-id> to see if the labels match your docker-compose.yml requirements.

3. The "Missing Map" (Volume & Network Mismatches)

In 2026, we’re seeing more "Multi-Arch" clusters (mixing x86 and ARM). If you try to deploy a service that requires a specific named volume that only exists on node-01, but you told Swarm it can run anywhere, the task will hang in "Pending" on any node that lacks that volume.

[!NOTE]
Common Network Hang: If you’re using an overlay network that hasn't synchronized across your managers correctly, the task might stay Pending while it waits for a network ID that doesn't exist on the target worker.

How to Debug Like a Pro in 2026

When a service stays Pending for more than 30 seconds, follow this checklist:

Check the "Desired State": Is Swarm actually trying to scale up?
Inspect the Task: Use docker inspect <task_id> and look at the Status and Message fields. This is where the "real" reason is hidden.
Check Node Health: Is the node marked as Ready and Active? A node in Drain mode will never accept new tasks.

The SwarmCLI "Doctor" Command

If you’re tired of digging through nested JSON, we’ve simplified the process:

swarmcli analyze <service_name>

This command runs a diagnostic sweep of your cluster, comparing your service requirements against your current node health. It will tell you exactly why your task is stuck—whether it’s a missing label, a full disk, or a memory reservation conflict.

Summary: Don't Let "Pending" Be Permanent

Docker Swarm’s auto-healing is brilliant, but it’s not magic. It needs a valid "landing strip" to place your containers. By keeping an eye on your resource reservations and using SwarmCLI to visualize your cluster capacity, you can solve 90% of scheduling issues before they even trigger an alert.

2026 Docker Swarm Mastery Series

Mar 9: [The Foundation] The Definitive Docker Swarm Guide for 2026.
Mar 12: [Expert Analysis] Docker Swarm vs Kubernetes in 2026 — The Case for Staying Simple.
Mar 16: [Edge Frontier] Setting up the ultimate 3-node Swarm on Raspberry Pi 5.
Mar 19: [Security Specialist] Secure by Design: Managing Docker Swarm Secrets the SwarmCLI Way.
Mar 23: [Ops Mastery] Docker Swarm Auto-healing: A Guide to Troubleshooting 'Pending' States.

Why SwarmCLI?

By 2026, we noticed a gap. Docker Swarm was rock solid, but the management tooling felt stuck in 2017. SwarmCLI bridges that gap with:

Real-time Health: Stop guessing which node is throttled.
Atomic Secret Sync: One-command .env to Raft encryption.
Edge-Optimized: Built in Go for zero-overhead on ARM/RPi5 devices.

DEV Community