Darian Vance

Posted on Jan 18 • Edited on Jan 20 • Originally published at wp.me

Solved: Auto-Restart Crashing Docker Containers with a Simple Python Watchdog

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Docker’s built-in restart policies often fall short in handling sophisticated container failures or unhealthy states. This article provides a simple Python watchdog script that actively monitors Docker containers, leveraging the docker-py SDK and Docker health checks, to programmatically restart them when they crash or become unhealthy.

🎯 Key Takeaways

Docker’s native on-failure or always restart policies are reactive and lack the sophisticated logic needed for application-level failures or specific operational contexts.
A custom Python watchdog script, utilizing the docker-py SDK, offers granular control by actively monitoring container status (e.g., ‘running’, ‘exited’) and integrating with Docker’s HEALTHCHECK status (‘healthy’, ‘unhealthy’).
Containerizing the watchdog itself and mounting /var/run/docker.sock is essential for its operation, granting it the necessary permissions to interact with the Docker daemon, but requires careful security consideration.
The watchdog’s configuration, such as MONITORED\_CONTAINER\_NAMES and CHECK\_INTERVAL\_SECONDS, can be easily managed via environment variables, enhancing its flexibility and deployability.

Auto-Restart Crashing Docker Containers with a Simple Python Watchdog

As seasoned SysAdmins, Developers, and DevOps Engineers, we’ve all faced the frustrating scenario: a critical Docker container unexpectedly crashes, bringing down a service, and requires manual intervention to get it back online. While Docker offers restart policies like on-failure or always, these are often reactive and lack the sophisticated logic needed for specific operational contexts, such as monitoring for a truly “unhealthy” state or reacting to application-level failures beyond simple exits.

Manual restarts are not only tedious and prone to human error but also inefficient in a world striving for automation. Relying solely on Docker’s built-in restart policies can sometimes be too blunt an instrument, leading to endless restart loops for persistently broken containers or failing to detect subtle issues that don’t result in an immediate container exit.

Our Simple, Effective Solution

This tutorial presents a practical and lightweight solution: a custom Python watchdog script. This script will actively monitor your Docker containers, identify those that have crashed, exited unexpectedly, or are reported as unhealthy (if Docker health checks are configured), and then programmatically restart them. Unlike complex orchestration tools that might be overkill for this specific problem, our Python watchdog offers:

- Simplicity: Easy to understand, customize, and deploy.
- Flexibility: Tailor the monitoring logic to your exact needs.
- Cost-Effectiveness: Leverage existing infrastructure without introducing expensive SaaS solutions or heavy frameworks.

Let’s dive in and build a robust, custom container supervisor.

Prerequisites

Before we begin, ensure you have the following in your environment:

- Python 3.6+: Installed and accessible on your system.
- pip: Python’s package installer, typically bundled with Python installations.
- Docker: The Docker daemon must be running, and the Docker CLI should be installed and functional. The watchdog script will interact with the Docker daemon via its socket.
- Basic Python and Docker knowledge: Familiarity with writing Python scripts and common Docker commands will be helpful.

Step-by-Step Guide

Step 1: Setting Up Your Environment and the Docker SDK

Our Python watchdog will interact with the Docker daemon using the docker-py library, which is the official Docker SDK for Python. First, we need to install it.

Open your terminal and run the following command:

pip install docker

Next, let’s create a small Python script to verify that the SDK can connect to your Docker daemon. Create a file named check_docker.py:

import docker

try:

client = docker.from_env()

print(“Successfully connected to Docker daemon.”)

print(f”Docker server version: {client.version().get(‘Version’)}”)

print(“\nListing existing containers:”)

for container in client.containers.list():

print(f” – {container.name} ({container.status})”)

except Exception as e:

print(f”Error connecting to Docker daemon: {e}”)

print(“Please ensure Docker is running and you have appropriate permissions (e.g., user in ‘docker’ group, or using ‘sudo’).”)

Run this script:

python check_docker.py

You should see output indicating a successful connection and a list of your running containers. If you encounter permission errors, ensure your user is part of the docker group (you might need to log out and back in after adding yourself) or run the script with sudo (though adding your user to the group is generally preferred for development).

Step 2: Crafting Your Python Watchdog Script

Now, let’s write the core watchdog script. This script will continuously check the status of specific containers and restart them if they’re not in an running state.

Create a file named docker_watchdog.py:

import docker
import time
import os

Configuration

MONITORED_CONTAINER_NAMES = os.getenv(‘MONITORED_CONTAINER_NAMES’, ‘my-flaky-app,another-service’).split(‘,’)

CHECK_INTERVAL_SECONDS = int(os.getenv(‘CHECK_INTERVAL_SECONDS’, ’15’)) # Check every 15 seconds

def main():

print(“Docker Watchdog started.”)

print(f”Monitoring containers: {MONITORED_CONTAINER_NAMES}”)

print(f”Check interval: {CHECK_INTERVAL_SECONDS} seconds”)

client = docker.from_env()

while True:

try:

for container_name in MONITORED_CONTAINER_NAMES:

container_name = container_name.strip()

if not container_name:

continue

try:

Get the container by name. If not found, it raises a NotFound exception.

container = client.containers.get(container_name)

except docker.errors.NotFound:

print(f”Container ‘{container_name}’ not found. Skipping.”)

continue

status = container.status

print(f”[{time.strftime(‘%Y-%m-%d %H:%M:%S’)}] Monitoring ‘{container.name}’ (ID: {container.short_id}). Current status: {status}”)

if status != ‘running’:

print(f” Container ‘{container.name}’ is not ‘running’ (status: {status}). Attempting to restart…”)

try:

container.restart()

print(f” Successfully restarted ‘{container.name}’.”)

except docker.errors.APIError as e:

print(f” Error restarting ‘{container.name}’: {e}”)

else:

print(f” Container ‘{container.name}’ is running as expected.”)

except docker.errors.APIError as e:

print(f”Docker API error: {e}”)

except Exception as e:

print(f”An unexpected error occurred: {e}”)

time.sleep(CHECK_INTERVAL_SECONDS)

if __name__ == ‘__main__’:

main()

Logic Explanation:

- Configuration: We define MONITORED_CONTAINER_NAMES as a comma-separated string, allowing us to specify which containers the watchdog should oversee. This is loaded from environment variables for easy configuration without modifying the script.
- Docker Client: docker.from_env() initializes the Docker client, automatically configuring itself based on environment variables (like DOCKER_HOST) or by trying the default Unix socket.
- Infinite Loop: The while True: loop ensures the watchdog runs continuously.
- Container Identification: For each configured container name, client.containers.get(container_name) fetches the container object. We include error handling for NotFound if a container doesn’t exist.
- Status Check: container.status retrieves the current state (e.g., ‘running’, ‘exited’, ‘restarting’, ‘paused’).
- Restart Logic: If the status is anything other than 'running', the script calls container.restart(). This method attempts to gracefully stop and then start the container.
- Error Handling: Specific docker.errors.APIError and general Exception handling are included for robustness.
- Sleep Interval: time.sleep(CHECK_INTERVAL_SECONDS) pauses the script, preventing it from hammering the Docker daemon and consuming excessive resources.

Step 3: Integrating Docker Health Checks (Recommended)

While checking for a running status is good, it doesn’t always reflect the application’s true health. A container can be ‘running’ but have its application crashed internally. Docker’s built-in HEALTHCHECK instruction in a Dockerfile allows you to define a command to check the container’s health. The Docker daemon then reports this status.

Let’s enhance our watchdog to also consider a container’s health status. First, ensure your monitored containers have a HEALTHCHECK defined in their Dockerfile (example below):

# Example Dockerfile with HEALTHCHECK
FROM alpine/git
CMD ["sleep", "infinity"]

Define a health check that checks every 5 seconds, times out after 2 seconds, retries 3 times.

HEALTHCHECK –interval=5s –timeout=2s –retries=3 CMD echo “Still alive” || exit 1

Now, modify docker_watchdog.py to include health status checks. We’ll look for a 'healthy' status.

# ... (existing imports and configuration) ...

def main():

print(“Docker Watchdog started.”)

print(f”Monitoring containers: {MONITORED_CONTAINER_NAMES}”)

print(f”Check interval: {CHECK_INTERVAL_SECONDS} seconds”)

client = docker.from_env()

while True:

try:

for container_name in MONITORED_CONTAINER_NAMES:

container_name = container_name.strip()

if not container_name:

continue

try:

container = client.containers.get(container_name)

except docker.errors.NotFound:

print(f”Container ‘{container_name}’ not found. Skipping.”)

continue

status = container.status

health_status = None

Check if health status is available

if ‘Health’ in container.attrs.get(‘State’, {}):

health_status = container.attrs[‘State’][‘Health’][‘Status’]

print(f”[{time.strftime(‘%Y-%m-%d %H:%M:%S’)}] Monitoring ‘{container.name}’ (ID: {container.short_id}). Current status: {status}, Health: {health_status if health_status else ‘N/A’}”)

Check if container is not running OR if it has a healthcheck and is unhealthy

if status != ‘running’ or (health_status and health_status != ‘healthy’):

reason = f”status: {status}”

if health_status and health_status != ‘healthy’:

reason += f”, health: {health_status}”

print(f” Container ‘{container.name}’ is in an undesirable state ({reason}). Attempting to restart…”)

try:

container.restart()

print(f” Successfully restarted ‘{container.name}’.”)

except docker.errors.APIError as e:

print(f” Error restarting ‘{container.name}’: {e}”)

else:

print(f” Container ‘{container.name}’ is running and healthy as expected.”)

except docker.errors.APIError as e:

print(f”Docker API error: {e}”)

except Exception as e:

print(f”An unexpected error occurred: {e}”)

time.sleep(CHECK_INTERVAL_SECONDS)

if __name__ == ‘__main__’:

main()

Logic Explanation of Health Check Integration:

- Accessing Health Status: The health status is found within the container’s attributes (container.attrs) under State['Health']['Status']. This will be ‘healthy’, ‘unhealthy’, ‘starting’, or ‘none’ if no health check is defined.
- Combined Condition: The restart condition now checks if status != 'running' or (health_status and health_status != 'healthy'). This means the container will be restarted if it’s not running, or if it is running but Docker reports it as ‘unhealthy’.

Step 4: Containerizing the Watchdog for Reliable Operation

To ensure our watchdog itself is resilient and always running, we should containerize it. This allows it to run alongside your other Docker containers and benefit from Docker’s management features, including its own restart policies.

Create a Dockerfile in the same directory as docker_watchdog.py:

# Use a lightweight Python base image
FROM python:3.9-slim-buster

Set working directory

WORKDIR /app

Copy the Python script and requirements file

If you had a requirements.txt, you’d copy and install it here:

COPY requirements.txt .

RUN pip install –no-cache-dir -r requirements.txt

COPY docker_watchdog.py .

Install the Docker SDK

RUN pip install –no-cache-dir docker

Command to run the watchdog script

Use exec form to ensure signals are handled correctly

CMD [“python”, “docker_watchdog.py”]

Dockerfile Explanation:

- Base Image: We use a slim Python image for a smaller container footprint.
- WORKDIR: Sets the working directory inside the container.
- Copy Script: Copies our Python watchdog script into the container.
- Install Dependencies: Installs the docker-py library.
- CMD: Specifies the command to execute when the container starts.

Now, build your Docker image:

docker build -t techresolve/docker-watchdog .

Finally, run the watchdog container. This is a critical step, as the watchdog needs access to the Docker daemon’s socket to perform its tasks. We achieve this by mounting /var/run/docker.sock into the watchdog container.

docker run -d \
  --name docker-watchdog \
  --restart unless-stopped \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e MONITORED_CONTAINER_NAMES="my-flaky-app,another-service" \
  -e CHECK_INTERVAL_SECONDS="10" \
  techresolve/docker-watchdog

Run Command Explanation:

- -d: Runs the container in detached mode (in the background).
- --name docker-watchdog: Assigns a friendly name to the watchdog container.
- --restart unless-stopped: This is for the watchdog itself. It ensures the watchdog container will automatically restart if Docker restarts or if the watchdog container itself exits, unless it was explicitly stopped.
- -v /var/run/docker.sock:/var/run/docker.sock: This is the most crucial part. It mounts the Docker daemon’s Unix socket from the host into the container. This grants the watchdog container the ability to communicate with and control the Docker daemon. Be aware: This grants full control over your Docker daemon to the container. Only run trusted images with this setup.
- -e MONITORED_CONTAINER_NAMES="..." and -e CHECK_INTERVAL_SECONDS="...": Pass the configuration values as environment variables, as implemented in our Python script. Adjust the container names to match those you actually want to monitor.

Step 5: Testing Your Watchdog

To test if your watchdog is working, let’s create a “flaky” container that periodically exits. We’ll call it my-flaky-app and ensure its name is in our MONITORED_CONTAINER_NAMES list.

First, stop the watchdog container to avoid immediate restarts while setting up the test app:

docker stop docker-watchdog

Now, run a test container that will exit after a few seconds:

docker run -d --name my-flaky-app \
  --restart no \
  alpine/git sh -c "echo 'Hello, I will exit soon!'; sleep 10; exit 1"

The --restart no flag ensures Docker’s built-in policies won’t interfere. This container will run for 10 seconds and then exit with an error code.

Now, restart your watchdog container:

docker start docker-watchdog

Observe the logs of your watchdog container:

docker logs -f docker-watchdog

You should see output similar to this:

...
[TIMESTAMP] Monitoring 'my-flaky-app' (ID: ...). Current status: running, Health: N/A
...
[TIMESTAMP] Monitoring 'my-flaky-app' (ID: ...). Current status: exited, Health: N/A
  Container 'my-flaky-app' is not 'running' (status: exited). Attempting to restart...
  Successfully restarted 'my-flaky-app'.
...

You can also check the status of my-flaky-app with docker ps. You should see it periodically going from exited to Up X seconds as the watchdog restarts it.

Common Pitfalls and Troubleshooting

- Docker Socket Permissions: The most common issue is the watchdog container not being able to access /var/run/docker.sock. Ensure the user running Docker has appropriate permissions or that the socket itself has broad read/write access (e.g., owned by the docker group, and your user is in that group). When running the watchdog inside a container, mounting -v /var/run/docker.sock:/var/run/docker.sock is critical, and the Python process inside the container needs the correct permissions (often root, or a user with the same GID as the Docker group on the host).
- Aggressive Restart Loops: If your application continuously crashes immediately upon restart, the watchdog will enter a rapid restart loop. This can consume excessive CPU and logs. Implement an exponential backoff or a maximum restart count within your script for production scenarios to prevent this, or let Docker’s built-in on-failure policy (if applied to the application container) manage backoff before the watchdog intervenes.
- Incorrect Container Names: Double-check the MONITORED_CONTAINER_NAMES environment variable. Typos or incorrect container names will result in the watchdog reporting “Container ‘…’ not found. Skipping.”
- Resource Consumption: While lightweight, running many watchdogs or very frequent checks can add overhead. Adjust CHECK_INTERVAL_SECONDS based on the criticality of the service and the acceptable downtime.

Conclusion and Next Steps

You’ve successfully built and deployed a custom Python watchdog to automatically restart crashing or unhealthy Docker containers. This simple yet powerful solution provides fine-grained control over your container recovery strategy, moving beyond Docker’s basic restart policies.

While this script is functional, consider these enhancements for a production-ready system:

- Notifications: Integrate with Slack, email, PagerDuty, or other alerting systems to notify you when a container is restarted.
- Configuration File: Externalize container names and other settings into a YAML or JSON file for easier management.
- Retry Logic with Backoff: Implement a more sophisticated retry mechanism, potentially with exponential backoff, to avoid thrashing constantly failing containers.
- Logging: Enhance logging with structured logs (e.g., JSON) and direct them to a centralized logging system.
- Graceful Shutdown: Add signal handling to the Python script to allow for graceful shutdowns.
- Monitoring Multiple Docker Instances: For more complex setups, consider using the Docker SDK to connect to remote Docker daemons.

This watchdog is a testament to how simple scripting can solve real-world DevOps challenges, offering flexibility and control that off-the-shelf solutions sometimes lack. Happy container supervising!

👉 Read the original article on TechResolve.blog

☕ Support my work

If this article helped you, you can buy me a coffee:

👉 https://buymeacoffee.com/darianvance

DEV Community