DEV Community

Cover image for I Learned Ansible From Scratch for My Open-Source Project — Here's the Full Breakdown
divyanshu_Kumar
divyanshu_Kumar

Posted on

I Learned Ansible From Scratch for My Open-Source Project — Here's the Full Breakdown

This is Part 2 of my pre-implementation journey — a feature I am contributing to the open-source Debezium Platform project. In Part 1, I learned SSH from scratch — generated key pairs, understood ~/.ssh/config, fixed permission errors, and built two Docker containers as fake SSH servers on my MacBook.

If you haven't read Part 1, here's the short version: I have two Docker containers (db-server-1 and db-server-2) that accept SSH connections via key-based auth using a config alias. My ~/.ssh/config file looks like this:

Host db-server-1
    HostName 127.0.0.1
    User deploy
    Port 2201
    IdentityFile ~/.ssh/ddd41_practice

Host db-server-2
    HostName 127.0.0.1
    User deploy
    Port 2202
    IdentityFile ~/.ssh/ddd41_practice
Enter fullscreen mode Exit fullscreen mode

Now the question is: what do I actually do with those SSH connections?

The answer is Ansible.

Ansible meme —


Why Ansible — And Why Not Just Java + SSH?

This was the first question I had to answer before studying Ansible at all.

Think about it: I already have SSH access to my remote servers. So why can't I just open those connections from Java and run shell commands? Well, SSH gives you a tunnel — it doesn't give you an automation engine. If I need to provision multiple remote servers from my main machine, SSH alone becomes incredibly tedious. I'd have to manually handle every step:

  • Install Docker
  • Start the Docker daemon
  • Add the SSH user to the docker group
  • Deploy the Host Agent as a systemd service
  • Pre-pull the Debezium Server Docker image

My first instinct as a Java developer was: "Can't I just use JSch or Apache MINA-SSHD and do all of this in Java?"

Let me show you why that instinct was wrong:

Approach What you'd have to write
Pure Java + SSH library OS detection logic, apt vs yum/dnf branching, error handling, retry logic, idempotency checks — for every single step
Ansible + YAML playbook ~150 lines of YAML. All of the above is handled by built-in Ansible modules.

Ansible handles OS detection (ansible_os_family), idempotency (modules check current state before acting), error reporting, retries, and parallel execution — for free. The Debezium design document puts it plainly: Java just runs ProcessBuilder to call ansible-playbook. Ansible does the heavy lifting.

GIF of someone delegating all work and relaxing


Setting Up Ansible

Installing Ansible on macOS is a one-liner:

brew install ansible
Enter fullscreen mode Exit fullscreen mode

Verify the installation:

ansible --version
Enter fullscreen mode Exit fullscreen mode

You should see ansible [core 2.17.x] or similar. Then install the Docker community collection (needed later for pulling images idempotently):

ansible-galaxy collection install community.docker
Enter fullscreen mode Exit fullscreen mode

The Mental Model — Four Things You Need to Understand

Before I ran a single Ansible command, I forced myself to understand four core concepts. Without these, you're just copying commands without knowing why they work.

1. Control Node

Your Mac (or whatever machine you're running Ansible from). This is where Ansible is installed and where you execute ansible-playbook. Here's the important part: Ansible does not need to be installed on the remote machines — only on your control node.

2. Managed Node

The remote host where Ansible will make changes. It only needs Python 3 and SSH access. Both of our Docker containers qualify.

3. Inventory

A list of hosts for Ansible to target. This can be a file (like inventory.ini) or an inline string passed on the command line.

The design uses inline ad-hoc inventory — a comma-separated string passed directly via the -i flag:

ansible-playbook host-setup.yml -i "db-server-1,"
Enter fullscreen mode Exit fullscreen mode

See that trailing comma after the hostname? That is not a typo — it is required. Without it, Ansible interprets the string as a filename and tries to open a file called db-server-1 on your disk. The comma tells the parser: "This is a comma-separated list of hosts that happens to contain exactly one item."

💡 Why not use an inventory file? Because hosts in DDD-41 are dynamic — they come from ~/.ssh/config, not a static file. The Java HostProvisioningService provisions one host at a time and builds the inventory string programmatically.

And here's the elegant part: when Ansible sees db-server-1 in the inventory, it doesn't need you to explicitly pass an IP address, port, or private key path. It calls your system's native OpenSSH client under the hood, which automatically reads ~/.ssh/config, resolves the alias, and establishes the connection. Zero extra configuration.

4. Playbook

A YAML file describing what to do. It contains plays, each play contains tasks, and each task calls a module. Think of it as a nested structure:

Playbook
└── Play (target: all hosts in inventory)
    ├── Task 1: Bootstrap Python
    ├── Task 2: Install Docker
    ├── Task 3: Start Docker service
    ├── Task 4: Add user to docker group
    ├── Task 5: Deploy Host Agent as systemd service
    └── Task 6: Pre-pull Debezium Server image
Enter fullscreen mode Exit fullscreen mode

How Ansible Reads ~/.ssh/config Automatically

This is the part that genuinely surprised me. I assumed I'd have to pass IP addresses, ports, and key paths directly into the playbook or build some mapping layer in Java.

Nope.

When Ansible connects to db-server-1, it delegates the connection to your system's OpenSSH client. OpenSSH automatically reads ~/.ssh/config. That means Ansible inherits your entire SSH configuration for free:

  • ✅ What IP address to connect to
  • ✅ What port to use
  • ✅ What username to log in with
  • ✅ Which private key to authenticate with

The sysadmin maintains ~/.ssh/config — everything else flows from it automatically.


The ansible.cfg File — Avoiding Repetitive Flags

Before running any commands, I created a project-level config file so I wouldn't have to repeat flags every time:

mkdir -p ~/ddd41-lab/ansible
cat > ~/ddd41-lab/ansible/ansible.cfg << 'EOF'
[defaults]
host_key_checking = False
gathering = explicit
timeout = 30

[ssh_connection]
ssh_args = -F ~/.ssh/config -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
pipelining = True
EOF
Enter fullscreen mode Exit fullscreen mode

Here's what these settings do:

  • host_key_checking = False — Disables the "Are you sure you want to continue connecting?" prompt. This is fine for a lab environment. Never do this in production.
  • gathering = explicit — Tells Ansible not to auto-gather system facts at playbook start. We do it manually after bootstrapping Python, which matters on fresh hosts that might not have Python installed yet.
  • ssh_args = -F ~/.ssh/config — Explicitly tells SSH to use our config file.
  • pipelining = True — Reduces the number of SSH sessions needed per task. Faster execution.

Ansible reads ansible.cfg from the current working directory when you run a command. So as long as I cd ~/ddd41-lab/ansible before running anything, this config is automatically active.


First Test: Ad-Hoc Ping

An ad-hoc command runs a single Ansible module without a playbook — perfect for quick connectivity checks.

cd ~/ddd41-lab/ansible
ansible db-server-1 -i "db-server-1," -m ping
Enter fullscreen mode Exit fullscreen mode

⚠️ Common misconception: The Ansible ping module does NOT send ICMP packets (unlike the ping command in your terminal). It connects via SSH, runs a tiny Python script on the remote host, and verifies that Python is available and working. A successful ping means: SSH works AND Python is installed.

To test both servers at once:

ansible all -i "db-server-1,db-server-2" -m ping
Enter fullscreen mode Exit fullscreen mode

Ansible ping output showing SUCCESS for both servers

Both servers are alive and responding. Time to write a real playbook.


My First Real Playbook — A Connectivity Check

cat > ~/ddd41-lab/ansible/01-ping.yml << 'EOF'
---
- name: Verify connectivity to all lab hosts
  hosts: all
  gather_facts: false

  tasks:
    - name: Ping the host
      ansible.builtin.ping:

    - name: Print hostname
      ansible.builtin.command: hostname
      register: result

    - name: Show hostname output
      ansible.builtin.debug:
        msg: "Remote hostname is: {{ result.stdout }}"
EOF
Enter fullscreen mode Exit fullscreen mode

Run it:

ansible-playbook 01-ping.yml -i "db-server-1,db-server-2"
Enter fullscreen mode Exit fullscreen mode

Playbook output showing successful ping and hostname for both servers

Understanding Task Status Colors

Ansible uses colour-coded output so you can read results at a glance:

Status Color What It Means
ok 🟢 Green Task ran successfully, nothing needed to change
changed 🟡 Yellow Task ran and made a modification
skipped 🔵 Cyan Task was skipped (condition not met)
failed 🔴 Red Task failed
unreachable 🔴 Red Could not connect to the host at all

YAML Basics — Because Playbooks Are All YAML

Before writing the full playbook, I needed to make sure my YAML fundamentals were solid. One wrong indentation and the entire playbook breaks. Here are the rules:

  • Spaces only, never tabs. Ansible recommends 2 spaces for indentation.
  • Lists start with - (dash + space).
  • Dictionaries use : (colon + space).
  • Comments start with #.
  • Strings with special characters need quotes.

🫠 If you have tabs in a YAML file, Ansible throws the most cryptic error messages you've ever seen. I learned this the hard way when I copy-pasted a snippet from a web page that had invisible tab characters. Spent 20 minutes debugging a syntax error that was literally invisible.

GIF of someone squinting at code trying to find the bug


Building the DDD-41 Provisioning Playbook

Now we get to the core of the project.our playbook needs to execute 6 specific provisioning steps to transform a clean server into a fully managed Debezium host.

Fair warning: if you copy the standard textbook templates for these steps, your pipeline will crash. the real-world errors I hit, and how I actually fixed them.


Step 1: Bootstrap Python (The "Permission Denied" Trap)

Ansible modules are agentless, but they require Python to be present on the remote host to execute tasks. If a server is completely fresh, Python might not be installed yet. The ansible.builtin.raw module solves this chicken-and-egg problem — it sends raw SSH shell commands directly, bypassing the Python requirement entirely.

The textbook version:

# ❌ WILL FAIL: Permission denied
- name: Bootstrap — install Python3 on Debian/Ubuntu
  ansible.builtin.raw: |
    apt-get update -qq && apt-get install -y python3 python3-pip
  changed_when: false
Enter fullscreen mode Exit fullscreen mode

What actually happened: The playbook instantly crashed with a wall of red:

"E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)"
Enter fullscreen mode Exit fullscreen mode

The fix: When Ansible logs into the server, it uses the standard user from your SSH config — in our case, the deploy user. A regular user doesn't have permission to install system packages. We need to tell Ansible to escalate privileges using sudo:

# ✅ CORRECT: Explicitly escalates privileges via sudo
- name: Bootstrap — install Python3 on Debian/Ubuntu
  ansible.builtin.raw: |
    apt-get update -qq && apt-get install -y python3 python3-pip
  changed_when: false
  become: true
Enter fullscreen mode Exit fullscreen mode

That single line — become: true — maps directly to running the command with sudo. It works seamlessly here because our Docker container's base setup configures the deploy user with passwordless sudo access in the /etc/sudoers file (which we set up in Part 1).


Step 2: Gather Facts (The Deprecation Warning)

Once Python is bootstrapped, we can safely collect system information. Ansible handles this through the setup module:

- name: Gather facts
  ansible.builtin.setup:
Enter fullscreen mode Exit fullscreen mode

This populates an internal inventory of the server — CPU architecture, RAM, Linux distribution, OS family, and more. We need this data for the next step where we conditionally install Docker based on the OS.

The gotcha: Older guides and tutorials reference top-level variables like when: ansible_os_family == "Debian". Modern versions of Ansible flag this with a bright yellow Deprecation Warning. To future-proof your playbooks, access facts through the formal dictionary:

# ❌ Old way (triggers deprecation warning)
when: ansible_os_family == "Debian"

# ✅ Modern way
when: ansible_facts['os_family'] == "Debian"
Enter fullscreen mode Exit fullscreen mode

Step 3: Install Docker (The Package Name Mismatch)

The playbook needs to conditionally install Docker based on whether the target is running Debian/Ubuntu or RHEL/CentOS.

The textbook version:

# ❌ WILL FAIL: Package not found
- name: Install Docker (Debian/Ubuntu)
  ansible.builtin.apt:
    name:
      - docker.io
      - docker-compose-plugin
    state: present
    update_cache: yes
  become: true
  when: ansible_facts['os_family'] == "Debian"
Enter fullscreen mode Exit fullscreen mode

What actually happened:

"msg": "No package matching 'docker-compose-plugin' is available"
Enter fullscreen mode Exit fullscreen mode

The fix: The textbook assumed our servers had Docker's official apt repository pre-configured. Our lab containers use vanilla Ubuntu repositories, where the package is simply called docker-compose — not docker-compose-plugin:

# ✅ CORRECT: Uses package names from default Ubuntu repositories
- name: Install Docker (Debian/Ubuntu)
  ansible.builtin.apt:
    name:
      - docker.io
      - docker-compose
    state: present
    update_cache: yes
  become: true
  when: ansible_facts['os_family'] == "Debian"
Enter fullscreen mode Exit fullscreen mode

📝 Lesson learned: Always verify package names against the actual repositories available on your target host. apt-cache search docker is your best friend.


Step 4: Start Docker Daemon (The Container Limitation)

Once Docker is installed, we tell the host to start the daemon and enable it on boot:

# ⚠️ EXPECTED TO FAIL IN LAB: No systemd inside Docker containers
- name: Start and enable Docker service
  ansible.builtin.service:
    name: docker
    state: started
    enabled: yes
  become: true
  ignore_errors: yes
Enter fullscreen mode Exit fullscreen mode

What actually happened:

"System has not been booted with systemd as init system (PID 1)."
Enter fullscreen mode Exit fullscreen mode

Why this is expected: Our "servers" are lightweight Docker containers, not real VMs. They don't run systemd as PID 1 — they lack a full init system. Since our goal here is to validate the Ansible playbook logic and Java ProcessBuilder orchestration (not to actually run Docker-in-Docker on a laptop), we add ignore_errors: yes. Ansible logs the failure, shrugs, and moves on to the next task.

On a real bare-metal server or cloud VM, this task would succeed without issues.


Step 5: Add User to Docker Group (The Undefined Variable)

To let our deploy user run Docker commands without sudo, we add it to the docker group:

# ⚠️ WILL FAIL: Variable not defined
- name: Add deploy user to docker group
  ansible.builtin.user:
    name: "{{ ansible_user }}"
    groups: docker
    append: yes
  become: true
Enter fullscreen mode Exit fullscreen mode

🛡️ The append: yes flag is critical. Without it, Ansible's user module replaces all existing group memberships. With append: yes, it adds docker to the user's existing groups without removing anything. This is idempotency in action — if the user is already in the docker group, Ansible does nothing.

What actually happened:

"msg": "Error while resolving value for 'name': 'ansible_user' is undefined"
Enter fullscreen mode Exit fullscreen mode

The fix: Standard Ansible setups define ansible_user in static inventory files (like hosts.ini). Since we use inline ad-hoc inventory (-i "db-server-1,"), there's no file where this variable is declared. Ansible can't resolve it.

The solution is to explicitly define it in a vars block at the top of the playbook:

  vars:
    ansible_user: "deploy"
Enter fullscreen mode Exit fullscreen mode

This ensures the template variable {{ ansible_user }} resolves correctly everywhere in the playbook.


Step 6: Deploy the Host Agent as a Systemd Service

The Java orchestrator needs a persistent remote process to communicate with. Since we haven't compiled the real agent yet, the playbook deploys a mock agent — a lightweight shell script running a sleep loop — to validate that our directory structures, file permissions, and service configurations are structurally correct.

    - name: Create agent directory
      ansible.builtin.file:
        path: /opt/debezium-agent
        state: directory
        owner: "{{ ansible_user }}"
        mode: '0755'
      become: true

    - name: Create mock agent script (for lab testing)
      ansible.builtin.copy:
        dest: /opt/debezium-agent/agent.sh
        content: |
          #!/bin/bash
          echo "Debezium Host Agent starting on port {{ agent_port }}..."
          echo "Token: {{ agent_token | default('NO_TOKEN_PROVIDED') }}"
          while true; do sleep 3600; done
        owner: "{{ ansible_user }}"
        mode: '0755'
      become: true

    - name: Create systemd service for Host Agent
      ansible.builtin.copy:
        dest: /etc/systemd/system/debezium-agent.service
        content: |
          [Unit]
          Description=Debezium Host Agent
          After=network.target docker.service
          Requires=docker.service

          [Service]
          Type=simple
          User={{ ansible_user }}
          ExecStart=/opt/debezium-agent/agent.sh
          Restart=always
          RestartSec=5
          Environment="AGENT_PORT={{ agent_port }}"
          Environment="AGENT_TOKEN={{ agent_token | default('test-token') }}"

          [Install]
          WantedBy=multi-user.target
        mode: '0644'
      become: true

    - name: Reload systemd and start agent (expected to fail in lab)
      ansible.builtin.systemd:
        name: debezium-agent
        state: started
        enabled: yes
        daemon_reload: yes
      become: true
      ignore_errors: yes
Enter fullscreen mode Exit fullscreen mode

The {{ agent_port }} and {{ agent_token }} placeholders use Jinja2 templating — Ansible's template engine. These values get injected at runtime through the -e (extra vars) flag when we run the playbook. And just like Step 4, the final systemd reload uses ignore_errors: yes because our Docker containers don't have a real init system.


The Complete Playbook

Here's the full host-setup.yml with all six steps assembled:

---
################################################################################
# DDD-41 Host Provisioning Playbook
# Provisions a bare-metal or VM host to run Debezium Server containers.
# Usage: ansible-playbook host-setup.yml -i "<ssh-alias>,"
################################################################################

- name: Bootstrap and provision Debezium host
  hosts: all
  gather_facts: false

  vars:
    agent_port: 8090
    agent_version: "1.0.0"
    debezium_image: "quay.io/debezium/server:latest"
    ansible_user: "deploy"

  tasks:
    ############################################################
    # Step 1: Bootstrap Python using raw module
    ############################################################
    - name: Bootstrap — install Python3 on Debian/Ubuntu
      ansible.builtin.raw: |
        apt-get update -qq && apt-get install -y python3 python3-pip
      changed_when: false
      become: true

    - name: Gather facts
      ansible.builtin.setup:

    ############################################################
    # Step 2: Install Docker
    ############################################################
    - name: Install Docker (Debian/Ubuntu)
      ansible.builtin.apt:
        name:
          - docker.io
          - docker-compose
        state: present
        update_cache: yes
      become: true
      when: ansible_facts['os_family'] == "Debian"

    - name: Install Docker (RHEL/CentOS)
      ansible.builtin.yum:
        name: docker
        state: present
      become: true
      when: ansible_facts['os_family'] == "RedHat"

    ############################################################
    # Step 3: Start and enable Docker daemon
    ############################################################
    - name: Start and enable Docker service
      ansible.builtin.service:
        name: docker
        state: started
        enabled: yes
      become: true
      ignore_errors: yes

    ############################################################
    # Step 4: Add SSH user to docker group
    ############################################################
    - name: Add deploy user to docker group
      ansible.builtin.user:
        name: "{{ ansible_user }}"
        groups: docker
        append: yes
      become: true

    ############################################################
    # Step 5: Pre-pull Debezium Server image
    ############################################################
    - name: Pre-pull Debezium Server Docker image
      ansible.builtin.shell: |
        docker pull {{ debezium_image }}
      become: true
      register: pull_result
      changed_when: "'Pull complete' in pull_result.stdout or 'Downloaded' in pull_result.stdout"
      ignore_errors: yes

    ############################################################
    # Step 6: Deploy the Host Agent as a systemd service
    ############################################################
    - name: Create agent directory
      ansible.builtin.file:
        path: /opt/debezium-agent
        state: directory
        owner: "{{ ansible_user }}"
        mode: '0755'
      become: true

    - name: Create mock agent script (for lab testing)
      ansible.builtin.copy:
        dest: /opt/debezium-agent/agent.sh
        content: |
          #!/bin/bash
          echo "Debezium Host Agent starting on port {{ agent_port }}..."
          echo "Token: {{ agent_token | default('NO_TOKEN_PROVIDED') }}"
          while true; do sleep 3600; done
        owner: "{{ ansible_user }}"
        mode: '0755'
      become: true

    - name: Create systemd service for Host Agent
      ansible.builtin.copy:
        dest: /etc/systemd/system/debezium-agent.service
        content: |
          [Unit]
          Description=Debezium Host Agent
          After=network.target docker.service
          Requires=docker.service

          [Service]
          Type=simple
          User={{ ansible_user }}
          ExecStart=/opt/debezium-agent/agent.sh
          Restart=always
          RestartSec=5
          Environment="AGENT_PORT={{ agent_port }}"
          Environment="AGENT_TOKEN={{ agent_token | default('test-token') }}"

          [Install]
          WantedBy=multi-user.target
        mode: '0644'
      become: true

    - name: Reload systemd and start agent (expected to fail in lab)
      ansible.builtin.systemd:
        name: debezium-agent
        state: started
        enabled: yes
        daemon_reload: yes
      become: true
      ignore_errors: yes

    - name: Report provisioning complete
      ansible.builtin.debug:
        msg: |
          ✅ Host {{ inventory_hostname }} provisioned successfully!
          Docker: installed and running
          Agent: deployed on port {{ agent_port }}
Enter fullscreen mode Exit fullscreen mode

Running the Full Playbook

cd ~/ddd41-lab/ansible

ansible-playbook host-setup.yml \
  -i "db-server-1," \
  -e "agent_port=8090 agent_token=test-bearer-token-abc123"
Enter fullscreen mode Exit fullscreen mode

And then... you hold your breath and watch the terminal scroll.

GIF of someone watching code run nervously


Reading the Output — What Those Errors Actually Mean

When the playbook finishes, you'll see some red text. Don't panic. Let me walk you through each piece.

1. The Docker Pull Error

failed to connect to the docker API... no such file or directory
Enter fullscreen mode Exit fullscreen mode

Why this happened: Remember Step 3, where we tried to start the Docker daemon but it failed because our Docker containers can't run systemd? Since the Docker engine isn't running, Step 5 (pulling an image) physically cannot work.

Why it's fine: Look right below the error message. You'll see:

...ignoring
Enter fullscreen mode Exit fullscreen mode

Our ignore_errors: yes safety net caught the crash and allowed the playbook to continue. Exactly as designed.

2. The Systemd Error

System has not been booted with systemd as init system (PID 1).
Enter fullscreen mode Exit fullscreen mode

Why this happened: Same root cause. Docker containers don't have systemd running as PID 1. The agent service can't be started.

Why it's fine: Same ignore_errors: yes — Ansible logs it, prints ...ignoring, and moves on.

3. The Green Checkmark

"msg": "✅ Host db-server-1 provisioned successfully!\nDocker: installed and running\nAgent: deployed on port 8090\n"
Enter fullscreen mode Exit fullscreen mode

Ansible reached the very end of the playbook. It created all directories, wrote the bash scripts, injected our agent_port: 8090 variable dynamically via Jinja2 templating, and printed the final success message.

4. The Play Recap — Your Report Card

PLAY RECAP ****************************************************
db-server-1  : ok=11   changed=4   unreachable=0   failed=0   skipped=1   rescued=0   ignored=3
Enter fullscreen mode Exit fullscreen mode

Let me decode this:

Metric Value What It Means
failed 0 No task permanently failed. The deployment is a success.
ignored 3 Ansible caught and bypassed 3 expected lab limitations (Docker daemon, image pull, systemd reload).
changed 4 Four things were created: agent directory, mock script, service file, and Python was installed.
skipped 1 The RHEL/CentOS Docker install was skipped (because our containers run Ubuntu).

The bottom line: The logic is flawless. If you ran this exact playbook against a real bare-metal Ubuntu server, the ignored=3 would drop to 0, the red text would vanish, and it would deploy a real Debezium host end-to-end.

GIF celebrating victory


Verifying Everything Worked

Trust, but verify:

# Docker installed?
ssh db-server-1 "docker --version"

# Agent directory created?
ssh db-server-1 "ls -la /opt/debezium-agent/"

# Systemd service file exists?
ssh db-server-1 "cat /etc/systemd/system/debezium-agent.service"
Enter fullscreen mode Exit fullscreen mode

Each of these should return exactly what our playbook configured. The directory is there, the mock script is executable, and the service file has our templated agent_port and agent_token values baked in.


The Most Important Concept: Idempotency

This is the golden rule of configuration management and the core reason why the Debezium design document specifies Ansible for the host provisioning engine.

Idempotent means: running an operation multiple times produces the exact same result without unintended side effects. If Docker is already installed, the playbook shouldn't reinstall it. If a directory already exists with the correct permissions, Ansible should leave it untouched.

Why This Matters for DDD-41

According to Section 3 of the design document, the platform architecture features a dynamic file watcher that monitors ~/.ssh/config. If a sysadmin modifies a host's IP address or changes an SSH alias, the watcher automatically re-triggers the provisioning pipeline.

Because of this, our playbook must be completely safe to re-run against an active, healthy host without disrupting running services.

Modules vs. Shell: The Structural Difference

To achieve idempotency, you must favor native Ansible modules over raw shell commands. Here's the exact contrast:

# ❌ NOT IDEMPOTENT: Runs every time, always reports "changed"
- name: Add user to docker group
  ansible.builtin.shell: usermod -aG docker deploy

# ✅ IDEMPOTENT: Checks current state first, only acts if needed
- name: Add user to docker group
  ansible.builtin.user:
    name: "deploy"
    groups: docker
    append: yes
  become: true
Enter fullscreen mode Exit fullscreen mode

The shell version blindly runs usermod every time — even if the user is already in the group. It always reports changed, making it impossible to tell if your playbook actually did anything meaningful.

The module version inspects the system state first. If deploy is already in the docker group, it does nothing and reports ok. This is the difference between "automation" and "reliable automation."

The rule is simple: always prefer built-in Ansible modules over shell or command tasks. Modules are designed to be idempotent out of the box. Shell scripts are blind to existing state unless you manually wrap them with conditional checks.


What Happens on the Second Run?

To verify idempotency, I ran the exact same playbook a second time against the same containers:

PLAY RECAP ****************************************************
db-server-1  : ok=11   changed=0   unreachable=0   failed=0   skipped=1   rescued=0   ignored=3
Enter fullscreen mode Exit fullscreen mode

See that? changed=0.

Here's what happened:

  1. Structural tasks shifted from yellow to green. Creating the agent directory, writing the mock script, and generating the service file all reported ok instead of changed. Ansible checked the containers, verified the files matched the playbook specification exactly, and skipped rewriting them.

  2. Lab workarounds still triggered. The Docker daemon, image pull, and systemd reload tasks still hit their container limitations and fell back to ignore_errors: yes. That's expected and correct.

  3. Zero unnecessary changes. The playbook confirmed the environment, touched nothing that was already correct, and proved that our provisioning pipeline is completely safe for continuous re-execution.

That changed=0 on the second run is the ultimate proof that your playbook is well-structured. It means the pipeline can safely loop through the same host over and over without causing drift or disruption.

🧠 Think of it this way: A good playbook is like a good if statement — it only does work when the condition demands it.


What I Took Away From All This

By the end of this phase, I went from viewing Ansible as just another DevOps buzzword to understanding how to design and debug a resilient infrastructure pipeline. Wrangling with configuration errors on my M1 Mac forced me to appreciate the nuance required to build production-ready automation.

Here are the three architectural insights that clicked for me:

1. The Power of Delegating to OpenSSH

The thing that surprised me most was how cleanly Ansible handles networking. I initially assumed I'd have to pass explicit IP addresses, ports, and private key paths into the playbook or build a complex mapping layer in Java.

Instead, Ansible completely offloads connection management to the system's native OpenSSH client. Because OpenSSH automatically reads ~/.ssh/config, Ansible inherits that entire configuration for free. This made the DDD-41 host discovery architecture click: a sysadmin maintains one standard SSH config file, the platform watches it for changes, and the automation engine handles everything else.

2. Idempotency Is Verified, Not Assumed

Running this playbook six or seven times taught me what idempotency actually means in practice. On the first run, my terminal was flooded with yellow changed statuses as directories were created and packages were installed. On the second run, every structural task shifted to green ok.

Seeing changed=0 in the play recap is the ultimate proof of a well-behaved playbook. It proves that by favoring native modules over blind shell commands, the engine inspects remote state before touching a single file. This guarantee is critical for the file-watcher architecture — if a config change re-triggers provisioning, the pipeline passes through an active, healthy host without breaking anything.

3. A Clean Separation of Concerns

The boundary between the Java control plane and the Ansible automation layer is completely decoupled:

  • Java's job: Manage the high-level lifecycle state machine (PENDING → PROVISIONING → READY/FAILED), handle asynchronous execution so the application never freezes during provisioning, and feed context variables into the deployment.
  • Ansible's job: Manage the concrete reality of the remote host — validate package states, create directories, write service configurations.

They don't need to know anything about each other's internals. Java fires off a ProcessBuilder, passes the runtime flags (-e), and waits for an exit code. A 0 means success; anything else means failure. Clean, simple, decoupled.


With SSH key authentication (Part 1) and Ansible provisioning playbooks (Part 2) fully tested and running against my local container lab, the automation foundation is rock solid.


Thank you for reading! If this helped you understand Ansible better (or saved you from the same Permission denied errors I hit), drop a comment or share your feedback. I'd love to hear how you'd approach this differently.

Happy automating! 🚀

Top comments (0)