DEV Community

jeff
jeff

Posted on

Retrying groups of tightly coupled tasks in Ansible

In some cases the best way to handle failure in distributed systems is to just try again. The same idea can apply when configuring a fleet of machines using Ansible. Let's say for examples sake that we have a task which may not always succeed on the first attempt for 10% of hosts being targeted due to some race condition. However, on the second attempt of the task, the remaining 10% of hosts will succeed. Now we could try to solve this race condition in many ways, but sometimes we may not have the time or required control of the system to do so. In this case, retrying the task may be the simplest, most efficient way to solve the problem.

I ran into this problem at work while working on a configuration to provision a rack of servers using Ansible. Typically, with Ansible if a task may fail on the first try, one can repeat the task like so:

- name: Some task that might fail
  failing_task:
    some: setting
  register: outcome
  retries: 3
  delay: 10
  until: outcome.result == 'success'
Enter fullscreen mode Exit fullscreen mode

This is great! Now Ansible will repeat the task three times with a 10 second delay in between each attempt. At the end of each attempt the until condition is evaluated and if it does not evaluate to true, the task will be repeated, assuming there are still retries left.

Now, let's say that I have a group of tasks that need to be repeated on failure, and not just one like in the example above. Well, grouping tasks is possible in Ansible, so maybe this will work:

---
- name: Group of tasks that are tightly coupled
  block:
  - name: Setup for the next task that needs to run after each failed attempt
    setting_up:
        some: prerequisite action

  - name: Some task that might fail
    failing_task:
        some: setting
    register: outcome
  retries: 3
  delay: 10
  until: outcome.result == 'success'
Enter fullscreen mode Exit fullscreen mode

This, unfortunately, will not work as Ansible does not currently support using retries on a block. If you find yourself in a situation where you need to repeat a group of tasks this will work:

# filename: coupled_task_group.yml
- name: Group of tasks that are tightly coupled
  block:
  - name: Increment the retry count
    set_fact:
      retry_count: "{{ 0 if retry_count is undefined else retry_count | int + 1 }}"
  - name: Setup for the next task that needs to run after each failed attempt
    setting_up:
        some: prerequisite action

  - name: Some task that might fail
    failing_task:
        some: setting
  rescue:
    - fail:
        msg: Maximum retries of grouped tasks reached
      when: retry_count | int == 5

    - debug:
        msg: "Task Group failed, let's give it another shot"

    - include_task: coupled_task_group.yml
Enter fullscreen mode Exit fullscreen mode

This looks really strange at first. You might ask why is the block's rescue calling the task file that the tasks are being declared in. What's happening is we're using the power of recursion to repeat a task until some condition is reached: either the task succeeds on the first try or on retry, or the rescue block is called up to five times triggering the when condition on the fail task throwing us out of the loop. Note that when writing any kind of recursive function ensuring a base case is vital, otherwise the function may call itself infinitely. The same holds true when using the above approach of repeating groups of tasks in Ansible, if we forgot to increment the retry_count variable on each pass through Ansible would run indefinitely until stopped by the user.

In the future, this approach shouldn't be needed as a PR to add this functionality to Ansible is nearing completion. To see if it's been merged, check here.

Top comments (2)

Collapse
 
codejedi365 profile image
codejedi365 • Edited

This was great @nodeselector ! Totally solved my issue with apt when it magically couldn't find packages on the first try after a apt update but it could on the 2nd try. THANK YOU!

Edit: I think your post is missing an 's' on include_task declarative to make it "include_tasks". Ansible threw an error on include_task.

I also added a dynamic max_retries variable & retry_delay variable to expand this concept above in case anyone needs that type of flexibility as I did.

# filename: coupled_task_group.yml
- name: Group of tasks that are tightly coupled
  vars:
    max_retries: "{{ 5 if max_retries is undefined else max_retries | int }}"
    retry_delay: "{{ 10 if retry_delay is undefined else retry_delay | int }}"
  block:
  - name: Increment the retry count
    set_fact:
      retry_count: "{{ 0 if retry_count is undefined else retry_count | int + 1 }}"

  - name: Setup for the next task that needs to run after each failed attempt
    setting_up:
        some: prerequisite action

  - name: Some task that might fail
    failing_task:
        some: setting

  rescue:
    - fail:
        msg: Maximum retries of grouped tasks reached
      when: retry_count | int == max_retries | int

    - debug:
        msg: "Task Group failed, let's give it another shot"

    - name: Sleep between retries
      wait_for:
        timeout: "{{ retry_delay }}" # seconds
      delegate_to: localhost
      become: false

    - include_tasks: coupled_task_group.yml
Enter fullscreen mode Exit fullscreen mode
Collapse
 
zl7 profile image
Zl7 • Edited

What if you want to use that coupled_task_group.yml several times?

How can you reset the retry_count between uses? For instance:

# main.yml
- name: "Check service 1"
  include_tasks: coupled_task_group.yml
  vars:
    service_name: 'service1'
    max_retries: 2
    retry_delay: 5

- name: "Check service 2"
  include_tasks: coupled_task_group.yml
  vars:
    service_name: 'service2'
    max_retries: 2
    retry_delay: 5
Enter fullscreen mode Exit fullscreen mode

In my case, coupled_task_group.yml is similar to what shown in this article including @codejedi365 but 'some task' implements a service check (for instance). In fact, it does not really matter what it implement it is just for the sake of giving a sense to my use case.

This implementation seems to work fine for the first one but the second seems to take retry_count value where the first one left it.

Finally, if I set retry_count: 0 when calling in the vars, it doesn't seem to work either as the retry_count gets stuck to 0 infinitely. Maybe I miss something obvious or this trick is just not meant to be used several times with the same set of variables...

Edit/Update:
I think I understood my mistake of resetting my variable retry_count. Since it's a fact it needs to be updated as a fact... So using something like this after success and failure, it works.

# filename: coupled_task_group.yml

...SomeStuff...

block:
- name: Increment the retry count
  set_fact:
    retry_count: "{{ 0 if retry_count is undefined or retry_count == 'reset' else retry_count | int + 1 }}"

...SomeStuff...

- name: Reset retry count after success
  set_fact:
    retry_count: reset

...SomeStuff...

rescue:
- name: Reset retry count if max retries reached (exit loop)
  set_fact:
    retry_count: reset
  failed_when: retry_count == 'reset'
  when: retry_count | int >= max_retries | int

...SomeStuff...
Enter fullscreen mode Exit fullscreen mode