<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: jeff</title>
    <description>The latest articles on DEV Community by jeff (@nodeselector).</description>
    <link>https://dev.to/nodeselector</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F222865%2F01391bf6-e839-46ff-a71a-7bbb5091377f.jpeg</url>
      <title>DEV Community: jeff</title>
      <link>https://dev.to/nodeselector</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nodeselector"/>
    <language>en</language>
    <item>
      <title>Retrying groups of tightly coupled tasks in Ansible</title>
      <dc:creator>jeff</dc:creator>
      <pubDate>Sat, 18 Apr 2020 08:01:46 +0000</pubDate>
      <link>https://dev.to/nodeselector/retrying-groups-of-tightly-coupled-tasks-in-ansible-579d</link>
      <guid>https://dev.to/nodeselector/retrying-groups-of-tightly-coupled-tasks-in-ansible-579d</guid>
      <description>&lt;p&gt;In some cases the best way to handle failure in distributed systems is to just try again. The same idea can apply when configuring a fleet of machines using Ansible. Let's say for examples sake that we have a task which may not always succeed on the first attempt for 10% of hosts being targeted due to some race condition. However, on the second attempt of the task, the remaining 10% of hosts will succeed. Now we could try to solve this race condition in many ways, but sometimes we may not have the time or required control of the system to do so. In this case, retrying the task may be the simplest, most efficient way to solve the problem.&lt;/p&gt;

&lt;p&gt;I ran into this problem at work while working on a configuration to provision a rack of servers using Ansible. Typically, with Ansible if a task may fail on the first try, one can repeat the task like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Some task that might fail&lt;/span&gt;
  &lt;span class="na"&gt;failing_task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;some&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;setting&lt;/span&gt;
  &lt;span class="na"&gt;register&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;outcome&lt;/span&gt;
  &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;delay&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;until&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;outcome.result == 'success'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is great! Now Ansible will repeat the task three times with a 10 second delay in between each attempt. At the end of each attempt the &lt;code&gt;until&lt;/code&gt; condition is evaluated and if it does not evaluate to true, the task will be repeated, assuming there are still retries left.&lt;/p&gt;

&lt;p&gt;Now, let's say that I have a group of tasks that need to be repeated on failure, and not just one like in the example above. Well, grouping tasks is possible in Ansible, so maybe this will work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Group of tasks that are tightly coupled&lt;/span&gt;
  &lt;span class="na"&gt;block&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Setup for the next task that needs to run after each failed attempt&lt;/span&gt;
    &lt;span class="na"&gt;setting_up&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;some&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prerequisite action&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Some task that might fail&lt;/span&gt;
    &lt;span class="na"&gt;failing_task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;some&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;setting&lt;/span&gt;
    &lt;span class="na"&gt;register&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;outcome&lt;/span&gt;
  &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;delay&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;until&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;outcome.result == 'success'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This, unfortunately, will not work as &lt;a href="https://github.com/ansible/ansible/issues/46203"&gt;Ansible does not currently support using retries on a &lt;code&gt;block&lt;/code&gt;&lt;/a&gt;. If you find yourself in a situation where you need to repeat a group of tasks this will work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# filename: coupled_task_group.yml&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Group of tasks that are tightly coupled&lt;/span&gt;
  &lt;span class="na"&gt;block&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Increment the retry count&lt;/span&gt;
    &lt;span class="na"&gt;set_fact&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;retry_count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;if&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;retry_count&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;undefined&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;else&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;retry_count&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Setup for the next task that needs to run after each failed attempt&lt;/span&gt;
    &lt;span class="na"&gt;setting_up&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;some&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prerequisite action&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Some task that might fail&lt;/span&gt;
    &lt;span class="na"&gt;failing_task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;some&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;setting&lt;/span&gt;
  &lt;span class="na"&gt;rescue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;fail&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;msg&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Maximum retries of grouped tasks reached&lt;/span&gt;
      &lt;span class="na"&gt;when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;retry_count | int == &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;debug&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;msg&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Group&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failed,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;let's&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;give&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;it&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;another&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;shot"&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;include_task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coupled_task_group.yml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks really strange at first. You might ask why is the &lt;code&gt;block&lt;/code&gt;'s &lt;code&gt;rescue&lt;/code&gt; calling the task file that the tasks are being declared in. What's happening is we're using the power of recursion to repeat a task until some condition is reached: either the task succeeds on the first try or on retry, or the &lt;code&gt;rescue&lt;/code&gt; block is called up to five times triggering the &lt;code&gt;when&lt;/code&gt; condition on the &lt;code&gt;fail&lt;/code&gt; task throwing us out of the loop. Note that when writing any kind of recursive function ensuring a base case is vital, otherwise the function may call itself infinitely. The same holds true when using the above approach of repeating groups of tasks in Ansible, if we forgot to increment the &lt;code&gt;retry_count&lt;/code&gt; variable on each pass through Ansible would run indefinitely until stopped by the user.&lt;/p&gt;

&lt;p&gt;In the future, this approach shouldn't be needed as a PR to add this functionality to Ansible is nearing completion. To see if it's been merged, check &lt;a href="https://github.com/ansible/ansible/pull/62151"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
