DEV Community

Sagar
Sagar

Posted on • Updated on

How to debug AWS Batch Job abrupt failures?

Background

My team owns a ML platform for company X which is built on top of AWS Batch. Most of the time the platform works just fine. But at our scale running thousands of instances at a time things can go wrong sometimes, like

  • A Job getting terminated abruptly due to which EC2 instance it was running on was removed out of service
  • The EC2 instance on which the job was running got terminated abruptly

In both the cases since the EC2 instance is no longer in service it is impossible for us to debug the root cause and fix the errors.

Most of the time the problem is due to transient Network issue or some new AWS Limit being hit.

And the only way to quickly identify the problem for us is to see the logs and in this case would be ECS logs or Docker logs. But as I mentioned before as the EC2 instance is out of service there is no way for us to retrieve after the fact. I the instance was still running we could easily SSH onto the host or use SSM and follow these steps to collect the logs. And as of this writing there is no way ECS pulling these logs automatically.

So to bridge the gap and make our lives better we tried couple of approaches to pull the logs from Instance automatically.

What did not work for us

There are some solutions available online that suggest to setup a EC2 ASG lifecycle rule which will trigger a Lambda function when the instance is going to be terminated. The Lambda function will then use something like session manager and copy all the relevant log files to S3.

As you can imagine this requires making changes in couple of different AWS services which can be tricky and hard to maintain. And for some unknown reason the logs were never missing when we needed them the most.

So we just gave up and went with the next approach which we think is a bit easier than this.

Solution

Instead of the above solution we came up with another solution which was very easy to implement and maintain. The solution involved leveraging CloudWatch agent to keep streaming the logs into CloudWatch.

This solution might be a bit expensive as it copies all the logs even if the instance is healthy. Which was fine for us as eventually every instance the Batch job was running on is terminated eventually and we made sure the CloudWatch logs retention was kept to a low number like 7 days to reduce the storage costs.

All we need to do is just run some commands and tell the CloudWatch agent to keep streaming some log files to the Cloud.

There are 2 ways (possibly more) that we recognized can be used to achieve this

  • Creating a Custom AMI that has this startup script
  • Launch Template / Launch Configuration with user data

In both the approaches a script is run during EC2 instance bootstrap which performs following steps

  1. Install CloudWatch agent
  2. Start the agent with local file paths where logs are located

In this post I'll just provide the high level commands that you need to run to achieve this. Based on your preference you can make these commands run during EC2 instance start-up using any of the 2 options mentioned above.

  1. First we need to install CloudWatch Agent on the EC2 instance. Notice that I am using us-west-2 region, also the architecture and OS of your instance could be different, you can see the instructions here for other platforms.

    rpm -Uvh https://amazoncloudwatch-agent-us-west-2.s3.us-west-2.amazonaws.com/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
    
  2. Once the CloudWatch agent is installed we need to create a CloudWatch configuration file that will point to the location of log files present locally, and also specify some other information like where the logs will be saved and with what name in CloudWatch.
    Notice that in this example we are telling the agent to stream 4 logs.

    • ECS Agent logs
    • ECS Init logs
    • ECS Audit logs
    • Messages logs (Contains Docker logs) You can add or remove from the collect_list as per your requirement. Also, based on the platform the log file locations might be different.
    cat > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json << 'EOF'
    {
      "agent":{
         "metrics_collection_interval":10,
         "omit_hostname":true
      },
      "logs":{
         "logs_collected":{
            "files":{
               "collect_list":[
                  {
                     "file_path":"/var/log/ecs/ecs-agent.log*",
                     "log_group_name":"/aws-batch/ecs-agent.log",
                     "log_stream_name":"{instance_id}",
                     "timezone":"Local"
                  },
                  {
                     "file_path":"/var/log/ecs/ecs-init.log",
                     "log_group_name":"/aws-batch/ecs-init.log",
                     "log_stream_name":"{instance_id}",
                     "timezone":"Local"
                  },
                  {
                     "file_path":"/var/log/ecs/audit.log*",
                     "log_group_name":"/aws-batch/audit.log",
                     "log_stream_name":"{instance_id}",
                     "timezone":"Local"
                  },
                  {
                     "file_path":"/var/log/messages",
                     "log_group_name":"/aws-batch/messages",
                     "log_stream_name":"{instance_id}",
                     "timezone":"Local"
                  }
               ]
            }
         }
      }
    }
    EOF
    
  3. Now all we need to do is tell CloudWatch agent to fetch the config that we created locally and start streaming the logs

    /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s
    
    

That's it, now whenever your EC2 instance is started the above commands will make sure to start streaming Log files that you care about.

Top comments (0)