Debugging ECR Private Pull Through Cache

#aws #containers

I completely missed that AWS added pull through cache rules for private ECR (Elastic Container Repository) repositories earlier this year. If you're an organization making use of publicly available container images this goes a long way towards helping you avoid throttling issues, accelerating deployments, and letting you have some confidence that the container versions you depend on will remain available.

One of my coworkers was having trouble with the pull through cache, his application that was running in ECS wasn't pulling the most recent container image for the latest tag like he expected. Even worse, it couldn't pull down images if he used a different tag. I don't know how he got the initial version of the image into ECR, but I wouldn't have expected his issues to start up after things seemed to be working.

🕐 TL;DR

Make sure your ECS task execution role has the ecr:BatchImportUpstreamImage and ecr:CreateRepository actions on repositories matching your pull through cache prefix. Once you have it working, refine the policy to the least access necessary.

When he tried different to use new tags in his task definition he got errors in the deployment logs like the one below:

service Sample-service was unable to place a task. Reason: CannotPullContainerError: pull image manifest has been retried 7 time(s): failed to resolve ref 012345678910.dkr.ecr.us-east-1.amazonaws.com/docker-hub/sample/sample:4.13.5: 012345678910.dkr.ecr.us-east-1.amazonaws.com/docker-hub/sample/sample:4.13.5: not found.

The image didn't exist in the ECR repository, so it makes sense the service wasn't able to find it, but the whole point of a pull through cache is that the image gets pulled. I decided I needed to test that the pull through cache could work at all.

I set up my local environment with the right AWS credentials and then used the View Push Commands button for the repository to get the command I needed to set up access to the ECR repository. That let me try pulling the image locally through the cache.

# Setup ECR Repository Access
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 012345678910.dkr.ecr.us-east-1.amazonaws.com
# Pull the container image
docker pull 012345678910.dkr.ecr.us-east-1.amazonaws.com/docker-hub/sample/sample:4.13.5

That worked, and the new images for that tag showed up in ECR. So the cache was working, but something was preventing it working within ECS. Sounds like permissions issues.

Since the container images are run as ECS tasks, I pulled up the ECS task IAM execution role. It has the permissions necessary to pull the container images, but does it have the permissions for the pull through cache? It did not!

There are two permissions potentially needed for pulling through images for the cache:

ecr:BatchImportUpstreamImage to actually pull down the image and update your private ECR repository
ecr:CreateRepository to create a new repository if you specify a brand new image

Since my coworker wasn't planning to pull down entirely new container images, he only needs the first permission. I updated the IAM role with the policy below and had things working!

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": "ecr:BatchImportUpstreamImage",
    "Resource": "arn:aws:ecr:us-east-1:012345678910:repository/docker-hub/*"
  }]
}

That's it! We had things working. We just needed to give the ECS task execution role the additional permissions needed for the repositories matching his pull through cache prefix. He is pulling multiple container images, so the wildcard isn't too excessive. It it were a single image the policy could be updated for that single repository.

We're still not sure why the once per 24 hour update alluded to in the AWS documentation isn't working, but that's a mystery for another day.

DEV Community

Debugging ECR Private Pull Through Cache

Top comments (0)