Jayesh Shinde

Posted on Mar 13

How We Fixed Intermittent ECS Image-Not-Found Errors in AWS CDK

#aws #cdk #devops #infrastructure

At one point, our ECS deployments started failing in a way that felt random.

Sometimes a deployment would work perfectly. Sometimes the service would try to roll forward and fail because the container image it expected was no longer available. Nothing was wrong with the application code. The problem was in the deployment asset flow.

We were using AWS CDK to deploy container-based workloads, and like many teams, we were relying on CDK’s default bootstrap ECR repository for Docker image assets. That was convenient at first, but it became a problem once repository retention rules were tightened for cost control.

In environments with frequent deployments, older intermediate images were being cleaned up faster than our deployment flow could safely tolerate. The result was intermittent ECS deploy failures caused by missing images.

The Root Cause

AWS CDK Docker assets are published during the asset publishing phase, which happens before CloudFormation starts deploying stacks.

That means two things:

CDK is not just defining infrastructure, it is also managing where deployable image assets are stored.
If the default asset repository has aggressive cleanup policies, your deployments can become fragile.

This is especially painful in non-production environments where deployment frequency is high and image churn is constant.

The Strategy We Took

We wanted a solution that was simple, low-risk, and did not require redesigning the whole build pipeline.

So instead of pushing ECS image assets to the shared default CDK ECR repository, we moved to a dedicated ECR repository per environment/application area.

At a high level, the fix looked like this:

create a dedicated ECR repository ahead of time
configure the CDK synthesizer to publish image assets there
keep lifecycle control on that repository
deploy the ECR stack first, then the app stacks

This gave us isolation from the shared bootstrap repository while keeping the rest of the CDK deployment model mostly unchanged.

Sample: Dedicated ECR Stack

Here is a simplified example of creating a dedicated ECR repository with a lifecycle policy:

import { Stack, RemovalPolicy } from 'aws-cdk-lib';
import { Repository } from 'aws-cdk-lib/aws-ecr';
import type { Construct } from 'constructs';

type AppEnvProps = {
  stage: string;
};

export class ContainerAssetRepoStack extends Stack {
  constructor(scope: Construct, id: string, props: AppEnvProps) {
    super(scope, id);

    new Repository(this, 'AppContainerRepo', {
      repositoryName: `myapp-assets-${props.stage.toLowerCase()}`,
      removalPolicy: RemovalPolicy.RETAIN,
      lifecycleRules: [
        {
          rulePriority: 1,
          description: 'Keep only the latest 100 images',
          maxImageCount: 100,
        },
      ],
    });
  }
}

A few important details here:

RETAIN protects the repository if the stack is deleted later.
lifecycle rules still clean up old images over time.
the repository name is normalized to lowercase, which is important for ECR.

Sample: Point CDK to the Dedicated Repo

Once the repository exists, the application stack can tell CDK to publish image assets there using DefaultStackSynthesizer.

import { App, DefaultStackSynthesizer, Stack } from 'aws-cdk-lib';
import type { Construct } from 'constructs';

type ServiceStackProps = {
  stage: string;
};

class ServiceStack extends Stack {
  constructor(scope: Construct, id: string, props: ServiceStackProps) {
    super(scope, id, {
      synthesizer: new DefaultStackSynthesizer({
        imageAssetsRepositoryName: `myapp-assets-${props.stage.toLowerCase()}`,
      }),
    });

    // ECS service, task definition, container asset usage, etc.
  }
}

const app = new App();

new ServiceStack(app, 'ServiceStackDev', {
  stage: 'dev',
});

This keeps the existing CDK asset publishing model, but moves the destination away from the shared default bootstrap repository.

One Important Gotcha

A stack dependency is not enough if the same deployment run tries to create the ECR repository and publish assets into it.

Why?

Because asset publishing happens before CloudFormation stack deployment.

So if the repository does not already exist, the asset publish step can fail before your “repo stack” is even deployed.

The safest pattern is:

deploy the ECR repository stack first
run the normal application deployment after that

That sequencing matters.

Another Important Gotcha: IAM Permissions

Changing the repository target is not enough by itself.

The identity or role that CDK uses to publish Docker assets must also have permission to push to the new ECR repository.

That usually means allowing actions such as:

ecr:PutImage
ecr:InitiateLayerUpload
ecr:UploadLayerPart
ecr:CompleteLayerUpload
ecr:BatchCheckLayerAvailability
ecr:BatchGetImage
ecr:GetDownloadUrlForLayer
ecr:GetAuthorizationToken

If you forget this part, the deployment simply moves from “image missing” problems to “access denied” problems.

Why This Worked Well for Us

We liked this approach because it was a practical middle ground.

It did not require:

rebuilding our CI/CD image strategy from scratch
changing every ECS service definition
introducing a more complex app-owned image publishing flow immediately

But it did give us:

predictable image retention
environment-specific isolation
fewer surprises during ECS deployments
better control over cost and cleanup behavior

When to Use This Pattern

This approach makes sense if:

you already use CDK-managed Docker/container assets
the default bootstrap ECR repository is shared across too many deployments
retention rules on that shared repository are causing instability
you want a fast, low-disruption improvement

If you want a more explicit long-term model, the next step is usually:

build image in CI
push image to a named ECR repository yourself
reference the image directly in ECS by repo and tag

That gives maximum control, but it also requires more changes.

Final Thought

CDK defaults are great for getting started, but they are not always ideal once platform constraints like retention, cost control, and deployment frequency start to matter.

In our case, moving Docker assets to dedicated ECR repositories was a small change with a big operational impact. It made deployments more predictable without forcing a major rework of the pipeline.

DEV Community