Introduction: Why I Built This (And Why You Should Care)
A few months ago, I was responsible for managing a rapidly growing microservices architecture supported by AWS, and I wasn't sleeping well because of security concerns. We had five different services running in separate containers, each with its own dependencies, and I was continually worried about whether any of them had been updated with a vulnerability during image deployment.
The real issue wasn't just the scanning of images; it was about how to automate and consistently maintain security across the entire deployment pipeline.
Manual security checks do not scale, and we all know they are often skipped when a deadline approaches.
I built an automated security pipeline to scan for vulnerabilities in container images using AWS Fargate and GitHub Actions. It eliminated all the manual scanning of images, as well as the constant fear of whether the last base image had critical CVEs and whether we were deploying insecure code to production.
In this article, I will provide a detailed overview of how I set this up with example configurations and command lines I ran, as well as some of the lessons I learned from implementing it in a production environment for several months before sharing this information with the AWS Community Builder program.
What We Will Discuss:
How to set up ECR with automated vulnerability scanning.
How to create GitHub Actions workflows that will gate deployments based on security findings.
What Least Privilege IAM Roles and Network Isolation on Fargate mean.
How to continuously monitor and track compliance.
An example step-by-step to use as a blueprint for your projects.
Let's go!
Architecture Overview: The Big Picture
Now that we have an overview of what we will be doing, here's an overview of the architecture that supports the automated security pipeline you built:

With this configuration, you will have:
- Automated security scanning of all container images before reaching production.
- Network Isolation for Private Subnets and Security Groups.
- Least Privilege Access Through Separate IAM Roles for Execution and Task Roles
- Complete Audit Trail via CloudWatch Logs
- Runtime Protection via Fargate Isolation Model
Now Let's Get into How Each Component Functions!
Container Image Security: The essentials
It is necessary not to assume that any images of containers are trustworthy. Even official images used by developers can contain vulnerabilities and/or have dependencies that could create security problems.
I locked down the image containers by:
1. Automatic image scanning ECR (Scan on push)
For each repository in ECR, I enabled image scanning on push. When I push an image into ECR, AWS automatically scans that image against the CVE database. To enable image scan on push for a repository, follow these steps:
Enabling scan-on-push for a repository:
aws ecr put-image-scanning-configuration \
--repository-name production-api \
--image-scanning-configuration scanOnPush=true \
--region us-east-1
Configuring enhanced scanning with Inspector:
aws ecr put-registry-scanning-configuration \
--scan-type ENHANCED \
--rules '[{"repositoryFilters":[{"filter":"*","filterType":"WILDCARD"}],"scanFrequency":"CONTINUOUS_SCAN"}]' \
--region us-east-1
With improved scanning you will discover:
- Operating system vulnerabilities
- Vulnerabilities in programming-language packages (Python, Node.JS,
- Java etc.)
- Continuous monitoring of your environment – not just at the time you 'push' your container image.
2. IAM Roles - Separation is Key
I learnt the hard way - do not ever use the same IAM Role across everything. Here is how I distinguish between:
Task Execution Role - this is the role that ECS will require in order to launch the container you have created.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
*Task Role *(what your application needs):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query"
],
"Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/users-table"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::my-app-assets/*"
}
]
}
The service’s requirement for executing a task is identical for all services; however, each service's specific need defines the task role.
- Base Image Selection & Multi-Stage Builds
- I no longer use large operating system images; instead, I use small base images as a foundation. Below is an example comparing both types of base images.
Base Image Size and Vulnerabilities (my tests show the below data).
node:18 - full image: 1.1 GB 247 CVEs
node:18-slim - small image: 243 MB 89 CVEs
node:18-alpine - container images: 178 MB 12 CVEs
My Dockerfile Strategy Uses Multi-Stage Builds:
# Build stage - includes dev dependencies
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
# Production stage - minimal runtime
FROM node:18-alpine
WORKDIR /app
# Create non-root user
RUN addgroup -g 1001 appuser && \
adduser -D -u 1001 -G appuser appuser
# Copy only production artifacts
COPY --from=builder --chown=appuser:appuser /app/dist ./dist
COPY --from=builder --chown=appuser:appuser /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appuser /app/package.json ./
# Switch to non-root user
USER appuser
# Run the application
EXPOSE 3000
CMD ["node", "dist/index.js"]
Security Practices in the Dockerfile:
- Utilizes an Alpine image as a base
- Implements a multi-stage build; production image has no build tools
- Runs the image as a non-root user (following the principle of least privilege)
- Only production dependencies included in the image
- Related dependencies are pinned to a specific version.
Vulnerability Thresholds
I created a policy that states: No Critical or High-Risk vulnerabilities exist in production -- No Exceptions.
The following process allows me to check the programmatic results of the scans:
# Get the latest scan findings
aws ecr describe-image-scan-findings \
--repository-name production-api \
--image-id imageTag=v1.2.3 \
--region us-east-1 \
--query 'imageScanFindings.findingSeverityCounts'
Output:
{
"CRITICAL": 0,
"HIGH": 0,
"MEDIUM": 3,
"LOW": 12,
"INFORMATIONAL": 5
}
A build will fail automatically when CRITICAL or HIGH vulnerabilities are detected in GitHub Actions CI/CD using our automated security process (in the following section).
Automating Security in GitHub Actions CI/CD
This is where things become interesting: every time new code is pushed into the master branch, an automated process is created, which builds the application, scans it for vulnerabilities, and only then deploys (if everything was successful).
GitHub Actions Full Workflow
Here is the workflow document I have.(.github/workflows/deploy.yml):
name: Build, Scan, and Deploy to Fargate
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: production-api
ECS_CLUSTER: production-cluster
ECS_SERVICE: api-service
CONTAINER_NAME: api-container
jobs:
build-and-scan:
name: Build and Security Scan
runs-on: ubuntu-latest
outputs:
image: ${{ steps.build-image.outputs.image }}
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build Docker image
id: build-image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:latest
echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
- name: Push image to Amazon ECR
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
- name: Wait for ECR scan to complete
env:
IMAGE_TAG: ${{ github.sha }}
run: |
echo "Waiting for image scan to complete..."
sleep 30
for i in {1..20}; do
SCAN_STATUS=$(aws ecr describe-image-scan-findings \
--repository-name $ECR_REPOSITORY \
--image-id imageTag=$IMAGE_TAG \
--region $AWS_REGION \
--query 'imageScanStatus.status' \
--output text 2>/dev/null || echo "PENDING")
echo "Scan status: $SCAN_STATUS"
if [ "$SCAN_STATUS" = "COMPLETE" ]; then
echo "Scan completed successfully"
break
elif [ "$SCAN_STATUS" = "FAILED" ]; then
echo "Scan failed!"
exit 1
fi
sleep 15
done
- name: Check for vulnerabilities
env:
IMAGE_TAG: ${{ github.sha }}
run: |
echo "Checking for vulnerabilities..."
FINDINGS=$(aws ecr describe-image-scan-findings \
--repository-name $ECR_REPOSITORY \
--image-id imageTag=$IMAGE_TAG \
--region $AWS_REGION)
echo "$FINDINGS" | jq '.imageScanFindings.findingSeverityCounts'
CRITICAL=$(echo "$FINDINGS" | jq -r '.imageScanFindings.findingSeverityCounts.CRITICAL // 0')
HIGH=$(echo "$FINDINGS" | jq -r '.imageScanFindings.findingSeverityCounts.HIGH // 0')
MEDIUM=$(echo "$FINDINGS" | jq -r '.imageScanFindings.findingSeverityCounts.MEDIUM // 0')
echo "Critical vulnerabilities: $CRITICAL"
echo "High vulnerabilities: $HIGH"
echo "Medium vulnerabilities: $MEDIUM"
if [ "$CRITICAL" -gt 0 ] || [ "$HIGH" -gt 0 ]; then
echo "❌ CRITICAL or HIGH vulnerabilities found! Deployment blocked."
echo "Please fix vulnerabilities before deploying."
exit 1
fi
if [ "$MEDIUM" -gt 5 ]; then
echo "⚠️ Warning: More than 5 MEDIUM vulnerabilities found."
echo "Consider addressing these vulnerabilities."
fi
echo "✅ Security scan passed!"
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ steps.build-image.outputs.image }}
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy results to GitHub Security
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: 'trivy-results.sarif'
deploy:
name: Deploy to Fargate
needs: build-and-scan
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Download task definition
run: |
aws ecs describe-task-definition \
--task-definition api-service-task \
--query taskDefinition > task-definition.json
- name: Fill in the new image ID in the task definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: task-definition.json
container-name: ${{ env.CONTAINER_NAME }}
image: ${{ needs.build-and-scan.outputs.image }}
- name: Deploy to Amazon ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
- name: Notify deployment success
if: success()
run: |
echo "🚀 Deployment successful!"
echo "Image: ${{ needs.build-and-scan.outputs.image }}"
echo "Service: ${{ env.ECS_SERVICE }}"
echo "Cluster: ${{ env.ECS_CLUSTER }}"
Below is a list of the actions taken in this workflow, listed sequentially:
- Build:
- Retrieve code from the repository
- Create a Docker image
- Tag the image with the SHA for historical records
- Scan:
- Send the image to Amazon ECR (which starts an automatic scan)
- Check for completion of the scan (with timeout)
- Download the results of the scan for identified vulnerabilities
- Deployments that have Critical or High vulnerabilities are prohibited from proceeding
- Run another scan using Trivy
- Upload the results of the scan to the Security Tab in GitHub
- Deploy: This will only take place when a successful scan occurs.
- Pull current task definition file(s)
- Replace current image reference with new image
- Deploy the task to Fargate
- Observe until the service has stabilized
The Secret Sauce consists of Deployment Gate Checks, which essentially checks for any vulnerabilities present during the deployment process.
if [ "$CRITICAL" -gt 0 ] || [ "$HIGH" -gt 0 ]; then
echo "❌ CRITICAL or HIGH vulnerabilities found! Deployment blocked."
exit 1
fi
If any vulnerabilities are found during this deployment gate check, the workflow will fail and the user will not be able to deploy that artifact regardless of the reason. This process has saved us from deploying known CVEs on several occasions.
Secrets Required to Configure GitHub
You will need to add the following secret keys to your repository:
AWS_ACCESS_KEY_ID (ex: AKIAIOSFODNN7EXAMPLE)
AWS_SECRET_ACCESS_KEY (ex: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY)
Pro tip: The best practice is to use an IAM user that has the least privileges necessary:
ecr:* To your repositories
ecs:DescribeTaskDefinition, ecs:RegisterTaskDefinition, ecs:UpdateService
iam:PassRole to the task execution role
Security and Runtime Management of Your Tasks Running within Fargate
There are many built-in Security features available with Fargate, including isolated processors and no-host management capabilities; however, it is still necessary to properly configure Fargate to obtain maximum benefits from these built-in Security capabilities.
1. Network Segmentation
When using Fargate to deploy my services, I created an environment with Private SUBNETS That Do Not Have Direct Access To The Internet. To leverage Outbound Connectivity (i.e., For ECR Image Pulling and API Calls), I used NAT Gateways (server-side Network Address Translation Gateway).
I Created A VPC According To This Configuration Through Terraform.
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "production-vpc"
}
}
resource "aws_subnet" "private_a" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
tags = {
Name = "production-private-subnet-a"
}
}
resource "aws_subnet" "private_b" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.2.0/24"
availability_zone = "us-east-1b"
tags = {
Name = "production-private-subnet-b"
}
}
resource "aws_security_group" "ecs_tasks" {
name = "ecs-tasks-sg"
description = "Security group for ECS tasks"
vpc_id = aws_vpc.main.id
# Allow outbound to internet (via NAT Gateway)
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
# Allow inbound from ALB only
ingress {
from_port = 3000
to_port = 3000
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
description = "Allow traffic from ALB"
}
tags = {
Name = "ecs-tasks-security-group"
}
}
- A sample task definition (Terraform) for Fargate Task Definitions that adheres to secure practices:
resource "aws_ecs_task_definition" "api_service" {
family = "api-service-task"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = "512"
memory = "1024"
execution_role_arn = aws_iam_role.ecs_execution_role.arn
task_role_arn = aws_iam_role.api_task_role.arn
container_definitions = jsonencode([
{
name = "api-container"
image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/production-api:latest"
essential = true
portMappings = [
{
containerPort = 3000
protocol = "tcp"
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/production-cluster/api-service"
"awslogs-region" = "us-east-1"
"awslogs-stream-prefix" = "ecs"
}
}
environment = [
{const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
defaultMeta: {
service: 'api-service',
environment: 'production'
},
transports: [
new winston.transports.Console()
]
});
// Usage
logger.info('User login successful', {
userId: 'user-123',
ipAddress: '203.0.113.42'
});
logger.error('Database connection failed', {
error: err.message,
stack: err.stack
});
name = "NODE_ENV"
value = "production"
},
{
name = "PORT"
value = "3000"
}
]
secrets = [
{
name = "DATABASE_URL"
valueFrom = "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/database-url-AbCdEf"
},
{
name = "API_KEY"
valueFrom = "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/api-key-XyZ123"
}
]
# Security configurations
readonlyRootFilesystem = true
linuxParameters = {
capabilities = {
drop = ["ALL"]
add = ["NET_BIND_SERVICE"]
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}
])
tags = {
Environment = "production"
Service = "api"
}
}
The task definition contains the following key features related to security:
- The readonlyRootFilesystem option restricts write access to the local file system of the container so that it cannot be modified by external entities.
- The linuxParameters.capabilities allow the user to drop all capabilities by default, and only grant the minimal required access.
- The secrets parameter should be set to use AWS Secrets Manager instead of environment variables.
- The execution role and the task role should be separate.
- It is recommended that a health check be included as part of the overall reliability of the application.
Encryption in Transit:
- ALB still uses HTTPS using ACM certificates.
- All internal service-to-service communications use HTTPS.
Encryption at Rest:
- ECR images will be encrypted using AWS KMS.
- CloudWatch logs will be encrypted.
- EFS volume (if used) will be encrypted.
Example: Creating encrypted log group:
aws logs create-log-group \
--log-group-name /ecs/production-cluster/api-service \
--kms-key-id arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012 \
--region us-east-1
4. Least-Privilege IAM Policies
This is the actual task role i used:
resource "aws_iam_role" "api_task_role" {
name = "apiServiceTaskRole"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy" "api_task_policy" {
name = "apiServiceTaskPolicy"
role = aws_iam_role.api_task_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query",
"dynamodb:UpdateItem"
]
Resource = [
"arn:aws:dynamodb:us-east-1:123456789012:table/users-table",
"arn:aws:dynamodb:us-east-1:123456789012:table/users-table/index/*"
]
},
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "arn:aws:s3:::my-app-assets/*"
},
{
Effect = "Allow"
Action = [
"secretsmanager:GetSecretValue"
]
Resource = [
"arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/database-url-*",
"arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/api-key-*"
]
}
]
})
}
This is a clearer account of what I'm storing in DynamoDB
(the only DynamoDB tables required by this service) and which S3 actions can be performed (only GetObject and PutObject allowed - deletes not included). The other sensitive data I store (secrets) must be secured.
Compliance & Monitoring
Monitoring isn't a one-time event; it requires consistent monitoring so that we can maintain audit trails.
- CloudWatch Logging Log files from all containers are collected in CloudWatch. The way I configure structured logging is detailed below: Creating the log group:
aws logs create-log-group \
--log-group-name /ecs/production-cluster/api-service \
--retention-in-days 30 \
--region us-east-1
Application logging (Node.js example):
const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
defaultMeta: {
service: 'api-service',
environment: 'production'
},
transports: [
new winston.transports.Console()
]
});
// Usage
logger.info('User login successful', {
userId: 'user-123',
ipAddress: '203.0.113.42'
});
logger.error('Database connection failed', {
error: err.message,
stack: err.stack
});
- CloudWatch Alarms I set up alarms for critical metrics:
# High CPU usage alarm
aws cloudwatch put-metric-alarm \
--alarm-name api-service-high-cpu \
--alarm-description "Alert when API service CPU > 80%" \
--metric-name CPUUtilization \
--namespace AWS/ECS \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--dimensions Name=ServiceName,Value=api-service Name=ClusterName,Value=production-cluster \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
# Failed task alarm
aws cloudwatch put-metric-alarm \
--alarm-name api-service-task-failures \
--alarm-description "Alert on task failures" \
--metric-name TaskFailures \
--namespace ECS/ContainerInsights \
--statistic Sum \
--period 60 \
--threshold 1 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--dimensions Name=ServiceName,Value=api-service Name=ClusterName,Value=production-cluster \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
3. Amazon Inspector Integration
Inspector provides continuous runtime vulnerability scanning. Here's how I enabled it:
# Enable Inspector for ECR and EC2
aws inspector2 enable \
--resource-types ECR EC2 \
--region us-east-1
# Check Inspector findings
aws inspector2 list-findings \
--filter-criteria '{
"ecrImageRepositoryName": [{"comparison": "EQUALS", "value": "production-api"}],
"severity": [{"comparison": "EQUALS", "value": "CRITICAL"}]
}' \
--region us-east-1
The inspector provides you with the following benefits:
- Continuous scanning of all of your running containers;.
- detection of CVEs in both your operating system packages and your application dependencies.
- Risk scores to prioritize your scan and find the most urgent risks; integration with AWS Security Hub.
- Log Query Analysis I use CloudWatch Insights to monitor security-relevant events, and I have created a query that identifies all failed authentication attempts.
fields @timestamp, userId, ipAddress, @message
| filter @message like /authentication failed/
| sort @timestamp desc
| limit 100
The query for the error rate by service:
fields @timestamp, service, @message
| filter level = "error"
| stats count() by service
| sort count() desc
- Compliance Reporting For compliance requirements (SOC 2, ISO 27001), I export logs to S3 for long-term retention:
aws logs create-export-task \
--log-group-name /ecs/production-cluster/api-service \
--from $(date -d '30 days ago' +%s)000 \
--to $(date +%s)000 \
--destination s3-compliance-logs-bucket \
--destination-prefix ecs-logs/api-service/
Example of Steps: Complete Process
I want to demonstrate how to deploy (create) a completely new microservice while implementing all security features.
Step 1: Create the Microservice
Project Layout:
notification-service/
├── src/
│ ├── index.js
│ ├── handlers/
│ └── utils/
├── Dockerfile
├── package.json
└── .github/
└── workflows/
└── deploy.yml
Example Docker file (There is a link to the Docker file that is similar to notification-service/dockerfile on GitHub)
# Build stage
FROM node:18-alpine AS builder
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production
# Copy application code
COPY . .
# Build if needed (TypeScript, etc.)
# RUN npm run build
# Production stage
FROM node:18-alpine
WORKDIR /app
# Install security updates
RUN apk update && \
apk upgrade && \
apk add --no-cache dumb-init && \
rm -rf /var/cache/apk/*
# Create non-root user
RUN addgroup -g 1001 -S appgroup && \
adduser -u 1001 -S appuser -G appgroup && \
chown -R appuser:appgroup /app
# Copy from builder
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/src ./src
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
# Switch to non-root user
USER appuser
# Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"
# Expose port
EXPOSE 3000
# Run application
CMD ["node", "src/index.js"]
Application code (src/index.js):
const express = require('express');
const AWS = require('aws-sdk');
const winston = require('winston');
const app = express();
const PORT = process.env.PORT || 3000;
// Configure logger
const logger = winston.createLogger({
level: 'info',
format: winston.format.json(),
defaultMeta: { service: 'notification-service' },
transports: [new winston.transports.Console()]
});
// Configure AWS SDK
const sns = new AWS.SNS({ region: 'us-east-1' });
const secretsManager = new AWS.SecretsManager({ region: 'us-east-1' });
app.use(express.json());
// Health check endpoint
app.get('/health', (req, res) => {
res.status(200).json({ status: 'healthy', timestamp: new Date().toISOString() });
});
// Send notification endpoint
app.post('/notify', async (req, res) => {
try {
const { message, topic } = req.body;
logger.info('Sending notification', { topic, messageLength: message.length });
const params = {
Message: message,
TopicArn: `arn:aws:sns:us-east-1:123456789012:${topic}`
};
await sns.publish(params).promise();
logger.info('Notification sent successfully', { topic });
res.status(200).json({ success: true, message: 'Notification sent' });
} catch (error) {
logger.error('Failed to send notification', {
error: error.message,
stack: error.stack
});
res.status(500).json({ success: false, error: 'Failed to send notification' });
}
});
// Graceful shutdown
process.on('SIGTERM', () => {
logger.info('SIGTERM received, shutting down gracefully');
process.exit(0);
});
app.listen(PORT, () => {
logger.info(`Notification service listening on port ${PORT}`);
});
Step 2: Create ECR Repository
# Create the repository
aws ecr create-repository \
--repository-name notification-service \
--image-scanning-configuration scanOnPush=true \
--encryption-configuration encryptionType=KMS,kmsKey=arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012 \
--region us-east-1
# Set lifecycle policy to keep only last 10 images
aws ecr put-lifecycle-policy \
--repository-name notification-service \
--lifecycle-policy-text '{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 images",
"selection": {
"tagStatus": "any",
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": {
"type": "expire"
}
}
]
}' \
--region us-east-1
Expected output:
{
"repository": {
"repositoryArn": "arn:aws:ecr:us-east-1:123456789012:repository/notification-service",
"registryId": "123456789012",
"repositoryName": "notification-service",
"repositoryUri": "123456789012.dkr.ecr.us-east-1.amazonaws.com/notification-service",
"createdAt": "2024-12-15T10:30:00.000000+00:00",
"imageScanningConfiguration": {
"scanOnPush": true
},
"encryptionConfiguration": {
"encryptionType": "KMS",
"kmsKey": "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"
}
}
}
Step 3: Build and Push Image Manually (First Time)
# Authenticate Docker to ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
# Build the image
docker build -t notification-service:v1.0.0 .
# Tag for ECR
docker tag notification-service:v1.0.0 \
123456789012.dkr.ecr.us-east-1.amazonaws.com/notification-service:v1.0.0
docker tag notification-service:v1.0.0 \
123456789012.dkr.ecr.us-east-1.amazonaws.com/notification-service:latest
# Push to ECR
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/notification-service:v1.0.0
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/notification-service:latest
# Wait for scan to complete
echo "Waiting for image scan..."
sleep 30
# Check scan results
aws ecr describe-image-scan-findings \
--repository-name notification-service \
--image-id imageTag=v1.0.0 \
--region us-east-1 \
--query 'imageScanFindings.findingSeverityCounts'
Output of scan:
{
"MEDIUM": 2,
"LOW": 8,
"INFORMATIONAL": 3
}
There are no any critical or high vulnerabilities, so safe to deploy!
Step 4: Create IAM Roles
Task execution role (same for all services):
# Create trust policy
cat > trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
EOF
# Create role
aws iam create-role \
--role-name ecsTaskExecutionRole \
--assume-role-policy-document file://trust-policy.json
# Attach AWS managed policy
aws iam attach-role-policy \
--role-name ecsTaskExecutionRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
Task role (specific to notification service):
# Create task role
aws iam create-role \
--role-name notificationServiceTaskRole \
--assume-role-policy-document file://trust-policy.json
# Create custom policy
cat > notification-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sns:Publish"
],
"Resource": [
"arn:aws:sns:us-east-1:123456789012:app-notifications",
"arn:aws:sns:us-east-1:123456789012:alert-notifications"
]
},
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": [
"arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/notification-config-*"
]
}
]
}
EOF
# Attach policy
aws iam put-role-policy \
--role-name notificationServiceTaskRole \
--policy-name NotificationServicePolicy \
--policy-document file://notification-policy.json
Step 5: Create Task Definition
# Create task definition JSON
cat > task-definition.json <<EOF
{
"family": "notification-service-task",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/notificationServiceTaskRole",
"containerDefinitions": [
{
"name": "notification-container",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/notification-service:latest",
"essential": true,
"portMappings": [
{
"containerPort": 3000,
"protocol": "tcp"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/production-cluster/notification-service",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"environment": [
{
"name": "NODE_ENV",
"value": "production"
},
{
"name": "PORT",
"value": "3000"
}
],
"secrets": [
{
"name": "NOTIFICATION_CONFIG",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/notification-config-AbCdEf"
}
],
"readonlyRootFilesystem": false,
"linuxParameters": {
"capabilities": {
"drop": ["ALL"],
"add": ["NET_BIND_SERVICE"]
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
EOF
# Register task definition
aws ecs register-task-definition \
--cli-input-json file://task-definition.json \
--region us-east-1
Step 6: Create ECS Service
# Create CloudWatch log group first
aws logs create-log-group \
--log-group-name /ecs/production-cluster/notification-service \
--region us-east-1
# Create the ECS service
aws ecs create-service \
--cluster production-cluster \
--service-name notification-service \
--task-definition notification-service-task \
--desired-count 2 \
--launch-type FARGATE \
--platform-version LATEST \
--network-configuration "awsvpcConfiguration={
subnets=[subnet-0a1b2c3d,subnet-0e1f2g3h],
securityGroups=[sg-0a1b2c3d],
assignPublicIp=DISABLED
}" \
--load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/notification-tg/1234567890abcdef,containerName=notification-container,containerPort=3000" \
--health-check-grace-period-seconds 60 \
--deployment-configuration "maximumPercent=200,minimumHealthyPercent=100,deploymentCircuitBreaker={enable=true,rollback=true}" \
--enable-execute-command \
--region us-east-1
Important Deployment Configuration Options
2 - The number of tasks that are running at any given time to ensure maximum uptime for the application (high availability).
DISABLED - To indicate that this task will be running in the private subnets of this VPC.
deploymentCircuitBreaker - To automatically roll back any failed deployments.
enable-execute-command - To allow the use of ECS Exec to assist in debugging your application (must include an appropriate IAM role).
** Step 7: ** Create a .github/workflows/deploy.yml file (as illustrated above) for this notification-service project.
name: Deploy Notification Service
on:
push:
branches: [ main ]
paths:
- 'notification-service/**'
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: notification-service
ECS_CLUSTER: production-cluster
ECS_SERVICE: notification-service
CONTAINER_NAME: notification-container
jobs:
deploy:
name: Build, Scan, and Deploy
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build, scan, and push image
id: build-image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
cd notification-service
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
# Wait for scan
sleep 30
# Check vulnerabilities
CRITICAL=$(aws ecr describe-image-scan-findings \
--repository-name $ECR_REPOSITORY \
--image-id imageTag=$IMAGE_TAG \
--query 'imageScanFindings.findingSeverityCounts.CRITICAL' \
--output text 2>/dev/null || echo "0")
if [ "$CRITICAL" != "0" ] && [ "$CRITICAL" != "None" ]; then
echo "❌ CRITICAL vulnerabilities found!"
exit 1
fi
echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: notification-service/task-definition.json
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
Step 8: Verify Deployment
# Check service status
aws ecs describe-services \
--cluster production-cluster \
--services notification-service \
--region us-east-1 \
--query 'services[0].[serviceName,status,runningCount,desiredCount]' \
--output table
# Check task health
aws ecs list-tasks \
--cluster production-cluster \
--service-name notification-service \
--region us-east-1
# Get task details
aws ecs describe-tasks \
--cluster production-cluster \
--tasks arn:aws:ecs:us-east-1:123456789012:task/production-cluster/1234567890abcdef \
--region us-east-1 \
--query 'tasks[0].[taskArn,lastStatus,healthStatus,containers[0].healthStatus]'
# Check logs
aws logs tail /ecs/production-cluster/notification-service \
--follow \
--region us-east-1
Expected healthy output:
Expected healthy output:
| Service Name | Status | Desired Tasks | Running Tasks |
|---|---|---|---|
| notification-service | ACTIVE | 2 | 2 |
Step 9: Test the Service
# Get ALB DNS name
ALB_DNS=$(aws elbv2 describe-load-balancers \
--names production-alb \
--query 'LoadBalancers[0].DNSName' \
--output text \
--region us-east-1)
# Health check
curl https://$ALB_DNS/health
# Send test notification
curl -X POST https://$ALB_DNS/notify \
-H "Content-Type: application/json" \
-d '{
"message": "Test notification from secure pipeline",
"topic": "app-notifications"
}'
Expected response:
{
"success": true,
"message": "Notification sent"
}
Security & Lessons Learned & Best Practices
From the time we ran this configuration in production for several months, below are my major takeaways:
Security Lessons
1. Automating Everything or It Doesn't Get Done
- When under pressure, manually performing security checks gets overlooked,
- Automated gates in Continuous Integration/Continuous Deployment (CI/CD) are a must,
- The easiest way to secure everything is to make it the easiest thing to do because making something insecure will almost always be easier.
2. Some Vulnerabilities Are Worse Than Others
- CRITICAL/HIGH: block deployment right away
- MEDIUM: accept and track with ticket
- LOW/INFORMATIONAL: trend monitoring.
- Context is KEY here: if you have a SQL Injection CVE in a package that your environment does NOT use, your response will be quite different from that of the SQL Injection in your Web Framework.
3. Utilizing Defense-in-Depth Is Beneficial
- Multiple layers of security caught issues: ECR Scan caught OS vulnerabilities, Trivy was able to catch application dependencies within an organization, and Inspector caught runtime issues.
- Network isolation limited lateral movement by exploiting a single security incident.
- IAM least-privilege limited the blast radius on a credential leak situation.
4. Secrets Management Is Important
- Never keep secrets in environment variables (you can view them in logs, console, etc.)
- AWS Secrets Manager integration with ECS is straightforward; rotating secrets regularly (e.g., every 90 days) is best practice. 5. Operational Lessons
** Read-Only File Systems can create issues for Applications**
- Many applications rely upon writing to /tmp.
- Therefore, it is recommended that you mount tmpfs volumes to isolate temporary data.
"mountPoints": [
{
"sourceVolume": "tmp",
"containerPath": "/tmp",
"readOnly": false
}
],
"volumes": [
{
"name": "tmp",
"host": {}
}
]
6. The Importance of Health Checks
- Implement proper health checks (a response code of '200' isn't a true indicator of health).
- Check the database for connectivity and external service health.
7. Considerations for Logging
- A structured logging format (such as JSON) allows for easier search and analysis of logs,the usage of trace IDs provides a means to correlate requests between systems, application logs should be separated from access logs, and use retention policies for logs (we maintain our application logs in CloudWatch for 30 days and for 7 years in S3).
8. Cost-Effective Practices
ECR image scanning is free the first time it is pushed to the ECR repository on a given day, however if enhanced image scanning is performed it costs $0.09 per image/month (this is well worth the cost), CloudWatch Logs Insights query charges (the cost of running these queries) will eat up your costs so caching is helpful when running frequently used query patterns, Fargate Spot pricing provides an opportunity to save 70% for workloads that are not time-critical.
Lessons Learned in CI/CD
9. Deployment Speed versus Security
The CI/CD pipeline takes between 8-12 minutes (build: 3m, scan: 2m, deploy: 5m) this is acceptable for production systems however, it is not acceptable to sacrifice security for speed. To accommodate faster iteration times, take advantage of dev environments with relaxed security policies.
10. Rollback Planning
When a deployment fails the ECS circuit breaker will auto roll back the deployment, always retain previous versions of the task definitions, use Blue/Green deployments for critical services, and always test your rollback plans in advance
Lessons Learned on Architecture
11. Isolated Services:
Each microservice has its own ECR repository.
Each service has its unique IAM Task Role.
Services communicate using ALB/API Gateway only.
This limits the potential effects of a compromised service on other services.
12. Monitoring is Mandatory:
- An Alarm should be set before incidents occur.
- Alerts should be generated for security events such as unusual traffic, failed authentication attempts...
- Weekly review meetings for security findings.
- AWS Security Hub is used to provide a centralised monitoring service.
Lessons Learned from the Team and Process
13. To Avoid Shortcuts,
- Use Documentation: Document why a security control exists.
- Provide run books for common situations.
- Clearly define the exception process (and make it rare).
14. Security Champions Facilitate Communication:
- A security champion per team reviews changes to the code, remains current with known threats.
- They provide a bridge between the development team and security.
15. Improve Your Security Continuously:
- Conduct quarterly security architecture review meetings.
- Conduct a Post Mortem of Security Incidents (no blame).
- Track metrics such as time to patch a vulnerability and false positive rate.
My Ten Best Security Practices
Your Practice: How Is It Applied?
- Scan for vulnerabilities on each push and catch issues when they start. Use the ECR scan-on push feature and GitHub actions integrated into your existing DevOps pipeline.
- Set up automated scans to catch high-severity vulnerabilities in the continuous integration/continuous deployment process. Use Automated Gates.
- Limit the ability of services to cause harm if compromised. Create separate task roles for each service for least privilege IAM practices.
- Reduce the attack surface by placing services in private subnets with no public IP addresses and utilizing NAT gateways for outbound internet access.
- Use non-root container images to prevent privilege escalation. Use a user ID of 1001 in your Dockerfiles.
- Use immutable infrastructure (e.g., Fargate) for consistent infrastructure that is auditable and reproducible. This includes using task definitions in Git.
- Encrypt all data at rest and in transit using KMS (Key Management Service) for protected Amazon ECR repositories, logs, and secrets.
- Enable Structured JSON logs (with trace IDs) to allow you to investigate more thoroughly when a security incident occurs.
- Use CloudWatch alarms and AWS Inspector to enable quick reporting and to monitor your environment.
- Stay ahead of the current threat landscape by using automated tools for dependency management (such as Dependabot) and by updating base container images through a regular monthly schedule.
What I did wrong, For You To Avoid Making The Same Mistakes
Assigned each service equal IAM role access to everything - If a service gets compromised, it gains access to everything. Fixed this issue by creating IAM roles per service.
Did not set log retention amount of time - The CloudWatch Logs bill was unbelievable. Therefore, I set a 30-day retention period for all log groups.
Did not consider image layer caching - My initial Dockerfiles took 15 minutes to build. So, I changed the order of COPY commands around to optimize the cache layer(s).
Deployed to Production from a local computer one time - I started to think, "just this one time". This eventually led to a pattern. We enforce all production deployments through GitHub Actions.
I ignored my MEDIUM vulnerabilities - They accumulated to over 200. Now, we are tracking and remediating these quarterly.
In summary, Secure as a Competitive Advantage
When I first began this project, I anticipated that security would be detrimental to the speed of our work. I was incorrect.
The implementation of Automatic Container Security has actually improved the speed of deployment for us:
We were able to eliminate the "security review" bottlenecks.
Our Development Team was able to find problems in CI before they made their way to production.
The overall number of incidents that required remediation decreased and therefore we spent less time on "fire-fighting".
Compliance audits have become trivial because we have access to all of the necessary logs, scans, and configurations in Codified form.
By The Numbers:
In the past 6 months, we have experienced Zero production security incidents that were caused by vulnerabilities in containers.
Our average deployment time (including all security checks) is 8 minutes.
100% of Images were scanned for vulnerabilities prior to deployment.
Average 15 minutes / Deployment Saved because there was no manual review needed.
2 hours a week were saved due to the elimination of a need for security meetings.
Advice I would give my former self:
When you start a project like this, you must begin with security in mind.
It is far more difficult to retrofit security than it is to build security in from the beginning.
AWS has fantastic tools (ECR Scanning, Fargate Isolation, IAM) that do all of the "heavy lifting" for you; you just have to properly configure and integrate them.
Resources That Helped Me
AWS ECS Security Best Practices
ECR Image Scanning Documentation
Fargate Security Guide
GitHub Actions for AWS
Container Security by Liz Rice
Final Thoughts
You don't have to deal with complicated or slow security anymore. Automating security makes it almost transparent to most developers while providing security for your production infrastructure.
By using AWS Fargate (a container runtime layer that provides isolation from the underlying EC2 instance that hosts your containers) and GitHub Actions, you can build a secure pipeline/as part of your deployment process without having to manage that infrastructure yourself.
Hopefully, after following this guide, you can also create an automated container security pipeline. You can use these examples of code to get started and customize them as necessary.
If you have any questions, comments or war stories to share regarding this guide, I would love to hear from you! You can post a comment here or connect with me on social media. Together we can work towards providing a more secure AWS ecosystem! 🔐
This solution has been running in a production environment for over 6 months and is currently handling over 50,000 requests per day without any incidents of container vulnerability based on security breaches. If this guide was helpful to you, please share it with your team and the larger AWS Community.
Top comments (0)