Maureen Chebet

Posted on Nov 26

Optimizing CI/CD Pipelines on AWS: From 45 Minutes to Under 15

#cicd #optimization #testing #performance

The Performance Crisis

Your Jenkins pipeline is a bottleneck:

45 minutes to complete a single pipeline run
2 deployments per day maximum (developers waiting hours for feedback)
High failure rate due to flaky tests
Resource contention on build servers (queues backing up)

The business impact? Slow feature delivery, frustrated developers, and deployment bottlenecks.

Requirements:

Reduce pipeline time to under 15 minutes (70% reduction!)
Enable multiple deployments per day
Improve reliability (reduce flaky test failures)
Maintain security and quality gates

In this article, I'll walk through optimizing a CI/CD pipeline on AWS, covering parallelization, test optimization, caching strategies, infrastructure improvements, and monitoring.

Current State Analysis

Identifying Bottlenecks

Typical 45-Minute Pipeline Breakdown:

Source Checkout:        2 min
Dependency Install:     8 min  ← Bottleneck
Unit Tests:            12 min  ← Can parallelize
Integration Tests:      15 min  ← Can parallelize
E2E Tests:             20 min  ← Major bottleneck
Security Scan:          5 min  ← Can parallelize
Build Artifact:         3 min
Deploy:                 2 min
─────────────────────────────
Total:                 45 min

Key Issues:

Sequential execution (everything runs one after another)
No caching (dependencies downloaded every time)
Slow test execution (not optimized)
Resource constraints (single build server)

Solution Architecture

Optimized Pipeline Design

┌─────────────────────────────────────────────────────────┐
│                    Source Stage                          │
│              (CodeCommit/GitHub)                         │
└────────────────────┬────────────────────────────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
┌───────▼────────┐      ┌─────────▼──────────┐
│  Build Stage   │      │  Security Scan      │  ← Parallel
│  (Parallel)    │      │  (Parallel)         │
└───────┬────────┘      └─────────┬──────────┘
        │                          │
        └────────────┬─────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
┌───────▼────────┐      ┌─────────▼──────────┐
│  Unit Tests    │      │  Integration Tests │  ← Parallel
│  (Parallel)    │      │  (Parallel)        │
└───────┬────────┘      └─────────┬──────────┘
        │                          │
        └────────────┬─────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
┌───────▼────────┐      ┌─────────▼──────────┐
│  E2E Tests     │      │  Build Artifact    │  ← Parallel
│  (Optimized)   │      │  (Cached)          │
└───────┬────────┘      └─────────┬──────────┘
        │                          │
        └────────────┬─────────────┘
                     │
              ┌──────▼──────┐
              │   Deploy    │
              └─────────────┘

Target: Under 15 Minutes

Phase 1: Parallelization Strategy

AWS CodePipeline with Parallel Actions

Pipeline Definition:

{
  "pipeline": {
    "name": "payment-app-optimized-pipeline",
    "stages": [
      {
        "name": "Source",
        "actions": [{
          "name": "SourceAction",
          "actionTypeId": {
            "category": "Source",
            "owner": "AWS",
            "provider": "CodeCommit",
            "version": "1"
          },
          "outputArtifacts": [{"name": "SourceOutput"}]
        }]
      },
      {
        "name": "BuildAndTest",
        "actions": [
          {
            "name": "Build",
            "actionTypeId": {
              "category": "Build",
              "owner": "AWS",
              "provider": "CodeBuild",
              "version": "1"
            },
            "inputArtifacts": [{"name": "SourceOutput"}],
            "outputArtifacts": [{"name": "BuildOutput"}],
            "configuration": {
              "ProjectName": "payment-app-build"
            }
          },
          {
            "name": "SecurityScan",
            "runOrder": 1,
            "actionTypeId": {
              "category": "Build",
              "owner": "AWS",
              "provider": "CodeBuild",
              "version": "1"
            },
            "inputArtifacts": [{"name": "SourceOutput"}],
            "outputArtifacts": [{"name": "SecurityScanOutput"}],
            "configuration": {
              "ProjectName": "payment-app-security-scan"
            }
          }
        ]
      },
      {
        "name": "Test",
        "actions": [
          {
            "name": "UnitTests",
            "runOrder": 1,
            "actionTypeId": {
              "category": "Test",
              "owner": "AWS",
              "provider": "CodeBuild",
              "version": "1"
            },
            "inputArtifacts": [{"name": "BuildOutput"}],
            "outputArtifacts": [{"name": "UnitTestOutput"}],
            "configuration": {
              "ProjectName": "payment-app-unit-tests"
            }
          },
          {
            "name": "IntegrationTests",
            "runOrder": 1,
            "actionTypeId": {
              "category": "Test",
              "owner": "AWS",
              "provider": "CodeBuild",
              "version": "1"
            },
            "inputArtifacts": [{"name": "BuildOutput"}],
            "outputArtifacts": [{"name": "IntegrationTestOutput"}],
            "configuration": {
              "ProjectName": "payment-app-integration-tests"
            }
          }
        ]
      },
      {
        "name": "E2ETest",
        "actions": [{
          "name": "E2ETests",
          "actionTypeId": {
            "category": "Test",
            "owner": "AWS",
            "provider": "CodeBuild",
            "version": "1"
          },
          "inputArtifacts": [{"name": "BuildOutput"}],
          "configuration": {
            "ProjectName": "payment-app-e2e-tests"
          }
        }]
      },
      {
        "name": "Deploy",
        "actions": [{
          "name": "DeployToStaging",
          "actionTypeId": {
            "category": "Deploy",
            "owner": "AWS",
            "provider": "ECS",
            "version": "1"
          },
          "inputArtifacts": [{"name": "BuildOutput"}],
          "configuration": {
            "ClusterName": "payment-staging",
            "ServiceName": "payment-service"
          }
        }]
      }
    ]
  }
}

Parallel Test Execution

Split Tests by Category:

# test-runner.py
import subprocess
import sys
import os

def run_tests_in_parallel():
    """Run tests in parallel based on category"""

    test_categories = {
        'unit': 'tests/unit',
        'integration': 'tests/integration',
        'e2e': 'tests/e2e'
    }

    processes = []

    for category, test_path in test_categories.items():
        # Run each category in parallel
        cmd = [
            'pytest',
            test_path,
            f'--junitxml=test-results-{category}.xml',
            '--maxfail=5',  # Fail fast
            '-n', 'auto'  # Parallel execution within category
        ]

        process = subprocess.Popen(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE
        )
        processes.append((category, process))

    # Wait for all to complete
    results = {}
    for category, process in processes:
        stdout, stderr = process.communicate()
        results[category] = {
            'returncode': process.returncode,
            'stdout': stdout.decode(),
            'stderr': stderr.decode()
        }

    # Check results
    failed = [cat for cat, res in results.items() if res['returncode'] != 0]

    if failed:
        print(f"❌ Tests failed in: {', '.join(failed)}")
        sys.exit(1)

    print("✅ All tests passed")
    return 0

if __name__ == '__main__':
    sys.exit(run_tests_in_parallel())

CodeBuild Buildspec for Parallel Tests:

# buildspec-tests.yml
version: 0.2
phases:
  install:
    runtime-versions:
      python: 3.9
      nodejs: 18
    commands:
      - echo Installing dependencies...
      - pip install -r requirements.txt
      - npm ci
  pre_build:
    commands:
      - echo Setting up test environment...
      - |
        # Start test dependencies in background
        docker-compose up -d postgres redis
        sleep 10  # Wait for services to be ready
  build:
    commands:
      - echo Running tests in parallel...
      - |
        # Run unit and integration tests in parallel
        python test-runner.py &
        INTEGRATION_PID=$!

        # Run unit tests separately for faster feedback
        pytest tests/unit --junitxml=unit-results.xml -n auto &
        UNIT_PID=$!

        # Wait for both
        wait $UNIT_PID
        UNIT_EXIT=$?

        wait $INTEGRATION_PID
        INTEGRATION_EXIT=$?

        if [ $UNIT_EXIT -ne 0 ] || [ $INTEGRATION_EXIT -ne 0 ]; then
          echo "Tests failed"
          exit 1
        fi
  post_build:
    commands:
      - echo Uploading test results...
      - |
        aws s3 cp unit-results.xml s3://test-results/payment-app/unit-${CODEBUILD_BUILD_ID}.xml
        aws s3 cp integration-results.xml s3://test-results/payment-app/integration-${CODEBUILD_BUILD_ID}.xml
artifacts:
  files:
    - '**/*'
  reports:
    unit-tests:
      files:
        - 'unit-results.xml'
    integration-tests:
      files:
        - 'integration-results.xml'

Phase 2: Test Optimization

Test Categorization Strategy

Fast Tests (Run First):

# tests/unit/test_fast.py
import pytest

@pytest.mark.fast
def test_calculate_total():
    """Fast unit test - runs in < 100ms"""
    assert calculate_total([1, 2, 3]) == 6

@pytest.mark.fast
def test_validate_email():
    """Fast validation test"""
    assert validate_email("test@example.com") == True

Slow Tests (Run After Fast Pass):

# tests/integration/test_slow.py
import pytest

@pytest.mark.slow
@pytest.mark.integration
def test_database_transaction():
    """Slow integration test - requires database"""
    # Test implementation
    pass

@pytest.mark.slow
@pytest.mark.e2e
def test_full_payment_flow():
    """Slow E2E test - requires full stack"""
    # Test implementation
    pass

Pytest Configuration:

# pytest.ini
[pytest]
markers =
    fast: Fast tests (< 100ms)
    slow: Slow tests (> 1s)
    unit: Unit tests
    integration: Integration tests
    e2e: End-to-end tests

# Run fast tests first
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*

# Parallel execution
addopts = 
    -n auto
    -m "not slow"  # Skip slow tests in initial run
    --maxfail=5
    --tb=short

Test Execution Strategy:

#!/bin/bash
# run-tests-optimized.sh

echo "Phase 1: Running fast tests..."
pytest -m "fast" --maxfail=5 -n auto

if [ $? -ne 0 ]; then
    echo "❌ Fast tests failed, stopping pipeline"
    exit 1
fi

echo "Phase 2: Running slow tests..."
pytest -m "slow" --maxfail=3 -n 2  # Fewer parallel for slow tests

if [ $? -ne 0 ]; then
    echo "❌ Slow tests failed"
    exit 1
fi

echo "✅ All tests passed"

Test Flakiness Reduction

Retry Strategy for Flaky Tests:

# conftest.py
import pytest
from pytest_retry import retry

@pytest.fixture(autouse=True)
def setup_test_environment():
    """Setup test environment with retries"""
    yield
    # Cleanup

# Retry decorator for known flaky tests
@pytest.mark.flaky(reruns=3, reruns_delay=2)
def test_external_api_call():
    """Test that sometimes fails due to network issues"""
    # Test implementation
    pass

Test Isolation:

# tests/conftest.py
import pytest
import asyncio

@pytest.fixture(scope="function")
def isolated_database():
    """Create isolated database for each test"""
    # Create temporary database
    db = create_test_database()
    yield db
    # Cleanup
    db.drop()

@pytest.fixture(scope="function")
def clean_cache():
    """Clear cache before each test"""
    cache.clear()
    yield
    cache.clear()

Test Data Management:

# tests/fixtures.py
import pytest
from faker import Faker

fake = Faker()

@pytest.fixture
def sample_customer():
    """Generate sample customer data"""
    return {
        'id': fake.uuid4(),
        'name': fake.name(),
        'email': fake.email(),
        'created_at': fake.date_time()
    }

@pytest.fixture
def sample_transaction(sample_customer):
    """Generate sample transaction"""
    return {
        'id': fake.uuid4(),
        'customer_id': sample_customer['id'],
        'amount': fake.pydecimal(left_digits=3, right_digits=2, positive=True),
        'status': 'pending'
    }

E2E Test Optimization

Selective E2E Testing:

# Only run E2E tests for critical paths
E2E_TEST_PATHS = [
    'tests/e2e/test_payment_flow.py::test_successful_payment',
    'tests/e2e/test_payment_flow.py::test_payment_failure',
    'tests/e2e/test_user_registration.py::test_new_user_signup'
]

# Run subset of E2E tests
pytest ${E2E_TEST_PATHS[@]} --maxfail=1

E2E Test Parallelization with Test Containers:

# docker-compose.test.yml
version: '3.8'
services:
  test-runner-1:
    build: .
    command: pytest tests/e2e/test_payment_flow.py
    environment:
      - TEST_ENV=staging-1

  test-runner-2:
    build: .
    command: pytest tests/e2e/test_user_flow.py
    environment:
      - TEST_ENV=staging-2

  test-runner-3:
    build: .
    command: pytest tests/e2e/test_admin_flow.py
    environment:
      - TEST_ENV=staging-3

Phase 3: Caching and Dependency Management

CodeBuild Caching

Enable Local Caching:

{
  "cache": {
    "type": "LOCAL",
    "modes": [
      "LOCAL_DOCKER_LAYER_CACHE",
      "LOCAL_SOURCE_CACHE",
      "LOCAL_CUSTOM_CACHE"
    ]
  }
}

Dependency Caching:

# buildspec-with-cache.yml
version: 0.2
cache:
  paths:
    - 'node_modules/**/*'
    - '.venv/**/*'
    - '.m2/**/*'
phases:
  install:
    commands:
      - echo Restoring cache...
      - |
        # Python dependencies
        if [ -d ".venv" ]; then
          echo "Using cached Python virtual environment"
        else
          python -m venv .venv
          source .venv/bin/activate
          pip install -r requirements.txt
        fi
      - |
        # Node.js dependencies
        if [ -d "node_modules" ]; then
          echo "Using cached node_modules"
        else
          npm ci --prefer-offline --no-audit
        fi
      - |
        # Maven dependencies (Java)
        if [ -d ".m2" ]; then
          echo "Using cached Maven repository"
        else
          mvn dependency:go-offline
        fi
  build:
    commands:
      - echo Building application...
      - npm run build
      - mvn package -DskipTests

Docker Layer Caching

Optimize Dockerfile for Caching:

# Stage 1: Dependencies (cached if package files don't change)
FROM node:18-slim AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
    npm cache clean --force

# Stage 2: Build (cached if source doesn't change)
FROM node:18-slim AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

# Stage 3: Runtime (minimal)
FROM node:18-slim
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
CMD ["node", "dist/index.js"]

CodeBuild Docker Caching:

# buildspec-docker-cache.yml
version: 0.2
phases:
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
  build:
    commands:
      - echo Building Docker image with cache...
      - |
        # Pull previous image for layer caching
        docker pull $IMAGE_REPO_NAME:latest || true
      - |
        # Build with cache
        docker build \
          --cache-from $IMAGE_REPO_NAME:latest \
          -t $IMAGE_REPO_NAME:$IMAGE_TAG \
          -t $IMAGE_REPO_NAME:latest \
          .
  post_build:
    commands:
      - echo Pushing image...
      - docker push $IMAGE_REPO_NAME:$IMAGE_TAG
      - docker push $IMAGE_REPO_NAME:latest

S3 Caching for Large Artifacts

import boto3
import hashlib
import os

s3 = boto3.client('s3')
CACHE_BUCKET = 'payment-app-build-cache'

def get_cache_key(file_path):
    """Generate cache key from file hash"""
    with open(file_path, 'rb') as f:
        file_hash = hashlib.md5(f.read()).hexdigest()
    return f"cache/{file_path}/{file_hash}"

def restore_from_cache(file_path, cache_key):
    """Restore file from S3 cache"""
    try:
        s3.download_file(CACHE_BUCKET, cache_key, file_path)
        print(f"✅ Restored {file_path} from cache")
        return True
    except s3.exceptions.NoSuchKey:
        print(f"Cache miss for {file_path}")
        return False

def save_to_cache(file_path, cache_key):
    """Save file to S3 cache"""
    try:
        s3.upload_file(file_path, CACHE_BUCKET, cache_key)
        print(f"✅ Saved {file_path} to cache")
    except Exception as e:
        print(f"Failed to save cache: {e}")

# Usage in build script
if restore_from_cache('node_modules.tar.gz', get_cache_key('package.json')):
    tar -xzf node_modules.tar.gz
else:
    npm ci
    tar -czf node_modules.tar.gz node_modules
    save_to_cache('node_modules.tar.gz', get_cache_key('package.json'))

Phase 4: Infrastructure Improvements

CodeBuild Compute Optimization

Use Larger Instances for Faster Builds:

# Update CodeBuild project to use larger instances
aws codebuild update-project \
  --name payment-app-build \
  --compute-type BUILD_GENERAL1_LARGE  # 8 vCPU, 15 GB RAM

# For very large builds
aws codebuild update-project \
  --name payment-app-build \
  --compute-type BUILD_GENERAL1_2XLARGE  # 16 vCPU, 32 GB RAM

Auto-Scaling Build Fleet:

import boto3

codebuild = boto3.client('codebuild')

def create_fleet_for_parallel_builds():
    """Create build fleet for parallel execution"""

    # Create fleet with multiple instances
    codebuild.create_fleet(
        name='payment-app-fleet',
        baseCapacity=2,
        environmentType='LINUX_CONTAINER',
        computeType='BUILD_GENERAL1_LARGE',
        image='aws/codebuild/standard:5.0',
        fleetServiceRole='arn:aws:iam::account:role/CodeBuildFleetRole'
    )

# Use fleet in project
codebuild.update_project(
    name='payment-app-build',
    fleet={
        'fleetArn': 'arn:aws:codebuild:region:account:fleet/payment-app-fleet'
    }
)

ECS Build Agents (Alternative to CodeBuild)

ECS Task for Builds:

{
  "family": "build-agent",
  "containerDefinitions": [
    {
      "name": "build-container",
      "image": "aws/codebuild/standard:5.0",
      "memory": 8192,
      "cpu": 4096,
      "environment": [
        {"name": "AWS_REGION", "value": "us-east-1"}
      ]
    }
  ],
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc"
}

Auto-Scale Build Agents:

import boto3

ecs = boto3.client('ecs')
application_autoscaling = boto3.client('application-autoscaling')

def setup_build_agent_autoscaling():
    """Setup auto-scaling for ECS build agents"""

    # Register scalable target
    application_autoscaling.register_scalable_target(
        ServiceNamespace='ecs',
        ResourceId='service/build-cluster/build-agent-service',
        ScalableDimension='ecs:service:DesiredCount',
        MinCapacity=1,
        MaxCapacity=10
    )

    # Create scaling policy based on queue depth
    application_autoscaling.put_scaling_policy(
        ServiceNamespace='ecs',
        ResourceId='service/build-cluster/build-agent-service',
        ScalableDimension='ecs:service:DesiredCount',
        PolicyName='scale-on-queue-depth',
        PolicyType='TargetTrackingScaling',
        TargetTrackingScalingPolicyConfiguration={
            'CustomizedMetricSpecification': {
                'MetricName': 'ApproximateNumberOfMessagesVisible',
                Namespace='AWS/SQS',
                Statistic='Average',
                Unit='Count'
            },
            'TargetValue': 5.0,  # Scale when queue has 5+ messages
            'ScaleInCooldown': 300,
            'ScaleOutCooldown': 60
        }
    )

Network Optimization

VPC Endpoints for Faster S3 Access:

# Create VPC endpoint for S3 (faster than internet)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-12345678 \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-12345678

# Create VPC endpoint for ECR
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-12345678 \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-123 subnet-456 \
  --security-group-ids sg-12345678

Phase 5: Monitoring and Metrics

Pipeline Metrics Dashboard

import boto3
from datetime import datetime, timedelta

codebuild = boto3.client('codebuild')
cloudwatch = boto3.client('cloudwatch')

def get_pipeline_metrics():
    """Collect pipeline performance metrics"""

    # Get recent builds
    builds = codebuild.batch_get_builds(
        ids=get_recent_build_ids()
    )

    metrics = {
        'total_builds': len(builds['builds']),
        'successful_builds': len([b for b in builds['builds'] if b['buildStatus'] == 'SUCCEEDED']),
        'failed_builds': len([b for b in builds['builds'] if b['buildStatus'] == 'FAILED']),
        'avg_duration': calculate_avg_duration(builds['builds']),
        'p95_duration': calculate_p95_duration(builds['builds']),
        'p99_duration': calculate_p99_duration(builds['builds'])
    }

    return metrics

def calculate_avg_duration(builds):
    """Calculate average build duration"""
    durations = []
    for build in builds:
        if build.get('endTime') and build.get('startTime'):
            duration = (build['endTime'] - build['startTime']).total_seconds()
            durations.append(duration)

    return sum(durations) / len(durations) if durations else 0

def publish_metrics_to_cloudwatch(metrics):
    """Publish metrics to CloudWatch"""

    cloudwatch.put_metric_data(
        Namespace='PaymentApp/CI-CD',
        MetricData=[
            {
                'MetricName': 'PipelineDuration',
                'Value': metrics['avg_duration'],
                'Unit': 'Seconds',
                'Timestamp': datetime.utcnow()
            },
            {
                'MetricName': 'PipelineSuccessRate',
                'Value': (metrics['successful_builds'] / metrics['total_builds']) * 100,
                'Unit': 'Percent',
                'Timestamp': datetime.utcnow()
            }
        ]
    )

CloudWatch Dashboard

def create_pipeline_dashboard():
    """Create CloudWatch dashboard for pipeline metrics"""

    dashboard = {
        "widgets": [
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["PaymentApp/CI-CD", "PipelineDuration", {"stat": "Average"}],
                        [".", "PipelineDuration", {"stat": "p95"}],
                        [".", "PipelineDuration", {"stat": "p99"}]
                    ],
                    "period": 300,
                    "stat": "Average",
                    "region": "us-east-1",
                    "title": "Pipeline Duration"
                }
            },
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["PaymentApp/CI-CD", "PipelineSuccessRate", {"stat": "Average"}],
                        ["AWS/CodeBuild", "Builds", {"stat": "Sum", "dimensions": [{"Name": "ProjectName", "Value": "payment-app-build"}]}]
                    ],
                    "period": 300,
                    "stat": "Average",
                    "region": "us-east-1",
                    "title": "Pipeline Success Rate"
                }
            },
            {
                "type": "log",
                "properties": {
                    "query": "SOURCE '/aws/codebuild/payment-app-build' | fields @timestamp, @message\n| filter @message like /ERROR/\n| sort @timestamp desc\n| limit 100",
                    "region": "us-east-1",
                    "title": "Build Errors"
                }
            }
        ]
    }

    cloudwatch.put_dashboard(
        DashboardName='CI-CD-Pipeline',
        DashboardBody=json.dumps(dashboard)
    )

Build Time Breakdown Analysis

def analyze_build_time_breakdown(build_id):
    """Analyze time spent in each phase"""

    # Get build logs
    logs = codebuild.batch_get_builds(ids=[build_id])
    build = logs['builds'][0]

    # Parse CloudWatch Logs for timing
    log_group = f"/aws/codebuild/{build['projectName']}"

    # Extract phase timings from logs
    phases = {
        'install': 0,
        'pre_build': 0,
        'build': 0,
        'post_build': 0
    }

    # Query CloudWatch Logs Insights
    logs_client = boto3.client('logs')

    response = logs_client.start_query(
        logGroupName=log_group,
        startTime=int((build['startTime'] - timedelta(minutes=1)).timestamp()),
        endTime=int((build['endTime'] + timedelta(minutes=1)).timestamp()),
        queryString="""
        fields @timestamp, @message
        | filter @message like /PHASE/
        | stats count() by @message
        """
    )

    # Process results to calculate phase durations
    # (Implementation depends on log format)

    return phases

Alerts for Pipeline Issues

def create_pipeline_alerts():
    """Create CloudWatch alarms for pipeline issues"""

    alarms = [
        {
            'AlarmName': 'pipeline-duration-too-long',
            'MetricName': 'PipelineDuration',
            'Namespace': 'PaymentApp/CI-CD',
            'Statistic': 'Average',
            'Period': 300,
            'EvaluationPeriods': 2,
            'Threshold': 900,  # 15 minutes
            'ComparisonOperator': 'GreaterThanThreshold'
        },
        {
            'AlarmName': 'pipeline-failure-rate-high',
            'MetricName': 'PipelineSuccessRate',
            'Namespace': 'PaymentApp/CI-CD',
            'Statistic': 'Average',
            'Period': 3600,
            'EvaluationPeriods': 1,
            'Threshold': 90,  # Less than 90% success rate
            'ComparisonOperator': 'LessThanThreshold'
        }
    ]

    for alarm in alarms:
        cloudwatch.put_metric_alarm(**alarm)
        print(f"Created alarm: {alarm['AlarmName']}")

Optimized Pipeline Timeline

Before Optimization

Source:        2 min
Dependencies:  8 min
Build:          5 min
Unit Tests:    12 min
Integration:   15 min
E2E:           20 min
Security:       5 min
Deploy:         2 min
─────────────────────
Total:         45 min

After Optimization

Source:        2 min
Build + Security (parallel):  5 min  ← Was 13 min
Unit + Integration (parallel): 8 min  ← Was 27 min
E2E (optimized):              6 min  ← Was 20 min
Deploy:                       2 min
─────────────────────────────────────
Total:                        13 min  ← Was 45 min (71% reduction!)

Implementation Checklist

Week 1: Setup and Configuration

Create optimized CodePipeline
Configure CodeBuild projects with caching
Set up parallel test execution
Enable CloudWatch monitoring

Week 2: Test Optimization

Categorize tests (fast/slow)
Implement test retry strategy
Optimize E2E tests
Reduce flaky tests

Week 3: Caching Implementation

Enable CodeBuild local caching
Implement Docker layer caching
Set up S3 caching for large artifacts
Verify cache hit rates

Week 4: Infrastructure Improvements

Upgrade to larger CodeBuild instances
Set up VPC endpoints
Configure auto-scaling build agents
Optimize network configuration

Week 5: Monitoring and Validation

Create CloudWatch dashboards
Set up alerts
Validate pipeline performance
Document improvements

Success Metrics

Target Metrics

target_metrics = {
    'pipeline_duration_minutes': 15,  # Target: < 15 minutes
    'success_rate_percent': 95,  # Target: > 95%
    'cache_hit_rate_percent': 80,  # Target: > 80%
    'parallel_execution_percent': 60,  # Target: 60% of pipeline parallel
    'deployments_per_day': 5,  # Target: 5+ deployments per day
    'developer_feedback_time_minutes': 15  # Target: < 15 minutes
}

Measuring Success

def measure_pipeline_improvement():
    """Compare before and after metrics"""

    before = {
        'avg_duration': 45,  # minutes
        'success_rate': 85,  # percent
        'deployments_per_day': 2
    }

    after = {
        'avg_duration': 13,  # minutes
        'success_rate': 96,  # percent
        'deployments_per_day': 5
    }

    improvements = {
        'duration_reduction': ((before['avg_duration'] - after['avg_duration']) / before['avg_duration']) * 100,
        'success_rate_improvement': after['success_rate'] - before['success_rate'],
        'deployment_frequency_increase': ((after['deployments_per_day'] - before['deployments_per_day']) / before['deployments_per_day']) * 100
    }

    print(f"Duration reduction: {improvements['duration_reduction']:.1f}%")
    print(f"Success rate improvement: {improvements['success_rate_improvement']:.1f}%")
    print(f"Deployment frequency increase: {improvements['deployment_frequency_increase']:.1f}%")

    return improvements

Best Practices Summary

Do's

Parallelize everything possible (tests, builds, scans)
Cache aggressively (dependencies, Docker layers, artifacts)
Run fast tests first (fail fast on critical issues)
Optimize test execution (parallel, selective, isolated)
Use larger build instances for faster execution
Monitor pipeline metrics continuously
Reduce flaky tests (retry, isolate, fix root causes)
Optimize Docker builds (layer caching, multi-stage)

Don'ts

Don't run tests sequentially when they can run in parallel
Don't skip caching (huge time savings)
Don't run all E2E tests on every commit
Don't ignore flaky tests (fix or remove them)
Don't use small build instances for large builds
Don't deploy without metrics (measure everything)
Don't ignore build time breakdown (identify bottlenecks)

Conclusion

Optimizing a CI/CD pipeline from 45 minutes to under 15 minutes requires a systematic approach:

Parallelization reduces total time by running tasks concurrently
Test optimization focuses on fast feedback and reliability
Caching eliminates redundant work (dependencies, Docker layers)
Infrastructure improvements provide more compute power
Monitoring ensures continuous optimization

The result? A fast, reliable pipeline that enables multiple deployments per day, improves developer experience, and maintains security and quality gates.