DEV Community

Henry for AWS Community Builders

Posted on

Implementing a Self-Healing Serverless CICD Pipeline with AWS Developer Tools.

In modern DevOps workflows, even the most mature CI/CD pipelines often fail for reasons that aren’t related to code transient network glitches, dependency timeouts, flaky tests, or resource limits. Each of these failures halts deployment and forces engineers to manually restart or troubleshoot pipelines, slowing release velocity and wasting valuable time.

This project introduces a Self-Healing Serverless CI/CD Pipeline built entirely on AWS Developer Tools. The solution intelligently detects pipeline or build failures in real time and automatically recovers from them without human intervention reducing downtime, improving reliability, and embodying the next step toward autonomous DevOps.

Core AWS Services Used
AWS CodePipeline — Orchestrates the CI/CD process (Source → Build → Deploy).
AWS CodeBuild — Builds and tests application code usingbuildspec.yml.
Amazon EventBridge — Detects CodePipeline or CodeBuild failure events instantly.
AWS Lambda (“Pipeline Doctor”) — Analyzes the failure and executes automated recovery actions.
Amazon DynamoDB — Stores retry counts and incident data to avoid infinite loops.
Amazon SNS — Sends notifications to developers if automatic healing fails.
Amazon CloudWatch — Captures logs and metrics for observability.
How It Works 
CodePipeline triggers automatically when a commit is pushed to the source repository (CodeCommit or GitHub).
CodeBuild runs build and test commands.
If a build or pipeline fails, EventBridge immediately emits a failure event.
Lambda (“Pipeline Doctor”) receives the event, classifies the failure, and executes a healing playbook, such as: Retrying the failed build, Increasing build timeout, or Retrying the failed pipeline stage. The Doctor records the incident in DynamoDB, increments retry counts, and prevents repeated loops.
If the retry succeeds, CodePipeline resumes automatically and continues to deploy.
If the retry limit is reached, SNS sends a notification to engineers for manual review.

Step 1:
Create the Deploy-target Lambda

go to the AWS Console, search for Lambda then Create function using the parameters below 

Name: SelfHealingDemoTarget

Runtime: Node.js 20.x then click Create.

Replace the default code with the contents of: lambda/demo/handler.js 

code found https://github.com/Greynalytics/Self-healing-cicd.git

exports.handler = async (event) => {   
console.log("Deploy stage invoked with:", JSON.stringify(event));   
return { statusCode: 200, body: "Hello from Self-Healing Pipeline target!" }; 
};
Enter fullscreen mode Exit fullscreen mode

Then Click Deploy. This is just a harmless endpoint your pipeline will “deploy” by invoking it during the Deploy stage.
**
Step 2:
Create the Code Build project**

Go to the Console, search for Code Build 

then click Create project with details below

Name: self-healing-build

Source: https://github.com/Greynalytics/Self-healing-cicd.git

repo = self-healing-src, 

branch main

Environment: Managed image → Amazon Linux 2 → Standard 7.0

Service role: create new (default)

Buildspec: Use a buildspec file (it’s already in the repo) the click Save.

My repo includes buildspec.yml that runs test.js ~40% of the time to simulate a flaky test, then zips an app/ folder into build.zip as an artifact.

**
Step 3 
Create the CodePipeline **

on the console search for CodePipeline and build a custom pipeline

then click Create pipeline with parameters below

Name: SelfHealingPipeline

Source: github

Build: CodeBuild (select self-healing-cicdbuild)

Deploy stage: choose Add stage → + Add action group

Action provider: Lambda

Function name: SelfHealingcicdTarget

User parameters: { "msg": "deployed" } (optional)

Hit Save and then Create pipeline and let it run once (it may succeed or fail randomly due to the flaky test which is perfect for our demo).

Step 4:
Create the DynamoDB incidents table

go to your Console search for DynamoDB then Create table with parameters below;

Table name: SelfHealingIncidents

Partition key: incidentId (String)

Billing: On-demand (default)

then click Create.

Step 5
Create an SNS topic for alerts 

on the console search for SNS then Topic and Create a Topic with the parameters below

Type: Standard

Name: SelfHealingAlerts

Create

then Add subscription → Email → enter your email → confirm via the email you receive.

Note, Copy the Topic ARN as seen above; you’ll need it in the Doctor’s env vars.

**
Step 6
Create the “Pipeline Doctor” Lambda **

to go the Console → Lambda → Create function

Name: PipelineDoctor

Runtime: Node.js 20.x

Create

Open the new function to go Code tab and replace the handler with the contents of:
 lambda/doctor/index.js from the repository

Environment variables (Configuration → Environment variables):

TABLE = SelfHealingIncidents

TOPIC_ARN = your SNS topic ARN

MAX_RETRIES = 2

Permissions (Execution role):
 Open the function’s Configuration → Permissions → Execution role:

Attach AmazonDynamoDBFullAccess (demo speed; tighten later).

Attach AmazonSNSFullAccess (demo speed; tighten later).

Add inline policy named Doctor-CodeBuild-Access:


{   "Version": "2012-10-17",   
    "Statement": [     { 
    "Effect": "Allow", 
    "Action": ["codebuild:BatchGetBuilds","codebuild:StartBuild","codebuild:UpdateProject"], 
    "Resource": "*" }   ] }
Enter fullscreen mode Exit fullscreen mode

Add inline policy named Doctor-CodePipeline-Access:

{   "Version": "2012-10-17",   "Statement": [     { "Effect": "Allow", "Action": ["codepipeline:RetryStageExecution","codepipeline:GetPipelineExecution"], "Resource": "*" }   ] }
Enter fullscreen mode Exit fullscreen mode

(For production, restrict "Resource": "*" to your specific project/pipeline ARNs.)

Click Deploy on the Lambda code.

**
Step 7 
Wire Event-Bridge rules to the Doctor**
on the Console search Amazon EventBridge then click Rules and Create rule

**
Rule A — CodeBuild failures/timeouts**

Name: OnCodeBuildFail

Event pattern — Pre-defined pattern by service:

Service: CodeBuild, 

Event type: CodeBuild Build State Change

Add pattern filter:

detail.build-status IN FAILED, TIMED_OUT

Target: Lambda function → PipelineDoctor

then Create Rule

Defining the Eventbridge Rule

Building the event patterns

Select target 

Create CodeBuild failure triggers

// CodeBuild failures
{
  "source": ["aws.codebuild"],
  "detail-type": ["CodeBuild Build State Change"],
  "detail": { "build-status": ["FAILED","TIMED_OUT"] }
}
Enter fullscreen mode Exit fullscreen mode

**
Rule B- Code Pipeline action failures**

Name: OnPipelineActionFail

Event pattern → Pre-defined pattern by service:

Service: CodePipeline, Event type: CodePipeline Action Execution State Change

Add pattern filter:

detail.state = FAILED

Target: Lambda function → PipelineDoctor

Create rule


// CodePipeline action failures
{
  "source": ["aws.codepipeline"],
  "detail-type": ["CodePipeline Action Execution State Change"],
  "detail": { "state": ["FAILED"] }
}
Enter fullscreen mode Exit fullscreen mode

Step 8
Test the self-healing flow 

Push a new commit to main in pipeline-src/.

Watch Code Pipeline run.

If Build fails (flaky test.js), EventBridge triggers Pipeline Doctor.

Doctor checks retries in DynamoDB, then retries the build (and may bump timeout).

If the retry passes, pipeline turns green which is self-healed.

If retries exhausted , you get an SNS email, and the incident is marked UNHEALED.

You can Open CloudWatch Logs for PipelineDoctor to see which playbook ran.

Top comments (0)