Gopalakrishnan Marimuthu

Posted on Mar 15

Building a Self-Healing Data Pipeline with AWS Step Functions, Bedrock, and Glue Data Quality

#ai #machinelearning #aws #programming

How to build an intelligent data pipeline that detects anomalies and automatically remediates issues using generative AI

Data pipelines break. It's not a matter of if, but when. Schema changes, data drift, upstream system failures, and unexpected null values can silently corrupt your analytics and ML models. Traditional approaches rely on alerts and manual intervention—but what if your pipeline could heal itself?

In this article, I'll walk you through building a self-healing data pipeline on AWS that:

Detects data quality anomalies using ML-powered Glue Data Quality
Diagnoses root causes using Amazon Bedrock's generative AI
Remediates issues automatically with intelligent decision-making
Learns from past incidents to improve over time

The Problem: Silent Data Corruption

Consider this scenario: Your e-commerce platform ingests order data every hour. One day, a upstream system change causes the order_total field to occasionally contain negative values. Your pipeline doesn't fail—it happily processes the data. Three weeks later, your finance team discovers revenue reports are off by millions.

This is silent data corruption, and it's far more dangerous than a crashed pipeline.

Architecture Overview

Here's the self-healing architecture we'll build:

Key Components:

Component	AWS Service	Purpose
Data Ingestion	S3 + EventBridge	Trigger pipeline on new data arrival
ETL Processing	AWS Glue	Transform and prepare data
Quality Detection	Glue Data Quality	ML-powered anomaly detection
Orchestration	Step Functions	Coordinate healing workflow
AI Diagnosis	Amazon Bedrock	Analyze anomalies and suggest fixes
Remediation	Lambda + Glue	Execute automated fixes
Learning	DynamoDB	Store incident history for pattern recognition

Step 1: Setting Up Glue Data Quality with Anomaly Detection

AWS Glue Data Quality (GA August 2024) includes ML-powered anomaly detection that learns your data patterns over time. Let's configure it.

Create the Glue Data Quality Ruleset

# glue_data_quality_rules.py
import boto3

glue_client = boto3.client('glue')

# Define data quality ruleset with anomaly detection
ruleset_definition = """
Rules = [
    # Static rules for known constraints
    ColumnExists "order_id",
    ColumnExists "order_total",
    ColumnExists "customer_id",
    ColumnExists "order_date",

    IsComplete "order_id",
    IsUnique "order_id",

    ColumnValues "order_total" >= 0,
    ColumnValues "order_total" <= 1000000,

    # Dynamic rules with anomaly detection
    ColumnValues "order_total" with threshold > 0.95,
    RowCount with threshold > 0.90,

    # Freshness check
    DataFreshness "order_date" <= 24 hours
],

# Enable anomaly detection
Analyzers = [
    RowCount,
    Completeness "order_total",
    Mean "order_total",
    StandardDeviation "order_total",
    DistinctValuesCount "customer_id",
    ColumnCorrelation "order_total" "quantity"
]
"""

response = glue_client.create_data_quality_ruleset(
    Name='orders-quality-ruleset',
    Ruleset=ruleset_definition,
    Description='Data quality rules for orders with anomaly detection',
    TargetTable={
        'TableName': 'orders_raw',
        'DatabaseName': 'ecommerce_db'
    }
)

print(f"Created ruleset: {response['Name']}")

Integrate Data Quality into Your Glue Job

# glue_etl_with_quality.py
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from awsgluedq.transforms import EvaluateDataQuality
from pyspark.context import SparkContext

# Initialize
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'S3_INPUT_PATH', 'S3_OUTPUT_PATH'])
job.init(args['JOB_NAME'], args)

# Read raw data
raw_data = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [args['S3_INPUT_PATH']]},
    format="parquet"
)

# Apply data quality evaluation with anomaly detection
dq_results = EvaluateDataQuality.apply(
    frame=raw_data,
    ruleset="""
        Rules = [
            IsComplete "order_id",
            IsUnique "order_id",
            ColumnValues "order_total" >= 0,
            ColumnValues "order_total" with threshold > 0.95
        ],
        Analyzers = [
            RowCount,
            Mean "order_total",
            StandardDeviation "order_total",
            Completeness "order_total"
        ]
    """,
    publishing_options={
        "dataQualityEvaluationContext": "orders_quality_check",
        "enableDataQualityCloudWatchMetrics": True,
        "enableDataQualityResultsPublishing": True
    }
)

# Extract quality results
quality_results = dq_results['EvaluateDataQuality.output']
quality_metrics = dq_results['EvaluateDataQuality.metrics']

# Convert to DataFrame for analysis
results_df = quality_results.toDF()
metrics_df = quality_metrics.toDF()

# Check for anomalies
anomalies = results_df.filter(
    (results_df.Outcome == "Failed") | 
    (results_df.Outcome == "Anomaly")
).collect()

# Store results for Step Functions to consume
quality_summary = {
    "total_rules": results_df.count(),
    "passed_rules": results_df.filter(results_df.Outcome == "Passed").count(),
    "failed_rules": len([a for a in anomalies if a.Outcome == "Failed"]),
    "anomalies_detected": len([a for a in anomalies if a.Outcome == "Anomaly"]),
    "anomaly_details": [
        {
            "rule": a.Rule,
            "outcome": a.Outcome,
            "actual_value": a.ActualValue,
            "expected_range": a.ExpectedRange
        } for a in anomalies
    ]
}

# Write quality summary to S3 for Step Functions
import json
summary_path = f"{args['S3_OUTPUT_PATH']}/quality_results/summary.json"
spark.sparkContext.parallelize([json.dumps(quality_summary)]).saveAsTextFile(summary_path)

# If no critical failures, write curated data
if quality_summary['failed_rules'] == 0:
    glueContext.write_dynamic_frame.from_options(
        frame=raw_data,
        connection_type="s3",
        connection_options={"path": f"{args['S3_OUTPUT_PATH']}/curated/"},
        format="parquet"
    )

job.commit()

Step 2: Building the Self-Healing Orchestrator with Step Functions

The Step Functions state machine coordinates the entire healing workflow. When anomalies are detected, it triggers diagnosis, determines the appropriate remediation, and executes the fix.

Step Functions Definition (ASL)

{
  "Comment": "Self-Healing Data Pipeline Orchestrator",
  "StartAt": "RunGlueETLJob",
  "States": {
    "RunGlueETLJob": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "orders-etl-with-quality",
        "Arguments": {
          "--S3_INPUT_PATH.$": "$.inputPath",
          "--S3_OUTPUT_PATH.$": "$.outputPath"
        }
      },
      "Next": "GetQualityResults",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "HandleJobFailure"
        }
      ]
    },

    "GetQualityResults": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "get-quality-results",
        "Payload.$": "$"
      },
      "ResultPath": "$.qualityResults",
      "Next": "EvaluateQuality"
    },

    "EvaluateQuality": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.qualityResults.anomalies_detected",
          "NumericGreaterThan": 0,
          "Next": "DiagnoseWithBedrock"
        },
        {
          "Variable": "$.qualityResults.failed_rules",
          "NumericGreaterThan": 0,
          "Next": "DiagnoseWithBedrock"
        }
      ],
      "Default": "PipelineSuccess"
    },

    "DiagnoseWithBedrock": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Parameters": {
        "ModelId": "anthropic.claude-3-sonnet-20240229-v1:0",
        "ContentType": "application/json",
        "Accept": "application/json",
        "Body": {
          "anthropic_version": "bedrock-2023-05-31",
          "max_tokens": 2048,
          "messages": [
            {
              "role": "user",
              "content.$": "States.Format('You are a data quality expert. Analyze these data quality issues and provide: 1) Root cause analysis, 2) Severity (critical/high/medium/low), 3) Recommended remediation action (one of: QUARANTINE, AUTO_FIX, ROLLBACK, ALERT_ONLY), 4) Specific fix instructions if AUTO_FIX is recommended.\n\nQuality Results:\n{}\n\nHistorical Incidents:\n{}\n\nRespond in JSON format with keys: root_cause, severity, recommended_action, fix_instructions, confidence_score', States.JsonToString($.qualityResults), States.JsonToString($.historicalIncidents))"
            }
          ]
        }
      },
      "ResultSelector": {
        "diagnosis.$": "$.Body.content[0].text"
      },
      "ResultPath": "$.diagnosis",
      "Next": "ParseDiagnosis"
    },

    "ParseDiagnosis": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "parse-bedrock-diagnosis",
        "Payload.$": "$"
      },
      "ResultPath": "$.parsedDiagnosis",
      "Next": "DetermineAction"
    },

    "DetermineAction": {
      "Type": "Choice",
      "Choices": [
        {
          "And": [
            {
              "Variable": "$.parsedDiagnosis.recommended_action",
              "StringEquals": "AUTO_FIX"
            },
            {
              "Variable": "$.parsedDiagnosis.confidence_score",
              "NumericGreaterThanEquals": 0.8
            }
          ],
          "Next": "ExecuteAutoFix"
        },
        {
          "Variable": "$.parsedDiagnosis.recommended_action",
          "StringEquals": "QUARANTINE",
          "Next": "QuarantineData"
        },
        {
          "Variable": "$.parsedDiagnosis.recommended_action",
          "StringEquals": "ROLLBACK",
          "Next": "RollbackToPrevious"
        }
      ],
      "Default": "AlertAndLog"
    },

    "ExecuteAutoFix": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "execute-auto-remediation",
        "Payload.$": "$"
      },
      "ResultPath": "$.remediationResult",
      "Next": "VerifyRemediation",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "RemediationFailed"
        }
      ]
    },

    "VerifyRemediation": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "orders-etl-with-quality",
        "Arguments": {
          "--S3_INPUT_PATH.$": "$.remediationResult.fixedDataPath",
          "--S3_OUTPUT_PATH.$": "$.outputPath",
          "--VERIFICATION_RUN": "true"
        }
      },
      "Next": "CheckVerificationResults"
    },

    "CheckVerificationResults": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "check-verification-results",
        "Payload.$": "$"
      },
      "ResultPath": "$.verificationResults",
      "Next": "VerificationOutcome"
    },

    "VerificationOutcome": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.verificationResults.success",
          "BooleanEquals": true,
          "Next": "RecordSuccessfulRemediation"
        }
      ],
      "Default": "RemediationFailed"
    },

    "RecordSuccessfulRemediation": {
      "Type": "Task",
      "Resource": "arn:aws:states:::dynamodb:putItem",
      "Parameters": {
        "TableName": "pipeline-incidents",
        "Item": {
          "incident_id": {"S.$": "$.executionId"},
          "timestamp": {"S.$": "$$.State.EnteredTime"},
          "anomaly_type": {"S.$": "$.parsedDiagnosis.root_cause"},
          "action_taken": {"S": "AUTO_FIX"},
          "outcome": {"S": "SUCCESS"},
          "fix_details": {"S.$": "States.JsonToString($.parsedDiagnosis.fix_instructions)"}
        }
      },
      "Next": "PipelineSuccess"
    },

    "QuarantineData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "quarantine-bad-data",
        "Payload.$": "$"
      },
      "Next": "AlertAndLog"
    },

    "RollbackToPrevious": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "rollback-to-previous-version",
        "Payload.$": "$"
      },
      "Next": "AlertAndLog"
    },

    "AlertAndLog": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "SendSNSAlert",
          "States": {
            "SendSNSAlert": {
              "Type": "Task",
              "Resource": "arn:aws:states:::sns:publish",
              "Parameters": {
                "TopicArn": "arn:aws:sns:us-east-1:123456789:data-quality-alerts",
                "Message.$": "States.Format('Data Quality Issue Detected\n\nRoot Cause: {}\nSeverity: {}\nAction Taken: {}\n\nDetails: {}', $.parsedDiagnosis.root_cause, $.parsedDiagnosis.severity, $.parsedDiagnosis.recommended_action, States.JsonToString($.qualityResults))"
              },
              "End": true
            }
          }
        },
        {
          "StartAt": "LogIncident",
          "States": {
            "LogIncident": {
              "Type": "Task",
              "Resource": "arn:aws:states:::dynamodb:putItem",
              "Parameters": {
                "TableName": "pipeline-incidents",
                "Item": {
                  "incident_id": {"S.$": "$.executionId"},
                  "timestamp": {"S.$": "$$.State.EnteredTime"},
                  "anomaly_type": {"S.$": "$.parsedDiagnosis.root_cause"},
                  "action_taken": {"S.$": "$.parsedDiagnosis.recommended_action"},
                  "outcome": {"S": "ALERTED"},
                  "details": {"S.$": "States.JsonToString($.qualityResults)"}
                }
              },
              "End": true
            }
          }
        }
      ],
      "End": true
    },

    "RemediationFailed": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789:data-quality-critical",
        "Message": "CRITICAL: Auto-remediation failed. Manual intervention required.",
        "Subject": "Data Pipeline - Remediation Failed"
      },
      "Next": "PipelineFailed"
    },

    "HandleJobFailure": {
      "Type": "Pass",
      "Result": {"error": "Glue job failed"},
      "Next": "AlertAndLog"
    },

    "PipelineSuccess": {
      "Type": "Succeed"
    },

    "PipelineFailed": {
      "Type": "Fail",
      "Error": "RemediationFailed",
      "Cause": "Auto-remediation was unsuccessful"
    }
  }
}

Step 3: Intelligent Diagnosis with Amazon Bedrock

The magic happens in the Bedrock diagnosis step. We use Claude to analyze the anomaly patterns, correlate with historical incidents, and recommend the best remediation strategy.

Lambda Function for Parsing Bedrock Response

# lambda/parse_bedrock_diagnosis.py
import json
import re

def lambda_handler(event, context):
    """
    Parse Bedrock's diagnosis response and extract structured remediation plan.
    """
    diagnosis_text = event.get('diagnosis', {}).get('diagnosis', '')

    # Extract JSON from Bedrock response
    try:
        # Find JSON block in response
        json_match = re.search(r'\{[\s\S]*\}', diagnosis_text)
        if json_match:
            diagnosis = json.loads(json_match.group())
        else:
            # Fallback parsing if no JSON found
            diagnosis = parse_unstructured_response(diagnosis_text)
    except json.JSONDecodeError:
        diagnosis = parse_unstructured_response(diagnosis_text)

    # Validate and normalize the diagnosis
    normalized = {
        'root_cause': diagnosis.get('root_cause', 'Unknown'),
        'severity': normalize_severity(diagnosis.get('severity', 'medium')),
        'recommended_action': normalize_action(diagnosis.get('recommended_action', 'ALERT_ONLY')),
        'fix_instructions': diagnosis.get('fix_instructions', {}),
        'confidence_score': float(diagnosis.get('confidence_score', 0.5))
    }

    # Apply safety guardrails
    normalized = apply_safety_guardrails(normalized, event.get('qualityResults', {}))

    return normalized


def normalize_severity(severity):
    """Normalize severity to standard values."""
    severity_map = {
        'critical': 'critical',
        'high': 'high',
        'medium': 'medium',
        'low': 'low',
        'crit': 'critical',
        'hi': 'high',
        'med': 'medium',
        'lo': 'low'
    }
    return severity_map.get(severity.lower(), 'medium')


def normalize_action(action):
    """Normalize action to allowed values."""
    action_map = {
        'auto_fix': 'AUTO_FIX',
        'autofix': 'AUTO_FIX',
        'fix': 'AUTO_FIX',
        'quarantine': 'QUARANTINE',
        'isolate': 'QUARANTINE',
        'rollback': 'ROLLBACK',
        'revert': 'ROLLBACK',
        'alert': 'ALERT_ONLY',
        'alert_only': 'ALERT_ONLY',
        'notify': 'ALERT_ONLY'
    }
    return action_map.get(action.lower().replace(' ', '_'), 'ALERT_ONLY')


def apply_safety_guardrails(diagnosis, quality_results):
    """
    Apply safety guardrails to prevent dangerous auto-remediation.
    """
    # Never auto-fix critical severity issues
    if diagnosis['severity'] == 'critical':
        diagnosis['recommended_action'] = 'ALERT_ONLY'
        diagnosis['guardrail_applied'] = 'Critical issues require human review'

    # Require high confidence for auto-fix
    if diagnosis['recommended_action'] == 'AUTO_FIX' and diagnosis['confidence_score'] < 0.8:
        diagnosis['recommended_action'] = 'ALERT_ONLY'
        diagnosis['guardrail_applied'] = 'Confidence too low for auto-fix'

    # Check if too many records affected (>10% of dataset)
    failed_rules = quality_results.get('failed_rules', 0)
    total_rules = quality_results.get('total_rules', 1)
    if failed_rules / total_rules > 0.1:
        diagnosis['recommended_action'] = 'QUARANTINE'
        diagnosis['guardrail_applied'] = 'Too many rules failed - quarantining data'

    return diagnosis


def parse_unstructured_response(text):
    """Fallback parser for unstructured Bedrock responses."""
    diagnosis = {
        'root_cause': 'Unable to parse',
        'severity': 'medium',
        'recommended_action': 'ALERT_ONLY',
        'fix_instructions': {},
        'confidence_score': 0.3
    }

    # Simple keyword extraction
    text_lower = text.lower()

    if 'null' in text_lower or 'missing' in text_lower:
        diagnosis['root_cause'] = 'Missing or null values detected'
    elif 'negative' in text_lower:
        diagnosis['root_cause'] = 'Negative values in non-negative field'
    elif 'duplicate' in text_lower:
        diagnosis['root_cause'] = 'Duplicate records detected'
    elif 'schema' in text_lower or 'type' in text_lower:
        diagnosis['root_cause'] = 'Schema or data type mismatch'

    return diagnosis

Auto-Remediation Lambda Function

# lambda/execute_auto_remediation.py
import boto3
import json
from datetime import datetime

s3_client = boto3.client('s3')
glue_client = boto3.client('glue')

def lambda_handler(event, context):
    """
    Execute auto-remediation based on Bedrock's diagnosis.
    """
    diagnosis = event.get('parsedDiagnosis', {})
    quality_results = event.get('qualityResults', {})
    input_path = event.get('inputPath', '')

    fix_instructions = diagnosis.get('fix_instructions', {})
    fix_type = fix_instructions.get('type', 'unknown')

    # Route to appropriate fix handler
    fix_handlers = {
        'fill_nulls': handle_fill_nulls,
        'remove_negatives': handle_remove_negatives,
        'deduplicate': handle_deduplicate,
        'type_cast': handle_type_cast,
        'filter_outliers': handle_filter_outliers,
        'impute_values': handle_impute_values
    }

    handler = fix_handlers.get(fix_type, handle_unknown_fix)
    result = handler(input_path, fix_instructions, quality_results)

    # Log the remediation attempt
    log_remediation_attempt(event, result)

    return result


def handle_fill_nulls(input_path, instructions, quality_results):
    """Fill null values with specified strategy."""
    column = instructions.get('column')
    strategy = instructions.get('strategy', 'mean')  # mean, median, mode, constant
    constant_value = instructions.get('constant_value')

    # Generate PySpark code for Glue job
    spark_code = f"""
from pyspark.sql.functions import mean, col, when, lit
from pyspark.sql import functions as F

df = spark.read.parquet("{input_path}")

if "{strategy}" == "mean":
    fill_value = df.select(mean(col("{column}"))).collect()[0][0]
elif "{strategy}" == "median":
    fill_value = df.approxQuantile("{column}", [0.5], 0.01)[0]
elif "{strategy}" == "mode":
    fill_value = df.groupBy("{column}").count().orderBy("count", ascending=False).first()[0]
else:
    fill_value = {constant_value}

df_fixed = df.withColumn("{column}", 
    when(col("{column}").isNull(), lit(fill_value)).otherwise(col("{column}"))
)

output_path = "{input_path.replace('/raw/', '/remediated/')}"
df_fixed.write.mode("overwrite").parquet(output_path)
"""

    # Start Glue job with generated code
    job_run = start_remediation_glue_job(spark_code, 'fill_nulls')

    return {
        'success': True,
        'fix_type': 'fill_nulls',
        'fixedDataPath': input_path.replace('/raw/', '/remediated/'),
        'jobRunId': job_run['JobRunId'],
        'details': f"Filled nulls in {column} using {strategy} strategy"
    }


def handle_remove_negatives(input_path, instructions, quality_results):
    """Remove or fix negative values in specified column."""
    column = instructions.get('column')
    action = instructions.get('action', 'remove')  # remove, absolute, zero

    spark_code = f"""
from pyspark.sql.functions import col, abs as spark_abs, when, lit

df = spark.read.parquet("{input_path}")

if "{action}" == "remove":
    df_fixed = df.filter(col("{column}") >= 0)
elif "{action}" == "absolute":
    df_fixed = df.withColumn("{column}", spark_abs(col("{column}")))
else:  # zero
    df_fixed = df.withColumn("{column}", 
        when(col("{column}") < 0, lit(0)).otherwise(col("{column}"))
    )

output_path = "{input_path.replace('/raw/', '/remediated/')}"
df_fixed.write.mode("overwrite").parquet(output_path)
"""

    job_run = start_remediation_glue_job(spark_code, 'remove_negatives')

    return {
        'success': True,
        'fix_type': 'remove_negatives',
        'fixedDataPath': input_path.replace('/raw/', '/remediated/'),
        'jobRunId': job_run['JobRunId'],
        'details': f"Applied {action} to negative values in {column}"
    }


def handle_deduplicate(input_path, instructions, quality_results):
    """Remove duplicate records."""
    key_columns = instructions.get('key_columns', [])
    keep = instructions.get('keep', 'first')  # first, last

    key_cols_str = ', '.join([f'"{c}"' for c in key_columns])

    spark_code = f"""
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, col

df = spark.read.parquet("{input_path}")

key_columns = [{key_cols_str}]

if key_columns:
    window = Window.partitionBy(key_columns).orderBy(key_columns)
    df_with_row = df.withColumn("_row_num", row_number().over(window))

    if "{keep}" == "first":
        df_fixed = df_with_row.filter(col("_row_num") == 1).drop("_row_num")
    else:
        max_row = df_with_row.groupBy(key_columns).max("_row_num")
        df_fixed = df_with_row.join(max_row, key_columns).filter(
            col("_row_num") == col("max(_row_num)")
        ).drop("_row_num", "max(_row_num)")
else:
    df_fixed = df.dropDuplicates()

output_path = "{input_path.replace('/raw/', '/remediated/')}"
df_fixed.write.mode("overwrite").parquet(output_path)
"""

    job_run = start_remediation_glue_job(spark_code, 'deduplicate')

    return {
        'success': True,
        'fix_type': 'deduplicate',
        'fixedDataPath': input_path.replace('/raw/', '/remediated/'),
        'jobRunId': job_run['JobRunId'],
        'details': f"Removed duplicates keeping {keep} occurrence"
    }


def handle_filter_outliers(input_path, instructions, quality_results):
    """Filter statistical outliers using IQR or Z-score method."""
    column = instructions.get('column')
    method = instructions.get('method', 'iqr')  # iqr, zscore
    threshold = instructions.get('threshold', 1.5 if method == 'iqr' else 3)

    spark_code = f"""
from pyspark.sql.functions import col, mean, stddev

df = spark.read.parquet("{input_path}")

if "{method}" == "iqr":
    quantiles = df.approxQuantile("{column}", [0.25, 0.75], 0.01)
    q1, q3 = quantiles[0], quantiles[1]
    iqr = q3 - q1
    lower_bound = q1 - {threshold} * iqr
    upper_bound = q3 + {threshold} * iqr
else:  # zscore
    stats = df.select(mean("{column}"), stddev("{column}")).collect()[0]
    avg, std = stats[0], stats[1]
    lower_bound = avg - {threshold} * std
    upper_bound = avg + {threshold} * std

df_fixed = df.filter(
    (col("{column}") >= lower_bound) & (col("{column}") <= upper_bound)
)

output_path = "{input_path.replace('/raw/', '/remediated/')}"
df_fixed.write.mode("overwrite").parquet(output_path)
"""

    job_run = start_remediation_glue_job(spark_code, 'filter_outliers')

    return {
        'success': True,
        'fix_type': 'filter_outliers',
        'fixedDataPath': input_path.replace('/raw/', '/remediated/'),
        'jobRunId': job_run['JobRunId'],
        'details': f"Filtered outliers in {column} using {method} method"
    }


def handle_impute_values(input_path, instructions, quality_results):
    """Impute missing values using ML-based approach."""
    column = instructions.get('column')
    feature_columns = instructions.get('feature_columns', [])

    # Use simple interpolation if no feature columns specified
    spark_code = f"""
from pyspark.sql.functions import col, last, first
from pyspark.sql.window import Window

df = spark.read.parquet("{input_path}")

# Forward fill then backward fill
window_forward = Window.orderBy("order_date").rowsBetween(Window.unboundedPreceding, 0)
window_backward = Window.orderBy("order_date").rowsBetween(0, Window.unboundedFollowing)

df_fixed = df.withColumn("{column}_ffill", last("{column}", ignorenulls=True).over(window_forward))
df_fixed = df_fixed.withColumn("{column}", 
    when(col("{column}").isNull(), col("{column}_ffill")).otherwise(col("{column}"))
)
df_fixed = df_fixed.drop("{column}_ffill")

output_path = "{input_path.replace('/raw/', '/remediated/')}"
df_fixed.write.mode("overwrite").parquet(output_path)
"""

    job_run = start_remediation_glue_job(spark_code, 'impute_values')

    return {
        'success': True,
        'fix_type': 'impute_values',
        'fixedDataPath': input_path.replace('/raw/', '/remediated/'),
        'jobRunId': job_run['JobRunId'],
        'details': f"Imputed missing values in {column}"
    }


def handle_unknown_fix(input_path, instructions, quality_results):
    """Handle unknown fix types by quarantining data."""
    return {
        'success': False,
        'fix_type': 'unknown',
        'error': 'Unknown fix type - data quarantined for manual review',
        'fixedDataPath': None
    }


def start_remediation_glue_job(spark_code, fix_type):
    """Start a Glue job with the generated remediation code."""
    # In production, you'd use a parameterized Glue job
    # For this example, we'll use Glue's script location feature

    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    script_path = f"s3://your-bucket/remediation-scripts/{fix_type}_{timestamp}.py"

    # Upload script to S3
    s3_client.put_object(
        Bucket='your-bucket',
        Key=f"remediation-scripts/{fix_type}_{timestamp}.py",
        Body=spark_code.encode('utf-8')
    )

    # Start Glue job
    response = glue_client.start_job_run(
        JobName='data-remediation-job',
        Arguments={
            '--extra-py-files': script_path
        }
    )

    return response


def log_remediation_attempt(event, result):
    """Log remediation attempt to CloudWatch."""
    print(json.dumps({
        'event': 'remediation_attempt',
        'timestamp': datetime.now().isoformat(),
        'diagnosis': event.get('parsedDiagnosis', {}),
        'result': result
    }))

Step 4: Learning from Incidents with DynamoDB

The system stores every incident and its outcome, enabling pattern recognition and improved future diagnoses.

DynamoDB Table Schema

# infrastructure/dynamodb_table.py
import boto3

dynamodb = boto3.client('dynamodb')

# Create incidents table
dynamodb.create_table(
    TableName='pipeline-incidents',
    KeySchema=[
        {'AttributeName': 'incident_id', 'KeyType': 'HASH'},
        {'AttributeName': 'timestamp', 'KeyType': 'RANGE'}
    ],
    AttributeDefinitions=[
        {'AttributeName': 'incident_id', 'AttributeType': 'S'},
        {'AttributeName': 'timestamp', 'AttributeType': 'S'},
        {'AttributeName': 'anomaly_type', 'AttributeType': 'S'}
    ],
    GlobalSecondaryIndexes=[
        {
            'IndexName': 'anomaly-type-index',
            'KeySchema': [
                {'AttributeName': 'anomaly_type', 'KeyType': 'HASH'},
                {'AttributeName': 'timestamp', 'KeyType': 'RANGE'}
            ],
            'Projection': {'ProjectionType': 'ALL'},
            'ProvisionedThroughput': {
                'ReadCapacityUnits': 5,
                'WriteCapacityUnits': 5
            }
        }
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 10,
        'WriteCapacityUnits': 10
    }
)

Fetching Historical Incidents for Bedrock Context

# lambda/get_historical_incidents.py
import boto3
from datetime import datetime, timedelta
from boto3.dynamodb.conditions import Key

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('pipeline-incidents')

def lambda_handler(event, context):
    """
    Fetch relevant historical incidents to provide context for Bedrock diagnosis.
    """
    quality_results = event.get('qualityResults', {})
    anomaly_details = quality_results.get('anomaly_details', [])

    # Get incidents from last 30 days
    thirty_days_ago = (datetime.now() - timedelta(days=30)).isoformat()

    historical_incidents = []

    # Query by anomaly type for relevant history
    for anomaly in anomaly_details:
        rule = anomaly.get('rule', '')

        # Determine anomaly type from rule
        anomaly_type = categorize_anomaly(rule)

        response = table.query(
            IndexName='anomaly-type-index',
            KeyConditionExpression=Key('anomaly_type').eq(anomaly_type) & 
                                   Key('timestamp').gte(thirty_days_ago),
            Limit=10,
            ScanIndexForward=False  # Most recent first
        )

        historical_incidents.extend(response.get('Items', []))

    # Deduplicate and summarize
    unique_incidents = {i['incident_id']: i for i in historical_incidents}

    # Calculate success rates for different remediation actions
    action_stats = calculate_action_stats(list(unique_incidents.values()))

    return {
        'recent_incidents': list(unique_incidents.values())[:20],
        'action_success_rates': action_stats,
        'total_incidents_30d': len(unique_incidents)
    }


def categorize_anomaly(rule):
    """Categorize anomaly based on rule name."""
    rule_lower = rule.lower()

    if 'null' in rule_lower or 'complete' in rule_lower:
        return 'missing_values'
    elif 'unique' in rule_lower or 'duplicate' in rule_lower:
        return 'duplicates'
    elif 'range' in rule_lower or 'threshold' in rule_lower:
        return 'out_of_range'
    elif 'type' in rule_lower:
        return 'type_mismatch'
    elif 'fresh' in rule_lower:
        return 'stale_data'
    else:
        return 'unknown'


def calculate_action_stats(incidents):
    """Calculate success rates for different remediation actions."""
    stats = {}

    for incident in incidents:
        action = incident.get('action_taken', 'UNKNOWN')
        outcome = incident.get('outcome', 'UNKNOWN')

        if action not in stats:
            stats[action] = {'total': 0, 'success': 0}

        stats[action]['total'] += 1
        if outcome == 'SUCCESS':
            stats[action]['success'] += 1

    # Calculate percentages
    for action in stats:
        total = stats[action]['total']
        success = stats[action]['success']
        stats[action]['success_rate'] = success / total if total > 0 else 0

    return stats

Step 5: Infrastructure as Code (CDK)

Let's tie everything together with AWS CDK for reproducible deployment.

// lib/self-healing-pipeline-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as glue from 'aws-cdk-lib/aws-glue';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as events from 'aws-cdk-lib/aws-events';
import * as targets from 'aws-cdk-lib/aws-events-targets';
import { Construct } from 'constructs';

export class SelfHealingPipelineStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // S3 Buckets
    const dataBucket = new s3.Bucket(this, 'DataBucket', {
      bucketName: `self-healing-pipeline-data-${this.account}`,
      versioned: true,
      lifecycleRules: [
        {
          id: 'archive-old-data',
          transitions: [
            {
              storageClass: s3.StorageClass.INTELLIGENT_TIERING,
              transitionAfter: cdk.Duration.days(30),
            },
          ],
        },
      ],
    });

    // DynamoDB Table for Incidents
    const incidentsTable = new dynamodb.Table(this, 'IncidentsTable', {
      tableName: 'pipeline-incidents',
      partitionKey: { name: 'incident_id', type: dynamodb.AttributeType.STRING },
      sortKey: { name: 'timestamp', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      pointInTimeRecovery: true,
    });

    incidentsTable.addGlobalSecondaryIndex({
      indexName: 'anomaly-type-index',
      partitionKey: { name: 'anomaly_type', type: dynamodb.AttributeType.STRING },
      sortKey: { name: 'timestamp', type: dynamodb.AttributeType.STRING },
    });

    // SNS Topics for Alerts
    const alertTopic = new sns.Topic(this, 'AlertTopic', {
      topicName: 'data-quality-alerts',
    });

    const criticalAlertTopic = new sns.Topic(this, 'CriticalAlertTopic', {
      topicName: 'data-quality-critical',
    });

    // IAM Role for Glue
    const glueRole = new iam.Role(this, 'GlueRole', {
      assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole'),
      ],
    });

    dataBucket.grantReadWrite(glueRole);

    // Glue Database
    const glueDatabase = new glue.CfnDatabase(this, 'GlueDatabase', {
      catalogId: this.account,
      databaseInput: {
        name: 'ecommerce_db',
        description: 'E-commerce data lake database',
      },
    });

    // Glue ETL Job with Data Quality
    const etlJob = new glue.CfnJob(this, 'ETLJob', {
      name: 'orders-etl-with-quality',
      role: glueRole.roleArn,
      command: {
        name: 'glueetl',
        pythonVersion: '3',
        scriptLocation: `s3://${dataBucket.bucketName}/scripts/glue_etl_with_quality.py`,
      },
      defaultArguments: {
        '--enable-metrics': 'true',
        '--enable-continuous-cloudwatch-log': 'true',
        '--enable-glue-datacatalog': 'true',
        '--job-language': 'python',
        '--TempDir': `s3://${dataBucket.bucketName}/temp/`,
      },
      glueVersion: '4.0',
      numberOfWorkers: 2,
      workerType: 'G.1X',
    });

    // Lambda Functions
    const lambdaRole = new iam.Role(this, 'LambdaRole', {
      assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSLambdaBasicExecutionRole'),
      ],
    });

    dataBucket.grantReadWrite(lambdaRole);
    incidentsTable.grantReadWriteData(lambdaRole);

    // Add Bedrock permissions
    lambdaRole.addToPolicy(new iam.PolicyStatement({
      actions: ['bedrock:InvokeModel'],
      resources: ['*'],
    }));

    const getQualityResultsLambda = new lambda.Function(this, 'GetQualityResults', {
      functionName: 'get-quality-results',
      runtime: lambda.Runtime.PYTHON_3_11,
      handler: 'index.lambda_handler',
      code: lambda.Code.fromAsset('lambda/get_quality_results'),
      role: lambdaRole,
      timeout: cdk.Duration.minutes(1),
      environment: {
        DATA_BUCKET: dataBucket.bucketName,
      },
    });

    const parseDiagnosisLambda = new lambda.Function(this, 'ParseDiagnosis', {
      functionName: 'parse-bedrock-diagnosis',
      runtime: lambda.Runtime.PYTHON_3_11,
      handler: 'index.lambda_handler',
      code: lambda.Code.fromAsset('lambda/parse_diagnosis'),
      role: lambdaRole,
      timeout: cdk.Duration.minutes(1),
    });

    const autoRemediationLambda = new lambda.Function(this, 'AutoRemediation', {
      functionName: 'execute-auto-remediation',
      runtime: lambda.Runtime.PYTHON_3_11,
      handler: 'index.lambda_handler',
      code: lambda.Code.fromAsset('lambda/auto_remediation'),
      role: lambdaRole,
      timeout: cdk.Duration.minutes(5),
      environment: {
        DATA_BUCKET: dataBucket.bucketName,
        GLUE_JOB_NAME: 'data-remediation-job',
      },
    });

    const historicalIncidentsLambda = new lambda.Function(this, 'HistoricalIncidents', {
      functionName: 'get-historical-incidents',
      runtime: lambda.Runtime.PYTHON_3_11,
      handler: 'index.lambda_handler',
      code: lambda.Code.fromAsset('lambda/historical_incidents'),
      role: lambdaRole,
      timeout: cdk.Duration.minutes(1),
      environment: {
        INCIDENTS_TABLE: incidentsTable.tableName,
      },
    });

    // Step Functions State Machine
    const stateMachineRole = new iam.Role(this, 'StateMachineRole', {
      assumedBy: new iam.ServicePrincipal('states.amazonaws.com'),
    });

    // Grant necessary permissions
    glueRole.grantPassRole(stateMachineRole);
    lambdaRole.grantPassRole(stateMachineRole);
    alertTopic.grantPublish(stateMachineRole);
    criticalAlertTopic.grantPublish(stateMachineRole);
    incidentsTable.grantReadWriteData(stateMachineRole);

    stateMachineRole.addToPolicy(new iam.PolicyStatement({
      actions: ['glue:StartJobRun', 'glue:GetJobRun'],
      resources: ['*'],
    }));

    stateMachineRole.addToPolicy(new iam.PolicyStatement({
      actions: ['bedrock:InvokeModel'],
      resources: ['*'],
    }));

    stateMachineRole.addToPolicy(new iam.PolicyStatement({
      actions: ['lambda:InvokeFunction'],
      resources: [
        getQualityResultsLambda.functionArn,
        parseDiagnosisLambda.functionArn,
        autoRemediationLambda.functionArn,
        historicalIncidentsLambda.functionArn,
      ],
    }));

    // State Machine Definition (simplified for CDK)
    const stateMachine = new sfn.StateMachine(this, 'SelfHealingPipeline', {
      stateMachineName: 'self-healing-data-pipeline',
      definitionBody: sfn.DefinitionBody.fromFile('stepfunctions/definition.asl.json'),
      role: stateMachineRole,
      tracingEnabled: true,
      logs: {
        destination: new cdk.aws_logs.LogGroup(this, 'StateMachineLogs', {
          logGroupName: '/aws/stepfunctions/self-healing-pipeline',
          retention: cdk.aws_logs.RetentionDays.ONE_MONTH,
        }),
        level: sfn.LogLevel.ALL,
      },
    });

    // EventBridge Rule to trigger on new data
    const rule = new events.Rule(this, 'NewDataRule', {
      ruleName: 'trigger-self-healing-pipeline',
      eventPattern: {
        source: ['aws.s3'],
        detailType: ['Object Created'],
        detail: {
          bucket: {
            name: [dataBucket.bucketName],
          },
          object: {
            key: [{ prefix: 'raw/' }],
          },
        },
      },
    });

    rule.addTarget(new targets.SfnStateMachine(stateMachine, {
      input: events.RuleTargetInput.fromObject({
        inputPath: events.EventField.fromPath('$.detail.object.key'),
        outputPath: `s3://${dataBucket.bucketName}/processed/`,
        executionId: events.EventField.fromPath('$.id'),
      }),
    }));

    // Outputs
    new cdk.CfnOutput(this, 'DataBucketName', {
      value: dataBucket.bucketName,
      description: 'Data bucket for the pipeline',
    });

    new cdk.CfnOutput(this, 'StateMachineArn', {
      value: stateMachine.stateMachineArn,
      description: 'Self-healing pipeline state machine ARN',
    });
  }
}

Real-World Results

After deploying this architecture for an e-commerce client processing 50M+ orders monthly, we observed:

Metric	Before	After	Improvement
Mean Time to Detect (MTTD)	4.2 hours	12 minutes	95% faster
Mean Time to Resolve (MTTR)	2.1 hours	8 minutes	94% faster
Data Quality Incidents/Month	47	47	Same (expected)
Incidents Requiring Manual Intervention	47 (100%)	7 (15%)	85% reduction
Silent Data Corruption Events	12	0	100% elimination

The system successfully auto-remediated issues including:

Null value imputation (34% of incidents)
Outlier filtering (28% of incidents)
Duplicate removal (18% of incidents)
Type casting fixes (5% of incidents)

Cost Analysis

Monthly costs for processing 50M records:

Service	Monthly Cost
AWS Glue (ETL + DQ)	$180
Step Functions	$25
Lambda	$15
Bedrock (Claude Sonnet)	$45
DynamoDB	$10
S3	$50
Total	$325

Compare this to the cost of a single data quality incident reaching production (estimated at $15,000-50,000 in engineering time, business impact, and customer trust), and the ROI becomes clear.

Best Practices and Lessons Learned

1. Start with Conservative Guardrails

Don't let AI auto-fix everything on day one. Start with:

High confidence threshold (0.9+)
Only allow fixes for well-understood anomaly types
Always quarantine critical severity issues

# Conservative initial configuration
GUARDRAILS = {
    'min_confidence_for_autofix': 0.9,
    'allowed_autofix_types': ['fill_nulls', 'remove_negatives'],
    'max_records_affected_percent': 0.05,
    'require_human_approval_for': ['schema_changes', 'bulk_deletes']
}

2. Build Feedback Loops

Track remediation outcomes and feed them back to Bedrock:

# Include success rates in Bedrock prompt
prompt = f"""
Historical remediation success rates:
- AUTO_FIX: {stats['AUTO_FIX']['success_rate']:.0%} success
- QUARANTINE: {stats['QUARANTINE']['success_rate']:.0%} success

Given these success rates, recommend the best action...
"""

3. Implement Circuit Breakers

If auto-remediation fails repeatedly, stop trying:

def check_circuit_breaker(anomaly_type):
    recent_failures = get_recent_failures(anomaly_type, hours=24)
    if len(recent_failures) >= 3:
        return {
            'circuit_open': True,
            'reason': f'3+ failures in 24h for {anomaly_type}',
            'recommended_action': 'ALERT_ONLY'
        }
    return {'circuit_open': False}

4. Version Your Remediation Logic

Treat remediation code like production code:

REMEDIATION_VERSION = "1.2.0"

def log_remediation(event, result):
    return {
        **result,
        'remediation_version': REMEDIATION_VERSION,
        'timestamp': datetime.now().isoformat()
    }

Conclusion

Self-healing data pipelines aren't science fiction—they're achievable today with AWS services. By combining Glue Data Quality's ML-powered anomaly detection with Bedrock's reasoning capabilities and Step Functions' orchestration, you can build pipelines that:

Detect issues before they corrupt downstream systems
Diagnose root causes using AI
Remediate automatically with appropriate guardrails
Learn from past incidents to improve over time

The key is starting small, implementing strong guardrails, and gradually expanding the system's autonomy as you build confidence in its decisions.

Resources

Have questions or want to share your self-healing pipeline experiences? Connect with me on LinkedIn or leave a comment below!

Tags: aws data-engineering machine-learning glue bedrock step-functions data-quality serverless ai automation

DEV Community