<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cloudev</title>
    <description>The latest articles on DEV Community by Cloudev (@copubah).</description>
    <link>https://dev.to/copubah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2950687%2Fd93a0c8d-a947-4a77-89cb-76a4a5a08573.jpeg</url>
      <title>DEV Community: Cloudev</title>
      <link>https://dev.to/copubah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/copubah"/>
    <language>en</language>
    <item>
      <title>A Self-Healing AWS ECS Monitoring System with Slack Alerts Using Terraform</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Fri, 20 Mar 2026 14:58:23 +0000</pubDate>
      <link>https://dev.to/copubah/a-self-healing-aws-ecs-monitoring-system-with-slack-alerts-using-terraform-2i0h</link>
      <guid>https://dev.to/copubah/a-self-healing-aws-ecs-monitoring-system-with-slack-alerts-using-terraform-2i0h</guid>
      <description>&lt;p&gt;Modern cloud applications need more than monitoring they need self-healing infrastructure. Waiting for humans to react to failures increases downtime and risks user impact. In this guide, I’ll show you how to build a system that automatically detects ECS service failures, notifies your team on Slack, and restores the service all using Terraform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Project Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In containerized environments, services can fail due to application crashes, resource exhaustion, or deployment issues. Traditional monitoring tools detect failures, but manual intervention is slow.&lt;/p&gt;

&lt;p&gt;A self-healing system solves this by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detecting failures automatically&lt;/li&gt;
&lt;li&gt;Restarting services without human intervention&lt;/li&gt;
&lt;li&gt;Sending alerts to teams in real-time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here’s how the system works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ECS service health degrades (task crashes, reduced running count)&lt;/li&gt;
&lt;li&gt;CloudWatch monitors ECS metrics and triggers an alarm when RunningTaskCount &amp;lt; desired count
3.EventBridge captures the alarm state change&lt;/li&gt;
&lt;li&gt;Lambda executes:&lt;/li&gt;
&lt;li&gt;Sends a Slack alert&lt;/li&gt;
&lt;li&gt;Restarts the ECS service&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This creates a closed-loop, event-driven system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Services Used&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Amazon ECS (Fargate) – Hosts containerized apps&lt;/li&gt;
&lt;li&gt;CloudWatch – Monitors service health&lt;/li&gt;
&lt;li&gt;EventBridge – Captures CloudWatch alarms and triggers Lambda&lt;/li&gt;
&lt;li&gt;Lambda – Executes remediation logic and sends Slack notifications
5.Slack Webhook – Sends alerts to your team&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Terraform Implementation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built the infrastructure using Terraform for repeatable, version-controlled deployment. Key points:&lt;br&gt;
1.Modular structure (ecs, lambda, cloudwatch, eventbridge, iam, ssm)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Slack webhook stored securely in SSM Parameter Store&lt;/li&gt;
&lt;li&gt;Lambda reads the webhook at runtime and sends formatted alerts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This project shows how to turn ECS monitoring into a self-healing system. By combining AWS services and Slack integration, you can detect failures, alert your team, and restore services automatically, reducing downtime and improving reliability.&lt;br&gt;
Github repo:&lt;a href="https://github.com/Copubah/AWS-ecs-monitoring-and-auto-remediation" rel="noopener noreferrer"&gt;https://github.com/Copubah/AWS-ecs-monitoring-and-auto-remediation&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>ecs</category>
      <category>aws</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Automating AWS Cost Monitoring with Terraform, Lambda, and Slack</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Wed, 18 Mar 2026 12:42:50 +0000</pubDate>
      <link>https://dev.to/copubah/automating-aws-cost-monitoring-with-terraform-lambda-and-slack-3h2o</link>
      <guid>https://dev.to/copubah/automating-aws-cost-monitoring-with-terraform-lambda-and-slack-3h2o</guid>
      <description>&lt;p&gt;Managing cloud costs can quickly become challenging, especially when resources scale dynamically. Instead of manually checking the AWS console, I built an automated system that sends daily cost summaries directly to Slack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS bills can grow unexpectedly without real-time visibility. Logging into the console daily is inefficient and easy to forget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built a serverless cost monitoring system using Terraform and AWS services that automatically sends cost updates to Slack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system follows a simple event-driven design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EventBridge triggers a Lambda function daily&lt;/li&gt;
&lt;li&gt;Lambda queries AWS Cost Explorer API&lt;/li&gt;
&lt;li&gt;The cost data is formatted into a readable summary&lt;/li&gt;
&lt;li&gt;A Slack webhook sends the message to a channel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools Used&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terraform (latest version)&lt;/li&gt;
&lt;li&gt;AWS Lambda (Python)&lt;/li&gt;
&lt;li&gt;EventBridge&lt;/li&gt;
&lt;li&gt;IAM (least privilege)&lt;/li&gt;
&lt;li&gt;AWS Cost Explorer API&lt;/li&gt;
&lt;li&gt;Slack Incoming Webhooks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How It Works&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every day, the Lambda function runs and retrieves:&lt;/li&gt;
&lt;li&gt;Total AWS spend for the day&lt;/li&gt;
&lt;li&gt;Breakdown of costs by service&lt;/li&gt;
&lt;li&gt;Optional alerts if spending exceeds thresholds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;It then sends a message like this to Slack&lt;/strong&gt;:&lt;br&gt;
AWS Cost Summary – 18 Mar 2026&lt;br&gt;
Total Spend Today: $12.34&lt;br&gt;
Top Services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC2: $6.50&lt;/li&gt;
&lt;li&gt;S3: $3.20&lt;/li&gt;
&lt;li&gt;Lambda: $2.64
Key Benefits&lt;/li&gt;
&lt;li&gt;Automated cost visibility&lt;/li&gt;
&lt;li&gt;Early detection of unexpected spikes&lt;/li&gt;
&lt;li&gt;Low cost to run (within free tier for most users)&lt;/li&gt;
&lt;li&gt;Fully serverless and scalable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitHub Repository&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/Copubah/aws-cost-reporter" rel="noopener noreferrer"&gt;https://github.com/Copubah/aws-cost-reporter&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>lambda</category>
      <category>terraform</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Deploying a Flask Application to Kubernetes Using Minikube</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Sat, 07 Mar 2026 15:54:50 +0000</pubDate>
      <link>https://dev.to/copubah/deploying-a-flask-application-to-kubernetes-using-minikube-25k7</link>
      <guid>https://dev.to/copubah/deploying-a-flask-application-to-kubernetes-using-minikube-25k7</guid>
      <description>&lt;p&gt;I have recently been spending time tinkering with Kubernetes and exploring how container orchestration works in practice. To get hands on experience, I set up a local cluster using Minikube and started experimenting with deploying simple applications.&lt;br&gt;
&lt;strong&gt;Project Overview&lt;/strong&gt;&lt;br&gt;
The goal of this project is to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build a simple Flask web application&lt;/li&gt;
&lt;li&gt;Package it in a container using Docker&lt;/li&gt;
&lt;li&gt;Deploy it to a Kubernetes cluster&lt;/li&gt;
&lt;li&gt;Expose the application so it can be accessed from a browser&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup runs locally using Kubernetes through Minikube.&lt;br&gt;
Architecture&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deployment follows a simple Kubernetes architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User → Kubernetes Service → Deployment → Pods → Flask Container&lt;/li&gt;
&lt;li&gt;Deployment manages the application pods&lt;/li&gt;
&lt;li&gt;Pods run the containerized Flask application&lt;/li&gt;
&lt;li&gt;Service exposes the application to the outside network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Project Structure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The repository contains the following files:&lt;/p&gt;

&lt;p&gt;kubernetes-flask-app&lt;br&gt;
│&lt;br&gt;
├── app.py&lt;br&gt;
├── requirements.txt&lt;br&gt;
├── Dockerfile&lt;br&gt;
├── deployment.yaml&lt;br&gt;
├── service.yaml&lt;br&gt;
└── README.md&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;app.py
A simple Flask application that returns a message when accessed.&lt;/li&gt;
&lt;li&gt;requirements.txt
Contains the Python dependencies required to run the application.&lt;/li&gt;
&lt;li&gt;Dockerfile
Defines how the Flask application is packaged into a container image.&lt;/li&gt;
&lt;li&gt;deployment.yaml
Defines the Kubernetes deployment and manages the application pods.&lt;/li&gt;
&lt;li&gt;service.yaml
Creates a Kubernetes service to expose the application.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Running the Project&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1. Start Minikube&lt;/strong&gt;&lt;br&gt;
Start your local Kubernetes cluster.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;minikube start&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verify the cluster is running.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kubectl get nodes
&lt;strong&gt;2. Build the Docker Image&lt;/strong&gt;
Configure your environment to use the Docker daemon inside Minikube.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;eval $(minikube docker-env)&lt;/p&gt;

&lt;p&gt;Build the container image.&lt;/p&gt;

&lt;p&gt;docker build -t flask-k8s-app .&lt;br&gt;
&lt;strong&gt;3. Deploy the Application&lt;/strong&gt;&lt;br&gt;
Create the deployment in Kubernetes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kubectl apply -f deployment.yaml&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check the running pods.&lt;/p&gt;

&lt;p&gt;kubectl get pods&lt;br&gt;
&lt;strong&gt;4. Expose the Application&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a Kubernetes service.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kubectl apply -f service.yaml&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To access the application:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;minikube service flask-service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your browser will open the application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What This Project Demonstrates&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This project helps demonstrate core DevOps and cloud concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Containerization using Docker&lt;/li&gt;
&lt;li&gt;Container orchestration using Kubernetes
-Local Kubernetes development using Minikube&lt;/li&gt;
&lt;li&gt;Deployments and services in Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can find the full project source code on GitHub:&lt;br&gt;
&lt;a href="https://github.com/Copubah/kubernetes-flask-app" rel="noopener noreferrer"&gt;https://github.com/Copubah/kubernetes-flask-app&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>python</category>
      <category>flask</category>
      <category>docker</category>
    </item>
    <item>
      <title>Chaos by Design: Production Maintenance Drills on Kubernetes</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Thu, 26 Feb 2026 17:37:13 +0000</pubDate>
      <link>https://dev.to/copubah/chaos-by-designproduction-maintenance-drills-on-kubernetes-3df</link>
      <guid>https://dev.to/copubah/chaos-by-designproduction-maintenance-drills-on-kubernetes-3df</guid>
      <description>&lt;p&gt;There's an old SRE adage: "Hope is not a strategy." Yet most engineering teams only discover how their systems fail under pressure when that pressure is real, unplanned, and 2 AM on a Saturday. Production outages are expensive teachers.&lt;/p&gt;

&lt;p&gt;The alternative is to make failure boring — to rehearse it so often that when it actually happens, your team moves through the recovery playbook on autopilot. That's the idea behind prod-maintenance-drills: a self-hosted Kubernetes environment where you deliberately break things to learn how to fix them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Drills Matter&lt;/strong&gt;&lt;br&gt;
Chaos engineering, popularized by Netflix's Chaos Monkey, is the discipline of intentionally introducing failures into a system to build confidence in its ability to withstand turbulent, unexpected conditions. But you don't need a Netflix-scale infrastructure to benefit from it.&lt;/p&gt;

&lt;p&gt;Even on a local Kubernetes cluster with a handful of pods, running structured drills teaches you things you can't learn from diagrams or documentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How fast does your deployment actually recover after a pod crash?&lt;/li&gt;
&lt;li&gt;Does your application handle a database restart gracefully, or does it need a manual restart too?&lt;/li&gt;
&lt;li&gt;At what CPU threshold does your HPA kick in — and does it kick in fast enough?&lt;/li&gt;
&lt;li&gt;When disk fills up, do your alerts fire before the application starts failing?
Running drills answers these questions with evidence, not assumptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The FastAPI application exposes two endpoints&lt;/strong&gt;: / for a health check and /db to verify database connectivity. These simple endpoints become the canary in the coal mine — you watch them during drills to confirm the system has recovered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The monitoring stack runs kube-prometheus&lt;/strong&gt;-stack via Helm, giving you Prometheus scraping, Grafana dashboards, and Alertmanager rules all preconfigured. You get real-time visibility into pod restarts, CPU usage, and database status without having to wire anything up manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Five Drills&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;app_crash.sh&lt;br&gt;
Deletes a running pod to test Kubernetes self-healing and deployment recovery time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;db_failure.sh&lt;/strong&gt;&lt;br&gt;
Kills the PostgreSQL pod to validate application reconnection behavior.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;-&lt;strong&gt;high_load.sh&lt;/strong&gt;&lt;br&gt;
Schedules a CPU stress job to trigger horizontal pod autoscaling at 60% threshold.&lt;/p&gt;

&lt;p&gt;-&lt;strong&gt;backup.sh&lt;/strong&gt;&lt;br&gt;
Creates a timestamped PostgreSQL dump to practice backup and restore procedures.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;disk_fill.sh&lt;/strong&gt;
Simulates disk exhaustion on a node and verifies that monitoring alerts fire correctly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Setting It Up&lt;/strong&gt;&lt;br&gt;
Prerequisites&lt;br&gt;
You'll need: kind or minikube, kubectl, docker, and helm. That's it no cloud account required.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spin up the cluster**
# Create a local kind cluster
kind create cluster --name simple-k8sbash&lt;/li&gt;
&lt;li&gt;Install the monitoring stack**
helm repo add prometheus-community \
&lt;a href="https://prometheus-community.github.io/helm-charts" rel="noopener noreferrer"&gt;https://prometheus-community.github.io/helm-charts&lt;/a&gt;
helm repo update&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;helm install monitoring \&lt;br&gt;
  prometheus-community/kube-prometheus-stack \&lt;br&gt;
  --namespace monitoring \&lt;br&gt;
  --create-namespacebash&lt;br&gt;
 &lt;strong&gt;Tip&lt;/strong&gt;&lt;br&gt;
For kind clusters, you'll also need to install metrics-server and add --kubelet-insecure-tls to its deployment args, otherwise HPA won't be able to read resource metrics.&lt;br&gt;
3.Build and load the application image**&lt;/p&gt;

&lt;h1&gt;
  
  
  Build the FastAPI image
&lt;/h1&gt;

&lt;p&gt;cd app &amp;amp;&amp;amp; docker build -t prod-app:latest . &amp;amp;&amp;amp; cd ..&lt;/p&gt;

&lt;h1&gt;
  
  
  Load it into kind (not needed for cloud clusters)
&lt;/h1&gt;

&lt;p&gt;kind load docker-image prod-app:latest --name simple-k8sbash&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy everything**
kubectl apply -f k8s/&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Verify everything came up
&lt;/h1&gt;

&lt;p&gt;kubectl get pods -n prod&lt;br&gt;
kubectl get pods -n monitoring&lt;br&gt;
kubectl get hpa -n prodbash&lt;br&gt;
Once the pods are running, grab the node IP and hit your endpoints:&lt;/p&gt;

&lt;p&gt;NODE_IP=$(kubectl get nodes -o jsonpath=\&lt;br&gt;
  '{.items[0].status.addresses[?(@.type=="InternalIP")].address}')&lt;/p&gt;

&lt;p&gt;curl http://$NODE_IP:30007/&lt;/p&gt;

&lt;h1&gt;
  
  
  → {"status":"running"}
&lt;/h1&gt;

&lt;p&gt;curl http://$NODE_IP:30007/db&lt;/p&gt;

&lt;h1&gt;
  
  
  → {"db":"connected"}bash
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Running Your First Drill&lt;/strong&gt;&lt;br&gt;
Let's walk through the pod crash drill end to end, because it's the cleanest example of what these drills teach you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open two terminal windows. In the first, start watching your pods&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;watch kubectl get pods -n prodbash
In the second, run the drill:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;./app_crash.shbash&lt;br&gt;
You'll see one pod disappear from the watch window. Within seconds — typically under 30 — Kubernetes will have scheduled a replacement. That's the Deployment controller doing its job.&lt;/p&gt;

&lt;p&gt;Now do it again. And again. Notice the restart count climb in Prometheus. Notice the brief dip in the up{namespace="prod"} metric. This is what your monitoring dashboards look like during an incident. Seeing it in a drill is far less stressful than seeing it at 2 AM for the first time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prometheus Queries for Drills&lt;/strong&gt;&lt;br&gt;
Monitor these metrics live while running drills:&lt;/p&gt;

&lt;p&gt;sum(kube_pod_container_status_restarts_total{namespace="prod"}) by (pod)&lt;br&gt;
sum(rate(container_cpu_usage_seconds_total{namespace="prod"}[1m])) by (pod)&lt;br&gt;
up{namespace="prod"}&lt;br&gt;
The HPA Drill — Watching Your System Scale&lt;br&gt;
The high_load.sh drill is especially satisfying because you get to watch autoscaling happen in real time. The HPA is configured with a minimum of 2 replicas, a maximum of 5, and a target CPU utilization of 60%.&lt;/p&gt;

&lt;h1&gt;
  
  
  In one terminal: watch the HPA
&lt;/h1&gt;

&lt;p&gt;watch kubectl get hpa -n prod&lt;/p&gt;

&lt;h1&gt;
  
  
  In another: trigger the load
&lt;/h1&gt;

&lt;p&gt;./high_load.shbash&lt;br&gt;
You'll see the TARGETS column climb past 60%, and within a minute or two the REPLICAS column will tick up from 2. The load job eventually completes, and the HPA scales back down after the cooldown period.&lt;/p&gt;

&lt;p&gt;This drill builds intuition for how long scaling takes end-to-end: from metric collection, to HPA decision, to pod scheduling, to readiness. That latency matters when you're sizing your HPA thresholds for real production traffic spikes.&lt;/p&gt;

&lt;p&gt;The Database Failure Drill — Resilience by Default?&lt;br&gt;
This one is the most revealing. When PostgreSQL restarts, does your FastAPI application reconnect automatically, or does it need a restart too?&lt;/p&gt;

&lt;p&gt;./db_failure.shbash&lt;br&gt;
While the PostgreSQL pod is down, hitting /db should return a connectivity error. When it comes back, your application should reconnect on the next request — if your database connection pool is configured to retry.&lt;/p&gt;

&lt;p&gt;If it doesn't reconnect automatically, you've just discovered a resilience gap before it cost you. Fix it: use a connection pool with reconnect logic, add retry wrappers around database calls, or configure proper liveness/readiness probes that cycle the app pod when the DB is unreachable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connection Resilience Checklist&lt;/strong&gt;&lt;br&gt;
After the DB drill, verify: (1) App eventually reconnects without manual intervention. (2) Readiness probe correctly marks the pod as not-ready while DB is down. (3) Prometheus alert fires within your SLO window. (4) The alert resolves automatically after recovery.&lt;br&gt;
Integrating With Your Workflow&lt;br&gt;
Drills are most valuable when they're scheduled, not spontaneous. A few patterns that work well:&lt;/p&gt;

&lt;p&gt;Weekly game days: Block 30 minutes every week for one drill, rotating through the five scenarios. Document observations and improvements in a shared runbook.&lt;br&gt;
Pre-release validation: Run the full suite before any major deployment. If your new release doesn't survive a pod crash drill, it's not ready.&lt;br&gt;
Onboarding tool: New engineers run the drills in their first week. There's no better way to learn a system than to watch it fail and recover.&lt;br&gt;
CI gate: In staging, run app_crash.sh as part of your pipeline and fail the build if recovery takes longer than your SLO allows.&lt;br&gt;
What's Next&lt;br&gt;
The current drill set covers the most common failure modes. Here are some directions to extend it:&lt;/p&gt;

&lt;p&gt;Network partition drill: Use a network policy to block traffic between the app and the database for a set duration, simulating a network split.&lt;br&gt;
Memory pressure: A complement to the CPU drill — fill pod memory to trigger OOM kills and test restart behavior.&lt;br&gt;
Rolling update with failure injection: Trigger a deployment rollout while simultaneously running the crash drill to validate zero-downtime deploys.&lt;br&gt;
Restore drill: Pair backup.sh with a corresponding restore.sh that brings the database back from a backup and validates data integrity.&lt;br&gt;
Multi-node scenarios: With a multi-node kind cluster, add a node drain drill to practice pod eviction and rescheduling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Copubah/prod-maintenance-drills" rel="noopener noreferrer"&gt;https://github.com/Copubah/prod-maintenance-drills&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>sre</category>
      <category>grafana</category>
      <category>fastapi</category>
    </item>
    <item>
      <title>Building a Simple Cloud Security Automation Tool in Rust</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Sun, 25 Jan 2026 19:44:00 +0000</pubDate>
      <link>https://dev.to/copubah/building-a-simple-cloud-security-automation-tool-in-rust-316b</link>
      <guid>https://dev.to/copubah/building-a-simple-cloud-security-automation-tool-in-rust-316b</guid>
      <description>&lt;p&gt;Cloud security is no longer just about dashboards and manual reviews. Modern security teams rely heavily on automation to detect and respond to misconfigurations in real time.&lt;/p&gt;

&lt;p&gt;In this article, I will show how I built a simple Cloud Security Posture Management (CSPM) tool using Rust and the AWS SDK. The goal is to demonstrate how Rust can be used for real world cloud security automation, not just systems programming.&lt;br&gt;
Why Rust for Cloud Security&lt;/p&gt;

&lt;p&gt;Most cloud security automation is written in Python or Go. Rust is less common, but it has some serious advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory safety by default&lt;/li&gt;
&lt;li&gt;High performance for log processing and scanning&lt;/li&gt;
&lt;li&gt;Single static binaries for agents and tools&lt;/li&gt;
&lt;li&gt;Strong type system for building reliable security systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rust is especially useful when building security tooling that needs to be fast, stable, and safe to run in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project Overview: CloudGuard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The project is a simple Rust CLI tool called CloudGuard.&lt;/p&gt;

&lt;p&gt;It performs two basic but very realistic security checks:&lt;/p&gt;

&lt;p&gt;1.Detect public S3 buckets&lt;br&gt;
2.Detect EC2 security groups open to the world on sensitive ports&lt;/p&gt;

&lt;p&gt;This is essentially a mini CSPM tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the Tool Does&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CloudGuard scans an AWS account and prints a security report showing:&lt;br&gt;
1.Any S3 buckets with public access&lt;br&gt;
2.Any security groups with 0.0.0.0/0 on:&lt;br&gt;
3.Port 22 (SSH)&lt;br&gt;
4.Port 3389 (RDP)&lt;br&gt;
5.Port 3306 (MySQL)&lt;/p&gt;

&lt;p&gt;These are some of the most common real world cloud misconfigurations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The architecture is very simple:&lt;/p&gt;

&lt;p&gt;Rust CLI&lt;br&gt;
→ AWS SDK for Rust&lt;br&gt;
→ AWS APIs (S3, EC2)&lt;/p&gt;

&lt;p&gt;There is no agent and no infrastructure required. It runs using normal AWS credentials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project Structure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The project is split into small modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;main.rs: entry point&lt;/li&gt;
&lt;li&gt;s3_scan.rs: S3 public access checks&lt;/li&gt;
&lt;li&gt;sg_scan.rs: security group checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps the code clean and easy to extend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setting Up the Project&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create the project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cargo new cloud-guard&lt;/li&gt;
&lt;li&gt;cd cloud-guard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Add dependencies to Cargo.toml&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;[dependencies]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;aws-config = "1"&lt;/li&gt;
&lt;li&gt;aws-sdk-s3 = "1"&lt;/li&gt;
&lt;li&gt;aws-sdk-ec2 = "1"&lt;/li&gt;
&lt;li&gt;tokio = { version = "1", features = ["full"] }&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure AWS credentials&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;aws configure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example: Scanning for Public S3 Buckets&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The tool lists all buckets and checks their ACLs for public access.&lt;/p&gt;

&lt;p&gt;The logic is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Call ListBuckets&lt;/li&gt;
&lt;li&gt;For each bucket, call GetBucketAcl&lt;/li&gt;
&lt;li&gt;If the grantee contains AllUsers, the bucket is public&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This mirrors how real CSPM tools work internally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Scanning Open Security Groups&lt;/strong&gt;**&lt;/p&gt;

&lt;p&gt;The security group scan works like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Call DescribeSecurityGroups&lt;/li&gt;
&lt;li&gt;Loop through inbound rules&lt;/li&gt;
&lt;li&gt;If CIDR is 0.0.0.0/0 and port is sensitive, flag it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the same logic used in enterprise security tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running the Tool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Run it locally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cargo run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You will get output like&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;=== S3 Public Bucket Scan ===&lt;br&gt;
Public bucket found: test-assets-bucket&lt;/p&gt;

&lt;p&gt;=== Security Group Scan ===&lt;br&gt;
Open SG: web-sg on ports 22-22&lt;/p&gt;

&lt;p&gt;That is already a working cloud security scanner.&lt;br&gt;
Github repo:&lt;a href="https://github.com/Copubah/aws-cloudguard" rel="noopener noreferrer"&gt;https://github.com/Copubah/aws-cloudguard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>aws</category>
      <category>security</category>
    </item>
    <item>
      <title>Kubernetes Essentials</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Sat, 03 Jan 2026 12:46:48 +0000</pubDate>
      <link>https://dev.to/copubah/kubernetes-essentials-4j74</link>
      <guid>https://dev.to/copubah/kubernetes-essentials-4j74</guid>
      <description>&lt;p&gt;Kubernetes is the go-to platform for managing containerized applications at scale. Here’s a concise guide to the basics every developer or SysOps engineer should know&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Pods&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Definition: Smallest deployable unit in Kubernetes; can run one or more containers.&lt;/p&gt;

&lt;p&gt;Commands:&lt;/p&gt;

&lt;p&gt;kubectl get pods           # List all pods&lt;br&gt;
kubectl describe pod  # Detailed info&lt;br&gt;
kubectl logs          # View container logs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Deployments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Purpose: Ensure your application runs with the desired number of replicas. Handles updates and rollbacks automatically.&lt;/p&gt;

&lt;p&gt;Commands:&lt;/p&gt;

&lt;p&gt;kubectl create deployment  --image=&lt;br&gt;
kubectl get deployments&lt;br&gt;
kubectl scale deployment  --replicas=N&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Purpose: Expose pods to internal or external traffic.&lt;br&gt;
Types:&lt;/p&gt;

&lt;p&gt;ClusterIP – internal only (default)&lt;/p&gt;

&lt;p&gt;NodePort – accessible via node IP&lt;/p&gt;

&lt;p&gt;LoadBalancer – external access via cloud LB&lt;/p&gt;

&lt;p&gt;Commands:&lt;/p&gt;

&lt;p&gt;kubectl expose deployment  --type=NodePort --port=80&lt;br&gt;
kubectl get svc&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Common Commands&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;kubectl get all – List all resources in the cluster&lt;/p&gt;

&lt;p&gt;kubectl delete pod  – Remove a pod&lt;/p&gt;

&lt;p&gt;kubectl apply -f  – Apply configuration files&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Troubleshooting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;kubectl describe pod  – Check events, errors, or misconfigurations&lt;/p&gt;

&lt;p&gt;kubectl logs  – Inspect application logs&lt;/p&gt;

&lt;p&gt;kubectl get nodes – Check node health and availability&lt;/p&gt;

&lt;p&gt;Tip: Always start by inspecting pods and their logs when troubleshooting, then check deployments and services. Kubernetes is powerful, but clear visibility into resources makes management easier&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>containers</category>
      <category>cloudnative</category>
      <category>devops</category>
    </item>
    <item>
      <title>Deploying a Highly Available AWS Architecture with Terraform</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Thu, 01 Jan 2026 12:40:49 +0000</pubDate>
      <link>https://dev.to/copubah/deploying-a-highly-available-aws-architecture-with-terraform-39pb</link>
      <guid>https://dev.to/copubah/deploying-a-highly-available-aws-architecture-with-terraform-39pb</guid>
      <description>&lt;p&gt;High availability is one of those concepts everyone mentions, but far fewer people actually implement correctly. In many demo projects, availability stops at launching an EC2 instance and exposing it to the internet. That works until something breaks, and in production something always breaks.&lt;/p&gt;

&lt;p&gt;In this post, I walk through a Terraform project where the primary goal is resilience. The architecture is designed to keep serving traffic even when individual instances or an entire Availability Zone fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The full source code is available here&lt;/strong&gt;:&lt;br&gt;
&lt;a href="https://github.com/Copubah/Terraform-AWS-Multi-AZ-Highly-Available-Architecture" rel="noopener noreferrer"&gt;https://github.com/Copubah/Terraform-AWS-Multi-AZ-Highly-Available-Architecture&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why I Built This Project&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted a project that answers practical questions instead of just showing that something is running.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens if an EC2 instance crashes&lt;/li&gt;
&lt;li&gt;What happens if an Availability Zone becomes unavailable&lt;/li&gt;
&lt;li&gt;How fast can the environment be rebuilt from scratch&lt;/li&gt;
&lt;li&gt;How cleanly is the infrastructure defined and reused&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project focuses on those questions using Terraform as the single source of truth for infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High Level Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At a high level, the architecture follows a common and proven AWS pattern.&lt;br&gt;
User traffic enters through an Application Load Balancer deployed in public subnets. The load balancer distributes traffic to application instances running in private subnets across multiple Availability Zones. An Auto Scaling Group ensures that capacity is always maintained. Supporting networking components like NAT Gateways and route tables ensure instances can communicate outbound without being publicly exposed.&lt;/p&gt;

&lt;p&gt;Every major component is spread across at least two Availability Zones to eliminate single points of failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core AWS Components Used&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPC&lt;/strong&gt;&lt;br&gt;
A custom VPC provides full control over networking. DNS support and hostnames are enabled to support internal service discovery and load balancing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subnets&lt;/strong&gt;&lt;br&gt;
Public subnets host the Application Load Balancer and NAT Gateways. Private subnets host the EC2 instances. Subnets are evenly distributed across Availability Zones to ensure redundancy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Application Load Balancer&lt;/strong&gt;&lt;br&gt;
The ALB acts as the entry point to the system. It performs health checks on backend instances and only routes traffic to healthy targets. If an instance fails health checks, it is automatically removed from rotation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto Scaling Group&lt;/strong&gt;&lt;br&gt;
The Auto Scaling Group maintains a minimum number of EC2 instances across multiple Availability Zones. If an instance terminates or becomes unhealthy, Auto Scaling replaces it automatically. This is one of the key pieces that enables self healing behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security Groups&lt;/strong&gt;&lt;br&gt;
Security groups are tightly scoped. The load balancer allows inbound HTTP traffic from the internet. Application instances only allow inbound traffic from the load balancer. This reduces the attack surface and follows least privilege principles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform and Modularity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the main goals of this project was clean Terraform structure.&lt;/p&gt;

&lt;p&gt;Instead of placing everything in a single main.tf file, the infrastructure is broken into reusable modules. Each module is responsible for a single concern such as networking, load balancing, or compute. This mirrors how Terraform is used in real teams and makes the code easier to reason about.&lt;/p&gt;

&lt;p&gt;Each module contains its own variables and outputs, which keeps dependencies explicit and avoids hidden coupling. The root module simply wires everything together.&lt;/p&gt;

&lt;p&gt;This modular approach also makes future expansion straightforward. Adding a database layer or extending to multi region deployments would not require restructuring the existing code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure Scenarios and How the Architecture Responds&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instance failure&lt;/strong&gt;&lt;br&gt;
If an EC2 instance crashes or is terminated, the load balancer stops sending traffic to it. The Auto Scaling Group detects the capacity drop and launches a replacement instance automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Availability Zone failure&lt;/strong&gt;&lt;br&gt;
If an entire Availability Zone becomes unavailable, the load balancer routes traffic only to healthy instances in the remaining zones. Auto Scaling launches new instances in available zones to maintain capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traffic spikes&lt;/strong&gt;&lt;br&gt;
Auto Scaling policies can be added to scale out based on load. The architecture already supports horizontal scaling without any redesign.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure rebuild&lt;/strong&gt;&lt;br&gt;
Because everything is defined in Terraform, the entire environment can be destroyed and recreated consistently. This is critical for disaster recovery and reproducibility.&lt;br&gt;
Why This Makes a Strong Portfolio Project&lt;/p&gt;

&lt;p&gt;This project focuses on reliability rather than visual complexity. It demonstrates understanding of core AWS concepts such as Availability Zones, load balancing, self healing infrastructure, and infrastructure as code.&lt;/p&gt;

&lt;p&gt;It also shows discipline in Terraform usage through modular design, clear variable definitions, and reproducibility. These are the qualities teams look for when reviewing real world infrastructure code.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>automation</category>
      <category>hcl</category>
    </item>
    <item>
      <title>Hands-On Journey Experimenting with Kubernetes: FastAPI + React Deployment</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Tue, 23 Dec 2025 17:03:12 +0000</pubDate>
      <link>https://dev.to/copubah/hands-on-journey-experimenting-with-kubernetes-fastapi-react-deployment-1p7m</link>
      <guid>https://dev.to/copubah/hands-on-journey-experimenting-with-kubernetes-fastapi-react-deployment-1p7m</guid>
      <description>&lt;p&gt;Over the past few weeks, I’ve been diving deep into Kubernetes, exploring how to deploy and manage containerized applications. To really understand the mechanics, I decided to create a small but complete full-stack project: a FastAPI backend with a React frontend running on a local Kubernetes cluster using Kind (Kubernetes in Docker).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why I Started Experimenting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kubernetes is powerful but complex, and reading documentation can only take you so far. I wanted to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;See how Deployments and Services interact in real-time&lt;/li&gt;
&lt;li&gt;Understand how frontend and backend communicate inside a cluster&lt;/li&gt;
&lt;li&gt;Experiment with replicas, scaling, and networking without affecting production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project became my sandbox for testing Kubernetes concepts and workflows hands-on&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How I Set Up the Project&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I structured the project into three main areas:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Backend (FastAPI)&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Runs on Python 3.11, serving API requests on port 80&lt;/li&gt;
&lt;li&gt;Exposed internally with a ClusterIP service&lt;/li&gt;
&lt;li&gt;I’ve experimented with creating endpoints and testing them using pytest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2.Frontend (React + Nginx)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;React app is built statically and served by Nginx&lt;/li&gt;
&lt;li&gt;Exposed externally with a LoadBalancer service&lt;/li&gt;
&lt;li&gt;Configured SPA routing and CORS headers to communicate seamlessly with the backend&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Kubernetes Manifests&lt;/li&gt;
&lt;li&gt;Separate YAML files for backend and frontend&lt;/li&gt;
&lt;li&gt;Each deployment has 2 replicas&lt;/li&gt;
&lt;li&gt;Services use Kubernetes DNS for internal pod communication&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;My Hands-On Experiments&lt;/strong&gt;&lt;br&gt;
Here’s what I’ve been tinkering with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating a Kind Cluster: Spinning up a lightweight Kubernetes environment locally&lt;/li&gt;
&lt;li&gt;Loading Docker Images: Pre-loading backend and frontend images into the cluster with kind load docker-image&lt;/li&gt;
&lt;li&gt;Deploying and Updating: Iteratively modifying endpoints, rebuilding images, and redeploying
-Service Discovery: Using Kubernetes DNS for frontend-backend communication instead of hardcoded IPs
-Debugging: Inspecting logs, port-forwarding services, and fixing issues like ImagePullBackOff and misconfigured CORS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Through this, I’ve gained a deeper understanding of&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replicas and pod distribution&lt;/li&gt;
&lt;li&gt;ClusterIP vs LoadBalancer services&lt;/li&gt;
&lt;li&gt;Internal pod networking&lt;/li&gt;
&lt;li&gt;How environment variables manage API connections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Testing &amp;amp; Local Development&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tested with pytest and FastAPI TestClient&lt;/li&gt;
&lt;li&gt;GET / returns a 200 status code with JSON&lt;/li&gt;
&lt;li&gt;Tests run locally with make test&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Points to backend API via REACT_APP_API_URL&lt;/li&gt;
&lt;li&gt;Can be tested locally before deploying to Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Experimenting with Kubernetes is all about deploying, breaking, fixing, and iterating. Creating a small project like this is the fastest way to understand how everything fits together.&lt;br&gt;
Check out the repo:&lt;a href="https://github.com/Copubah/simple-k8s-project" rel="noopener noreferrer"&gt;https://github.com/Copubah/simple-k8s-project&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>fastapi</category>
      <category>react</category>
      <category>docker</category>
    </item>
    <item>
      <title>CI/CD for Dummies</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Sat, 20 Dec 2025 16:50:17 +0000</pubDate>
      <link>https://dev.to/copubah/cicd-for-dummies-29n0</link>
      <guid>https://dev.to/copubah/cicd-for-dummies-29n0</guid>
      <description>&lt;p&gt;Continuous Integration and Continuous Deployment (or Delivery) better known as CI/CD  might sound like a fancy buzzword. But at its core, it’s just automation that helps make sure your code is tested, validated, and safely shipped every time you make changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is CI/CD?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CI/CD is a workflow that automatically takes your code from development all the way through testing and ready-to-deploy stages without manual steps.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Continuous Integration (CI) means your code gets automatically tested and merged early and often. Every push triggers automated builds and tests so bugs don’t sneak in. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Continuous Delivery (CD) means your code is always in a deployable state. Once tests pass, deployments can be triggered with a click. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Continuous Deployment goes one step further — if all checks pass, your code goes straight to production without human intervention. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In traditional development, teams write code, then hand it over to QA and operations with lots of manual steps. CI/CD flips that script by automating the whole lifecycle so you can ship faster and with confidence&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why CI/CD Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here’s why engineers love CI/CD:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster feedback-tests run automatically on every commit, so you catch problems early. 
Dummies&lt;/li&gt;
&lt;li&gt;Reliable deployments-automation means fewer mistakes compared to manual deploys. 
Dummies&lt;/li&gt;
&lt;li&gt;Better collaboration-developers integrate changes more frequently with less conflict. 
Dummies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Confidence in releases-because tests run every time, you know the code is solid before deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Simple CI/CD Example with GitHub Actions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To make CI/CD tangible, I built a basic Node.js project that demonstrates a live pipeline using GitHub Actions. The idea is simple: every time you push code, GitHub tests it automatically.&lt;/p&gt;

&lt;p&gt;Here’s what happens under the hood:&lt;/p&gt;

&lt;p&gt;1.You push your code to GitHub.&lt;br&gt;
2.GitHub Actions sees the change and triggers the pipeline.&lt;br&gt;
3.It sets up a fresh environment, installs dependencies, and runs your tests.&lt;br&gt;
4.You get feedback right inside GitHub if everything passed or failed&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your repo structure looks like this&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;CI-CD-for-Dummies/&lt;br&gt;
├── app.js&lt;br&gt;&lt;br&gt;
├── test.js&lt;br&gt;&lt;br&gt;
├── package.json&lt;br&gt;&lt;br&gt;
├── .github/&lt;br&gt;
│   └── workflows/&lt;br&gt;
│       └── ci.yml&lt;br&gt;&lt;br&gt;
└── README.md&lt;/p&gt;

&lt;p&gt;This is all you need to set up a basic pipeline no extra servers or tools required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it Yourself&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you want to experience CI/CD first-hand:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clone the project&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/Copubah/CI-CD-for-Dummies.git" rel="noopener noreferrer"&gt;https://github.com/Copubah/CI-CD-for-Dummies.git&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cd CI-CD-for-Dummies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install dependencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;npm install&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run tests locally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;npm test&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;.Commit and push changes back to GitHub and watch your CI/CD pipeline run on the Actions tab. &lt;br&gt;
Watching tests run automatically on every push is a little like magic when you’re just getting started.&lt;/p&gt;

&lt;p&gt;Wrapping Up&lt;/p&gt;

&lt;p&gt;CI/CD takes away repetitive manual work so you can focus on writing code. Whether you’re just learning DevOps or want reliable hands-off deployments, automating your builds and tests with CI/CD is one of the best practices you can adopt as a developer.&lt;/p&gt;

&lt;p&gt;If you’re curious how this scales to larger projects or more advanced workflows, there are tons of tools and techniques out there — but mastering the basics will give you a huge head start.&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>automation</category>
      <category>testing</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>Building a Modular Serverless ETL Pipeline on AWS with Terraform &amp; Lambda</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Sun, 07 Dec 2025 08:20:28 +0000</pubDate>
      <link>https://dev.to/copubah/building-a-modular-serverless-etl-pipeline-on-aws-with-terraform-lambda-5hmf</link>
      <guid>https://dev.to/copubah/building-a-modular-serverless-etl-pipeline-on-aws-with-terraform-lambda-5hmf</guid>
      <description>&lt;p&gt;Many applications even small ones receive data as raw CSV files (customer exports, logs, partner data dumps). Without automation to clean, validate, and store that data in a standard format, teams end up with messy data, duplicated effort, inconsistent formats, and manual steps each time new data arrives.&lt;/p&gt;

&lt;p&gt;This pipeline provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automated processing of raw CSV uploads
&lt;/li&gt;
&lt;li&gt;Basic data hygiene (cleaning / validation)
&lt;/li&gt;
&lt;li&gt;Ready-to-use outputs for analytics or downstream systems
&lt;/li&gt;
&lt;li&gt;Modular, reproducible, and extendable infrastructure
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By combining Terraform + AWS Lambda + Amazon S3, the solution is serverless, scalable, and easy to redeploy.&lt;br&gt;
Because it’s built with Terraform + AWS Lambda + Amazon S3, you don’t manage servers AWS handles compute, storage and scaling and you get repeatable infrastructure deployment. This pattern is ideal for small to medium data ingestion workflows, proofs‑of‑concept, and even production‑ready ETL for modest data volumes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture &amp;amp; Design&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here’s the high-level architecture of the pipeline:&lt;/p&gt;

&lt;p&gt;Raw CSV file  -&amp;gt;  S3 raw-bucket  -&amp;gt;  S3 event trigger  -&amp;gt;  Lambda function (Python)&lt;br&gt;&lt;br&gt;
                                             │&lt;br&gt;&lt;br&gt;
                                             ▼&lt;br&gt;&lt;br&gt;
                                     Data cleaning / transformation&lt;br&gt;&lt;br&gt;
                                             │&lt;br&gt;&lt;br&gt;
                                             ▼&lt;br&gt;&lt;br&gt;
                                   Save cleaned CSV to S3 clean-bucket&lt;br&gt;&lt;br&gt;
                                             │&lt;br&gt;&lt;br&gt;
                          (optional: push cleaned data to DynamoDB / RDS)  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A user (or another system) uploads a CSV file into the “raw” S3 bucket.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;2.S3 triggers the Lambda function automatically on object creation.&lt;/p&gt;

&lt;p&gt;3.The Lambda reads the CSV, parses rows, applies validation and transformation logic (e.g. remove invalid rows, normalize text, enforce schema).&lt;/p&gt;

&lt;p&gt;4.Cleaned data is written back to a “clean” S3 bucket — optionally also sent to a database (like DynamoDB) or another data store.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Because everything is managed via Terraform, you can version your infrastructure, redeploy consistently across environments (dev / staging / prod), and manage permissions cleanly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example Use Cases&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Customer data ingestion: Partners or internal teams export user data; this pipeline cleans, standardizes, and readies it for analytics or import.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Daily sales / transaction reports: Automate processing of daily uploads into a clean format ready for dashboards or billing systems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Log / event data processing: Convert raw logs or CSV exports into normalized data for analytics or storage.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;-Pre‑processing for analytics or machine learning: Clean and standardize raw data before loading into data warehouse or data lake.&lt;/p&gt;

&lt;p&gt;-Archival + compliance workflows: Maintain clean, versioned, and validated data sets for audits or record‑keeping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Learning Outcomes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure as Code with Terraform
&lt;/li&gt;
&lt;li&gt;Event-driven serverless architecture with Lambda
&lt;/li&gt;
&lt;li&gt;Secure IAM policies and resource permissions
&lt;/li&gt;
&lt;li&gt;Modular, reusable Terraform modules
&lt;/li&gt;
&lt;li&gt;Clean, maintainable ETL logic in Python
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Possible Enhancements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Schema validation and error logging
&lt;/li&gt;
&lt;li&gt;Deduplication logic using DynamoDB or file hashes
&lt;/li&gt;
&lt;li&gt;Multiple destinations (S3, DynamoDB, RDS)
&lt;/li&gt;
&lt;li&gt;Monitoring and CloudWatch metrics
&lt;/li&gt;
&lt;li&gt;Multi-format support (CSV, JSON, Parquet)
&lt;/li&gt;
&lt;li&gt;CI/CD integration
&lt;/li&gt;
&lt;li&gt;Multi-environment deployment (dev, staging, prod)
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how to build a real-world, production-inspired ETL pipeline on AWS. It’s a small but powerful example of combining serverless computing, IaC, and automation. Being recently experimenting with these tools, I found this project an excellent way to learn best practices while building something tangible for a portfolio.&lt;/p&gt;

&lt;p&gt;Github repo:&lt;a href="https://github.com/Copubah/aws-etl-pipeline-terraform" rel="noopener noreferrer"&gt;https://github.com/Copubah/aws-etl-pipeline-terraform&lt;/a&gt;&lt;/p&gt;

</description>
      <category>data</category>
      <category>lambda</category>
      <category>terraform</category>
      <category>aws</category>
    </item>
    <item>
      <title>How to Cut AWS Costs and Maintain Reliability Without a FinOps Team</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Sun, 16 Nov 2025 14:50:50 +0000</pubDate>
      <link>https://dev.to/copubah/how-to-cut-aws-costs-and-maintain-reliability-without-a-finops-team-54cl</link>
      <guid>https://dev.to/copubah/how-to-cut-aws-costs-and-maintain-reliability-without-a-finops-team-54cl</guid>
      <description>&lt;p&gt;Managing AWS costs can be overwhelming, especially for startups and development teams. Running resources 24/7, oversized instances, and lack of monitoring often lead to surprise bills. But what if you could optimize costs automatically while keeping your infrastructure reliable?&lt;/p&gt;

&lt;p&gt;In this post, I’ll walk you through a practical approach to solving seven common AWS cost problems using automation and best practices &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Runaway AWS Costs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Problem: Dev/test resources run continuously, and bills spiral out of control.&lt;br&gt;
The Solution: Automatically stop non‑production resources outside business hours, scale down idle services, and implement lifecycle policies for S3 data.&lt;br&gt;
Impact: 30–50% cost reduction.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Manual Cost Management&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Problem: Tracking and stopping resources manually is error‑prone and time‑consuming.&lt;br&gt;
The Solution: Use Lambda functions triggered by schedules and AWS Budget alerts.&lt;br&gt;
Impact: Fully automated cost management with zero manual intervention.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Lack of Cost Visibility&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Problem: Teams only notice overspending when the bill arrives.&lt;br&gt;
The Solution: AWS Budgets with thresholds (50%, 80%, 100%, 120%) send proactive alerts.&lt;br&gt;
Impact: Early warnings prevent budget overruns and surprises.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reliability vs Cost Trade-off&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Problem: Cutting costs often sacrifices uptime.&lt;br&gt;
The Solution: Deploy multi‑AZ architectures with auto‑scaling, health checks, and comprehensive monitoring.&lt;br&gt;
Impact: Save money without compromising 99.9% uptime.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resource Waste&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Problem: Idle instances, oversized servers, and old data in expensive storage tiers.&lt;br&gt;
The Solution:&lt;/p&gt;

&lt;p&gt;Scheduled shutdowns of non‑production resources&lt;/p&gt;

&lt;p&gt;Right‑sized instances with auto‑scaling&lt;/p&gt;

&lt;p&gt;S3 lifecycle policies (IA after 30 days, Glacier after 90 days)&lt;br&gt;
Impact: Eliminates waste across compute, storage, and databases.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reactive Incident Response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Problem: Teams only learn of issues after users complain.&lt;br&gt;
The Solution: CloudWatch alarms monitor CPU, memory, latency, errors, and system health.&lt;br&gt;
Impact: Proactive alerts and automated recovery keep downtime minimal.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Complex Infrastructure Setup&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Problem: Building cost optimization and monitoring from scratch takes weeks.&lt;br&gt;
The Solution: Use production‑ready Terraform modules to deploy the complete infrastructure in 15 minutes.&lt;br&gt;
Impact: Best practices implemented instantly with minimal setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real‑World Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before Automation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dev environment running 24/7: $500/month&lt;/li&gt;
&lt;li&gt;Oversized instances: $300/month&lt;/li&gt;
&lt;li&gt;Manual monitoring and cost tracking&lt;/li&gt;
&lt;li&gt;Total: $800/month + hours of manual work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After Automation (via the platform)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto‑stop dev: $250/month&lt;/li&gt;
&lt;li&gt;Right‑sized with auto‑scaling: $180/month&lt;/li&gt;
&lt;li&gt;Automated monitoring &amp;amp; alerts&lt;/li&gt;
&lt;li&gt;Total: $430/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Savings: $370/month while eliminating manual work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who Benefits?&lt;/strong&gt;&lt;br&gt;
1.Startups: Manage costs while scaling quickly&lt;br&gt;
2.Dev Teams: Focus on building, not shutting down resources&lt;br&gt;
3.Finance Teams: Predictable spend with proactive alerts&lt;br&gt;
4.DevOps Teams: More time on innovation, less on management&lt;br&gt;
5.CTOs: Balance speed with cost control&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform Example: Deploy Auto-Stop Lambda&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;`resource "aws_lambda_function" "stop_dev_instances" {&lt;br&gt;
  filename         = "lambda_function_payload.zip"&lt;br&gt;
  function_name    = "stop_dev_instances"&lt;br&gt;
  handler          = "lambda_function.lambda_handler"&lt;br&gt;
  runtime          = "python3.11"&lt;br&gt;
  role             = aws_iam_role.lambda_exec.arn&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;resource "aws_cloudwatch_event_rule" "schedule_rule" {&lt;br&gt;
  name                = "stop-dev-schedule"&lt;br&gt;
  schedule_expression = "cron(0 19 ? * MON-FRI *)"&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;resource "aws_cloudwatch_event_target" "lambda_target" {&lt;br&gt;
  rule      = aws_cloudwatch_event_rule.schedule_rule.name&lt;br&gt;
  target_id = "stopDevLambda"&lt;br&gt;
  arn       = aws_lambda_function.stop_dev_instances.arn&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;`&lt;br&gt;
This snippet schedules stopping dev instances every weekday at 7 PM&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CloudWatch Alarm Example: ECS CPU Utilization&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;resource "aws_cloudwatch_metric_alarm" "ecs_high_cpu" {&lt;br&gt;
  alarm_name          = "ecs_high_cpu"&lt;br&gt;
  comparison_operator = "GreaterThanThreshold"&lt;br&gt;
  evaluation_periods  = 2&lt;br&gt;
  metric_name         = "CPUUtilization"&lt;br&gt;
  namespace           = "AWS/ECS"&lt;br&gt;
  period              = 300&lt;br&gt;
  statistic           = "Average"&lt;br&gt;
  threshold           = 80&lt;br&gt;
  alarm_actions       = [aws_sns_topic.ops_team.arn]&lt;br&gt;
}&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
This alarm notifies the operations team if ECS CPU usage exceeds 80% for 10 minutes.&lt;/p&gt;

&lt;p&gt;Deploying this platform gives enterprise-level cost management and reliability without a dedicated FinOps team.&lt;br&gt;
GitHub Repository:&lt;a href="https://github.com/Copubah/aws-cost-optimization-platform" rel="noopener noreferrer"&gt;https://github.com/Copubah/aws-cost-optimization-platform&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>sre</category>
      <category>cloudcomputing</category>
    </item>
    <item>
      <title>Building a Serverless Image Processing Pipeline on AWS with Terraform</title>
      <dc:creator>Cloudev</dc:creator>
      <pubDate>Sat, 08 Nov 2025 15:43:02 +0000</pubDate>
      <link>https://dev.to/copubah/building-a-serverless-image-processing-pipeline-on-aws-with-terraform-1fec</link>
      <guid>https://dev.to/copubah/building-a-serverless-image-processing-pipeline-on-aws-with-terraform-1fec</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I recently built a production-ready pipeline that allows users to upload images, have them processed such as resized or optimized, and then stored in a separate location all with minimal manual intervention. The project is available on GitHub at &lt;a href="https://github.com/Copubah/aws-image-processing-pipeline" rel="noopener noreferrer"&gt;aws-image-processing-pipeline&lt;/a&gt;. The goal was to leverage serverless architecture and Infrastructure as Code so that the solution is decoupled, scalable and maintainable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Many applications need to offload image processing or other heavy tasks so that the front end remains responsive and system components can scale independently. By introducing a messaging queue and event-driven processing, we separate upload, processing and storage. This enables high throughput, error isolation and simpler operational overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The workflow is as follows&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A user uploads an image to an S3 bucket for uploads&lt;/li&gt;
&lt;li&gt;A message is sent to an SQS queue&lt;/li&gt;
&lt;li&gt;A Lambda function polls the queue, downloads the image, processes it, and uploads the result to the processed bucket&lt;/li&gt;
&lt;li&gt;Messages that fail after retries go to a Dead-Letter Queue for inspection&lt;/li&gt;
&lt;li&gt;Monitoring is provided via CloudWatch logs and metrics&lt;/li&gt;
&lt;li&gt;Terraform defines all infrastructure for versioning and reuse&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Tech Stack and Tools
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AWS S3 for storing raw and processed images&lt;/li&gt;
&lt;li&gt;AWS SQS for decoupled messaging&lt;/li&gt;
&lt;li&gt;AWS Lambda with Python 3.11 for processing logic&lt;/li&gt;
&lt;li&gt;Terraform for defining and deploying resources&lt;/li&gt;
&lt;li&gt;Bash scripts for deployment and testing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation Details
&lt;/h2&gt;

&lt;p&gt;The project uses a modular Terraform structure with separate modules for S3, SQS and Lambda. Each module is reusable and focused on a single responsibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Terraform Examples
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;provider.tf&lt;/strong&gt;&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
hcl
provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "uploads" {
  bucket = "uploads-bucket-example"
  acl    = "private"
}

resource "aws_s3_bucket" "processed" {
  bucket = "processed-bucket-example"
  acl    = "private"
}


resource "aws_sqs_queue" "image_queue" {
  name = "image-processing-queue"
}


resource "aws_lambda_function" "image_processor" {
  function_name = "image_processor"
  handler       = "image_processor.lambda_handler"
  runtime       = "python3.11"
  role          = aws_iam_role.lambda_role.arn
  filename      = "lambda/image_processor.zip"

  environment {
    variables = {
      UPLOADS_BUCKET   = aws_s3_bucket.uploads.bucket
      PROCESSED_BUCKET = aws_s3_bucket.processed.bucket
    }
  }
}

resource "aws_lambda_event_source_mapping" "sqs_trigger" {
  event_source_arn = aws_sqs_queue.image_queue.arn
  function_name    = aws_lambda_function.image_processor.arn
}

import boto3
from PIL import Image
import os
import tempfile

s3 = boto3.client('s3')

def lambda_handler(event, context):
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']

        download_path = os.path.join(tempfile.gettempdir(), key)
        upload_path = os.path.join(tempfile.gettempdir(), "processed-" + key)

        s3.download_file(bucket, key, download_path)

        with Image.open(download_path) as image:
            image = image.resize((800, 600))
            image.save(upload_path)

        s3.upload_file(upload_path, os.environ['PROCESSED_BUCKET'], key)

Deployment Walkthrough

Configure AWS credentials locally with AWS CLI

Clone the repository from GitHub

Copy terraform.tfvars.example to terraform.tfvars and update bucket names

Zip the Lambda code

cd lambda
zip -r image_processor.zip image_processor.py
mv image_processor.zip ../
cd ..


From the project root run

terraform init
terraform plan
terraform apply


Upload an image to the upload bucket to test the pipeline

Monitor Lambda execution in CloudWatch logs

Inspect the processed bucket and Dead-Letter Queue for failures

Clean up with

terraform destroy

Challenges and Learnings

Configuring SQS visibility timeout and retry logic required careful planning

IAM role policies had to be restrictive yet functional

Handling large images without Lambda timeouts required optimization

Storage costs were controlled with S3 lifecycle policies

Conclusion

This pipeline provides a solid foundation for serverless asynchronous workloads. Possible extensions include notifications when processing completes, multiple image transformations or CDN integration. Building this project deepened my skills in AWS, Terraform and event-driven architectures
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>serverless</category>
      <category>cloud</category>
      <category>aws</category>
      <category>terraform</category>
    </item>
  </channel>
</rss>
