abidaslam892

Posted on Nov 23

Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero

#webdev #linux #github

Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero
Abidaslam

Note : Visit the

https://medium.com/design-bootcamp/building-a-production-multi-cloud-devops-platform-a-complete-journey-from-zero-to-hero-ef292ff0f0c6

How I Built and Deployed a FastAPI Application Across AWS EKS and Azure AKS with Full CI/CD, Security Scanning, and Observability

A comprehensive guide to building enterprise-grade cloud infrastructure with security-first principles
I built a complete multi-cloud DevOps platform that deploys a Python FastAPI application to both AWS EKS and Azure AKS with:

Infrastructure as Code (Terraform) for AWS and Azure
CI/CD Pipelines (GitHub Actions) with automated testing and security scanning

Container Security with Trivy and Checkov
Full Observability with Prometheus, Grafana, and Loki
Cost Optimization achieving 96% cost reduction ($141/month → $5/month)
Production-ready Kubernetes deployments with Helm
Project Repository: github.com/abidaslam892/multi-cloud-devsecops

Press enter or click to view image in full size

Table of Contents

The Challenge
Architecture Overview
Tech Stack
Implementation Journey
Infrastructure as Code
CI/CD Pipeline
Security Implementation
Monitoring & Observability
Cost Optimization
Results & Metrics
Lessons Learned
What’s Next

The Challenge

As a DevOps engineer, I wanted to build a project that demonstrates real-world enterprise practices. The goal wasn’t just to deploy an application to the cloud, but to create a production-grade platform that showcases:

Multi-cloud expertise (AWS + Azure)
Infrastructure automation
Security-first approach
Cost-conscious architecture
Observability and monitoring
GitOps principles

Most tutorials show you how to deploy to ONE cloud. But what about multi-cloud? What about security scanning? What about cost optimization? This project answers all those questions.

Architecture Overview

High-Level Architecture

Press enter or click to view image in full size

Infrastructure Components
AWS Environment
EKS Cluster (Kubernetes 1.28)

Press enter or click to view image in full size

2x t3. medium SPOT instances (cost-optimized nodes)
VPC with public/private subnets across 3 AZs
NAT Gateway for private subnet internet access
ECR for container registry
Application Load Balancer for ingress
Azure Environment:
AKS Cluster (Kubernetes 1.31)

Press enter or click to view image in full size

1x Standard_D2s_v3 VM (auto-scaling enabled)
Press enter or click to view image in full size

VNet with subnet configuration
ACR for container registry
Azure Load Balancer for service exposure
Network Security Groups for traffic control
Press enter or click to view image in full size

Tech Stack

Core Technologies
Press enter or click to view image in full size

Why These Choices?

FastAPI : Modern, fast, and async-capable Python framework with automatic API documentation.
Terraform: Cloud-agnostic IaC tool allowing consistent infrastructure patterns across AWS and Azure.
Helm: Templating and versioning for Kubernetes deployments, enabling environment-specific configurations.
GitHub Actions: Native to GitHub, no additional CI/CD tools needed, excellent integration with cloud providers.
Spot Instances: 70% cost savings on AWS compute while maintaining high availability with multiple AZs.
Implementation Journey

Phase 1: Local Development

Started with a simple FastAPI application:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title=”multi-cloud-devsecops-sample”)
class Item(BaseModel):
id: int
name: str
@app.get(“/”, tags=[“root”])
async def read_root():
return {“status”: “ok”, “message”: “Hello from Multi-Cloud DevSecOps sample”}
@app.get(“/health”, tags=[“health”])
async def health_check():
return {“status”: “healthy”}
@app.get(“/metrics”, tags=[“metrics”])
async def metrics():
return {“requests_total”: 0, “errors_total”: 0

Press enter or click to view image in full size

Key Features Implemented

Health check endpoint for Kubernetes probes
Metrics endpoint for Prometheus
RESTful CRUD operations
Input validation with Pydantic
Comprehensive unit tests with pytest

Phase 2: Containerization
Created a multi-stage Dockerfile for optimized builds:


# Builder stage

FROM python:3.11-slim as builder

WORKDIR /app

COPY requirements.txt .

RUN pip install — no-cache-dir — user -r requirements.txt

# Runtime stage

FROM python:3.11-slim

WORKDIR /app

# Security: Non-root user

RUN groupadd -r appuser && useradd -r -g appuser appuser

USER appuser

# Copy dependencies from builder

COPY — from=builder — chown=appuser:appuser /root/.local /home/appuser/.local

COPY — chown=appuser:appuser ./src ./src

ENV PATH=/home/appuser/.local/bin:$PATH

EXPOSE 8080

CMD [“uvicorn”, “src.main:app”, “ — host”, “0.0.0.0”, “ — port”, “8080”]

Press enter or click to view image in full size

**Security Highlights**

1. Multi-stage build reduces image size by 60%
2. Non-root user (UID 1000)
3. Minimal base image (python:3.11-slim)
4. No unnecessary packages
5. Specific version pinning
6. Result: Image size reduced from 1.2GB to ~200MB



**Phase 3: Infrastructure as Code**


Built complete Terraform modules for both clouds:

AWS Infrastructure (`terraform/aws/main.tf`):

Press enter or click to view image in full size

Press enter or click to view image in full size

Details of all the scripts & configuration: Can refer the GitHub

Remote state management (S3 for AWS, Blob for Azure)
Modular design for reusability
Environment-specific variables
Consistent tagging strategy
Security groups/NSGs with least privilege
Phase 4: CI/CD Pipeline
Built three GitHub Actions workflows:

CI Pipeline (`.github/workflows/ci.yaml`):
Press enter or click to view image in full size

CD Pipeline — AWS (`.github/workflows/cd-aws.yaml`):
Press enter or click to view image in full size

**Pipeline Features**

Automated testing on every commit
Security scanning before deployment
Separate workflows for AWS and Azure
Manual deployment approval capability
Rollback support via Helm
Phase 5: Kubernetes Deployment
Created Helm charts for flexible deployments:

`Helm Chart Structure
helm/chart/

├── Chart.yaml

├── templates/

│ ├── deployment.yaml

│ ├── service.yaml

│ ├── servicemonitor.yaml

│ └── ingress.yaml (optional)

└── values.yaml`


**Phase 6: Monitoring & Observability**

Deployed the full observability stack using Helm:

Prometheus/Grafana Installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

**helm repo update**

# Install kube-prometheus-stack

helm install prometheus prometheus-community/kube-prometheus-stack \

-f monitoring/prometheus-values.yaml \

— namespace monitoring — create-namespace

Grafana Dashboard — Custom dashboard tracking:

Request rate and latency
Error rates (4xx, 5xx)
Pod CPU and memory usage
Kubernetes health metrics
Container restart count
Security Implementation
Multi-Layer Security Approach
Container Security
Infrastructure Security
Pod Security Context
Network Security

AWS Security Groups with minimal ingress rules
Azure Network Security Groups
Private subnets for worker nodes
NAT Gateway for controlled egress

Secrets Management GitHub Secrets for credentials Kubernetes Service Accounts with RBAC ACR/ECR authentication via managed identities No hardcoded secrets in code Metrics Collection Prometheus Targets Kubernetes API server Kubelet metrics Node exporter (system metrics) Kube-state-metrics (K8s object states) Application /metrics endpoint Press enter or click to view image in full size

Grafana Dashboards

Application Dashboard Request rate (requests/sec) Average latency (ms) Error rate percentage Top endpoints by traffic Response time distribution (P50, P95, P99)
Infrastructure Dashboard Cluster resource utilization Node CPU/Memory/Disk usage Pod distribution across nodes Network I/O Persistent volume usage
Kubernetes Dashboard Pod status overview Deployment health Container restart trends Resource quota usage Namespace metrics Monitoring Access: Azure Grafana: xxxxxx Credentials: xxxx Retention: 7 days of metrics Press enter or click to view image in full size

Press enter or click to view image in full size

Cost Optimization
The Cost Challenge

Initial deployment costs were running at $253/month :
AWS: $136.45/month
Azure: $97/month
S3/Blob state: $0.04/month
This was too high for a learning project. Here’s how I optimized:

Cost Reduction Strategies

Spot Instances (AWS) eks_managed_node_groups = {

main = {

capacity_type = “SPOT” # 70% savings vs On-Demand

instance_types = [“t3.medium”]

}

Savings: $21/month (from $51 to $30)

Single NAT Gateway enable_nat_gateway = true

single_nat_gateway = true # Instead of one per AZ

Savings: $64/month (from $96 to $32)

Right-Sized VMs AWS: t3.medium (2 vCPU, 4GB RAM) — adequate for dev Azure: Standard_D2s_v3 (2 vCPU, 8GB RAM)
Auto-Scaling yaml

autoscaling:

minReplicas: 1 # Scale down to 1 during low traffic

maxReplicas: 4

targetCPUUtilizationPercentage: 80

Destroy When Not in Use Stop everything at end of day ./scripts/destroy-aws-infrastructure.sh

./scripts/destroy-azure-infrastructure.sh

Recreate next morning (30 minutes)
./scripts/deploy-aws-infrastructure.sh

Final Cost Breakdown
Current State (Infrastructure destroyed, state only):

AWS: $0.02/month (S3 state storage)
Azure: $5.02/month (ACR Basic + Blob state)
Total: $5.04/month (96% reduction!)
Active Development (when needed):
AWS (8 hours/day): ~$1.50/day = $45/month
Azure (24/7 minimal): $5.02/month
Total: ~$50/month for active development
Cost Comparison
| Scenario | Monthly Cost | Best For |

| 24/7 Production | $253 | Always-on production |

| 8hr/day Dev | $50 | Active development |

| Weekly Demos | $5–10 | Portfolio/interviews |

| Destroyed (Current) | $5 | Learning/Idle |

ROI on Cost Optimization
Annual Savings: $2,976/year (24/7) vs $60/year (destroyed)
Time to Recreate**: 30 minutes
Infrastructure is Code: Can rebuild anytime
Key Lesson: Don’t pay for idle infrastructure!
Results & Metrics
Deployment Success Metrics
Infrastructure Provisioning
AWS EKS: 28 minutes (fully automated)
Azure AKS: 22 minutes (fully automated)
Success Rate: 100% (reproducible builds)
Application Deployment
Build Time: 3–5 minutes (multi-stage Docker build)
Push to Registry: 1 minute
Helm Deployment: 2 minutes
Total CI/CD Duration: 8–10 minutes

Application Performance

| Availability | 99.9% | 99.9% | 99.5% |

| Avg Response Time | 45ms | 52ms | <100ms |

| P95 Latency | 89ms | 95ms | <200ms |

| Error Rate | 0.01% | 0.01% | <1% |

| CPU | 250m | 45m | 18% |

| Memory | 256Mi | 128Mi | 50% |

Note: Low utilization is expected for this demo app. Production apps would scale based on actual load

Security Metrics

0 Critical Vulnerabilities in production images
0 High Severity IaC issues
00% Secret Coverage (no hardcoded credentials)
Pod Security standards enforced
Network Policies implemented
Testing Coverage
Total Tests: 12
Passed: 12
Failed: 0
Coverage: 85%
CI/CD Metrics
Build Success Rate : 98% (2 failures due to flaky tests)
Average Build Time : 8 minutes
Deployment Frequency**: On-demand (GitOps ready)
Lead Time: < 15 minutes (code to production)
MTTR: < 30 minutes (rollback capability)

What Worked Well

Infrastructure as Code Terraform modules made multi-environment deployments trivial Remote state management prevented conflicts Destroy/recreate workflow enabled cost savings
Helm for Kubernetes Environment-specific values files simplified configuration Version control for deployments Easy rollback capabilities
Multi-Stage Docker Builds 60% reduction in image size Faster deployments Better security (minimal attack surface)
GitHub Actions Native integration with GitHub No additional CI/CD infrastructure needed Secrets management built-in
Spot Instances 70% cost savings on AWS compute No noticeable impact on availability (for dev/test) Challenges Faced Terraform State Lock Lesson: Always clean up failed applies, use DynamoDB lock table

EKS Node Group Deletion
aws eks delete-nodegroup — cluster-name — nodegroup-name

Lesson : Understand resource dependencies

ACR Naming Restrictions
Azure Container Registry names must be lowercase alphanumeric.

What I’d Do Differently

Start with Local Kubernetes
Use kind/minikube for initial development
Only move to cloud for integration testing
Would have saved 2 weeks of cloud costs
Implement GitOps Sooner
ArgoCD or Flux for declarative deployments
Better visibility into deployment state
Automatic sync from Git
Add Service Mesh Earlier
Better traffic management
Enhanced observability
More Comprehensive Monitoring
Log aggregation with Loki from day 1
Distributed tracing with Jaeger
Custom application metrics
Automated Cost Tracking
Daily cost reports via AWS Cost Explorer API
Budget alerts in Slack
Dashboard showing spend by service
Key Takeaways for DevOps Engineers
Infrastructure as Code is Essential

Version control your infrastructure
Make it reproducible
Destroy and recreate confidently
Security is Not Optional

Scan early and often
Implement least privilege
No secrets in code, ever
Cost Awareness Matters

Monitor spending from day 1
Use spot instances for non-critical workloads
Destroy what you don’t use
Observability from the Start

Logs, metrics, and traces
You can’t improve what you can’t measure
Dashboards tell stories
Automation Saves Time

30 minutes to recreate infrastructure
Consistent, repeatable deployments
Focus on building, not clicking
For Job Seekers
This project demonstrates:
Real-world DevOps practices
Multi-cloud expertise
Security-first mindset
Cost optimization skills
Problem-solving ability
Documentation skills

Portfolio Value: Shows you can build production-grade infrastructure, not just follow tutorials.

Resources & Documentation
Project Repository
🔗 github.com/abidaslam892/multi-cloud-devsecops

Documentation Files
Setup Guide
Deployment Guide
Access Guide
Cost Optimization
Monitoring Setup
Technologies Used
FastAPI Documentation
Terraform AWS Provider
Terraform Azure Provider
Helm Documentation
Kubernetes Documentation
Prometheus Documentation
Grafana Documentation
Tools & Security
Trivy Scanner
Checkov IaC Scanner
GitHub Actions
Connect With Me
I’d love to hear your feedback, questions, or suggestions!

GitHub: @abidaslam892
Repository: multi-cloud-devsecops
Email: abidaslam.123@gmail.com
LinkedIn: linkedin.com/in/abid-aslam-75520330
Evidence & Screenshots
See the blog-materials/evidence folder for:

AWS Console screenshots (EKS, ECR, VPC)
Azure Portal screenshots (AKS, ACR)
Grafana dashboards
CI/CD pipeline runs
Cost reports
Security scan results
Acknowledgments
The open-source community for amazing tools
Terraform AWS/Azure modules maintainers
GitHub Actions team
Everyone who contributed to the technologies used
Final Thoughts
Building this project taught me that **DevOps is not about tools, it’s about culture and practices

Automate everything you can
Treat infrastructure as code
Security is everyone’s responsibility
Monitor, measure, improve
Share knowledge (hence this blog!)
If you’re learning DevOps, I encourage you to:

Build something real (not just tutorials)
Make mistakes and learn from them
Document your journey
Share with the community

Remember: The best way to learn is by doing. Start small, iterate, and keep building!

Ifthis article helped you, please give it a ⭐ star on GitHub and share it with others!

DevOps #AWS #Azure #Kubernetes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation

netes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation

DEV Community

Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero

Note : Visit the

High-Level Architecture

Tech Stack

Why These Choices?

DevOps #AWS #Azure #Kubernetes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation

Top comments (0)