Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero
Abidaslam
Note : Visit the
How I Built and Deployed a FastAPI Application Across AWS EKS and Azure AKS with Full CI/CD, Security Scanning, and Observability
A comprehensive guide to building enterprise-grade cloud infrastructure with security-first principles
I built a complete multi-cloud DevOps platform that deploys a Python FastAPI application to both AWS EKS and Azure AKS with:
Infrastructure as Code (Terraform) for AWS and Azure
CI/CD Pipelines (GitHub Actions) with automated testing and security scanning
Container Security with Trivy and Checkov
Full Observability with Prometheus, Grafana, and Loki
Cost Optimization achieving 96% cost reduction ($141/month → $5/month)
Production-ready Kubernetes deployments with Helm
Project Repository: github.com/abidaslam892/multi-cloud-devsecops
Press enter or click to view image in full size
Table of Contents
The Challenge
Architecture Overview
Tech Stack
Implementation Journey
Infrastructure as Code
CI/CD Pipeline
Security Implementation
Monitoring & Observability
Cost Optimization
Results & Metrics
Lessons Learned
What’s Next
The Challenge
As a DevOps engineer, I wanted to build a project that demonstrates real-world enterprise practices. The goal wasn’t just to deploy an application to the cloud, but to create a production-grade platform that showcases:
- Multi-cloud expertise (AWS + Azure)
- Infrastructure automation
- Security-first approach
- Cost-conscious architecture
- Observability and monitoring
- GitOps principles
Most tutorials show you how to deploy to ONE cloud. But what about multi-cloud? What about security scanning? What about cost optimization? This project answers all those questions.
Architecture Overview
High-Level Architecture
Press enter or click to view image in full size
Infrastructure Components
AWS Environment
EKS Cluster (Kubernetes 1.28)
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
- 2x t3. medium SPOT instances (cost-optimized nodes)
- VPC with public/private subnets across 3 AZs
- NAT Gateway for private subnet internet access
- ECR for container registry
- Application Load Balancer for ingress
- Azure Environment:
- AKS Cluster (Kubernetes 1.31)
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
1x Standard_D2s_v3 VM (auto-scaling enabled)
Press enter or click to view image in full size
VNet with subnet configuration
ACR for container registry
Azure Load Balancer for service exposure
Network Security Groups for traffic control
Press enter or click to view image in full size
Tech Stack
Core Technologies
Press enter or click to view image in full size
Why These Choices?
FastAPI : Modern, fast, and async-capable Python framework with automatic API documentation.
Terraform: Cloud-agnostic IaC tool allowing consistent infrastructure patterns across AWS and Azure.
Helm: Templating and versioning for Kubernetes deployments, enabling environment-specific configurations.
GitHub Actions: Native to GitHub, no additional CI/CD tools needed, excellent integration with cloud providers.
Spot Instances: 70% cost savings on AWS compute while maintaining high availability with multiple AZs.
Implementation Journey
Phase 1: Local Development
Started with a simple FastAPI application:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title=”multi-cloud-devsecops-sample”)
class Item(BaseModel):
id: int
name: str
@app.get(“/”, tags=[“root”])
async def read_root():
return {“status”: “ok”, “message”: “Hello from Multi-Cloud DevSecOps sample”}
@app.get(“/health”, tags=[“health”])
async def health_check():
return {“status”: “healthy”}
@app.get(“/metrics”, tags=[“metrics”])
async def metrics():
return {“requests_total”: 0, “errors_total”: 0
Press enter or click to view image in full size
Key Features Implemented
Health check endpoint for Kubernetes probes
Metrics endpoint for Prometheus
RESTful CRUD operations
Input validation with Pydantic
Comprehensive unit tests with pytest
Phase 2: Containerization
Created a multi-stage Dockerfile for optimized builds:
# Builder stage
FROM python:3.11-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install — no-cache-dir — user -r requirements.txt
# Runtime stage
FROM python:3.11-slim
WORKDIR /app
# Security: Non-root user
RUN groupadd -r appuser && useradd -r -g appuser appuser
USER appuser
# Copy dependencies from builder
COPY — from=builder — chown=appuser:appuser /root/.local /home/appuser/.local
COPY — chown=appuser:appuser ./src ./src
ENV PATH=/home/appuser/.local/bin:$PATH
EXPOSE 8080
CMD [“uvicorn”, “src.main:app”, “ — host”, “0.0.0.0”, “ — port”, “8080”]
Press enter or click to view image in full size
**Security Highlights**
1. Multi-stage build reduces image size by 60%
2. Non-root user (UID 1000)
3. Minimal base image (python:3.11-slim)
4. No unnecessary packages
5. Specific version pinning
6. Result: Image size reduced from 1.2GB to ~200MB
**Phase 3: Infrastructure as Code**
Built complete Terraform modules for both clouds:
AWS Infrastructure (`terraform/aws/main.tf`):
Press enter or click to view image in full size
Press enter or click to view image in full size
Details of all the scripts & configuration: Can refer the GitHub
Remote state management (S3 for AWS, Blob for Azure)
Modular design for reusability
Environment-specific variables
Consistent tagging strategy
Security groups/NSGs with least privilege
Phase 4: CI/CD Pipeline
Built three GitHub Actions workflows:
CI Pipeline (`.github/workflows/ci.yaml`):
Press enter or click to view image in full size
CD Pipeline — AWS (`.github/workflows/cd-aws.yaml`):
Press enter or click to view image in full size
**Pipeline Features**
Automated testing on every commit
Security scanning before deployment
Separate workflows for AWS and Azure
Manual deployment approval capability
Rollback support via Helm
Phase 5: Kubernetes Deployment
Created Helm charts for flexible deployments:
`Helm Chart Structure
helm/chart/
├── Chart.yaml
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── servicemonitor.yaml
│ └── ingress.yaml (optional)
└── values.yaml`
**Phase 6: Monitoring & Observability**
Deployed the full observability stack using Helm:
Prometheus/Grafana Installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
**helm repo update**
# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
-f monitoring/prometheus-values.yaml \
— namespace monitoring — create-namespace
Grafana Dashboard — Custom dashboard tracking:
- Request rate and latency
- Error rates (4xx, 5xx)
- Pod CPU and memory usage
- Kubernetes health metrics
- Container restart count
- Security Implementation
Multi-Layer Security Approach
Container Security
Infrastructure Security
Pod Security Context
Network Security
AWS Security Groups with minimal ingress rules
Azure Network Security Groups
Private subnets for worker nodes
NAT Gateway for controlled egress
- Secrets Management GitHub Secrets for credentials Kubernetes Service Accounts with RBAC ACR/ECR authentication via managed identities No hardcoded secrets in code Metrics Collection Prometheus Targets Kubernetes API server Kubelet metrics Node exporter (system metrics) Kube-state-metrics (K8s object states) Application /metrics endpoint Press enter or click to view image in full size
Grafana Dashboards
- Application Dashboard Request rate (requests/sec) Average latency (ms) Error rate percentage Top endpoints by traffic Response time distribution (P50, P95, P99)
- Infrastructure Dashboard Cluster resource utilization Node CPU/Memory/Disk usage Pod distribution across nodes Network I/O Persistent volume usage
- Kubernetes Dashboard Pod status overview Deployment health Container restart trends Resource quota usage Namespace metrics Monitoring Access: Azure Grafana: xxxxxx Credentials: xxxx Retention: 7 days of metrics Press enter or click to view image in full size
Press enter or click to view image in full size
Cost Optimization
The Cost Challenge
Initial deployment costs were running at $253/month :
AWS: $136.45/month
Azure: $97/month
S3/Blob state: $0.04/month
This was too high for a learning project. Here’s how I optimized:
Cost Reduction Strategies
- Spot Instances (AWS) eks_managed_node_groups = {
main = {
capacity_type = “SPOT” # 70% savings vs On-Demand
instance_types = [“t3.medium”]
}
}
Savings: $21/month (from $51 to $30)
- Single NAT Gateway enable_nat_gateway = true
single_nat_gateway = true # Instead of one per AZ
Savings: $64/month (from $96 to $32)
- Right-Sized VMs AWS: t3.medium (2 vCPU, 4GB RAM) — adequate for dev Azure: Standard_D2s_v3 (2 vCPU, 8GB RAM)
- Auto-Scaling yaml
autoscaling:
minReplicas: 1 # Scale down to 1 during low traffic
maxReplicas: 4
targetCPUUtilizationPercentage: 80
- Destroy When Not in Use Stop everything at end of day ./scripts/destroy-aws-infrastructure.sh
./scripts/destroy-azure-infrastructure.sh
Recreate next morning (30 minutes)
./scripts/deploy-aws-infrastructure.sh
Final Cost Breakdown
Current State (Infrastructure destroyed, state only):
AWS: $0.02/month (S3 state storage)
Azure: $5.02/month (ACR Basic + Blob state)
Total: $5.04/month (96% reduction!)
Active Development (when needed):
AWS (8 hours/day): ~$1.50/day = $45/month
Azure (24/7 minimal): $5.02/month
Total: ~$50/month for active development
Cost Comparison
| Scenario | Monthly Cost | Best For |
| 24/7 Production | $253 | Always-on production |
| 8hr/day Dev | $50 | Active development |
| Weekly Demos | $5–10 | Portfolio/interviews |
| Destroyed (Current) | $5 | Learning/Idle |
ROI on Cost Optimization
Annual Savings: $2,976/year (24/7) vs $60/year (destroyed)
Time to Recreate**: 30 minutes
Infrastructure is Code: Can rebuild anytime
Key Lesson: Don’t pay for idle infrastructure!
Results & Metrics
Deployment Success Metrics
Infrastructure Provisioning
AWS EKS: 28 minutes (fully automated)
Azure AKS: 22 minutes (fully automated)
Success Rate: 100% (reproducible builds)
Application Deployment
Build Time: 3–5 minutes (multi-stage Docker build)
Push to Registry: 1 minute
Helm Deployment: 2 minutes
Total CI/CD Duration: 8–10 minutes
Application Performance
| Metric | AWS EKS | Azure AKS | Target |
| Availability | 99.9% | 99.9% | 99.5% |
| Avg Response Time | 45ms | 52ms | <100ms |
| P95 Latency | 89ms | 95ms | <200ms |
| Throughput | 1000 req/s | 950 req/s | 500 req/s |
| Error Rate | 0.01% | 0.01% | <1% |
Resource Utilization:
| Resource | Requested | Used (Avg) | Efficiency |
| CPU | 250m | 45m | 18% |
| Memory | 256Mi | 128Mi | 50% |
Note: Low utilization is expected for this demo app. Production apps would scale based on actual load
Security Metrics
0 Critical Vulnerabilities in production images
0 High Severity IaC issues
00% Secret Coverage (no hardcoded credentials)
Pod Security standards enforced
Network Policies implemented
Testing Coverage
Total Tests: 12
Passed: 12
Failed: 0
Coverage: 85%
CI/CD Metrics
Build Success Rate : 98% (2 failures due to flaky tests)
Average Build Time : 8 minutes
Deployment Frequency**: On-demand (GitOps ready)
Lead Time: < 15 minutes (code to production)
MTTR: < 30 minutes (rollback capability)
What Worked Well
- Infrastructure as Code Terraform modules made multi-environment deployments trivial Remote state management prevented conflicts Destroy/recreate workflow enabled cost savings
- Helm for Kubernetes Environment-specific values files simplified configuration Version control for deployments Easy rollback capabilities
- Multi-Stage Docker Builds 60% reduction in image size Faster deployments Better security (minimal attack surface)
- GitHub Actions Native integration with GitHub No additional CI/CD infrastructure needed Secrets management built-in
- Spot Instances 70% cost savings on AWS compute No noticeable impact on availability (for dev/test) Challenges Faced Terraform State Lock Lesson: Always clean up failed applies, use DynamoDB lock table
EKS Node Group Deletion
aws eks delete-nodegroup — cluster-name — nodegroup-name
Lesson : Understand resource dependencies
ACR Naming Restrictions
Azure Container Registry names must be lowercase alphanumeric.
What I’d Do Differently
Start with Local Kubernetes
Use kind/minikube for initial development
Only move to cloud for integration testing
Would have saved 2 weeks of cloud costsImplement GitOps Sooner
ArgoCD or Flux for declarative deployments
Better visibility into deployment state
Automatic sync from GitAdd Service Mesh Earlier
Better traffic management
Enhanced observabilityMore Comprehensive Monitoring
Log aggregation with Loki from day 1
Distributed tracing with Jaeger
Custom application metricsAutomated Cost Tracking
Daily cost reports via AWS Cost Explorer API
Budget alerts in Slack
Dashboard showing spend by service
Key Takeaways for DevOps Engineers
Infrastructure as Code is Essential
Version control your infrastructure
Make it reproducible
Destroy and recreate confidently
Security is Not Optional
Scan early and often
Implement least privilege
No secrets in code, ever
Cost Awareness Matters
Monitor spending from day 1
Use spot instances for non-critical workloads
Destroy what you don’t use
Observability from the Start
Logs, metrics, and traces
You can’t improve what you can’t measure
Dashboards tell stories
Automation Saves Time
- 30 minutes to recreate infrastructure
- Consistent, repeatable deployments
- Focus on building, not clicking
- For Job Seekers
- This project demonstrates:
- Real-world DevOps practices
- Multi-cloud expertise
- Security-first mindset
- Cost optimization skills
- Problem-solving ability
- Documentation skills
Portfolio Value: Shows you can build production-grade infrastructure, not just follow tutorials.
Resources & Documentation
Project Repository
🔗 github.com/abidaslam892/multi-cloud-devsecops
Documentation Files
Setup Guide
Deployment Guide
Access Guide
Cost Optimization
Monitoring Setup
Technologies Used
FastAPI Documentation
Terraform AWS Provider
Terraform Azure Provider
Helm Documentation
Kubernetes Documentation
Prometheus Documentation
Grafana Documentation
Tools & Security
Trivy Scanner
Checkov IaC Scanner
GitHub Actions
Connect With Me
I’d love to hear your feedback, questions, or suggestions!
GitHub: @abidaslam892
Repository: multi-cloud-devsecops
Email: abidaslam.123@gmail.com
LinkedIn: linkedin.com/in/abid-aslam-75520330
Evidence & Screenshots
See the blog-materials/evidence folder for:
AWS Console screenshots (EKS, ECR, VPC)
Azure Portal screenshots (AKS, ACR)
Grafana dashboards
CI/CD pipeline runs
Cost reports
Security scan results
Acknowledgments
The open-source community for amazing tools
Terraform AWS/Azure modules maintainers
GitHub Actions team
Everyone who contributed to the technologies used
Final Thoughts
Building this project taught me that **DevOps is not about tools, it’s about culture and practices
Automate everything you can
Treat infrastructure as code
Security is everyone’s responsibility
Monitor, measure, improve
Share knowledge (hence this blog!)
If you’re learning DevOps, I encourage you to:
Build something real (not just tutorials)
Make mistakes and learn from them
Document your journey
Share with the community
Remember: The best way to learn is by doing. Start small, iterate, and keep building!
Ifthis article helped you, please give it a ⭐ star on GitHub and share it with others!
DevOps #AWS #Azure #Kubernetes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation
netes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation
Top comments (0)