DEV Community

Cover image for Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero
abidaslam892
abidaslam892

Posted on

Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero

Building a Production-Multi-Cloud DevOps Platform: A Complete Journey from Zero to Hero
Abidaslam

Note : Visit the

https://medium.com/design-bootcamp/building-a-production-multi-cloud-devops-platform-a-complete-journey-from-zero-to-hero-ef292ff0f0c6

How I Built and Deployed a FastAPI Application Across AWS EKS and Azure AKS with Full CI/CD, Security Scanning, and Observability

A comprehensive guide to building enterprise-grade cloud infrastructure with security-first principles
I built a complete multi-cloud DevOps platform that deploys a Python FastAPI application to both AWS EKS and Azure AKS with:

Infrastructure as Code (Terraform) for AWS and Azure
CI/CD Pipelines (GitHub Actions) with automated testing and security scanning

Container Security with Trivy and Checkov
Full Observability with Prometheus, Grafana, and Loki
Cost Optimization achieving 96% cost reduction ($141/month → $5/month)
Production-ready Kubernetes deployments with Helm
Project Repository: github.com/abidaslam892/multi-cloud-devsecops

Press enter or click to view image in full size

Table of Contents

  1. The Challenge

  2. Architecture Overview

  3. Tech Stack

  4. Implementation Journey

  5. Infrastructure as Code

  6. CI/CD Pipeline

  7. Security Implementation

  8. Monitoring & Observability

  9. Cost Optimization

  10. Results & Metrics

  11. Lessons Learned

  12. What’s Next

The Challenge

As a DevOps engineer, I wanted to build a project that demonstrates real-world enterprise practices. The goal wasn’t just to deploy an application to the cloud, but to create a production-grade platform that showcases:

  1. Multi-cloud expertise (AWS + Azure)
  2. Infrastructure automation
  3. Security-first approach
  4. Cost-conscious architecture
  5. Observability and monitoring
  6. GitOps principles

Most tutorials show you how to deploy to ONE cloud. But what about multi-cloud? What about security scanning? What about cost optimization? This project answers all those questions.

Architecture Overview

High-Level Architecture

Press enter or click to view image in full size

Infrastructure Components
AWS Environment
EKS Cluster (Kubernetes 1.28)

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

  1. 2x t3. medium SPOT instances (cost-optimized nodes)
  2. VPC with public/private subnets across 3 AZs
  3. NAT Gateway for private subnet internet access
  4. ECR for container registry
  5. Application Load Balancer for ingress
  6. Azure Environment:
  7. AKS Cluster (Kubernetes 1.31)

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

1x Standard_D2s_v3 VM (auto-scaling enabled)
Press enter or click to view image in full size

VNet with subnet configuration
ACR for container registry
Azure Load Balancer for service exposure
Network Security Groups for traffic control
Press enter or click to view image in full size

Tech Stack

Core Technologies
Press enter or click to view image in full size

Why These Choices?

FastAPI : Modern, fast, and async-capable Python framework with automatic API documentation.
Terraform: Cloud-agnostic IaC tool allowing consistent infrastructure patterns across AWS and Azure.
Helm: Templating and versioning for Kubernetes deployments, enabling environment-specific configurations.
GitHub Actions: Native to GitHub, no additional CI/CD tools needed, excellent integration with cloud providers.
Spot Instances: 70% cost savings on AWS compute while maintaining high availability with multiple AZs.
Implementation Journey

Phase 1: Local Development

Started with a simple FastAPI application:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title=”multi-cloud-devsecops-sample”)
class Item(BaseModel):
id: int
name: str
@app.get(“/”, tags=[“root”])
async def read_root():
return {“status”: “ok”, “message”: “Hello from Multi-Cloud DevSecOps sample”}
@app.get(“/health”, tags=[“health”])
async def health_check():
return {“status”: “healthy”}
@app.get(“/metrics”, tags=[“metrics”])
async def metrics():
return {“requests_total”: 0, “errors_total”: 0

Press enter or click to view image in full size

Key Features Implemented

Health check endpoint for Kubernetes probes
Metrics endpoint for Prometheus
RESTful CRUD operations
Input validation with Pydantic
Comprehensive unit tests with pytest

Phase 2: Containerization
Created a multi-stage Dockerfile for optimized builds:


# Builder stage

FROM python:3.11-slim as builder

WORKDIR /app

COPY requirements.txt .

RUN pip install — no-cache-dir — user -r requirements.txt

# Runtime stage

FROM python:3.11-slim

WORKDIR /app

# Security: Non-root user

RUN groupadd -r appuser && useradd -r -g appuser appuser

USER appuser

# Copy dependencies from builder

COPY — from=builder — chown=appuser:appuser /root/.local /home/appuser/.local

COPY — chown=appuser:appuser ./src ./src

ENV PATH=/home/appuser/.local/bin:$PATH

EXPOSE 8080

CMD [“uvicorn”, “src.main:app”, “ — host”, “0.0.0.0”, “ — port”, “8080”]

Press enter or click to view image in full size

**Security Highlights**

1. Multi-stage build reduces image size by 60%
2. Non-root user (UID 1000)
3. Minimal base image (python:3.11-slim)
4. No unnecessary packages
5. Specific version pinning
6. Result: Image size reduced from 1.2GB to ~200MB



**Phase 3: Infrastructure as Code**


Built complete Terraform modules for both clouds:

AWS Infrastructure (`terraform/aws/main.tf`):

Press enter or click to view image in full size

Press enter or click to view image in full size

Details of all the scripts & configuration: Can refer the GitHub

Remote state management (S3 for AWS, Blob for Azure)
Modular design for reusability
Environment-specific variables
Consistent tagging strategy
Security groups/NSGs with least privilege
Phase 4: CI/CD Pipeline
Built three GitHub Actions workflows:

CI Pipeline (`.github/workflows/ci.yaml`):
Press enter or click to view image in full size

CD Pipeline — AWS (`.github/workflows/cd-aws.yaml`):
Press enter or click to view image in full size

**Pipeline Features**

Automated testing on every commit
Security scanning before deployment
Separate workflows for AWS and Azure
Manual deployment approval capability
Rollback support via Helm
Phase 5: Kubernetes Deployment
Created Helm charts for flexible deployments:

`Helm Chart Structure
helm/chart/

├── Chart.yaml

├── templates/

│ ├── deployment.yaml

│ ├── service.yaml

│ ├── servicemonitor.yaml

│ └── ingress.yaml (optional)

└── values.yaml`


**Phase 6: Monitoring & Observability**

Deployed the full observability stack using Helm:

Prometheus/Grafana Installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

**helm repo update**

# Install kube-prometheus-stack

helm install prometheus prometheus-community/kube-prometheus-stack \

-f monitoring/prometheus-values.yaml \

— namespace monitoring — create-namespace

Enter fullscreen mode Exit fullscreen mode

Grafana Dashboard — Custom dashboard tracking:

  1. Request rate and latency
  2. Error rates (4xx, 5xx)
  3. Pod CPU and memory usage
  4. Kubernetes health metrics
  5. Container restart count
  6. Security Implementation
  7. Multi-Layer Security Approach

  8. Container Security

  9. Infrastructure Security

  10. Pod Security Context

  11. Network Security

AWS Security Groups with minimal ingress rules
Azure Network Security Groups
Private subnets for worker nodes
NAT Gateway for controlled egress

  1. Secrets Management GitHub Secrets for credentials Kubernetes Service Accounts with RBAC ACR/ECR authentication via managed identities No hardcoded secrets in code Metrics Collection Prometheus Targets Kubernetes API server Kubelet metrics Node exporter (system metrics) Kube-state-metrics (K8s object states) Application /metrics endpoint Press enter or click to view image in full size

Grafana Dashboards

  1. Application Dashboard Request rate (requests/sec) Average latency (ms) Error rate percentage Top endpoints by traffic Response time distribution (P50, P95, P99)
  2. Infrastructure Dashboard Cluster resource utilization Node CPU/Memory/Disk usage Pod distribution across nodes Network I/O Persistent volume usage
  3. Kubernetes Dashboard Pod status overview Deployment health Container restart trends Resource quota usage Namespace metrics Monitoring Access: Azure Grafana: xxxxxx Credentials: xxxx Retention: 7 days of metrics Press enter or click to view image in full size

Press enter or click to view image in full size

Cost Optimization
The Cost Challenge

Initial deployment costs were running at $253/month :
AWS: $136.45/month
Azure: $97/month
S3/Blob state: $0.04/month
This was too high for a learning project. Here’s how I optimized:

Cost Reduction Strategies

  1. Spot Instances (AWS) eks_managed_node_groups = {

main = {

capacity_type = “SPOT” # 70% savings vs On-Demand

instance_types = [“t3.medium”]

}

}

Savings: $21/month (from $51 to $30)

  1. Single NAT Gateway enable_nat_gateway = true

single_nat_gateway = true # Instead of one per AZ

Savings: $64/month (from $96 to $32)

  1. Right-Sized VMs AWS: t3.medium (2 vCPU, 4GB RAM) — adequate for dev Azure: Standard_D2s_v3 (2 vCPU, 8GB RAM)
  2. Auto-Scaling yaml

autoscaling:

minReplicas: 1 # Scale down to 1 during low traffic

maxReplicas: 4

targetCPUUtilizationPercentage: 80

  1. Destroy When Not in Use Stop everything at end of day ./scripts/destroy-aws-infrastructure.sh

./scripts/destroy-azure-infrastructure.sh

Recreate next morning (30 minutes)
./scripts/deploy-aws-infrastructure.sh

Final Cost Breakdown
Current State (Infrastructure destroyed, state only):

AWS: $0.02/month (S3 state storage)
Azure: $5.02/month (ACR Basic + Blob state)
Total: $5.04/month (96% reduction!)
Active Development (when needed):
AWS (8 hours/day): ~$1.50/day = $45/month
Azure (24/7 minimal): $5.02/month
Total: ~$50/month for active development
Cost Comparison
| Scenario | Monthly Cost | Best For |

| 24/7 Production | $253 | Always-on production |

| 8hr/day Dev | $50 | Active development |

| Weekly Demos | $5–10 | Portfolio/interviews |

| Destroyed (Current) | $5 | Learning/Idle |

ROI on Cost Optimization
Annual Savings: $2,976/year (24/7) vs $60/year (destroyed)
Time to Recreate**: 30 minutes
Infrastructure is Code: Can rebuild anytime
Key Lesson: Don’t pay for idle infrastructure!
Results & Metrics
Deployment Success Metrics
Infrastructure Provisioning
AWS EKS: 28 minutes (fully automated)
Azure AKS: 22 minutes (fully automated)
Success Rate: 100% (reproducible builds)
Application Deployment
Build Time: 3–5 minutes (multi-stage Docker build)
Push to Registry: 1 minute
Helm Deployment: 2 minutes
Total CI/CD Duration: 8–10 minutes

Application Performance

| Metric | AWS EKS | Azure AKS | Target |

| Availability | 99.9% | 99.9% | 99.5% |

| Avg Response Time | 45ms | 52ms | <100ms |

| P95 Latency | 89ms | 95ms | <200ms |

| Throughput | 1000 req/s | 950 req/s | 500 req/s |

| Error Rate | 0.01% | 0.01% | <1% |

Resource Utilization:
| Resource | Requested | Used (Avg) | Efficiency |

| CPU | 250m | 45m | 18% |

| Memory | 256Mi | 128Mi | 50% |

Note: Low utilization is expected for this demo app. Production apps would scale based on actual load

Security Metrics

0 Critical Vulnerabilities in production images
0 High Severity IaC issues
00% Secret Coverage (no hardcoded credentials)
Pod Security standards enforced
Network Policies implemented
Testing Coverage
Total Tests: 12
Passed: 12
Failed: 0
Coverage: 85%
CI/CD Metrics
Build Success Rate : 98% (2 failures due to flaky tests)
Average Build Time : 8 minutes
Deployment Frequency**: On-demand (GitOps ready)
Lead Time: < 15 minutes (code to production)
MTTR: < 30 minutes (rollback capability)

What Worked Well

  1. Infrastructure as Code Terraform modules made multi-environment deployments trivial Remote state management prevented conflicts Destroy/recreate workflow enabled cost savings
  2. Helm for Kubernetes Environment-specific values files simplified configuration Version control for deployments Easy rollback capabilities
  3. Multi-Stage Docker Builds 60% reduction in image size Faster deployments Better security (minimal attack surface)
  4. GitHub Actions Native integration with GitHub No additional CI/CD infrastructure needed Secrets management built-in
  5. Spot Instances 70% cost savings on AWS compute No noticeable impact on availability (for dev/test) Challenges Faced Terraform State Lock Lesson: Always clean up failed applies, use DynamoDB lock table

EKS Node Group Deletion
aws eks delete-nodegroup — cluster-name — nodegroup-name

Lesson : Understand resource dependencies

ACR Naming Restrictions
Azure Container Registry names must be lowercase alphanumeric.

What I’d Do Differently

  1. Start with Local Kubernetes
    Use kind/minikube for initial development
    Only move to cloud for integration testing
    Would have saved 2 weeks of cloud costs

  2. Implement GitOps Sooner
    ArgoCD or Flux for declarative deployments
    Better visibility into deployment state
    Automatic sync from Git

  3. Add Service Mesh Earlier
    Better traffic management
    Enhanced observability

  4. More Comprehensive Monitoring
    Log aggregation with Loki from day 1
    Distributed tracing with Jaeger
    Custom application metrics

  5. Automated Cost Tracking
    Daily cost reports via AWS Cost Explorer API
    Budget alerts in Slack
    Dashboard showing spend by service
    Key Takeaways for DevOps Engineers
    Infrastructure as Code is Essential

Version control your infrastructure
Make it reproducible
Destroy and recreate confidently
Security is Not Optional

Scan early and often
Implement least privilege
No secrets in code, ever
Cost Awareness Matters

Monitor spending from day 1
Use spot instances for non-critical workloads
Destroy what you don’t use
Observability from the Start

Logs, metrics, and traces
You can’t improve what you can’t measure
Dashboards tell stories
Automation Saves Time

  1. 30 minutes to recreate infrastructure
  2. Consistent, repeatable deployments
  3. Focus on building, not clicking
  4. For Job Seekers
  5. This project demonstrates:
  6. Real-world DevOps practices
  7. Multi-cloud expertise
  8. Security-first mindset
  9. Cost optimization skills
  10. Problem-solving ability
  11. Documentation skills

Portfolio Value: Shows you can build production-grade infrastructure, not just follow tutorials.

Resources & Documentation
Project Repository
🔗 github.com/abidaslam892/multi-cloud-devsecops

Documentation Files
Setup Guide
Deployment Guide
Access Guide
Cost Optimization
Monitoring Setup
Technologies Used
FastAPI Documentation
Terraform AWS Provider
Terraform Azure Provider
Helm Documentation
Kubernetes Documentation
Prometheus Documentation
Grafana Documentation
Tools & Security
Trivy Scanner
Checkov IaC Scanner
GitHub Actions
Connect With Me
I’d love to hear your feedback, questions, or suggestions!

GitHub: @abidaslam892
Repository: multi-cloud-devsecops
Email: abidaslam.123@gmail.com
LinkedIn: linkedin.com/in/abid-aslam-75520330
Evidence & Screenshots
See the blog-materials/evidence folder for:

AWS Console screenshots (EKS, ECR, VPC)
Azure Portal screenshots (AKS, ACR)
Grafana dashboards
CI/CD pipeline runs
Cost reports
Security scan results
Acknowledgments
The open-source community for amazing tools
Terraform AWS/Azure modules maintainers
GitHub Actions team
Everyone who contributed to the technologies used
Final Thoughts
Building this project taught me that **DevOps is not about tools, it’s about culture and practices

Automate everything you can
Treat infrastructure as code
Security is everyone’s responsibility
Monitor, measure, improve
Share knowledge (hence this blog!)
If you’re learning DevOps, I encourage you to:

  1. Build something real (not just tutorials)

  2. Make mistakes and learn from them

  3. Document your journey

  4. Share with the community

Remember: The best way to learn is by doing. Start small, iterate, and keep building!

Ifthis article helped you, please give it a ⭐ star on GitHub and share it with others!

DevOps #AWS #Azure #Kubernetes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation

netes #Terraform #CI/CD #CloudNative #Security #DevSecOps #MultiCloud #Docker #Helm #Prometheus #Grafana #Python #FastAPI #Infrastructure #Automation

Top comments (0)