Rohith Mani varma

Posted on Dec 17, 2025

How I built a scalable, cost efficient cloud gaming architecture

#python #azure #aws #redis

Hey, this is my first post on dev.to, and I'm excited to share what I've built in the past few months.

Building a Cloud Gaming Platform on AWS and Azure

What I Built

I built a cloud gaming platform that lets users play PC games by streaming from GPU-enabled VMs. The application layer runs on AWS (ECS Fargate, Aurora, ElastiCache), while the actual gaming VMs run on Azure using Spot pricing to keep costs down.

The setup isn't perfect - it's single-AZ to save money, uses Spot VMs which can get evicted, and has some rough edges. But it works in production and has been stable enough for real users.

This writeup covers the architecture, why I made certain choices (mostly cost-related), and some issues I ran into.

System Overview

Two microservices on ECS Fargate:

Main API handles user stuff - Telegram bot, payments via Razorpay, session tracking
Azure-API manages VM lifecycle - starting/stopping VMs, disk management, quota tracking

Shared infrastructure:

Aurora PostgreSQL Serverless v2 (one cluster, two databases for isolation)
ElastiCache Serverless Redis (namespace separation between services)
Both services use same VPC, separate subnets

External dependencies:

Azure Central India for NC4as_T4_v3 GPU VMs (NVIDIA T4)
Cloudflare for DNS and SSL
New Relic for monitoring

Tech stack is Python 3.12-3.13, FastAPI, deployed via GitHub Actions. Pretty standard stuff.

Architecture Decisions

Why AWS + Azure?

Honest answer: AWS credits and free tier covered the hosting costs. For GPU compute, Azure had better Spot VM availability for the T4 instance type in Central India region compared to AWS's g4dn instances. That's basically it.

I considered running everything on one cloud, but the pricing worked out cheaper this way. Azure's Spot pricing for NC4as_T4_v3 is decent, and the game streaming bandwidth goes directly from Azure VMs to users - doesn't route through AWS infrastructure. So the bulk data transfer costs (video/audio streams) are on Azure's side. AWS only handles small management API calls between services.

Service Separation

The Main API and Azure-API are split because they have different concerns:

Main API deals with users - webhook calls from Telegram, payment callbacks from Razorpay, sub-minute billing calculations. It's mostly I/O bound and needs to respond quickly.

Azure-API talks to Azure's management APIs to create/delete VMs, manage disks, check quotas. These operations are slow (VM creation takes 2-3 minutes) and resource-intensive, so keeping them separate makes sense.

Could have made it one service, but this way each can scale independently. In practice though, I'm running one task of each because the load isn't that high.

Database Setup

One Aurora Serverless v2 cluster with two databases (kiro_db and azure_api_db), each with separate PostgreSQL users. This gives me database-level isolation without paying for two separate clusters.

Redis is even simpler - same instance, different namespaces using key prefixes. Each service adds mainapi: or azureapi: as a namespace prefix to all its Redis keys. Works fine and costs half as much as separate instances.

Scaling is 0.5 to 2.0 ACU for Aurora. Most of the time it sits at 0.5, spikes to 1.0 during peak hours. Haven't needed the full 2.0 yet. Monthly cost is around 50-80 dollars for Aurora and 20-40 for Redis.

Networking Setup

VPC is vpc-0aff34ca93f30f2b0, CIDR 10.0.0.0/16, deployed in ap-south-1 region (Mumbai). Single-AZ deployment in ap-south-1a to avoid cross-AZ data transfer charges - saves about 40% on networking costs.
(Click on the image for better quality)

Three subnets:

Public subnet (10.0.1.0/24): External ALB, NAT Gateway
Private subnet 1 (10.0.3.0/24): Main API ECS tasks
Private subnet 2 (10.0.4.0/24): Azure-API ECS tasks, Internal ALB

Route tables are straightforward:

Public Subnet:
| Destination | Target | Description |
|-------------|--------|-------------|
| 10.0.0.0/16 | local | VPC internal traffic |
| 0.0.0.0/0 | igw-xxxxxx | Direct internet access |

Private Subnets:
| Destination | Target | Description |
|-------------|--------|-------------|
| 10.0.0.0/16 | local | VPC internal traffic |
| 0.0.0.0/0 | nat-05657e4e60da89b64 | Outbound via NAT Gateway |

NAT Gateway is necessary because ECS tasks need to reach Azure APIs, New Relic, and other external services. No public IPs on the tasks themselves - better security that way.

Load Balancers

External ALB (wasd-main-api-alb-fixed) sits in the public subnet, handles incoming traffic from Cloudflare. Target group points to Main API tasks on port 3000.

Internal ALB (prod-azure-api-internal) is in private subnet 2, scheme is internal-only. This is how Main API talks to Azure-API without going through the internet. Simpler than setting up service mesh or direct task-to-task communication.

Health checks run every 30 seconds, 5 second timeout. Two consecutive successes to mark healthy, three failures to mark unhealthy. Standard settings, nothing fancy.

Security Groups

One security group (sg-0f366b1e5f78783c6) for everything. Could have split it up more, but honestly the rules are simple enough:

Inbound:

Port 80 from anywhere (ALB needs to accept public traffic)
Port 3000 from ALB security group (Main API)
Port 8000 from Internal ALB (Azure-API)

Outbound:

All traffic to anywhere (need this for New Relic, Azure SDK, Razorpay, Telegram webhooks)
Also covers port 5432 for Aurora and 6379 for Redis within VPC

Not the most restrictive setup, but outbound is harder to lock down when you have multiple external dependencies.

Deployment Pipeline

GitHub Actions handles deployment. Push to main branch triggers the workflow:

Build Docker image (multi-stage build, takes 2-3 minutes)
Push to ECR with two tags - latest and $GITHUB_SHA
Update ECS task definition
Update ECS service with force-new-deployment
Wait for service to stabilize

Total time from commit to production: 5-8 minutes typically.

ECS deployment config:

{
  "maximumPercent": 200,
  "minimumHealthyPercent": 100,
  "deploymentCircuitBreaker": {
    "enable": true,
    "rollback": true
  }
}

This means it starts a new task before killing the old one, so there's always something serving traffic. Circuit breaker catches failed deployments and rolls back automatically - saved me a few times when I pushed broken code.

OIDC role for GitHub Actions has minimal permissions - just ECR push and ECS update. Can't delete or read anything, which is how it should be.

Secrets Management

I only put two secrets in AWS Secrets Manager: DATABASE_URL and AZURE_CLIENT_SECRET. These two are critical - if someone gets database access or can spin up Azure VMs, it's a big problem.

Other stuff (API keys for Telegram, Razorpay, New Relic) goes in as ECS environment variables. Yes, technically less secure, but the cost-benefit doesn't make sense for every single secret. Secrets Manager charges per secret per month, and for a side project that adds up.

ECS injects the secrets at runtime, so they never appear in task definition JSON or anywhere in git. That's the important part.

Monitoring Setup

CloudWatch gets basic metrics - CPU, memory, ALB request counts, error rates. Set up alarms for high CPU (>80% for 5 minutes) and high error rate (>5% for 2 minutes). They notify a Slack channel.

New Relic APM tracks application-level stuff - slow database queries, external API latency, error traces. The distributed tracing view helps when debugging issues that span both services.

I don't have a proper on-call setup - it's just me, and if something breaks at 2 AM, it waits until morning unless it's critical. Small-scale operations.

Problems I Hit

Race Conditions on Session Creation

Users would click "Start Session" multiple times quickly, causing duplicate API calls to Azure. This created multiple VMs for the same session, which was both wasteful and confusing.

Fixed it with Redis locks:

lock_key = f"user_session_creation_lock:{user_id}"
lock_acquired = redis_client.set(lock_key, "locked", ex=120, nx=True)

if not lock_acquired:
    return "Session already being created, please wait"

120 second timeout because VM creation can take that long. Works fine now - no more duplicate VMs.

Webhook Endpoints

Telegram and Razorpay need stable HTTPS endpoints for webhooks. ECS tasks have dynamic IPs and restart during deployments, so you can't give them the task IP directly.

Solution was Cloudflare + ALB:

Cloudflare provides the SSL certificate and stable domain (tgprod.wasdcloud.online)
CNAME points to ALB DNS
ALB routes to whatever tasks are healthy at the moment

Works across deployments, and I don't have to manage SSL certificates myself. Cloudflare's free tier handles it.

Single-AZ Risk

Deploying everything in one availability zone saves money but means AZ-level failures would take the service down. Haven't had an AZ outage yet, so hard to say if this was the right call.

Aurora is multi-AZ by default (AWS manages this), so at least the database would survive. ECS tasks would need to be restarted in another AZ, which would take 5-10 minutes plus however long AWS takes to resolve the AZ issue.

For the scale I'm at, the cost savings (40% less on data transfer) seemed worth it. If this was a proper business with SLAs, I'd probably go multi-AZ.

Spot VM Evictions

Azure Spot VMs get evicted when capacity is needed elsewhere. I have a dual-queue system - Spot queue with priority, On-Demand queue as fallback. When Spot quota is exhausted or VMs get evicted, requests fall back to On-Demand.

On-Demand is about 3x more expensive, so I try to use Spot whenever possible. In practice, evictions happen maybe 2-3 times a month for the T4 instances in Central India region, which is acceptable.

User experience isn't great when a Spot VM gets evicted mid-game, but that's the trade-off for cheaper compute. Haven't found a way around this yet.

What I'd Change

If I were rebuilding this from scratch:

Multi-AZ from the start - The cost savings aren't worth the operational anxiety. Aurora is already multi-AZ; extending ECS to run in multiple zones isn't that expensive.
Better disk management - Right now each user gets a persistent disk that's reused across sessions. Works fine but disk cleanup is manual. Should have automated this from day one.
Proper structured logging - I'm using basic logging now. Should have set up structured JSON logging and centralized it in CloudWatch Logs Insights. Would make debugging much easier.
Smarter health check strategy - The /health endpoint checks database and Redis connectivity, which sounds good but has a side effect: it keeps Aurora and ElastiCache from scaling down to zero. Serverless scaling is supposed to reduce costs during idle periods, but constant health checks (every 30 seconds) prevent that. Should have designed a lightweight health check that doesn't hit the database, or increased the interval significantly.
Better cost tracking - I have rough estimates of what things cost, but not detailed per-resource tracking. Cost Explorer helps, but tagging resources properly from the beginning would have been smarter.

Cost Breakdown

Rough monthly costs (all in USD):

ECS Fargate: 20-30 (two tasks, 256/512 CPU, 1GB memory each)
Aurora Serverless v2: 50-80 (0.5-2.0 ACU, usually sits at 0.5)
ElastiCache Serverless: 20-40
NAT Gateway: 30-40 (based on data processed)
ALB: 20-25
S3, CloudWatch, misc: 10-15

Total AWS: 150-230/month

Azure GPU VMs are pay-per-use (billed to users), so not included here. Spot pricing is around $0.10-0.15/hour for NC4as_T4_v3, but users pay for this directly.

New Relic is on their free tier (100GB/month data, enough for this scale).

What's Next

Considering adding S3 backup for user game saves. Right now everything is on Azure disks, which works but has no disaster recovery. Syncing to S3 periodically would give me a backup and potentially let users restore data if their disk gets corrupted.

Other than that, the system is pretty stable. No major planned changes unless something breaks or load increases significantly.

Final Thoughts

This isn't a perfect setup - single-AZ, Spot VMs, minimal monitoring, manual operations. But it works, costs are manageable, and deployment is automated. For a side project that actually needs to run in production, that's good enough.

The hybrid cloud approach was mostly driven by cost, not architectural purity. AWS for the application layer made sense given free tier + credits. Azure for GPU compute made sense given Spot pricing and regional availability.

If you're building something similar, focus on getting the basics right first - proper deployment pipeline, health checks, monitoring. The fancy stuff can come later. And always test your deployment rollback before you need it.

Stack:
AWS ECS • Azure GPU VMs • FastAPI • PostgreSQL • Redis • Docker • GitHub Actions

DEV Community