DEV Community: Pritam Roy

The AI Platform Wars 2026: Stop Asking Which AI Is Best - Ask This Instead

Pritam Roy — Mon, 23 Mar 2026 12:48:03 +0000

🔗 This is a curated excerpt. The full deep-dive - with detailed comparison tables, architecture-level insights, performance breakdowns, and final verdicts - is on my blog:
👉 pritamroy.com - The AI Platform Wars 2026

The Question Everyone Is Asking (And Getting Wrong)

"Which AI is the best right now?"

I see this on Twitter, Reddit, LinkedIn, every tech forum - every single day.

And almost every answer misses the point entirely.

Because ChatGPT, Claude, Gemini, Copilot, Grok, Perplexity, and DeepSeek are not on the same battlefield.

They are solving different problems.

For different users.

With different architectural bets.

Comparing them without that context is like asking:

"Is a scalpel better than a hammer?"

It depends on what you are trying to build - or cut open.

⚔️ We Are Not in an AI Tool Race

We are in a Platform Ecosystem War.

Each of these companies is not just building a smarter chatbot.

They are building:

🔗 Integrations - into your IDE, your email, your browser, your cloud
🏗️ Developer workflows - where you code, how you ship
🗄️ Data pipelines - who owns your enterprise context
🌐 Ecosystems - that lock you in, gently but completely

The AI response quality? That's table stakes now.

The real war is about who controls your workflow.

🧠 The 7 Platforms - What They Actually Are

⚡ ChatGPT - The Swiss Army Knife

Balanced across coding, writing, analysis, and structured tasks
Largest ecosystem of plugins and integrations
GPT-4o brings strong multimodal capability
Best for: General-purpose AI work, teams that need one tool for everything

🧩 Claude - The Deep Thinker

Best-in-class long-context reasoning (up to 200K tokens)
Exceptional for nuanced, complex, multi-step problems
Strong focus on safety, alignment, and reducing hallucinations
Best for: Deep research, long-form writing, architecture reviews, anything requiring sustained reasoning

🌐 Gemini - The Google Insider

Native integration with Google Workspace (Docs, Sheets, Gmail, Drive)
Strong multimodal capabilities - image, audio, video
Rapidly closing the gap in reasoning benchmarks
Best for: Teams living in Google ecosystem, multimodal workflows

💻 Microsoft Copilot - The Developer's Co-Pilot

Embedded directly into VS Code, GitHub, Azure, M365
Context-aware - understands your repo, your codebase, your tickets
Best for: Enterprise developers, Microsoft-stack teams, productivity workflows

⚡ Grok - The Real-Time Reactor

Direct access to X (Twitter) real-time data
Fast, opinionated, less filtered responses
Still maturing in depth and accuracy
Best for: Current events, social data analysis, quick real-time lookups

🔍 Perplexity - The Research Engine

Hybrid of search engine + AI reasoning
Citation-backed answers - you can verify every claim
Strong factual accuracy for technical and news queries
Best for: Research, fact-checking, staying current without hallucination risk

🚀 DeepSeek - The Emerging Disruptor

Exceptional coding performance, often competing with GPT-4 class models
Significantly lower cost
Open-weight models available for self-hosting
Best for: Coding-heavy teams, cost-conscious deployments, developers who want control

📊 The Cheat Sheet (Quick Reference)

Platform	Core Strength	Biggest Weakness	Best Use Case
ChatGPT	Balanced, versatile	Jack of all trades	General-purpose, coding, structured tasks
Claude	Long reasoning, nuance	Slower, fewer integrations	Deep thinking, long-form content
Gemini	Google integration, multimodal	Still catching up on reasoning	Google workspace, image/video work
Copilot	IDE + enterprise integration	Narrow outside Microsoft stack	Dev productivity, code reviews
Grok	Real-time data	Depth and accuracy	Live news, social trends
Perplexity	Factual accuracy, citations	Not for creative tasks	Research, fact-checking
DeepSeek	Coding, cost efficiency	Smaller ecosystem	Budget-conscious coding teams

🧩 The Real Framework: Map Tool to Task

Stop asking "which is best?"

Start asking "which is best for THIS?"

Task	Top Picks
🧑‍💻 Writing code	Copilot → DeepSeek → ChatGPT
📝 Long-form writing	Claude → ChatGPT
🔍 Research & fact-checking	Perplexity → Claude
📊 Data analysis	ChatGPT → Gemini
⚡ Real-time information	Grok → Perplexity
🏢 Enterprise Microsoft stack	Copilot
🌐 Google Workspace	Gemini
🏗️ Architecture reviews	Claude
💰 Budget-conscious deployment	DeepSeek

🔥 The Insight Most Engineers Miss

Here is what is actually happening beneath the surface:

AI Quality Gap → Closing Fast
Ecosystem Control Gap → Widening Fast

A year ago, GPT-4 had a clear quality lead.

Today? DeepSeek matches it on coding. Claude matches it on reasoning. Gemini is catching up on multimodal.

The model quality differentiation is shrinking.

But the ecosystem lock-in is growing.

Copilot understands your GitHub repo.

Gemini reads your Google Drive.

Claude remembers your enterprise documents.

The question is no longer just "which AI gives better answers?"

It is: "Which AI is embedded so deep in your workflow that switching becomes painful?"

That is the real Platform War.

💬 What's Your Daily Driver?

I am genuinely curious:

Which AI platform do you actually use most in your engineering or creative workflow?

And more importantly - why that one?

Real usage patterns beat benchmark comparisons every time.

Drop it in the comments. 👇

👉 Want the Full Deep-Dive?

This post covered the high-level landscape.

The full article on my blog includes:

📊 Detailed comparison tables across 12+ dimensions

🏗️ Architecture-level analysis of each platform

⚡ Real-world performance breakdown by engineering task

🔮 Where each platform is heading in 2026

🎯 My personal verdict after daily usage of all 7

👉 Read the full analysis on pritamroy.com

Written by Pritam Roy - Senior AWS Cloud & DevOps Engineer. I write about cloud architecture, real-world infrastructure, and the tools engineers actually use.

If this was useful, drop a ❤️ - it helps other engineers find it.

How JioHotstar Engineered 82.1 Crore Concurrent Streams - A DevOps Deep Dive into the T20 World Cup 2026 Final

Pritam Roy — Tue, 10 Mar 2026 10:31:03 +0000

Originally published on pritamroy.com

Setting the Stage: What Actually Happened on March 8, 2026

Before we talk infrastructure, let's appreciate the scale of the event that stress-tested it.

India defeated New Zealand by 96 runs in the ICC Men's T20 World Cup 2026 Final at the Narendra Modi Stadium in Ahmedabad, posting a mammoth 255/5 - the highest total ever in a T20 World Cup final. India became the first team in history to retain their T20 World Cup title, and the first to win three T20 World Cup titles overall. The stadium held 86,000 roaring fans. Hundreds of millions watched on screens across every corner of India and the world.

And JioHotstar? It didn't just survive. It rewrote history.

The Numbers That Made Engineers Sweat (And Then Celebrate)

The concurrent viewership peaked at 82.1 crore simultaneous streams during the post-match presentation ceremony. Let that number sink in - 821 million streams at a single moment, from a single platform, from a single country.

Here's how the demand curve looked throughout the match:

Moment	Concurrent Viewers
Ricky Martin's opening performance	2.1 crore
At the toss	4.2 crore
End of India's innings (255/5)	43.9 crore
Innings break	44.3 crore
New Zealand start chasing 255	49.9 crore
End of the 1st over of the chase	50.3 crore
The moment the last wicket fell	74.5 crore
Post-match presentation ceremony	82.1 crore

This is a textbook demand curve for any DevOps engineer to study - a slow warm-up, a steep mid-event ramp, and a vertical spike at the moment of maximum drama. Every system design decision JioHotstar made had to account for exactly this shape.

For context: the 2024 T20 WC Final peaked at just 5.3 crore on Disney+ Hotstar. In two years, they scaled peak concurrency by more than 15x. That is not an accident - that is an engineering masterclass.

They also came into the Final having broken a world record just days earlier. During the India vs England semi-final on March 5, JioHotstar recorded 65.2 million peak concurrent viewers - the highest concurrency ever achieved for a live event across any digital platform in the world. The Final obliterated even that number.

Part 1: The Foundation - Understanding JioHotstar's Architecture Origins

To understand the engineering decisions, you need to understand the entity first.

By late 2024, Reliance Industries (through Viacom18) and The Walt Disney Company announced an $8.5 billion joint venture called JioStar, combining Viacom18's media assets with Disney's Star India and Hotstar operations in India.

This merger gave the DevOps teams something rare: two battle-hardened streaming backends to draw lessons from. Disney+ Hotstar had years of cricket-at-scale experience, having served the 2023 ODI World Cup and multiple IPL seasons. JioCinema had cracked the 4K pipeline and aggressive CDN work. The 2026 World Cup was the first true test of whether the combined architecture could handle something neither had ever attempted alone.

Part 2: Pre-Tournament Planning - How You Prepare for an 82-Crore Spike

In DevOps, you never wait for production to find your limits. JioHotstar's SRE teams began capacity planning months before the first ball was bowled.

Traffic Forecasting Using Historical Data

The SRE teams forecast traffic using a predictive model trained on data from previous major streaming events: the 2024 T20 WC Final (5.3 crore), the 2023 ODI WC Final (5.9 crore), Asia Cup peaks, and IPL finals. Engineers built regression models accounting for factors like: is India playing? What stage of the tournament is it? What time of day? What are the network conditions across India's diverse geography?

The key insight: the Final was always going to be the largest event, and the models needed to be revised upward after each knockout match. After the semi-final already set a world record at 65.2 million concurrent, the capacity plan for the Final had to be re-evaluated entirely.

Load Testing at Scale - Project HULK

JioHotstar created an in-house project called "Project HULK" specifically to stress-test their platform before major events. The load generation infrastructure used c5.9xlarge machines distributed across 8 different AWS regions to simultaneously hit the CDN, load balancers, and application layers.

The reason for distributing across 8 regions is subtle but important: cloud providers share underlying physical infrastructure. A massive synthetic load originating from a single region could inadvertently impact other customers co-located on the same hardware. By spreading synthetic load across regions, you simulate a real-world distributed user base while being a responsible cloud tenant.

Pre-warming: The Underrated Hero

Every time a major cricket match was about to begin under the old architecture, the operations team had to manually pre-warm hundreds of load balancers. In the new architecture, this process was fully automated.

But the discipline of pre-warming remained: before the Ricky Martin opening performance even started, JioHotstar's edge nodes, CDN caches, and application clusters were already scaled up and warm. Pre-warming CDN caches with the stream's initial HLS segments, spinning up Kubernetes node pools ahead of anticipated demand, pre-populating authentication session caches - all of this is part of the playbook.

You don't wait for traffic to arrive. You meet it at the door.

Part 3: The Kubernetes Architecture - DataCenter Abstraction

This is the most significant architectural evolution in JioHotstar's history, and the one with the most lessons for any platform engineering team.

The Old World

Previously, Hotstar managed its workloads on two large, self-managed Kubernetes clusters built using KOPS (Kubernetes Operations), running 800+ microservices across them. Every microservice had its own AWS Application Load Balancer (ALB) using NodePort services.

The request flow looked like this:

Client → CDN → ALB → NodePort → kube-proxy → Pod

The problems were multiple. Hundreds of ALBs needed to be manually pre-warmed before every major match - an error-prone, time-consuming process. The old Cluster Autoscaler was too slow to release or consolidate nodes efficiently during off-peak periods. And scaling beyond 400 nodes simultaneously caused API server throttling - a hard ceiling on their peak capacity.

The New Model: DataCenter Abstraction

The new model introduced a concept called DataCenter Abstraction. A "data center" in this model doesn't refer to a physical building - it's a logical grouping of multiple Kubernetes clusters within a specific region. Together, these clusters behave like a single large compute unit, with each application team given a single logical namespace.

What this means in practice for the World Cup Final:

JioHotstar could treat its AWS infrastructure across Mumbai, Hyderabad, and Delhi as a single logical pool
A central Envoy proxy replaced hundreds of individual ALBs, unifying traffic routing, authentication, and rate-limiting in one place
Services moved from NodePort to ClusterIP + ALB Ingress, eliminating hard port limits
Developers deploy one YAML manifest per service; the platform handles failover and routing behind the scenes

They also migrated from self-managed KOPS clusters to Amazon EKS, offloading Kubernetes control plane management to AWS. Combined with Karpenter, nodes now provision in seconds rather than minutes - critical when viewership goes from 44 crore to 74 crore in the final 4 overs of a chase.

# Karpenter NodePool - simplified example
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: live-streaming-pool
spec:
  template:
    spec:
      requirements:
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values: ["c6i.8xlarge", "c6g.8xlarge", "c5.9xlarge"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand", "spot"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["ap-south-1a", "ap-south-1b", "ap-south-1c"]
      kubelet:
        maxPods: 110
  limits:
    cpu: "8000"
    memory: "16Ti"
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

The capacity-type includes both on-demand and spot, meaning Karpenter intelligently places stateless, fault-tolerant workloads on cheaper Spot instances while keeping critical session services on On-Demand. The consolidationPolicy: WhenUnderutilized ensures nodes are immediately released during the innings break, saving cost in real time.

IP Address Management - A Lesson from 2023

A critical incident during the 2023 World Cup involved running out of IP addresses. The VPC CNI plugin's WARM_IP_TARGET and MINIMUM_IP_TARGET settings were over-allocating IPs per node. For 2026, engineers used larger CIDR blocks (/18 instead of /20) and fine-tuned these settings, allowing clusters to scale beyond 400 nodes without hitting IP exhaustion.

Part 4: Infrastructure Scaling - Eliminating Every Bottleneck

Kubernetes architecture is only part of the picture. The network infrastructure underneath also needed surgery.

NAT Gateway Scaling

Monitoring with VPC Flow Logs revealed a frightening discovery during a pre-tournament load test: a single Kubernetes cluster was consuming 50% of its NAT Gateway throughput at just 10% of expected peak load. At full Final traffic, this would have been a catastrophic bottleneck.

The fix: scale out from one NAT Gateway per Availability Zone to one NAT Gateway per subnet. This distributed the external traffic load evenly and eliminated the pressure point entirely.

Worker Node Network Optimization

Load tests showed that internal API Gateway pods were consuming 8–9 Gbps of network bandwidth on individual nodes, causing severe contention with other services.

Two fixes were implemented in parallel:

Deploy high-throughput nodes with a minimum capacity of 10 Gbps for API Gateway workloads
Use Kubernetes topology spread constraints to ensure only one API Gateway pod runs per node

# Topology spread constraint for API Gateway pods
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api-gateway

This constraint ensures Kubernetes never schedules two API Gateway pods on the same physical node. The result: throughput stabilized at 2–3 Gbps per node even at peak, rather than saturating at 8–9 Gbps on a few overloaded nodes.

Part 5: The Video Pipeline - From Camera to 82 Crore Phones in Under 5 Seconds

Most people think of streaming as "just sending video." For a live match at this scale, it is an extraordinarily intricate real-time data pipeline with multiple stages, each completing in sub-second timeframes.

Stage 1 - Ingestion: Getting the Feed from the Ground

At the Narendra Modi Stadium, production crews captured the match using multiple HD and 4K cameras. The raw feed travels via dedicated broadcast fiber links using SRT (Secure Reliable Transport) protocol. SRT provides approximately 20% packet loss recovery compared to the older RTMP protocol - critical given India's network variability.

Stage 2 - Transcoding: One Feed, 100 Million Devices

Raw feeds hit AWS Elemental MediaLive on p4d.24xlarge GPU instances, transcoding multiple adaptive renditions in under 2 seconds. A single 4K broadcast feed is simultaneously converted into:

Profile	Target Audience
360p	2G/3G users in rural India
480p	Moderate connections
720p	Standard HD
1080p	Good broadband
4K HDR	Premium fiber/5G subscribers

The 2026 World Cup featured true 4K HDR streaming - not upscaled 1080p - at genuinely high bitrates. Every rendition generated in real-time, in parallel, with sub-2-second latency.

Stage 3 - Packaging: HLS, DASH, and DRM

AWS MediaPackage segments outputs into HLS/DASH chunks at over 100,000 chunks per second, applies DRM encryption through Widevine and PlayReady, and dynamically adds captions and regional subtitles. MediaPackage does just-in-time packaging - eliminating the need to pre-generate format-specific segments for every device type.

Stage 4 - Storage and Delivery

Amazon S3 Intelligent-Tiering stores HLS/DASH chunks with multi-AZ replication. CloudFront delivers them via 300+ edge locations worldwide. Live stream segments are accessed billions of times in their first few seconds and then almost never again - S3 Intelligent-Tiering handles this access pattern perfectly, automatically reducing storage costs.

Part 6: The CDN Layer - The True Workhorse of 82 Crore Streams

If the video pipeline is the heart, the CDN is the circulatory system. No single origin server can serve 82 crore simultaneous streams.

Multi-CDN Strategy

JioHotstar employs a multi-CDN strategy with an in-house CDN load optimizer that dynamically chooses between Akamai, CloudFront, and others, always routing viewers through the least congested path. If one CDN faces an issue, another picks up the slack - completely transparent to the viewer.

Traffic Segregation

Traffic Type	Routing Strategy
Cacheable (scorecards, stats, highlights)	Dedicated CDN domain, aggressive cache TTLs
Non-cacheable (sessions, personalization)	Separate routing path, correctness-first
Non-video (images, metadata)	Cost-efficient CDN providers

This segregation preserves high-performance CDN capacity specifically for video segment delivery.

The Jio Network Advantage: A Moat No Competitor Can Copy

JioHotstar is part of a company that also owns the physical network delivering the stream. Jio's 5G network works with Jio's own Mobile Edge Computing (MEC) servers, placing compute resources physically inside the telecom network - at the base station layer - rather than in a distant cloud data center.

For 500 million+ Jio subscribers, the World Cup Final was served from their own carrier's edge - a fundamentally different and faster delivery path than what any competitor can offer.

Part 7: Microservices at Scale - 800+ Services Serving One Match

The microservices architecture means video playback, authentication, personalization, live chat, multilingual commentary routing, payment processing, and analytics are all independent services. This isolation is critical: if the live emoji reaction feature crashes during Bumrah's 4th wicket, it should crash without affecting the video stream.

Feature Flags: The Safety Net

Feature flags allow gradual rollout and instant kill-switches without any deployment. In a worst-case scenario - say, a memory leak in the live chat microservice - engineers flip a single flag to disable chat for all users, immediately reducing load without any restart or deployment.

The Kafka and Flink Real-Time Pipeline

Every viewer generates continuous telemetry events. At 82 crore concurrent users, this is billions of messages per second.

Apache Kafka - distributed, fault-tolerant message queue absorbing event bursts
Apache Flink - real-time processing for dashboards, anomaly detection, and adaptive algorithms

Part 8: Observability - The SRE War Room During the Final

The monitoring stack ran three layers simultaneously:

Tool	Purpose
AWS CloudWatch	Infrastructure metrics (EC2 CPU, RDS connections, NAT throughput)
Prometheus	Application-level and custom business metrics
Grafana	Real-time visualization - latency, throughput, rebuffer trends

The single most important metric: rebuffer rate - the percentage of viewers experiencing playback interruption.

# Prometheus alert rule for rebuffer rate
sum(rate(media_rebuffer_events[5m])) / sum(rate(media_play_time[5m])) > 0.004

At 82 crore viewers, 0.4% means 3.28 crore people buffering simultaneously - an unacceptable outcome. Every metric had an automated alert. Every alert had a documented runbook. Every runbook had been practiced.

Chaos Engineering: Breaking Things Before Match Day

Before major events, JioHotstar's teams ran chaos drills at 2 AM:

Deliberately killing an entire Availability Zone
Simulating a CDN provider outage
Injecting latency into the authentication service
Validating automated failover and recovery

Good SRE teams don't wait for production failures - they engineer them deliberately.

Part 9: Caching Strategy - Keeping 82 Crore Sessions Alive

The solution is an aggressive multi-layer caching hierarchy:

Layer 1 - CDN Edge Cache
The video segment cached at the CDN. If served from a CloudFront edge PoP, JioHotstar's origin never sees that request at all. This is the most important cache hit in the entire system.

Layer 2 - Application-Level Redis Cache
User session tokens and subscription entitlements cached in Redis clusters. Subscription verified once at playback start, cached for the match duration. Subsequent requests bypass the database entirely.

Layer 3 - Database Read Replicas
Multiple read replicas spread across AZs serve preferences and recommendation data. Write traffic goes only to the primary.

A well-designed caching layer means 82 crore viewers might generate fewer database queries than 5 lakh viewers on a poorly designed system.

Part 10: Adaptive Bitrate and AI Optimization - Client Intelligence at Scale

The ABR player constantly measures download speed, buffer health, and network latency - running entirely on the client side. For 82 crore simultaneous viewers, even a 1ms server-side computation per quality decision would be catastrophic - that's 820,000 seconds of compute per decision cycle.

JioHotstar's AI-powered bitrate optimization achieves:

25% average bitrate reduction without compromising perceived quality
12% more watch time due to reduced buffering
Proactive network condition prediction before rebuffering begins

Part 11: Cost Architecture - 15x Scale Without 15x the Bill

Metric	Value
Cost per 1M viewers	~$0.87–$0.92
Budget variance	~22% under budget
Spot instance discount	Up to 90% vs On-Demand

Spot Instances were used for all stateless, fault-tolerant workloads: transcoding workers, telemetry processors, recommendation engines. Session-critical services ran on On-Demand or Reserved capacity.

Karpenter's bin-packing and consolidation continuously released underutilized nodes between matches, reducing running costs to near-zero between sessions.

Part 12: Multi-Language, Multi-Format - Serving Every Indian

India is not one market. It is 22 official languages, hundreds of dialects, and a spectrum from 2G feature phones in rural UP to 5G flagship devices in Bangalore.

Commentary was available in Hindi, English, Tamil, Telugu, Kannada, Malayalam, Bengali, Marathi, and more - each a separate audio track dynamically stitched into the HLS manifest at request time based on viewer preference.

JioHotstar simultaneously ran four distinct product experiences from the same underlying stream:

Standard player
Hype Mode (vertical video with real-time stat overlays)
Multi-cam view
Highlights scrubber

The platform also deployed CMAF (Common Media Application Format) low-latency protocol at massive scale, achieving end-to-end delay of only a few seconds - crucial when millions of viewers are watching simultaneously with stadium audio bleeding through their windows.

Part 13: Graceful Degradation - Planning for What You Don't Plan For

In the event of unexpected traffic spikes beyond provisioned capacity, instead of showing a blank screen or error, the system pre-caches and serves static still images (scoreboard, static broadcast frame) as a temporary placeholder while the video pipeline catches up.

The engineering philosophy is clear:

Protect the stream above everything else.

Key Takeaways for DevOps and SRE Engineers

1. Automate pre-warming and scale playbooks.
At 82 crore scale, there is no time for human intervention in the scaling loop.

2. Data-driven capacity planning beats gut feel every time.
Use past events to forecast. Validate with load tests. Revise upward after each knockout match.

3. Layered optimization covers every tier.
CDN edge → Kubernetes node pool → NAT gateway → database read replica. A bottleneck at any tier collapses the stack.

4. Managed services let teams focus on workloads, not infrastructure.
Moving from KOPS to EKS freed the platform team to focus on the microservices that actually differentiate their product.

5. Infrastructure as Code is non-negotiable at 800+ microservices.
Every load balancer, CDN config, autoscaling policy, and node pool declared in code, version-controlled in Git, deployed through CI/CD.

6. Observability is not optional.
CloudWatch + Prometheus + Grafana + documented runbooks + practiced responses. This is what separates platforms that survive scale from platforms that become post-mortems.

7. Plan for graceful failure, not just successful scale.
Feature flags as kill switches, static fallback images, circuit breakers - the difference between "lower quality for 30 seconds" and "error page for 82 crore people."

The Final Score

86,000 fans sang Vande Mataram inside the Narendra Modi Stadium as India lifted their third T20 World Cup. And 82.1 crore people watched it happen - simultaneously, on a single platform, without a single major outage, without viral complaints of buffering, and without the platform going down at the moment of the winning wicket.

India won on the field. JioHotstar won in the server room. Both victories were built the same way: with preparation, with execution under pressure, and with a team that had practiced for exactly this moment.

The next time you're tempted to skip the chaos drill or leave the pre-warming script manual, remember: someone at JioHotstar ran that drill at 2 AM so that 82 crore people could watch Bumrah take his 4th wicket on the smoothest stream of their lives.

Originally published at pritamroy.com

Let's Discuss 💬

Have you worked on large-scale streaming infrastructure, CDN optimization, or SRE for real-time systems? What architectural choices did your team make differently - especially around multi-CDN routing, Kubernetes autoscaling, or observability at high concurrency?

Drop a comment below - I'd love to hear your experience. 👇

Pritam Roy: From Network Engineer to AWS DevOps & Cloud Engineer

Pritam Roy — Mon, 02 Mar 2026 14:33:41 +0000

When I started my career in IT, I didn’t begin in the cloud or with automation tools. I began by understanding how systems behave when they fail.

I’m Pritam Roy, and over the past 9+ years in IT, my journey has taken me from working in network operations to designing and managing large-scale AWS cloud infrastructure and DevOps pipelines for production environments.

Where It Started

I began my career in a Network Operations Center, monitoring enterprise MPLS and ILL networks. Working in a 24×7 operations environment taught me something that no certification can teach:

Reliability matters more than theory.

Seeing outages, latency spikes, and configuration issues in real time helped me understand how critical stability, monitoring, and structured processes are in infrastructure engineering.

Transitioning to Cloud and DevOps

During the pandemic, I used that period to deeply invest in learning AWS, Linux systems, infrastructure automation, and deployment pipelines.

Instead of focusing only on courses, I practiced by building real environments:

Automated deployments on AWS
CI/CD pipelines for application delivery
Secure networking setups
Infrastructure provisioning using Terraform

That shift changed the trajectory of my career.

Working in Production Cloud Environments

Today, I work as a Senior AWS Cloud and DevOps Engineer managing real production infrastructure.

My work includes:

Managing hundreds of EC2 instances across environments
Designing secure AWS VPC architectures
Building CI/CD pipelines for Java, Node.js, React, Angular, and mobile apps
Running containerized workloads with Docker and Kubernetes
Implementing monitoring with Prometheus, Grafana, and CloudWatch
Strengthening infrastructure security with IAM, GuardDuty, and WAF

Working in fintech environments especially taught me how critical reliability, observability, and automation are.

Infrastructure in such environments cannot afford downtime or weak security models.

My Approach to DevOps

For me, DevOps isn’t about tools. It’s about how systems are designed.

I focus on:

Automation — anything repeatable should be automated
Security — infrastructure must be secure by design
Scalability — systems should grow without breaking
Observability — if you can’t see it, you can’t fix it

If a system requires constant manual intervention, it isn’t engineered yet.

What I Focus On Today

Currently, my core areas include:

AWS infrastructure design and optimization
Kubernetes and container orchestration
Infrastructure as Code
Secure cloud networking
CI/CD automation
Monitoring and reliability engineering

I enjoy working on systems that are built to last, not just built to run.

Let’s Connect

If you’re interested in cloud engineering, DevOps practices, or infrastructure automation, feel free to connect.

Website: https://www.pritamroy.com
LinkedIn: https://www.linkedin.com/in/pritam-roy-2a55b684/
GitHub: https://github.com/pritamrai99

Thanks for reading my journey. The cloud ecosystem keeps evolving, and I’m excited to keep building, learning, and improving the systems I work on.