Shashank Chakraborty

Posted on Jun 13

From a Simple Web App to a Production-Style Platform: My DevOps Learning Journey

#devops #kubernetes #learning #systemdesign

Github: https://github.com/Shashank0701-byte/System-Craft

When I started building SystemCraft, my goal wasn't to learn Kubernetes, GitOps, monitoring, or cloud-native architecture.

I just wanted to build a system design interview platform.
Fast forward a few months, and that simple web application evolved into something much bigger:

CI/CD Pipelines
Dockerized Deployments
Kubernetes
Helm Charts
ArgoCD GitOps
Prometheus Monitoring
Grafana Dashboards
AlertManager
Auto Scaling
Security Scanning

This article is the story of how that happened and what I learned along the way.

The Original Idea

SystemCraft was designed to solve a problem I noticed while preparing for system design interviews.

Most preparation resources are passive:
Reading blogs
Watching videos
Looking at architecture diagrams
But real system design interviews are interactive.

You need to make decisions, justify trade-offs, adapt to changing requirements, and explain your reasoning.
I wanted to create a platform where engineers could:
Design architectures visually
Receive AI-powered feedback
Simulate real interview scenarios
Learn through iteration

The first version was straightforward:

Next.js
↓
MongoDB
↓
Gemini API
It worked. But then I started asking a different question:
How would I run this in production?

The Docker Phase

My first step was containerization.
I created a Dockerfile and containerized the entire application.
At first, I thought Docker was the hard part.
I quickly learned it wasn't.
Building containers is easy.
Operating containers reliably is the real challenge.
**
Questions started appearing:
**How do I deploy updates?
How do I manage multiple replicas?
How do I scale?
How do I monitor failures?

Docker solved packaging.
It didn't solve operations.

Building a Real CI Pipeline

The next step was automation.
I didn't want deployments to depend on manual commands.
I created a GitHub Actions pipeline that would automatically:

Lint & Typecheck
↓
Playwright E2E Tests
↓
Docker Build
↓
Trivy Security Scan
↓
Kubernetes Validation
↓
Deployment

One lesson became obvious:

Automation isn't about speed.
It's about consistency.
The pipeline catches mistakes long before they reach production.

Security Wasn't Optional
One of the most valuable additions was Trivy.
Initially I wasn't thinking much about container security.
Then I started scanning images and realized how many vulnerabilities can exist inside dependencies you didn't even know you had.

Every build now goes through:
Docker Build
↓
Trivy Scan
↓
Deployment

This simple addition completely changed how I think about shipping software.

Enter Kubernetes

Eventually a single container stopped being enough.
I wanted:
Multiple replicas
Self-healing workloads
Rolling updates
Horizontal scaling
Kubernetes provided all of that.

But Kubernetes introduced new challenges:
YAML management
Service discovery
Resource limits
Health checks
Configuration management

The complexity increased significantly.
At the same time, I started understanding why Kubernetes became the industry standard.

Helm Changed Everything

Managing raw Kubernetes manifests quickly became painful.
I introduced Helm charts to template deployments and environments.
Instead of maintaining multiple copies of manifests, I could parameterize everything:
Image versions
Replica counts
Resource limits
Environment variables

Deployment became much more manageable.

Discovering GitOps with ArgoCD
This was probably the biggest mindset shift.
Originally deployment looked like:
_GitHub Actions
↓
kubectl apply

After learning GitOps:

Git Commit
↓
Git Repository
↓
ArgoCD
↓
Kubernetes Cluster_

The cluster state became fully declarative.
Git became the source of truth.
Rollback became dramatically easier.
Auditing changes became trivial.
I finally understood why so many engineering teams are adopting GitOps workflows.

Monitoring: The Missing Piece

For a long time I only cared whether the application worked.

Then I realized:
If something breaks in production, how would I know?
That question led me to Prometheus and Grafana.
I instrumented the application and started tracking:
API latency
Request volume
Error rates
Resource utilization
Application health

Suddenly I could see what the system was actually doing.
Monitoring transformed troubleshooting from guessing into observing.
Adding Alerting
Monitoring is useful.
Alerting is essential.

I integrated AlertManager so that operational issues could be detected automatically.

This forced me to think about:
Error thresholds
SLOs
Availability targets
Incident response

Topics I previously associated only with large companies.
Testing Scalability
Eventually I wanted to understand how the platform behaved under load.
I simulated 500 concurrent users.

**The results were revealing.
Single Container
Metric Value
Requests 23,381
Throughput ~155 req/s
P95 Latency 3.33s

The Node.js process became saturated.
Performance degraded rapidly.

Kubernetes with HPA
Metric Value
Requests 61,026
Throughput ~351 req/s
P95 Latency 861ms**

By distributing traffic across multiple pods, latency dropped dramatically while throughput more than doubled.
This was the first time I could actually see the benefits of horizontal scaling in practice.

Current Architecture

#webdev

Today the deployment flow looks like this:
Developer
↓
GitHub
↓
GitHub Actions
↓
Docker Build
↓
Trivy Scan
↓
GHCR
↓
ArgoCD
↓
Kubernetes
↓
Prometheus
↓
Grafana
↓
AlertManager

What started as a simple web application became a complete cloud-native platform.

What I Learned
A few lessons stood out throughout this journey.
Containers are easy. Operations are hard.
Docker solves packaging.
Production systems require much more.
Monitoring should come early.
You can't improve what you can't measure.
GitOps is powerful.
Having Git as the source of truth simplifies operations tremendously.
Security should be automated.
If security checks depend on humans, they eventually get skipped.
Scalability is best learned through experimentation.
Nothing teaches scaling better than watching a system fail under load and then improving it.

What's Next
The next phase of my learning journey involves:

AWS
Terraform
Infrastructure as Code
Distributed Load Testing
Platform Engineering

I'm currently building an open-source load testing tool called Loadster, inspired by the challenges I encountered while testing SystemCraft.
**
*The goal is to create a Kubernetes-native load testing platform with built-in support for Prometheus, Grafana, and GitOps workflows.
*
**Final Thoughts
SystemCraft started as a project to learn system design.
It unexpectedly became my gateway into DevOps, cloud-native infrastructure, observability, automation, and platform engineering.
Looking back, the most valuable lesson wasn't learning a specific tool.
It was understanding how all these tools work together to build reliable systems.
And honestly, that's where the real engineering begins.

Check out the site Live: https://system-craft-kohl.vercel.app/

If you like the article make sure to drop a like and maybe even checkout the github repo and help me contribute and make it even better

Github: https://github.com/Shashank0701-byte/System-Craft