From Backend Engineer to Building Production Infrastructure in AWS

#ai #automation #aws #devops

Last year in September, I was laid off. By November, I'd landed a role at a startup with a responsibility I wasn't fully prepared for: build a reliable AWS infrastructure from scratch.
I'd only ever worked as a fullstack engineer. I knew the basics — VPCs, load balancers, containers — but not at the level where I'd feel confident deploying anything to production, let alone architecting an entire infrastructure with a real budget on the line.
But I took the job anyway. This is the story of what I did, what I got wrong, and what I learned along the way.

This is the story of every mistake, pivot, and small win along the way.

The Starting Point: Rewriting Before Deploying

Before I could deploy anything, there was a bigger problem. The existing backend was legacy JavaScript — no types, no structure, no clear separation of concerns. Deploying it as-is would just be putting broken code on expensive servers.

So I rewrote it in TypeScript.

I went with a 3-layer architecture: routes, controllers, and module functions. The goal was maintainability — whoever touches this codebase next should be able to find things without a treasure map. Most of the refactoring happened through Cursor, which made the migration significantly faster than doing it by hand.

With the codebase stable, it was time to ship.

First Deployment: The ECS Experiment

The first service that needed to go live was a CMS — a monorepo running a backend with SQLite. My initial plan was to standardise on ECS for everything. It felt like the "right" way to do containers on AWS.

I containerised the backend, learned how ECR works, pushed the image, mounted an EBS volume for the SQLite database, and put an ELB in front of it. It took two full weekends. But it worked.

Then I got the message from management: the cost of ECS is too high.

For a simple CMS serving internal traffic, they were right. I was over-engineering it.

The Pivot: Right-Sizing the Infrastructure

I tore everything down and went back to research — blogs, documentation, conversations with LLMs to pressure-test ideas. The answer was simple once I stopped thinking in terms of "what's the most modern approach" and started thinking about "what does this workload actually need."

A CMS doesn't need container orchestration. It needs a box that runs.

Here's what I landed on:

Backend: EC2 t3.small — cheap, sufficient, easy to manage
Frontend: AWS Amplify — fast deploys, built-in CDN, no server to maintain
Database: SQLite on EBS with weekly backups to S3

Total cost: a fraction of the ECS setup. Same reliability for the workload. That was my first real win — not building the most impressive architecture, but building the right one.

The Docker Build Problem

With the infrastructure sorted, a new bottleneck appeared: CI/CD.

Every push triggered a Docker build, and each build was taking around 5 minutes. We were running a monorepo with pnpm, and the issue was layer caching — or rather, the lack of it. Every build was reinstalling every dependency from scratch.

The fix was multi-stage builds. By separating the dependency installation layer from the application code layer, Docker could cache the expensive pnpm install step and only rebuild what actually changed.

Build times dropped from 5 minutes to 20 seconds. Across 4 repositories. That's the kind of improvement that compounds — every developer, every push, every day.

Nightly API Testing Without a Dedicated Server

The next requirement was daily test runs — a nightly job that would hit all our APIs, run the test suite with Jest and Supertest, and generate a report for the team.

My first instinct was to spin up another EC2 instance. But that meant paying for a machine that sits idle 23 hours a day, plus managing its uptime, patching, and monitoring.

Then I discovered that GitHub Actions provides temporary VMs for exactly this kind of job. No infrastructure to manage. No idle costs. The workflow spins up, runs 100+ API tests in about 5 minutes, generates an HTML report, uploads it to S3, and notifies the admins.

A scheduled cron job in a YAML file replaced an entire server.

What I Actually Learned

Looking back at these four months, the technical skills were the easy part. Docker, ECS, EC2, Amplify, GitHub Actions — you can learn any of these in a weekend. The harder lessons were about decision-making:

Cost awareness changes how you architect. When you're spending someone else's money, every decision carries weight. "Best practice" doesn't mean anything if it's 10x the cost for a workload that doesn't need it.

The first solution is rarely the right one. ECS was a fine technology. It was the wrong choice for the problem. Being willing to tear something down after two weekends of work is a skill in itself.

Boring infrastructure is good infrastructure. An EC2 instance with a cron job isn't exciting. It's also not going to wake you up at 3am.

Optimise the inner loop. The Docker build improvement saved 4+ minutes per push across 4 repos. Over a team and a quarter, that's days of developer time recovered.

I'm not a DevOps engineer. I'm not a backend engineer either. Honestly, it's better to let go of that mindset entirely. In the coming times, just doing one thing won't be enough. You're an engineer — you solve problems. The label doesn't matter; the willingness to figure it out does.

Months in and still going, I've learned that the gap between
"knowing the theory" and "running it in production" is mostly filled with wrong first attempts and the willingness to start over.

If you're in a similar position — thrust into infrastructure work without a roadmap — my advice is simple: deploy something, get it wrong, and iterate. The cloud makes it cheap to experiment, and every teardown teaches you more than any tutorial.