Beyond the Happy Path: Lessons in Resilience and Distributed State

#backend #architecture #python #devjournal

Reflecting on two major technical challenges from my backend engineering internship, focusing on fault tolerance, infrastructure, and distributed architectures.

Introduction

As I wrap up my HNG internship, I’ve been reflecting on the gap between code that "works on my machine" and code that survives in production. Here is a look at two tasks from Stage 9—one solo, one team-based—that completely changed how I approach backend engineering and infrastructure.

The Individual Task: Background Job Scheduler

What it was
For my individual Stage 9 task, I built a distributed background job scheduler backed by PostgreSQL and a FastAPI backend, featuring a vanilla HTML/CSS/JS frontend. It manages async tasks (like a mock email sending queue) using a MinHeap priority queue, Directed Acyclic Graph (DAG) dependency resolution, a Dead-Letter Queue (DLQ), and a real-time Server-Sent Events (SSE) dashboard.

The problem it was solving
Heavy asynchronous tasks—like email generation or batch processing—cannot block the main API thread. The system needed to successfully queue, prioritize, retry on failure, and track every job entirely independently from the standard request-response cycle.

How I approached it
I built the core logic from the ground up: a MinHeap and an alternative Timing Wheel algorithm for scheduling, a worker engine featuring a 3-attempt backoff sequence (1s, 5s, 25s with jitter), a DAG dependency checker, and a starvation daemon to prevent tasks from hanging. Once the CRUD API and SSE streaming were hooked up, I containerized the entire application with Docker and wrote my deploy scripts. I thought I was done.

What actually broke and how I fixed it
The application code took hours. The deployment took a full day of non-stop debugging across multiple cloud providers.

Oracle Cloud was out of capacity on every free tier shape, and GCP demanded upfront payment. I finally got a t3.micro running on AWS, but that’s when the real DevOps nightmare began:

The SSL Chicken-and-Egg: My Nginx config referenced SSL certificates that didn't exist yet, which prevented Nginx from starting. But Certbot couldn't run to fetch the certs because Nginx wouldn't start. I had to strip the SSL config, run Certbot in pure HTTP mode, and then let Certbot rewrite the config.
Nginx Misconfiguration: I initially copied a full nginx.conf (complete with an events block) into sites-available, which Nginx completely rejected. I had to rewrite it as a standard server block.
dpkg Locks & The OOM Killer: The AWS t3.micro's 1GB of RAM meant the Linux Out-Of-Memory (OOM) killer kept terminating apt during installations. Worse, an unattended background system update corrupted my docker-compose-v2 plugin and held the dpkg lock hostage. I had to force-kill the processes, clear the lock files, and rebuild the package database manually.
Docker Compose Overrides: I spent way too long debugging why jobs weren't failing during my demo, only to realize Docker Compose's environment block was completely overriding my local .env variables (like EMAIL_FAILURE_RATE=0.0).

What I took away from it
Platform-as-a-Service tools like Railway abstract away so much that you never actually learn how networking, reverse proxies, SSL certificates, DNS propagation, or Linux package management work under the hood. Stripping those abstractions away is incredibly painful, but you come out understanding the full stack from the browser all the way to the database connection string.

The code was the easy part. Getting it to serve HTTPS on a real domain through a reverse proxy on a $0 cloud server with a broken package manager was where I learned everything that actually matters.

The Team Task: MeetMind (An Interview Assistant)

What it was
For the team project, we collaborated on MeetMind, an AI-powered interview assistant. It can conduct live interviews independently or assist a human interviewer in real-time. Post-interview, it generates a comprehensive summary, a candidate scorecard, and general performance insights. It even allows the interviewer to query the AI about specific moments that happened during the conversation.

The problem it was solving
MeetMind relied heavily on external LLMs to function. I was responsible for building the AI Service Engine and the Notification System. The core challenge was that external AI APIs are inherently flaky—they rate limit, time out, or throw 500 errors.

How we approached it
To prevent a third-party outage from crashing an active interview, I implemented a multi-tiered fallback and retry engine for all external API calls.

What broke and how I fixed it
During testing, a single API outage would cascade and fail the entire interview generation process. I fixed this by building a resilient 3-step routing protocol. If a request failed with a retryable error, the engine would first attempt to use Google Gemini. If Gemini continued to fail after its retries, the system seamlessly fell back to OpenRouter. If OpenRouter failed, it routed to Groq as a final safety net.

What I took away from it
The more third-party integrations you have, the more single points of failure you introduce into your system. You have to treat external dependencies as hostile and engineer guardrails around them.

Final Thoughts

This internship was a trial by fire in the best way possible. Moving from building simple APIs to engineering robust systems has completely changed my trajectory as a backend developer. If there’s one overarching takeaway from the last few months, it’s this: The "happy path" is an illusion. Engineer for the edge cases.