Afeh

Posted on Jun 13

Engineering Resilience: Two Lessons from Building Under Pressure

#api #career #performance #softwareengineering

A reflection on performance optimization at scale and building reliability mechanisms; two tasks that defined my internship.

Every engineering internship has its share of "aha" moments; those late-night debugging sessions where a breakthrough finally clicks, or the PR that takes seven commits to get right. As I wrap up my time as an Intern with HNG, I want to write about two tasks that stuck with me. Not because they were the hardest, but because they taught me something real about building systems that have to work.

One was individual, optimizing a demographic intelligence API to handle millions of records with sub-second query times. The other was a team effort; building reliability mechanisms into an AI-powered interview platform so that when things break (and they will), the system degrades gracefully instead of falling apart.

Lets walk through both.

Part I: Working on Insighta (Individual Task — Stage 4B)

What it was

Insighta IQ is a demographic intelligence API; think "find me Nigerian females aged 20 to 45." I nicknamed it Stereo API(Stereo short for stereotyping of course). Users query a PostgreSQL database of millions of demographic profiles through a FastAPI backend, via both CLI and web clients.

Stage 4 asked us to make it perform under serious assumptions:

Data growth: from millions toward tens of millions of profiles
Traffic: hundreds to low thousands of queries per minute, multiple teams using it daily
Performance targets: P50 latency under 500ms, P95 under 2 seconds

The task had three parts: query optimization, query normalization, and large-scale CSV ingestion. The CSV ingestion piece was the one that kept me up at night.

The problem

Users needed to upload CSV files with up to 500,000 rows of profile data. These weren't trivial constraints:

You cannot insert rows one by one (500k individual inserts = death by a thousand round trips)
You cannot load the entire file into memory (500k rows of profile data can easily exceed available RAM)
Uploads must not block ongoing query traffic (the system can't go dark during ingestion)
The system must handle concurrent uploads
A single bad row must never fail the entire upload

On top of that, the database is hosted remotely, so every query — including every INSERT, incurs network latency. And we were already under read pressure from the query workload.

How I approached it

I broke the problem into layers:

Streaming, not loading: Read the file in 256KB chunks, never hold the whole thing in memory
Batch processing, not row-by-row: Validate rows, accumulate 10,000 valid ones, then bulk-insert
Dedicated thread pool: Run the CPU+DB-bound CSV work off the event loop so the API stays responsive
Off-loading the uploading to a background task: It is better to just return a success response to let the user know that the uploading is working in the background rather than waiting endlessly for the upload to get done.
Partial failure tolerance: Each batch commits independently; failures don't roll back valid work
Comprehensive skip reporting: Return a structured summary of exactly what went wrong and why

Here's what the ingestion pipeline looked like:

File upload → Chunked read (256KB) → Decode (UTF-8 → Latin-1 fallback)
  → CSV DictReader (streaming) → Validate row → Batch (10k rows)
    → Deduplicate in-batch → Single DB query for existing names
      → Bulk INSERT (ON CONFLICT DO NOTHING) → Report summary

What broke and how I fixed it

The first approach was naive. I read the entire CSV into memory, parsed every row, and then tried to insert everything at once. For a 100KB test file, it worked beautifully. For the 500,000-row file? Memory exploded.

Fix: I switched to streaming with csv.DictReader over a StringIO buffer. But I still needed to validate rows against each other (duplicate names within the same upload) and against the database. That meant holding state across batches of 10,000 rows.

The second problem was intra-batch deduplication. Without it, two rows with the same name in the same upload would both pass validation, only to conflict during insert. The fix was a global_seen: Set[str]; a set of all names already inserted in this upload session, passed between batch flushes.

The third problem was edge cases in CSV data I never anticipated:

Rows with extra commas producing wrong column counts
UTF-16 encoded files that failed UTF-8 decoding
Age values like "twenty-five" instead of "25"
Country codes in varying cases ("ng", "NG", "Ng")

Each edge case got its own validator function. The _parse_row function became a gauntlet of pure functions — no database calls, just deterministic validation. If a row failed any check, it was counted and skipped, but processing continued.

What I took away

Batch everything. Validate early. Never trust user input.

The batch size of 10,000 wasn't arbitrary. Below 5,000, the overhead of DB round trips dominated. Above 20,000, memory pressure started climbing without significant throughput gains. 10,000 was the sweet spot.

The pattern that emerged; stream, validate, batch, insert, report — is something I now see everywhere: ETL pipelines, message queues, log processors. It's a universal pattern for handling lots of data with limited resources.

The caching and query optimization work (indexes on gender, age, country_id, composite indexes, connection pooling, result caching with 5-minute TTL) brought the query side in line with our P50/P95 targets. But the CSV ingestion piece — that's what I remember most vividly, because it forced me to think about resource constraints, failure modes, and graceful degradation all at once.

Why I picked this

The CSV ingestion task wasn't just about writing code. It was about engineering for constraints: memory, concurrency, consistency, partial failure. Every decision (batch size, decoding fallback strategy, ON CONFLICT DO NOTHING vs pre-query) was a trade-off. I had to justify each one. That process of articulating why a decision is right, not just that it works is what I think separates engineering from coding.

Part II: Building Reliability Into an AI Interview Platform (Team Task)

What it was

MeetMind is an AI-powered interview platform. Candidates join live audio interviews with an AI interviewer, and the system records transcripts, generates assessments, sends emails, and provides a chat interface for recruiters to query interview data. I really had a lot of fun building this with my teammates (though I nearly lost my mind a couple of times).

The catch? It depends on several third-party APIs:

Gemini (Google) for generating interview questions, assessments, and extracting resume information
Resend for transactional email delivery (welcome emails, password resets, interview invites)
LiveKit for real-time audio/video sessions

Any of these could fail at any moment from network blips, rate limits, service outages, or transient errors. Before this task, a failed API call meant a 500 error and a frustrated user.

My PR to address that issue added two reliability mechanisms:

Retry with exponential backoff — a centralized retry_async utility that wraps any async function with automatic retries, logging, and backoff
Transcript fallback — when individual transcript turns are missing in the database, fall back to the session's stored transcript JSON

The problem

Two concrete failure scenarios:

Scenario A: A candidate finishes an interview, and the system fires off a background task to generate an assessment summary via Gemini. Halfway through, Gemini returns a 429 (rate limit). The assessment fails. The summary is stuck in "generating" status. The recruiter sees a blank report.

Scenario B: Real-time interview transcript turns are stored one by one as the AI interviewer and candidate speak. If the persistence pipeline drops a few turns — or fails entirely — the transcript is incomplete. The recruiter can't review the interview. The AI can't generate an assessment.

Both scenarios had the same root cause: no mechanism for transient failure recovery.

How I approached it

For the retry mechanism, I wanted something that:

Works with any async function (generic, reusable)
Uses exponential backoff (don't hammer the failing service)
Logs warnings on intermediate failures, errors on exhaustion
Accepts configurable retry count, delay, backoff factor, and exception types
Integrates with existing code without massive refactoring

The signature was clean:

async def retry_async(
    func: Callable[..., T],
    *args,
    max_retries: int = 3,
    initial_delay: float = 2.0,
    backoff_factor: float = 2.0,
    exceptions: tuple[type[BaseException], ...] = (Exception,),
    task_name: str = "Task",
    **kwargs,
) -> T:

For the transcript fallback, the pattern was: try the primary data source first, and if unavailable, reconstruct from the session's transcript_json field. The fallback had to be transparent to callers — get_chat_history, get_transcript, and get_transcript_export all work the same way regardless of which data source backs them.

What broke and how I fixed it

The first bug was an off-by-one in the backoff calculation. I had delay *= backoff_factor happening before the sleep, so the first retry was 2x the initial delay instead of the initial delay itself. It felt trivial, but it meant retries took longer than necessary — 2s, 4s, 8s instead of 1s, 2s, 4s.

Fix: Moved the delay *= backoff_factor to after the sleep.

The second issue was exception type granularity. Initially, the retry caught all Exception subclasses. But some exceptions shouldn't be retried — like ValueError from bad user input, or KeyError from a missing dictionary key. A retry won't fix a programming error.

Fix: Made the exceptions tuple a parameter so callers can specify which exceptions are retryable. For Gemini calls, we retry (Exception,) since the API can fail for many transient reasons. For email delivery, we retry the Resend-specific exception.

The third issue was the transcript fallback format mismatch. The session's transcript_json stored timestamps as Unix seconds, but the transcript endpoints expected them formatted as "HH:MM:SS" elapsed time. The fallback function needed to reproduce the same relative timestamp calculation that the primary path used.

Fix: Created _format_elapsed_timestamp as a shared utility and used it in both the primary and fallback paths. The fallback also had to generate deterministic UUIDs for each fallback turn (using uuid.uuid5 with a namespace) since there were no database IDs available.

What I took away

Graceful degradation > perfect failure. The goal isn't to never fail — it's to fail in a way that doesn't cascade into user-facing errors. The retry mechanism handles transient failures silently. The fallback mechanism ensures that even if the real-time pipeline drops data, users can still review interview transcripts. The user never needs to know something went wrong.

I also learned that reliability is visible in the logs. After the retry mechanism was deployed, we stopped seeing "Assessment generation failed" errors in Sentry. Instead, we saw "Attempt 1/3 failed for Generate assessment... Retrying in 2.00s..." — which is an entirely different class of log. It means the system handled the failure rather than succumbing to it.

Why I picked this

This task taught me that reliability isn't a feature you add at the end — it's a design philosophy that affects how you structure every external dependency call. The retry_async utility now wraps every Gemini generation, every email send, every document embedding. It's invisible infrastructure. But it's the difference between a system that occasionally returns 500s and one that absorbs transient failures and moves on. It is important to build Fault-tolerant systems.

And the transcript fallback taught me something subtler: data can come from multiple sources, and that's okay. The real-time turns are the ideal data source. But the session JSON is always there as a safety net. Engineering for multiple data paths — with clear fallback semantics — makes the system more robust without adding complexity to the API surface.

Bringing It Together

Looking back, both tasks taught me the same lesson from different angles:

Design for failure, optimize for reality.

The CSV ingestion system assumes every row could be bad and handles it gracefully without stopping. The retry mechanism assumes every API call could fail and recovers transparently. The transcript fallback assumes the primary data path could be incomplete and provides an alternative.

Systems that work under pressure aren't the ones that never break. They're the ones that break gracefully, recover quickly, and leave useful evidence behind when they can't.

That's what I'll carry forward from this internship.

DEV Community

Engineering Resilience: Two Lessons from Building Under Pressure

Part I: Working on Insighta (Individual Task — Stage 4B)

What it was

The problem

How I approached it

What broke and how I fixed it

What I took away

Why I picked this

Part II: Building Reliability Into an AI Interview Platform (Team Task)

What it was

The problem

How I approached it

What broke and how I fixed it

What I took away

Why I picked this

Bringing It Together

Top comments (0)