From Cache Keys to Concurrency

Ali Agboola — Sat, 13 Jun 2026 11:39:21 +0000

The final stage of HNG is a writing task: pick two pieces of work from the internship, one solo and one team, and write about whichever ones stuck. Mine are Stage 4, an optimization pass on an API I'd already built, and Stage 6, the queue infrastructure behind a team product called Flowbrand. Neither is here because it went well. Each one cost me an assumption I didn't know I was carrying, which is a better reason to write than a clean scoreboard.

Stage 4: the difference between working and holding up

By Stage 4 I had a NestJS API sitting on db seeded with demographic profiles. Filtering, sorting, pagination, a natural-language search endpoint. All of it worked. Stage 4 then asked the question every stage before it had politely avoided: what happens when people actually use this?

Three deliverables: faster queries, consistent cache keys, and a CSV ingestion endpoint that could take files of up to 500,000 rows without flattening the server.

The slowness came first. A query like gender=male&country_id=NG&age_group=adult against a million-row Postgres table over a remote connection took about 1.6 seconds. All three columns had single-column indexes, so I assumed the database was covered. It wasn't. Postgres can't combine three separate indexes into a single lookup; it runs a bitmap scan on each one and intersects the results in memory, and at a million rows the intersection is where the time goes.

The caching problem was quieter. Redis was in place, but gender=male&country_id=NG and country_id=NG&gender=male serialized to different cache keys. Two requests asking the same question, two trips to the database.

The CSV problem was plain arithmetic. Inserting 500,000 rows one at a time is 500,000 network round trips; at 5ms each, that's about forty minutes. Buffer the whole file in memory instead and a 50MB upload becomes a 50MB heap spike, multiplied by every concurrent upload.

What I built

Composite indexes went in first because they required no application changes:

@@index([gender, country_id])
@@index([gender, age_group, country_id])

Order matters here. gender leads because it prunes the most rows at the first step, and a composite index lets Postgres walk one pre-grouped structure instead of doing three scans and stitching them together afterwards.

Then the cache keys, before touching any Redis logic. A cache that sometimes misses is a nuisance; a cache that returns results for the wrong query is a hazard, and inconsistent keys are exactly how you get there. So every filter object is normalized before anything else happens: strings lowercased, keys sorted alphabetically, numerics coerced to consistent precision, undefined fields stripped, then serialized.

function buildCacheKey(filters: ProfileQueryFilters): string {
  const normalized = normalizeFilters(filters);
  const sorted = Object.keys(normalized)
    .sort()
    .reduce((acc, key) => ({ ...acc, [key]: normalized[key] }), {});
  return `profiles:${JSON.stringify(sorted)}`;
}

The cache itself is plain cache-aside. Check Redis, return on a hit, query and store on a miss with a five-minute TTL, invalidate on writes. Five minutes because analysts repeat queries within a session, and slightly stale analytics is a trade nobody notices.

For the CSV endpoint, the file never sits in the Node heap. Multer's diskStorage writes the upload to /tmp, the service opens a readline interface over a read stream, and valid rows collect in a 1,000-row buffer that flushes through createMany with skipDuplicates: true. Each chunk commits on its own, and there is deliberately no transaction around the import. If the process dies at row 400,000, those rows stay committed, and because duplicates are skipped, re-running the same file is safe. Rolling back 400,000 good inserts because row 400,001 crashed helps nobody.

What broke

The bug that taught me the most was in my own normalization. I had put it inside buildCacheKey, which meant the cache saw normalized filters while the database received the raw ones. Same request, two different shapes. It mostly worked anyway, because Postgres happened to be case-insensitive on that column, which is the worst kind of bug: one that passes by accident. The fix was boring and correct. Normalization became the first thing findAllProfiles() does, before the cache check and before the query, so both sides see the same object.

The other adjustment was chunk size. I started flushing every 100 rows, which on a 500k file means 5,000 round trips. Too many. Moving to 1,000-row chunks cut it to 500 and kept memory per chunk reasonable.

The before and after:

Query	Before	After (cache miss)	After (cache hit)
`gender=male`	820ms	310ms	12ms
`gender=male&country_id=NG`	1,240ms	390ms	14ms
`gender=male&country_id=NG&age_group=adult`	1,680ms	420ms	11ms

What stayed with me, in order of how often I've repeated it to myself since: EXPLAIN ANALYZE beats assumptions. I was sure my single-column indexes were earning their keep until the plan showed three bitmap scans and a hash join over 70,000 rows. Cache key normalization is a correctness requirement, not a performance tweak, and before this stage I had never once thought about it. And partial-failure behaviour in a bulk operation is a decision to make deliberately at design time, not something to discover during an incident.

I picked this stage because it was the first one that asked "does it hold up?" rather than "does it work?". Every earlier stage had a happy path. This one had a number: 1.6 seconds before, 11 milliseconds after. Watching that number move was the first time the work felt like engineering instead of assembly.

Stage 6: Flowbrand, queues, and the merge I'd take back

Stage 6 was the team stage. We built the backend for Flowbrand, a platform that takes an uploaded document and generates a branded marketing funnel from it. My slice was the queue infrastructure, the document upload pipeline, and the worker that actually produces the funnels. And then, late in the stage, a security audit that sent me somewhere I hadn't expected to go.

The product is asynchronous whether you like it or not. A user uploads a PDF. An LLM call extracts structured content from it. A second LLM call turns that content into a funnel schema. The assembled result lands in Postgres. Thirty seconds on a good day, with a failure possible at every step, and none of it belongs inside an HTTP request. Without a queue, Flowbrand is a timeout with a logo.

The boilerplate we inherited had no queue at all, and features were already being built against synchronous service calls the real flow would never survive. So before anything else, the team needed something to plug into.

Five PRs and an audit

PR #12 laid the foundation: Bull wired into NestJS with Redis behind it, a QueueService any module could inject to enqueue work, a base processor pattern, and a QueueModule for the rest of the team to import. The decision I'd defend hardest from that PR is typing jobs as a discriminated union — one payload shape per job type — so a processor always knows exactly what it's holding.

PR #30 came from watching the foundation misbehave under real conditions. Redis briefly unreachable during startup? Module threw. Job died with an unhandled exception? Vanished, silently. I added connection retries with graceful degradation, structured error logging at the processor level, and an event listener that records failures to the database where someone can actually see them. The question underneath all of it: who owns this failure, and what are they supposed to do with it?

PR #35 was the funnel worker: two processors, 889 lines, fifteen files. ExtractionProcessor reads the document from S3, calls the LLM, stores the structured content. FunnelGenerationProcessor takes that content, calls the LLM again with a funnel-specific prompt, and writes the assembled result inside a single QueryRunner transaction. Keep the 889 in mind. It comes back.

PR #62 was the upload pipeline — and where the real edge-case work happened. Flowbrand accepts PDFs, Word files, PowerPoints. Each one needs to be validated, streamed to MinIO, queued for text extraction, and tracked via a polling endpoint so the frontend knows what's happening. The uploaded_documents table, a status enum (UPLOADING, PARSING, COMPLETED, FAILED), and a failure_reason column all went in first. Then the storage service, then the extraction processor.

The happy path worked within a day. Two things then broke it.

The first was the upload itself. The initial version used NestJS's default FileInterceptor, which buffers the whole file to disk before you touch it. Fine at 2MB. A 25MB PowerPoint stalled the server. The fix was switching to memoryStorage: false in Multer config and piping createReadStream(file.path) directly into the MinIO upload call — the file never fully lives in memory.

The second was a retry idempotency problem I hadn't thought through. Bull retries failed jobs automatically. If ExtractionProcessor crashed after writing 40% progress to the database but before marking the job complete, the retry would pick up the same job and try to reset progress to 0 — which made no sense, and in some cases hit a unique constraint violation. The fix was an orphan guard at the top of the processor: read the current DB record first. If status is COMPLETED, skip. If status is FAILED, re-run from the top. If percent_complete > 0 and status is PARSING, we're resuming — don't reset. A Bull job is not a function call. It can run more than once, on a different process, with the same inputs. If your processor has side effects — and writing to MinIO and updating a DB row definitely are — you need to be able to answer: what happens if this runs twice?

PR #102 wrapped every controller response in a consistent { status: "success", data: ... } envelope. Not glamorous. The frontend was blocked on it.

PR #166 came from a code audit, and it was the most uncomfortable work I did in Stage 6. Three concurrency bugs in the authentication service, all security-relevant, all in code that had passing tests.

The first: two simultaneous first-time logins from the same user both read auth_metadata = null, both tried to create the row, and one crashed on a unique constraint. Fix: catch Postgres error code 23505 and re-read the row the concurrent request already created. The second requester gets a valid result without knowing a race happened.

The second was worse. The account lockout counter used a JavaScript read-modify-write pattern: read failed_attempts, add 1, write it back. Under concurrent wrong-password attempts, two requests could read the same value, both add 1, both write the same number. The counter could only go up by 1 even if five requests hit simultaneously. A brute-force attack could exhaust attempts without ever triggering the lock. The fix was deleting the read-modify-write entirely and replacing it with a single repository.increment() call — one SQL statement, UPDATE auth_metadata SET failed_attempts = failed_attempts + 1 WHERE id = $1. The database handles the concurrency; there is no window between read and write for another request to interfere.

The third: the distributed lock on OTP verification had a 10-second TTL, chosen arbitrarily. A slow DB response under load could let the lock expire before the handler finished, allowing two concurrent OTP submissions to both succeed for the same token. Fix: raise the TTL to 30 seconds.

The increment() fix introduced a subtle problem I missed on first pass. The method had been returning the updated metadata object, which the caller used to decide whether to trigger lockout. increment() doesn't give you the row back. My first version re-fetched it — two DB round trips per failed login. I looked at what the caller actually needed: just the new count. So the method now returns the old value plus one, calculated locally, and only hits the database again when it needs to write locked_until. One write per failed attempt, no superfluous read.

These three bugs existed in code with passing tests. The tests ran one request at a time. That's the lesson from PR #166: concurrency bugs are invisible in unit tests unless you specifically design the test to expose them. And the fix for bug 2 is less about adding code than removing it — moving the operation to where it belongs, which is the database, not JavaScript.

What I keep

The queue left me with a clean model for failure ownership, because the two processors handle it opposite ways and both are right. ExtractionProcessor swallows exceptions and records them to the database: an extraction failure is specific to that document, and retrying the same broken content against the same call doesn't change anything. FunnelGenerationProcessor rethrows: its failures are usually transient, rate limits and timeouts, and Bull's retry machinery exists for exactly that case.

And then PR #35. 889 lines across fifteen files is at least three concerns in one diff, and I knew that when I opened it. The deadline was real. I merged. My lowest score in Stage 6 was Collaboration + Docs — 6 out of 10 — and it was earned. Reviewers can only review what they can read. The question I now ask before opening anything: what's the smallest piece of this that someone could review on its own?

I picked Stage 6 because it's where the internship stopped being coursework. In the solo stages, reviewability is theoretical; the reviewer is a grading script. In Stage 6, a real teammate is blocked until your work makes sense to them. PR #35 put the tradeoff in front of me plainly. I chose to ship the whole feature. The score put a number on what that cost. I don't plan to pay it twice.

Ali . HNG XIV Backend Track.
Portfolio: sage-ali.vercel.app · GitHub: github.com/sage-ali