Wilfrid Okorie

Posted on Jun 13

From Resilience Infrastructure to Event Driven Architectures - What HNGi14 Taught Me About Real Systems

#architecture #backend #softwareengineering #systemdesign

Most backend systems look simple, until they don't. A single server, a database, with requests coming in and responses going out: it looks clean, predictable and easy to reason about. All of a sudden, business flows, traffic grows, external services start failing, processes crash mid-job, latency increases. What looked like a solid foundation reveals itself as a system that only works in the happy path.
During HNGi14, I spent months building systems that had to work outside the happy path; intelligent retries, automatic failure recovery and mishap management, asynchronous job processing, latency drops, and system hardening in general. This article covers two of those systems: a retry engine that covers exponential backoff and jitter, and a Redis pubsub metrics pipeline that demonstrates an event-driven architecture.

Stage 8 Task - Retry Engine

Problem it Solved

Imagine you have a system, and your server is down for 5 seconds. Say your system using Stripe for payments, and stripe is supposed to deliver a webhook to your server on successful payments. The payment happens that time, your server does not receive it, so your server doesn't consume the customer actually paid for the service, and the customer doesn't get served. In this context, I will refer to stripe as the client, and your server as the server. In this case, it is not the client's fault that the webhook wasn't delivered. It is actually the server's fault. If the client tried that request again now your server is up, it will probably succeed. This is a typical case where a retry engine is needed.

What is a Retry Engine?

A Retry Engine is a small HTTP service that acts as a reliable proxy for outbound HTTP requests. Instead of making a request directly and hoping it succeeds, it persists the job, and returns immediately with an ID, while a background worker handles the actual call. If the call succeeds, that is the happy path. It stores the result, and you can get it whenever you want. If it fails with a retryable error, the worker backs off and tries again. If it fails permanently, it stops, and marks the job dead.

When a server experiences transient failures, the clients that have made requests to it experience errors, and the typical solution is to retry the requests, depending on if the errors are retryable errors, or Transient Errors. Transient Errors are those that would probably succeed if tried again. However, if the errors are retried immediately, they would probably crash your just-recovering server. That is where Exponential Backoff. The clients requesting give the server some time to recover from the error, and the more the error persists, the higher the backoff period before the next request.

However, the issue with this is that if the server crashed due to high traffic all at once, and those retries, though exponentially backed off, happen at the same time, it would probably lead to the same spike in traffic, and would crash the server again. That is where Jitter comes in. Jitter adds randomness to the retries, and spreads these retries. Assuming they were supposed to happen at the 10 seconds after the previous try, jitter spreads it to 8.5s, 8.7s, 8.8s, 9.1s, 9.5s, 10.2s, 10.4s etc. It depends on randomness for varying the next_retry_at for each of the requests

How I Approached The Task

I was genuinely confused when I refused the task, so the first thing I did was give the task to claude to break it down. The prompt requested a breakdown, and key concepts needed to be studied to understand the task and implement. After this, I went to YouTube, searched through different videos for explanations.
I spent more time doing research than writing the code for that task, because the code was genuinely very easy to write, maybe because of how much I did research.
After understanding, I wrote a summary of what I got, and questions I had, and headed back to Claude. It was only after I had gotten the concept to the core, that I started writing the code, and it took me just an afternoon to complete the task.

What Broke and How I Fixed it

During implementation, the main things that broke had to do with my use of sqlite3 library, and how it differs from Postgres.
Although it uses SQL, it is file-based, unlike PostgreSQL.
I also had some issues with module and commonjs during the implementation, because I did not use any framework to bootstrap it - I wrote from scratch.

What I Took Away From It

Concepts:

How workers are actually just setIntervals or recursive setTimeout calls with a memory-layer, like DB or redis for BullMQ.
How exponential backoff and jitter work mathematically, not just conceptually
The thundering herd problem and why jitter solves it
What transient errors are and why the distinction between client errors and server errors matter for retry logic, and cases of exceptions
How SQLite DB works, using it for the first time

Patterns:

The claim-before-process pattern: previously learned in Web3 and applied here, to prevent duplicate work in polling workers
Using future time as self-healing lock instead of boolean flags all the time
How recursive setTimeouts can be a safe alternative to setInterval for async work loops

Debugging Techniques

Using console.time() to track time in the console. Used in debugging, but not currently present

Why I Picked It

Among the many individual tasks that were done throughout the internship, this was the most exciting, because the retry engine, exponential backoff and jitter, were all completely foreign concepts to me before the task. After the task, I understood these things very well in my opinion, and good enough to take on the next task, which was basically a highly scoped-up version of it - a full production-grade Job Scheduler.

Project Details:

Team Task - Real-Time Inverter Metrics Streaming

In the HNG Internship, I was the backend track lead, and general technical leader of the EnergyIQ team. We built an application that turns solar inverter data gotten from the specific inverter brand APIs, into actionable intelligence for financial optimization, costs and savings, and device maintainance, for the user.

Initially, the flow mentioned the frontend polling the backend for updates from these APIs. This would mean an browser requests data, and the backend requests the data from the external service, and get the data, and then return it to the user, while also storing to the DB

The Problem

The problem is what is mentioned in the section above. For a handful of users, this works very well. However, as the user-base for the app grows, this starts to crack under pressure. EnergyIQ supports multiple users being able to view the dashboard for a particular inverter, in a feature known as Team Access. This means that if 10 people are viewing the inverter dashboard of a particular inverter, their 10 different browser instances make requests for the same data periodically. This also means that if the client doesn't ask, the backend doesn't get the inverter data, and cannot use the data to do other things it should. Also, these brand APIs have rate limits, and requested request times, so it becomes chaos when multiple clients ask for data at different times. One simple solution is them asking the DB instead, but this way, they don't exactly get data the moment it is available, and for critical cases, this latency can mean a lot.

What It Was

As a result, I created:

a single poller service for the brand APIs
a redis pubsub service for publishing data
an SSE endpoint for connecting clients to this stream by subscribing

The poller service is a background service that polls all the brand APIs at organized times, depending on the specific rate limit for that brand's API. This way, it organizes the traffic into a few, organized requests, using Promise.all to make this requests concurrently. When it makes these requests, if there are errors, the errors are handled and logged.
If the data is gotten successfully, it uses the pubsub service to publish this data to a specific channel, which is determined by the id of the inverter for which it has just polled.
The SSE endpoint enables the client to connect once to the server and listen for updates, by subscribing to the same channel pattern used by the poller/pubsub service.
As a result, data is gotten real-time, eliminating the polling from the frontend completely, and replacing chaos with order.

How I Approached It:

As the head of the team, I chose the task for myself, since I had been learning about different API and endpoints types and shapes. I thought of different solutions to it, of which the main ones were websockets, long-polling, server-sent events using the EventSource API, or Redis Pubsub + SSE - this is the one I chose.

The real challenge was how to make sure it was separate from other services i.e. it didn't result in circular dependencies by calling a function in the inverters service directly to deliver the data. One of the important design justifications in this implementation was avoiding the poller service being called by the endpoint controller directly.

Before writing a single line of code, I did my research to find out why redis was as fast as it was, how it was able to do a number of different things (also used in cache), and how SSE endpoints worked, and why they differ from websockets.

After this, I wrote an implementation plan, and used it.

What broke and How I fixed:

Here are some things that broke:

Provision for a sandbox
Initially, EnergyIQ had no inverter to test this feature on. I had to write a mock-inverter-server that spits data that makes sense periodically. This server was configured to perform its own internal physics, and is a worker that just spits data guessed from the physics on every tick, with daytime and night time awareness. I wrote this, deployed, and created a new inverter brand to be used in development, which I called the SANBOX INVERTER. This also turned out to be central to EnergyIQ, as it was used to support implementation of basically every other feature.
The device type assumption
I hardcoded deviceType: 'min' in the Growatt adapter. When I tested against a real inverter, it returned empty arrays for min, inv, sph, and max. The fix was discovering the device was type: 2 (storage) and switching to deviceType: storage, and later making device type dynamic, stored at onboarding from the device list response rather than hardcoded.
Content-Type for the Growatt v4 endpoint
The initial adapter was sending JSON. The Growatt v4 queryLastData endpoint requires application/x-www-form-urlencoded. The fix was switching to URLSearchParams and setting the correct Content-Type header.
Field name mismatches from the TRD
The TRD documented bmsSOC - the actual API returned bdc1Soc and bmsSoc as separate fields depending on the device type. The storage response used completely different field names (capacity for SoC instead of bmsSOC, vBat instead of bdc1Vbat). The fix was mapping the storage-specific fields correctly.

What I Took Away From It:

From this task:

I learned how Redis works, and how it uses the RAM for data storage, for speed.
It was my first implementation ever of a lasting connection between client and server, so I got to understand practically how server-sent events work.
I learned alternatives that can be used in other cases. I had heard about Kafka for streaming. While it was not a direct solution to this, it was worth learning how and where it is used.

Why I Picked This:

I picked this because it is still my proudest contribution to the team, not because it is a very big thing, but because it enhanced EnergyIQ a lot. Many other services now use pattern pubsub (listen for updates on patterned channels) to perform their functions, without need to poll the DB. It became a backbone for the entire app, since the app is centered around getting and using solar inverter data from the brand APIs.

It is something worth being proud of.

CONCLUSION

In conclusion, I wrote my first Express server in HNG, coming with the intention of learning how systems work, and understanding Systems Design. It is not a destination, but I can proudly say I am moving fast and effectively on that journey, as a result of the well thought-through tasks that have been given in this HNG cohort. I am proud of myself.

DEV Community