Sumayyah

Posted on Jun 13

From Job Scheduling to Real Time Systems: Two HNG Internship Tasks That Stayed With Me

#hng #internship #softwaredevelopment #distributedsystems

If there is one thing HNG internship taught me, it is that software engineering has a way of making simple ideas feel complicated.

A feature can sound straightforward during planning. Everyone nods their heads. The requirements seem clear. You already have an idea of how you want to build it.

Then you start.

A few hours later, you are reading documentation you did not know existed. You are tracing logs through different services. You are questioning decisions that felt obvious when you first made them. And somehow, the problem that looked small has become much bigger than expected.

Somewhere along the way, I stopped focusing on finishing tasks and started paying attention to what they were teaching me. Some taught me new technologies. Others taught me patience. A few of them left a much deeper mark. Not because they went perfectly, but because they completely shifted how I think about building backend systems.

When I look back on the internship, those are the experiences I remember most clearly. Not because they were perfect, but because they forced me to grow.

Dillame Scheduler

When I got to Stage 9 of the internship, our task was a background job scheduler, I thought I knew what I was getting into. A queue, a worker, a retry system. It sounded manageable.

It was not.

Very early, four things changed the whole problem for me. A min heap, a timing wheel, lazy deletion, and priority updates. Put together, they formed the IndexedPriorityQueue, and that became the hardest part of the entire system.

The scheduler itself was meant to be production ready. FastAPI handled the API. MongoDB stored jobs. Redis handled pub sub. A worker process ran separately and executed jobs in the background.

The system supported DAG dependencies, recurring schedules, retry backoff, dead letter handling, starvation prevention, duplicate protection, and a live dashboard. But everything depended on one thing working correctly. The scheduler had to decide which job runs next and when.

That is where things got real.

A simple scheduler can just fetch jobs and sort them. That works until scale enters the picture.

Jobs had multiple factors. Priority, scheduled time, creation time. All of them mattered. And when high priority jobs keep arriving, low priority jobs can easily get stuck forever.

So I had two problems. Ordering and starvation.

Ordering meant getting correct execution without constantly sorting everything. Starvation meant making sure low priority jobs do not get ignored for too long.

That is where I started building the IndexedPriorityQueue.

How I approached it

I used a min heap as the base structure, then extended it with an index map to support updates.

Python heapq does not support priority updates, so I used lazy deletion. Instead of modifying entries, I pushed new ones and marked old ones as stale. When popped, stale entries were ignored.

Each entry followed this structure:

(effective_priority, scheduled_at, created_at)

This gave stable ordering across priority, schedule time, and creation time.

The index map tracked job versions. Every update increased the version, pushed a new heap entry, and invalidated the old one. This kept updates at O log n without rebuilding anything.

The timing wheel handled future jobs. Instead of placing everything in the heap, I used a 3600 slot circular structure. Jobs scheduled within an hour went into the wheel first, then got promoted into the heap when due.

slot = int(scheduled_at.timestamp()) % 3600

This kept the heap focused only on ready work.

What broke and how I fixed it

Lazy deletion caused heap buildup over time. Too many stale entries slowed pops. I fixed it by rebuilding the heap only when stale entries crossed a safe threshold.

Timing drift caused jobs to land in wrong slots. I fixed it by normalizing timestamps before slot calculation and using consistent floor based time everywhere.

Starvation updates sometimes affected MongoDB but not the in memory heap. I fixed it by checking the index before applying heap updates so both stayed in sync.

SSE events failed once due to Nginx buffering. The system was working, but the browser was not receiving updates. Disabling proxy buffering fixed it immediately.

What I took away from it

I learned that data structures stop being abstract the moment state and time enter the picture. A heap is simple until you need updates, persistence, and coordination across workers.

The combination of a timing wheel and heap was the key insight. The wheel handled coarse grained scheduling with O(1) insertion, while the heap handled fine grained execution ordering with O(log n) updates. Separating concerns reduced complexity in both layers.

Most importantly, I learned that debugging distributed systems is about tracing behavior across boundaries, not just looking at one place.

Why I picked it

I picked this task because it exposed everything I did not fully understand at the start.

Everything else in the system felt familiar enough. APIs, databases, queues, dashboards. The scheduler forced deeper thinking.

It was not just about making something work. It was about making it correct under failure, load, and time.

The system is live and running, and the full implementation is available here:
Source Code

Clinsights API - SSE Notification Delivery system & EventBus + connection registry layer

Stage 6 of the internship moved the clinical API from a simple request-response backend into a distributed, event-driven system. The system was split into three major parts: a fault-tolerant processing pipeline, a real-time notification system using SSE, and a WebSocket-based chat layer.

I worked specifically on the SSE Notification Delivery system and the EventBus + connection registry layer, which handled how backend events are delivered to users in real time.

The SSE Notification Delivery system was responsible for pushing updates from the backend pipeline to the client without polling. Instead of the frontend repeatedly calling endpoints like /cases/{id}/interpretations/latest, the backend maintains a persistent connection and streams updates as soon as they happen.

At the center of this design was an EventBus, which acts as a shared communication layer between the pipeline and real-time consumers. The pipeline emits events like interpretation_ready, and the SSE layer subscribes to those events and forwards them to connected clients.

On top of this, a connection registry tracks active user sessions in memory, allowing the system to broadcast events across multiple devices or tabs.

The problem it was solving

The original system had no real-time delivery mechanism. Once a case was processed, the only way for the client to know was to poll the API repeatedly. This created unnecessary load and delayed feedback.

More importantly, there was no structured way for backend events to flow into the client. The pipeline updated database state, but nothing in the system was responsible for translating those updates into real-time user-facing events.

The core problems were:

Heavy reliance on polling and repeated API calls
No real-time feedback when processing completed
No unified event propagation system between backend and frontend
No recovery mechanism for missed updates

How I solved it

The first step was introducing an EventBus abstraction as the central communication layer. Instead of tightly coupling the pipeline to delivery systems, the pipeline simply publishes events.

The EventBus handles:

Publishing events through Redis pub/sub
Subscribing users to event streams
Serializing event payloads consistently
Acting as the single source of truth for all real-time events

The SSE layer then consumes these events and streams them to clients through a persistent HTTP connection (/notifications/stream).

The connection flow looks like this:

Client opens SSE connection
Server subscribes user to EventBus channel
Pipeline publishes event to EventBus
SSE stream receives event and pushes it to client

To support real-world conditions, I also added:

Keepalive pings every 30 seconds to prevent proxy timeouts
Last-Event-ID support for missed event replay
In-memory connection registry for multi-device broadcasting
Strict separation between event production and delivery layers

One of the first issues I ran into was missed events during temporary client disconnections. The SSE stream worked perfectly when the connection was alive, but if a user disconnected even briefly, any events sent during that window were lost permanently. This made the system unreliable in practice.

I fixed this by persisting notifications in the database and introducing Last-Event-ID support. On reconnection, the server replays all events that were created after the last received event ID. This turned SSE from a purely live stream into a recoverable event system.

Another issue I faced was SSE connections being silently closed by proxies and load balancers due to inactivity. In production-like environments, long-lived HTTP connections are often terminated if no data is transmitted for a while. This caused unexpected disconnections even when the backend was healthy.

I solved this by introducing periodic keepalive comments every 30 seconds. These small heartbeat messages ensure the connection is always considered active by intermediaries.

I also faced duplicate event delivery during reconnections where replayed events overlapped with live events still in transit. This resulted in duplicate notifications on the frontend. The fix was to assign stable event identifiers tied to database records, allowing the client to safely deduplicate events.

Finally, there was a design-level issue where different parts of the system were emitting events inconsistently. The pipeline, SSE layer, and WebSocket system were initially loosely coupled, which made behavior unpredictable. I resolved this by enforcing the EventBus as the only valid event source. Everything else subscribes to it, nothing bypasses it.

What I took away from it

This system made it clear that real-time infrastructure is not just about sending data quickly, but about guaranteeing delivery under failure conditions.

A few key lessons stood out which includes,

Event-driven systems only work when there is a single, consistent event source
SSE is simple but requires persistence and replay to be production-safe
Connection state cannot be trusted; it must always be treated as temporary
Real-time delivery is less about pushing events and more about recovering them correctly The most important realization was that reliability in distributed systems comes from designing for failure first, not adding it later actually.

Why I picked it

I worked on this part because it sat at the intersection of backend processing and user experience. It was not just infrastructure for the sake of infrastructure, it was directly controlled how users experienced system responsiveness.

It also involved solving coordination problems between multiple moving parts: the pipeline, the EventBus, and the real-time delivery layer. Getting all of them to behave consistently required thinking beyond individual services and focusing on system-wide behavior.

That made it one of the most technically interesting parts of the project to work on.

Final thoughts

What stayed with me from both projects was not the features, but the systems behind them.

One taught me how to control execution under uncertainty, where timing, queues, and concurrency decide correctness. The other taught me how to move information reliably in real time, across connections that can appear and disappear at any moment.

In both cases, the real challenge was not making things work. It was making them keep working when everything around them gets noisy, fast, or unpredictable.

That is the part I will remember most.

Thank you HNG!