DEV Community

Albert Jokelin
Albert Jokelin

Posted on

1

Turbocharged C: How to engineer a Sub-50ms Program!⚡🚀

Imagine working on several programs that together must route messages across 5 devices in less than 50ms!

That's what I was working on at my workplace for the last couple of months.

A bit of context here- we work on Backend data transfer technology that routes data across hundreds of kilometres using optical fibre.

There comes a time when those links fail and we must route that data through an alternative path. Because we are dealing with terabytes of data every second, this must be done quickly; within milliseconds.

Image description

This is how the flow works:
1) The framing device raises an alarm when it stops receiving data.
2) The switching device processes this alarm and sends relevant information about the connection to the switching device of the alternate path.
3) The alternate switching device starts processing data.

All of these steps have to be completed within 50ms. Needless to say, our solution worked but not within the intended timeframe and that's when we got to debugging.

Since our switching devices communicate using ethernet packets, we had to determine whether the number of packets transmitted made any difference. Surprisingly, it did. We expected the message to be broadcast across the system rather than sent specifically to a single device, but that wasn't the case.

Debugging the system

In my opinion, the best way to debug any system (especially embedded ones) is to break them into the smallest possible units and thoroughly examine each one.

We started by adding timers to the operations essential to the switching mechanism in all the devices. Since we were using C, we used the built-in timer.

This method highlighted two issues- the processor we used had too many operations going on and there was an avoidable time delay when the switch received the data and processed it.

How did we resolve it? Well, the first was fairly straightforward. Since the switching mechanism had different threads, our first approach was to keep the process out of the general scheduler and have a single core work on it. However, our processor had only two cores which meant all the other processes hogged up the first core and lesser urgent but important tasks weren't completed on time.

The next best solution was to increase the thread priority which worked reasonably well.

The second issue was challenging to solve because we had to dive into the switch manufacturer's code and figure out how they sent ethernet packets.

After a week of going through the code, we found out that messages received by the switch were stored in a queue (an elegant solution, realization should have struck earlier IMO).

Image description

Now, there are two ways to retrieve messages from the queue- either the processor polls or the queue pushes. By default, our processor was set to polling which explained the time delay between receiving and processing the messages. We switched to an interrupt/push-based mechanism and voila! It worked.

Although the above optimizations did work, what made the program execution faster was using highly optimized algorithms and structures like hashmap. Their effect was noteworthy with times reaching as low as 8ms!

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay