Deconstructing our microservices, Part 1: The app crashed in production and what we did next

#distributedsystems #architecture #microservices #dotnet

This API had been working fine in production for 2 years. The process was taking the expected time and resources, but boom. One day, one of our pods crashed.

Self-healing mechanisms were already in place to handle the situation; the system automatically redeployed the pod instance as configured.

But the question remains, Why?

The API was tested, working flawlessly with real production data from the last 2 years. So many customers were using it. Resource allocation seemed adequate until now.

So we checked the Application Insights on Azure for CPU, memory, and other resource usage. The pod crashed due to high memory usage. A customer had a very large file.

Now, let’s learn what this API does. We were working on a web application that handles various types of financial calculations. This particular API reads a file and imports its data into the system. The import process includes reading data from a file, processing data for various validations, including checking for a valid file, verifying mathematical constraints, and performing calculations on it before saving. Formatting the data in the required system formats and creating hierarchies that have foreign key references in the database. Some hierarchies go as deep as 12 levels. It sent the real-time status of the process using SignalR. A file can be a single CSV file or a zip file that contains CSV files. Data will be stored across 30+ tables in the SQL Server database. We had 1 GB of memory for a pod. After we got hold of similar data, we tested it on staging, and it was taking 3–4 days in an idle instance. It was also very long. The large file had 1.5 million rows.

Now we know the reason is high memory usage. What was the cause? Resources, code, database, or its query?

A large file processing was taking up to 1 GB of memory. If any other process uses a good chunk of memory, it implies less memory for our import process. So this one process was taking the whole pod down with it.
We were downloading the files into memory and reading data from them. The memory usage increased exponentially with the large files.
We were saving hierarchical data to create one row at a time, so we could have the parent ID to create the hierarchy. It was not a big issue with general files, but that was not the case with this specific file.

With these analyses, we decided to make a few changes.

Extracting the Import Microservice

Let’s move the import process out of the calculation service, so if it goes sideways, the whole calculation pod does not go down with it. To manage n number of pods, used KEDA (Kubernetes Event-driven Autoscaling). A new pod was created for a new import request. After the process was completed, the pod would scale in. The calculation service was publishing events to the Azure Service Bus topic, and KEDA was autoscaling up to the max n( 5 in our case) pods. If there were more than n requests, these requests were held in a message queue. KEDA also gives us the advantage of parallel processing.

Blob Stream

Instead of reading the data from a file/memory stream, read directly from the Azure Blob Storage using blob stream. This reduced the direct file content in the memory, thus improving performance. The pros of reading blob stream outweighed the cons of keeping a connection open to blob storage. The connection remained open for a very short duration, not for the whole process time, and we did not keep the file in the memory. Previously, we deleted the file after the end of the process. Now we were keeping the file data in data objects in code.

Code Optimization

We were saving the rows one by one because when the code was first written, we had limited exposure to the variety of the data we were going to have in the files. After 2 years of this process working, we had better business metadata to understand the scope better. We had three main hierarchy patterns to handle. In two cases, more than 99% of the data was in the last node of the hierarchy. And in the remaining one case, 99.5% of users never created a hierarchy. They were using only the top-most node. Now we were more comfortable using SQLBulkCopy to bulk save the data. We used a batch size to not overwhelm the database.

Let me point out one more thing that we were already doing. We were not doing this whole import process in a transaction because a transaction on this long process can lead to a deadlock in the database. We were maintaining a flag to mark the status of the process. Only when the status shows success is the data shown in the UI. User was getting a real-time update of the process via SignalR. If the process failed for any reason, we deleted the dirty data.

As you may guess, we also made other changes to accommodate our KEDA pattern. The import process logic was idempotent, so the server can safely process the same message twice without corrupting the SQL Server data

Post Optimization

Now our process was optimized. What took 3 to 4 days is now done in 3–4 hours. But one new issue arose with it, due to the now faster process, the client was flooded with SignalR. The UI froze. It was receiving more data than it could process and display. We were showing every single row status to the user, so we chose to display updates after every few thousand rows. Real-time to near real-time update.

Production crashes are painful, but they are also the best architectural forcing functions. Looking back, our 2-year-old API didn’t fail because it was poorly written; it failed because it hit a scaling wall. By re-architecting the solution, we learned a few invaluable lessons:

Isolate the Blast Radius: Never let a heavy background process share a fate with your core user-facing services. Use event-driven scaling (like KEDA) to let heavy lifters crash in peace without taking down the ship.

Stream, Don’t Swallow: When dealing with unpredictable file sizes, avoid memory buffering. Keep your Large Object Heap (LOH) clean by streaming data directly from cloud storage.

Manage Backpressure: True optimization doesn’t stop at the backend. If you make your data processing 10x faster, ensure your frontend delivery mechanisms (like SignalR) are throttled to handle the flood.

Have you ever had a seemingly stable API suddenly blow up after years in production? How did your team handle the scaling bottleneck? Let’s talk about it in the comments below!