Situation
You are tasked with implementing a feature that allows users to download a massive CSV file, weighing in at 10 GB. The data must be sourced from a database, undergoes necessary transformations, and is served to the user as a downloadable file that should start immediately upon selecting the download option.
It's important to note that the data is subject to change at regular intervals and users may not always require a download, so it's not feasible to pre-push the data to S3. Instead, this process needs to be available on-demand to ensure timely and efficient delivery of the necessary data to users.
Solution - TL;DR
In order to handle this sitution effectively, one can opt in for the Streaming data from API since it provides us the lot of technical advantages yet so simple. This is one of the best solution for this usecase. It provides a great deal of user experience, since the download in this approch starts instantaneously.
Common problems with naive solution
While it may seem like a simple solution, writing an API to handle this scenario could prove problematic, particularly when dealing with large datasets. In such cases, the API may encounter special cases that result in failure, such as a 504 Gateway Timeout if the time limit is exceeded.
Additionally, if we plan to handle huge datasets with this approach, we may need to resort to vertical scaling, which could be excessive and ultimately unsustainable.
Finding the solution
Let's start finding optimized solution for this problem, compare them objectively. So when considering for best architectures, we should consider the usual factors, i.e scalability, availability, fault tolerant, consistency and security.
There are various factors that affect our solutions they are as follows,
- Compute resources (CPU and Memory)
- Database Disk IO
- Network IO
- Data size
- Technical difficulties
After a good amount of research, I was able to list down some of the best solution approaches, they are as follows,
- Asynchronous approach with queues and consumers
- Event based approach with websockets
- Streaming data from API
1. Asynchronous approach with queues and consumers
In this method, whenever a user clicks on the download, immediately that information is pushed to a queue. Then those requests are fulfilled by one of the consumer/worker process. The worker basically fetches the request from the queue, starts proceeding with the data fetching and data transformation.
This process can take x seconds which is directly proportional to the size of the data. After transformation, it will push it to S3 and then fetch the presigned URL so that it can be given to the client. Meanwhile the client would be polling backend server with an interval of y seconds for this presigned URL.
Here's a high level architecture view of this solution,
Pros: When implemented properly, this solution would provide great scalability with n workers, high availability, and failure handling with acknowledgement in queues.
Cons: The user experience would not be good. The infrastructure to setup these components might cost hefty.
2. Event based approach with websockets
In this approach, instead of using conventional REST APIs, we're gonna use Websockets to make use of event driven design to eliminate the polling and the compute processing is also done at the same backend server. This can also be said to be as asynchronous, based on the design as well.
The websocket is opened once the user requests for the download, the server starts to work on the data fetching, data transformation logics, computes the resultant file and pushes it to S3. After fetching the presigned URL, it is being sent the client for download.
Here's an overview of the design,
Pros: Easy to setup, cost effective.
Cons: Infrastructure scaling for compute resource is not flexible, since processing happens in backend server.
3. Streaming data from API
This approach is more straight forward, it can be achieved with the conventional API with a small change. Instead of doing the heavy lifting in a single shot and returning the response, we will be chunking and stream the response.
As a user perspective, immediately after the trigger, the file starts to download. The download speed is directly proportional to the above mentioned parameters. There should be a tradeoff in terms of performance for user experience. Here's the overview of the streaming API design.
Some technicalities needs to be crafted very carefully based on the infrastructure resources, such as how much records can be processed in a single chunk. A good metric is to chunk the data based on the data size.
100 MB data can be processed in a single chunk
A simple implementation of streaming API with Flask - python is as follows, flask docs
@app.route('/large.csv')
def generate_large_csv():
def generate():
for row in iter_all_rows():
yield f"{','.join(row)}\n"
return app.response_class(generate(), mimetype='text/csv')
Pros: Easy to setup, cost effective, best user experience etc.
Cons: Infrastructure scaling for compute resource is not flexible, since processing happens in backend server.
Conclusion
All the three approches offers a powerful, best in class solution, if implemented properly, but opting 1of these 3 should be based on your usecase/needs, you can use anyone of these solution to handle the large files. The third approach of streaming data was matching best for my usecase, so I went with the same. Ofcourse there can be issues at the implementation level, which are ignored in this article for the sake of simplicity.
Share your perspective in the comments.
Top comments (0)