When it comes to data processing, Pandas is one of the most handy and sophisticated libraries in Python. It has functions which can read and write structured data efficiently from and to a wide variety of data sources (csv, databases, flat files, etc).
But when the data in the source grows, it can throw a memory error. Pandas has a nice way of dealing with this by reading the data in chunks and returning a TextFileReader object which can be iterated to access data. This data can then be read chunk by chunk and processed.
Let's take an example of data load from csv to MySQL. In this example we will be loading a csv five with 500,000 rows and 3 columns to a MySQL table
We will use pandas module to read data from csv and sqlalchemy module in python to insert data to the MySQL.
Processing data in chunk is still linear which means each data chunk will be processed or loaded one after the other.
- Create an sqlalchemy engine using the database uri
- Create an insert query for each row in the csv file
- Use the sqlalchemy engine to load the data to MySQL table.
So how do we speed things up?
Multiprocessing is one way.
Python's multiprocessing module is one of the easiest way to spin up multiple processes for parallel computing.
We will be using apply_async from multiprocessing module.
The apply_async function is a variant of apply() which applies the provided function over data but does not wait for the function run to be completed. In this way the code does not have to wait for a particular chunk of data to be processed before starting to process the next.
- Create a pool of workers. The workers will work on each of the data chunk load.
- For each of the data chunks, apply the insert function using the parameters
- The wait method will wait for the insert function to finish till the timeout is reached
With 4 cores and for a csv file with 50,000 rows and 3 columns, the code using multiprocessing was able to load the data in 45 seconds while the linear code was almost 10x slower