Table Of Contents
This article provides a recipe for building fast look-up tables for huge CSV data sets in Python.
Look-up tables are meant to be read-only data structures used to search for a presence of a given key and to retrieve multiple values associated with existing keys.
Recipe is meant to be a rough approach in contrast to a precise algorithm.
Huge is relative and means significantly exceeding the amount of available RAM of a single machine, but comfortably fitting its storage.
Fast is subjective for a given use case and means quick enough to be used in stream processing or row-by-row batch processing.
The goal is achieved by calculating numeric representations of target keys, sorting and indexing the data set, doing binary searches against indices, and retrieving actual data using offsets.
The article itself is meant to be a memo of a generic approach which I would like to share with my colleagues and my future self.
Hence, it intentionally does not provide many details. However, questions in the comments are welcome.
A hidden goal is to try dev.to publishing platform as an alternative to Medium, as the latter imposes certain difficulties for publishing technical articles.
The recipe describes the essence of a subtask of a production daily batch job. The job was processing datasets consisting of
parquet files having millions of rows and around 200 columns each. Among columns there were
SKUs and image
URIs. Both strings. All rows were grouped by SKUs, hence image URIs were stored as arrays. Arrays had up to 60 items with log-normal distribution having the center of mass between 10 and 20 items. Additionally, each dataset had multiple versions and a single SKU could have differences in URIs between versions.
The goal of the subtask was to validate and fix all URIs associated with each given SKU present in multiple versions of datasets.
URIs themselves were pointing to an internal storage, where file names were grouped by SKUs. Hence, in order to validate and fix URIs in data, a listing of actual files needed to be obtained and stored. This was done by traversing the storage using several workers that produced CSVs. Each CSV was a flat list of file names associated with SKUs. When catenated into a single final file, listings were several gigabytes in size.
So, in order to complete the goal, for each SKU from each dataset version, each image from an array of SKU's URIs had to be checked against a flat listing of actual files. Essentially, an inner join needed to be performed between each version of a dataset and a file listing, both having up to tens of millions of items.
Conceptually, in an ideal world this could be achieved by loading a dataset and a file listing as 2
explode()-ing URIs array, normalizing them, joining 2 dataframes, and then grouping the resulting dataframe by SKUs in order to get back arrays.
The issue is that doing so required up to 20 GiB of RAM. And the job was running in a Kubernetes cluster with limited resources, having only a couple of gigabytes of RAM available. Going beyond that made the job just die silently. Basically, 20 GiB of RAM consumption was not the limit, as in the future both datasets and file listings could grow unpredictably.
Hence, another approach was needed and the priorities were RAM consumption and implementation speed, as the task was urgent.
The recipe below describes one such possible approach, which proved to be quick to implement, comfortable to debug, smooth to run, and possible to control. It allowed reducing the overall RAM consumption of the whole job from 20 GiB to 2 GiB, which is ×10.
Create a numeric hash of a data key (e.g., SKU), make it the 1st column in the CSV dataset file. The hash itself does not need to be secure, the only criteria are speed and consistency. Python's built-in
hash()will not work, as it depends on
randommodule, hence it has no consistency. Consider using
pyfasthashas Python's implementation.
Use external merge sort to sort the CSV by the previously created numeric key. This will group all data by the target key. Consider using
csvsortas Python's implementation.
Create an index of the data set. The index is a rectangular data structure with 3 fields:
- Numeric key.
- Offset of the first row with that key in the sorted dataset file.
- Number of rows associated with that key.
numpy.arrayfor that and store it as a file. At the moment,
numpyallows saving only the whole array and there's no way to append data chunks to a previously saved file. Hence, just in the case the whole index cannot fit RAM at once, use
npy-append-arrayto build the index by chunks.
Create a small wrapper for data look-ups.
It should know about the locations of the CSV file and the index file, but it should encapsulate all internals from clients, providing a simple API: given an original key (e.g., SKU), return all values associated with it. The wrapper should take care of opening and closing of both files, consider implementing the context manager for it.
mmap_mode='r'to load the index, this will ensure the index will fit RAM by using making the file memory-mapped. See
numpy.memmapfor a detailed description of modes.
In order to search a given original key, convert it to a numeric key the same way this was done at previous steps.
numpy.searchsortedto perform a binary search in the index. If a key is found, seek the CSV file at the
offsetand read the
countnumber of rows. Return them as a result in task-specific format.
The implementation of
pyfasthashuses a binary-compiled module under the hood. Currently, it does not compile on
arm64chips, because compilation flags are hardcoded. This can be an issue for users of MacBooks with M1 chips. If this is the case, confider using
py-fnvhash-cor falling back to pure Python implementation during development via
py-fnvhash. Or try another library or another algorithm, which will be simple enough and consistent.
csvsortis a bit stale and has several issues. Its main branch contains several nice improvements, but they were not released for a long time. Also, one may wish to tune it up a bit, for example, to change the type for numeric values from
int. Fortunately, it is a simple enough single-module library, which can be imported into a project and modified. However, that might not be an option for everyone.
numpy.searchsorted()function accepts only 1-D arrays, but the proposed index is 2-D, so slicing needs to be done.
Also, the function always returns a position of the key inside the given array. The result is a number in range
[0; len], both sides included. There's no special number like
-1to signal the item was not found. So, the result must be validated:
- Check if the position is less than the length of the index.
- Read the index at the position and get a (
count) tuple. If the
keyreally equals what was passed to
searchsorted(), then it is the actual result. If it differs, then nothing was found.
Top comments (0)