Oleksandr

Posted on Feb 7, 2022

Fast Multivalue Look-ups For Huge Data Sets

#bigdata #python #numpy #csv

Synopsis
Rationale
Context
The Recipe
Caveats

Synopsis

This article provides a recipe for building fast look-up tables for huge CSV data sets in Python.

Look-up tables are meant to be read-only data structures used to search for a presence of a given key and to retrieve multiple values associated with existing keys.

Recipe is meant to be a rough approach in contrast to a precise algorithm.

Huge is relative and means significantly exceeding the amount of available RAM of a single machine, but comfortably fitting its storage.

Fast is subjective for a given use case and means quick enough to be used in stream processing or row-by-row batch processing.

The goal is achieved by calculating numeric representations of target keys, sorting and indexing the data set, doing binary searches against indices, and retrieving actual data using offsets.

Rationale

The article itself is meant to be a memo of a generic approach which I would like to share with my colleagues and my future self.

Hence, it intentionally does not provide many details. However, questions in the comments are welcome.

A hidden goal is to try dev.to publishing platform as an alternative to Medium, as the latter imposes certain difficulties for publishing technical articles.

Context

The recipe describes the essence of a subtask of a production daily batch job. The job was processing datasets consisting of parquet files having millions of rows and around 200 columns each. Among columns there were SKUs and image URIs. Both strings. All rows were grouped by SKUs, hence image URIs were stored as arrays. Arrays had up to 60 items with log-normal distribution having the center of mass between 10 and 20 items. Additionally, each dataset had multiple versions and a single SKU could have differences in URIs between versions.

The goal of the subtask was to validate and fix all URIs associated with each given SKU present in multiple versions of datasets.

URIs themselves were pointing to an internal storage, where file names were grouped by SKUs. Hence, in order to validate and fix URIs in data, a listing of actual files needed to be obtained and stored. This was done by traversing the storage using several workers that produced CSVs. Each CSV was a flat list of file names associated with SKUs. When catenated into a single final file, listings were several gigabytes in size.

So, in order to complete the goal, for each SKU from each dataset version, each image from an array of SKU's URIs had to be checked against a flat listing of actual files. Essentially, an inner join needed to be performed between each version of a dataset and a file listing, both having up to tens of millions of items.

Conceptually, in an ideal world this could be achieved by loading a dataset and a file listing as 2 pandas dataframes, explode()-ing URIs array, normalizing them, joining 2 dataframes, and then grouping the resulting dataframe by SKUs in order to get back arrays.

The issue is that doing so required up to 20 GiB of RAM. And the job was running in a Kubernetes cluster with limited resources, having only a couple of gigabytes of RAM available. Going beyond that made the job just die silently. Basically, 20 GiB of RAM consumption was not the limit, as in the future both datasets and file listings could grow unpredictably.

Hence, another approach was needed and the priorities were RAM consumption and implementation speed, as the task was urgent.

The recipe below describes one such possible approach, which proved to be quick to implement, comfortable to debug, smooth to run, and possible to control. It allowed reducing the overall RAM consumption of the whole job from 20 GiB to 2 GiB, which is ×10.

The Recipe

Create a numeric hash of a data key (e.g., SKU), make it the 1st column in the CSV dataset file. The hash itself does not need to be secure, the only criteria are speed and consistency. Python's built-in hash() will not work, as it depends on random module, hence it has no consistency. Consider using FNV-1 and pyfasthash as Python's implementation.
Use external merge sort to sort the CSV by the previously created numeric key. This will group all data by the target key. Consider using csvsort as Python's implementation.
Create an index of the data set. The index is a rectangular data structure with 3 fields:
1. Numeric key.
2. Offset of the first row with that key in the sorted dataset file.
3. Number of rows associated with that key.
Use numpy.array for that and store it as a file. At the moment,numpy allows saving only the whole array and there's no way to append data chunks to a previously saved file. Hence, just in the case the whole index cannot fit RAM at once, use npy-append-array to build the index by chunks.
Create a small wrapper for data look-ups.

It should know about the locations of the CSV file and the index file, but it should encapsulate all internals from clients, providing a simple API: given an original key (e.g., SKU), return all values associated with it. The wrapper should take care of opening and closing of both files, consider implementing the context manager for it.

Use numpy.load with mmap_mode='r' to load the index, this will ensure the index will fit RAM by using making the file memory-mapped. See numpy.memmap for a detailed description of modes.

In order to search a given original key, convert it to a numeric key the same way this was done at previous steps.

Then use numpy.searchsorted to perform a binary search in the index. If a key is found, seek the CSV file at the offset and read the count number of rows. Return them as a result in task-specific format.

Caveats

The implementation of pyfasthash uses a binary-compiled module under the hood. Currently, it does not compile on arm64 chips, because compilation flags are hardcoded. This can be an issue for users of MacBooks with M1 chips. If this is the case, confider using py-fnvhash-c or falling back to pure Python implementation during development via py-fnvhash. Or try another library or another algorithm, which will be simple enough and consistent.
csvsort is a bit stale and has several issues. Its main branch contains several nice improvements, but they were not released for a long time. Also, one may wish to tune it up a bit, for example, to change the type for numeric values from float to int. Fortunately, it is a simple enough single-module library, which can be imported into a project and modified. However, that might not be an option for everyone.
numpy.searchsorted() function accepts only 1-D arrays, but the proposed index is 2-D, so slicing needs to be done.

Also, the function always returns a position of the key inside the given array. The result is a number in range [0; len], both sides included. There's no special number like -1 to signal the item was not found. So, the result must be validated:
1. Check if the position is less than the length of the index.
2. Read the index at the position and get a (key, offset, count) tuple. If the key really equals what was passed to searchsorted(), then it is the actual result. If it differs, then nothing was found.

DEV Community

Fast Multivalue Look-ups For Huge Data Sets

Table Of Contents

Synopsis

Rationale

Context

The Recipe

Caveats

Top comments (0)

Read next

From ETL and ELT to Reverse ETL

Path.mkdir(parents, exist_ok) in Python

Django + Celery In-depth tutorial for beginners

How to pass an Array of Structs in Bigquery's parameterized queries