DEV Community

Santi Frias
Santi Frias

Posted on

IronHack BCN Challenge1, Code Part

Small code to process customer Id's, at least 30 Millions of ID's in 75 seconds on repl.it.
Stress Detailed test, and memory profiling optional.

For fun, a problem that they insist on solving in python. list, minimum 28 million id's to process periodically. I start from a list of id's of 14 million dummy for tests (so it should go baggy in an instance in the cloud of 1GB of memory, and 1 vCPU, leftover storage, not SSD). As the list of id's that is being renewed periodically has to be accessible by other applications, and I don't want to succumb to the space and process with jSON, I do some tests with a simulated list of 14 million id's with facilities that allow the dump / recovery of data objects (pickle, marshal, HDF5 in tree B).

Alt Text

The space required by the data structures is clear.They are extracted with getsizeof (), and match what is expected in python, including overhead. The list of 14M id's occupies 107MiB, which is 8 bytes per element, just like the 18M id's stress test in which the dump file occupies 138MiB (144006144 bytes). What is no longer so clear is why in the recovery of data objects, the memory is 5 times greater, choose the facility you choose. The evolution of memory is described below.

Alt Text

Finally I have made my own compatible reader / writer of arrays in python on disk, as a binary file. Anyway, it seems that it processes 30 million IDs in 75 seconds, which is already left over for the challenge with a single thread

These are the things that can happen when you use a tool with self-managed memory, in which there is a great diversity of packages, and in which the overhead object clearly sacrifices efficiency. There is a dilemma between changing tools or performing low-level operations yourself in some function, with cost in time and memory space known and stable. In this case we're sacrificing compatibility on read/write object data files, in exchange for efficiency in time and storage space, without using compression.

That's all.

Top comments (0)