Data Ingestion - Build Your Own "Map Reduce"?

Why map reduce

Let's say you work on Facebook; you have lots of data and probably needs lots of map-reduce tasks.
You will use mrjob/PySpark/spark/hadoop. You got the point - you need 1 framework to rule them all.
You need a system: where will temp file be stored, API with cloud, data security, erasuer, multi-tenant etc..
You need standards - standards between developers to themselves, between developers to devops etc. ;

Let's say, for the other hand, your a solopreneur/small startup. Max 3-4 developers team.
You need things to work, and work fast.
Don't have 10ks of map-reduce jobs, but probably 1 or 2.
You won't be using hadoop, that's for sure. Might be using:

Different approaches

Linq

not really map reduce per se, more like "sql w/out sql engine"
However, this adds complexities of .net to your environment;
e.g. read release notes and understand if you can run it on your different OSes (production, staging, developers machines).
Also - need to learn C#; loading from files, different encodings, saving, iterators etc..
If you're not proficient with C#, this could be one-time investment which will not worth it.

Mrjob

Pros: Python native lib; able to debug easily (using inline) run locally e.g. multi process on local machine,
use hadoop, dataproc (seems that "Dataproc is a managed Spark and Hadoop service..." ) etc.
However, lots of moving parts and different configuration options.

Custom made map-reduce

Let's go to UCI Machine Learning website (2015 is on the phone..)
Choose some dataset, and test

Some notes:
We don't need Sha256 and not evey base64; nothing will happen if keys will not distribute very equally.
we could take MMH3; googling "python murmurhash" gives 2 interesting results; and since both use the same cpp code, let's take the one with most stars
Other options would be to simply do (% NUM_SHARDS) or even shift right (however must have shards count == power of 2).

mini setup script:

and 2 python test scripts:

Results:
imap runs much slower;
we can look at it/sec from tqdm to see that:

# test.py sample after 4 seconds:
2801493it [00:04, 566075.99it/s]

# test_imap.py sample after 4 seconds:
73439it [00:04, 18754.44it/s]

We could see the non-imap version is 30x faster!

Q&A

Q: Why setup.sh and not requirements.txt file?
A: this is not production code; it's aimed for quick reproducibility, not for having exact same lib (e.g. security etc.)

Q: Why MMH3 and not sha256?
A: This is not a security product, we don't need cryptographic hash; we just need a nice distribution of keys, and we want this to be fast.

Q: Why is imap slower than single process?
A: Might be because the imap version has lots of overhead because of IPC;
The trade-off between offloading the (alleged) "heavy lifting" calculation of hash to external process is being erased by the IPC.

Q: Why?
A: Using process pool might worth it if task is more CPU bound; here, task is more io bound the overhead of MMH hash doesn't justify it.

Q: Are we sure about it?
A: we could use py-spy and see.

Q: Conclusion?
A: it depends
Also - depends on size of file.
Also - depends on post-processing each shard.
conclusion - test mrjob as well; it might have a better IPC.

DEV Community