Let's say you work on Facebook; you have lots of data and probably needs lots of map-reduce tasks.
You will use mrjob/PySpark/spark/hadoop. You got the point - you need 1 framework to rule them all.
You need a system: where will temp file be stored, API with cloud, data security, erasuer, multi-tenant etc..
You need standards - standards between developers to themselves, between developers to devops etc. ;
Let's say, for the other hand, your a solopreneur/small startup. Max 3-4 developers team.
You need things to work, and work fast.
Don't have 10ks of map-reduce jobs, but probably 1 or 2.
You won't be using hadoop, that's for sure. Might be using:
not really map reduce per se, more like "sql w/out sql engine"
However, this adds complexities of .net to your environment;
e.g. read release notes and understand if you can run it on your different OSes (production, staging, developers machines).
Also - need to learn C#; loading from files, different encodings, saving, iterators etc..
If you're not proficient with C#, this could be one-time investment which will not worth it.
Pros: Python native lib; able to debug easily (using inline) run locally e.g. multi process on local machine,
use hadoop, dataproc (seems that "Dataproc is a managed Spark and Hadoop service..." ) etc.
However, lots of moving parts and different configuration options.
Let's go to UCI Machine Learning website (2015 is on the phone..)
Choose some dataset, and test
We don't need Sha256 and not evey base64; nothing will happen if keys will not distribute very equally.
we could take MMH3; googling "python murmurhash" gives 2 interesting results; and since both use the same cpp code, let's take the one with most stars
Other options would be to simply do (% NUM_SHARDS) or even shift right (however must have shards count == power of 2).
mini setup script:
and 2 python test scripts:
imap runs much slower;
we can look at it/sec from tqdm to see that:
# test.py sample after 4 seconds: 2801493it [00:04, 566075.99it/s] # test_imap.py sample after 4 seconds: 73439it [00:04, 18754.44it/s]
We could see the non-imap version is 30x faster!
Q: Why setup.sh and not requirements.txt file?
A: this is not production code; it's aimed for quick reproducibility, not for having exact same lib (e.g. security etc.)
Q: Why MMH3 and not sha256?
A: This is not a security product, we don't need cryptographic hash; we just need a nice distribution of keys, and we want this to be fast.
Q: Why is imap slower than single process?
A: Might be because the imap version has lots of overhead because of IPC;
The trade-off between offloading the (alleged) "heavy lifting" calculation of hash to external process is being erased by the IPC.
A: Using process pool might worth it if task is more CPU bound; here, task is more io bound the overhead of MMH hash doesn't justify it.
Q: Are we sure about it?
A: we could use py-spy and see.
A: it depends
Also - depends on size of file.
Also - depends on post-processing each shard.
conclusion - test mrjob as well; it might have a better IPC.