DEV Community

Discussion on: Are you still using Python 2?

 
hoelzro profile image
Rob Hoelz

So I did a little bit of profiling on a subset of my wikidump data - a quick glance showed that bz2 decompression was way worse on Python 3. When I tried on decompressed files, the Python 3 cost was only 20% rather than 100% - so that's progress! I'll probably take a deeper look tomorrow.

Thread Thread
 
hoelzro profile image
Rob Hoelz

I should clarify that I'm on Python 2.7.15 and 3.6.5 - I wonder if the newly released 3.7 would help?

Thread Thread
 
rhymes profile image
rhymes

When I tried on decompressed files, the Python 3 cost was only 20% rather than 100% - so that's progress! I'll probably take a deeper look tomorrow.

Yeah, and if you can isolate the issue with a gist that I can take a look at I'm happy to do so!

I wonder if the newly released 3.7 would help?

I see nothing related to bz2/bzip2 in Python 3.7 what's new page: docs.python.org/3.7/whatsnew/3.7.html

Are you using Linux, MacOS or Windows?

Thread Thread
 
hoelzro profile image
Rob Hoelz

Linux

Thread Thread
 
hoelzro profile image
Rob Hoelz

Also, if you want to try this out, I'm using a dump file from the Russian Wikipedia (such as dumps.wikimedia.org/ruwiki/2018062...), and just extracting the list of documents via WikiXMLDumpFile(filename).getWikiDocuments() illustrates the difference in timing. You'll need to patch the code to behave with Python 3, though!

Thread Thread
 
rhymes profile image
rhymes

Can you maybe just commit your branch for Python 3? You'll save me some work ;)

Thread Thread
 
hoelzro profile image
Rob Hoelz

Sure - I can fork it and submit my changes there after work!

Thread Thread
 
rhymes profile image
rhymes

Thank you!

Thread Thread
 
hoelzro profile image
Rob Hoelz

Thanks for offering to take a look; here's my fork with the Python 3 change: github.com/hoelzro/WikiCorpusExtra...

...and here's a gist using it: gist.github.com/hoelzro/80561443fe...

After digging in a little bit, I noticed that the bz2.BZ2File class is substantially slower in Python 3; using bz2.decompress or bz2.BZ2Decompressor is a lot faster! The former, of course, requires enough memory to hold both the uncompressed and compressed contents, and trying to use the latter is a more cumbersome and one might reintroduce the same overhead that bz2.BZ2File has if one implements the line splitting logic on top of it. I'm curious why BZ2File is slower in Python 3 - maybe I'll have a chance to dig in further tomorrow!