DEV Community

Discussion on: Beyond CSV files: using Apache Parquet columnar files with Dask to reduce storage and increase performance. Try it now!

Collapse
 
zompro profile image
Jorge PM • Edited

I believe Dask doesn't support reading from zip. See here: github.com/dask/dask/issues/2554#i....

Looking around it does seem to be able to read from a single gzip but it doesn't seem to be straightforward. If you need to unzip the file then any gains on storage would be nullified.

I would be very interested to try it if you know a way and compare size and performance. Normally storage is cheap, at least a lot cheaper than other resources so performance is in most cases the priority and the compression is a nice to have (depends on the use case of course)

Collapse
 
paddy3118 profile image
Paddy3118 • Edited

Sadly I don't use Dask, but in the past I have used zcat on a Linux command line to stream a csv to stdin for a script to then process without needing the whole of the data uncompressed in memory/on disk.

Thread Thread
 
zompro profile image
Jorge PM

Cool I can totally see a use case for that streaming into something like Apache Kafka. I will prototype a couple of things and see if it can become another little article. Thanks!