Discussion on: Efficiently Streaming a Large AWS S3 File via S3 Select

View post

Hi Idris, Great post! I tried to do a similar code where I select all data from a s3 file and recreated this same file locally with the same exact format. But after building the file I noticed that the local file had fewer records than the real one. As I increased the chunk size of the scan range the difference between the s3 file and the local file diminished. Do you have any idea why this might happen? Obs: file format: CSV , No compression, 5000 and 20000 bytes chunk range used for the tests. Once again, thank you for the post.

Idris Rampurawala • Aug 4 '21

Hi,
Glad that you liked the post and it helped you in your use-case.
With this process of streaming the data, you have to keep retrieving the file chunk from S3 until you reach the total file size. I would recommend to clone this repo and compare with your local code to identify if you missed something 😉

Optionally, I would recommend to also check out the sequel to this post for parallel processing 😁

Parallelize Processing a Large AWS S3 File

Idris Rampurawala ・ Jun 25 ・ 6 min read

#aws #python #showdev #productivity

Vinícius • Aug 5 '21

Thank you! I found out what I was missing, I made the start_byte = end_byte + 1. Losing one row per chunk. Your next article was exact what I was looking for for the next step of my program.