DEV Community

Discussion on: Efficiently Streaming a Large AWS S3 File via S3 Select

Collapse
 
visantanna profile image
Vinícius

Hi Idris, Great post! I tried to do a similar code where I select all data from a s3 file and recreated this same file locally with the same exact format. But after building the file I noticed that the local file had fewer records than the real one. As I increased the chunk size of the scan range the difference between the s3 file and the local file diminished. Do you have any idea why this might happen? Obs: file format: CSV , No compression, 5000 and 20000 bytes chunk range used for the tests. Once again, thank you for the post.

Collapse
 
idrisrampurawala profile image
Idris Rampurawala

Hi,
Glad that you liked the post and it helped you in your use-case.
With this process of streaming the data, you have to keep retrieving the file chunk from S3 until you reach the total file size. I would recommend to clone this repo and compare with your local code to identify if you missed something 😉

Optionally, I would recommend to also check out the sequel to this post for parallel processing 😁

Collapse
 
visantanna profile image
Vinícius

Thank you! I found out what I was missing, I made the start_byte = end_byte + 1. Losing one row per chunk. Your next article was exact what I was looking for for the next step of my program.