Querying AWS S3 - Give me all of it

#datascience #database #aws

Being fairly new to AWS itself, I’m still discovering all the capabilities this platform has to offer. Say we have data in an AWS S3 (Simple Storage Service) bucket that we want to query/pass to a Lambda to digest or curate, what’s the best way to execute this solution? We’ve tried 3 solutions which all had different pros/cons in terms of effectively and efficiently passing the data to our Lambda.

S3 Notification Event trigger to Lambda
Glue Crawler and Athena
listObjectsV2 and getObject

S3 Notification Event Trigger

This was the solution that we had initially started with. Simply adding an s3:ObjectCreate:* event trigger onto the S3 and have it pass the newly created object to our Lambda when triggered. The solution works if you have data being put into the S3 but lacks the ability to query for the data or possibly “reprocess” the data through the Lambda. The solution seemed to work well with lower volumes of data going into S3 (less than 100 records per minute), but we started to see the accuracy degrade when we reached higher volumes (over 100 records per minute) of data going into our S3 bucket.

Glue Crawler and Athena

Next solution that we had tested out was using a Glue Crawler on top of the S3 and querying for data using Athena. Our Glue Crawler would generate a table based on the contents within the JSON that we put into our S3. Then we could use Athena to query specific pieces of data from our JSON. I did like how you were able to use ‘athena-express’ within the Lambdas and query the data directly from there. I found that Athena was a pretty powerful tool and I wanted to love this solution, but we ran into too many issues to continue with this approach.

Problem #1.

Our first issue happens since Glue Crawlers generate the table based on a single-level JSON. When you have multi-level JSON files, it puts the nested objects into STRUCT type for columns, and you cannot access the data effectively after you’ve queried for it. To alleviate this problem, we had simply flatted the JSON file so that our nested objects were basically on a single level. IE, object.test.one vs object_test_one.

Problem #2.

Glue Crawlers work great when you have a low amount of data in the S3 to crawl. Once you start seeing more and more data put into S3, it starts to take longer for the Glue Crawler to finish its job. Due to the nature of how much data we started getting in our S3 bucket, we noticed that runs would go over 30 minutes and even 1 hour at times. This would not be feasible if we want to query our data on a scheduled Cron job, as the run durations would be inconsistent.

Problem #3.

Our last issue started to come into play when we figured we may have new key-value pairs going into our JSON. This meant that we should run our Glue Crawler before our Athena query is executed. Problem #2 affects this issue because we can’t rely on the Glue Crawler run to finish before our Athena query. If we didn’t run our Glue Crawler before our Athena query is executed, we run into the risk of missing new key-value pairs added to the JSON.

listObjectV2 and getObject

Last solution that we had tried and currently using is the listObjectV2/getObject approach. This solution simply requires us to use s3.listObjectV2 to grab the fileNames with in the whole S3 or a specific folder. We would then loop through the array given to us by s3.listObjectV2 and use s3.getObject to grab the file’s contents.

The only issue we encountered was listObjectV2 only returning max of 1000 records at a time. This required us to loop the calls until the records returned less that 1000, ensuring than we didn’t have any more items for the current run. Other than that, we found that the accuracy with this solution was great. It would allow us to accurately account for all JSON objects per Cron job runs and also the ability to reprocess/re-query data if needed.

DEV Community