Katz Ueno

Posted on Oct 6, 2021

How to filter S3 bucket to download millions of log files

#aws #s3 #log

I started to have more occasions to download log files which is stored on S3 bucket.

Whether they are Nginx, or Apache log being aggregated from auto scaling group, OR CloudFront access log.

Those files on S3 bucket are enormously huge number. It is impossible to browse through the management console.

It's impossible to download all logs locally.

Here is how I usually do it.

Try to narrow down the date and time

Make sure that you minimize the date and time range of the log that you want to acquire.

I have a good habit of trying to log what I've done with timestamp.

For example, when you deploy some new code, you try to create a slack thread, and type comment at the each step of deployment. You don't need to be too precise. Slack has the time stamp automatically.

You could use Google Spreadsheet or Notion to use they checkbox feature. It will record when you check the checkbox.

We use the service called Backlog which is Japanese project management SaaS service which has the checkbox feature when we deploy the code to production or upgrade a CMS system.

Find a block of logs chronologically through S3 Management Console

Log-in to your AWS Management Console and go into S3 page.

Browse the S3 bucket.

Try to find proper prefix.

For example, the CloudFront log format is like the following

[CloudFront ID].YYYY-MM-DD-##.[some ID].gz

For CloudFront log, you should be able to get your CloudFront ID, you could guess the date.

However, it could still be large amount of log file for 1 entire day of access.

You still want to find out which portion of ## (numbers) logs files you want to get.

So you filter the S3 bucket by [CloudFront ID].YYYY-MM-DD-, then from the result, you try to sort by time from the column.

Then find out which ## of files you want to get.

Let's say you now know you want all of E0EEEEEEEEEEE0-10-05-05* log files.

Get IAM Access Key and Secret to access S3 bucket Log

Issue IAM Access Key, or log-in to EC2 instances which has the IAM Role to access S3 bucket.

Run `aws s3 cp`

Run the aws s3 cp command to fetch the log files to your local or remote EC2 instance.

For aws s3 commands, you can only filter by path. You cannot use wildcard (*) in the s3:// URL. You neext to use --exclude --include options.

First, you exclude all files, then you specify what pattern of object names you want to include.

aws s3 cp s3://S3-Bucket-Name/Folder-if-any/ ./ --recursive --exclude "*" --include "[CloudFront ID].YYYY-MM-DD-05*"

From previous example.

aws s3 cp s3://S3-Bucket-Name/Folder-if-any/ ./ --recursive --exclude "*" --include "E0EEEEEEEEEEE0-10-05-05*"

Then, you should be able to download all of your desired log files.

DEV Community

How to filter S3 bucket to download millions of log files

Try to narrow down the date and time

Find a block of logs chronologically through S3 Management Console

Get IAM Access Key and Secret to access S3 bucket Log

Run `aws s3 cp`

Top comments (0)

Read next

AWS CloudWatch Logging and Live Tail using Python/Boto3 SDK!

Day 03: Deploying Basic Infrastructure with Terraform

How would you perform a Kafka operations from your local machine?

AWS re:Invent 2024 is a history now

Try to narrow down the date and time

Find a block of logs chronologically through S3 Management Console

Get IAM Access Key and Secret to access S3 bucket Log

Run aws s3 cp

Read next

AWS CloudWatch Logging and Live Tail using Python/Boto3 SDK!

Day 03: Deploying Basic Infrastructure with Terraform

How would you perform a Kafka operations from your local machine?

AWS re:Invent 2024 is a history now

Run `aws s3 cp`