When I first set out to use Athena for querying CloudFront logs, I thought it would be a breeze. Just set up the Glue database, create a table, and start querying, right? Wrong. The problem hit me when I realized the log files had a flat structure—no partitions, no hierarchy. Every query ended up scanning massive amounts of data, even when I just needed a small slice of information.
To make things worse, manually adding partitions for each batch of logs felt like an endless chore. It was clear that this setup wasn’t sustainable for our growing traffic. But then came AWS's announcement: Apache Parquet support for CloudFront logs, along with Hive-compatible folder structures. That’s when it clicked—if I combined this with Athena Partition Projection, it would be a total breakthrough.
CloudFront Logs: Then and Now
Previously, CloudFront logs were delivered in plain text (CSV) format. While this format was simple, it wasn’t optimized for querying large datasets. Logs were delivered to S3 in a flat structure.
Old log example (CSV format)
2024-11-25T15:00:00Z,192.168.1.1,GET,www.example.com,/index.html,200,120,Chrome
Flat structure of file name
s3://cloudfront--logs/E123ABC456DEF-2024-11-25-12-00-00-abcdef0123456789.gz
--
After the update,
CloudFront logs can now be delivered in Apache Parquet format. Parquet is a columnar storage format that improves query performance and reduces storage space significantly.
The same log data, when stored in Parquet, is compressed and structured like this:
Hive-Style Partitioning
Hive-style refers to a folder structure where data is organized into directories named after partition keys and their values, like key=value/.
CloudFront supports Hive-style partitioning when delivering logs to S3. This means your logs are stored in a folder structure like this:
s3://cloudfront--logs/year=2024/month=11/day=25/hour=15/
Even better, you can customize the partitioning field to match your needs. For example, partition by year
, month
, day
, or even by DistributionId
:
Example : `{DistributionId}/{yyyy}/{MM}/{dd}/{HH}/`
This flexibility makes querying faster and perfectly tailored to your use case.
Why does Partition Projection matter?
Imagine this: you’ve got a huge collection of photos on your phone. You’re trying to find a picture from your last vacation, but the photos aren’t organized. You’d have to scroll endlessly, right? Now, what if you could group them by year, month, and day? Finding that one photo becomes way easier.
That’s kind of what partitioning does for your data in Athena. It organizes your data into neat folders (or partitions), so Athena only looks where it needs to when running a query.
But here’s the problem: if your data is growing super fast (like CloudFront logs), you’d need to constantly tell Athena about all the new folders. Sounds like a pain, right? That’s where Partition Projection saves the day! Think of it as an automatic system that figures out how your data is organized without you lifting a finger.
How does "Partition Projection" work?
When you create a table in Athena, instead of manually adding partitions every time new data arrives, you just describe how your data is structured. For example, if your logs are stored by
year
,month
,day
, andhour
, you let Athena know upfront. Then, whenever you query, Athena predicts where to look based on your pattern.
Let’s say you’ve got CloudFront logs stored in S3 like this:
s3://cloudfront--logs/year=2024/month=11/day=25/hour=12/
s3://cloudfront--logs/year=2024/month=11/day=26/hour=14/
Here, each folder (like
year=2024
) is a partition.
Without Partition Projection, you’d have to tell Athena about each folder manually, like this:
ALTER TABLE cloudfront_logs ADD PARTITION (year='2024', month='11', day='25', hour='12') LOCATION 's3://my-logs-bucket/year=2024/month=11/day=25/hour=12/';
Doing that for every log? No thanks!
With Partition Projection, you define the pattern once when you create the table:
CREATE EXTERNAL TABLE cloudfront_logs (
timestamp STRING,
url STRING,
status_code INT
)
PARTITIONED BY (year STRING, month STRING, day STRING, hour STRING)
STORED AS PARQUET
LOCATION 's3://cloudfront--logs/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2020,2030',
'projection.month.type' = 'integer',
'projection.month.range' = '1,12',
'projection.day.type' = 'integer',
'projection.day.range' = '1,31',
'projection.hour.type' = 'integer',
'projection.hour.range' = '0,23'
);
Athena now knows your logs follow this structure. When you query, it’ll automatically figure out where to look—no manual updates required.
Lets talk about the setup:
To set up Hive-style partitions in CloudFront:
When you are enabling the logging, choose Amazon S3 as the destination and enable Hive-compatible prefixes for your log files and choose output format as Parquet.Provide a suffix path for partitioning your data.
Example :
{DistributionId}/{yyyy}/{MM}/{dd}/{HH}/
To effortlessly set up an Athena database and table with Partition Projection enabled, check out this GitHub repo
Wrapping It Up
CloudFront logs just got a lot easier to work with. Whether you're using the new Apache Parquet format with Hive-compatible folders or combining it with Athena Partition Projection, you can now query your logs faster, cheaper, and with way less hassle. It’s been a game-changer for me, and I hope it will be for you too.
But that’s not all. You can also deliver CloudFront logs to CloudWatch in JSON
or text
format for real-time monitoring, or even use Kinesis Data Firehose to process logs on the fly. AWS has made it super flexible to work with CloudFront logs, so you can choose the setup that works best for you.
Happy logging! 😊
Top comments (0)