Using DuckDB to query Cloud Provider audit logs when you don't have a SIEM available.
⚠️ Just want the code? Check out my gist here
More than once, I have been in a situation where I needed to query CloudTrail logs but was working in a customer environment where they weren’t aggregated to a search interface. Another similar situation is when CloudTrail data events are disabled for cost reasons but need to be temporarily turned on for troubleshooting/audit purposes. While the CloudTrail console offers some (very) limited lookups (for management events only), and Athena is an option, what about DuckDB?
DuckDB offers both the ability to retrieve directly from S3, as well as parse JSON files into queryable tables. This blog is my documentation of working through that process! This blog assumes you already have DuckDB installed, if not, start here.
Setup
Start a DuckDB session. DuckDB can operate either fully in memory or utilize disk space to process datasets larger than your available memory. For Cloudtrail in a single account over a day, in memory should be fine, but we can use persistent storage mode to make sure our tables don't disappear when we exit:
duckdb cloudtrail-analysis
Next, load the AWS extension:
INSTALL AWS;
LOAD AWS;
This lets you load AWS credentials from your CLI profiles a bit easier than the default workflow. We can load whatever credentials we have configured in our environment or AWS CLI profile using it:
CALL load_aws_credentials();
Before we go down the SQL Rabbit-hole, lets consider the structure of CloudTrail as it gets exported to S3:
# cloudtrail_file.json
{
"Records":[
{
"eventVersion": "1.09",
"userIdentity": {
"type": "IAMUser",
"principalId": "EXAMPLE6E4XEGITWATV6R",
"arn": "arn:aws:iam::123456789012:user/Mary_Major",
"accountId": "123456789012",
}
...
}
...,
]
}
Takeaway here is: the file is structured as a single JSON object, with a top level key of Records that is an array containing our CloudTrail entries that we are after. We will need to explode the records out of that array into a table to make them useful.
DuckDB’s read_json function by default will attempt to determine the schema of JSON files, and adapt the column data types accordingly. CloudTrail entries have a few common top level fields but tend to be very dynamic when it comes to specific fields for that event (eg RequestParameters). We can use the maximum_depth parameter on our read_json call to override this functionality.
To avoid redownloading the files from S3 over and over again, we can use the CREATE TABLE … AS statement (aka CTAS in the SQL world) to create a table from our read_json query:
CREATE TABLE ct_raw AS SELECT * FROM read_json('s3://org-cloudtrail-111122223333/AWSLogs/o-123456043/111122223333/CloudTrail/us-east-1/2024/05/19/*.gz', maximum_depth=2);
This gets us a table with a single column: Records with a data type of an array of JSON objects. Next, we can explode the list using unnest to access the individual events:
CREATE TABLE ct AS SELECT unnest(Records) AS Event FROM ct_raw;
The JSON datatype allows us to to access the nested values using dot notation, which looks like this: event.userIdentity.arn
. While this can offer us some limited querying, when we want to utilize our columns in the WHERE statement, the JSON datatype isn't ideal. To finish, we can extract the keys we care about into separate columns using json_extract_string
:
CREATE TABLE cloudtrail_events AS SELECT json_extract_string(event, '$.eventVersion') AS eventVersion,
json_extract_string(event, '$.userIdentity.type') AS userType,
json_extract_string(event, '$.userIdentity.principalId') AS principalId,
json_extract_string(event, '$.userIdentity.arn') AS userArn,
json_extract_string(event, '$.userIdentity.accountId') AS accountId,
json_extract_string(event, '$.userIdentity.accessKeyId') AS accessKeyId,
json_extract_string(event, '$.userIdentity.userName') AS userName,
CAST(json_extract_string(event, '$.eventTime') AS TIMESTAMP) AS eventTime,
json_extract_string(event, '$.eventSource') AS eventSource,
json_extract_string(event, '$.eventName') AS eventName,
json_extract_string(event, '$.awsRegion') AS awsRegion,
json_extract_string(event, '$.sourceIPAddress') AS sourceIPAddress,
json_extract_string(event, '$.userAgent') AS userAgent,
json_extract_string(event, '$.errorCode') AS errorCode,
json_extract_string(event, '$.errorMessage') AS errorMessage,
json_extract(event, '$.requestParameters') AS requestParameters,
json_extract(event, '$.responseElements') as responseElements,
json_extract(event, '$.resources') AS resources,
FROM ct
Query time!
Some sample queries:
All actions taken by a particular IAM Principal:
select eventName, eventTime, userAgent from cloudtrail_events where arn = 'REPLACE_ME';
All the unique error messages:
select distinct errorCode from cloudtrail_events;
Get all events in the past 24 hours:
select * from cloudtrail_events where eventtime >= (now() - INTERVAL '1 day');
Happy querying!
Top comments (2)
A short, but well-structured article that delivers what's needed. Nice work!
Very cool write up!