This one was not very well documented and I really struggled a little bit figuring this one out. Adding this to the collective internet intelligence in the hope that it will help someone out one day.... or just my future self (hello future self!).
Adding Flow Logs to the AWS Transit Gateway was honestly easy enough to do. I pointed them to S3 and they started showing up. Easy done, nope. Querying the logs was a different story. Took me more than a day to get this all working the way I expected.
Here is what I landed on
resource "aws_ec2_transit_gateway" "main" {
description = "Super Awesome Transit Gateway"
amazon_side_asn = var.amazon_side_asn
dns_support = var.dns_support
vpn_ecmp_support = var.vpn_ecmp_support
default_route_table_association = var.default_route_table_association
default_route_table_propagation = var.default_route_table_propagation
auto_accept_shared_attachments = var.auto_accept_shared_attachments
}
resource "aws_s3_bucket" "flow_logs" {
bucket = "tgw-flow-logs-super-awesome-tgw"
}
resource "aws_flow_log" "main" {
log_destination = aws_s3_bucket.flow_logs.arn
log_destination_type = "s3"
traffic_type = "ALL"
transit_gateway_id = aws_ec2_transit_gateway.main.id
max_aggregation_interval = 60
destination_options {
file_format = "parquet"
hive_compatible_partitions = true
per_hour_partition = true
}
}
You can see that I landed on Parquet
being the output format and I ended up also turning on Hive Compatible partitions because it makes queries in Athena quite a bit more efficient.
Getting to the important parts, which was correctly setting up Athena and the queries for it. Below is a continuation of the same Terraform module adding in the important bits for Athena.
Created a bucket and then an Athena workgroup to save the queries and results.
resource "aws_s3_bucket" "athena_output" {
bucket = "athena-${aws_ec2_transit_gateway.main.id}"
}
resource "aws_athena_workgroup" "tgw_workgroup" {
name = "TransitGatewayFlowLogs"
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://${aws_s3_bucket.athena_output.bucket}/output/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}
}
}
Now we get to the juicy parts. Once again, below is also the same module but adding in the Athena queries. I have gone and created three different queries.
- Creating the Flow Logs table in Athena.
- Update the Metadata in the Athena Table for the correct Hive Table partitions for the Parquet format.
- Query the Flow Logs for a specific VPC ID to see what its talking to. This one you may want to tailor a bit to fit your needs so your mileage may vary.
resource "aws_athena_named_query" "create_flow_table" {
name = "Create Flow Logs Table"
workgroup = aws_athena_workgroup.tgw_workgroup.name
description = "Create the flow logs table in Athena"
database = aws_athena_database.flow_logs.name
query = <<EOT
CREATE EXTERNAL TABLE IF NOT EXISTS tgwflowlogs (
version int,
resource_type string,
account_id string,
tgw_id string,
tgw_attachment_id string,
tgw_src_vpc_account_id string,
tgw_dst_vpc_account_id string,
tgw_src_vpc_id string,
tgw_dst_vpc_id string,
tgw_src_subnet_id string,
tgw_dst_subnet_id string,
tgw_src_eni string,
tgw_dst_eni string,
tgw_src_az_id string,
tgw_dst_az_id string,
tgw_pair_attachment_id string,
srcaddr string,
dstaddr string,
srcpor int,
dstport int,
protocol bigint,
packet bigint,
bytes bigint,
start bigint,
`end` bigint,
log_status string,
type string,
packets_lost_no_route bigint,
packets_lost_blackhole bigint,
packets_lost_mtu_exceeded bigint,
packets_lost_ttl_expired bigint,
tcp_flags int,
region string,
flow_direction string,
pkt_src_aws_service string,
pkt_dst_aws_service string
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://${aws_s3_bucket.flow_logs.id}/AWSLogs/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
EOT
}
resource "aws_athena_named_query" "update_metadata" {
name = "Update Metadata in Catalog"
workgroup = aws_athena_workgroup.tgw_workgroup.name
description = "Update the metadata in the catalog after you add Hive compatible partitions"
database = aws_athena_database.flow_logs.name
query = "MSCK REPAIR TABLE tgwflowlogs"
}
resource "aws_athena_named_query" "vpc_query" {
name = "VPC ID Query"
workgroup = aws_athena_workgroup.tgw_workgroup.name
description = "Select the VPC ID, source IP address, destination IP address, and total bytes transferred for each flow log record."
database = aws_athena_database.flow_logs.name
query = <<EOT
SELECT tgw_src_vpc_id AS "VPC ID",
srcaddr AS "Source Address",
dstaddr AS "Destination Address",
dstport as "Destination Port",
SUM(bytes) AS "Total Bytes"
FROM "${aws_athena_database.flow_logs.name}".tgwflowlogs
WHERE tgw_src_vpc_id = 'replace-vpc-id-here'
GROUP BY tgw_src_vpc_id,
srcaddr,
dstaddr,
dstport
LIMIT 100
EOT
}
Once you deploy this it will set up your TGW and attach the flow logs. Those logs will be outputted to the S3 bucket in Parquet format.
You should be able to go into Athena and in the Query Editor, you will see the Workgroup
dropdown menu where you can change that to be the workgroup that was created via the Terraform above. It is named TransitGatewayFlowLogs
(assuming you did not change it).
Once you select that workgroup, you will then be able to go to the Saved Queries
tab and see the three queries that were also created from the Terraform.
You should then select The Create Flow Logs Table
and run that query. That will create the Athena table for you.
Once you do that, go back to the Saved Queries and select the Update Metadata in Catalog
and run that. That will make sure the table is set up correctly for the Parquet Hive partitions.
After you do that, you can then load up the last saved query VPC ID Query
. One important note about that last query. In there, I have placed a placeholder replace-vpc-id-here
bit where you will want to put in a real VPC ID.
Once you do that, you should be able to query the Transit Gateway Flow logs.
Top comments (0)