Building an Apache Iceberg Log Analytics Platform with S3 Tables and Amazon Data Firehose

#aws #iceberg #logging

Introduction

Amazon S3 Tables is an AWS-managed object storage service that supports the Apache Iceberg specification and automatically performs table optimization tasks such as compaction in the background. In this article, I explore how to architect a log analytics platform for applications deployed on AWS using S3 Tables and Amazon Data Firehose.

Building an Apache Iceberg Log Analytics Platform on AWS

System Architecture

The system architecture for this implementation is shown in the diagram below. Assuming a containerized application deployed on ECS, I have placed firelens as a sidecar container to serve as the log router. firelens receives logs and forwards them to Amazon Kinesis Data Firehose. Firehose then stores the received logs in S3 Tables in Iceberg format. Finally, the stored Iceberg tables are queried and analyzed using Amazon Athena.

Infrastructure Provisioning

The infrastructure is provisioned using Terraform. I have included a link to the Terraform code repository below for those interested in the implementation details.

https://github.com/manaty226/aws-s3tables-firehose-athena-log-analytics

In this configuration, writing data from Amazon Data Firehose to S3 Tables requires LakeFormation permission settings. However, as of December 2025, Terraform's LakeFormation resource does not support permission configuration for S3 Tables. Therefore, after creating the IAM role for Kinesis Data Firehose with Terraform, you need to execute the following AWS CLI command to configure LakeFormation permissions. Without this configuration, S3 Tables will not be visible from the Firehose stream, causing resource creation to fail.

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier="arn:aws:iam::${ACCOUNT_ID}:role/s3tables-log-demo-firehose-role" \
  --resource "{\"Table\":{\"CatalogId\":\"${ACCOUNT_ID}:s3tablescatalog/${TABLE_BUCKET_NAME}\",\"DatabaseName\":\"logs\",\"Name\":\"some_api_logs\"}}" \
  --permissions "ALL" \
  --region ap-northeast-1

For more details on the LakeFormation permission configuration issue in Terraform, please refer to the following GitHub issue.
https://github.com/hashicorp/terraform-provider-aws/issues/40724

Log Sample Structure

For this implementation, the application container outputs logs in the following JSON format.

{
  "timestamp": "2025-12-20T10:10:21Z",
  "level": "INFO",
  "message": "Sample log message",
  "request_id": "93f7a013-37d1-4793-a73f-b660d78a1f16"
}

When these logs are sent to Amazon Data Firehose via FireLens, metadata is appended, resulting in the following structure.

{
    "container_name": "app",
    "source": "stdout",
    "log": "{\"timestamp\":\"2025-12-20T10:10:21Z\",\"level\":\"INFO\",\"message\":\"Sample log message\",\"request_id\":\"93f7a013-37d1-4793-a73f-b660d78a1f16\"}",
    "container_id": "xxxxxxxxxxxxxxxxx"
}

When data is sent from Amazon Data Firehose to S3 Tables, fields that do not exist in the S3 Tables schema are silently ignored. Therefore, you need to define an Iceberg table schema that accommodates this format in advance. Unlike CloudWatch Logs, where the log schema is parsed at query time after ingestion, with S3 Tables you must standardize the log fields. However, as application developers, we often need to modify log fields depending on specific functions or contexts. Given this, a reasonable approach might be to include the full log content in the log field while defining standardized fields as S3 Tables columns and configuring FireLens to format the logs accordingly. This is still an area I am experimenting with, so if you have any best practices to share, I would appreciate your input.

If you can verify the log data in the S3 Tables preview, the setup is successful. Note that when the JSON fields output by FireLens do not match the S3 Tables column schema, Amazon Data Firehose will not report an error, and the data will simply not be stored in S3 Tables, making debugging particularly challenging.

Querying Logs Stored in S3 Tables

Finally, I query the logs stored in S3 Tables from Athena.

As mentioned earlier, the log body is stored as a JSON string in the log field. Therefore, we need to parse the log field as JSON in Athena to extract individual fields. Below is an example query. I considered creating a View table to streamline analysis for the development team, but currently S3 Tables is recognized as a cross-account Glue Data Catalog, resulting in an error when attempting to execute the CREATE VIEW command. For now, the recommended approach is to save queries like the one below as Named Queries and share them within the team. If anyone knows how to create View tables in this context, I would be grateful for your guidance.

SELECT
    container_id,
    container_name,
    ecs_cluster,
    ecs_task_arn,
    ecs_task_definition,
    source,
    json_extract_scalar(log, '$.timestamp') AS timestamp,
    json_extract_scalar(log, '$.level') AS level,
    json_extract_scalar(log, '$.message') AS message,
    json_extract_scalar(log, '$.request_id') AS request_id
FROM some_api_logs;

Conclusion

In this article, I explored building an application log analytics platform using S3 Tables. While there are still areas that feel somewhat immature such as Terraform not yet supporting LakeFormation permission configuration, the need to set ALL permissions for Amazon Data Firehose, and the inability to create View tables—the potential for significantly lower log ingestion and storage costs compared to CloudWatch Logs is promising. Additionally, the optimization opportunities offered by Iceberg's compaction and index optimization based on data characteristics make this a technology I am looking forward to seeing evolve.