Part I of this article will detail how to streamline your data analytics with AWS Athena using Terraform Infrastructure as Code.
Requirements
- AWS account
- Terraform
Walkthrough
Configure AWS credentials with aws configure.
At the root of our project, create a file named provider.tf which we will specify the aws provider. Also we are going to read dynamically our columns and declare a variable for each column.
#provider.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.59.0"
}
}
}
Next, let's define an S3 bucket in a file named s3.tf which we will use as a source of data.
#s3.tf
resource "aws_s3_bucket" "log_bucket" {
bucket = "my-tf-log-bucket-source-athena"
}
Now we will define a file named athena.tf, in this file we are going to create an athena workgroup,db, table, and our query.
For our aws_glue_catalog_table we are expecting the table to serealize a json object.
Note: Is important to note that our json files should have no spaces, otherwise it won't parse well other alternative could be the use of LazySimpleSerDe which could convert from byte stream
#athena.tf
resource "aws_glue_catalog_database" "myservice_athena_db" {
name = "myservice"
}
resource "aws_glue_catalog_table" "athena_table" {
name = "mytable"
database_name = aws_glue_catalog_database.myservice_athena_db.name
description = "Table containing the results stored in S3 as source"
table_type = "EXTERNAL_TABLE"
storage_descriptor {
location = "s3://${aws_s3_bucket.log_bucket.bucket}/mydata"
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"
ser_de_info {
name = "s3-stream"
serialization_library = "org.openx.data.jsonserde.JsonSerDe"
parameters = {
"ignore.malformed.json" = "TRUE"
"dots.in.keys" = "FALSE"
"case.insensitive" = "TRUE"
"mapping" = "TRUE"
}
}
dynamic "columns" {
for_each = var.columns
iterator = column
content {
name = column.value.name
type = column.value.type
}
}
}
}
resource "aws_athena_workgroup" "test" {
name = "example123123s"
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://${aws_s3_bucket.log_bucket.bucket}/output/"
}
}
}
resource "aws_athena_named_query" "athena_query" {
name = "test_query"
workgroup = aws_athena_workgroup.test.id
database = aws_glue_catalog_database.myservice_athena_db.name
query = "SELECT json_extract_scalar(person, '$.name') as name, product as source FROM myservice.mytable;"
}
Our json example should look like this (but without spaces)
{
"product": "Live JSON generator",
"version": 3.1,
"releaseDate": "2014-06-25T00:00:00.000Z",
"demo": true,
"person": {
"id": 12345,
"name": "John Doe",
"phones": {
"home": "800-123-4567",
"mobile": "877-123-1234"
},
"email": [
"jd@example.com",
"jd@example.org"
],
"dateOfBirth": "1980-01-02T00:00:00.000Z",
"registered": true,
"emergencyContacts": [
{
"name": "Jane Doe",
"phone": "888-555-1212",
"relationship": "spouse"
},
{
"name": "Justin Doe",
"phone": "877-123-1212",
"relationship": "parent"
}
]
}
}
In our variables.tf we will dynamically call all the columns
#variables.tf
variable "columns" {
type = list(object({
name = string,
type = string,
}))
default = []
description = "The columns in the table, where the key is the name of the column and the value the type"
}
#terraform.tfvars
columns = [
{
name = "product"
type = "string"
},
{
name = "version"
type = "int"
},
{
name = "releaseDate"
type = "string"
},
{
name = "demo"
type = "boolean"
} ,
{
name = "person"
type = "string"
}
]
After finishing our files let's use terraform to create our resources.
terraform init
terraform plan
terraform apply
![]()
Update our data.json file through s3 commands or directly in the interface
aws s3 cp data.json s3://<<your_bucket_name>>/mydata/
Now in Athena service in our AWS account we can see our db, table and saved query, running our query should return a value like this.



Top comments (0)