DEV Community

Cover image for Streamline your data analytics with AWS Athena queries and Terraform
Bolivar Llerena
Bolivar Llerena

Posted on

Streamline your data analytics with AWS Athena queries and Terraform

Part I of this article will detail how to streamline your data analytics with AWS Athena using Terraform Infrastructure as Code.

Requirements

  • AWS account
  • Terraform

Walkthrough

Configure AWS credentials with aws configure.

At the root of our project, create a file named provider.tf which we will specify the aws provider. Also we are going to read dynamically our columns and declare a variable for each column.

#provider.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.59.0"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Next, let's define an S3 bucket in a file named s3.tf which we will use as a source of data.

#s3.tf
resource "aws_s3_bucket" "log_bucket" {
  bucket = "my-tf-log-bucket-source-athena"
}
Enter fullscreen mode Exit fullscreen mode

Now we will define a file named athena.tf, in this file we are going to create an athena workgroup,db, table, and our query.

For our aws_glue_catalog_table we are expecting the table to serealize a json object.

Note: Is important to note that our json files should have no spaces, otherwise it won't parse well other alternative could be the use of LazySimpleSerDe which could convert from byte stream

#athena.tf
resource "aws_glue_catalog_database" "myservice_athena_db" {
  name = "myservice"
}

resource "aws_glue_catalog_table" "athena_table" {
 name          = "mytable"
 database_name = aws_glue_catalog_database.myservice_athena_db.name
 description   = "Table containing the results stored in S3 as source"


 table_type = "EXTERNAL_TABLE"


 storage_descriptor {
   location      = "s3://${aws_s3_bucket.log_bucket.bucket}/mydata"
   input_format  = "org.apache.hadoop.mapred.TextInputFormat"
   output_format = "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"


   ser_de_info {
     name                  = "s3-stream"
     serialization_library = "org.openx.data.jsonserde.JsonSerDe"


     parameters = {
       "ignore.malformed.json" = "TRUE"
       "dots.in.keys"          = "FALSE"
       "case.insensitive"      = "TRUE"
       "mapping"               = "TRUE"
     }
   }


   dynamic "columns" {
     for_each = var.columns


     iterator = column
     content {
       name = column.value.name
       type = column.value.type
     }
   }
 }
}

resource "aws_athena_workgroup" "test" {
  name = "example123123s"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.log_bucket.bucket}/output/"

    }
  }
}

resource "aws_athena_named_query" "athena_query" {
  name      = "test_query"
  workgroup = aws_athena_workgroup.test.id
  database  = aws_glue_catalog_database.myservice_athena_db.name
  query     = "SELECT json_extract_scalar(person, '$.name') as name, product as source FROM myservice.mytable;"
}
Enter fullscreen mode Exit fullscreen mode

Our json example should look like this (but without spaces)

{
  "product": "Live JSON generator",
  "version": 3.1,
  "releaseDate": "2014-06-25T00:00:00.000Z",
  "demo": true,
  "person": {
    "id": 12345,
    "name": "John Doe",
    "phones": {
      "home": "800-123-4567",
      "mobile": "877-123-1234"
    },
    "email": [
      "jd@example.com",
      "jd@example.org"
    ],
    "dateOfBirth": "1980-01-02T00:00:00.000Z",
    "registered": true,
    "emergencyContacts": [
      {
        "name": "Jane Doe",
        "phone": "888-555-1212",
        "relationship": "spouse"
      },
      {
        "name": "Justin Doe",
        "phone": "877-123-1212",
        "relationship": "parent"
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

In our variables.tf we will dynamically call all the columns

#variables.tf
variable "columns" {
  type = list(object({
    name = string,
    type = string,
  }))
  default     = []
  description = "The columns in the table, where the key is the name of the column and the value the type"
}
Enter fullscreen mode Exit fullscreen mode
#terraform.tfvars
columns = [
    {
      name = "product"
      type = "string"
    },

     {
      name = "version"
      type = "int"
    },

     {
      name = "releaseDate"
      type = "string"
    },

     { 
      name = "demo" 
      type = "boolean" 
    } ,
     {
      name = "person"
      type = "string"
    }
]
Enter fullscreen mode Exit fullscreen mode

After finishing our files let's use terraform to create our resources.

terraform init
Enter fullscreen mode Exit fullscreen mode

Plan init

terraform plan
Enter fullscreen mode Exit fullscreen mode

Plan creation

terraform apply
Enter fullscreen mode Exit fullscreen mode

Plan apply
Update our data.json file through s3 commands or directly in the interface

aws s3 cp data.json s3://<<your_bucket_name>>/mydata/
Enter fullscreen mode Exit fullscreen mode

Now in Athena service in our AWS account we can see our db, table and saved query, running our query should return a value like this.

Athena UI
And save the results in our bucket.

Bucket

Top comments (0)