Augusto Valdivia for AWS Community Builders

Posted on Jun 1, 2021 • Originally published at augustovaldivia.ca

AWS Data Lake with Terraform - Part 1 of 6

#aws #terraform #awsdatalake #bigdata

Big data has been growing as topic for a while now and it is obvious that data is powerful. Data is indeed the new oil. Any business out there is investing in data research. There are many terms nowadays that describe data and how it is organized. A data lake is one of them. So, what is it?

In simple words a Data Lake is a centralized repository that collects, stores and organizes huge data collection, including structured and semi-structured data. It also allows multiple organizational units (OU) to explore and investigate their current business stage in minutes. It provides users with the availability to do ad-hoc analysis over diverse processing engines like serverless, in-memory processing, queries and batches.

The challenge

In these series of blogs I will explain how I translated MVP core services for a large e-commerce company into Infrastructure-as-Code (IAC) using Terraform scripts to allow for fast and repeatable deployments, efficient testing and to decrease recovery time in case of an unplanned event. This Data Lake architecture version-one use the following services:

EC2 for elastic compute
Kinesis process to collect, and analyze data streams in real time (or almost real time)
S3 for the data landing and the data consumptions zones

Each of these services are a huge topic in their own ecosystem so throughout this article I will highlight information about how they work and how I integrated them.

Diagram version 1: Data lake

Diagram final version: Data lake

What method will we be using to deploy this infrastructure?

We will be deploying this infrastructure as a code (IaC) using Terraform.

resource "aws_instance" "logs" {
  count = var.ec2_count
  ami                         = "ami-0742b4e673072066f"
  instance_type               = "t2.micro"
  subnet_id                   = aws_subnet.dlogssub.id
  associate_public_ip_address = true
  vpc_security_group_ids      = [aws_security_group.web_sg.id]
  depends_on                  = [aws_internet_gateway.bigdataigw]
  key_name                    = aws_key_pair.logskey.key_name
  iam_instance_profile        = aws_iam_instance_profile.ec2_profile.name

  user_data = <<-EOF
          #!/bin/bash -xe
          yum -y update
          yum install -y aws-kinesis-agent
          EOF

  tags = {
    "Name" = "ec2-app-02"
  }
}

Terraform new default tags feature

provider "aws" {
  default_tags {

    tags = {
      Enviroment = "DataLake-test"
      Project    = "DataLake-infrastructure"
    }
  }
  region = "us-east-1"
}

Amazon Elastic Compute Cloud (EC2)

EC2 is the backbone of this infrastructure as it is dedicated to holding the e-commerce large data logs during the time of business analysis. Also, it provides you with a resizable compute capacity for this environment. You can kick up a new server optimized for your work in minutes and rapidly scale it up or down as your computing requirements change.

Amazon Kinesis

Kinesis plays a double part within this infrastructure. Firstly, the Kinesis Firehose stream allows you to capture data from a server log being generated on our Amazon EC2 instance and distributes that into your data lake landing zone in Amazon your S3 bucket. The second one uses the Amazon Kinesis agent application in order to publish data (“direct put”) into this Amazon Kinesis firehose using the Amazon Kinesis agent.

Kinesis agent sample:

2021-06-01 02:13:11.683+000 (Agent.MetricsEmitter RUNNING) com.amazon.kinesis,streaming.agent.Agent [INFO] Agent: Progress: 500000 records parsed (42036691 bytes), and 500000 records sent successfully to destinations.Uptime: 330039ms

A powerful mechanism that Kinesis possesses is the availability to configure how to store your data into s3. You can configure based on buffer size and buffer interval. For the purpose of this project I have decided to select 5 megabytes of a buffer size meaning that incoming data from the firehose will be dividing the files in five megabytes in size. And, for the buffer interval I set it to the lowest value which is 60 seconds. Tips to remember Kinesis firehose is “almost real-time” and cannot go lower than that.

Amazon Simple Storage Server(S3)

S3 is the biggest and most performant data lake storage solution because of its cost-effective, secure data storage with 11 9s of durability and its virtually unlimited scalability model. It makes sense to store your vast data logs in S3, Don’t you think?

The goal for individuals or businesses to use this data lake solution would be to build and integrate Amazon S3 with Amazon Kinesis, Amazon Athena, Amazon Redshift Spectrum, and AWS Glue for data scientists or engineers to query and process a large amount data.

S3 data stream logs sample:

Important to note that this infrastructure is not fully developed I will be adding other servers such as AWS Glue, AWS Athena, AWS Redshift, AWS Cloudwatch and QuickSight 😊 please stay tune.

Functions, arguments and expressions of Terraform that were used in the above project:

providers
variables
modules
resources
types and values
splat or [*]– One of my favorites

default-tags-in-the-terraform-aws-provider– New feature

Find the Terraform repo and directions for this project here

I would like to give a big shout out to my mentor Derek Morgan. Thank you for all of your support all these months and for the amazing course "More Than Certified in Terraform" the best course out there. Link to the course here. If you want to connect with him and ask questions about his course, contact him via LinkedIn Derek Morgan or you can join the Discord channel here.

DEV Community

AWS Data Lake with Terraform - Part 1 of 6

Top comments (0)