Tamunopriye Dagogo-George

Posted on Sep 7, 2023

Terraforming Your Data Infrastructure on AWS: A Hands on guide for Data Engineers

In the constantly evolving world of Data Engineering, if you haven't come across Terraform by now, you might have been living under a rock :)

Terraform was originally embraced by DevOps Engineers for deploying cloud infrastructure. However, as Data Engineers recognized the need to provision similar resources, it has gained widespread adoption in the field.

What is Terraform?
Terraform is an open-source tool developed by HashiCorp used for defining and provisioning infrastructure using the Declarative Approach of HashiCorp Configuration Language (HCL).

Terraform is used for Infrastructure as code. What does this mean?

Infrastructure are services like virtual servers, storage buckets, lambda functions, RDS, IAM policies and so on, commonly found on cloud providers such as Azure, AWS, GCP and Oracle. For this tutorial we will be focusing on AWS.
Infrastructure as code refers to the use of code - in this case HCL, to create and manage the above resources as opposed to manually configuring them.

Why do Data Enginners use Terraform?

IAC: Terraform allows data engineers to define their infrastructure, including servers, networks, and storage, in code. This enables them to version control their infrastructure configurations, track changes over time, and easily reproduce environments.
Cloud Agnostic: Terraform supports multiple cloud providers (e.g., AWS, Azure, Google Cloud) and even on-premises environments. Data engineers can use a consistent toolset regardless of their cloud provider.
Collaboration: Terraform configurations can be shared and collaboratively developed by data engineering teams. Changes can be reviewed, and infrastructure can be modified with the approval of team members.
Change Management: Data engineers can implement changes to infrastructure in a controlled manner. Terraform's plan and apply workflow allows them to preview changes before applying them, reducing the risk of errors.
State Management: Terraform maintains a state file that keeps track of the current state of the infrastructure. This state file helps Terraform understand which resources are already provisioned and how they are configured, preventing unnecessary modifications.
Integration: Terraform can be integrated into continuous integration/continuous deployment (CI/CD) pipelines, allowing data engineers to automate the testing and deployment of infrastructure changes.
Cost Control: By defining infrastructure in code, data engineers can implement cost-saving strategies, such as automatically shutting down non-essential resources when not in use or optimizing resource sizes based on actual usage.

Terraform Lifecycle

Terraform's lifecycle consists of four stages.

Terraform init to initialize Terraform environment and cloud provider plugins
Terraform plan to create and display execution plans of resources to be configured
Terraform apply to execute the plan and create the resources
Terraform destroy to delete the resources created in a specific terraform environment

In this tutorial you will learn how to provision and configure an EC2 instance, and an S3 bucket using terraform. Make sure you have a fully operational AWS account to proceed.
You can access the tutorial's code by visiting the Github repository provided for reference.

Disclaimer:
This is a lengthy tutorial, particularly for those new to Terraform. Therefore, I recommend taking it step by step, pausing when needed, and avoiding any sense of being overwhelmed.

Now that you get the tea lets dive in and get hands-on!

1. Install Terraform
To use terraform locally you need to have it installed.

With the code editor of your choice, Terraform can be installed on Mac or Linux using a tool called Terraswitch. This enables you to switch between different terraform versions, just like pyenv.
Otherwise, you can use Package managers; Homebrew for mac -

brew install terraform

wget for linux :

wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform

chocolatey for windows

choco install terraform

You can go through the official documentation for any step you need further clarifications on.

After installation, run terraform -version on your terminal to confirm installation. This should return the version of terraform installed.

2. AWS Authentication - Generate new access
In order to authorise terraform on AWS, you need to achieve the following steps:
a. Create a new IAM user with administrator access to get the AWS access key and AWS secret key.
On the AWS console, search for IAM services and then navigate to Users ---> Create user

Next step is to set permission, for this you will click on Attach policies directly then Administrator access and then next to create user

Now that you have created a user, you can then get access keys for that user by navigating back to users --> security credentials --> create access key.

choose the command line interface option, write a description and finally create access key. You can download as CSV for future reference or just take note of the generated keys.

b. Configure AWS cli with new creds for AWS Authentication.
To download the AWS cli on Mac, Linux or Windows kindly follow this guide.

After successful installation run aws --version for confirmation and then aws configure. This will prompt you to provide the Access code and Secret code copied from step (a).

Alternatively, you can also authenticate AWS using environment variables:

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

After this is done, you are all set to use terraform on your aws account!

3. Terraform Core Concepts

File Structure
Terraform treats every configuration as a module and executes all configuration files that ends in the extension .tf

A typical file structure for a new module is:

├─ README.md
├─ main.tf
├─ variables.tf
├─ outputs.tf

main.tf will contain the main set of configurations for your module.
variables.tf will contain the variable definitions for your module. When your module is used by others, the variables will be configured as arguments in the module block.
outputs.tf will contain the output definitions for your module. Module outputs are often used to pass information about the parts of your infrastructure defined by the module to other parts of your configuration.

Executing a terraform configuration file creates other files to be aware of, and ensure that you don't distribute them as part of your module:

terraform.tfstate and terraform.tfstate.backup: These files contain your Terraform state, and is how Terraform keeps track of the relationship between your configuration and the infrastructure provisioned by it.
.terraform: This directory contains the modules and plugins used to provision your infrastructure. These files are specific to a specific instance of Terraform when provisioning infrastructure.
.terraform.lock.hcl: This file maintains specific provider versions, ensuring consistency in Terraform configurations and preventing unintended updates. It enhances control and stability, critical for production environments.

Terraform Syntax
These are the most essential parts of terraform

Providers: To configure any infrastructure you need to first declare the provider so that Terraform can install and use them. Terraform relies on plugins called providers to interact with cloud providers, and other APIs.

Terraform has a list of providers, such as AWS, Azure, GCP, etc. visit this documentation to know how other providers are used.

An AWS provider is declared in a provider block by the following:

  provider "aws" {
  region = "us-north-1"
}

Resources: Terraform uses resource blocks to define components of your infrastructure, such as virtual networks, compute instances, etc.
Most Terraform providers have a number of different resources as documented here

AWS Resource Type	AWS Infrastructure
aws_instance	EC2 Instance
aws_security_group	Security Group

To create a resource:

resource "resource_type" "custom_resource_name" {
    sample_attribute = ""
}

Resource blocks declare a resource type and name. Together, the type and name forms a unique resource identifier (ID) in the format resource_type.resource_name
To display information about a resource as output, you have to reference the resource ID.

Resource types always start with the provider name followed by an underscore, for example; aws_instance

Variables: The general idea of variables in programming is also applied in Terraform. This allows you to parameterize your configurations and make them more flexible and reusable.

Terraform variables can be set in different ways;

Using the variable.tf file
Individually, with the -var command line option.
In variable definitions (.tfvars) files, either specified on the command line or automatically loaded.
As environment variables.

We will be using the first option, by declaring a variable block with the syntax:

variable "var_name" {
  description = "description of var"
  type        = string
  default     = "value"
}

Note that Terraform variables can be of data types - string, number, boolean, list, object, map, tuples, etc.

The value can then be accessed from within expressions as var.<NAME>, where <NAME> matches the label given in the declaration block. For example;
In variables.tf we create a new variable called user_information

variable "user_information" {
  type = object({
    name    = string
    address = string
  })
  sensitive = true
}

In main.tf we access the variable using var.user_information

resource "some_resource" "a" {
  name    = var.user_information.name
  address = var.user_information.address
}

Outputs: Output values make information about your infrastructure available on the command line, and can expose information for other Terraform configurations to use. Output values are similar to return values in programming languages.

Each output value exported by a module must be declared using an output block:

output "instance_ip_addr" {
  value = resource_type.custom_resource_name.attribute
}

The label immediately after the output keyword is the output name, which must be a valid identifier.

Now that we have understood the syntax, let's provision our first AWS resource

4. Creating an Ec2 instance
For our first example we want to create an EC2 instance that we can access on the web. This means that we need to know the public IP to which the server is hosted, and will therefore be creating outputs in the outputs.tf file

To create an EC2 instance we need to declare a resource block that calls a resource named - aws_instance

From the documentation we can see the argument references of aws_instance which points out different arguments needed to configure an instance.
For this example we will define the following;
a. user_data file. which serves as a method to perform some configurations on the EC2 instance during creation, such as setting the hostname, or installing a software package.
b. ami ID which is a unique identifier for a specific Amazon Machine Image.
c. security_group to create a virtual firewall that controls inbound and outbound network traffic to Amazon EC2 instances.
d. instance_type to define the hardware of the virtual machine, specifying its CPU, memory, storage capacity, and so on.
e. key_name to specify the SSH keys required for secure access to an instance via SSH.

Let's go ahead to create some of these as variables in variables.tf

variable "instance_name" {
    description = "Name of the EC2 instance"
    type        = string
    default     = "test_instance"
}
variable "instance_type" {
    description = "EC2 instance type"
    type        = string
    default     = "t3.micro"
}
variable "instance_ami" {
    description = "id of AMI"
    type        = string
    default     = "ami-065681da47fb4e433"
}
variable "key_name" {
    description = "key name"
    type        = string
    default     = "ssh-keypair"
}

You can obtain the instance_type and ami required to create the AWS instance by referencing the AWS console when manually setting up an EC2 instance.

On your AWS portal search for EC2, click on instances, and then launch instances
On this page you can view the AMI id and instance type of your choice. Since we are using a free tier, t3.micro is good enough.

Next we will use these variables in main.tf by declaring the provider and resource block.

provider "aws" {
    region = "eu-north-1"
}

resource "aws_security_group" "web_traffic" {
    name = "allow_tls"

    ingress {
        from_port        = 443
        to_port          = 443
        protocol         = "tcp"
        cidr_blocks      = ["0.0.0.0/0"]
    }
    ingress {
        from_port   = 80
        to_port     = 80
        protocol    = "tcp"
        cidr_blocks = ["0.0.0.0/0"]
    }
    egress {
        from_port        = 443
        to_port          = 443
        protocol         = "tcp"
        cidr_blocks      = ["0.0.0.0/0"]
    }
    egress {
        from_port   = 80
        to_port     = 80
        protocol    = "tcp"
        cidr_blocks = ["0.0.0.0/0"]
    }
}

resource "aws_key_pair" "ssh_key_pair" {
  key_name   = var.key_name
  public_key = file(".ssh/rsa.pub")
}

resource "aws_instance" "my_ec2" {
    ami = var.instance_ami
    instance_type = var.instance_type
    user_data = file("startup.sh")
    key_name = aws_key_pair.ssh_key_pair.key_name
    security_groups = [aws_security_group.web_traffic.name]

    tags = {
        Name = var.instance_name
    }
}

Here we specified three resources; aws_instance, aws_key_pair, and aws_security_group.
Attributes of aws_security_group and aws_key_pair were parsed as arguments into aws_instance as key_name and security_groups. Note how we used the format resource_type.resource_name.attribute to access them.

Creating key pairs using aws_key_pair allows you to securely connect to your EC2 instance using SSH while aws_security_group allows you to configure inbound and outbound traffic using rules based on IP addresses, ports, and protocols. In this case we allowed traffic from any IP on ports 80 and 443.

keep in view that to utilize aws_key_pair, we must first generate an SSH key, which we can then provide to the AWS instance. To generate an SSH key with a length of 4096 bits, run the command and follow the prompts. I created a .ssh folder and an rsa filename to store my keypair.

ssh-keygen -t rsa -b 4096

The ssh-keygen command will generate two files: a private key file and a public key file. The private key file should be kept secure and never shared with anyone. The public key file can be shared with Amazon EC2 instances to allow SSH access.

Thirdly, we declare the output block in outputs.tf so that we can view the IP of the hosted server

output "instance_id" {
    value = aws_instance.my_ec2.public_ip
}

The attribute_reference of the documentation shows all the attributes that can be returned after provisioning a resource.

Lastly, create a startup.sh with the following content:

#!/bin/bash
sudo yum update
sudo yum install -y httpd
sudo systemctl start httpd
sudo systemctl enable httpd
echo "<h1>Hello from Terraform</h1>" | sudo tee /var/www/html/index.html]

This is the bash script that will be executed through user_data

Now that all requirements are in place:

Initialize the project - terraform init
Next to preview the changes that Terraform will make to your
infrastructure, run terraform plan

This will show you all the configurations to be deployed.
Finally to apply the configurations, run terraform apply
This will output the public IP of the instance

Typing the IP address into your web browser with the format: http://<IP_address> should yield the following page:

and on your AWS console you will see the server up and running

Congratulations! you have provisioned your first ec2 instance using terraform and you can ssh into the instance using the generated keypair with the command - ssh -i "ssh-keypair.pem" ec2-user@ec2-<public_ip>.eu-north-1.compute.amazonaws.com.
5. Creating an S3 bucket
Now that we are familiar with the methods and syntax of terraform, this should be easy to grasp.

To create a simple s3 bucket we will be using the aws_s3_bucket resource.

From the documentation linked above we can browse through the argument references to see what arguments we need to pass to this resource.
Let's go ahead to define the bucket_name in variables.tf. Make sure to use a unique name.

variable "bucket_name" {
    description = "Name of s3 bucke"
    type        = string
    default     = "my-tf-test-bucket-priye"
}

Now we will add the aws_s3_bucket resource to main.tf

resource "aws_s3_bucket" "my_s3_bucket" {
    bucket = var.bucket_name

    tags = {
        Name        = "My bucket"
        Environment = "Dev"
    }
}

now run terraform plan and apply. This will add an additional resource

You can also confirm the s3 bucket on your AWS console.

6. Destroy all resources
To remove all provisioned resources and reduce operational costs, execute the following command:

terraform destroy

Congratulations! We have come to the end of this tutorial and you have taken your first steps into the world of Terraform!🎉

In this tutorial, you've learned the fundamentals of Terraform, from setting up your environment to creating your first resources. But remember, Terraform is a vast and flexible tool with many advanced features waiting for you to explore.
As you embark on your data engineering journey with Terraform, delve into advanced topics such as:

Modules: Learn how to encapsulate and reuse your infrastructure code for better organization and maintainability.

Dynamic Blocks: Discover how to create dynamic and flexible configurations for resources using dynamic blocks.

For Each: Explore how to loop through lists and maps to create multiple similar resources with ease.

Provider Configuration: Extend your skills by configuring different providers for hybrid and multi-cloud setups.

State Management: Master the art of state files, remote backends, and locking mechanisms for safe and efficient collaboration.

If you enjoyed learning a thing or two from this tutorial, please consider sharing it. Additionally, if you'd be interested in a follow-up tutorial covering these advanced concepts, don't hesitate to express your interest in the comments below. Your feedback is highly appreciated!

Happy Terraforming! 🌍💻

Top comments (3)

Avadhesh Superset • Sep 8 '23 • Edited

This article offers comprehensive guidance on adhering to DevOps best practices regarding Infrastructure as Code (IAC). At kapstan.io/, we are strong proponents of IAC. We've witnessed users initially opting for cloud consoles to create their infrastructure, only to later recognise the inherent benefits of IAC. In response to such scenarios, we've recently launched a tool at kapstan.io/generate-terraform, a one-click solution for converting AWS infrastructure into Terraform templates, Feel free to give it a try to kickstart your IAC journey!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.