Amit Kushwaha

Posted on Jan 2 • Edited on Jan 3

->> Day-13 Terraform Data Source AWS

#aws #terraform #cloudnative #devops

Introduction

In the real world, infrastructure is rarely created in isolation. Teams often need to deploy new resources into existing environments-whether it's a shared VPC managed by a network team, a subnet created by another project, or legacy infrastructure that predates your Terraform adoption. This is where Terraform Data Sources become invaluable.

In this comprehensive guide, we'll explore how to use Terraform data sources to reference and interact with existing AWS infrastructure without taking ownership of it. We'll walk through a practical example of deploying an EC2 instance into a pre-existing VPC and subnet.

>> What Are Terraform Data Sources?

Data sources in Terraform are a powerful feature that allows you to fetch information about existing infrastructure without managing it. Think of them as read-only queries that let your Terraform configuration discover and use resources that already exist in your cloud environment.

>> Key Characteristics of Data Sources:

Read-Only: Data sources only query information; they never create, update, or delete resources
Dynamic Discovery: They find resources based on filters, tags, or identifiers
Cross-Team Collaboration: Perfect for working with infrastructure managed by other teams
Separation of Concerns: Allows you to reference shared infrastructure without taking ownership

-> The Real-World Scenario

Imagine this common enterprise scenario:

Your organization has a shared networking infrastructure-a VPC and subnets created by the network operations team. These resources are tagged and managed separately. Your team needs to deploy a new application server (EC2 instance) into this existing network, but you don't want to manage the VPC or subnet in your Terraform configuration.

This is the perfect use case for data sources!

Our setup involves:

1.Pre-existing Infrastructure (managed elsewhere):

VPC with tag: Name = shared-network-vpc
Subnet with tag: Name = shared-primary-subnet

2.Our Managed Resources:

EC2 instance deployed into the existing subnet
Uses the latest Amazon Linux 2 AMI

Breaking Down the Terraform Code
Let's examine the three data sources used in this configuration:

1. Finding the Existing VPC

data "aws_vpc" "shared" {
  filter {
    name   = "tag:Name"
    values = ["shared-network-vpc"]
  }
}

This data source searches for a VPC with a specific tag. The filter block allows us to query based on tags, which is a best practice for identifying shared infrastructure. Once found, we can reference this VPC using data.aws_vpc.shared.id.

2. Locating the Subnet

data "aws_subnet" "shared" {
  filter {
    name   = "tag:Name"
    values = ["shared-primary-subnet"]
  }
  vpc_id = data.aws_vpc.shared.id
}

This data source finds a subnet by its tag, but notice the vpc_id parameter-it ensures we're finding the subnet within the correct VPC. This demonstrates chaining data sources, where one data source uses the output of another.

3. Getting the Latest AMI

data "aws_ami" "amazon_linux_2" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

This data source queries for the most recent Amazon Linux 2 AMI. Using data sources for AMIs is a best practice because it ensures you're always using the latest patched version without hardcoding AMI IDs (which vary by region).

Creating the EC2 Instance

resource "aws_instance" "main" {
  ami           = data.aws_ami.amazon_linux_2.id
  instance_type = "t2.micro"
  subnet_id     = data.aws_subnet.shared.id
  private_ip    = "10.0.1.50"

  tags = {
    Name = "terraform-aws-day-13-instance"
  }
}

The EC2 instance resource uses the data sources we defined:

ami: References the latest Amazon Linux 2 AMI ID
subnet_id: Places the instance in the existing subnet

This is the magic of data sources-we're deploying into existing infrastructure seamlessly!

The Power of This Approach

1. Separation of Concerns

The network team manages the VPC and subnets, while your team manages the compute resources. Each team works independently without stepping on each other's toes.

2. No State Conflicts
Because you're not managing the VPC and subnet in your Terraform state, there's no risk of accidentally modifying or destroying them.

3. Clean Terraform Plans

When you run terraform plan, you'll see only the resources you're managing-in this case, just the EC2 instance. The VPC and subnet won't appear in the plan at all.

4. Simplified Cleanup

Running terraform destroy only removes the EC2 instance. The shared infrastructure remains intact, as it should.

Common Use Cases for Data Sources

Beyond our VPC/subnet example, data sources are invaluable for:

Security Groups: Reference existing security groups for your instances
IAM Roles: Use pre-created IAM roles without managing them
S3 Buckets: Reference existing buckets for application configuration
Route53 Zones: Deploy records into existing hosted zones
KMS Keys: Use organization-wide encryption keys

Troubleshooting Tips

Problem: Data source can't find the resource.

Solution: Verify the resource exists and tags match exactly (tags are case-sensitive).

Problem: Multiple resources match the filter.

Solution: Make your filters more specific or use unique identifiers.

Problem: Resource in wrong region.

Solution: Ensure your provider region matches where the resource exists.

Conclusion

Terraform data sources are essential for real-world infrastructure management. They enable you to build on top of existing infrastructure, collaborate across teams, and maintain clean separation of responsibilities-all without the complexity of shared state files or cross-configuration dependencies.

Remember: Data sources don't manage infrastructure-they discover it. And that distinction is what makes them so powerful for building composable, maintainable Terraform configurations.