DEV Community

Cover image for Stop Creating Everything: The Art of Terraform Data Sources
Emmanuel E. Ebenezer
Emmanuel E. Ebenezer

Posted on

Stop Creating Everything: The Art of Terraform Data Sources

It's Day 13 of the AWS Challenge, and today I learned something that reshaped my perspective of Infrastructure as Code: You don't have to manage everything.

Up until now, every Terraform practice I've done was about creating resources. VPCs? Create them. Subnets? Create them. Security groups? Create them all! But in industry practice, especially in companies with multiple teams, you can't just create everything from scratch. The networking team already built the VPC. The security team manages the security groups. Your job is to deploy your app into existing infrastructure.

That's where data sources come in, and they're absolutely game-changing.

Hey, I'm learning terraform within 30 days. Or at least I'm trying to. You too can join the challenge.

Resources vs. Data Sources

Here's the fundamental difference that took me way too long to understand:

Resource Block = "I own this. I manage its entire lifecycle."

resource "aws_vpc" "my_vpc" {
  cidr_block = "10.0.0.0/16"
  # Terraform creates, updates, and destroys this
}
Enter fullscreen mode Exit fullscreen mode

Data Block = "This already exists. I just need to reference it."

data "aws_vpc" "existing_vpc" {
  filter {
    name   = "tag:Name"
    values = ["shared-network-vpc"]
  }
  # Terraform just reads this, never touches it
}
Enter fullscreen mode Exit fullscreen mode

That's it. That's the whole concept. But the implications are huge.

Why This Matters: The Multi-Team Reality

Let me paint a realistic scenario:

Your company has:

  • A networking team that manages all VPCs and subnets
  • A security team that maintains security groups and IAM policies
  • An infrastructure team (that's you!) that deploys applications

Without data sources:
You'd either need:

  1. Access to everyone's Terraform state files (good luck with that)
  2. Manually copy-paste IDs everywhere
  3. Create duplicate resources (definitely calling for chaous)

With data sources:
You just query what you need:

# Find the VPC the cloud networking team created
data "aws_vpc" "company_vpc" {
  filter {
    name   = "tag:ManagedBy"
    values = ["cloud-networking-team"]
  }
}

# Find the security group the security team manages
data "aws_security_group" "approved_sg" {
  filter {
    name   = "tag:ManagedBy"
    values = ["security-team"]
  }
}

# Deploy your app using their infrastructure
resource "aws_instance" "my_app" {
  ami           = data.aws_ami.latest_amazon_linux.id
  subnet_id     = data.aws_subnet.app_subnet.id
  vpc_security_group_ids = [data.aws_security_group.approved_sg.id]

  # Your app, their infrastructure. Perfect harmony.
}
Enter fullscreen mode Exit fullscreen mode

Beautiful. Clean. No stepping on anyone's toes.

My Hands-On Demo: The Three Essential Data Sources

I worked through a scenario that simulates practical infrastructure sharing. Here's what I built:

Setup: Simulating Existing Infrastructure

First, I created "existing" infrastructure (pretending another team already did this):

# This simulates what the networking team already deployed
resource "aws_vpc" "shared" {
  cidr_block = "10.0.0.0/16"
  tags = {
    Name = "shared-network-vpc"  # ← This tag is key!
  }
}

resource "aws_subnet" "shared" {
  vpc_id     = aws_vpc.shared.id
  cidr_block = "10.0.1.0/24"
  tags = {
    Name = "shared-primary-subnet"  # ← This too!
  }
}
Enter fullscreen mode Exit fullscreen mode

After deploying this, I pretended to forget it exists lol (as you do in large organizations).

Data Source #1: Finding the VPC

Now, from a completely separate Terraform configuration, I need to deploy an EC2 instance into that VPC. Here's how I find it:

data "aws_vpc" "shared" {
  filter {
    name   = "tag:Name"
    values = ["shared-network-vpc"]
  }
}
Enter fullscreen mode Exit fullscreen mode

What this does:

  • Queries AWS for a VPC with that specific tag
  • Returns the VPC ID, CIDR block, and all other attributes
  • Updates every time I run Terraform (always fresh data)

Pro tip: I tested this in terraform console before using it:

terraform console
> data.aws_vpc.shared.id
"vpc-0a1b2c3d4e5f6"
> data.aws_vpc.shared.cidr_block
"10.0.0.0/16"
Enter fullscreen mode Exit fullscreen mode

Seeing real AWS data returned in real-time? muah... to my fingers

Data Source #2: Finding the Subnet (Chained!)

Now I need the subnet. But wait—there might be multiple subnets. I can narrow it down by both tag AND VPC:

data "aws_subnet" "shared" {
  filter {
    name   = "tag:Name"
    values = ["shared-primary-subnet"]
  }
  vpc_id = data.aws_vpc.shared.id  # ← Using the first data source!
}
Enter fullscreen mode Exit fullscreen mode

What I learned:

  • You can chain data sources (use one to refine another)
  • The vpc_id parameter narrows the search
  • This prevents grabbing the wrong subnet if multiple have similar names

Data Source #3: Latest AMI (The Dynamic One)

Instead of hardcoding an AMI ID (which goes stale), I can dynamically fetch the latest approved AMI:

data "aws_ami" "amazon_linux_2" {
  most_recent = true
  owners      = ["amazon"]  # Only official Amazon AMIs

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Why this is brilliant:

  • most_recent = true always grabs the newest matching AMI
  • Wildcards (*) allow flexible pattern matching
  • Multiple filters ensure you get exactly what you want
  • Your instances automatically use the latest AMI on next deploy

No more "Oops, I'm using an AMI from 2019."

Putting It All Together: The Final Resource

Now I use all three data sources to deploy my EC2 instance:

resource "aws_instance" "main" {
  ami           = data.aws_ami.amazon_linux_2.id           # ← Data source
  instance_type = "t2.micro"
  subnet_id     = data.aws_subnet.shared.id                # ← Data source
  private_ip    = "10.0.1.50"

  tags = {
    Name = "day13-instance"
  }
}
Enter fullscreen mode Exit fullscreen mode

When I ran terraform plan:

Plan: 1 to add, 0 to change, 0 to destroy.
Enter fullscreen mode Exit fullscreen mode

Only 1 resource being created! The VPC and subnet aren't in the plan because Terraform isn't managing them. It's just referencing them.

This is the magic. One configuration, seamless integration with existing infrastructure.

The Power Move: Terraform Console Testing

Before you deploy anything, you can test your data sources in the console:

terraform console

# Test VPC data source
> data.aws_vpc.shared.id
"vpc-0a1b2c3d4e5f6"

# Test subnet data source
> data.aws_subnet.shared.cidr_block
"10.0.1.0/24"

# Test AMI data source
> data.aws_ami.amazon_linux_2.name
"amzn2-ami-hvm-2.0.20231218.0-x86_64-gp2"
Enter fullscreen mode Exit fullscreen mode

You can see actual AWS data, verify your filters work, and confirm everything before you deploy.

Common Data Sources You'll Actually Use

After doing this demo, I researched what data sources are most useful in real projects. Here's my cheat sheet:

Network Resources

# VPC lookup
data "aws_vpc" "main" { ... }

# Subnet lookup
data "aws_subnet" "main" { ... }

# Security Group lookup
data "aws_security_group" "main" { ... }

# All availability zones in current region
data "aws_availability_zones" "available" {
  state = "available"
}
Enter fullscreen mode Exit fullscreen mode

Compute Resources

# Latest AMI (super common)
data "aws_ami" "latest" { ... }

# Existing EC2 instance
data "aws_instance" "main" { ... }
Enter fullscreen mode Exit fullscreen mode

Identity & Metadata

# Current AWS account ID
data "aws_caller_identity" "current" {}

# Current AWS region
data "aws_region" "current" {}

# Example usage:
# "My account is ${data.aws_caller_identity.current.account_id}"
# "I'm deploying to ${data.aws_region.current.name}"
Enter fullscreen mode Exit fullscreen mode

Storage

# S3 Bucket lookup
data "aws_s3_bucket" "main" { ... }

# RDS Instance lookup
data "aws_db_instance" "main" { ... }
Enter fullscreen mode Exit fullscreen mode

Industry Practice Patterns I Discovered

Pattern 1: Tag-Based Discovery

Best practice: Use consistent, unique tags

data "aws_vpc" "main" {
  filter {
    name   = "tag:Environment"
    values = ["production"]
  }
  filter {
    name   = "tag:ManagedBy"
    values = ["networking-team"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Why multiple filters? Precision. One filter might match multiple resources. Two or three? Much more specific.

Pattern 2: ID-Based Lookup (When You Have It)

If you know the exact ID, use it:

data "aws_vpc" "main" {
  id = "vpc-12345678"
}
Enter fullscreen mode Exit fullscreen mode

Pros: Fast, precise, no ambiguity
Cons: Less flexible, harder to maintain if IDs change

Pattern 3: Fallback Values

What if the data source doesn't find anything? Use try():

locals {
  vpc_id = try(data.aws_vpc.shared.id, var.default_vpc_id)
}
Enter fullscreen mode Exit fullscreen mode

This gives you a safety net.

The Bigger Picture: Why This Matters

After today, I finally understand how Terraform works in an ideal Engineering Department:

Team A (Networking):

  • Manages VPCs, subnets, route tables
  • Tags everything consistently
  • Provides documentation of available resources

Team B (Security):

  • Manages security groups, IAM policies
  • Creates "approved" security group templates
  • Tags them for easy discovery

Team C (Applications):

  • Uses data sources to find Team A's VPCs
  • Uses data sources to find Team B's security groups
  • Deploys apps without touching core infrastructure
  • Everyone stays in their lane, no conflicts

This is collaborative Infrastructure as Code. Not just "my infrastructure," but "our infrastructure, properly managed."

Key Takeaways

  1. Data sources read, resources manage. Know the difference, use them appropriately.

  2. Tag everything meaningfully. Your future self (and your teammates) will thank you.

  3. Test in terraform console. Verify your data sources work before deploying.

  4. Use multiple filters for precision. One tag might match multiple resources; two or three gets specific.

  5. Data sources query every run. They always reflect current AWS state, unlike hardcoded IDs.

  6. Chain data sources. Use one to refine another for precise targeting.

  7. Always fetch latest AMIs dynamically. Never hardcode AMI IDs; they go stale.

What's Next?

Tomorrow (Day 14), I'll be building a mini project to test the waters with my terraform knowledge.

The pieces are coming together. Functions make your code intelligent. Data sources connect you to existing infrastructure.

This is getting really good.

See yaa! (And seriously, go use terraform console to explore your AWS account. It's like having X-ray vision for your infrastructure.)

Top comments (0)