It's Day 13 of the AWS Challenge, and today I learned something that reshaped my perspective of Infrastructure as Code: You don't have to manage everything.
Up until now, every Terraform practice I've done was about creating resources. VPCs? Create them. Subnets? Create them. Security groups? Create them all! But in industry practice, especially in companies with multiple teams, you can't just create everything from scratch. The networking team already built the VPC. The security team manages the security groups. Your job is to deploy your app into existing infrastructure.
That's where data sources come in, and they're absolutely game-changing.
Hey, I'm learning terraform within 30 days. Or at least I'm trying to. You too can join the challenge.
Resources vs. Data Sources
Here's the fundamental difference that took me way too long to understand:
Resource Block = "I own this. I manage its entire lifecycle."
resource "aws_vpc" "my_vpc" {
cidr_block = "10.0.0.0/16"
# Terraform creates, updates, and destroys this
}
Data Block = "This already exists. I just need to reference it."
data "aws_vpc" "existing_vpc" {
filter {
name = "tag:Name"
values = ["shared-network-vpc"]
}
# Terraform just reads this, never touches it
}
That's it. That's the whole concept. But the implications are huge.
Why This Matters: The Multi-Team Reality
Let me paint a realistic scenario:
Your company has:
- A networking team that manages all VPCs and subnets
- A security team that maintains security groups and IAM policies
- An infrastructure team (that's you!) that deploys applications
Without data sources:
You'd either need:
- Access to everyone's Terraform state files (good luck with that)
- Manually copy-paste IDs everywhere
- Create duplicate resources (definitely calling for chaous)
With data sources:
You just query what you need:
# Find the VPC the cloud networking team created
data "aws_vpc" "company_vpc" {
filter {
name = "tag:ManagedBy"
values = ["cloud-networking-team"]
}
}
# Find the security group the security team manages
data "aws_security_group" "approved_sg" {
filter {
name = "tag:ManagedBy"
values = ["security-team"]
}
}
# Deploy your app using their infrastructure
resource "aws_instance" "my_app" {
ami = data.aws_ami.latest_amazon_linux.id
subnet_id = data.aws_subnet.app_subnet.id
vpc_security_group_ids = [data.aws_security_group.approved_sg.id]
# Your app, their infrastructure. Perfect harmony.
}
Beautiful. Clean. No stepping on anyone's toes.
My Hands-On Demo: The Three Essential Data Sources
I worked through a scenario that simulates practical infrastructure sharing. Here's what I built:
Setup: Simulating Existing Infrastructure
First, I created "existing" infrastructure (pretending another team already did this):
# This simulates what the networking team already deployed
resource "aws_vpc" "shared" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "shared-network-vpc" # ← This tag is key!
}
}
resource "aws_subnet" "shared" {
vpc_id = aws_vpc.shared.id
cidr_block = "10.0.1.0/24"
tags = {
Name = "shared-primary-subnet" # ← This too!
}
}
After deploying this, I pretended to forget it exists lol (as you do in large organizations).
Data Source #1: Finding the VPC
Now, from a completely separate Terraform configuration, I need to deploy an EC2 instance into that VPC. Here's how I find it:
data "aws_vpc" "shared" {
filter {
name = "tag:Name"
values = ["shared-network-vpc"]
}
}
What this does:
- Queries AWS for a VPC with that specific tag
- Returns the VPC ID, CIDR block, and all other attributes
- Updates every time I run Terraform (always fresh data)
Pro tip: I tested this in terraform console before using it:
terraform console
> data.aws_vpc.shared.id
"vpc-0a1b2c3d4e5f6"
> data.aws_vpc.shared.cidr_block
"10.0.0.0/16"
Seeing real AWS data returned in real-time? muah... to my fingers
Data Source #2: Finding the Subnet (Chained!)
Now I need the subnet. But wait—there might be multiple subnets. I can narrow it down by both tag AND VPC:
data "aws_subnet" "shared" {
filter {
name = "tag:Name"
values = ["shared-primary-subnet"]
}
vpc_id = data.aws_vpc.shared.id # ← Using the first data source!
}
What I learned:
- You can chain data sources (use one to refine another)
- The
vpc_idparameter narrows the search - This prevents grabbing the wrong subnet if multiple have similar names
Data Source #3: Latest AMI (The Dynamic One)
Instead of hardcoding an AMI ID (which goes stale), I can dynamically fetch the latest approved AMI:
data "aws_ami" "amazon_linux_2" {
most_recent = true
owners = ["amazon"] # Only official Amazon AMIs
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
Why this is brilliant:
-
most_recent = truealways grabs the newest matching AMI - Wildcards (
*) allow flexible pattern matching - Multiple filters ensure you get exactly what you want
- Your instances automatically use the latest AMI on next deploy
No more "Oops, I'm using an AMI from 2019."
Putting It All Together: The Final Resource
Now I use all three data sources to deploy my EC2 instance:
resource "aws_instance" "main" {
ami = data.aws_ami.amazon_linux_2.id # ← Data source
instance_type = "t2.micro"
subnet_id = data.aws_subnet.shared.id # ← Data source
private_ip = "10.0.1.50"
tags = {
Name = "day13-instance"
}
}
When I ran terraform plan:
Plan: 1 to add, 0 to change, 0 to destroy.
Only 1 resource being created! The VPC and subnet aren't in the plan because Terraform isn't managing them. It's just referencing them.
This is the magic. One configuration, seamless integration with existing infrastructure.
The Power Move: Terraform Console Testing
Before you deploy anything, you can test your data sources in the console:
terraform console
# Test VPC data source
> data.aws_vpc.shared.id
"vpc-0a1b2c3d4e5f6"
# Test subnet data source
> data.aws_subnet.shared.cidr_block
"10.0.1.0/24"
# Test AMI data source
> data.aws_ami.amazon_linux_2.name
"amzn2-ami-hvm-2.0.20231218.0-x86_64-gp2"
You can see actual AWS data, verify your filters work, and confirm everything before you deploy.
Common Data Sources You'll Actually Use
After doing this demo, I researched what data sources are most useful in real projects. Here's my cheat sheet:
Network Resources
# VPC lookup
data "aws_vpc" "main" { ... }
# Subnet lookup
data "aws_subnet" "main" { ... }
# Security Group lookup
data "aws_security_group" "main" { ... }
# All availability zones in current region
data "aws_availability_zones" "available" {
state = "available"
}
Compute Resources
# Latest AMI (super common)
data "aws_ami" "latest" { ... }
# Existing EC2 instance
data "aws_instance" "main" { ... }
Identity & Metadata
# Current AWS account ID
data "aws_caller_identity" "current" {}
# Current AWS region
data "aws_region" "current" {}
# Example usage:
# "My account is ${data.aws_caller_identity.current.account_id}"
# "I'm deploying to ${data.aws_region.current.name}"
Storage
# S3 Bucket lookup
data "aws_s3_bucket" "main" { ... }
# RDS Instance lookup
data "aws_db_instance" "main" { ... }
Industry Practice Patterns I Discovered
Pattern 1: Tag-Based Discovery
Best practice: Use consistent, unique tags
data "aws_vpc" "main" {
filter {
name = "tag:Environment"
values = ["production"]
}
filter {
name = "tag:ManagedBy"
values = ["networking-team"]
}
}
Why multiple filters? Precision. One filter might match multiple resources. Two or three? Much more specific.
Pattern 2: ID-Based Lookup (When You Have It)
If you know the exact ID, use it:
data "aws_vpc" "main" {
id = "vpc-12345678"
}
Pros: Fast, precise, no ambiguity
Cons: Less flexible, harder to maintain if IDs change
Pattern 3: Fallback Values
What if the data source doesn't find anything? Use try():
locals {
vpc_id = try(data.aws_vpc.shared.id, var.default_vpc_id)
}
This gives you a safety net.
The Bigger Picture: Why This Matters
After today, I finally understand how Terraform works in an ideal Engineering Department:
Team A (Networking):
- Manages VPCs, subnets, route tables
- Tags everything consistently
- Provides documentation of available resources
Team B (Security):
- Manages security groups, IAM policies
- Creates "approved" security group templates
- Tags them for easy discovery
Team C (Applications):
- Uses data sources to find Team A's VPCs
- Uses data sources to find Team B's security groups
- Deploys apps without touching core infrastructure
- Everyone stays in their lane, no conflicts
This is collaborative Infrastructure as Code. Not just "my infrastructure," but "our infrastructure, properly managed."
Key Takeaways
Data sources read, resources manage. Know the difference, use them appropriately.
Tag everything meaningfully. Your future self (and your teammates) will thank you.
Test in terraform console. Verify your data sources work before deploying.
Use multiple filters for precision. One tag might match multiple resources; two or three gets specific.
Data sources query every run. They always reflect current AWS state, unlike hardcoded IDs.
Chain data sources. Use one to refine another for precise targeting.
Always fetch latest AMIs dynamically. Never hardcode AMI IDs; they go stale.
What's Next?
Tomorrow (Day 14), I'll be building a mini project to test the waters with my terraform knowledge.
The pieces are coming together. Functions make your code intelligent. Data sources connect you to existing infrastructure.
This is getting really good.
See yaa! (And seriously, go use terraform console to explore your AWS account. It's like having X-ray vision for your infrastructure.)
Top comments (0)