DEV Community

Cover image for Solved: DriftHound: an open-source tool to detect & notify infrastructure drift (early stage, Looking for feedback!)
Darian Vance
Darian Vance

Posted on • Originally published at wp.me

Solved: DriftHound: an open-source tool to detect & notify infrastructure drift (early stage, Looking for feedback!)

🚀 Executive Summary

TL;DR: Infrastructure drift, characterized by unauthorized changes outside Infrastructure as Code, silently undermines cloud and on-premises environments, leading to inconsistencies and security risks. DriftHound is an emerging open-source tool designed to detect and notify these discrepancies, offering a flexible, cost-effective solution for maintaining desired infrastructure states.

🎯 Key Takeaways

  • Infrastructure drift manifests as unauthorized or unintended changes outside IaC, causing symptoms like inconsistent environments, failed deployments, and critical security vulnerabilities.
  • Solutions for drift detection range from resource-intensive manual audits and custom scripting to robust commercial IaC tools (e.g., Terraform Cloud, AWS Config) with built-in capabilities.
  • DriftHound is an early-stage, dedicated open-source tool aiming to provide vendor-agnostic, cost-effective drift detection through configurable rules, scheduled scans, and flexible notification channels for environments like AWS and Kubernetes.

Infrastructure drift can silently undermine your meticulously crafted cloud and on-premises environments. Learn to detect, prevent, and manage configuration discrepancies with robust tools, including emerging open-source solutions like DriftHound, to maintain stability and security.

Understanding and Battling Infrastructure Drift

In the dynamic world of modern IT infrastructure, maintaining a consistent and desired state across all your environments is a constant challenge. Infrastructure drift, the silent assailant, refers to the unauthorized or unintended changes that occur outside of your defined Infrastructure as Code (IaC) or configuration management processes. It can manifest in subtle ways but leads to significant headaches for DevOps teams. Let’s delve into the symptoms, explore various solutions, and highlight how specialized tools can help.

Symptoms of Infrastructure Drift

Detecting infrastructure drift often begins with experiencing its negative consequences. Here are common symptoms that indicate your environments might be diverging from their desired state:

  • Inconsistent Environments: Development, staging, and production environments behave differently despite being deployed from the same IaC. This leads to “works on my machine” syndrome and unexpected bugs.
  • Failed Deployments: New deployments fail or encounter errors because underlying infrastructure configurations have changed, making the IaC no longer idempotent or applicable.
  • Security Vulnerabilities: Ad-hoc changes, such as opening a port for a quick fix or misconfiguring a security group, can introduce critical security gaps that bypass review processes.
  • Performance Degradation: Unintended configuration changes to resources like database settings, cache policies, or network rules can silently degrade application performance.
  • Debugging Nightmares: Troubleshooting issues becomes a protracted and frustrating process when the actual state of resources doesn’t match the expected state defined in your IaC.
  • Compliance Violations: In regulated industries, drift can lead to non-compliance with organizational policies or external regulations, resulting in audits and penalties.

Ignoring these symptoms can erode trust in your automation, increase operational overhead, and introduce significant risks.

Solution 1: Manual Audits and Custom Scripting

For organizations in the early stages of their IaC journey or dealing with legacy systems, manual audits combined with custom scripting are often the first line of defense against drift. While resource-intensive and prone to human error, they offer a basic level of detection.

Approach

This approach involves:

  • Regular Manual Checks: Periodically reviewing console configurations for critical services.
  • Cloud Provider CLIs: Using command-line interfaces (CLIs) like AWS CLI, Azure CLI, or gcloud CLI to fetch current resource configurations and compare them against documented baselines or IaC files.
  • API Calls & SDKs: Writing scripts that interact directly with cloud provider APIs to gather configuration data.
  • Configuration File Diffs: Using tools like diff to compare current configuration files on servers with source-controlled versions.

Examples

AWS EC2 Security Group Check

Let’s say you have a security group in your Terraform state, and you want to ensure no unauthorized ingress rules have been added manually:

# Get the expected state from your Terraform configuration
# (e.g., in security_groups.tf)
# resource "aws_security_group" "web_sg" {
#   name        = "web-server-sg"
#   description = "Allow HTTP and SSH"
#   ingress {
#     from_port   = 80
#     to_port     = 80
#     protocol    = "tcp"
#     cidr_blocks = ["0.0.0.0/0"]
#   }
#   ingress {
#     from_port   = 22
#     to_port     = 22
#     protocol    = "tcp"
#     cidr_blocks = ["X.X.X.X/32"] # Your bastion IP
#   }
#   egress {
#     from_port   = 0
#     to_port     = 0
#     protocol    = "-1"
#     cidr_blocks = ["0.0.0.0/0"]
#   }
# }

# Now, fetch the actual state using AWS CLI
aws ec2 describe-security-groups --group-names web-server-sg --query 'SecurityGroups[0].IpPermissions'
Enter fullscreen mode Exit fullscreen mode

You would then manually compare the JSON output with your Terraform definition.

Kubernetes Deployment Replica Count Check

To ensure a Kubernetes deployment hasn’t been scaled manually outside of GitOps, you could check the current replicas:

# Get the expected state from your Kubernetes YAML
# (e.g., in deployment.yaml)
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: my-app
# spec:
#   replicas: 3
#   selector:
#     matchLabels:
#       app: my-app
#   template:
#     metadata:
#       labels:
#         app: my-app
#     spec:
#       containers:
#       - name: my-app
#         image: my-app:v1.0.0

# Now, fetch the actual state using kubectl
kubectl get deployment my-app -o=jsonpath='{.spec.replicas}'
Enter fullscreen mode Exit fullscreen mode

If the output is ‘5’ but your YAML specifies ‘3’, you have drift.

Limitations

  • Scale Challenges: Becomes impractical with a large number of resources or complex configurations.
  • Human Error: Manual comparisons are prone to oversight.
  • Lack of Real-time: Audits are snapshots; drift can occur between checks.
  • Maintenance Overhead: Scripts require continuous updates as infrastructure evolves.

Solution 2: Commercial Configuration Management & IaC Tools with Drift Detection

As organizations mature, they adopt robust IaC and configuration management tools that often include built-in drift detection or prevention capabilities. These tools aim to ensure the actual state converges with the desired state defined in code.

Approach

These tools tackle drift in several ways:

  • State Comparison: Many IaC tools maintain a state file (e.g., Terraform state) and can compare it against the live infrastructure.
  • Desired State Enforcement: Configuration management tools (e.g., Puppet, Ansible) actively enforce a desired state, reverting unauthorized changes.
  • Policy-as-Code: Cloud providers offer services (e.g., AWS Config, Azure Policy) that continuously monitor resource configurations against defined policies.
  • Deployment-time Checks: Tools can simulate changes (e.g., terraform plan) to highlight deviations before applying new configurations.

Examples

Terraform Cloud/Enterprise Drift Detection

Terraform Cloud and Terraform Enterprise offer automated drift detection. While terraform plan shows drift at deployment time, their commercial offerings can proactively scan your infrastructure.

# Example: Using 'terraform plan' locally to identify potential drift
# This compares your local state file with the current cloud state.
terraform plan -out=tfplan.out

# Output might include:
# ...
# aws_instance.web_server (unmanaged): This object has been removed
# from the configuration, but still exists in the remote state.
# (This implies the resource was manually deleted or not in config.)
#
# aws_security_group.app_sg:
#   ~ ingress {
#         cidr_blocks = [
#             "0.0.0.0/0",
#             # "1.2.3.4/32",  # drift detected: this entry removed from code but still in cloud
#         ]
#         ...
#     }
# ...
Enter fullscreen mode Exit fullscreen mode

Terraform Cloud’s drift detection features would periodically run these checks and notify you without a manual trigger.

AWS Config Rules

AWS Config allows you to define rules that continuously monitor your AWS resources for compliance with desired configurations.

# Example: AWS Config rule to ensure S3 buckets do not allow public read access
# Rule Name: s3-bucket-public-read-prohibited
# Trigger Type: Configuration changes
# Scope of resources: AWS S3 Bucket
# Parameters: N/A
# Action: Detects non-compliant buckets, provides remediation details.

# To deploy such a rule using AWS CLI (simplified):
aws configservice put-config-rule \
    --config-rule-name s3-bucket-public-read-prohibited \
    --description "Checks if your Amazon S3 buckets allow public read access." \
    --source Identifier="S3_BUCKET_PUBLIC_READ_PROHIBITED" \
    --input-parameters "{}" \
    --maximum-execution-frequency TwentyFour_Hours

# View non-compliant resources:
aws configservice get-compliance-details-by-config-rule \
    --config-rule-name s3-bucket-public-read-prohibited
Enter fullscreen mode Exit fullscreen mode

This automatically flags resources that deviate from the rule.

Comparison: Manual Scripting vs. Commercial/Managed Tools

Feature Manual Audits / Custom Scripting Commercial IaC / CM Tools (e.g., Terraform Cloud, AWS Config)
Automation Level Low (requires manual execution or cron jobs) High (automated scans, continuous monitoring)
Scale & Complexity Poor for large, complex infrastructures Excellent, designed for enterprise scale
Detection Frequency As frequent as you run scripts; can miss transient drift Near real-time or configurable intervals
Notification Manual review of script output; custom integration needed Built-in alerts, dashboards, integration with ITSM/messaging
Remediation Manual or custom scripts to revert changes Automated (CM tools), guided (IaC tools), or policy-driven
Cost Low direct cost (time & labor are primary costs) Subscription fees, potentially higher cloud resource costs
Learning Curve Depends on scripting proficiency; can be steep for complex logic Initial setup; often user-friendly dashboards & DSLs

Solution 3: Dedicated Open-Source Drift Detection Tools (Introducing DriftHound)

While commercial tools offer powerful solutions, there’s a growing need for flexible, vendor-agnostic, and cost-effective open-source alternatives. This is where dedicated open-source drift detection tools come into play, offering a specialized approach to identifying and reporting infrastructure discrepancies. DriftHound, as highlighted in the Reddit thread, represents an exciting development in this space.

Approach

Dedicated open-source tools typically focus on:

  • Scanning & Discovery: Connecting to various cloud providers, Kubernetes clusters, or even on-premises APIs to discover the current state of resources.
  • State Comparison: Comparing the discovered actual state against a defined desired state (e.g., from Git repositories, IaC files, or previous scans).
  • Granular Reporting: Providing detailed reports on what has changed, identifying specific attributes that have drifted.
  • Flexible Notifications: Integrating with common notification channels (Slack, PagerDuty, email) to alert teams about detected drift.
  • Extensibility: Often designed with plugin architectures to support new providers or custom comparison logic.

Introducing DriftHound (Early Stage Example)

DriftHound, as described in its early stages, aims to be one such tool. While specific commands and configurations are still evolving, we can infer its intended functionality based on the problem it seeks to solve: detecting and notifying infrastructure drift. A hypothetical workflow for such a tool might look like this:

Hypothetical DriftHound Configuration (Illustrative)

Imagine configuring DriftHound to monitor your AWS environment for deviations from your Terraform plans, and send notifications to Slack.

# drifthound.yaml (Hypothetical configuration)
---
version: "1"
providers:
  aws:
    region: us-east-1
    # Assuming DriftHound can infer desired state from a local Terraform project
    terraform_state_path: "./infrastructure/aws/prod/"
    # Or, perhaps connect to a Terraform Cloud workspace
    # terraform_cloud_workspace: "my-org/prod-infra"
  kubernetes:
    kubeconfig_path: "~/.kube/config"
    # Assuming it compares against Kustomize or Helm definitions
    git_repo_url: "https://github.com/my-org/kubernetes-configs.git"
    git_branch: "main"
    path: "prod-cluster"

notifications:
  slack:
    webhook_url_secret: "SLACK_WEBHOOK_URL" # Environment variable
    channel: "#devops-alerts"
  email:
    smtp_server: "smtp.example.com"
    from: "drifthound@example.com"
    to: ["devops@example.com"]

schedule:
  # Run a scan every 6 hours
  interval_minutes: 360

rules:
  - name: "S3 Bucket Public Access"
    provider: "aws"
    resource_type: "AWS::S3::Bucket"
    field: "PublicAccessBlockConfiguration"
    # Ensure PublicAccessBlock is always enabled
    expected_value:
      BlockPublicAcls: true
      IgnorePublicAcls: true
      BlockPublicPolicy: true
      RestrictPublicBuckets: true
  - name: "Kubernetes Deployment Replicas"
    provider: "kubernetes"
    resource_type: "Deployment"
    field: "spec.replicas"
    # Ensure replicas match desired state in Git
    # (This implies a comparison against a source-of-truth, e.g., git repo)
    compare_to_git: true
Enter fullscreen mode Exit fullscreen mode

Hypothetical DriftHound Command Line Usage

After configuration, you might run DriftHound manually or schedule it in a CI/CD pipeline or as a cron job:

# Initialize/setup (e.g., install plugins for providers)
drifthound init

# Run a one-off scan
drifthound scan --config drifthound.yaml

# Output to console, and send notifications if drift is detected
# Example CLI output for detected drift:
# [INFO] Starting drift scan...
# [INFO] Provider 'aws': Scanning 120 resources.
# [WARNING] Drift detected for AWS::S3::Bucket 'my-public-s3-bucket'
#           Rule: S3 Bucket Public Access
#           Field: PublicAccessBlockConfiguration.BlockPublicAcls
#           Expected: true, Found: false
# [WARNING] Drift detected for Kubernetes::Deployment 'webapp-prod'
#           Rule: Kubernetes Deployment Replicas
#           Field: spec.replicas
#           Expected: 3, Found: 5
# [INFO] Notifying Slack channel #devops-alerts...
# [INFO] Scan complete. 2 drifts found.
Enter fullscreen mode Exit fullscreen mode

Benefits of Dedicated Open-Source Tools

  • Cost-Effective: No licensing fees, making them accessible for all budget levels.
  • Flexibility & Customization: Open source allows teams to modify, extend, and adapt the tool to their specific needs and environments.
  • Vendor Agnostic: Often designed to work across multiple cloud providers and infrastructure types without vendor lock-in.
  • Community Driven: Benefits from community contributions, rapid iteration, and shared knowledge for bug fixes and feature development.
  • Auditability: The transparent codebase allows for security reviews and understanding of how drift detection is performed.

Considerations for Early-Stage Tools like DriftHound

When adopting an early-stage open-source tool, consider:

  • Maturity: Features might be nascent, and stability could be lower than established commercial tools.
  • Community Support: Rely on the community for troubleshooting and feature requests.
  • Documentation: May be less comprehensive in early stages.
  • Contribution: Early adoption often means an opportunity to shape the tool’s future by providing feedback and contributing code.

The emergence of tools like DriftHound signifies a growing recognition that robust, accessible drift detection is crucial for maintaining healthy, secure, and compliant infrastructure. By actively participating in their development and providing feedback, the DevOps community can help shape powerful new solutions for this persistent challenge.


Darian Vance

👉 Read the original article on TechResolve.blog

Top comments (0)