Heinan Cabouly

Posted on Jun 7

Bash Secrets I Learned From 10 Years of Production Hell

#linux #bash #devops #programming

Three months ago, a single bash script I wrote processed 50,000 server deployments without a single failure. Two years ago, my scripts were breaking production every other week. Here's what changed.

I've been writing bash professionally since 2014, and I'll be honest - for the first few years, my scripts were garbage. They worked on my machine, failed mysteriously in production, and debugging them was like performing surgery with a sledgehammer.

But after countless 3 AM wake-up calls, failed deployments, and that one incident where I accidentally deleted half our staging environment (oops), I finally learned how to write bash scripts that don't suck.

These aren't the techniques you'll find in basic tutorials. These are the hard-earned lessons from production environments where failure isn't just embarrassing - it costs money.

🔒 The Security Stuff Nobody Talks About

Let's start with the elephant in the room. Most bash scripts are security nightmares waiting to happen.

Input Sanitization That Actually Works

Everyone tells you to "sanitize your inputs" but nobody shows you how. Here's what I learned after our security team tore apart my scripts:

# This is what I used to write (don't do this)
username="$1"
mysql -u root -p"$password" -e "SELECT * FROM users WHERE name='$username'"

# This is what I write now
sanitize_input() {
    local input="$1"
    local max_length="${2:-50}"

    # Remove anything that isn't alphanumeric, underscore, or dash
    input=$(echo "$input" | tr -cd '[:alnum:]_-')

    # Truncate to max length
    input="${input:0:$max_length}"

    # Ensure it's not empty after sanitization
    [[ -n "$input" ]] || {
        echo "Invalid input: contains illegal characters" >&2
        return 1
    }

    echo "$input"
}

# Safe usage
username=$(sanitize_input "$1") || exit 1
mysql -u root -p"$password" -e "SELECT * FROM users WHERE name='${username}'"

Environment Variable Validation

After getting burned by missing environment variables in production:

# Validate required environment variables at script start
validate_environment() {
    local required_vars=(
        "DATABASE_URL"
        "API_KEY" 
        "DEPLOY_ENV"
        "LOG_LEVEL"
    )

    local missing_vars=()

    for var in "${required_vars[@]}"; do
        if [[ -z "${!var}" ]]; then
            missing_vars+=("$var")
        fi
    done

    if [[ ${#missing_vars[@]} -gt 0 ]]; then
        echo "ERROR: Missing required environment variables:" >&2
        printf "  %s\n" "${missing_vars[@]}" >&2
        echo "Set these variables and try again." >&2
        exit 1
    fi
}

# Call this at the beginning of every production script
validate_environment

Secrets Management

Stop putting passwords in your scripts. Seriously.

# Bad: hardcoded secrets
DB_PASSWORD="super_secret_password"

# Better: environment variables  
DB_PASSWORD="${DB_PASSWORD:?DB_PASSWORD must be set}"

# Best: external secret management
get_secret() {
    local secret_name="$1"
    local secret_file="/run/secrets/$secret_name"

    if [[ -f "$secret_file" ]]; then
        cat "$secret_file"
    elif command -v aws >/dev/null && [[ -n "$AWS_SECRET_ARN" ]]; then
        aws secretsmanager get-secret-value \
            --secret-id "$AWS_SECRET_ARN" \
            --query 'SecretString' \
            --output text
    else
        echo "ERROR: Cannot retrieve secret '$secret_name'" >&2
        return 1
    fi
}

# Usage
DB_PASSWORD=$(get_secret "database_password") || exit 1

🧪 Testing Bash Scripts (Yes, Really)

This might sound crazy, but I write tests for my bash scripts now. It started after a "simple" deployment script wiped our entire customer database.

Unit Testing Functions

# test_helpers.sh - My testing framework
run_test() {
    local test_name="$1"
    local test_function="$2"

    echo -n "Testing $test_name... "

    if $test_function; then
        echo "✅ PASS"
        ((TESTS_PASSED++))
    else
        echo "❌ FAIL"
        ((TESTS_FAILED++))
    fi
}

assert_equals() {
    local expected="$1"
    local actual="$2"
    local message="${3:-Values don't match}"

    if [[ "$expected" == "$actual" ]]; then
        return 0
    else
        echo "  Expected: '$expected'"
        echo "  Actual: '$actual'"
        echo "  Message: $message"
        return 1
    fi
}

assert_contains() {
    local haystack="$1"
    local needle="$2"

    if [[ "$haystack" == *"$needle"* ]]; then
        return 0
    else
        echo "  '$haystack' does not contain '$needle'"
        return 1
    fi
}

Testing Real Functions

# Example function to test
parse_config_value() {
    local config_file="$1"
    local key="$2"

    grep "^$key=" "$config_file" | cut -d'=' -f2
}

# The test
test_parse_config_value() {
    # Setup
    local test_config=$(mktemp)
    echo "database_host=localhost" > "$test_config"
    echo "database_port=5432" >> "$test_config"

    # Test
    local result
    result=$(parse_config_value "$test_config" "database_host")

    # Cleanup
    rm -f "$test_config"

    # Assert
    assert_equals "localhost" "$result"
}

# Run the test
run_test "parse_config_value extracts correct value" test_parse_config_value

Integration Testing

# Test the entire deployment pipeline
test_deployment_pipeline() {
    local test_env="test_$(date +%s)"

    # Create isolated test environment
    setup_test_environment "$test_env" || return 1

    # Run deployment
    DEPLOY_ENV="$test_env" ./deploy.sh || {
        cleanup_test_environment "$test_env"
        return 1
    }

    # Verify deployment worked
    if ! verify_deployment "$test_env"; then
        cleanup_test_environment "$test_env"
        return 1
    fi

    # Cleanup
    cleanup_test_environment "$test_env"
    return 0
}

🚨 Error Handling That Saves Your Job

The difference between a junior and senior bash scripter isn't what they know - it's how they handle things going wrong.

Structured Error Handling

# Error handling that actually helps you debug
handle_error() {
    local exit_code=$?
    local line_number=$1
    local command="$2"
    local function_name="${FUNCNAME[2]}"

    {
        echo "==============================================="
        echo "SCRIPT FAILURE REPORT"
        echo "==============================================="
        echo "Script: $0"
        echo "Function: ${function_name:-main}"
        echo "Line: $line_number"
        echo "Command: $command"
        echo "Exit Code: $exit_code"
        echo "Time: $(date)"
        echo "User: $(whoami)"
        echo "Working Directory: $(pwd)"
        echo "Environment: ${DEPLOY_ENV:-unknown}"
        echo ""
        echo "Recent Commands:"
        history | tail -5
        echo ""
        echo "System Info:"
        uname -a
        echo "Disk Space:"
        df -h
        echo "Memory:"
        free -h
        echo "==============================================="
    } >&2

    # Send to monitoring system
    send_alert "Script Failure" "Script $0 failed at line $line_number"

    exit $exit_code
}

# Set up error trapping
set -eE
trap 'handle_error $LINENO "$BASH_COMMAND"' ERR

Retry Logic for Flaky Operations

Network calls, API requests, file operations - they all fail sometimes. Here's how I handle it:

retry_with_backoff() {
    local max_attempts="$1"
    local base_delay="$2"
    local max_delay="$3"
    shift 3

    local attempt=1
    local delay="$base_delay"

    while [[ $attempt -le $max_attempts ]]; do
        echo "Attempt $attempt/$max_attempts: $*" >&2

        if "$@"; then
            echo "Success on attempt $attempt" >&2
            return 0
        fi

        if [[ $attempt -eq $max_attempts ]]; then
            echo "All $max_attempts attempts failed" >&2
            return 1
        fi

        echo "Failed, waiting $delay seconds before retry..." >&2
        sleep "$delay"

        # Exponential backoff with jitter
        delay=$((delay * 2))
        [[ $delay -gt $max_delay ]] && delay=$max_delay

        # Add random jitter (±20%)
        local jitter=$((delay / 5))
        delay=$((delay + (RANDOM % (jitter * 2)) - jitter))

        ((attempt++))
    done
}

# Usage
retry_with_backoff 5 2 30 curl -f "https://api.example.com/health"
retry_with_backoff 3 1 10 docker pull "$IMAGE_NAME"

Circuit Breaker Pattern

When external services are down, stop hammering them:

declare -A circuit_breakers

is_circuit_open() {
    local service="$1"
    local failure_threshold="${2:-5}"
    local reset_timeout="${3:-300}"  # 5 minutes

    local cb_data="${circuit_breakers[$service]}"
    [[ -z "$cb_data" ]] && return 1  # Circuit doesn't exist, it's closed

    local failures=$(echo "$cb_data" | cut -d: -f1)
    local last_failure=$(echo "$cb_data" | cut -d: -f2)
    local current_time=$(date +%s)

    # If enough time has passed, reset the circuit
    if [[ $((current_time - last_failure)) -gt $reset_timeout ]]; then
        unset circuit_breakers["$service"]
        return 1  # Circuit reset, it's closed
    fi

    # Check if we've exceeded failure threshold
    [[ $failures -ge $failure_threshold ]]
}

record_failure() {
    local service="$1"
    local cb_data="${circuit_breakers[$service]}"

    if [[ -z "$cb_data" ]]; then
        circuit_breakers["$service"]="1:$(date +%s)"
    else
        local failures=$(echo "$cb_data" | cut -d: -f1)
        circuit_breakers["$service"]="$((failures + 1)):$(date +%s)"
    fi
}

call_with_circuit_breaker() {
    local service="$1"
    shift

    if is_circuit_open "$service"; then
        echo "Circuit breaker open for $service, skipping call" >&2
        return 1
    fi

    if "$@"; then
        return 0
    else
        record_failure "$service"
        return 1
    fi
}

# Usage
call_with_circuit_breaker "payment_api" curl -f "https://payment.api.com/process"

📊 Monitoring and Observability

You can't fix what you can't see. Here's how I make my scripts observable:

Structured Logging with Context

# Logging that actually helps during incidents
declare -A log_context

add_log_context() {
    local key="$1"
    local value="$2"
    log_context["$key"]="$value"
}

structured_log() {
    local level="$1"
    local message="$2"
    local timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

    # Build context string
    local context_str=""
    for key in "${!log_context[@]}"; do
        context_str+="$key=${log_context[$key]} "
    done

    # Output structured log
    echo "{\"timestamp\":\"$timestamp\",\"level\":\"$level\",\"message\":\"$message\",\"script\":\"$0\",\"pid\":$$,\"context\":{$context_str}}" >&2
}

info() { structured_log "INFO" "$1"; }
warn() { structured_log "WARN" "$1"; }
error() { structured_log "ERROR" "$1"; }

# Usage
add_log_context "deployment_id" "$DEPLOY_ID"
add_log_context "environment" "$DEPLOY_ENV"
info "Starting deployment process"

Metrics Collection

# Simple metrics that save your sanity
send_metric() {
    local metric_name="$1"
    local value="$2"
    local tags="$3"

    # Send to your monitoring system
    # This example uses StatsD, adapt for your system
    if command -v nc >/dev/null; then
        echo "${metric_name}:${value}|c|#${tags}" | nc -u -w1 localhost 8125
    fi

    # Also log for debugging
    echo "METRIC: $metric_name=$value tags=$tags" >&2
}

time_operation() {
    local operation_name="$1"
    shift

    local start_time=$(date +%s.%N)

    if "$@"; then
        local duration=$(echo "$(date +%s.%N) - $start_time" | bc)
        send_metric "operation.duration" "$duration" "operation=$operation_name,status=success"
        return 0
    else
        local exit_code=$?
        local duration=$(echo "$(date +%s.%N) - $start_time" | bc)
        send_metric "operation.duration" "$duration" "operation=$operation_name,status=failure"
        return $exit_code
    fi
}

# Usage
send_metric "deployment.started" "1" "environment=$DEPLOY_ENV"
time_operation "database_migration" run_migration
send_metric "deployment.completed" "1" "environment=$DEPLOY_ENV"

🏗️ Production-Ready Script Template

After years of trial and error, here's my template for any production script:

#!/bin/bash
# Production Script Template
# Description: What this script does
# Author: Your Name
# Version: 1.0
# Last Modified: $(date)

set -eEo pipefail
IFS=$'\n\t'

# Script metadata
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_PID=$$

# Load configuration
readonly CONFIG_FILE="${SCRIPT_DIR}/config.env"
[[ -f "$CONFIG_FILE" ]] && source "$CONFIG_FILE"

# Logging setup
exec 3> >(logger -t "$SCRIPT_NAME")
readonly LOG_FD=3

log() {
    local level="$1"
    shift
    echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] [$level] [PID:$SCRIPT_PID] $*" >&$LOG_FD
    [[ "$level" == "ERROR" ]] && echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] [$level] $*" >&2
}

info() { log "INFO" "$@"; }
warn() { log "WARN" "$@"; }
error() { log "ERROR" "$@"; }

# Error handling
cleanup() {
    local exit_code=$?
    info "Script cleanup started"

    # Your cleanup logic here
    rm -f "${temp_files[@]}" 2>/dev/null || true

    info "Script exiting with code $exit_code"
    exit $exit_code
}

handle_error() {
    local exit_code=$?
    local line_number=$1

    error "Script failed at line $line_number with exit code $exit_code"
    error "Command: $BASH_COMMAND"
    error "Function: ${FUNCNAME[2]:-main}"

    # Send alert to monitoring system
    send_alert "Script Failure" "Script $SCRIPT_NAME failed"

    exit $exit_code
}

trap cleanup EXIT
trap 'handle_error $LINENO' ERR

# Validation
validate_prerequisites() {
    info "Validating prerequisites"

    # Check required commands
    local required_commands=("curl" "jq" "docker")
    for cmd in "${required_commands[@]}"; do
        if ! command -v "$cmd" >/dev/null; then
            error "Required command not found: $cmd"
            exit 1
        fi
    done

    # Check required environment variables
    local required_vars=("DEPLOY_ENV" "API_KEY")
    for var in "${required_vars[@]}"; do
        if [[ -z "${!var}" ]]; then
            error "Required environment variable not set: $var"
            exit 1
        fi
    done

    info "All prerequisites validated"
}

# Main function
main() {
    local action="${1:-deploy}"

    info "Starting $SCRIPT_NAME with action: $action"
    add_log_context "action" "$action"

    validate_prerequisites

    case "$action" in
        "deploy")
            deploy_application
            ;;
        "rollback")
            rollback_application
            ;;
        "health")
            check_health
            ;;
        *)
            error "Unknown action: $action"
            echo "Usage: $0 {deploy|rollback|health}" >&2
            exit 1
            ;;
    esac

    info "Script completed successfully"
}

# Your application logic here
deploy_application() {
    info "Starting deployment"
    # Implementation here
}

rollback_application() {
    info "Starting rollback"
    # Implementation here
}

check_health() {
    info "Checking application health"
    # Implementation here
}

# Run main function
main "$@"

🎯 The Lessons That Matter

After all these years, here's what really matters:

Security first - Your script will handle sensitive data. Protect it.
Fail fast and loud - If something's wrong, make it obvious immediately.
Test everything - If you can't test it, you can't trust it.
Monitor everything - Logs and metrics save lives (and careers).
Plan for failure - Things will go wrong. Be ready.

These techniques saved my job more than once. That time I accidentally deleted the staging environment? The robust error handling and monitoring helped us recover in 20 minutes instead of 20 hours.

The deployment script that processed 50,000 servers? It used every pattern in this article. Testing caught the bugs, monitoring showed the bottlenecks, and error handling kept it running when individual servers failed.

🚀 Level Up Your Bash Skills

This is just the tip of the iceberg. Production bash scripting is a deep field with patterns for configuration management, service orchestration, infrastructure automation, and much more.

If you want to master these production techniques and avoid the painful lessons I learned the hard way, I've put together a comprehensive course that covers everything from basic scripting to advanced production patterns.

🎓 Bash Scripting for DevOps - Complete Course

The course includes:

Security patterns for handling secrets and user input
Testing frameworks for bash scripts
Error handling strategies that prevent downtime
Monitoring and logging best practices
Real production scenarios from my 10 years of experience
Complete script templates you can use immediately

I also share more advanced techniques and real-world case studies on my website: htdevops.top

Don't learn these lessons the hard way like I did. Your future self (and your on-call rotation) will thank you.

What's the worst production incident you've had with a bash script? Share your war stories in the comments - we've all been there! 🔥

DEV Community