Sameer Imtiaz

Posted on Jun 11

Bash Mastery: Lessons from a Decade of Production Challenges

#bash #errors #linux

Six months ago, a single bash script I crafted flawlessly executed 75,000 server deployments. Three years prior, my scripts were causing production outages almost biweekly. Here’s how I transformed my approach.

Since 2013, I’ve been writing bash scripts professionally. Early on, my scripts were fragile—functional on my local setup but prone to mysterious failures in production. Debugging felt like navigating a maze blindfolded.

After numerous late-night fire drills, botched deployments, and an infamous incident where I inadvertently wiped out a quarter of our testing environment, I honed techniques to create reliable bash scripts.

These aren’t beginner tips from tutorials. They’re battle-tested insights from high-stakes production environments where errors translate to real financial losses.

🔐 Overlooked Security Practices

Bash scripts often become security liabilities if not handled carefully.

Robust Input Validation

Tutorials preach input sanitization, but practical examples are rare. After our security team flagged vulnerabilities in my scripts, I adopted this approach:

# Old, insecure way (avoid this)
user_input="$1"
mysql -u admin -p"$password" -e "SELECT * FROM users WHERE id='$user_input'"

# New, secure approach
sanitize_user_input() {
    local input="$1"
    local length_limit="${2:-50}"

    # Strip non-alphanumeric characters except underscores and hyphens
    input=$(echo "$input" | tr -cd '[:alnum:]_-')

    # Enforce length limit
    input="${input:0:$length_limit}"

    # Check for valid input
    if [[ -z "$input" ]]; then
        echo "Error: Input contains invalid characters" >&2
        return 1
    fi

    echo "$input"
}

# Usage
user_input=$(sanitize_user_input "$1") || exit 1
mysql -u admin -p"$password" -e "SELECT * FROM users WHERE id='${user_input}'"

Environment Variable Checks

A missing environment variable once crashed our production system. Now, I validate them upfront:

check_required_vars() {
    local required=(
        "DB_HOST"
        "AUTH_TOKEN"
        "ENVIRONMENT"
        "LOG_LEVEL"
    )
    local missing=()

    for var in "${required[@]}"; do
        if [[ -z "${!var}" ]]; then
            missing+=("$var")
        fi
    done

    if [[ ${#missing[@]} -gt 0 ]]; then
        echo "Error: Missing environment variables:" >&2
        for var in "${missing[@]}"; do
            echo "  $var" >&2
        done
        echo "Please set these variables." >&2
        exit 1
    fi
}

# Run at script start
check_required_vars

Secure Secret Handling

Hardcoding secrets is a recipe for disaster. Here’s my evolution:

# Avoid: Hardcoded credentials
DB_PASS="mysecret123"

# Better: Use environment variables
DB_PASS="${DB_PASS:?Error: DB_PASS not set}"

# Best: External secret retrieval
fetch_secret() {
    local secret_id="$1"
    local secret_path="/secrets/$secret_id"

    if [[ -f "$secret_path" ]]; then
        cat "$secret_path"
    elif command -v aws >/dev/null && [[ -n "$AWS_SECRET_ARN" ]]; then
        aws secretsmanager get-secret-value \
            --secret-id "$AWS_SECRET_ARN" \
            --query 'SecretString' \
            --output text
    else
        echo "Error: Unable to fetch secret '$secret_id'" >&2
        return 1
    fi
}

# Usage
DB_PASS=$(fetch_secret "db_password") || exit 1

🧪 Testing Bash Scripts

After a “minor” script erased our user database, I started testing rigorously.

Unit Testing Functions

# test_utils.sh
execute_test() {
    local test_name="$1"
    local test_func="$2"

    echo -n "Running $test_name... "

    if $test_func; then
        echo "✅ Success"
        ((TESTS_SUCCESS++))
    else
        echo "❌ Failed"
        ((TESTS_FAILED++))
    fi
}

assert_equal() {
    local expected="$1"
    local actual="$2"
    local msg="${3:-Mismatch detected}"

    if [[ "$expected" == "$actual" ]]; then
        return 0
    else
        echo "  Expected: '$expected'" >&2
        echo "  Actual: '$actual'" >&2
        echo "  Error: $msg" >&2
        return 1
    fi
}

assert_includes() {
    local text="$1"
    local substring="$2"

    if [[ "$text" == *"$substring"* ]]; then
        return 0
    else
        echo "  '$text' does not include '$substring'" >&2
        return 1
    fi
}

Testing a Function

# Function to test
get_config_value() {
    local file="$1"
    local key="$2"
    grep "^$key=" "$file" | cut -d'=' -f2
}

# Test case
test_get_config_value() {
    local temp_file=$(mktemp)
    echo "db_host=127.0.0.1" > "$temp_file"
    echo "db_port=3306" >> "$temp_file"

    local result=$(get_config_value "$temp_file" "db_host")
    rm -f "$temp_file"

    assert_equal "127.0.0.1" "$result"
}

execute_test "get_config_value retrieves correct value" test_get_config_value

Integration Testing

test_full_deployment() {
    local env_name="test_$(date +%s)"

    setup_test_env "$env_name" || return 1

    DEPLOY_ENV="$env_name" ./deploy.sh || {
        cleanup_test_env "$env_name"
        return 1
    }

    if ! check_deployment_status "$env_name"; then
        cleanup_test_env "$env_name"
        return 1
    fi

    cleanup_test_env "$env_name"
    return 0
}

🚨 Robust Error Handling

The hallmark of a seasoned scripter is anticipating and managing failures.

Comprehensive Error Reporting

handle_failure() {
    local code=$?
    local line=$1
    local cmd="$2"
    local func="${FUNCNAME[2]:-main}"

    {
        echo "=== ERROR REPORT ==="
        echo "Script: $0"
        echo "Function: $func"
        echo "Line: $line"
        echo "Command: $cmd"
        echo "Exit Code: $code"
        echo "Timestamp: $(date)"
        echo "User: $(whoami)"
        echo "Directory: $(pwd)"
        echo "Environment: ${DEPLOY_ENV:-unknown}"
        echo -e "\nRecent Commands:"
        history | tail -n 5
        echo -e "\nSystem Status:"
        uname -a
        echo "Storage:"
        df -h
        echo "Memory:"
        free -h
        echo "==================="
    } >&2

    notify_monitoring "Script Error" "Script $0 failed at line $line"
    exit $code
}

set -eE
trap 'handle_failure $LINENO "$BASH_COMMAND"' ERR

Retry with Exponential Backoff

retry_operation() {
    local attempts="$1"
    local initial_delay="$2"
    local max_delay="$3"
    shift 3

    local try=1
    local delay="$initial_delay"

    while [[ $try -le $attempts ]]; do
        echo "Try $try/$attempts: $*" >&2

        if "$@"; then
            echo "Success on try $try" >&2
            return 0
        fi

        if [[ $try -eq $attempts ]]; then
            echo "Failed after $attempts tries" >&2
            return 1
        fi

        echo "Retrying in $delay seconds..." >&2
        sleep "$delay"

        delay=$((delay * 2))
        [[ $delay -gt $max_delay ]] && delay=$max_delay
        delay=$((delay + (RANDOM % (delay / 5)) - (delay / 5)))

        ((try++))
    done
}

# Usage
retry_operation 5 2 30 curl -f "https://api.example.com/status"
retry_operation 3 1 10 docker pull "$IMAGE_NAME"

Circuit Breaker for Unreliable Services

declare -A circuit_states

check_circuit() {
    local svc="$1"
    local max_fails="${2:-5}"
    local timeout="${3:-300}"

    local state="${circuit_states[$svc]}"
    [[ -z "$state" ]] && return 1

    local fails=$(echo "$state" | cut -d: -f1)
    local last_fail=$(echo "$state" | cut -d: -f2)
    local now=$(date +%s)

    if [[ $((now - last_fail)) -gt $timeout ]]; then
        unset circuit_states["$svc"]
        return 1
    fi

    [[ $fails -ge $max_fails ]]
}

log_failure() {
    local svc="$1"
    local state="${circuit_states[$svc]}"

    if [[ -z "$state" ]]; then
        circuit_states["$svc"]="1:$(date +%s)"
    else
        local fails=$(echo "$state" | cut -d: -f1)
        circuit_states["$svc"]="$((fails + 1)):$(date +%s)"
    fi
}

call_with_protection() {
    local svc="$1"
    shift

    if check_circuit "$svc"; then
        echo "Circuit open for $svc, skipping" >&2
        return 1
    fi

    if "$@"; then
        return 0
    else
        log_failure "$svc"
        return 1
    fi
}

# Usage
call_with_protection "auth_api" curl -f "https://auth.api.com/verify"

📊 Observability and Monitoring

Visibility is critical for debugging and performance tracking.

Structured Logging

declare -A log_metadata

set_log_metadata() {
    local key="$1"
    local value="$2"
    log_metadata["$key"]="$value"
}

log_structured() {
    local level="$1"
    local msg="$2"
    local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

    local meta=""
    for key in "${!log_metadata[@]}"; do
        meta+="$key=${log_metadata[$key]},"
    done

    echo "{\"timestamp\":\"$ts\",\"level\":\"$level\",\"message\":\"$msg\",\"script\":\"$0\",\"pid\":$$,\"metadata\":{${meta%,}}}" >&2
}

log_info() { log_structured "INFO" "$1"; }
log_warn() { log_structured "WARN" "$1"; }
log_error() { log_structured "ERROR" "$1"; }

# Usage
set_log_metadata "deploy_id" "$DEPLOY_ID"
set_log_metadata "env" "$DEPLOY_ENV"
log_info "Initiating deployment"

Metrics Tracking

record_metric() {
    local name="$1"
    local value="$2"
    local tags="$3"

    if command -v nc >/dev/null; then
        echo "${name}:${value}|c|#${tags}" | nc -u -w1 localhost 8125
    fi

    echo "METRIC: $name=$value tags=$tags" >&2
}

time_task() {
    local task="$1"
    shift

    local start=$(date +%s.%N)

    if "$@"; then
        local duration=$(echo "$(date +%s.%N) - $start" | bc)
        record_metric "task.duration" "$duration" "task=$task,status=success"
        return 0
    else
        local code=$?
        local duration=$(echo "$(date +%s.%N) - $start" | bc)
        record_metric "task.duration" "$duration" "task=$task,status=failure"
        return $code
    fi
}

# Usage
record_metric "deploy.start" "1" "env=$DEPLOY_ENV"
time_task "db_migration" perform_migration
record_metric "deploy.end" "1" "env=$DEPLOY_ENV"

🏗️ Robust Script Template

Here’s my go-to template for production-ready scripts:

#!/bin/bash
# Production Script Framework
# Purpose: Describe script function
# Author: Your Name
# Version: 1.0
# Updated: $(date)

set -eEo pipefail
IFS=$'\n\t'

# Metadata
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_PID=$$

# Config
readonly CONFIG="${SCRIPT_DIR}/settings.env"
[[ -f "$CONFIG" ]] && source "$CONFIG"

# Logging
exec 3> >(logger -t "$SCRIPT_NAME")
readonly LOG_FD=3

log_message() {
    local level="$1"
    shift
    echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] [$level] [PID:$SCRIPT_PID] $*" >&$LOG_FD
    [[ "$level" == "ERROR" ]] && echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] [$level] $*" >&2
}

log_info() { log_message "INFO" "$@"; }
log_error() { log_message "ERROR" "$@"; }

# Cleanup
cleanup() {
    local code=$?
    log_info "Cleaning up"

    rm -f "${temp_files[@]}" 2>/dev/null || true
    log_info "Exiting with code $code"
    exit $code
}

# Error handling
handle_error() {
    local code=$?
    local line=$1

    log_error "Failed at line $line with code $code"
    log_error "Command: $BASH_COMMAND"
    log_error "Function: ${FUNCNAME[2]:-main}"

    notify_monitoring "Script Failure" "Script $SCRIPT_NAME failed"
    exit $code
}

trap cleanup EXIT
trap 'handle_error $LINENO' ERR

# Validation
check_requirements() {
    log_info "Checking requirements"

    local commands=("curl" "jq" "docker")
    for cmd in "${commands[@]}"; do
        if ! command -v "$cmd" >/dev/null; then
            log_error "Missing command: $cmd"
            exit 1
        fi
    done

    local vars=("DEPLOY_ENV" "AUTH_TOKEN")
    for var in "${vars[@]}"; do
        if [[ -z "${!var}" ]]; then
            log_error "Missing variable: $var"
            exit 1
        fi
    done

    log_info "Requirements verified"
}

# Main logic
main() {
    local action="${1:-deploy}"

    log_info "Starting $SCRIPT_NAME with action: $action"
    set_log_metadata "action" "$action"

    check_requirements

    case "$action" in
        "deploy")
            run_deployment
            ;;
        "rollback")
            run_rollback
            ;;
        "health")
            check_status
            ;;
        *)
            log_error "Invalid action: $action"
            echo "Usage: $0 {deploy|rollback|health}" >&2
            exit 1
            ;;
    esac

    log_info "Execution completed"
}

run_deployment() {
    log_info "Deploying application"
    # Add deployment logic
}

run_rollback() {
    log_info "Rolling back application"
    # Add rollback logic
}

check_status() {
    log_info "Verifying application status"
    # Add health check logic
}

main "$@"

🎯 Key Takeaways

From a decade of experience, here are the critical lessons:

Prioritize Security: Scripts handle sensitive data—secure them.
Fail Clearly: Make errors obvious and immediate.
Test Thoroughly: Untested scripts are untrustworthy.
Monitor Actively: Logs and metrics are lifesavers.
Prepare for Failure: Anticipate and mitigate issues.

These practices have saved my career multiple times. When I accidentally wiped the testing environment, comprehensive logging and error handling enabled recovery in 30 minutes instead of days.

The script that handled 75,000 deployments? It leveraged every technique here—testing eliminated bugs, monitoring pinpointed issues, and error handling ensured resilience.

🚀 Advance Your Bash Expertise

Production bash scripting is a vast domain, covering configuration, orchestration, and automation. To skip the painful lessons I endured, I’ve developed a detailed course on production-grade bash scripting.

The course covers:

Secure handling of secrets and inputs
Testing frameworks for reliability
Error management to avoid outages
Monitoring and logging best practices
Real-world scenarios from my decade of experience
Ready-to-use script templates

Avoid learning the hard way. Your future self—and your team—will appreciate it.

What’s your worst bash script disaster? Share your stories below—we’ve all got battle scars! 🔥

DEV Community