Three months ago, a single bash script I wrote processed 50,000 server deployments without a single failure. Two years ago, my scripts were breaking production every other week. Here's what changed.
I've been writing bash professionally since 2014, and I'll be honest - for the first few years, my scripts were garbage. They worked on my machine, failed mysteriously in production, and debugging them was like performing surgery with a sledgehammer.
But after countless 3 AM wake-up calls, failed deployments, and that one incident where I accidentally deleted half our staging environment (oops), I finally learned how to write bash scripts that don't suck.
These aren't the techniques you'll find in basic tutorials. These are the hard-earned lessons from production environments where failure isn't just embarrassing - it costs money.
๐ The Security Stuff Nobody Talks About
Let's start with the elephant in the room. Most bash scripts are security nightmares waiting to happen.
Input Sanitization That Actually Works
Everyone tells you to "sanitize your inputs" but nobody shows you how. Here's what I learned after our security team tore apart my scripts:
# This is what I used to write (don't do this)
username="$1"
mysql -u root -p"$password" -e "SELECT * FROM users WHERE name='$username'"
# This is what I write now
sanitize_input() {
local input="$1"
local max_length="${2:-50}"
# Remove anything that isn't alphanumeric, underscore, or dash
input=$(echo "$input" | tr -cd '[:alnum:]_-')
# Truncate to max length
input="${input:0:$max_length}"
# Ensure it's not empty after sanitization
[[ -n "$input" ]] || {
echo "Invalid input: contains illegal characters" >&2
return 1
}
echo "$input"
}
# Safe usage
username=$(sanitize_input "$1") || exit 1
mysql -u root -p"$password" -e "SELECT * FROM users WHERE name='${username}'"
Environment Variable Validation
After getting burned by missing environment variables in production:
# Validate required environment variables at script start
validate_environment() {
local required_vars=(
"DATABASE_URL"
"API_KEY"
"DEPLOY_ENV"
"LOG_LEVEL"
)
local missing_vars=()
for var in "${required_vars[@]}"; do
if [[ -z "${!var}" ]]; then
missing_vars+=("$var")
fi
done
if [[ ${#missing_vars[@]} -gt 0 ]]; then
echo "ERROR: Missing required environment variables:" >&2
printf " %s\n" "${missing_vars[@]}" >&2
echo "Set these variables and try again." >&2
exit 1
fi
}
# Call this at the beginning of every production script
validate_environment
Secrets Management
Stop putting passwords in your scripts. Seriously.
# Bad: hardcoded secrets
DB_PASSWORD="super_secret_password"
# Better: environment variables
DB_PASSWORD="${DB_PASSWORD:?DB_PASSWORD must be set}"
# Best: external secret management
get_secret() {
local secret_name="$1"
local secret_file="/run/secrets/$secret_name"
if [[ -f "$secret_file" ]]; then
cat "$secret_file"
elif command -v aws >/dev/null && [[ -n "$AWS_SECRET_ARN" ]]; then
aws secretsmanager get-secret-value \
--secret-id "$AWS_SECRET_ARN" \
--query 'SecretString' \
--output text
else
echo "ERROR: Cannot retrieve secret '$secret_name'" >&2
return 1
fi
}
# Usage
DB_PASSWORD=$(get_secret "database_password") || exit 1
๐งช Testing Bash Scripts (Yes, Really)
This might sound crazy, but I write tests for my bash scripts now. It started after a "simple" deployment script wiped our entire customer database.
Unit Testing Functions
# test_helpers.sh - My testing framework
run_test() {
local test_name="$1"
local test_function="$2"
echo -n "Testing $test_name... "
if $test_function; then
echo "โ
PASS"
((TESTS_PASSED++))
else
echo "โ FAIL"
((TESTS_FAILED++))
fi
}
assert_equals() {
local expected="$1"
local actual="$2"
local message="${3:-Values don't match}"
if [[ "$expected" == "$actual" ]]; then
return 0
else
echo " Expected: '$expected'"
echo " Actual: '$actual'"
echo " Message: $message"
return 1
fi
}
assert_contains() {
local haystack="$1"
local needle="$2"
if [[ "$haystack" == *"$needle"* ]]; then
return 0
else
echo " '$haystack' does not contain '$needle'"
return 1
fi
}
Testing Real Functions
# Example function to test
parse_config_value() {
local config_file="$1"
local key="$2"
grep "^$key=" "$config_file" | cut -d'=' -f2
}
# The test
test_parse_config_value() {
# Setup
local test_config=$(mktemp)
echo "database_host=localhost" > "$test_config"
echo "database_port=5432" >> "$test_config"
# Test
local result
result=$(parse_config_value "$test_config" "database_host")
# Cleanup
rm -f "$test_config"
# Assert
assert_equals "localhost" "$result"
}
# Run the test
run_test "parse_config_value extracts correct value" test_parse_config_value
Integration Testing
# Test the entire deployment pipeline
test_deployment_pipeline() {
local test_env="test_$(date +%s)"
# Create isolated test environment
setup_test_environment "$test_env" || return 1
# Run deployment
DEPLOY_ENV="$test_env" ./deploy.sh || {
cleanup_test_environment "$test_env"
return 1
}
# Verify deployment worked
if ! verify_deployment "$test_env"; then
cleanup_test_environment "$test_env"
return 1
fi
# Cleanup
cleanup_test_environment "$test_env"
return 0
}
๐จ Error Handling That Saves Your Job
The difference between a junior and senior bash scripter isn't what they know - it's how they handle things going wrong.
Structured Error Handling
# Error handling that actually helps you debug
handle_error() {
local exit_code=$?
local line_number=$1
local command="$2"
local function_name="${FUNCNAME[2]}"
{
echo "==============================================="
echo "SCRIPT FAILURE REPORT"
echo "==============================================="
echo "Script: $0"
echo "Function: ${function_name:-main}"
echo "Line: $line_number"
echo "Command: $command"
echo "Exit Code: $exit_code"
echo "Time: $(date)"
echo "User: $(whoami)"
echo "Working Directory: $(pwd)"
echo "Environment: ${DEPLOY_ENV:-unknown}"
echo ""
echo "Recent Commands:"
history | tail -5
echo ""
echo "System Info:"
uname -a
echo "Disk Space:"
df -h
echo "Memory:"
free -h
echo "==============================================="
} >&2
# Send to monitoring system
send_alert "Script Failure" "Script $0 failed at line $line_number"
exit $exit_code
}
# Set up error trapping
set -eE
trap 'handle_error $LINENO "$BASH_COMMAND"' ERR
Retry Logic for Flaky Operations
Network calls, API requests, file operations - they all fail sometimes. Here's how I handle it:
retry_with_backoff() {
local max_attempts="$1"
local base_delay="$2"
local max_delay="$3"
shift 3
local attempt=1
local delay="$base_delay"
while [[ $attempt -le $max_attempts ]]; do
echo "Attempt $attempt/$max_attempts: $*" >&2
if "$@"; then
echo "Success on attempt $attempt" >&2
return 0
fi
if [[ $attempt -eq $max_attempts ]]; then
echo "All $max_attempts attempts failed" >&2
return 1
fi
echo "Failed, waiting $delay seconds before retry..." >&2
sleep "$delay"
# Exponential backoff with jitter
delay=$((delay * 2))
[[ $delay -gt $max_delay ]] && delay=$max_delay
# Add random jitter (ยฑ20%)
local jitter=$((delay / 5))
delay=$((delay + (RANDOM % (jitter * 2)) - jitter))
((attempt++))
done
}
# Usage
retry_with_backoff 5 2 30 curl -f "https://api.example.com/health"
retry_with_backoff 3 1 10 docker pull "$IMAGE_NAME"
Circuit Breaker Pattern
When external services are down, stop hammering them:
declare -A circuit_breakers
is_circuit_open() {
local service="$1"
local failure_threshold="${2:-5}"
local reset_timeout="${3:-300}" # 5 minutes
local cb_data="${circuit_breakers[$service]}"
[[ -z "$cb_data" ]] && return 1 # Circuit doesn't exist, it's closed
local failures=$(echo "$cb_data" | cut -d: -f1)
local last_failure=$(echo "$cb_data" | cut -d: -f2)
local current_time=$(date +%s)
# If enough time has passed, reset the circuit
if [[ $((current_time - last_failure)) -gt $reset_timeout ]]; then
unset circuit_breakers["$service"]
return 1 # Circuit reset, it's closed
fi
# Check if we've exceeded failure threshold
[[ $failures -ge $failure_threshold ]]
}
record_failure() {
local service="$1"
local cb_data="${circuit_breakers[$service]}"
if [[ -z "$cb_data" ]]; then
circuit_breakers["$service"]="1:$(date +%s)"
else
local failures=$(echo "$cb_data" | cut -d: -f1)
circuit_breakers["$service"]="$((failures + 1)):$(date +%s)"
fi
}
call_with_circuit_breaker() {
local service="$1"
shift
if is_circuit_open "$service"; then
echo "Circuit breaker open for $service, skipping call" >&2
return 1
fi
if "$@"; then
return 0
else
record_failure "$service"
return 1
fi
}
# Usage
call_with_circuit_breaker "payment_api" curl -f "https://payment.api.com/process"
๐ Monitoring and Observability
You can't fix what you can't see. Here's how I make my scripts observable:
Structured Logging with Context
# Logging that actually helps during incidents
declare -A log_context
add_log_context() {
local key="$1"
local value="$2"
log_context["$key"]="$value"
}
structured_log() {
local level="$1"
local message="$2"
local timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
# Build context string
local context_str=""
for key in "${!log_context[@]}"; do
context_str+="$key=${log_context[$key]} "
done
# Output structured log
echo "{\"timestamp\":\"$timestamp\",\"level\":\"$level\",\"message\":\"$message\",\"script\":\"$0\",\"pid\":$$,\"context\":{$context_str}}" >&2
}
info() { structured_log "INFO" "$1"; }
warn() { structured_log "WARN" "$1"; }
error() { structured_log "ERROR" "$1"; }
# Usage
add_log_context "deployment_id" "$DEPLOY_ID"
add_log_context "environment" "$DEPLOY_ENV"
info "Starting deployment process"
Metrics Collection
# Simple metrics that save your sanity
send_metric() {
local metric_name="$1"
local value="$2"
local tags="$3"
# Send to your monitoring system
# This example uses StatsD, adapt for your system
if command -v nc >/dev/null; then
echo "${metric_name}:${value}|c|#${tags}" | nc -u -w1 localhost 8125
fi
# Also log for debugging
echo "METRIC: $metric_name=$value tags=$tags" >&2
}
time_operation() {
local operation_name="$1"
shift
local start_time=$(date +%s.%N)
if "$@"; then
local duration=$(echo "$(date +%s.%N) - $start_time" | bc)
send_metric "operation.duration" "$duration" "operation=$operation_name,status=success"
return 0
else
local exit_code=$?
local duration=$(echo "$(date +%s.%N) - $start_time" | bc)
send_metric "operation.duration" "$duration" "operation=$operation_name,status=failure"
return $exit_code
fi
}
# Usage
send_metric "deployment.started" "1" "environment=$DEPLOY_ENV"
time_operation "database_migration" run_migration
send_metric "deployment.completed" "1" "environment=$DEPLOY_ENV"
๐๏ธ Production-Ready Script Template
After years of trial and error, here's my template for any production script:
#!/bin/bash
# Production Script Template
# Description: What this script does
# Author: Your Name
# Version: 1.0
# Last Modified: $(date)
set -eEo pipefail
IFS=$'\n\t'
# Script metadata
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_PID=$$
# Load configuration
readonly CONFIG_FILE="${SCRIPT_DIR}/config.env"
[[ -f "$CONFIG_FILE" ]] && source "$CONFIG_FILE"
# Logging setup
exec 3> >(logger -t "$SCRIPT_NAME")
readonly LOG_FD=3
log() {
local level="$1"
shift
echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] [$level] [PID:$SCRIPT_PID] $*" >&$LOG_FD
[[ "$level" == "ERROR" ]] && echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] [$level] $*" >&2
}
info() { log "INFO" "$@"; }
warn() { log "WARN" "$@"; }
error() { log "ERROR" "$@"; }
# Error handling
cleanup() {
local exit_code=$?
info "Script cleanup started"
# Your cleanup logic here
rm -f "${temp_files[@]}" 2>/dev/null || true
info "Script exiting with code $exit_code"
exit $exit_code
}
handle_error() {
local exit_code=$?
local line_number=$1
error "Script failed at line $line_number with exit code $exit_code"
error "Command: $BASH_COMMAND"
error "Function: ${FUNCNAME[2]:-main}"
# Send alert to monitoring system
send_alert "Script Failure" "Script $SCRIPT_NAME failed"
exit $exit_code
}
trap cleanup EXIT
trap 'handle_error $LINENO' ERR
# Validation
validate_prerequisites() {
info "Validating prerequisites"
# Check required commands
local required_commands=("curl" "jq" "docker")
for cmd in "${required_commands[@]}"; do
if ! command -v "$cmd" >/dev/null; then
error "Required command not found: $cmd"
exit 1
fi
done
# Check required environment variables
local required_vars=("DEPLOY_ENV" "API_KEY")
for var in "${required_vars[@]}"; do
if [[ -z "${!var}" ]]; then
error "Required environment variable not set: $var"
exit 1
fi
done
info "All prerequisites validated"
}
# Main function
main() {
local action="${1:-deploy}"
info "Starting $SCRIPT_NAME with action: $action"
add_log_context "action" "$action"
validate_prerequisites
case "$action" in
"deploy")
deploy_application
;;
"rollback")
rollback_application
;;
"health")
check_health
;;
*)
error "Unknown action: $action"
echo "Usage: $0 {deploy|rollback|health}" >&2
exit 1
;;
esac
info "Script completed successfully"
}
# Your application logic here
deploy_application() {
info "Starting deployment"
# Implementation here
}
rollback_application() {
info "Starting rollback"
# Implementation here
}
check_health() {
info "Checking application health"
# Implementation here
}
# Run main function
main "$@"
๐ฏ The Lessons That Matter
After all these years, here's what really matters:
- Security first - Your script will handle sensitive data. Protect it.
- Fail fast and loud - If something's wrong, make it obvious immediately.
- Test everything - If you can't test it, you can't trust it.
- Monitor everything - Logs and metrics save lives (and careers).
- Plan for failure - Things will go wrong. Be ready.
These techniques saved my job more than once. That time I accidentally deleted the staging environment? The robust error handling and monitoring helped us recover in 20 minutes instead of 20 hours.
The deployment script that processed 50,000 servers? It used every pattern in this article. Testing caught the bugs, monitoring showed the bottlenecks, and error handling kept it running when individual servers failed.
๐ Level Up Your Bash Skills
This is just the tip of the iceberg. Production bash scripting is a deep field with patterns for configuration management, service orchestration, infrastructure automation, and much more.
If you want to master these production techniques and avoid the painful lessons I learned the hard way, I've put together a comprehensive course that covers everything from basic scripting to advanced production patterns.
๐ Bash Scripting for DevOps - Complete Course
The course includes:
- Security patterns for handling secrets and user input
- Testing frameworks for bash scripts
- Error handling strategies that prevent downtime
- Monitoring and logging best practices
- Real production scenarios from my 10 years of experience
- Complete script templates you can use immediately
I also share more advanced techniques and real-world case studies on my website: htdevops.top
Don't learn these lessons the hard way like I did. Your future self (and your on-call rotation) will thank you.
What's the worst production incident you've had with a bash script? Share your war stories in the comments - we've all been there! ๐ฅ
Top comments (0)