Six months ago, a single bash script I crafted flawlessly executed 75,000 server deployments. Three years prior, my scripts were causing production outages almost biweekly. Here’s how I transformed my approach.
Since 2013, I’ve been writing bash scripts professionally. Early on, my scripts were fragile—functional on my local setup but prone to mysterious failures in production. Debugging felt like navigating a maze blindfolded.
After numerous late-night fire drills, botched deployments, and an infamous incident where I inadvertently wiped out a quarter of our testing environment, I honed techniques to create reliable bash scripts.
These aren’t beginner tips from tutorials. They’re battle-tested insights from high-stakes production environments where errors translate to real financial losses.
🔐 Overlooked Security Practices
Bash scripts often become security liabilities if not handled carefully.
Robust Input Validation
Tutorials preach input sanitization, but practical examples are rare. After our security team flagged vulnerabilities in my scripts, I adopted this approach:
# Old, insecure way (avoid this)
user_input="$1"
mysql -u admin -p"$password" -e "SELECT * FROM users WHERE id='$user_input'"
# New, secure approach
sanitize_user_input() {
local input="$1"
local length_limit="${2:-50}"
# Strip non-alphanumeric characters except underscores and hyphens
input=$(echo "$input" | tr -cd '[:alnum:]_-')
# Enforce length limit
input="${input:0:$length_limit}"
# Check for valid input
if [[ -z "$input" ]]; then
echo "Error: Input contains invalid characters" >&2
return 1
fi
echo "$input"
}
# Usage
user_input=$(sanitize_user_input "$1") || exit 1
mysql -u admin -p"$password" -e "SELECT * FROM users WHERE id='${user_input}'"
Environment Variable Checks
A missing environment variable once crashed our production system. Now, I validate them upfront:
check_required_vars() {
local required=(
"DB_HOST"
"AUTH_TOKEN"
"ENVIRONMENT"
"LOG_LEVEL"
)
local missing=()
for var in "${required[@]}"; do
if [[ -z "${!var}" ]]; then
missing+=("$var")
fi
done
if [[ ${#missing[@]} -gt 0 ]]; then
echo "Error: Missing environment variables:" >&2
for var in "${missing[@]}"; do
echo " $var" >&2
done
echo "Please set these variables." >&2
exit 1
fi
}
# Run at script start
check_required_vars
Secure Secret Handling
Hardcoding secrets is a recipe for disaster. Here’s my evolution:
# Avoid: Hardcoded credentials
DB_PASS="mysecret123"
# Better: Use environment variables
DB_PASS="${DB_PASS:?Error: DB_PASS not set}"
# Best: External secret retrieval
fetch_secret() {
local secret_id="$1"
local secret_path="/secrets/$secret_id"
if [[ -f "$secret_path" ]]; then
cat "$secret_path"
elif command -v aws >/dev/null && [[ -n "$AWS_SECRET_ARN" ]]; then
aws secretsmanager get-secret-value \
--secret-id "$AWS_SECRET_ARN" \
--query 'SecretString' \
--output text
else
echo "Error: Unable to fetch secret '$secret_id'" >&2
return 1
fi
}
# Usage
DB_PASS=$(fetch_secret "db_password") || exit 1
🧪 Testing Bash Scripts
After a “minor” script erased our user database, I started testing rigorously.
Unit Testing Functions
# test_utils.sh
execute_test() {
local test_name="$1"
local test_func="$2"
echo -n "Running $test_name... "
if $test_func; then
echo "✅ Success"
((TESTS_SUCCESS++))
else
echo "❌ Failed"
((TESTS_FAILED++))
fi
}
assert_equal() {
local expected="$1"
local actual="$2"
local msg="${3:-Mismatch detected}"
if [[ "$expected" == "$actual" ]]; then
return 0
else
echo " Expected: '$expected'" >&2
echo " Actual: '$actual'" >&2
echo " Error: $msg" >&2
return 1
fi
}
assert_includes() {
local text="$1"
local substring="$2"
if [[ "$text" == *"$substring"* ]]; then
return 0
else
echo " '$text' does not include '$substring'" >&2
return 1
fi
}
Testing a Function
# Function to test
get_config_value() {
local file="$1"
local key="$2"
grep "^$key=" "$file" | cut -d'=' -f2
}
# Test case
test_get_config_value() {
local temp_file=$(mktemp)
echo "db_host=127.0.0.1" > "$temp_file"
echo "db_port=3306" >> "$temp_file"
local result=$(get_config_value "$temp_file" "db_host")
rm -f "$temp_file"
assert_equal "127.0.0.1" "$result"
}
execute_test "get_config_value retrieves correct value" test_get_config_value
Integration Testing
test_full_deployment() {
local env_name="test_$(date +%s)"
setup_test_env "$env_name" || return 1
DEPLOY_ENV="$env_name" ./deploy.sh || {
cleanup_test_env "$env_name"
return 1
}
if ! check_deployment_status "$env_name"; then
cleanup_test_env "$env_name"
return 1
fi
cleanup_test_env "$env_name"
return 0
}
🚨 Robust Error Handling
The hallmark of a seasoned scripter is anticipating and managing failures.
Comprehensive Error Reporting
handle_failure() {
local code=$?
local line=$1
local cmd="$2"
local func="${FUNCNAME[2]:-main}"
{
echo "=== ERROR REPORT ==="
echo "Script: $0"
echo "Function: $func"
echo "Line: $line"
echo "Command: $cmd"
echo "Exit Code: $code"
echo "Timestamp: $(date)"
echo "User: $(whoami)"
echo "Directory: $(pwd)"
echo "Environment: ${DEPLOY_ENV:-unknown}"
echo -e "\nRecent Commands:"
history | tail -n 5
echo -e "\nSystem Status:"
uname -a
echo "Storage:"
df -h
echo "Memory:"
free -h
echo "==================="
} >&2
notify_monitoring "Script Error" "Script $0 failed at line $line"
exit $code
}
set -eE
trap 'handle_failure $LINENO "$BASH_COMMAND"' ERR
Retry with Exponential Backoff
retry_operation() {
local attempts="$1"
local initial_delay="$2"
local max_delay="$3"
shift 3
local try=1
local delay="$initial_delay"
while [[ $try -le $attempts ]]; do
echo "Try $try/$attempts: $*" >&2
if "$@"; then
echo "Success on try $try" >&2
return 0
fi
if [[ $try -eq $attempts ]]; then
echo "Failed after $attempts tries" >&2
return 1
fi
echo "Retrying in $delay seconds..." >&2
sleep "$delay"
delay=$((delay * 2))
[[ $delay -gt $max_delay ]] && delay=$max_delay
delay=$((delay + (RANDOM % (delay / 5)) - (delay / 5)))
((try++))
done
}
# Usage
retry_operation 5 2 30 curl -f "https://api.example.com/status"
retry_operation 3 1 10 docker pull "$IMAGE_NAME"
Circuit Breaker for Unreliable Services
declare -A circuit_states
check_circuit() {
local svc="$1"
local max_fails="${2:-5}"
local timeout="${3:-300}"
local state="${circuit_states[$svc]}"
[[ -z "$state" ]] && return 1
local fails=$(echo "$state" | cut -d: -f1)
local last_fail=$(echo "$state" | cut -d: -f2)
local now=$(date +%s)
if [[ $((now - last_fail)) -gt $timeout ]]; then
unset circuit_states["$svc"]
return 1
fi
[[ $fails -ge $max_fails ]]
}
log_failure() {
local svc="$1"
local state="${circuit_states[$svc]}"
if [[ -z "$state" ]]; then
circuit_states["$svc"]="1:$(date +%s)"
else
local fails=$(echo "$state" | cut -d: -f1)
circuit_states["$svc"]="$((fails + 1)):$(date +%s)"
fi
}
call_with_protection() {
local svc="$1"
shift
if check_circuit "$svc"; then
echo "Circuit open for $svc, skipping" >&2
return 1
fi
if "$@"; then
return 0
else
log_failure "$svc"
return 1
fi
}
# Usage
call_with_protection "auth_api" curl -f "https://auth.api.com/verify"
📊 Observability and Monitoring
Visibility is critical for debugging and performance tracking.
Structured Logging
declare -A log_metadata
set_log_metadata() {
local key="$1"
local value="$2"
log_metadata["$key"]="$value"
}
log_structured() {
local level="$1"
local msg="$2"
local ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
local meta=""
for key in "${!log_metadata[@]}"; do
meta+="$key=${log_metadata[$key]},"
done
echo "{\"timestamp\":\"$ts\",\"level\":\"$level\",\"message\":\"$msg\",\"script\":\"$0\",\"pid\":$$,\"metadata\":{${meta%,}}}" >&2
}
log_info() { log_structured "INFO" "$1"; }
log_warn() { log_structured "WARN" "$1"; }
log_error() { log_structured "ERROR" "$1"; }
# Usage
set_log_metadata "deploy_id" "$DEPLOY_ID"
set_log_metadata "env" "$DEPLOY_ENV"
log_info "Initiating deployment"
Metrics Tracking
record_metric() {
local name="$1"
local value="$2"
local tags="$3"
if command -v nc >/dev/null; then
echo "${name}:${value}|c|#${tags}" | nc -u -w1 localhost 8125
fi
echo "METRIC: $name=$value tags=$tags" >&2
}
time_task() {
local task="$1"
shift
local start=$(date +%s.%N)
if "$@"; then
local duration=$(echo "$(date +%s.%N) - $start" | bc)
record_metric "task.duration" "$duration" "task=$task,status=success"
return 0
else
local code=$?
local duration=$(echo "$(date +%s.%N) - $start" | bc)
record_metric "task.duration" "$duration" "task=$task,status=failure"
return $code
fi
}
# Usage
record_metric "deploy.start" "1" "env=$DEPLOY_ENV"
time_task "db_migration" perform_migration
record_metric "deploy.end" "1" "env=$DEPLOY_ENV"
🏗️ Robust Script Template
Here’s my go-to template for production-ready scripts:
#!/bin/bash
# Production Script Framework
# Purpose: Describe script function
# Author: Your Name
# Version: 1.0
# Updated: $(date)
set -eEo pipefail
IFS=$'\n\t'
# Metadata
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_PID=$$
# Config
readonly CONFIG="${SCRIPT_DIR}/settings.env"
[[ -f "$CONFIG" ]] && source "$CONFIG"
# Logging
exec 3> >(logger -t "$SCRIPT_NAME")
readonly LOG_FD=3
log_message() {
local level="$1"
shift
echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] [$level] [PID:$SCRIPT_PID] $*" >&$LOG_FD
[[ "$level" == "ERROR" ]] && echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] [$level] $*" >&2
}
log_info() { log_message "INFO" "$@"; }
log_error() { log_message "ERROR" "$@"; }
# Cleanup
cleanup() {
local code=$?
log_info "Cleaning up"
rm -f "${temp_files[@]}" 2>/dev/null || true
log_info "Exiting with code $code"
exit $code
}
# Error handling
handle_error() {
local code=$?
local line=$1
log_error "Failed at line $line with code $code"
log_error "Command: $BASH_COMMAND"
log_error "Function: ${FUNCNAME[2]:-main}"
notify_monitoring "Script Failure" "Script $SCRIPT_NAME failed"
exit $code
}
trap cleanup EXIT
trap 'handle_error $LINENO' ERR
# Validation
check_requirements() {
log_info "Checking requirements"
local commands=("curl" "jq" "docker")
for cmd in "${commands[@]}"; do
if ! command -v "$cmd" >/dev/null; then
log_error "Missing command: $cmd"
exit 1
fi
done
local vars=("DEPLOY_ENV" "AUTH_TOKEN")
for var in "${vars[@]}"; do
if [[ -z "${!var}" ]]; then
log_error "Missing variable: $var"
exit 1
fi
done
log_info "Requirements verified"
}
# Main logic
main() {
local action="${1:-deploy}"
log_info "Starting $SCRIPT_NAME with action: $action"
set_log_metadata "action" "$action"
check_requirements
case "$action" in
"deploy")
run_deployment
;;
"rollback")
run_rollback
;;
"health")
check_status
;;
*)
log_error "Invalid action: $action"
echo "Usage: $0 {deploy|rollback|health}" >&2
exit 1
;;
esac
log_info "Execution completed"
}
run_deployment() {
log_info "Deploying application"
# Add deployment logic
}
run_rollback() {
log_info "Rolling back application"
# Add rollback logic
}
check_status() {
log_info "Verifying application status"
# Add health check logic
}
main "$@"
🎯 Key Takeaways
From a decade of experience, here are the critical lessons:
- Prioritize Security: Scripts handle sensitive data—secure them.
- Fail Clearly: Make errors obvious and immediate.
- Test Thoroughly: Untested scripts are untrustworthy.
- Monitor Actively: Logs and metrics are lifesavers.
- Prepare for Failure: Anticipate and mitigate issues.
These practices have saved my career multiple times. When I accidentally wiped the testing environment, comprehensive logging and error handling enabled recovery in 30 minutes instead of days.
The script that handled 75,000 deployments? It leveraged every technique here—testing eliminated bugs, monitoring pinpointed issues, and error handling ensured resilience.
🚀 Advance Your Bash Expertise
Production bash scripting is a vast domain, covering configuration, orchestration, and automation. To skip the painful lessons I endured, I’ve developed a detailed course on production-grade bash scripting.
The course covers:
- Secure handling of secrets and inputs
- Testing frameworks for reliability
- Error management to avoid outages
- Monitoring and logging best practices
- Real-world scenarios from my decade of experience
- Ready-to-use script templates
Avoid learning the hard way. Your future self—and your team—will appreciate it.
What’s your worst bash script disaster? Share your stories below—we’ve all got battle scars! 🔥
Top comments (0)