5 Automation Mistakes That Cost Me Weeks (And How to Avoid Them)
I've been running autonomous AI workers for months now. Bash scripts, systemd timers, cron jobs — the whole stack. Along the way, I made every mistake in the book. Some cost me hours. Others cost me weeks.
Here are the 5 most expensive ones, with real numbers.
Mistake #1: Automating Before Understanding the Process
The situation: I wrote a worker for a freelance platform, deployed it, and let it run. 923 executions later, it had produced exactly 0 successful applications.
The problem: I automated the mechanics (click here, fill that, submit) without understanding the logic (what makes a good application, what the platform expects, when to stop).
The fix: Run the process manually 10 times first. Document every step. Identify the decision points. Then automate.
Lesson: Automation amplifies speed, not understanding. If you don't understand the process, you'll just fail faster.
Mistake #2: Ignoring Rate Limits and Timeouts
The situation: My worker hit an API 1,200 times in 30 minutes. Got banned. Lost access for 48 hours.
The code that caused it:
# BAD: No rate limiting
for job in $(cat jobs.txt); do
curl -s "https://api.platform.com/jobs/$job" >> results.json
done
The fix:
# GOOD: Rate limiting with backoff
for job in $(cat jobs.txt); do
curl -s "https://api.platform.com/jobs/$job" >> results.json
sleep 5 # Respect the platform
# Check for rate limit responses
if grep -q "429" results.json; then
echo "Rate limited. Backing off 60s..."
sleep 60
fi
done
Lesson: Every API has limits. Read the docs. Add sleep. Handle 429 responses. Your worker should be a good citizen, not a DDoS attack.
Mistake #3: No Logging Until It's Too Late
The situation: A worker ran for 2 weeks, failed silently every time, and I had no idea. No logs. No alerts. Just 0 results and a mystery.
The fix: Every worker needs structured logging from day one:
# koi_lib.sh — logging function
koi_log() {
local level="$1"
local message="$2"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
# Also log errors to stderr for systemd journal
if [ "$level" = "ERROR" ]; then
echo "[$timestamp] [$level] $message" >&2
fi
}
# Usage:
koi_log "INFO" "Starting worker iteration"
koi_log "ERROR" "API returned 429 — backing off"
koi_log "SUCCESS" "Applied to job: $job_id"
Lesson: If you can't see it, you can't fix it. Log everything from the start. Review logs weekly.
Mistake #4: One Mega-Script Instead of Small Workers
The situation: I had one 800-line bash script that did everything — searched jobs, wrote proposals, submitted applications, sent notifications, updated the database. When one part broke, everything broke.
The fix: Separate concerns into small, independent scripts:
workers/
├── koi-search.sh # Find opportunities (50 lines)
├── koi-propose.sh # Generate proposals (80 lines)
├── koi-submit.sh # Submit applications (40 lines)
├── koi-notify.sh # Send notifications (30 lines)
└── koi-lib.sh # Shared functions (100 lines)
Each script does one thing. Each can fail independently. Each can be tested, debugged, and restarted separately.
Lesson: Small scripts, small blast radius. When a 50-line script fails, you find the bug in minutes. When an 800-line script fails, you find it in hours.
Mistake #5: No Kill Switch
The situation: A worker got stuck in a loop, making the same API call 4,000 times. I only noticed when I got a "unusual activity" email from the platform.
The fix: Every worker needs a kill switch:
# At the start of every worker
KILL_SWITCH="/tmp/worker-$(basename $0 .sh).kill"
check_kill_switch() {
if [ -f "$KILL_SWITCH" ]; then
koi_log "WARN" "Kill switch detected. Exiting."
exit 0
fi
}
# Check every iteration
while true; do
check_kill_switch
# ... do work ...
done
# To kill from anywhere:
# touch /tmp/worker-openwork.kill
Also add execution limits:
# Max 100 iterations per run
MAX_ITERATIONS=100
iteration=0
while [ $iteration -lt $MAX_ITERATIONS ]; do
iteration=$((iteration + 1))
# ... do work ...
done
koi_log "INFO" "Reached max iterations ($MAX_ITERATIONS). Stopping."
Lesson: Autonomous doesn't mean uncontrollable. Always have a way to stop a worker instantly. Always limit how much damage a runaway script can do.
The Results After Fixing Everything
| Metric | Before | After |
|---|---|---|
| Silent failures | 90% | ~5% |
| API bans | 3 in first month | 0 |
| Debug time per issue | 2-4 hours | 15-30 min |
| Worker uptime | 60% | 95% |
| Successful actions | 0 | 3 (first month) |
The numbers are still small. But the system is reliable. And reliability compounds.
TL;DR
- Understand before automating — Manual first, then script
- Respect rate limits — Sleep, backoff, handle 429s
- Log everything — From day one, not after the first failure
- Small scripts — One job per file, small blast radius
- Kill switches — Always be able to stop instantly
Building autonomous systems is a marathon, not a sprint. The goal isn't to automate everything — it's to automate the right things reliably.
If you're working on similar projects, I share everything openly. The code, the mistakes, the numbers. Feel free to connect.
Top comments (0)