Every project I start has the same 20 minutes of setup.
Write the HTTP client with retry logic. Set up the file watcher. Write the CSV parser. Build the rate limiter. Wire up the scheduler. I know exactly how to do all of it — I've done it fifty times. But I still write it from scratch every time, because I never had a clean canonical version I trusted enough to copy.
This year I stopped doing that. I went back through my last 2 years of projects and pulled out every script that (a) I'd rebuilt more than twice and (b) had survived production without modifications. I got 25 scripts. I cleaned them up, documented them properly, and packaged them.
Here's what's in there and why I made the choices I made.
What "production-ready" means in practice
Tutorial scripts have three failure modes:
- No error handling — they work until they don't, with no useful error output
- Hardcoded config — you have to edit the file to change any parameter
- No CLI — you can only run them from another Python script, not from a shell or scheduler
Every script in the cookbook fixes all three. They use argparse for CLI, handle errors with useful exit codes and messages, and are completely self-contained — no shared state, no internal imports required.
The goal: you should be able to drop any script into a new project, read it in 5 minutes, and run it. Then adapt it if you need to.
The pattern I found across all 25 scripts
When I went back through 2 years of projects to pull the scripts worth keeping, I expected to filter by "does this work." That turned out to be the wrong filter. Plenty of scripts worked. What I actually filtered on was: has this been reused without modification?
The scripts that survived that filter shared four properties. None of them are surprising in isolation. What surprised me was how consistently tutorials and quick solutions violate all four.
1. They fail loudly with useful exit codes and error messages, not silently.
A script that swallows an exception and exits 0 is worse than one that crashes — at least the crash tells you something broke. The production-worthy version catches exceptions at the boundary, logs a message that includes the error type and the value that caused it, and exits with a non-zero code. A cron job that returns exit 1 triggers an alert. One that returns exit 0 silently drops your data with no indication anything went wrong.
2. Every configurable value is a CLI arg with a documented default, never hardcoded.
Hardcoded values look like shortcuts when you're writing the script and look like landmines six months later when someone runs it in a slightly different context. File paths, timeouts, retry counts, output directories — all of these appear as argparse arguments with sensible defaults. If you need to run the same script against a different endpoint or with a longer timeout, you pass a flag; you don't edit the file.
3. They're idempotent — running them twice produces the same result.
This matters most for scripts run by schedulers or CI pipelines, where the same script might run twice due to a retry or a race condition. A script that creates a file should check if it already exists. A script that inserts records should handle duplicates. A script that syncs a directory should be a no-op if the directories are already in sync. Idempotence is not free, but the cost of writing it once is much lower than the cost of debugging a corruption caused by a double-run at 3 AM.
4. They do one thing and expose a clean interface.
The scripts that didn't survive tended to grow. Start as a file downloader, acquire some transformation logic, pick up a notification step, accumulate a few flags that change the behavior in incompatible ways. A script doing four things is four things you can't independently test, reuse, or replace. The ones worth keeping are narrow. The composition of narrow scripts produces complex behavior; the script itself stays simple.
Tutorial code violates all four of these constantly — not because the authors are wrong, but because tutorials optimize for clarity at the moment of reading, not for reliability at the moment of production failure. AI-generated scripts have the same problem: they optimize for plausible-looking output, not for what happens when the rate limiter fires at 2 AM and your CI has no one watching it. Every script in this cookbook was debugged in production before it ended up here. None of them are generated.
4 scripts worth showing in detail
1. HTTP client with retry logic and rate limiting
The problem: most requests implementations handle 200s and 500s. They don't handle rate limits (429), transient failures (503), or connection timeouts gracefully. You find out when something breaks at 2 AM.
What the script does:
- Configurable backoff (exponential with jitter, adjustable base/max/multiplier)
- Handles 429 with Retry-After header parsing — waits the correct amount, not a fixed sleep
- Distinguishes retryable errors (5xx, timeout, connection error) from permanent ones (4xx except 429)
- Session reuse for performance
- Structured logging of every retry attempt with reason
CLI: python http_client.py --url https://api.example.com/endpoint --retries 5 --timeout 30
# Core retry logic — simplified excerpt
def request_with_retry(url, method="GET", max_retries=3, backoff_base=1.0, **kwargs):
session = requests.Session()
last_error = None
for attempt in range(max_retries + 1):
try:
response = session.request(method, url, **kwargs)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", backoff_base * (2 ** attempt)))
logger.warning(f"Rate limited. Waiting {retry_after}s (attempt {attempt+1}/{max_retries})")
time.sleep(retry_after)
continue
if response.status_code >= 500:
wait = backoff_base * (2 ** attempt) + random.uniform(0, 1)
logger.warning(f"Server error {response.status_code}. Retrying in {wait:.1f}s")
time.sleep(wait)
continue
response.raise_for_status()
return response
except (requests.Timeout, requests.ConnectionError) as e:
wait = backoff_base * (2 ** attempt) + random.uniform(0, 1)
logger.warning(f"Connection error: {e}. Retrying in {wait:.1f}s")
last_error = e
time.sleep(wait)
raise last_error or requests.RequestException(f"Max retries exceeded for {url}")
2. File watcher with debouncing
The problem: watchdog is great but every implementation I've seen either (a) fires duplicate events on a single save or (b) doesn't handle the "file replaced" pattern that editors like vim use.
What the script does:
- Debounces file events with configurable delay (default 500ms) — one callback per logical change, not per filesystem event
- Handles create/modify/delete/move events separately with distinct callbacks
- Ignores temp files and swap files by default (configurable patterns)
- Recursive or non-recursive watching
- Graceful shutdown on SIGINT/SIGTERM
# Debouncing logic — simplified excerpt
class DebouncedHandler(FileSystemEventHandler):
def __init__(self, callback, debounce_ms=500, ignore_patterns=None):
self.callback = callback
self.debounce_delay = debounce_ms / 1000
self.ignore_patterns = ignore_patterns or [r'.*\.swp$', r'.*\.tmp$', r'~$']
self._timers = {}
def _should_ignore(self, path):
return any(re.match(p, path) for p in self.ignore_patterns)
def on_modified(self, event):
if event.is_directory or self._should_ignore(event.src_path):
return
self._schedule_callback(event.src_path, event)
def _schedule_callback(self, path, event):
if path in self._timers:
self._timers[path].cancel()
timer = threading.Timer(self.debounce_delay, self.callback, args=[event])
self._timers[path] = timer
timer.start()
3. Paginated API collector
The problem: every API paginates differently. Some use page/per_page, some use offset/limit, some use cursor tokens, some use next_url in the response. Writing a collector for each one is tedious and error-prone.
What the script does:
- Supports four pagination strategies: page-based, offset-based, cursor-based, and next-URL-based
- Configurable via JSON schema describing the API's pagination parameters
- Yields records as they arrive (generator) — works on APIs returning millions of records
- Rate limiting between requests
- Configurable stop conditions (max records, max pages, or custom predicate)
CLI: python api_collector.py --config api_config.json --output records.jsonl
4. Structured CSV/JSON normalizer
The problem: real-world data is inconsistent. Column names vary across exports. Types need coercion. Nested JSON needs flattening. Nulls are represented as empty strings, "NULL", "N/A", "none", or actual nulls depending on the source.
What the script does:
- Configurable field mapping (rename columns via config file)
- Type coercion with null handling (configurable null representations)
- JSON column flattening with configurable depth limit
- Duplicate detection and deduplication
- Validation with configurable error handling (skip / raise / log)
- Output to CSV, JSONL, or SQLite
CLI: python normalizer.py --input data.csv --config schema.json --output clean.jsonl --on-error log
Full list of 25 scripts
HTTP & APIs (6)
-
http_client.py— retry logic, rate limiting, session reuse -
api_collector.py— paginated API collection (4 pagination strategies) -
webhook_sender.py— send webhooks with HMAC signing, retry, and delivery tracking -
webhook_receiver.py— receive and validate webhooks (Flask-based, signature verification) -
oauth_client.py— OAuth 2.0 client with token refresh -
graphql_client.py— GraphQL query client with variable substitution
File Operations (5)
-
file_watcher.py— file system monitoring with debouncing -
batch_processor.py— process files in a directory with parallelism control -
file_sync.py— sync two directories (local or S3), with diff-only mode -
archive_manager.py— compress/decompress with progress, integrity checking -
log_rotator.py— rotate and compress log files with configurable retention
Data (5)
-
normalizer.py— CSV/JSON normalization with type coercion and mapping -
deduplicator.py— deduplicate records by configurable key fields -
csv_merger.py— merge multiple CSV files with conflict resolution -
json_flattener.py— flatten nested JSON to tabular structure -
schema_validator.py— validate records against JSON Schema with detailed error output
Scheduling & Jobs (4)
-
cron_wrapper.py— wrap any command as a cron job with logging, alerting, and lock files -
task_queue.py— simple file-based task queue (no Redis required) -
retry_runner.py— run a command with retry logic and configurable backoff -
job_scheduler.py— schedule jobs with APScheduler, persistent state
System & Cloud (5)
-
db_backup.py— backup PostgreSQL/MySQL/SQLite with rotation and S3 upload -
s3_sync.py— sync local directory to S3 with change detection -
process_monitor.py— monitor a process, restart if dead, alert on anomalies -
ssl_checker.py— check SSL certificate expiry across a list of domains -
docker_cleanup.py— prune stopped containers, dangling images, unused volumes
Scripts I almost cut (and why they survived)
The hardest part of building the cookbook was the cutting phase. I had more than 40 scripts that fit the "rebuilt twice, survived production" criteria. Getting to 25 meant making the case for each one.
A few were hard to cut because their value only shows up in edge cases — the cases you only hit once, at the worst possible time.
cron_wrapper.py was almost cut because every team already has cron. The counterargument came from a near-miss. A data processing job had been running nightly for three months without issues. One night the job took longer than expected and the next scheduled run started before the first one finished. Both instances wrote to the same output file, the second one truncated it mid-write, and the first one finished writing to a file that was now shorter than it started. The corruption was silent — the file was valid CSV, just missing the last third of the records. We found it four days later when a downstream report came up short. cron_wrapper.py uses a lock file to prevent concurrent execution and alerts on lock contention. The lock behavior isn't useful 99% of the time. It's essential the one time it would have mattered.
process_monitor.py was almost cut because modern infrastructure uses systemd for process management. The counterargument: systemd manages services. process_monitor.py manages scripts that aren't services — a nightly batch job, a file processor that runs on demand, a data sync that needs to run continuously but doesn't warrant a full service definition. For these, you want restart-on-failure and anomaly alerting (CPU spike, memory leak, unusual runtime) without the overhead of writing and maintaining a systemd unit file. The script fills that gap and has been used on three different projects specifically for batch jobs that grew beyond what a bare cron job could handle.
retry_runner.py was almost cut because it seemed redundant with http_client.py. The distinction: http_client.py handles retry logic for HTTP requests specifically. retry_runner.py handles retry logic for arbitrary shell commands — a database migration that occasionally fails on the first attempt due to connection pool exhaustion, a file upload to a flaky external service, a test suite that reliably passes on the second run when given a few seconds after the first. It wraps any command, not just HTTP calls, and that generality turned out to matter.
How to adapt these to your codebase
These are starting points, not drop-in libraries. The distinction matters because "drop-in library" implies a black box — you configure it, call it, and don't need to understand what's inside. That's the wrong model for scripts you're going to run in production on your own infrastructure. You should understand what they do before you trust them.
Every script in the cookbook is under 200 lines, with clear section comments that separate the argument parsing, the core logic, and the error handling. The structure is intentional: it makes the scripts easy to read in one sitting, and easy to find the specific part you want to change.
The most common adaptation pattern: copy the script, read it once to understand the flow, find the 5-10 lines that implement the specific behavior you need, strip everything else, extend from there. You're not supposed to use the full script as-is in every case. You're supposed to extract the part that solves your problem and build on it.
A concrete example: the retry logic in http_client.py — the exponential backoff with jitter, the 429 handling with Retry-After header parsing, the distinction between retryable and permanent errors — has been extracted and adapted for at least three different internal tools this year. One was a Slack API client that needed to handle rate limits differently from the default. One was a webhook sender for an API that returned 503 during deployments and needed more aggressive backoff. One was an internal data pipeline that called a rate-limited external service with custom authentication. The core retry logic was the same in all three; the adaptation was in the error classification and the authentication layer. Starting from the script saved 30-60 minutes of writing and debugging the retry edge cases from scratch each time.
Why one-time price
Most tools that help you automate things are SaaS. Monthly subscription, usage limits, API keys you don't control. For a set of scripts you run on your own infrastructure, that model doesn't make sense.
The cookbook is $39 one-time at https://kazdispatch.gumroad.com/l/hydbq. You get the scripts, no subscription, no usage tracking.
30-day refund if it doesn't deliver value. No questions asked.
What's the automation you keep rewriting? Leave it in the comments — file operations, API calls, scheduling, data pipeline boilerplate — and I'll either point you to the script that covers it or admit it's not in there. If three people ask for the same thing, I'll write it. (#discuss)
(Part 2 covers how to chain these scripts together with prompt_chainer.py — building multi-step automation pipelines that use AI to handle the decisions.)
If you want to know when Part 2 drops — and get the free conftest.py template I use on every Python project — sign up here. No spam, just the occasional Python automation post.
Top comments (0)