TL;DR — Migrating a long-running cron-based scraper to an actor architecture does not require a rewrite. It requires six structural changes applied in order. Each one is independently shippable. Each one moves the scraper closer to a state where infrastructure is no longer your problem. We migrated our interview transcription pipeline this way over four iterations. Here's the order I'd run it again.
I've watched too many migrations fail because someone said "let's rewrite this as an actor" and treated it as a from-scratch project. They burn a sprint, miss edge cases the original handled, ship something that works in dev and breaks under real input, and end up reverting.
The pipeline doesn't need a rewrite. It needs surgery.
The six-step migration order
Each step is a PR. Each step ships independently. The cron job keeps running until step 6.
Step 1 — Extract input from the script body
Find the hardcoded list of URLs / config values / paths in your script. Move them to a JSON config file. The script reads from the config; the config is parameterised.
# Before:
URLS = ["https://...", "https://..."]
OUTPUT_PATH = "/var/data/output.csv"
# After:
import json, sys
config = json.load(sys.stdin)
urls = config["urls"]
output_path = config["output_path"]
The cron now does cat config.json | python script.py instead of python script.py. Behaviour identical. Surface area changed.
Why first: every later step depends on having a typed input. Doing this first means everything that follows operates on the same shape.
Step 2 — Replace ad-hoc output with a structured dataset
Instead of writing rows directly to a CSV, push them to a function that wraps the persistence layer:
def push_record(record):
# Today, this writes to a CSV.
write_csv_row(output_path, record)
# Tomorrow, this writes to Apify Dataset, S3, BigQuery, whatever.
Same data shape, abstracted writer. The cron still produces a CSV. Step 6 swaps the writer.
Why second: schema changes are easier when there's one place that knows about the shape.
Step 3 — Replace bare try/except with structured failures
Audit every try/except. If it swallows the exception, replace with explicit logging and a failure record:
# Before:
try:
record = process(url)
except Exception:
pass
# After:
try:
record = process(url)
except Exception as e:
push_failure({"url": url, "error": str(e), "type": type(e).__name__})
continue
Now failures are first-class data. Same rows of work; the bad ones go to a different file (or dataset) instead of vanishing.
Why third: this is the step where you stop losing data silently. Every later step assumes failures are visible.
Step 4 — Replace print() with structured logging
# Before:
print(f"Processing {url}")
# After:
import logging
log = logging.getLogger("transcribe")
log.info("processing", extra={"url": url, "stage": "transcribe"})
Use a logging library that supports structured fields. (Python: structlog, loguru, or logging with a JSON formatter.)
Why fourth: logs are what step 6 will be reading. They need shape.
Step 5 — Containerise
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "script.py"]
The cron now runs docker run my-scraper. Same input/output. Containerised.
Why fifth: containerisation is portable. Step 6 needs it; nothing earlier did.
Step 6 — Swap the runtime
Now you point the container at an actor runtime — Apify, Kubernetes CronJob, Cloud Run, whatever. The container is the same. The cron entry is gone. Scheduling, retries, logging, persistence are now provided by the runtime.
This is the step that takes a day. Steps 1–5 might take 2–3 days each. The point of doing them first is that step 6, the one most teams treat as the whole migration, is small and reversible by the time you reach it.
Why this order
Each step is independently valuable even if you stop. After step 1 you have a parameterised script — useful for ad-hoc runs. After step 2 you can change persistence. After step 3 you stop losing data. After step 4 you can debug remotely. After step 5 you can deploy anywhere. After step 6 you have an actor.
If a stakeholder asks why the rewrite is taking so long, you can point at the running improvements at any step. There is no "we're 60% done with the rewrite, it's not running yet" phase.
Result
The interview transcription actor went through this migration over four months, one step at a time, while running in production the entire time. Pre-migration: ad-hoc cron, 18% silent-failure rate, mean time to detect issues ~24 hours. Post-migration: actor with retries and structured logging, 1.2% failure rate (and the failures are visible), mean time to detect <30 minutes.
Total team-hours: roughly 60. Spread across four iterations. Compare to the rewrites I've seen go sideways: typically 80–120 hours and a stalled cutover.
When this is wrong
Two cases where a rewrite genuinely beats the migration playbook:
- The original script is very small (under 100 lines, single function). At that scale, the migration steps cost as much as a rewrite, and the rewrite gives you a cleaner result.
- The original is in a language your team doesn't maintain (a Perl script you inherited, a Bash pipeline). At some point the cost of step 1 alone exceeds the rewrite cost.
Otherwise: surgery, not rewrite. We packaged this six-step migration as a checklist we apply to every legacy scraper an engagement starts with — same shape we used to rebuild the interview transcription actor.
Where in the six steps is your current scraper? Drop the answer — I'll point at the next change that buys you the most.
Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Top comments (0)