SIÁN Agency

Posted on Jun 22 • Originally published at apify.com

Migration Playbook: Cron Script Actor. Six Steps, No Rewrites.

#architecture #automation #tutorial #webscraping

TL;DR — Migrating a long-running cron-based scraper to an actor architecture does not require a rewrite. It requires six structural changes applied in order. Each one is independently shippable. Each one moves the scraper closer to a state where infrastructure is no longer your problem. We migrated our interview transcription pipeline this way over four iterations. Here's the order I'd run it again.

I've watched too many migrations fail because someone said "let's rewrite this as an actor" and treated it as a from-scratch project. They burn a sprint, miss edge cases the original handled, ship something that works in dev and breaks under real input, and end up reverting.

The pipeline doesn't need a rewrite. It needs surgery.

The six-step migration order

Each step is a PR. Each step ships independently. The cron job keeps running until step 6.

Step 1 — Extract input from the script body

Find the hardcoded list of URLs / config values / paths in your script. Move them to a JSON config file. The script reads from the config; the config is parameterised.

# Before:
URLS = ["https://...", "https://..."]
OUTPUT_PATH = "/var/data/output.csv"

# After:
import json, sys
config = json.load(sys.stdin)
urls = config["urls"]
output_path = config["output_path"]

The cron now does cat config.json | python script.py instead of python script.py. Behaviour identical. Surface area changed.

Why first: every later step depends on having a typed input. Doing this first means everything that follows operates on the same shape.

Step 2 — Replace ad-hoc output with a structured dataset

Instead of writing rows directly to a CSV, push them to a function that wraps the persistence layer:

def push_record(record):
    # Today, this writes to a CSV.
    write_csv_row(output_path, record)

# Tomorrow, this writes to Apify Dataset, S3, BigQuery, whatever.

Same data shape, abstracted writer. The cron still produces a CSV. Step 6 swaps the writer.

Why second: schema changes are easier when there's one place that knows about the shape.

Step 3 — Replace bare `try/except` with structured failures

Audit every try/except. If it swallows the exception, replace with explicit logging and a failure record:

# Before:
try:
    record = process(url)
except Exception:
    pass

# After:
try:
    record = process(url)
except Exception as e:
    push_failure({"url": url, "error": str(e), "type": type(e).__name__})
    continue

Now failures are first-class data. Same rows of work; the bad ones go to a different file (or dataset) instead of vanishing.

Why third: this is the step where you stop losing data silently. Every later step assumes failures are visible.

Step 4 — Replace `print()` with structured logging

# Before:
print(f"Processing {url}")

# After:
import logging
log = logging.getLogger("transcribe")
log.info("processing", extra={"url": url, "stage": "transcribe"})

Use a logging library that supports structured fields. (Python: structlog, loguru, or logging with a JSON formatter.)

Why fourth: logs are what step 6 will be reading. They need shape.

Step 5 — Containerise

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "script.py"]

The cron now runs docker run my-scraper. Same input/output. Containerised.

Why fifth: containerisation is portable. Step 6 needs it; nothing earlier did.

Step 6 — Swap the runtime

Now you point the container at an actor runtime — Apify, Kubernetes CronJob, Cloud Run, whatever. The container is the same. The cron entry is gone. Scheduling, retries, logging, persistence are now provided by the runtime.

This is the step that takes a day. Steps 1–5 might take 2–3 days each. The point of doing them first is that step 6, the one most teams treat as the whole migration, is small and reversible by the time you reach it.

Why this order

Each step is independently valuable even if you stop. After step 1 you have a parameterised script — useful for ad-hoc runs. After step 2 you can change persistence. After step 3 you stop losing data. After step 4 you can debug remotely. After step 5 you can deploy anywhere. After step 6 you have an actor.

If a stakeholder asks why the rewrite is taking so long, you can point at the running improvements at any step. There is no "we're 60% done with the rewrite, it's not running yet" phase.

Result

The interview transcription actor went through this migration over four months, one step at a time, while running in production the entire time. Pre-migration: ad-hoc cron, 18% silent-failure rate, mean time to detect issues ~24 hours. Post-migration: actor with retries and structured logging, 1.2% failure rate (and the failures are visible), mean time to detect <30 minutes.

Total team-hours: roughly 60. Spread across four iterations. Compare to the rewrites I've seen go sideways: typically 80–120 hours and a stalled cutover.

When this is wrong

Two cases where a rewrite genuinely beats the migration playbook:

The original script is very small (under 100 lines, single function). At that scale, the migration steps cost as much as a rewrite, and the rewrite gives you a cleaner result.
The original is in a language your team doesn't maintain (a Perl script you inherited, a Bash pipeline). At some point the cost of step 1 alone exceeds the rewrite cost.

Otherwise: surgery, not rewrite. We packaged this six-step migration as a checklist we apply to every legacy scraper an engagement starts with — same shape we used to rebuild the interview transcription actor.

Where in the six steps is your current scraper? Drop the answer — I'll point at the next change that buys you the most.

Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

DEV Community

Migration Playbook: Cron Script Actor. Six Steps, No Rewrites.

The six-step migration order

Step 1 — Extract input from the script body

Step 2 — Replace ad-hoc output with a structured dataset

Step 3 — Replace bare `try/except` with structured failures

Step 4 — Replace `print()` with structured logging

Step 5 — Containerise

Step 6 — Swap the runtime

Why this order

Result

When this is wrong

Top comments (0)

The six-step migration order

Step 1 — Extract input from the script body

Step 2 — Replace ad-hoc output with a structured dataset

Step 3 — Replace bare try/except with structured failures

Step 4 — Replace print() with structured logging

Step 5 — Containerise

Step 6 — Swap the runtime

Why this order

Result

When this is wrong

Step 3 — Replace bare `try/except` with structured failures

Step 4 — Replace `print()` with structured logging