Prithwish Nath

Posted on May 5 • Originally published at Medium

How Failing at Fantasy Baseball Made Me Fix My Cron Jobs with Temporal

#python #devops #automation #programming

So I made a bad trade in my fantasy baseball league. Dropped Kaz Okamoto because — according to my data — he’d been cold for two weeks. In reality, he’s been on a tear for the last 9 days. 😅 This was a bad decision made because of bad data — my stats cron job had hit a rate limit, exited with no errors, and my FastAPI backend kept serving a stale JSON snapshot.

Well, I’d been meaning to fix that setup anyway. This time I did — and instead of patching the script, I tried out Temporal and…it worked embarrassingly well. Retries, backoff, execution history — things I’d normally bolt on manually were just… there. And if the network layer itself was flaky — rate limits, geo blocks — I could just add a proxy as a hardening layer.

This actually prompted me to go look at some of our production ingest jobs at work, and I thought: these are the same pattern, just with more surface area! I ended up swapping out one of them, tentatively, then another. Same pattern, just more scale.

This is my (admittedly very casual) write-up of what I learned. I hope it’s useful!

💡 I use Temporal’s Python SDK here, but they have one for TypeScript too — if that’s your thing.

How Cron Jobs Can Burn You

Here’s the brittle script I was using:

# One-shot MLB.com player fetch + write  

def main() -> None:
    player_url = os.getenv(
        "PLAYER_URL",
        "https://www.mlb.com/player/kazuma-okamoto-672960",
    )
    out_dir = Path(os.getenv("OUTPUT_DIR", "./data/runs"))
    out = out_dir / "latest.json"

    try:
        r = requests.get(player_url, timeout=60)
        if r.status_code != 200:
            print(f"WARN: HTTP {r.status_code}, leaving {out} unchanged")
            return
        stats = extract_stats_datatable(r.text)
        out_dir.mkdir(parents=True, exist_ok=True)
        out.write_text(json.dumps(stats), encoding="utf-8")
        print(f"Wrote stats to {out}")
    except Exception as e:
        print(f"WARN: ingest failed ({e!r}), leaving {out} unchanged")

if __name__ == "__main__":
    main()

That script lived behind a super simple crontab line — once a night, fixed schedule, stdout/stderr logs:

# crontab -l (excerpt)
0 2 * * * cd /home/me/fantasy-stats && .venv/bin/python scripts/fetch_player_stats.py >> /var/log/mlb_fetch.log 2>&1

The script is about ~20 lines of Python plus one line of schedule. How many potential failures can you spot here?

The first is the thing I thought was prudent: if the fetch looks wrong, don’t overwrite the snapshot. So any 429, timeout, or 200 with a layout that no longer contains the marker extract_stats_datatable expects becomes a printed WARN, a no-op, and main() returns — exit code 0. No raise_for_status(); no sys.exit(1). Cron is “happy”; the one-line warning vanishes in a log I wasn’t tailing; latest.json never updates. Nine “successful” runs later…I made a bad decision because I had bad data. (Flip it to raise_for_status() and you get the opposite smell: a non-zero exit, still no retry, still stale data until someone fixes the feed — pick your poison 😬)

The other two are subtler — and I didn’t personally run into them — but upon review, they were just as likely to have burnt me.

Fixed output path. Every run writes to latest.json. If two runs overlap — which happens the moment a run is slow and the next cron tick fires — it’s a race condition. One overwrites the other mid-write. You might read a corrupted file, or never know which run's data you actually have.
Non-atomic write. out.write_text() is not atomic. If the process dies mid-write — OOM, signal, anything — you get a partial JSON file. The next reader gets a parse error and now has to figure out if the file is corrupted or just empty. This is the exactly kind of bug that shows up at 2am on a production system.

The real problem isn’t any ONE of these — it’s really that cron gives you exactly one bit of feedback: exit zero or exit non-zero — and as this script shows, exit zero can lie. It can’t give you a retry policy, overlap protection, or artifact history. No way to answer “what did this job actually do at 3am last Tuesday?”

And yes, you can try to patch around that. Add a retry loop and exponential backoff, ship logs somewhere. But now your retry state lives in the process memory that disappears on crash, your backoff is hand-rolled, and your observability is still pretty much log spelunking. At that point you’re not using cron anymore. You’re rebuilding a tiny, worse workflow engine around cron.

What is Temporal?

Temporal is a “durable execution platform”. What that really means in practice is that you’ll write ordinary functions — a Workflow that orchestrates things, and Activities that do the actual work — and Temporal will:

Make the execution survive process crashes,
Retry failed steps with backoff,
Prevent overlapping runs, and
Record the full history of every execution.

💡 Durable execution is the simple idea that your code should keep running to completion even if the machine running it doesn’t. The mental model is that your workflow is a function call that cannot be interrupted, even if the worker reboots halfway through. State lives in Temporal’s history, not in the worker’s memory.

The architecture for our project will look like this (by default you get the Temporal Web UI on localhost:8233):

To grok Temporal properly, understand that the Workflow owns when things happen, while the Activity owns what happens — i.e the page fetch, the stats extraction, the file write. Workflows must stay deterministic; all side effects belong in Activities. Why does that matter? We’ll come back to that in a second.

Getting player data from MLB.com

Before any of the Temporal machinery, you need a reliable data ingestion. For us, that means loading an MLB.com player page and extracting the stats blob embedded in the initial HTML.

MLB.com currently renders a player page with a JavaScript object that starts with stats: {"statsDatatable"...}. That's convenient: no browser, no Playwright, no screenshot automation needed.

Critical to understand that this does not tell you that the stats blob is still there or that the row you care about was parsed correctly.

mlb_player_stats.py

"""  
MLB.com player page: HTTP fetch + embedded JSON extraction.  
"""  
from __future__ import annotations  
import os  
import re  
from json import JSONDecoder  
from typing import Any, Dict, List  
from urllib.parse import quote  
import requests  

def _strip_tags(s: Any) -> Any:  
    if not isinstance(s, str):  
        return s  
    s = re.sub(r"`<[^>`]+>", "", s)  
    return s.strip()  

def _sanitize_row(row: Dict[str, Any]) -> Dict[str, Any]:  
    out: Dict[str, Any] = {}  
    for k, v in row.items():  
        out[k] = _strip_tags(v)  
    return out  

def extract_stats_datatable(html: str) -> Dict[str, Any]:  
    needle = 'stats: {"statsDatatable"'  
    i = html.find(needle)  
    if i == -1:  
        raise ValueError(  
            "Could not find stats JSON marker (page layout may have changed)."  
        )  
    start = i + len("stats: ")  
    obj, _ = JSONDecoder().raw_decode(html[start:])  
    return obj  

def pick_current_season_row(rows: List[Dict[str, Any]]) -> Dict[str, Any] | None:  
    for row in rows:  
        h = row.get("header", "")  
        if isinstance(h, str) and "Regular Season" in h and "Career" not in h:  
            return row  
    return rows[0] if rows else None  

def build_requests_proxies() -> Dict[str, str] | None:  
    # Bright Data super proxy; returns None if credentials unset  
    explicit = os.getenv("BRIGHT_DATA_PROXY_URL", "").strip()  
    if explicit:  
        return {"http": explicit, "https": explicit}  
    host = os.getenv("BRIGHT_DATA_PROXY_HOST", "brd.superproxy.io").strip()  
    port = os.getenv("BRIGHT_DATA_PROXY_PORT", "33335").strip()  
    username = os.getenv("BRIGHT_DATA_PROXY_USERNAME", "").strip()  
    password = os.getenv("BRIGHT_DATA_PROXY_PASSWORD", "").strip()  
    if not username or not password:  
        return None  
    user_enc = quote(username, safe="")  
    pass_enc = quote(password, safe="")  
    proxy_url = f"http://{user_enc}:{pass_enc}@{host}:{port}"  
    return {"http": proxy_url, "https": proxy_url}  

def fetch_player_page(player_url: str, *, timeout: int = 60) -> requests.Response:  
    proxies = build_requests_proxies()  
    return requests.get(  
        player_url,  
        timeout=timeout,  
        proxies=proxies  
    )  

def build_stats_payload(player_url: str, *, timeout: int = 60) -> Dict[str, Any]:  
    # Fetch page, parse embedded hitting summary rows (sanitized)  
    r = fetch_player_page(player_url, timeout=timeout)  
    r.raise_for_status()  
    blob = extract_stats_datatable(r.text)  
    hitting_large = blob["statsDatatable"]["hitting"]["large"]  
    block = hitting_large[0] if isinstance(hitting_large, list) else hitting_large  
    filtered = block["filteredRows"]  
    current = pick_current_season_row(filtered)  
    career = next(  
        (row for row in filtered if row.get("header") == "Career Regular Season"),  
        None,  
    )  
    via_proxy = build_requests_proxies() is not None  
    return {  
        "source_url": player_url,  
        "http_status": r.status_code,  
        "via_bright_data_proxy": via_proxy,  
        "current_regular_season": _sanitize_row(current) if current else None,  
        "career_regular_season_row": _sanitize_row(career) if career else None,  
        "all_summary_rows": [_sanitize_row(row) for row in filtered],  
    }

Keeping the fetch inside an activity means the execution model stays unchanged while the network path can evolve independently. The same fetch_player_page call can run directly or be routed through a proxy layer without touching the workflow logic.

Note that Temporal can only give you execution reliability: retries, timeouts, and visibility. It can not make a failing network succeed. If every attempt returns 429 from the same IP, Temporal will reliably retry a failing request until the policy is exhausted — you won’t know how valuable our proxy layer is until you really need it (if you’re following along, get it here.)

Data Extraction with Temporal.io

The Temporal Workflow

@workflow.defn  
class StatsCollectionWorkflow:  
    @workflow.run  
    async def run(self, job: StatsJob) -> CollectStatsResult:  
        info = workflow.info()  
        return await workflow.execute_activity(  
            collect_stats,  
            CollectStatsInput(  
                player_url=job.player_url,  
                workflow_id=info.workflow_id,  
                run_id=info.run_id,  
                output_dir=job.output_dir,  
            ),  
            start_to_close_timeout=timedelta(minutes=10),  
            retry_policy=RetryPolicy(  
                initial_interval=timedelta(seconds=3),  
                backoff_coefficient=2.0,  
                maximum_interval=timedelta(minutes=2),  
                maximum_attempts=8,  
            ),  
        )

That RetryPolicy block is the part most of us have written manually at some point — a while loop, a try/except, a time.sleep, a counter, hopefully a max attempts check. Here it's declared once, lives outside the business logic, and survives worker crashes. If the worker process dies on attempt 3 of 8, the next worker that comes up picks up at attempt 4. The state is in Temporal, not in memory.

The start_to_close_timeout is the hard limit on how long a single activity attempt can run. Without it, a stalled HTTP request holds a worker slot indefinitely. Decidedly not what we want.

The Temporal Activity

Here’s activities.py:

"""Activities: MLB.com fetch + stats extraction + artifact write (all side effects here)."""  
from __future__ import annotations  
import json  
import os  
from dataclasses import dataclass  
from pathlib import Path  
from typing import Any, Dict, Union  
from temporalio import activity  
from temporal_cron.mlb_player_stats import build_stats_payload  

@dataclass  
class CollectStatsInput:  
    player_url: str  
    workflow_id: str  
    run_id: str  
    output_dir: str  

@dataclass  
class CollectStatsResult:  
    artifact_path: str  
    home_runs: Union[str, int]  
    player_url: str  

def _atomic_write_json(path: Path, data: Dict[str, Any]) -> None:  
    path.parent.mkdir(parents=True, exist_ok=True)  
    tmp = path.with_suffix(path.suffix + ".tmp")  
    tmp.write_text(json.dumps(data, indent=2), encoding="utf-8")  
    tmp.replace(path)  

@activity.defn  
def collect_stats(input: CollectStatsInput) -> CollectStatsResult:  
    # One HTTP try per activity attempt; workflow RetryPolicy owns backoff/attempts.  
    data = build_stats_payload(input.player_url, timeout=60)  
    current = data.get("current_regular_season") or {}  
    hr = current.get("homeRuns", 0)  
    base = Path(input.output_dir or os.getenv("OUTPUT_DIR", "./data/runs"))  
    safe_wid = input.workflow_id.replace(os.sep, "_").replace(":", "_")  
    safe_rid = input.run_id.replace(os.sep, "_").replace(":", "_")  
    out_path = base / f"{safe_wid}__{safe_rid}.json"  
    payload = {  
        "workflow_id": input.workflow_id,  
        "run_id": input.run_id,  
        "player_url": input.player_url,  
        "data": data,  
    }  
    _atomic_write_json(out_path, payload)  
    return CollectStatsResult(  
        artifact_path=str(out_path.resolve()),  
        home_runs=hr if hr is not None else 0,  
        player_url=input.player_url,  
    )

The output path uses the run ID, not a fixed filename. Every execution gets its own artifact — stats-manual-abc123__run456.json. No races, no overwrites, and you have a full history of every run. You can diff two runs. You can see exactly what data you had on any given night. This alone would have saved me.

The file write is atomic: _atomic_write_json (in the excerpt above) writes to a .tmp file first, then replace() on the same filesystem. A reader either sees the old file or the new file — never a partial write. The brittle script calls write_text() directly; if the process died mid-write, you got corrupted JSON and a confusing parse error at the worst possible time.

💡 Note that collect_stats is a sync function, not async. That's intentional — sync activities run in a thread pool, so blocking I/O doesn't block the event loop. The Temporal SDK supports both; sync is the right call when your activity is mostly waiting on a network request.

The Temporal Worker

async def _main() -> None:  
    client = await Client.connect(host, namespace=namespace)  
    worker = Worker(  
        client,  
        task_queue=task_queue,  
        workflows=[StatsCollectionWorkflow],  
        activities=[collect_stats],  
    )  
    await worker.run()

The Worker is the only process that ever touches MLB.com. The workflow and scheduler are pure orchestration — they tell Temporal what to do, they don’t do any work themselves. You can scale workers horizontally without touching the scheduling layer. You’ll want that when you take this pattern to production.

The worker view shows the stats-pipeline worker polling alongside Temporal's own system worker.

The Temporal Schedule

Putting this on a schedule is one function call:

schedule = Schedule(  
    action=ScheduleActionStartWorkflow(  
        StatsCollectionWorkflow.run,  
        job,  
        id=workflow_id,  
        task_queue=task_queue,  
    ),  
    spec=ScheduleSpec(cron_expressions=[cron]),  
    policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.SKIP),  
)

SKIP is the thing cron simply cannot do.

So, if MLB.com is slow today, or your fetches are getting rate-limited harder than usual, and your run takes 12 minutes. Your schedule fires every 15. Eventually the slow run bleeds into the next tick. With cron, you now have two instances running simultaneously, both writing to latest.json, racing each other. With SKIP, the new scheduled run sees the previous one is still active and does nothing. When the previous run finishes, the schedule resumes normally at the next tick. That's an entire class of bug you stop thinking about.

The schedule script also handles create-or-update correctly — describe() first, catch NOT_FOUND, then either create or update:

try:  
    await handle.describe()  
except RPCError as err:  
    if err.status != RPCStatusCode.NOT_FOUND:  
        raise  
    await client.create_schedule(schedule_id, schedule)  
else:  
    await handle.update(lambda _input: ScheduleUpdate(schedule=schedule))

Run it once to create the schedule. Run it again to change the cron expression or job parameters. Same command either way.

What you actually see in the UI

Pop open http://localhost:8233 while a workflow is running. Every step is there — which activity ran, how many attempts it took, what the retry intervals were, what came back. If it failed on attempt 2 and succeeded on attempt 5, you can see that. You can see the exact input that went in and the exact output that came out. You can see how long each attempt took.

The workflow list gives you the first-level answer cron never gives cleanly: what ran, when, and whether it completed.

The timeline connects the workflow input, activity execution, and output artifact in one place. The event history is the audit trail: scheduled tasks, activity start/completion, workflow task transitions, and final result.

The activity details will also show the successful third attempt and preserve the previous failure: 429 Too Many Requests.

Temporal activity event showing attempt #3 with previous HTTP 429 failure details

Compare that to the cron version — just a process exit code and whatever you print() to stdout. If the job failed at 3am and you weren't tailing logs, that information is gone.

This observability is honestly why teams end up on Temporal even for jobs that aren’t that complicated. This removes entire categories of debugging work. No log spelunking to figure out whether a retry happened. No guessing how many attempts ran. No reconstructing a timeline from scattered stdout. You won’t have to infer from fragments; all this is something you can just look up.

Running it yourself

Commands below assume uv is available with a venv set up.

# Start the local Temporal server  
temporal server start-dev  
# In another terminal, start the worker  
uv run temporal-cron-worker  
# Trigger a single run  
uv run temporal-cron-start  
# Or put it on a schedule (defaults to every 15 minutes, set SCHEDULE_CRON in .env to change)  
uv run temporal-cron-schedule

Open http://localhost:8233 and you'll see the workflow execution. Artifacts land under data/runs/, one file per run, named by workflow and run ID.

A Note on Production

This just runs against a local Temporal dev server with a single worker. Taking it to production means picking Temporal Cloud or running your own cluster, adding proper secrets management, structured logging, metrics, and deployment automation.

Cron is all you need for simple, isolated tasks. Probably not so much when you have to introduce retries, external dependencies, or jobs that can overlap or run longer than their schedule.

When you get into that zone, Temporal gives you a durable execution layer around ingest: retries, timeouts, overlap control, and a history you can audit. And then when that layer needs to scale up or if you start hitting IP-based (or geo-based, even) friction, Bright Data’s proxies give you the hardened network path you’ll need. It’s been a pretty natural pairing, in my experience. You can route the same requests.get() call through a proxy and let Temporal keep owning retries, timeouts, and audit history.

Regardless, this pattern is a cheat code right here for Temporal: Workflows own orchestration, and Activities own side effects.

Top comments (2)

Martin Miles • May 6

Seems like extra complexity and cost to add Temporal + Bright Data proxies here. Why don't you just move to a rabbitmq/Kafka system?

sohom das • May 6

If Bright Data reduces 429/geo-blocking failures, you don't say how we can design retries/timeouts and checks to know for sure we're getting the correct data, not just successful proxied requests