Marko

Posted on Nov 21

The Offline Data Engineer: Building Resilient API Pipelines that Work on an Airplane

#python #dataengineering #api #sql

Development loops for API integrations are usually painful.

We’ve all been there: You are building a data pipeline to ingest data from a third-party API (Salesforce, Stripe, or an internal microservice). You write your Python script, hit run, and wait.

It works. You change a column name in your transformation logic. You hit run again. You wait again.

Suddenly, you hit a rate limit. Or the API throws a 503 error. Or, worse, you are on a train or an airplane with spotty WiFi, and you can’t run your code at all because it depends on a live internet connection.

In the world of SQL, we solved this with local databases (DuckDB) and seeds. But in the world of Python API ingestion, we are often still writing fragile requests.get() loops that break the moment the internet does.

I built FastFlowTransform (FFT) to solve this. It’s a hybrid SQL+Python framework that treats HTTP responses like immutable artifacts, allowing you to build, test, and debug API pipelines completely offline.

Here is how to build a pipeline that is "Airplane Mode" ready, using a real API example.

The Problem: The "Request Loop" Antipattern

Typically, a Python extraction script looks something like this:

import requests
import pandas as pd

data = []
page = 1
while True:
    # Fragile: Depending on a live connection for every test run
    resp = requests.get(f"https://jsonplaceholder.typicode.com/todos?_page={page}&_limit=20")

    json_data = resp.json()
    if not json_data: # Stop if empty
        break

    # ... complex logic to clean and parse JSON ...
    data.extend(json_data)
    page += 1

df = pd.DataFrame(data)
# ... save to DB ...

The issues with this approach:

Coupled Extraction & Logic: If you mess up the parsing logic, you have to re-fetch everything to test the fix.
No State: If the script crashes on page 99 of 100, you restart from page 1.
Online Only: You cannot run this without a live connection.

The Solution: FFT's Cached HTTP Module

FastFlowTransform introduces a dedicated module fastflowtransform.api.http. It separates the fetch (IO) from the transformation (Compute) using a file-backed cache.

Let's build a model that pulls "To-Do" items from JSONPlaceholder.

1. The Setup

First, we initialize a project. We'll use DuckDB for local development so we don't need a cloud warehouse.

pip install fastflowtransform
fft init my_api_project --engine duckdb

2. The Model with Pagination

In FFT, Python models are first-class citizens. We need to define how to talk to the API.

Since JSONPlaceholder uses query parameters (_page and _limit), we write a simple paginator function that detects when to stop (when the list is empty) and how to get the next page.

Create models/todos_ingest.ff.py:

import pandas as pd
from fastflowtransform import model
from fastflowtransform.api.http import get_df

# 1. Define the Paginator
# This function runs after every request to determine what to do next.
def offset_paginator(url, params, response_json):
    # If the API returns an empty list, we are done.
    if not response_json:
        return None

    # Otherwise, increment the page number
    current_page = params.get("_page", 1)
    if current_page >= 2:
        return None
    next_params = dict(params or {})
    next_params["_page"] = current_page + 1
    return {
        "next_request": {
            "params": next_params
        }
    }

@model(name="todos_ingest")
def fetch_todos() -> pd.DataFrame:
    # 2. get_df handles the HTTP calls, caching, and conversion
    df = get_df(
        url="https://jsonplaceholder.typicode.com/todos",
        params={"_page": 1, "_limit": 10}, # Start at page 1
        paginator=offset_paginator,
        # record_path is None because the root of the JSON is the list itself
        record_path=None 
    )

    # 3. Apply transformation logic
    # If we change THIS logic later, FFT won't re-fetch the API!

    # Example: Mark high-priority items locally
    df["priority"] = df["title"].apply(lambda x: "HIGH" if "delectus" in x else "NORMAL")

    return df

3. The First Run (Online)

When we run this for the first time, FFT hits the API.

fft run . --select todos_ingest

What happens under the hood:

FFT calculates a fingerprint for the model.
It executes the requests: _page=1, then _page=2, and so on, until the API returns [] or it reaches the configured limits.
Crucially: It saves the raw JSON to a local cache directory (.fastflowtransform/http_cache).
It transforms the data and materializes a table in DuckDB.

4. The Second Run (Offline / "Airplane Mode")

Now, imagine you are on a plane. You realize you made a mistake: you want to filter out any tasks that are already completed.

You update the code:

    # ... inside the model function ...

    # New Logic: Filter rows
    df = df[df["completed"] == False]

    # New Logic: Uppercase titles
    df["title"] = df["title"].str.upper()

    return df

You don't have internet. But you don't need it. Run with the --offline flag:

fft run . --select todos_ingest --offline

The Result:

FFT sees the --offline flag.
It checks the cache. It finds the JSON from the previous run.
It skips the network request entirely.
It feeds the cached JSON into your new logic.
The run succeeds in milliseconds.

5. Telemetry and Observability

How do you know if you hit the cache? FFT generates a run_results.json artifact after every run. It provides deep visibility into your API consumption:

"http": {
    "bytes": 2273,
    "cache_hits": 2,
    "content_hashes": [
        "sha256:110aa4d5dac630aa245ff3c3c53d7ea9bc4212df93f04d96f900ba9cb93f4622",
        "sha256:27ea31b0b9bb05c4feba2951d2f0a5f9dde340f0d19cc45722386e8951b794b5"
    ],
    "keys": [
        "7a8d720efd2b8afb319534d0d1f08b7937f666a14fea0952c3cbbe0c2442b6d9",
        "24850fd6c24df9ecd041d643331023d48d39b6c6bbf64080c76f86c95613a584"
    ],
    "node": "todos_ingest",
    "requests": 2,
    "used_offline": true
}

This gives you confidence that your CI/CD pipeline is deterministic. You can even commit your cache to git (for small reference datasets) to ensure your tests never flake due to external API downtime.

Why this matters

Data engineering is moving toward software engineering best practices.

Reproducibility: Your pipeline should produce the same result today as it did yesterday, regardless of the state of an external API.
Speed: You shouldn't pay a latency penalty every time you test a logic change.
Cost: If you are hitting a paid API, caching saves you money during development.

FastFlowTransform brings the developer experience of "Localhost" to the messy world of Data Engineering.

Try it out:
[FastFlowTransform GitHub]