DEV Community: Khalid Abdelaty

[Boost]

Khalid Abdelaty — Fri, 27 Mar 2026 16:04:27 +0000

Khalid Abdelaty

Mar 27

From Web Scraping Scripts to Web Data APIs: A Practical Python Guide

#ai #python #api #tutorial

13 min read

From Web Scraping Scripts to Web Data APIs: A Practical Python Guide

Khalid Abdelaty — Fri, 27 Mar 2026 15:58:51 +0000

Learn how to build reliable web data pipelines in Python using API-based extraction instead of scraping scripts. This guide walks through single-page scrapes, batch processing, structured JSON output, and handling errors in production.

If you have ever written a Python scraping script that worked perfectly on Tuesday and broke by Thursday, you already understand the core problem. Websites change their HTML structure, deploy anti-bot protections, and render content with JavaScript that a simple requests.get() call never sees. Maintaining these scripts becomes a job in itself.

Web Data APIs take a different approach. Instead of managing headless browsers, proxy pools, and CSS selectors yourself, you send an HTTP request with a URL and get back clean, structured content. The infrastructure that handles JavaScript rendering, IP rotation, and retry logic lives on someone else's servers.

This guide uses Olostep as the API provider to walk through practical Python examples. By the end, you will have tested, working code that runs against a real API.

All code examples in this article are available in the companion GitHub repository: olostep-python-guide.

Web data API request flow. Image by Author.

What Is a Web Data API?

The term "Web Data API" describes a category of services that handle the scraping infrastructure for you and return structured page content through a standard HTTP interface. Before I get into how these APIs work, let me start with what they replace and why.

The problem with traditional scraping

Traditional web scraping means writing code that downloads a page's HTML, parses it with a library like BeautifulSoup or Scrapy, and extracts the data you need by targeting specific CSS selectors or XPath expressions. This works fine for small, stable projects. It starts to fall apart when any of the following happens:

The target site redesigns its layout or changes CSS class names, and your selectors stop matching overnight.
The site deploys anti-bot systems like Cloudflare Turnstile or DataDome, or triggers CAPTCHA challenges after a handful of requests.
Content loads dynamically through JavaScript, so the HTML returned by a basic HTTP request is an empty shell.
You need to scrape hundreds or thousands of pages, which means managing proxy rotation, concurrency, and rate limiting on your own.

Each of these problems has a solution, but the solutions stack up. You end up maintaining a headless browser, a proxy pool, a retry queue, and a CAPTCHA solver alongside the actual data extraction logic. At that point, the scraping infrastructure often takes more engineering effort than the analysis you built it for.

What web data APIs solve

A web data API handles all of that infrastructure behind a single HTTP endpoint. You send a POST request with the URL you want to scrape and the format you need (Markdown, HTML, JSON), and the API returns the extracted content. JavaScript rendering, residential IP rotation, and anti-bot handling happen on the provider's side.

The real difference is who owns the maintenance. With traditional scraping, that is you: the browser, the proxies, the retry logic. With a web data API, the provider handles that stack, and you pay per request. Your code deals with the data, not with how to get it.

Who uses web data APIs?

AI companies building RAG pipelines need clean text from thousands of documentation pages. So do product teams running research agents and data engineers who need extraction that does not break every time a target site updates its frontend.

Getting Started with Olostep

With that background covered, let me walk through a practical setup using Olostep.

Creating an account and getting your API key

Head to olostep.com and create an account. The free tier gives you 500 successful requests per month, which is enough to follow along with this guide.

Once you are logged in, find your API key in the dashboard. You will need it for every request.

Dashboard API key retrieval section. Image by Author.

Store it as an environment variable rather than hardcoding it in your scripts:

export OLOSTEP_API_KEY="your_api_key_here"

On the Python side, you only need the requests library:

pip install requests

Your first API scrape

Here is a minimal example that scrapes a single URL and returns the content as Markdown and HTML. I will break down each part after the code.

import os
import requests

api_key = os.environ.get("OLOSTEP_API_KEY")
endpoint = "https://api.olostep.com/v1/scrapes"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

payload = {
    "url_to_scrape": "https://books.toscrape.com/",
    "formats": ["markdown", "html"]
}

response = requests.post(endpoint, json=payload, headers=headers)
response.raise_for_status()

data = response.json()
print(data["result"]["markdown_content"][:500])

Note that formats takes an array, so you can request multiple output formats in one call. The response.raise_for_status() call ensures your code fails loudly on HTTP errors rather than silently processing a bad response.

Understanding the API response

The response from /v1/scrapes is a JSON object. Two parts are worth paying attention to. At the top level you get metadata, including a retrieve_id you can use later to fetch the same content without re-scraping:

retrieve_id = data["retrieve_id"]  # Top-level field

The actual extracted content lives inside the result object:

markdown = data["result"]["markdown_content"]
html = data["result"]["html_content"]

Core Endpoints and When to Use Them

Olostep provides several endpoints, but three cover most of the extraction jobs you will run. Each one suits a different scenario, so picking the right one matters.

Choosing the right API endpoint. Image by Author.

`/v1/scrapes`: single-page extraction

This is the endpoint I used in the previous section. It works synchronously: you send a URL, and you get back the content in the same HTTP response. Use it when you need to extract data from a single page or a small number of pages (under 50) where you can run requests in parallel from your own code.

The key optional parameters are:

wait_before_scraping (integer, milliseconds): Tells the API to pause before extracting, giving JavaScript time to render. This replaces any selector-based waiting you might expect from browser automation tools.
country: Routes the request through a residential IP in a specific country. Useful when the page serves different content based on location. Supported values include US, GB, JP, IN, and others.
actions: An array of browser interactions (click, scroll, fill input, wait) that the API executes before extraction. This covers scenarios where content is behind a button click or requires scrolling to load.

Each scrape costs 1 credit on the base plan.

`/v1/batches`: bulk URL processing

When you have a list of URLs, say 100 to 10,000, submitting them one by one through /v1/scrapes is inefficient. The /v1/batches endpoint accepts a single POST request containing all your URLs and processes them in parallel on Olostep's infrastructure.

batch_payload = {
    "items": [
        {"custom_id": "page_1", "url": "https://example.com/page-1"},
        {"custom_id": "page_2", "url": "https://example.com/page-2"},
        {"custom_id": "page_3", "url": "https://example.com/page-3"}
    ]
}

response = requests.post(
    "https://api.olostep.com/v1/batches",
    json=batch_payload,
    headers=headers
)
response.raise_for_status()
batch_id = response.json()["id"]

Each item needs a custom_id (a unique string you define) and a url. The batch processes in parallel and typically completes in 5 to 8 minutes regardless of size. You check the status by polling, which I will cover in the pipeline section.

One thing to know: new accounts may start with a lower batch limit than the 10,000 maximum. Contact Olostep's support team to confirm your account's current limit and request an increase if needed.

`/v1/answers`: natural language web queries

This endpoint works differently from the other two. Instead of providing a URL, you provide a natural language question. The API searches the web, cross-references sources, and returns an answer with citations.

answer_payload = {
    "task": "What is the current market cap of NVIDIA?",
    "json": {
        "company": "",
        "market_cap": "",
        "currency": "",
        "source": ""
    }
}

response = requests.post(
    "https://api.olostep.com/v1/answers",
    json=answer_payload,
    headers=headers
)

The json parameter lets you define the structure of the response. Pass an object with empty string values as a schema template, and the API returns data that matches those fields. If the API cannot find a reliable answer for a field, it returns "NOT_FOUND" instead of guessing, which is useful for data validation pipelines.

This endpoint costs 20 credits per request, so it is best suited for research automation and enrichment tasks rather than high-volume extraction.

Building a Web Data Pipeline in Python

With the individual endpoints covered, let me put them together into a working pipeline.

End-to-end batch data pipeline flow. Image by Author.

Defining your target URLs

Start with a list of URLs you want to process. In a real project, these might come from a database, a sitemap, or a prior discovery step. For this example, I will use a static list:

target_urls = [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
    "https://books.toscrape.com/catalogue/soumission_998/index.html",
]

If you need to discover URLs programmatically, Olostep also offers a /v1/searches endpoint for web search and a /v1/maps endpoint for listing all URLs on a domain. Both can feed into a batch job, though they are outside the scope of this guide.

Submitting a batch job

Build the items array from your URL list and submit it. I am using a simple hash of the URL as the custom_id to keep each item unique:

import hashlib

items = [
    {
        "custom_id": hashlib.sha256(url.encode()).hexdigest()[:12],
        "url": url
    }
    for url in target_urls
]

batch_payload = {"items": items}
response = requests.post(
    "https://api.olostep.com/v1/batches",
    json=batch_payload,
    headers=headers
)
response.raise_for_status()
batch_id = response.json()["id"]
print(f"Batch submitted: {batch_id}")

Polling for completion

Batch jobs are asynchronous, so you need to poll the status endpoint until the job finishes. One thing to watch: add a timeout so your script does not hang if something goes wrong:

import time

def wait_for_batch(batch_id: str, timeout: int = 600) -> None:
    """Poll batch status with a timeout guard."""
    deadline = time.time() + timeout
    while time.time() < deadline:
        status_response = requests.get(
            f"https://api.olostep.com/v1/batches/{batch_id}",
            headers=headers
        )
        status_response.raise_for_status()
        status = status_response.json()["status"]

        if status == "completed":
            print("Batch completed.")
            return
        elif status != "in_progress":
            raise RuntimeError(f"Unexpected batch status: {status}")

        time.sleep(30)

    raise TimeoutError(f"Batch {batch_id} did not complete within {timeout}s")

wait_for_batch(batch_id)

I set the default timeout to 600 seconds since that gives enough headroom over the typical completion time I mentioned earlier.

Retrieving and structuring the results

Once the batch is complete, fetch the results. Each item includes a retrieve_id that you pass to the /v1/retrieve endpoint to get the actual content:

import json
import pandas as pd

# Get batch items
items_response = requests.get(
    f"https://api.olostep.com/v1/batches/{batch_id}/items",
    headers=headers
)
items_response.raise_for_status()
batch_items = items_response.json()["items"]

# Retrieve content for each item
results = []
for item in batch_items:
    if not item.get("retrieve_id"):
        print(f"Skipping item with no retrieve_id: {item['custom_id']}")
        continue

    retrieve_response = requests.get(
        "https://api.olostep.com/v1/retrieve",
        params={
            "retrieve_id": item["retrieve_id"],
            "formats": ["markdown"]
        },
        headers=headers
    )
    retrieve_response.raise_for_status()
    content = retrieve_response.json()

    results.append({
        "custom_id": item["custom_id"],
        "url": item["url"],
        "markdown": content.get("markdown_content", "")
    })

# Load into DataFrame and export
df = pd.DataFrame(results)
df.to_csv("scraped_results.csv", index=False)
print(f"Saved {len(df)} results to scraped_results.csv")

Items without a retrieve_id have failed. Olostep does not charge for failed requests, but you still want to log them for review.

Note: For larger batches, the items endpoint returns paginated results. Loop through pages using the cursor and limit query parameters until no cursor is returned. See the Olostep batch documentation for details.

Scheduling and automation

For recurring pipelines, you can schedule this script using standard Python tools. cron is the simplest option for Unix systems. For more control, Apache Airflow or Prefect let you define the pipeline as a DAG with dependency management and monitoring. Olostep also has a /v1/schedules endpoint for recurring scrapes, which means you do not need a separate scheduler.

Working with Structured JSON Output

Extracting raw Markdown or HTML is useful, but many pipelines need structured data: product names, prices, ratings, dates. Olostep's llm_extract parameter lets you define a JSON schema and get back structured output instead of raw text.

Defining a custom extraction schema

To use structured extraction, add an llm_extract object to your /v1/scrapes request with a schema key. You also need to include "json" in the formats array:

payload = {
    "url_to_scrape": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "formats": ["json"],
    "llm_extract": {
        "schema": {
            "title": {"type": "string"},
            "price": {"type": "string"},
            "availability": {"type": "string"},
            "rating": {"type": "string"}
        }
    }
}

response = requests.post(endpoint, json=payload, headers=headers)
response.raise_for_status()
data = response.json()

The schema is a JSON Schema object where each key defines a field you want extracted, with no CSS selectors or XPath required.

Note: As of this writing, schema-based llm_extract requests may return a 500 extraction_error due to a known bug on Olostep's end. The Olostep team has confirmed the fix will be deployed within 24–36 hours and officially recommends omitting the schema key in the meantime: "llm_extract": {}. With this workaround, the API auto-extracts the fields it determines are most relevant. The companion GitHub repository uses this approach so the code runs without errors right now. Once the fix is live, you can add the schema key back to control exactly which fields are returned.

One cost consideration: llm_extract uses 20 credits per request compared to 1 credit for a standard scrape. If you are extracting the same type of data from the same site structure repeatedly, Olostep's pre-built parsers (available in their Parsers Store) cost only 1 to 5 credits and auto-update when the site changes its layout.

Use case: scraping structured e-commerce data

Here is a practical example that extracts product details and handles the response correctly:

import json
import pandas as pd

response = requests.post(endpoint, json=payload, headers=headers)
response.raise_for_status()
data = response.json()

# json_content is a stringified JSON string, not a dict
raw_json = data["result"]["json_content"]
parsed = json.loads(raw_json)

print(parsed)
# {"title": "A Light in the Attic", "price": "£51.77", ...}

Structured JSON extraction from e-commerce. Image by Author.

The json.loads() step is not optional. If you try to pass raw_json directly to a DataFrame constructor, you will get a TypeError because it is a string, not a dictionary. This trips up many developers the first time they work with this endpoint.

Loading structured output into pandas

Once parsed, the data loads directly into a DataFrame. The simple form works when all fields are flat:

df = pd.DataFrame([parsed])
print(df)

If llm_extract returns nested fields, switch to pd.json_normalize(), which flattens the structure into columns automatically:

df = pd.json_normalize([parsed])
df.to_csv("products.csv", index=False)

When to Still Use Traditional Scraping

Web data APIs are not the right tool for every situation. I want to be honest about the cases where writing your own scraping script makes more sense.

If you are doing a one-time extraction from a simple, static HTML page, a five-line BeautifulSoup script costs nothing and runs in seconds. And if you are working at very low volume against sites with no anti-bot protection, the overhead of setting up API authentication is harder to justify.

Here is a quick comparison to help you decide:

Scenario	Traditional scraping	Web data API
One-time, static page	Quick and free	Overkill
Recurring production pipeline	Fragile; maintenance-heavy	Fits well
JavaScript-heavy pages	Requires headless browser setup	Handled automatically
Anti-bot protected sites	Requires proxy management	Handled automatically
Over 100 URLs at once	Complex concurrency logic	Single batch request
Proof of concept	Good starting point	Free tier covers it

The decision usually comes down to whether you are building something that runs once or something that runs repeatedly. For one-off jobs, traditional scraping is simpler. For anything you need to maintain, the API approach is easier to keep running.

Best Practices and Common Pitfalls

A few things make the difference between a script that works locally and one that holds up in production. These come from patterns I have seen across scraping and API-based projects.

Respect robots.txt and terms of service. Olostep's terms put the responsibility for compliance on you. Before scraping any site, check its robots.txt and terms of service. The file is not legally binding everywhere, but ignoring it can lead to IP blocks. GDPR and CCPA requirements apply regardless of how you collect the data.

Store retrieve_id values for content reuse. As I mentioned earlier, every response includes a retrieve_id. If you need the same content in a different format later, call /v1/retrieve with that ID instead of re-scraping. As of this writing, retrieved content is stored for 7 days. Contact Olostep support to verify the current retention policy before relying on long-term availability.

Use country targeting when content varies by location. As I covered earlier, the country parameter routes the request through a residential IP in the region you specify. This matters when the page serves different prices, languages, or product availability depending on geography.

Add retry logic for production pipelines. Network errors and rate limits are inevitable at scale. Instead of wrapping every call in a manual try/except loop, use the tenacity library to add exponential backoff with a decorator:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type(requests.exceptions.RequestException)
)
def scrape_url(url: str) -> dict:
    """Scrape a URL with automatic retry on failure."""
    api_key = os.environ.get("OLOSTEP_API_KEY")
    headers = {"Authorization": f"Bearer {api_key}"}
    payload = {"url_to_scrape": url, "formats": ["markdown"]}
    response = requests.post(
        "https://api.olostep.com/v1/scrapes",
        json=payload,
        headers=headers
    )
    response.raise_for_status()
    return response.json()

This retries up to 3 times with increasing delays (2, 4, 8 seconds) on RequestException subclasses, covering network failures and HTTP 429/5xx errors.

Start with single-page scrapes, then scale to batches. Test your extraction logic on a few URLs through /v1/scrapes before submitting a batch of thousands. This saves credits and helps you catch schema issues early. The Olostep documentation recommends using /v1/scrapes in parallel for fewer than 50 URLs and switching to /v1/batches for anything larger.

Conclusion

Traditional scraping still works for small, one-off tasks, and I do not think it is going away. But for production pipelines, recurring extraction jobs, and anything that needs to handle JavaScript rendering or anti-bot protections at scale, the API approach cuts down on a lot of the infrastructure you would otherwise need to manage.

In this guide, I covered single-page extraction with /v1/scrapes, batch processing with /v1/batches, question-based web queries with /v1/answers, structured JSON extraction with llm_extract, and production patterns for polling, retries, and error handling.

If you want to go further, Olostep's /v1/crawls endpoint handles full-site crawling with depth and page limits, and works well for ingesting entire documentation sites into RAG knowledge bases.

From Web Scraping Scripts to Web Data APIs: A Practical Python Guide

Khalid Abdelaty — Fri, 27 Mar 2026 15:55:25 +0000

This guide uses Olostep as the API provider to walk through practical Python examples. By the end, you will have tested, working code that runs against a real API.

All code examples in this article are available in the companion GitHub repository: olostep-python-guide.

Web data API request flow. Image by Author.

What Is a Web Data API?

The problem with traditional scraping

The target site redesigns its layout or changes CSS class names, and your selectors stop matching overnight.
The site deploys anti-bot systems like Cloudflare Turnstile or DataDome, or triggers CAPTCHA challenges after a handful of requests.
Content loads dynamically through JavaScript, so the HTML returned by a basic HTTP request is an empty shell.
You need to scrape hundreds or thousands of pages, which means managing proxy rotation, concurrency, and rate limiting on your own.

What web data APIs solve

Who uses web data APIs?

Getting Started with Olostep

With that background covered, let me walk through a practical setup using Olostep.

Creating an account and getting your API key

Head to olostep.com and create an account. The free tier gives you 500 successful requests per month, which is enough to follow along with this guide.

Once you are logged in, find your API key in the dashboard. You will need it for every request.

Dashboard API key retrieval section. Image by Author.

Store it as an environment variable rather than hardcoding it in your scripts:

export OLOSTEP_API_KEY="your_api_key_here"

On the Python side, you only need the requests library:

pip install requests

Your first API scrape

Here is a minimal example that scrapes a single URL and returns the content as Markdown and HTML. I will break down each part after the code.

import os
import requests

api_key = os.environ.get("OLOSTEP_API_KEY")
endpoint = "https://api.olostep.com/v1/scrapes"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

payload = {
    "url_to_scrape": "https://books.toscrape.com/",
    "formats": ["markdown", "html"]
}

response = requests.post(endpoint, json=payload, headers=headers)
response.raise_for_status()

data = response.json()
print(data["result"]["markdown_content"][:500])

Understanding the API response

retrieve_id = data["retrieve_id"]  # Top-level field

The actual extracted content lives inside the result object:

markdown = data["result"]["markdown_content"]
html = data["result"]["html_content"]

Core Endpoints and When to Use Them

Olostep provides several endpoints, but three cover most of the extraction jobs you will run. Each one suits a different scenario, so picking the right one matters.

Choosing the right API endpoint. Image by Author.

`/v1/scrapes`: single-page extraction

The key optional parameters are:

wait_before_scraping (integer, milliseconds): Tells the API to pause before extracting, giving JavaScript time to render. This replaces any selector-based waiting you might expect from browser automation tools.
country: Routes the request through a residential IP in a specific country. Useful when the page serves different content based on location. Supported values include US, GB, JP, IN, and others.
actions: An array of browser interactions (click, scroll, fill input, wait) that the API executes before extraction. This covers scenarios where content is behind a button click or requires scrolling to load.

Each scrape costs 1 credit on the base plan.

`/v1/batches`: bulk URL processing

batch_payload = {
    "items": [
        {"custom_id": "page_1", "url": "https://example.com/page-1"},
        {"custom_id": "page_2", "url": "https://example.com/page-2"},
        {"custom_id": "page_3", "url": "https://example.com/page-3"}
    ]
}

response = requests.post(
    "https://api.olostep.com/v1/batches",
    json=batch_payload,
    headers=headers
)
response.raise_for_status()
batch_id = response.json()["id"]

One thing to know: new accounts may start with a lower batch limit than the 10,000 maximum. Contact Olostep's support team to confirm your account's current limit and request an increase if needed.

`/v1/answers`: natural language web queries

answer_payload = {
    "task": "What is the current market cap of NVIDIA?",
    "json": {
        "company": "",
        "market_cap": "",
        "currency": "",
        "source": ""
    }
}

response = requests.post(
    "https://api.olostep.com/v1/answers",
    json=answer_payload,
    headers=headers
)

This endpoint costs 20 credits per request, so it is best suited for research automation and enrichment tasks rather than high-volume extraction.

Building a Web Data Pipeline in Python

With the individual endpoints covered, let me put them together into a working pipeline.

End-to-end batch data pipeline flow. Image by Author.

Defining your target URLs

Start with a list of URLs you want to process. In a real project, these might come from a database, a sitemap, or a prior discovery step. For this example, I will use a static list:

target_urls = [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
    "https://books.toscrape.com/catalogue/soumission_998/index.html",
]

Submitting a batch job

Build the items array from your URL list and submit it. I am using a simple hash of the URL as the custom_id to keep each item unique:

import hashlib

items = [
    {
        "custom_id": hashlib.sha256(url.encode()).hexdigest()[:12],
        "url": url
    }
    for url in target_urls
]

batch_payload = {"items": items}
response = requests.post(
    "https://api.olostep.com/v1/batches",
    json=batch_payload,
    headers=headers
)
response.raise_for_status()
batch_id = response.json()["id"]
print(f"Batch submitted: {batch_id}")

Polling for completion

Batch jobs are asynchronous, so you need to poll the status endpoint until the job finishes. One thing to watch: add a timeout so your script does not hang if something goes wrong:

import time

def wait_for_batch(batch_id: str, timeout: int = 600) -> None:
    """Poll batch status with a timeout guard."""
    deadline = time.time() + timeout
    while time.time() < deadline:
        status_response = requests.get(
            f"https://api.olostep.com/v1/batches/{batch_id}",
            headers=headers
        )
        status_response.raise_for_status()
        status = status_response.json()["status"]

        if status == "completed":
            print("Batch completed.")
            return
        elif status != "in_progress":
            raise RuntimeError(f"Unexpected batch status: {status}")

        time.sleep(30)

    raise TimeoutError(f"Batch {batch_id} did not complete within {timeout}s")

wait_for_batch(batch_id)

I set the default timeout to 600 seconds since that gives enough headroom over the typical completion time I mentioned earlier.

Retrieving and structuring the results

Once the batch is complete, fetch the results. Each item includes a retrieve_id that you pass to the /v1/retrieve endpoint to get the actual content:

import json
import pandas as pd

# Get batch items
items_response = requests.get(
    f"https://api.olostep.com/v1/batches/{batch_id}/items",
    headers=headers
)
items_response.raise_for_status()
batch_items = items_response.json()["items"]

# Retrieve content for each item
results = []
for item in batch_items:
    if not item.get("retrieve_id"):
        print(f"Skipping item with no retrieve_id: {item['custom_id']}")
        continue

    retrieve_response = requests.get(
        "https://api.olostep.com/v1/retrieve",
        params={
            "retrieve_id": item["retrieve_id"],
            "formats": ["markdown"]
        },
        headers=headers
    )
    retrieve_response.raise_for_status()
    content = retrieve_response.json()

    results.append({
        "custom_id": item["custom_id"],
        "url": item["url"],
        "markdown": content.get("markdown_content", "")
    })

# Load into DataFrame and export
df = pd.DataFrame(results)
df.to_csv("scraped_results.csv", index=False)
print(f"Saved {len(df)} results to scraped_results.csv")

Items without a retrieve_id have failed. Olostep does not charge for failed requests, but you still want to log them for review.

Note: For larger batches, the items endpoint returns paginated results. Loop through pages using the cursor and limit query parameters until no cursor is returned. See the Olostep batch documentation for details.

Scheduling and automation

Working with Structured JSON Output

Defining a custom extraction schema

To use structured extraction, add an llm_extract object to your /v1/scrapes request with a schema key. You also need to include "json" in the formats array:

payload = {
    "url_to_scrape": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "formats": ["json"],
    "llm_extract": {
        "schema": {
            "title": {"type": "string"},
            "price": {"type": "string"},
            "availability": {"type": "string"},
            "rating": {"type": "string"}
        }
    }
}

response = requests.post(endpoint, json=payload, headers=headers)
response.raise_for_status()
data = response.json()

The schema is a JSON Schema object where each key defines a field you want extracted, with no CSS selectors or XPath required.

Note: As of this writing, schema-based llm_extract requests may return a 500 extraction_error due to a known bug on Olostep's end. The Olostep team has confirmed the fix will be deployed within 24–36 hours and officially recommends omitting the schema key in the meantime: "llm_extract": {}. With this workaround, the API auto-extracts the fields it determines are most relevant. The companion GitHub repository uses this approach so the code runs without errors right now. Once the fix is live, you can add the schema key back to control exactly which fields are returned.

Use case: scraping structured e-commerce data

Here is a practical example that extracts product details and handles the response correctly:

import json
import pandas as pd

response = requests.post(endpoint, json=payload, headers=headers)
response.raise_for_status()
data = response.json()

# json_content is a stringified JSON string, not a dict
raw_json = data["result"]["json_content"]
parsed = json.loads(raw_json)

print(parsed)
# {"title": "A Light in the Attic", "price": "£51.77", ...}

Structured JSON extraction from e-commerce. Image by Author.

Loading structured output into pandas

Once parsed, the data loads directly into a DataFrame. The simple form works when all fields are flat:

df = pd.DataFrame([parsed])
print(df)

If llm_extract returns nested fields, switch to pd.json_normalize(), which flattens the structure into columns automatically:

df = pd.json_normalize([parsed])
df.to_csv("products.csv", index=False)

When to Still Use Traditional Scraping

Web data APIs are not the right tool for every situation. I want to be honest about the cases where writing your own scraping script makes more sense.

Here is a quick comparison to help you decide:

Scenario	Traditional scraping	Web data API
One-time, static page	Quick and free	Overkill
Recurring production pipeline	Fragile; maintenance-heavy	Fits well
JavaScript-heavy pages	Requires headless browser setup	Handled automatically
Anti-bot protected sites	Requires proxy management	Handled automatically
Over 100 URLs at once	Complex concurrency logic	Single batch request
Proof of concept	Good starting point	Free tier covers it

Best Practices and Common Pitfalls

A few things make the difference between a script that works locally and one that holds up in production. These come from patterns I have seen across scraping and API-based projects.

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type(requests.exceptions.RequestException)
)
def scrape_url(url: str) -> dict:
    """Scrape a URL with automatic retry on failure."""
    api_key = os.environ.get("OLOSTEP_API_KEY")
    headers = {"Authorization": f"Bearer {api_key}"}
    payload = {"url_to_scrape": url, "formats": ["markdown"]}
    response = requests.post(
        "https://api.olostep.com/v1/scrapes",
        json=payload,
        headers=headers
    )
    response.raise_for_status()
    return response.json()

This retries up to 3 times with increasing delays (2, 4, 8 seconds) on RequestException subclasses, covering network failures and HTTP 429/5xx errors.

Conclusion

If you want to go further, Olostep's /v1/crawls endpoint handles full-site crawling with depth and page limits, and works well for ingesting entire documentation sites into RAG knowledge bases.

DEV Community: Khalid Abdelaty

[Boost]

From Web Scraping Scripts to Web Data APIs: A Practical Python Guide

From Web Scraping Scripts to Web Data APIs: A Practical Python Guide

What Is a Web Data API?

The problem with traditional scraping

What web data APIs solve

Who uses web data APIs?

Getting Started with Olostep

Creating an account and getting your API key

Your first API scrape

Understanding the API response

Core Endpoints and When to Use Them

/v1/scrapes: single-page extraction

/v1/batches: bulk URL processing

/v1/answers: natural language web queries

Building a Web Data Pipeline in Python

Defining your target URLs

Submitting a batch job

Polling for completion

Retrieving and structuring the results

Scheduling and automation

Working with Structured JSON Output

Defining a custom extraction schema

Use case: scraping structured e-commerce data

Loading structured output into pandas

When to Still Use Traditional Scraping

Best Practices and Common Pitfalls

Conclusion

From Web Scraping Scripts to Web Data APIs: A Practical Python Guide

What Is a Web Data API?

The problem with traditional scraping

What web data APIs solve

Who uses web data APIs?

Getting Started with Olostep

Creating an account and getting your API key

Your first API scrape

Understanding the API response

Core Endpoints and When to Use Them

/v1/scrapes: single-page extraction

/v1/batches: bulk URL processing

/v1/answers: natural language web queries

Building a Web Data Pipeline in Python

Defining your target URLs

Submitting a batch job

Polling for completion

Retrieving and structuring the results

Scheduling and automation

Working with Structured JSON Output

Defining a custom extraction schema

Use case: scraping structured e-commerce data

Loading structured output into pandas

When to Still Use Traditional Scraping

Best Practices and Common Pitfalls

Conclusion

`/v1/scrapes`: single-page extraction

`/v1/batches`: bulk URL processing

`/v1/answers`: natural language web queries

`/v1/scrapes`: single-page extraction

`/v1/batches`: bulk URL processing

`/v1/answers`: natural language web queries