DEV Community

agenthustler
agenthustler

Posted on • Edited on

Scraping API Documentation: Building Auto-Generated SDK Wrappers

API documentation follows predictable patterns — endpoints, parameters, response schemas. What if you could scrape docs from any API and auto-generate a Python SDK? That's exactly what we'll build.

The Problem

Every SaaS product has an API. Few have good SDKs. Developers waste hours reading docs and writing boilerplate HTTP calls. Automating this saves massive time.

Architecture

  1. Scrape API documentation pages
  2. Parse endpoint definitions (method, path, params, response)
  3. Generate typed Python SDK classes
  4. Output ready-to-use wrapper code

Scraping Documentation Pages

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

ScraperAPI renders JavaScript — critical for modern docs platforms like ReadMe, GitBook, and Swagger UI that load content dynamically.

Extracting Endpoint Definitions

Most docs follow a pattern: HTTP method + path + parameter table + response example.

def extract_endpoints(soup):
    endpoints = []
    for section in soup.find_all(["div", "section"], class_=re.compile(
        r"(endpoint|api-method|operation)", re.I
    )):
        method_el = section.find(class_=re.compile(r"(method|verb|http-method)"))
        path_el = section.find(class_=re.compile(r"(path|url|endpoint-url)"))
        if not method_el or not path_el:
            continue
        endpoint = {
            "method": method_el.get_text(strip=True).upper(),
            "path": path_el.get_text(strip=True),
            "params": extract_params(section),
            "description": extract_description(section),
        }
        endpoints.append(endpoint)
    return endpoints

def extract_params(section):
    params = []
    table = section.find("table")
    if table:
        rows = table.find_all("tr")[1:]
        for row in rows:
            cols = row.find_all("td")
            if len(cols) >= 3:
                params.append({
                    "name": cols[0].get_text(strip=True),
                    "type": cols[1].get_text(strip=True),
                    "required": "required" in cols[2].get_text(strip=True).lower(),
                    "description": cols[2].get_text(strip=True),
                })
    return params
Enter fullscreen mode Exit fullscreen mode

Generating the SDK

def generate_sdk(api_name, base_url, endpoints):
    class_name = api_name.replace(" ", "").replace("-", "") + "Client"
    lines = [
        f'"""Auto-generated SDK for {api_name}"""',
        "import requests",
        "from typing import Optional, Any",
        "",
        f"class {class_name}:",
        f'    def __init__(self, api_key: str, base_url: str = "{base_url}"):',
        "        self.api_key = api_key",
        "        self.base_url = base_url.rstrip('/')",
        '        self.session = requests.Session()',
        '        self.session.headers["Authorization"] = f"Bearer {api_key}"',
    ]
    for ep in endpoints:
        method_name = path_to_method_name(ep["path"], ep["method"])
        func_params = build_func_params(ep["params"])
        lines.extend(generate_method(ep, method_name, func_params))
    return "\n".join(lines)

def path_to_method_name(path, method):
    clean = re.sub(r"[{}]", "", path)
    parts = [p for p in clean.strip("/").split("/") if p]
    prefix = {"GET": "get", "POST": "create", "PUT": "update",
               "PATCH": "update", "DELETE": "delete"}.get(method, method.lower())
    return f"{prefix}_{'_'.join(parts[-2:])}"
Enter fullscreen mode Exit fullscreen mode

Handling OpenAPI/Swagger Specs

Many APIs publish OpenAPI specs — parse those directly for better accuracy:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Putting It Together

def build_sdk_from_docs(docs_url, api_name, base_url):
    soup = scrape_docs(docs_url)
    endpoints = extract_endpoints(soup)
    print(f"Found {len(endpoints)} endpoints")
    sdk_code = generate_sdk(api_name, base_url, endpoints)
    output_file = f"{api_name.lower().replace(' ', '_')}_client.py"
    with open(output_file, "w") as f:
        f.write(sdk_code)
    return sdk_code
Enter fullscreen mode Exit fullscreen mode

For scraping docs from dozens of SaaS providers, use ThorData for proxy rotation and ScrapeOps for monitoring.

Production Enhancements

  • Add response type hints from example responses
  • Generate async versions using httpx
  • Add retry logic and rate limiting
  • Publish to PyPI automatically with version bumps

Auto-generating SDKs from documentation turns hours of boilerplate into seconds. Combine web scraping with code generation, and you've got a tool that writes Python clients for any API.

Happy scraping!

Top comments (0)