How I built a tool that turns any website into a REST API automatically

#tutorial #python #opensource #webdev

How I built a tool that turns any website into a REST API automatically

Most websites don't have a public API. If you want their data, you either scrape it manually with CSS selectors — which breaks every time the site updates — or you pay for a cloud scraping service.

I wanted a third option: point a CLI at any URL and get a fully working REST API back, automatically. No selectors. No config. No code.

Here's how WebSnap works under the hood.

The problem with traditional scraping

The standard approach looks like this:

soup = BeautifulSoup(html, 'lxml')
titles = soup.select('article.product_pod h3 a')
prices = soup.select('p.price_color')

This works — until the site changes its CSS classes. Then everything breaks and you rewrite the selectors.

WebSnap takes a different approach: instead of targeting specific selectors, it analyzes the structure of the DOM and finds patterns automatically.

Step 1: Render the page properly

The first problem is that modern websites render content with JavaScript. A plain requests.get() gives you an empty shell. WebSnap uses Playwright to launch a real Chromium browser:

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)
    page = await context.new_page()
    await page.goto(url, wait_until="networkidle")
    await scroll_to_bottom(page)  # triggers lazy loading
    html = await page.content()

The scroll_to_bottom step is important — many sites only load content when you scroll down, so we simulate that before capturing the DOM.

Step 2: Detect data collections via structural signatures

This is the core idea. Every DOM node has a structural signature: its tag name plus the tag names of its direct children.

def node_signature(tag):
    children = [c.name for c in tag.children if isinstance(c, Tag)]
    return (tag.name, tuple(children))

For a product card, the signature might look like:

('article', ('div', 'h3', 'p', 'p'))

Now we scan every parent node and count how many of its children share the same signature:

sig_counter = Counter(node_signature(c) for c in children)
dominant_sig, count = sig_counter.most_common(1)[0]

if count >= 5:  # found a data collection
    ...

If 5 or more siblings share a signature, that's almost certainly a list of data — products, articles, search results, whatever. No ML needed. Pure structure.

Step 3: Infer field types

Once we have a collection, we extract fields from the first element as a template and infer their types:

# By field name
if 'price' in name:  return 'float'
if 'image' in name:  return 'image_url'
if 'link'  in name:  return 'url'

# By sample values
if re.search(r'[\$\€\£][\d,.]+', sample):  return 'float'
if re.match(r'^https?://', sample):        return 'url'
if re.match(r'^\d+$', sample):            return 'integer'

This gets you clean typed fields like price_gbp: float instead of text_3: string.

Step 4: Generate the FastAPI server

With the schema inferred, WebSnap generates a complete Python file:

class Item(BaseModel):
    title: Optional[str] = None
    product_price: Optional[float] = None
    image_url: Optional[str] = None
    instock: Optional[str] = None

@app.get("/api/items")
def get_items(limit: int = 20, offset: int = 0):
    return DATA[offset : offset + limit]

With Swagger docs at /docs and CORS enabled out of the box.

The result

$ websnap generate https://books.toscrape.com
✓ API generated with 20 records
✓ Endpoint: /api/items
✓ Server saved to server.py

$ python3 server.py
$ curl http://localhost:3000/api/items
[
  {
    "title": "A Light in the Attic",
    "product_price": 51.77,
    "image_url": "media/cache/2c/da/...",
    "instock": "In stock"
  },
  ...
]