How I built a tool that turns any website into a REST API automatically
Most websites don't have a public API. If you want their data, you either scrape it manually with CSS selectors — which breaks every time the site updates — or you pay for a cloud scraping service.
I wanted a third option: point a CLI at any URL and get a fully working REST API back, automatically. No selectors. No config. No code.
Here's how WebSnap works under the hood.
The problem with traditional scraping
The standard approach looks like this:
soup = BeautifulSoup(html, 'lxml')
titles = soup.select('article.product_pod h3 a')
prices = soup.select('p.price_color')
This works — until the site changes its CSS classes. Then everything breaks and you rewrite the selectors.
WebSnap takes a different approach: instead of targeting specific selectors, it analyzes the structure of the DOM and finds patterns automatically.
Step 1: Render the page properly
The first problem is that modern websites render content with JavaScript. A plain requests.get() gives you an empty shell. WebSnap uses Playwright to launch a real Chromium browser:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
await scroll_to_bottom(page) # triggers lazy loading
html = await page.content()
The scroll_to_bottom step is important — many sites only load content when you scroll down, so we simulate that before capturing the DOM.
Step 2: Detect data collections via structural signatures
This is the core idea. Every DOM node has a structural signature: its tag name plus the tag names of its direct children.
def node_signature(tag):
children = [c.name for c in tag.children if isinstance(c, Tag)]
return (tag.name, tuple(children))
For a product card, the signature might look like:
('article', ('div', 'h3', 'p', 'p'))
Now we scan every parent node and count how many of its children share the same signature:
sig_counter = Counter(node_signature(c) for c in children)
dominant_sig, count = sig_counter.most_common(1)[0]
if count >= 5: # found a data collection
...
If 5 or more siblings share a signature, that's almost certainly a list of data — products, articles, search results, whatever. No ML needed. Pure structure.
Step 3: Infer field types
Once we have a collection, we extract fields from the first element as a template and infer their types:
# By field name
if 'price' in name: return 'float'
if 'image' in name: return 'image_url'
if 'link' in name: return 'url'
# By sample values
if re.search(r'[\$\€\£][\d,.]+', sample): return 'float'
if re.match(r'^https?://', sample): return 'url'
if re.match(r'^\d+$', sample): return 'integer'
This gets you clean typed fields like price_gbp: float instead of text_3: string.
Step 4: Generate the FastAPI server
With the schema inferred, WebSnap generates a complete Python file:
class Item(BaseModel):
title: Optional[str] = None
product_price: Optional[float] = None
image_url: Optional[str] = None
instock: Optional[str] = None
@app.get("/api/items")
def get_items(limit: int = 20, offset: int = 0):
return DATA[offset : offset + limit]
With Swagger docs at /docs and CORS enabled out of the box.
The result
$ websnap generate https://books.toscrape.com
✓ API generated with 20 records
✓ Endpoint: /api/items
✓ Server saved to server.py
$ python3 server.py
$ curl http://localhost:3000/api/items
[
{
"title": "A Light in the Attic",
"product_price": 51.77,
"image_url": "media/cache/2c/da/...",
"instock": "In stock"
},
...
]
From URL to working REST API in under 10 seconds.
What's next
The current version works well for simple listing pages. The roadmap includes:
- Multi-page crawling — follow pagination automatically
- LLM-powered naming — use a local model to give fields smarter names
- OpenAPI YAML export — proper spec file alongside the server
- TypeScript SDK generation — autogenerated client
The project is open source and early stage — feedback and contributions very welcome.
GitHub: (https://github.com/uunaign/websnap)
Built with Python, Playwright, BeautifulSoup4, FastAPI, and Pydantic.
Top comments (0)