Web Scraping API for Python: Extract Data Without Beautiful Soup or Selenium
Every Python developer has written a web scraper. Beautiful Soup + Requests for static pages. Selenium + headless Chrome for JavaScript-rendered content. Both approaches break the same way: network timeouts, JavaScript failures, pagination logic, JavaScript rendering overhead, rate-limit walls.
PageBolt's /extract endpoint returns clean, structured data from any URL in one API call.
The Python Web Scraping Problem
Beautiful Soup + Requests:
import requests
from bs4 import BeautifulSoup
import time
urls = ['https://example.com/product/{}'.format(i) for i in range(1, 100)]
products = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# You now manually parse every variation of HTML structure
title = soup.find('h1', class_='product-title')
price = soup.find('span', class_='price')
# Rate limiting, retries, error handling...
products.append({'title': title.text, 'price': price.text})
time.sleep(1) # Don't get blocked
Problems: Brittle CSS selectors, no JavaScript rendering, manual rate limiting, maintenance burden.
Selenium + headless Chrome:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
for url in urls:
driver.get(url)
time.sleep(3) # Wait for JS to render
title = driver.find_element(By.CSS_SELECTOR, 'h1.product-title').text
price = driver.find_element(By.CSS_SELECTOR, '.price').text
products.append({'title': title, 'price': price})
driver.quit()
Problems: 300MB+ Chrome per instance, memory leaks, 5–10 second startup per page, infrastructure costs, fragile waits.
The API Solution
PageBolt /extract returns Markdown-formatted, structured data:
import requests
response = requests.post(
'https://api.pagebolt.dev/v1/extract',
headers={'Authorization': f'Bearer {api_key}'},
json={'url': 'https://example.com/product/123'}
)
data = response.json()
print(data['markdown']) # Clean Markdown
print(data['title']) # Page title
Returns:
{
"url": "https://example.com/product/123",
"title": "Product Title",
"markdown": "# Product Title\n\nPrice: $49.99\n\nDescription: ...",
"wordCount": 342,
"excerpt": "Short description of the product..."
}
Real-World Examples
1. Bulk Product Monitoring
import requests
import json
from datetime import datetime
competitors = {
'amazon': 'https://amazon.com/s?k=widget',
'ebay': 'https://ebay.com/sch/i.html?_nkw=widget',
'aliexpress': 'https://aliexpress.com/wholesale?SearchText=widget'
}
api_key = os.environ['PAGEBOLT_KEY']
headers = {'Authorization': f'Bearer {api_key}'}
results = {}
for source, url in competitors.items():
res = requests.post(
'https://api.pagebolt.dev/v1/extract',
headers=headers,
json={'url': url}
)
extraction = res.json()
results[source] = {
'extracted_at': datetime.utcnow().isoformat(),
'markdown': extraction['markdown'],
'title': extraction['title']
}
# Parse with Claude, store in DB
with open('competitor_data.json', 'w') as f:
json.dump(results, f)
2. News Aggregator Pipeline
import requests
import asyncio
import httpx
async def extract_article(url, api_key):
async with httpx.AsyncClient() as client:
res = await client.post(
'https://api.pagebolt.dev/v1/extract',
headers={'Authorization': f'Bearer {api_key}'},
json={'url': url}
)
return res.json()
async def aggregate(urls, api_key):
tasks = [extract_article(url, api_key) for url in urls]
return await asyncio.gather(*tasks)
# Process 50 articles concurrently
articles = asyncio.run(aggregate(article_urls, api_key))
3. CI/CD Data Validation
import requests
import sys
# Verify staging site renders product data correctly
res = requests.post(
'https://api.pagebolt.dev/v1/extract',
headers={'Authorization': f'Bearer {api_key}'},
json={'url': f'https://staging.yourapp.com/product/{product_id}'}
)
data = res.json()
# Fail deploy if critical data missing
required_fields = ['$99.99', 'Product Name', 'In Stock']
missing = [f for f in required_fields if f not in data['markdown']]
if missing:
print(f"❌ Staging validation failed. Missing: {missing}")
sys.exit(1)
print("✅ Staging data validation passed")
Why Not Self-Hosted Scraping?
Beautiful Soup/Requests:
- Manual HTML parsing per site
- No JavaScript rendering
- You manage retries, rate limits, proxies
- Breaks on layout changes
Selenium/Puppeteer:
- 300MB Chrome per instance
- 5–10 second startup per page
- Infrastructure costs ($100+/month at scale)
- Memory leaks in production
PageBolt /extract:
- <1KB per request
- 0.5–2 second per page
- $29/month for 5,000 requests
- Automatic updates (no maintenance)
Pricing
- Free: 100 requests/month
- Starter: $29/month → 5,000/month
- Growth: $79/month → 25,000/month
- Scale: $199/month → 100,000/month
Next Steps
- Get API key: pagebolt.dev/pricing
- Run first extraction:
curl -X POST https://api.pagebolt.dev/v1/extract -H "Authorization: Bearer YOUR_KEY" -d '{"url":"https://example.com"}'
Try free — 100 requests/month, no credit card required.
Top comments (0)