The moment I learned to find API endpoints changed everything. I was struggling to scrape a product listing site with Selenium. It took 5 minutes to render one page.
Then I opened the Network tab and found the API. Same data, but as clean JSON. I switched to scraping the API directly.
Results:
- Before: 5 minutes per page, messy HTML parsing
- After: 2 seconds per page, clean JSON data
Finding APIs is the secret weapon of professional scrapers. Let me show you how.
Why APIs Are Better Than Scraping HTML
Scraping HTML:
- Slow (download + parse)
- Brittle (breaks when design changes)
- Messy (nested tags, inconsistent structure)
- JavaScript needed (even slower)
Scraping API:
- Fast (just download JSON)
- Stable (APIs change less than websites)
- Clean (structured JSON data)
- No rendering needed
Speed comparison:
- HTML scraping: 10-20 pages/second
- API scraping: 100-500 pages/second
That's 10-50x faster!
How to Find API Endpoints
Step 1: Open Developer Tools
Chrome/Edge:
- Press F12 or Ctrl+Shift+I
- Click "Network" tab
Firefox:
- Press F12
- Click "Network" tab
Step 2: Filter by XHR/Fetch
Click "XHR" or "Fetch" button in the Network tab. This shows only API requests.
Step 3: Refresh the Page
Press Ctrl+R to reload. Watch requests appear in the Network tab.
Step 4: Look for JSON Responses
Click on requests one by one. Look for:
- URLs containing
/api/ - Responses with JSON data
- Requests with your target data
Step 5: Inspect the Request
Click on interesting request → Check:
- URL (Request URL at top)
- Method (GET, POST, etc.)
- Headers (Authorization, cookies, etc.)
- Payload (if POST request)
- Response (the JSON data)
Real Example: Product Listing
Let's say you're scraping products from a store.
What You See in Network Tab
Request URL: https://api.example.com/v1/products?page=1&limit=20&sort=popular
Method: GET
Status: 200
Response:
{
"products": [
{
"id": 123,
"name": "Widget Pro",
"price": 29.99,
"stock": 50
},
{
"id": 124,
"name": "Gadget Plus",
"price": 49.99,
"stock": 30
}
],
"total": 1523,
"page": 1,
"pages": 77
}
Perfect! You found the API.
Your Scrapy Spider
import scrapy
import json
class ApiSpider(scrapy.Spider):
name = 'products'
def start_requests(self):
url = 'https://api.example.com/v1/products?page=1&limit=20&sort=popular'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
data = json.loads(response.text)
# Extract products
for product in data['products']:
yield {
'id': product['id'],
'name': product['name'],
'price': product['price'],
'stock': product['stock']
}
# Pagination
current_page = data['page']
total_pages = data['pages']
if current_page < total_pages:
next_page = current_page + 1
next_url = f'https://api.example.com/v1/products?page={next_page}&limit=20&sort=popular'
yield scrapy.Request(next_url, callback=self.parse)
Done! Clean, fast, reliable.
Finding Hidden APIs (Advanced)
Some APIs aren't obvious. Here's how to find them.
Technique 1: Search for "api" in Network Tab
Type "api" in the filter box. Shows only URLs containing "api".
Technique 2: Look for GraphQL
Modern sites use GraphQL. Look for:
- URL:
https://example.com/graphql - Method: POST
- Payload contains "query"
Example GraphQL request:
{
"query": "{ products(limit: 20) { id name price } }"
}
Technique 3: Check WebSocket Connections
Some sites use WebSockets for real-time updates.
In Network tab:
- Filter by "WS" (WebSocket)
- Click on connection
- View messages
Technique 4: Look at Script Tags
Sometimes API URLs are embedded in JavaScript:
def parse(self, response):
# Look for API URLs in script tags
scripts = response.css('script::text').getall()
for script in scripts:
if 'api.example.com' in script:
# Extract API URL from JavaScript
import re
urls = re.findall(r'https://api\.example\.com/[^"\']+', script)
for url in urls:
yield scrapy.Request(url, callback=self.parse_api)
Handling API Authentication
Many APIs require authentication.
Type 1: API Key in URL
https://api.example.com/products?api_key=abc123def456
How to find it:
- Check request URL in Network tab
- Look for
api_key,key,tokenparameters
Your spider:
def start_requests(self):
api_key = 'abc123def456'
url = f'https://api.example.com/products?api_key={api_key}'
yield scrapy.Request(url)
Type 2: Bearer Token in Headers
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
How to find it:
- Network tab → Click request
- Headers tab → Look for "Authorization"
Your spider:
def start_requests(self):
url = 'https://api.example.com/products'
headers = {
'Authorization': 'Bearer YOUR_TOKEN_HERE'
}
yield scrapy.Request(url, headers=headers)
Type 3: Session Cookies
Some APIs use cookies for auth.
How to find them:
- Network tab → Click request
- Headers tab → Look for "Cookie"
Your spider:
def start_requests(self):
url = 'https://api.example.com/products'
cookies = {
'session_id': 'abc123',
'user_token': 'xyz789'
}
yield scrapy.Request(url, cookies=cookies)
Type 4: Custom Headers
X-Api-Key: abc123
X-Client-Id: def456
Your spider:
def start_requests(self):
headers = {
'X-Api-Key': 'abc123',
'X-Client-Id': 'def456'
}
yield scrapy.Request(url, headers=headers)
Handling POST Requests
Some APIs use POST instead of GET.
Finding POST Data
Network tab:
- Click POST request
- "Payload" tab
- See the data sent
Example:
{
"filters": {
"category": "electronics",
"price_max": 1000
},
"page": 1,
"limit": 20
}
Your Spider
import scrapy
import json
class PostSpider(scrapy.Spider):
name = 'post'
def start_requests(self):
url = 'https://api.example.com/search'
payload = {
'filters': {
'category': 'electronics',
'price_max': 1000
},
'page': 1,
'limit': 20
}
yield scrapy.Request(
url,
method='POST',
body=json.dumps(payload),
headers={'Content-Type': 'application/json'},
callback=self.parse
)
def parse(self, response):
data = json.loads(response.text)
for item in data['results']:
yield item
Handling Pagination in APIs
APIs have different pagination styles.
Style 1: Page Numbers
/products?page=1
/products?page=2
/products?page=3
Spider:
def parse(self, response):
data = json.loads(response.text)
for item in data['items']:
yield item
# Next page
current_page = int(response.url.split('page=')[1])
if data['has_next']:
next_page = current_page + 1
next_url = f'https://api.example.com/products?page={next_page}'
yield scrapy.Request(next_url, callback=self.parse)
Style 2: Offset/Limit
/products?offset=0&limit=20
/products?offset=20&limit=20
/products?offset=40&limit=20
Spider:
def parse(self, response):
data = json.loads(response.text)
for item in data['items']:
yield item
# Next offset
total = data['total']
offset = int(response.url.split('offset=')[1].split('&')[0])
limit = 20
if offset + limit < total:
next_offset = offset + limit
next_url = f'https://api.example.com/products?offset={next_offset}&limit={limit}'
yield scrapy.Request(next_url, callback=self.parse)
Style 3: Cursor-Based
/products?cursor=abc123
/products?cursor=def456
Spider:
def parse(self, response):
data = json.loads(response.text)
for item in data['items']:
yield item
# Next cursor
if data['next_cursor']:
next_url = f"https://api.example.com/products?cursor={data['next_cursor']}"
yield scrapy.Request(next_url, callback=self.parse)
GraphQL APIs
GraphQL is a modern API query language.
Finding GraphQL Endpoints
Look for:
- URL:
/graphql - Method: POST
- Content-Type:
application/json - Body contains
"query"
Example GraphQL Query
{
"query": "query { products(limit: 20) { id name price description } }"
}
Scrapy Spider for GraphQL
import scrapy
import json
class GraphQLSpider(scrapy.Spider):
name = 'graphql'
def start_requests(self):
url = 'https://example.com/graphql'
query = '''
query {
products(limit: 20, offset: 0) {
id
name
price
description
}
}
'''
payload = {'query': query}
yield scrapy.Request(
url,
method='POST',
body=json.dumps(payload),
headers={'Content-Type': 'application/json'},
callback=self.parse
)
def parse(self, response):
data = json.loads(response.text)
for product in data['data']['products']:
yield product
GraphQL Pagination
def start_requests(self):
for offset in range(0, 1000, 20): # 0, 20, 40, ...
query = f'''
query {{
products(limit: 20, offset: {offset}) {{
id
name
price
}}
}}
'''
payload = {'query': query}
yield scrapy.Request(
'https://example.com/graphql',
method='POST',
body=json.dumps(payload),
headers={'Content-Type': 'application/json'},
callback=self.parse
)
Rate Limiting with APIs
APIs often have rate limits.
Detecting Rate Limits
Signs:
- 429 status code (Too Many Requests)
- Error message about rate limiting
- Header:
X-RateLimit-Remaining: 0
Handling Rate Limits
# settings.py
# Slow down
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 4
# Auto throttle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
Respecting Rate Limit Headers
def parse(self, response):
# Check rate limit headers
remaining = response.headers.get('X-RateLimit-Remaining')
if remaining and int(remaining) < 10:
self.logger.warning('Approaching rate limit, slowing down')
# Slow down or pause
# Continue parsing
data = json.loads(response.text)
for item in data:
yield item
Reverse Engineering API Parameters
Sometimes API URLs have cryptic parameters.
Common Parameters to Try
# Pagination
?page=1
?offset=0&limit=20
?cursor=abc
# Sorting
?sort=price
?sort=price_asc
?order_by=name
# Filtering
?category=electronics
?price_min=10&price_max=100
?in_stock=true
# Search
?q=laptop
?search=laptop
?query=laptop
# Format
?format=json
?output=json
Testing Parameters
def start_requests(self):
base_url = 'https://api.example.com/products'
# Try different parameters
for page in range(1, 11):
url = f'{base_url}?page={page}&limit=50&sort=price'
yield scrapy.Request(url, callback=self.parse)
When APIs Don't Exist
If you can't find an API:
Option 1: Use Scrapy-Playwright (render JavaScript)
Option 2: Look harder
- Sometimes APIs are there but hidden
- Check mobile app traffic (apps often use APIs)
- Look at older versions of the site
Option 3: Scrape HTML
- Last resort
- Slower but works
Complete Real-World Example
Let's scrape a product API:
import scrapy
import json
from urllib.parse import urlencode
class ProductApiSpider(scrapy.Spider):
name = 'product_api'
# API base URL (found in Network tab)
api_base = 'https://api.example.com/v2/products'
# Headers (copied from Network tab)
headers = {
'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...',
'User-Agent': 'Mozilla/5.0...',
'Accept': 'application/json'
}
def start_requests(self):
# Start with page 1
params = {
'page': 1,
'limit': 50,
'category': 'electronics',
'sort': 'popularity'
}
url = f'{self.api_base}?{urlencode(params)}'
yield scrapy.Request(url, headers=self.headers, callback=self.parse)
def parse(self, response):
# Parse JSON response
try:
data = json.loads(response.text)
except json.JSONDecodeError:
self.logger.error(f'Invalid JSON from {response.url}')
return
# Extract products
for product in data.get('products', []):
yield {
'id': product.get('id'),
'name': product.get('name'),
'price': product.get('price'),
'currency': product.get('currency'),
'stock': product.get('in_stock'),
'rating': product.get('rating'),
'reviews': product.get('review_count'),
'url': product.get('product_url')
}
# Pagination
current_page = data.get('current_page', 1)
total_pages = data.get('total_pages', 1)
if current_page < total_pages:
next_page = current_page + 1
params = {
'page': next_page,
'limit': 50,
'category': 'electronics',
'sort': 'popularity'
}
next_url = f'{self.api_base}?{urlencode(params)}'
yield scrapy.Request(next_url, headers=self.headers, callback=self.parse)
else:
self.logger.info(f'Finished scraping {total_pages} pages')
Quick Checklist
Finding APIs:
- [ ] Open DevTools (F12)
- [ ] Click Network tab
- [ ] Filter by XHR/Fetch
- [ ] Refresh page
- [ ] Click on requests with JSON responses
- [ ] Note URL, method, headers, payload
Testing APIs:
- [ ] Copy request URL
- [ ] Test in Scrapy shell
- [ ] Check authentication requirements
- [ ] Test pagination
- [ ] Test different parameters
Building Spider:
- [ ] Start with one page
- [ ] Parse JSON response
- [ ] Add pagination
- [ ] Add authentication if needed
- [ ] Respect rate limits
Summary
Why find APIs:
- 10-50x faster than HTML scraping
- Clean JSON data
- More stable/reliable
- No JavaScript rendering needed
How to find them:
- Network tab → XHR/Fetch filter
- Look for JSON responses
- Note URL, headers, payload
Common patterns:
- GET with URL parameters
- POST with JSON body
- Authentication via headers or cookies
- Pagination via page/offset/cursor
Best practices:
- Test API in Scrapy shell first
- Copy exact headers from browser
- Respect rate limits
- Handle errors gracefully
Remember:
- Always try to find API first
- APIs > Playwright > Selenium > HTML scraping
- 10 minutes finding API saves hours of scraping
Start by opening Network tab on any site you want to scrape. You'll be surprised how many use APIs!
Happy scraping! 🕷️
Top comments (0)