I was building a Scrapy spider when I hit a weird situation. The website had an API endpoint that returned JSON data, but Scrapy kept trying to parse it as HTML.
I spent hours trying to make Scrapy work with the API. Then I realized I could just use Python's requests library inside my spider. Problem solved in 5 minutes.
Sometimes Scrapy isn't the right tool for every single request. Let me show you when and how to use requests inside Scrapy.
What is Python Requests?
requests is a simple Python library for making HTTP requests.
Think of it like this:
- Scrapy = A complete factory with assembly line, workers, quality control
- requests = A simple tool you hold in your hand
Sometimes you need the whole factory. Sometimes you just need the simple tool.
Why Use Requests Inside Scrapy?
Reason 1: API Calls
Some websites have APIs that return pure JSON (no HTML).
Problem with Scrapy:
def parse(self, response):
data = response.json() # Works, but awkward
Easier with requests:
import requests
def parse(self, response):
api_data = requests.get('https://api.example.com/data').json()
# Clean JSON, easy to use
Reason 2: Authentication APIs
Login endpoints often need specific formatting.
Problem:
Scrapy's FormRequest can be complex for APIs.
Solution:
import requests
# Simple API login
response = requests.post(
'https://example.com/api/login',
json={'username': 'user', 'password': 'pass'}
)
token = response.json()['token']
Much simpler!
Reason 3: External Data Sources
You need data from a different website while scraping.
Example:
- Scraping products from Website A
- Need to check prices on Website B's API
- Want to do both in one spider
def parse(self, response):
product_name = response.css('.product-name::text').get()
# Quick API check on different site
price_api = f'https://pricecheck.com/api?product={product_name}'
external_price = requests.get(price_api).json()['price']
yield {
'name': product_name,
'external_price': external_price
}
Reason 4: File Downloads
Downloading files (PDFs, images) is simpler with requests.
import requests
# Download PDF
pdf_url = 'https://example.com/report.pdf'
pdf = requests.get(pdf_url)
with open('report.pdf', 'wb') as f:
f.write(pdf.content)
Easier than using Scrapy's file pipeline for one-off downloads.
When to Use What?
Use Scrapy When:
- Scraping multiple pages
- Following pagination
- Need robots.txt respect
- Want automatic retries
- Need rate limiting
- Crawling whole websites
Use requests When:
- Single API call
- Quick external check
- Downloading single file
- Simple authentication
- Testing endpoints
- One-off requests
Use Both When:
- Scrapy for main scraping
- requests for API calls
- Best of both worlds!
Basic Example: requests Inside Scrapy
Simple API Call
import scrapy
import requests
class ApiSpider(scrapy.Spider):
name = 'api'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
product_id = product.css('::attr(data-id)').get()
# Use requests to get API data
api_url = f'https://api.example.com/products/{product_id}'
api_response = requests.get(api_url)
if api_response.status_code == 200:
api_data = api_response.json()
yield {
'name': product.css('h2::text').get(),
'price': api_data['price'],
'stock': api_data['stock']
}
What this does:
- Scrapy scrapes the main page
- Gets product IDs from HTML
- Uses requests to fetch API data for each product
- Combines both data sources
Real Example: Product Scraper with API
Let's build a real spider that uses both Scrapy and requests.
The Scenario
Website structure:
- Product listing page (HTML)
- Individual product pages (HTML)
- Price API (JSON)
We want:
- Product names from HTML
- Live prices from API
The Spider
import scrapy
import requests
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
# API settings
api_base = 'https://api.example.com'
api_key = 'your-api-key-here'
def parse(self, response):
"""Parse product listing page"""
for product in response.css('.product'):
# Get basic info from HTML
name = product.css('.name::text').get()
sku = product.css('::attr(data-sku)').get()
# Get live price from API
price = self.get_price_from_api(sku)
yield {
'name': name,
'sku': sku,
'price': price
}
# Follow next page (Scrapy handles this)
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
def get_price_from_api(self, sku):
"""Helper method using requests"""
try:
url = f'{self.api_base}/prices/{sku}'
headers = {'Authorization': f'Bearer {self.api_key}'}
response = requests.get(url, headers=headers, timeout=5)
if response.status_code == 200:
return response.json()['price']
else:
self.logger.warning(f'API failed for SKU {sku}')
return None
except Exception as e:
self.logger.error(f'API error: {e}')
return None
What's happening:
- Scrapy scrapes HTML for product names
- requests fetches live prices from API
- Combines both into final item
- Scrapy handles pagination
- requests handles API calls
Perfect combination!
Authentication Example
Logging In with API
import scrapy
import requests
class AuthSpider(scrapy.Spider):
name = 'auth'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Login on spider start
self.token = self.login()
def login(self):
"""Use requests to login and get token"""
login_url = 'https://example.com/api/login'
credentials = {
'username': 'myuser',
'password': 'mypass'
}
response = requests.post(login_url, json=credentials)
if response.status_code == 200:
token = response.json()['access_token']
self.logger.info('Login successful!')
return token
else:
self.logger.error('Login failed!')
return None
def start_requests(self):
if not self.token:
self.logger.error('No token, cannot scrape')
return
# Use token in Scrapy requests
headers = {'Authorization': f'Bearer {self.token}'}
yield scrapy.Request(
'https://example.com/protected-data',
headers=headers,
callback=self.parse
)
def parse(self, response):
# Scrape protected content
yield {'data': response.css('.data::text').get()}
Why this works:
- requests handles complex login
- Gets authentication token
- Scrapy uses token for scraping
- Clean separation of concerns
Downloading Files Example
Download PDFs with requests
import scrapy
import requests
import os
class PdfSpider(scrapy.Spider):
name = 'pdfs'
start_urls = ['https://example.com/reports']
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Create download folder
os.makedirs('downloads', exist_ok=True)
def parse(self, response):
for report in response.css('.report'):
title = report.css('.title::text').get()
pdf_url = report.css('a::attr(href)').get()
# Download PDF with requests
self.download_pdf(pdf_url, title)
yield {
'title': title,
'url': pdf_url,
'downloaded': True
}
def download_pdf(self, url, filename):
"""Download PDF using requests"""
try:
self.logger.info(f'Downloading {filename}...')
response = requests.get(url, timeout=30)
if response.status_code == 200:
# Clean filename
safe_filename = filename.replace('/', '_')[:50]
filepath = f'downloads/{safe_filename}.pdf'
with open(filepath, 'wb') as f:
f.write(response.content)
self.logger.info(f'Saved to {filepath}')
else:
self.logger.warning(f'Failed to download {filename}')
except Exception as e:
self.logger.error(f'Download error: {e}')
Checking External Data
Price Comparison Spider
import scrapy
import requests
class PriceCompareSpider(scrapy.Spider):
name = 'compare'
start_urls = ['https://shop.com/products']
def parse(self, response):
for product in response.css('.product'):
name = product.css('.name::text').get()
our_price = product.css('.price::text').get()
# Check competitor price via API
competitor_price = self.check_competitor_price(name)
yield {
'name': name,
'our_price': our_price,
'competitor_price': competitor_price,
'cheaper': float(our_price) < float(competitor_price) if competitor_price else None
}
def check_competitor_price(self, product_name):
"""Check price on competitor's API"""
try:
api_url = 'https://competitor.com/api/search'
params = {'q': product_name}
response = requests.get(api_url, params=params, timeout=5)
if response.status_code == 200:
results = response.json()
if results:
return results[0]['price']
return None
except:
return None
Common Patterns
Pattern 1: API + HTML Combo
def parse(self, response):
# HTML data
title = response.css('h1::text').get()
# API data
api_url = 'https://api.example.com/data'
api_data = requests.get(api_url).json()
# Combine
yield {
'title': title,
'api_details': api_data
}
Pattern 2: Pre-spider API Check
def __init__(self):
# Check API before scraping
status = requests.get('https://api.example.com/status').json()
if not status['available']:
raise Exception('API not available')
Pattern 3: Post-scrape API Update
def closed(self, reason):
# Notify API that scraping is done
requests.post(
'https://api.example.com/scrape-complete',
json={'spider': self.name, 'reason': reason}
)
Important Tips
Tip 1: Add Timeout
Always add timeout to requests calls:
# BAD: Can hang forever
response = requests.get(url)
# GOOD: Times out after 5 seconds
response = requests.get(url, timeout=5)
Tip 2: Handle Errors
try:
response = requests.get(url, timeout=5)
if response.status_code == 200:
data = response.json()
else:
self.logger.warning(f'API returned {response.status_code}')
except requests.Timeout:
self.logger.error('API timeout')
except Exception as e:
self.logger.error(f'API error: {e}')
Tip 3: Don't Block Scrapy
Keep requests calls fast:
# BAD: Slow API calls block Scrapy
for i in range(100):
requests.get(slow_api) # Each takes 10 seconds!
# GOOD: Only use requests when necessary
if need_api_data:
requests.get(api_url, timeout=2)
Tip 4: Use Session for Multiple Calls
If making many requests calls:
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self):
# Create session (reuses connections)
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'MySpider/1.0'
})
def parse(self, response):
# Faster repeated calls
data1 = self.session.get('https://api.example.com/1').json()
data2 = self.session.get('https://api.example.com/2').json()
When NOT to Use requests
Don't Use requests For:
1. Main scraping
# BAD: Using requests for pagination
for page in range(100):
html = requests.get(f'https://example.com/page/{page}').text
# Parse with BeautifulSoup
# GOOD: Let Scrapy handle it
def parse(self, response):
# Scrapy handles retries, delays, etc
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
2. When you need Scrapy features
# BAD: Lose Scrapy benefits
requests.get(url) # No retries, no delays, no robots.txt
# GOOD: Use Scrapy
yield scrapy.Request(url) # Has retries, delays, robots.txt
3. Asynchronous scraping
# BAD: Blocks Scrapy's async
response = requests.get(url) # Blocks!
# GOOD: Scrapy is already async
yield scrapy.Request(url) # Non-blocking
Complete Real Example
Here's a complete spider using both Scrapy and requests:
import scrapy
import requests
import json
class EcommerceSpider(scrapy.Spider):
name = 'ecommerce'
start_urls = ['https://shop.example.com/products']
# API settings
api_url = 'https://api.example.com'
api_key = 'your-key'
def __init__(self):
super().__init__()
# Test API connection
if not self.test_api():
raise Exception('API not available')
def test_api(self):
"""Test API with requests"""
try:
response = requests.get(
f'{self.api_url}/health',
timeout=5
)
return response.status_code == 200
except:
return False
def parse(self, response):
"""Parse product listing"""
for product in response.css('.product'):
# Get HTML data
name = product.css('.name::text').get()
url = product.css('a::attr(href)').get()
sku = product.css('::attr(data-sku)').get()
# Get API data
inventory = self.get_inventory(sku)
reviews = self.get_reviews(sku)
yield {
'name': name,
'url': response.urljoin(url),
'sku': sku,
'in_stock': inventory['in_stock'],
'quantity': inventory['quantity'],
'avg_rating': reviews['avg_rating'],
'review_count': reviews['count']
}
# Scrapy handles pagination
next_page = response.css('.pagination .next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
def get_inventory(self, sku):
"""Get inventory from API using requests"""
try:
url = f'{self.api_url}/inventory/{sku}'
headers = {'X-API-Key': self.api_key}
response = requests.get(url, headers=headers, timeout=3)
if response.status_code == 200:
data = response.json()
return {
'in_stock': data['available'],
'quantity': data['qty']
}
else:
self.logger.warning(f'Inventory API failed for {sku}')
return {'in_stock': None, 'quantity': None}
except Exception as e:
self.logger.error(f'Inventory error: {e}')
return {'in_stock': None, 'quantity': None}
def get_reviews(self, sku):
"""Get reviews from API using requests"""
try:
url = f'{self.api_url}/reviews/{sku}'
headers = {'X-API-Key': self.api_key}
response = requests.get(url, headers=headers, timeout=3)
if response.status_code == 200:
data = response.json()
return {
'avg_rating': data['average'],
'count': data['total']
}
else:
return {'avg_rating': None, 'count': 0}
except Exception as e:
self.logger.error(f'Reviews error: {e}')
return {'avg_rating': None, 'count': 0}
def closed(self, reason):
"""Send completion notification via API"""
try:
url = f'{self.api_url}/scrape-complete'
data = {
'spider': self.name,
'reason': reason,
'stats': dict(self.crawler.stats.get_stats())
}
requests.post(url, json=data, timeout=5)
self.logger.info('Sent completion notification')
except:
self.logger.warning('Could not send notification')
This spider:
- Tests API connection on start (requests)
- Scrapes product listings (Scrapy)
- Gets inventory data (requests + API)
- Gets review data (requests + API)
- Handles pagination (Scrapy)
- Sends completion notification (requests)
Perfect combination of both tools!
Common Mistakes
Mistake 1: Using requests for Everything
# BAD: Why use Scrapy at all?
def parse(self, response):
html = requests.get('https://example.com').text
# Just use requests library alone!
# GOOD: Scrapy for scraping, requests for APIs
def parse(self, response):
# Scrapy handles the page
name = response.css('.name::text').get()
# requests for quick API call
price = requests.get(api_url).json()['price']
Mistake 2: No Error Handling
# BAD: Will crash on error
api_data = requests.get(url).json()
# GOOD: Handle errors
try:
response = requests.get(url, timeout=5)
if response.status_code == 200:
api_data = response.json()
else:
api_data = None
except:
api_data = None
Mistake 3: Blocking Scrapy
# BAD: 100 slow requests blocks everything
for i in range(100):
requests.get(slow_api) # Takes 5 seconds each!
# GOOD: Keep it minimal
if really_needed:
requests.get(api, timeout=2)
Quick Decision Guide
Use Scrapy when:
✓ Scraping multiple pages
✓ Following links
✓ Need retries
✓ Need rate limiting
✓ Respecting robots.txt
Use requests when:
✓ Single API call
✓ Authentication
✓ File download
✓ Quick external check
✓ Testing connection
Use both when:
✓ Scraping HTML + API data
✓ Need different tools for different jobs
✓ Want best of both worlds
Summary
What is requests?
Simple Python library for HTTP requests.
Why use it with Scrapy?
- API calls (JSON data)
- Authentication
- File downloads
- External data checks
When to use it:
- API endpoints
- One-off requests
- Simple authentication
- Quick external checks
When NOT to use it:
- Main scraping (use Scrapy)
- Pagination (use Scrapy)
- When you need retries (use Scrapy)
Best practice:
# Scrapy for main scraping
def parse(self, response):
html_data = response.css('.data::text').get()
# requests for API calls
api_data = requests.get(api_url, timeout=5).json()
# Combine both
yield {'html': html_data, 'api': api_data}
Remember:
- Always add timeout
- Always handle errors
- Keep requests calls minimal
- Don't block Scrapy's async
- Use right tool for each job
The best approach is combining both: Scrapy for scraping, requests for API calls!
Happy scraping! 🕷️
Top comments (0)