Muhammad Ikramullah Khan

Posted on Jan 21

Using Python Requests Inside Scrapy: The Beginner's Guide

#webdev #programming #beginners #python

I was building a Scrapy spider when I hit a weird situation. The website had an API endpoint that returned JSON data, but Scrapy kept trying to parse it as HTML.

I spent hours trying to make Scrapy work with the API. Then I realized I could just use Python's requests library inside my spider. Problem solved in 5 minutes.

Sometimes Scrapy isn't the right tool for every single request. Let me show you when and how to use requests inside Scrapy.

What is Python Requests?

requests is a simple Python library for making HTTP requests.

Think of it like this:

Scrapy = A complete factory with assembly line, workers, quality control
requests = A simple tool you hold in your hand

Sometimes you need the whole factory. Sometimes you just need the simple tool.

Why Use Requests Inside Scrapy?

Reason 1: API Calls

Some websites have APIs that return pure JSON (no HTML).

Problem with Scrapy:

def parse(self, response):
    data = response.json()  # Works, but awkward

Easier with requests:

import requests

def parse(self, response):
    api_data = requests.get('https://api.example.com/data').json()
    # Clean JSON, easy to use

Reason 2: Authentication APIs

Problem:
Scrapy's FormRequest can be complex for APIs.

Solution:

import requests

# Simple API login
response = requests.post(
    'https://example.com/api/login',
    json={'username': 'user', 'password': 'pass'}
)
token = response.json()['token']

Much simpler!

Reason 3: External Data Sources

You need data from a different website while scraping.

Example:

Scraping products from Website A
Need to check prices on Website B's API
Want to do both in one spider

def parse(self, response):
    product_name = response.css('.product-name::text').get()

    # Quick API check on different site
    price_api = f'https://pricecheck.com/api?product={product_name}'
    external_price = requests.get(price_api).json()['price']

    yield {
        'name': product_name,
        'external_price': external_price
    }

Reason 4: File Downloads

Downloading files (PDFs, images) is simpler with requests.

import requests

# Download PDF
pdf_url = 'https://example.com/report.pdf'
pdf = requests.get(pdf_url)

with open('report.pdf', 'wb') as f:
    f.write(pdf.content)

Easier than using Scrapy's file pipeline for one-off downloads.

When to Use What?

Use Scrapy When:

Scraping multiple pages
Following pagination
Need robots.txt respect
Want automatic retries
Need rate limiting
Crawling whole websites

Use requests When:

Single API call
Quick external check
Downloading single file
Simple authentication
Testing endpoints
One-off requests

Use Both When:

Scrapy for main scraping
requests for API calls
Best of both worlds!

Basic Example: requests Inside Scrapy

Simple API Call

import scrapy
import requests

class ApiSpider(scrapy.Spider):
    name = 'api'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            product_id = product.css('::attr(data-id)').get()

            # Use requests to get API data
            api_url = f'https://api.example.com/products/{product_id}'
            api_response = requests.get(api_url)

            if api_response.status_code == 200:
                api_data = api_response.json()

                yield {
                    'name': product.css('h2::text').get(),
                    'price': api_data['price'],
                    'stock': api_data['stock']
                }

What this does:

Scrapy scrapes the main page
Gets product IDs from HTML
Uses requests to fetch API data for each product
Combines both data sources

Real Example: Product Scraper with API

Let's build a real spider that uses both Scrapy and requests.

The Scenario

Website structure:

Product listing page (HTML)
Individual product pages (HTML)
Price API (JSON)

We want:

Product names from HTML
Live prices from API

The Spider

import scrapy
import requests

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    # API settings
    api_base = 'https://api.example.com'
    api_key = 'your-api-key-here'

    def parse(self, response):
        """Parse product listing page"""
        for product in response.css('.product'):
            # Get basic info from HTML
            name = product.css('.name::text').get()
            sku = product.css('::attr(data-sku)').get()

            # Get live price from API
            price = self.get_price_from_api(sku)

            yield {
                'name': name,
                'sku': sku,
                'price': price
            }

        # Follow next page (Scrapy handles this)
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def get_price_from_api(self, sku):
        """Helper method using requests"""
        try:
            url = f'{self.api_base}/prices/{sku}'
            headers = {'Authorization': f'Bearer {self.api_key}'}

            response = requests.get(url, headers=headers, timeout=5)

            if response.status_code == 200:
                return response.json()['price']
            else:
                self.logger.warning(f'API failed for SKU {sku}')
                return None

        except Exception as e:
            self.logger.error(f'API error: {e}')
            return None

What's happening:

Scrapy scrapes HTML for product names
requests fetches live prices from API
Combines both into final item
Scrapy handles pagination
requests handles API calls

Perfect combination!

Authentication Example

Logging In with API

import scrapy
import requests

class AuthSpider(scrapy.Spider):
    name = 'auth'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Login on spider start
        self.token = self.login()

    def login(self):
        """Use requests to login and get token"""
        login_url = 'https://example.com/api/login'

        credentials = {
            'username': 'myuser',
            'password': 'mypass'
        }

        response = requests.post(login_url, json=credentials)

        if response.status_code == 200:
            token = response.json()['access_token']
            self.logger.info('Login successful!')
            return token
        else:
            self.logger.error('Login failed!')
            return None

    def start_requests(self):
        if not self.token:
            self.logger.error('No token, cannot scrape')
            return

        # Use token in Scrapy requests
        headers = {'Authorization': f'Bearer {self.token}'}

        yield scrapy.Request(
            'https://example.com/protected-data',
            headers=headers,
            callback=self.parse
        )

    def parse(self, response):
        # Scrape protected content
        yield {'data': response.css('.data::text').get()}

Why this works:

requests handles complex login
Gets authentication token
Scrapy uses token for scraping
Clean separation of concerns

Downloading Files Example

Download PDFs with requests

import scrapy
import requests
import os

class PdfSpider(scrapy.Spider):
    name = 'pdfs'
    start_urls = ['https://example.com/reports']

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Create download folder
        os.makedirs('downloads', exist_ok=True)

    def parse(self, response):
        for report in response.css('.report'):
            title = report.css('.title::text').get()
            pdf_url = report.css('a::attr(href)').get()

            # Download PDF with requests
            self.download_pdf(pdf_url, title)

            yield {
                'title': title,
                'url': pdf_url,
                'downloaded': True
            }

    def download_pdf(self, url, filename):
        """Download PDF using requests"""
        try:
            self.logger.info(f'Downloading {filename}...')

            response = requests.get(url, timeout=30)

            if response.status_code == 200:
                # Clean filename
                safe_filename = filename.replace('/', '_')[:50]
                filepath = f'downloads/{safe_filename}.pdf'

                with open(filepath, 'wb') as f:
                    f.write(response.content)

                self.logger.info(f'Saved to {filepath}')
            else:
                self.logger.warning(f'Failed to download {filename}')

        except Exception as e:
            self.logger.error(f'Download error: {e}')

Checking External Data

Price Comparison Spider

import scrapy
import requests

class PriceCompareSpider(scrapy.Spider):
    name = 'compare'
    start_urls = ['https://shop.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            name = product.css('.name::text').get()
            our_price = product.css('.price::text').get()

            # Check competitor price via API
            competitor_price = self.check_competitor_price(name)

            yield {
                'name': name,
                'our_price': our_price,
                'competitor_price': competitor_price,
                'cheaper': float(our_price) < float(competitor_price) if competitor_price else None
            }

    def check_competitor_price(self, product_name):
        """Check price on competitor's API"""
        try:
            api_url = 'https://competitor.com/api/search'
            params = {'q': product_name}

            response = requests.get(api_url, params=params, timeout=5)

            if response.status_code == 200:
                results = response.json()
                if results:
                    return results[0]['price']

            return None

        except:
            return None

Common Patterns

Pattern 1: API + HTML Combo

def parse(self, response):
    # HTML data
    title = response.css('h1::text').get()

    # API data
    api_url = 'https://api.example.com/data'
    api_data = requests.get(api_url).json()

    # Combine
    yield {
        'title': title,
        'api_details': api_data
    }

Pattern 2: Pre-spider API Check

def __init__(self):
    # Check API before scraping
    status = requests.get('https://api.example.com/status').json()

    if not status['available']:
        raise Exception('API not available')

Pattern 3: Post-scrape API Update

def closed(self, reason):
    # Notify API that scraping is done
    requests.post(
        'https://api.example.com/scrape-complete',
        json={'spider': self.name, 'reason': reason}
    )

Important Tips

Tip 1: Add Timeout

Always add timeout to requests calls:

# BAD: Can hang forever
response = requests.get(url)

# GOOD: Times out after 5 seconds
response = requests.get(url, timeout=5)

Tip 2: Handle Errors

try:
    response = requests.get(url, timeout=5)
    if response.status_code == 200:
        data = response.json()
    else:
        self.logger.warning(f'API returned {response.status_code}')
except requests.Timeout:
    self.logger.error('API timeout')
except Exception as e:
    self.logger.error(f'API error: {e}')

Tip 3: Don't Block Scrapy

Keep requests calls fast:

# BAD: Slow API calls block Scrapy
for i in range(100):
    requests.get(slow_api)  # Each takes 10 seconds!

# GOOD: Only use requests when necessary
if need_api_data:
    requests.get(api_url, timeout=2)

Tip 4: Use Session for Multiple Calls

If making many requests calls:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self):
        # Create session (reuses connections)
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'MySpider/1.0'
        })

    def parse(self, response):
        # Faster repeated calls
        data1 = self.session.get('https://api.example.com/1').json()
        data2 = self.session.get('https://api.example.com/2').json()

When NOT to Use requests

Don't Use requests For:

1. Main scraping

# BAD: Using requests for pagination
for page in range(100):
    html = requests.get(f'https://example.com/page/{page}').text
    # Parse with BeautifulSoup

# GOOD: Let Scrapy handle it
def parse(self, response):
    # Scrapy handles retries, delays, etc
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)

2. When you need Scrapy features

# BAD: Lose Scrapy benefits
requests.get(url)  # No retries, no delays, no robots.txt

# GOOD: Use Scrapy
yield scrapy.Request(url)  # Has retries, delays, robots.txt

3. Asynchronous scraping

# BAD: Blocks Scrapy's async
response = requests.get(url)  # Blocks!

# GOOD: Scrapy is already async
yield scrapy.Request(url)  # Non-blocking

Complete Real Example

Here's a complete spider using both Scrapy and requests:

import scrapy
import requests
import json

class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    start_urls = ['https://shop.example.com/products']

    # API settings
    api_url = 'https://api.example.com'
    api_key = 'your-key'

    def __init__(self):
        super().__init__()
        # Test API connection
        if not self.test_api():
            raise Exception('API not available')

    def test_api(self):
        """Test API with requests"""
        try:
            response = requests.get(
                f'{self.api_url}/health',
                timeout=5
            )
            return response.status_code == 200
        except:
            return False

    def parse(self, response):
        """Parse product listing"""
        for product in response.css('.product'):
            # Get HTML data
            name = product.css('.name::text').get()
            url = product.css('a::attr(href)').get()
            sku = product.css('::attr(data-sku)').get()

            # Get API data
            inventory = self.get_inventory(sku)
            reviews = self.get_reviews(sku)

            yield {
                'name': name,
                'url': response.urljoin(url),
                'sku': sku,
                'in_stock': inventory['in_stock'],
                'quantity': inventory['quantity'],
                'avg_rating': reviews['avg_rating'],
                'review_count': reviews['count']
            }

        # Scrapy handles pagination
        next_page = response.css('.pagination .next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def get_inventory(self, sku):
        """Get inventory from API using requests"""
        try:
            url = f'{self.api_url}/inventory/{sku}'
            headers = {'X-API-Key': self.api_key}

            response = requests.get(url, headers=headers, timeout=3)

            if response.status_code == 200:
                data = response.json()
                return {
                    'in_stock': data['available'],
                    'quantity': data['qty']
                }
            else:
                self.logger.warning(f'Inventory API failed for {sku}')
                return {'in_stock': None, 'quantity': None}

        except Exception as e:
            self.logger.error(f'Inventory error: {e}')
            return {'in_stock': None, 'quantity': None}

    def get_reviews(self, sku):
        """Get reviews from API using requests"""
        try:
            url = f'{self.api_url}/reviews/{sku}'
            headers = {'X-API-Key': self.api_key}

            response = requests.get(url, headers=headers, timeout=3)

            if response.status_code == 200:
                data = response.json()
                return {
                    'avg_rating': data['average'],
                    'count': data['total']
                }
            else:
                return {'avg_rating': None, 'count': 0}

        except Exception as e:
            self.logger.error(f'Reviews error: {e}')
            return {'avg_rating': None, 'count': 0}

    def closed(self, reason):
        """Send completion notification via API"""
        try:
            url = f'{self.api_url}/scrape-complete'
            data = {
                'spider': self.name,
                'reason': reason,
                'stats': dict(self.crawler.stats.get_stats())
            }

            requests.post(url, json=data, timeout=5)
            self.logger.info('Sent completion notification')
        except:
            self.logger.warning('Could not send notification')

This spider:

Tests API connection on start (requests)
Scrapes product listings (Scrapy)
Gets inventory data (requests + API)
Gets review data (requests + API)
Handles pagination (Scrapy)
Sends completion notification (requests)

Perfect combination of both tools!

Common Mistakes

Mistake 1: Using requests for Everything

# BAD: Why use Scrapy at all?
def parse(self, response):
    html = requests.get('https://example.com').text
    # Just use requests library alone!

# GOOD: Scrapy for scraping, requests for APIs
def parse(self, response):
    # Scrapy handles the page
    name = response.css('.name::text').get()

    # requests for quick API call
    price = requests.get(api_url).json()['price']

Mistake 2: No Error Handling

# BAD: Will crash on error
api_data = requests.get(url).json()

# GOOD: Handle errors
try:
    response = requests.get(url, timeout=5)
    if response.status_code == 200:
        api_data = response.json()
    else:
        api_data = None
except:
    api_data = None

Mistake 3: Blocking Scrapy

# BAD: 100 slow requests blocks everything
for i in range(100):
    requests.get(slow_api)  # Takes 5 seconds each!

# GOOD: Keep it minimal
if really_needed:
    requests.get(api, timeout=2)

Quick Decision Guide

Use Scrapy when:

✓ Scraping multiple pages
✓ Following links
✓ Need retries
✓ Need rate limiting
✓ Respecting robots.txt

Use requests when:

✓ Single API call
✓ Authentication
✓ File download
✓ Quick external check
✓ Testing connection

Use both when:

✓ Scraping HTML + API data
✓ Need different tools for different jobs
✓ Want best of both worlds

Summary

What is requests?
Simple Python library for HTTP requests.

Why use it with Scrapy?

API calls (JSON data)
Authentication
File downloads
External data checks

When to use it:

API endpoints
One-off requests
Simple authentication
Quick external checks

When NOT to use it:

Main scraping (use Scrapy)
Pagination (use Scrapy)
When you need retries (use Scrapy)

Best practice:

# Scrapy for main scraping
def parse(self, response):
    html_data = response.css('.data::text').get()

    # requests for API calls
    api_data = requests.get(api_url, timeout=5).json()

    # Combine both
    yield {'html': html_data, 'api': api_data}

Remember:

Always add timeout
Always handle errors
Keep requests calls minimal
Don't block Scrapy's async
Use right tool for each job

The best approach is combining both: Scrapy for scraping, requests for API calls!

Happy scraping! 🕷️