DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Using Python Requests Inside Scrapy: The Beginner's Guide

I was building a Scrapy spider when I hit a weird situation. The website had an API endpoint that returned JSON data, but Scrapy kept trying to parse it as HTML.

I spent hours trying to make Scrapy work with the API. Then I realized I could just use Python's requests library inside my spider. Problem solved in 5 minutes.

Sometimes Scrapy isn't the right tool for every single request. Let me show you when and how to use requests inside Scrapy.


What is Python Requests?

requests is a simple Python library for making HTTP requests.

Think of it like this:

  • Scrapy = A complete factory with assembly line, workers, quality control
  • requests = A simple tool you hold in your hand

Sometimes you need the whole factory. Sometimes you just need the simple tool.


Why Use Requests Inside Scrapy?

Reason 1: API Calls

Some websites have APIs that return pure JSON (no HTML).

Problem with Scrapy:

def parse(self, response):
    data = response.json()  # Works, but awkward
Enter fullscreen mode Exit fullscreen mode

Easier with requests:

import requests

def parse(self, response):
    api_data = requests.get('https://api.example.com/data').json()
    # Clean JSON, easy to use
Enter fullscreen mode Exit fullscreen mode

Reason 2: Authentication APIs

Login endpoints often need specific formatting.

Problem:
Scrapy's FormRequest can be complex for APIs.

Solution:

import requests

# Simple API login
response = requests.post(
    'https://example.com/api/login',
    json={'username': 'user', 'password': 'pass'}
)
token = response.json()['token']
Enter fullscreen mode Exit fullscreen mode

Much simpler!

Reason 3: External Data Sources

You need data from a different website while scraping.

Example:

  • Scraping products from Website A
  • Need to check prices on Website B's API
  • Want to do both in one spider
def parse(self, response):
    product_name = response.css('.product-name::text').get()

    # Quick API check on different site
    price_api = f'https://pricecheck.com/api?product={product_name}'
    external_price = requests.get(price_api).json()['price']

    yield {
        'name': product_name,
        'external_price': external_price
    }
Enter fullscreen mode Exit fullscreen mode

Reason 4: File Downloads

Downloading files (PDFs, images) is simpler with requests.

import requests

# Download PDF
pdf_url = 'https://example.com/report.pdf'
pdf = requests.get(pdf_url)

with open('report.pdf', 'wb') as f:
    f.write(pdf.content)
Enter fullscreen mode Exit fullscreen mode

Easier than using Scrapy's file pipeline for one-off downloads.


When to Use What?

Use Scrapy When:

  • Scraping multiple pages
  • Following pagination
  • Need robots.txt respect
  • Want automatic retries
  • Need rate limiting
  • Crawling whole websites

Use requests When:

  • Single API call
  • Quick external check
  • Downloading single file
  • Simple authentication
  • Testing endpoints
  • One-off requests

Use Both When:

  • Scrapy for main scraping
  • requests for API calls
  • Best of both worlds!

Basic Example: requests Inside Scrapy

Simple API Call

import scrapy
import requests

class ApiSpider(scrapy.Spider):
    name = 'api'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            product_id = product.css('::attr(data-id)').get()

            # Use requests to get API data
            api_url = f'https://api.example.com/products/{product_id}'
            api_response = requests.get(api_url)

            if api_response.status_code == 200:
                api_data = api_response.json()

                yield {
                    'name': product.css('h2::text').get(),
                    'price': api_data['price'],
                    'stock': api_data['stock']
                }
Enter fullscreen mode Exit fullscreen mode

What this does:

  1. Scrapy scrapes the main page
  2. Gets product IDs from HTML
  3. Uses requests to fetch API data for each product
  4. Combines both data sources

Real Example: Product Scraper with API

Let's build a real spider that uses both Scrapy and requests.

The Scenario

Website structure:

  • Product listing page (HTML)
  • Individual product pages (HTML)
  • Price API (JSON)

We want:

  • Product names from HTML
  • Live prices from API

The Spider

import scrapy
import requests

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    # API settings
    api_base = 'https://api.example.com'
    api_key = 'your-api-key-here'

    def parse(self, response):
        """Parse product listing page"""
        for product in response.css('.product'):
            # Get basic info from HTML
            name = product.css('.name::text').get()
            sku = product.css('::attr(data-sku)').get()

            # Get live price from API
            price = self.get_price_from_api(sku)

            yield {
                'name': name,
                'sku': sku,
                'price': price
            }

        # Follow next page (Scrapy handles this)
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def get_price_from_api(self, sku):
        """Helper method using requests"""
        try:
            url = f'{self.api_base}/prices/{sku}'
            headers = {'Authorization': f'Bearer {self.api_key}'}

            response = requests.get(url, headers=headers, timeout=5)

            if response.status_code == 200:
                return response.json()['price']
            else:
                self.logger.warning(f'API failed for SKU {sku}')
                return None

        except Exception as e:
            self.logger.error(f'API error: {e}')
            return None
Enter fullscreen mode Exit fullscreen mode

What's happening:

  • Scrapy scrapes HTML for product names
  • requests fetches live prices from API
  • Combines both into final item
  • Scrapy handles pagination
  • requests handles API calls

Perfect combination!


Authentication Example

Logging In with API

import scrapy
import requests

class AuthSpider(scrapy.Spider):
    name = 'auth'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Login on spider start
        self.token = self.login()

    def login(self):
        """Use requests to login and get token"""
        login_url = 'https://example.com/api/login'

        credentials = {
            'username': 'myuser',
            'password': 'mypass'
        }

        response = requests.post(login_url, json=credentials)

        if response.status_code == 200:
            token = response.json()['access_token']
            self.logger.info('Login successful!')
            return token
        else:
            self.logger.error('Login failed!')
            return None

    def start_requests(self):
        if not self.token:
            self.logger.error('No token, cannot scrape')
            return

        # Use token in Scrapy requests
        headers = {'Authorization': f'Bearer {self.token}'}

        yield scrapy.Request(
            'https://example.com/protected-data',
            headers=headers,
            callback=self.parse
        )

    def parse(self, response):
        # Scrape protected content
        yield {'data': response.css('.data::text').get()}
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • requests handles complex login
  • Gets authentication token
  • Scrapy uses token for scraping
  • Clean separation of concerns

Downloading Files Example

Download PDFs with requests

import scrapy
import requests
import os

class PdfSpider(scrapy.Spider):
    name = 'pdfs'
    start_urls = ['https://example.com/reports']

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Create download folder
        os.makedirs('downloads', exist_ok=True)

    def parse(self, response):
        for report in response.css('.report'):
            title = report.css('.title::text').get()
            pdf_url = report.css('a::attr(href)').get()

            # Download PDF with requests
            self.download_pdf(pdf_url, title)

            yield {
                'title': title,
                'url': pdf_url,
                'downloaded': True
            }

    def download_pdf(self, url, filename):
        """Download PDF using requests"""
        try:
            self.logger.info(f'Downloading {filename}...')

            response = requests.get(url, timeout=30)

            if response.status_code == 200:
                # Clean filename
                safe_filename = filename.replace('/', '_')[:50]
                filepath = f'downloads/{safe_filename}.pdf'

                with open(filepath, 'wb') as f:
                    f.write(response.content)

                self.logger.info(f'Saved to {filepath}')
            else:
                self.logger.warning(f'Failed to download {filename}')

        except Exception as e:
            self.logger.error(f'Download error: {e}')
Enter fullscreen mode Exit fullscreen mode

Checking External Data

Price Comparison Spider

import scrapy
import requests

class PriceCompareSpider(scrapy.Spider):
    name = 'compare'
    start_urls = ['https://shop.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            name = product.css('.name::text').get()
            our_price = product.css('.price::text').get()

            # Check competitor price via API
            competitor_price = self.check_competitor_price(name)

            yield {
                'name': name,
                'our_price': our_price,
                'competitor_price': competitor_price,
                'cheaper': float(our_price) < float(competitor_price) if competitor_price else None
            }

    def check_competitor_price(self, product_name):
        """Check price on competitor's API"""
        try:
            api_url = 'https://competitor.com/api/search'
            params = {'q': product_name}

            response = requests.get(api_url, params=params, timeout=5)

            if response.status_code == 200:
                results = response.json()
                if results:
                    return results[0]['price']

            return None

        except:
            return None
Enter fullscreen mode Exit fullscreen mode

Common Patterns

Pattern 1: API + HTML Combo

def parse(self, response):
    # HTML data
    title = response.css('h1::text').get()

    # API data
    api_url = 'https://api.example.com/data'
    api_data = requests.get(api_url).json()

    # Combine
    yield {
        'title': title,
        'api_details': api_data
    }
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Pre-spider API Check

def __init__(self):
    # Check API before scraping
    status = requests.get('https://api.example.com/status').json()

    if not status['available']:
        raise Exception('API not available')
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Post-scrape API Update

def closed(self, reason):
    # Notify API that scraping is done
    requests.post(
        'https://api.example.com/scrape-complete',
        json={'spider': self.name, 'reason': reason}
    )
Enter fullscreen mode Exit fullscreen mode

Important Tips

Tip 1: Add Timeout

Always add timeout to requests calls:

# BAD: Can hang forever
response = requests.get(url)

# GOOD: Times out after 5 seconds
response = requests.get(url, timeout=5)
Enter fullscreen mode Exit fullscreen mode

Tip 2: Handle Errors

try:
    response = requests.get(url, timeout=5)
    if response.status_code == 200:
        data = response.json()
    else:
        self.logger.warning(f'API returned {response.status_code}')
except requests.Timeout:
    self.logger.error('API timeout')
except Exception as e:
    self.logger.error(f'API error: {e}')
Enter fullscreen mode Exit fullscreen mode

Tip 3: Don't Block Scrapy

Keep requests calls fast:

# BAD: Slow API calls block Scrapy
for i in range(100):
    requests.get(slow_api)  # Each takes 10 seconds!

# GOOD: Only use requests when necessary
if need_api_data:
    requests.get(api_url, timeout=2)
Enter fullscreen mode Exit fullscreen mode

Tip 4: Use Session for Multiple Calls

If making many requests calls:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self):
        # Create session (reuses connections)
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'MySpider/1.0'
        })

    def parse(self, response):
        # Faster repeated calls
        data1 = self.session.get('https://api.example.com/1').json()
        data2 = self.session.get('https://api.example.com/2').json()
Enter fullscreen mode Exit fullscreen mode

When NOT to Use requests

Don't Use requests For:

1. Main scraping

# BAD: Using requests for pagination
for page in range(100):
    html = requests.get(f'https://example.com/page/{page}').text
    # Parse with BeautifulSoup

# GOOD: Let Scrapy handle it
def parse(self, response):
    # Scrapy handles retries, delays, etc
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

2. When you need Scrapy features

# BAD: Lose Scrapy benefits
requests.get(url)  # No retries, no delays, no robots.txt

# GOOD: Use Scrapy
yield scrapy.Request(url)  # Has retries, delays, robots.txt
Enter fullscreen mode Exit fullscreen mode

3. Asynchronous scraping

# BAD: Blocks Scrapy's async
response = requests.get(url)  # Blocks!

# GOOD: Scrapy is already async
yield scrapy.Request(url)  # Non-blocking
Enter fullscreen mode Exit fullscreen mode

Complete Real Example

Here's a complete spider using both Scrapy and requests:

import scrapy
import requests
import json

class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    start_urls = ['https://shop.example.com/products']

    # API settings
    api_url = 'https://api.example.com'
    api_key = 'your-key'

    def __init__(self):
        super().__init__()
        # Test API connection
        if not self.test_api():
            raise Exception('API not available')

    def test_api(self):
        """Test API with requests"""
        try:
            response = requests.get(
                f'{self.api_url}/health',
                timeout=5
            )
            return response.status_code == 200
        except:
            return False

    def parse(self, response):
        """Parse product listing"""
        for product in response.css('.product'):
            # Get HTML data
            name = product.css('.name::text').get()
            url = product.css('a::attr(href)').get()
            sku = product.css('::attr(data-sku)').get()

            # Get API data
            inventory = self.get_inventory(sku)
            reviews = self.get_reviews(sku)

            yield {
                'name': name,
                'url': response.urljoin(url),
                'sku': sku,
                'in_stock': inventory['in_stock'],
                'quantity': inventory['quantity'],
                'avg_rating': reviews['avg_rating'],
                'review_count': reviews['count']
            }

        # Scrapy handles pagination
        next_page = response.css('.pagination .next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def get_inventory(self, sku):
        """Get inventory from API using requests"""
        try:
            url = f'{self.api_url}/inventory/{sku}'
            headers = {'X-API-Key': self.api_key}

            response = requests.get(url, headers=headers, timeout=3)

            if response.status_code == 200:
                data = response.json()
                return {
                    'in_stock': data['available'],
                    'quantity': data['qty']
                }
            else:
                self.logger.warning(f'Inventory API failed for {sku}')
                return {'in_stock': None, 'quantity': None}

        except Exception as e:
            self.logger.error(f'Inventory error: {e}')
            return {'in_stock': None, 'quantity': None}

    def get_reviews(self, sku):
        """Get reviews from API using requests"""
        try:
            url = f'{self.api_url}/reviews/{sku}'
            headers = {'X-API-Key': self.api_key}

            response = requests.get(url, headers=headers, timeout=3)

            if response.status_code == 200:
                data = response.json()
                return {
                    'avg_rating': data['average'],
                    'count': data['total']
                }
            else:
                return {'avg_rating': None, 'count': 0}

        except Exception as e:
            self.logger.error(f'Reviews error: {e}')
            return {'avg_rating': None, 'count': 0}

    def closed(self, reason):
        """Send completion notification via API"""
        try:
            url = f'{self.api_url}/scrape-complete'
            data = {
                'spider': self.name,
                'reason': reason,
                'stats': dict(self.crawler.stats.get_stats())
            }

            requests.post(url, json=data, timeout=5)
            self.logger.info('Sent completion notification')
        except:
            self.logger.warning('Could not send notification')
Enter fullscreen mode Exit fullscreen mode

This spider:

  • Tests API connection on start (requests)
  • Scrapes product listings (Scrapy)
  • Gets inventory data (requests + API)
  • Gets review data (requests + API)
  • Handles pagination (Scrapy)
  • Sends completion notification (requests)

Perfect combination of both tools!


Common Mistakes

Mistake 1: Using requests for Everything

# BAD: Why use Scrapy at all?
def parse(self, response):
    html = requests.get('https://example.com').text
    # Just use requests library alone!

# GOOD: Scrapy for scraping, requests for APIs
def parse(self, response):
    # Scrapy handles the page
    name = response.css('.name::text').get()

    # requests for quick API call
    price = requests.get(api_url).json()['price']
Enter fullscreen mode Exit fullscreen mode

Mistake 2: No Error Handling

# BAD: Will crash on error
api_data = requests.get(url).json()

# GOOD: Handle errors
try:
    response = requests.get(url, timeout=5)
    if response.status_code == 200:
        api_data = response.json()
    else:
        api_data = None
except:
    api_data = None
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Blocking Scrapy

# BAD: 100 slow requests blocks everything
for i in range(100):
    requests.get(slow_api)  # Takes 5 seconds each!

# GOOD: Keep it minimal
if really_needed:
    requests.get(api, timeout=2)
Enter fullscreen mode Exit fullscreen mode

Quick Decision Guide

Use Scrapy when:

✓ Scraping multiple pages
✓ Following links
✓ Need retries
✓ Need rate limiting
✓ Respecting robots.txt
Enter fullscreen mode Exit fullscreen mode

Use requests when:

✓ Single API call
✓ Authentication
✓ File download
✓ Quick external check
✓ Testing connection
Enter fullscreen mode Exit fullscreen mode

Use both when:

✓ Scraping HTML + API data
✓ Need different tools for different jobs
✓ Want best of both worlds
Enter fullscreen mode Exit fullscreen mode

Summary

What is requests?
Simple Python library for HTTP requests.

Why use it with Scrapy?

  • API calls (JSON data)
  • Authentication
  • File downloads
  • External data checks

When to use it:

  • API endpoints
  • One-off requests
  • Simple authentication
  • Quick external checks

When NOT to use it:

  • Main scraping (use Scrapy)
  • Pagination (use Scrapy)
  • When you need retries (use Scrapy)

Best practice:

# Scrapy for main scraping
def parse(self, response):
    html_data = response.css('.data::text').get()

    # requests for API calls
    api_data = requests.get(api_url, timeout=5).json()

    # Combine both
    yield {'html': html_data, 'api': api_data}
Enter fullscreen mode Exit fullscreen mode

Remember:

  • Always add timeout
  • Always handle errors
  • Keep requests calls minimal
  • Don't block Scrapy's async
  • Use right tool for each job

The best approach is combining both: Scrapy for scraping, requests for API calls!

Happy scraping! 🕷️

Top comments (0)