DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Finding API Endpoints: Scrape the Data Source, Not the Website

The moment I learned to find API endpoints changed everything. I was struggling to scrape a product listing site with Selenium. It took 5 minutes to render one page.

Then I opened the Network tab and found the API. Same data, but as clean JSON. I switched to scraping the API directly.

Results:

  • Before: 5 minutes per page, messy HTML parsing
  • After: 2 seconds per page, clean JSON data

Finding APIs is the secret weapon of professional scrapers. Let me show you how.


Why APIs Are Better Than Scraping HTML

Scraping HTML:

  • Slow (download + parse)
  • Brittle (breaks when design changes)
  • Messy (nested tags, inconsistent structure)
  • JavaScript needed (even slower)

Scraping API:

  • Fast (just download JSON)
  • Stable (APIs change less than websites)
  • Clean (structured JSON data)
  • No rendering needed

Speed comparison:

  • HTML scraping: 10-20 pages/second
  • API scraping: 100-500 pages/second

That's 10-50x faster!


How to Find API Endpoints

Step 1: Open Developer Tools

Chrome/Edge:

  • Press F12 or Ctrl+Shift+I
  • Click "Network" tab

Firefox:

  • Press F12
  • Click "Network" tab

Step 2: Filter by XHR/Fetch

Click "XHR" or "Fetch" button in the Network tab. This shows only API requests.

Step 3: Refresh the Page

Press Ctrl+R to reload. Watch requests appear in the Network tab.

Step 4: Look for JSON Responses

Click on requests one by one. Look for:

  • URLs containing /api/
  • Responses with JSON data
  • Requests with your target data

Step 5: Inspect the Request

Click on interesting request → Check:

  1. URL (Request URL at top)
  2. Method (GET, POST, etc.)
  3. Headers (Authorization, cookies, etc.)
  4. Payload (if POST request)
  5. Response (the JSON data)

Real Example: Product Listing

Let's say you're scraping products from a store.

What You See in Network Tab

Request URL: https://api.example.com/v1/products?page=1&limit=20&sort=popular
Method: GET
Status: 200

Response:
{
  "products": [
    {
      "id": 123,
      "name": "Widget Pro",
      "price": 29.99,
      "stock": 50
    },
    {
      "id": 124,
      "name": "Gadget Plus",
      "price": 49.99,
      "stock": 30
    }
  ],
  "total": 1523,
  "page": 1,
  "pages": 77
}
Enter fullscreen mode Exit fullscreen mode

Perfect! You found the API.

Your Scrapy Spider

import scrapy
import json

class ApiSpider(scrapy.Spider):
    name = 'products'

    def start_requests(self):
        url = 'https://api.example.com/v1/products?page=1&limit=20&sort=popular'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        data = json.loads(response.text)

        # Extract products
        for product in data['products']:
            yield {
                'id': product['id'],
                'name': product['name'],
                'price': product['price'],
                'stock': product['stock']
            }

        # Pagination
        current_page = data['page']
        total_pages = data['pages']

        if current_page < total_pages:
            next_page = current_page + 1
            next_url = f'https://api.example.com/v1/products?page={next_page}&limit=20&sort=popular'
            yield scrapy.Request(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Done! Clean, fast, reliable.


Finding Hidden APIs (Advanced)

Some APIs aren't obvious. Here's how to find them.

Technique 1: Search for "api" in Network Tab

Type "api" in the filter box. Shows only URLs containing "api".

Technique 2: Look for GraphQL

Modern sites use GraphQL. Look for:

  • URL: https://example.com/graphql
  • Method: POST
  • Payload contains "query"

Example GraphQL request:

{
  "query": "{ products(limit: 20) { id name price } }"
}
Enter fullscreen mode Exit fullscreen mode

Technique 3: Check WebSocket Connections

Some sites use WebSockets for real-time updates.

In Network tab:

  • Filter by "WS" (WebSocket)
  • Click on connection
  • View messages

Technique 4: Look at Script Tags

Sometimes API URLs are embedded in JavaScript:

def parse(self, response):
    # Look for API URLs in script tags
    scripts = response.css('script::text').getall()

    for script in scripts:
        if 'api.example.com' in script:
            # Extract API URL from JavaScript
            import re
            urls = re.findall(r'https://api\.example\.com/[^"\']+', script)
            for url in urls:
                yield scrapy.Request(url, callback=self.parse_api)
Enter fullscreen mode Exit fullscreen mode

Handling API Authentication

Many APIs require authentication.

Type 1: API Key in URL

https://api.example.com/products?api_key=abc123def456
Enter fullscreen mode Exit fullscreen mode

How to find it:

  • Check request URL in Network tab
  • Look for api_key, key, token parameters

Your spider:

def start_requests(self):
    api_key = 'abc123def456'
    url = f'https://api.example.com/products?api_key={api_key}'
    yield scrapy.Request(url)
Enter fullscreen mode Exit fullscreen mode

Type 2: Bearer Token in Headers

Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
Enter fullscreen mode Exit fullscreen mode

How to find it:

  • Network tab → Click request
  • Headers tab → Look for "Authorization"

Your spider:

def start_requests(self):
    url = 'https://api.example.com/products'
    headers = {
        'Authorization': 'Bearer YOUR_TOKEN_HERE'
    }
    yield scrapy.Request(url, headers=headers)
Enter fullscreen mode Exit fullscreen mode

Type 3: Session Cookies

Some APIs use cookies for auth.

How to find them:

  • Network tab → Click request
  • Headers tab → Look for "Cookie"

Your spider:

def start_requests(self):
    url = 'https://api.example.com/products'
    cookies = {
        'session_id': 'abc123',
        'user_token': 'xyz789'
    }
    yield scrapy.Request(url, cookies=cookies)
Enter fullscreen mode Exit fullscreen mode

Type 4: Custom Headers

X-Api-Key: abc123
X-Client-Id: def456
Enter fullscreen mode Exit fullscreen mode

Your spider:

def start_requests(self):
    headers = {
        'X-Api-Key': 'abc123',
        'X-Client-Id': 'def456'
    }
    yield scrapy.Request(url, headers=headers)
Enter fullscreen mode Exit fullscreen mode

Handling POST Requests

Some APIs use POST instead of GET.

Finding POST Data

Network tab:

  • Click POST request
  • "Payload" tab
  • See the data sent

Example:

{
  "filters": {
    "category": "electronics",
    "price_max": 1000
  },
  "page": 1,
  "limit": 20
}
Enter fullscreen mode Exit fullscreen mode

Your Spider

import scrapy
import json

class PostSpider(scrapy.Spider):
    name = 'post'

    def start_requests(self):
        url = 'https://api.example.com/search'

        payload = {
            'filters': {
                'category': 'electronics',
                'price_max': 1000
            },
            'page': 1,
            'limit': 20
        }

        yield scrapy.Request(
            url,
            method='POST',
            body=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            callback=self.parse
        )

    def parse(self, response):
        data = json.loads(response.text)
        for item in data['results']:
            yield item
Enter fullscreen mode Exit fullscreen mode

Handling Pagination in APIs

APIs have different pagination styles.

Style 1: Page Numbers

/products?page=1
/products?page=2
/products?page=3
Enter fullscreen mode Exit fullscreen mode

Spider:

def parse(self, response):
    data = json.loads(response.text)

    for item in data['items']:
        yield item

    # Next page
    current_page = int(response.url.split('page=')[1])
    if data['has_next']:
        next_page = current_page + 1
        next_url = f'https://api.example.com/products?page={next_page}'
        yield scrapy.Request(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Style 2: Offset/Limit

/products?offset=0&limit=20
/products?offset=20&limit=20
/products?offset=40&limit=20
Enter fullscreen mode Exit fullscreen mode

Spider:

def parse(self, response):
    data = json.loads(response.text)

    for item in data['items']:
        yield item

    # Next offset
    total = data['total']
    offset = int(response.url.split('offset=')[1].split('&')[0])
    limit = 20

    if offset + limit < total:
        next_offset = offset + limit
        next_url = f'https://api.example.com/products?offset={next_offset}&limit={limit}'
        yield scrapy.Request(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Style 3: Cursor-Based

/products?cursor=abc123
/products?cursor=def456
Enter fullscreen mode Exit fullscreen mode

Spider:

def parse(self, response):
    data = json.loads(response.text)

    for item in data['items']:
        yield item

    # Next cursor
    if data['next_cursor']:
        next_url = f"https://api.example.com/products?cursor={data['next_cursor']}"
        yield scrapy.Request(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

GraphQL APIs

GraphQL is a modern API query language.

Finding GraphQL Endpoints

Look for:

  • URL: /graphql
  • Method: POST
  • Content-Type: application/json
  • Body contains "query"

Example GraphQL Query

{
  "query": "query { products(limit: 20) { id name price description } }"
}
Enter fullscreen mode Exit fullscreen mode

Scrapy Spider for GraphQL

import scrapy
import json

class GraphQLSpider(scrapy.Spider):
    name = 'graphql'

    def start_requests(self):
        url = 'https://example.com/graphql'

        query = '''
        query {
          products(limit: 20, offset: 0) {
            id
            name
            price
            description
          }
        }
        '''

        payload = {'query': query}

        yield scrapy.Request(
            url,
            method='POST',
            body=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            callback=self.parse
        )

    def parse(self, response):
        data = json.loads(response.text)

        for product in data['data']['products']:
            yield product
Enter fullscreen mode Exit fullscreen mode

GraphQL Pagination

def start_requests(self):
    for offset in range(0, 1000, 20):  # 0, 20, 40, ...
        query = f'''
        query {{
          products(limit: 20, offset: {offset}) {{
            id
            name
            price
          }}
        }}
        '''

        payload = {'query': query}
        yield scrapy.Request(
            'https://example.com/graphql',
            method='POST',
            body=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            callback=self.parse
        )
Enter fullscreen mode Exit fullscreen mode

Rate Limiting with APIs

APIs often have rate limits.

Detecting Rate Limits

Signs:

  • 429 status code (Too Many Requests)
  • Error message about rate limiting
  • Header: X-RateLimit-Remaining: 0

Handling Rate Limits

# settings.py

# Slow down
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 4

# Auto throttle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
Enter fullscreen mode Exit fullscreen mode

Respecting Rate Limit Headers

def parse(self, response):
    # Check rate limit headers
    remaining = response.headers.get('X-RateLimit-Remaining')
    if remaining and int(remaining) < 10:
        self.logger.warning('Approaching rate limit, slowing down')
        # Slow down or pause

    # Continue parsing
    data = json.loads(response.text)
    for item in data:
        yield item
Enter fullscreen mode Exit fullscreen mode

Reverse Engineering API Parameters

Sometimes API URLs have cryptic parameters.

Common Parameters to Try

# Pagination
?page=1
?offset=0&limit=20
?cursor=abc

# Sorting
?sort=price
?sort=price_asc
?order_by=name

# Filtering
?category=electronics
?price_min=10&price_max=100
?in_stock=true

# Search
?q=laptop
?search=laptop
?query=laptop

# Format
?format=json
?output=json
Enter fullscreen mode Exit fullscreen mode

Testing Parameters

def start_requests(self):
    base_url = 'https://api.example.com/products'

    # Try different parameters
    for page in range(1, 11):
        url = f'{base_url}?page={page}&limit=50&sort=price'
        yield scrapy.Request(url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

When APIs Don't Exist

If you can't find an API:

Option 1: Use Scrapy-Playwright (render JavaScript)

Option 2: Look harder

  • Sometimes APIs are there but hidden
  • Check mobile app traffic (apps often use APIs)
  • Look at older versions of the site

Option 3: Scrape HTML

  • Last resort
  • Slower but works

Complete Real-World Example

Let's scrape a product API:

import scrapy
import json
from urllib.parse import urlencode

class ProductApiSpider(scrapy.Spider):
    name = 'product_api'

    # API base URL (found in Network tab)
    api_base = 'https://api.example.com/v2/products'

    # Headers (copied from Network tab)
    headers = {
        'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...',
        'User-Agent': 'Mozilla/5.0...',
        'Accept': 'application/json'
    }

    def start_requests(self):
        # Start with page 1
        params = {
            'page': 1,
            'limit': 50,
            'category': 'electronics',
            'sort': 'popularity'
        }

        url = f'{self.api_base}?{urlencode(params)}'
        yield scrapy.Request(url, headers=self.headers, callback=self.parse)

    def parse(self, response):
        # Parse JSON response
        try:
            data = json.loads(response.text)
        except json.JSONDecodeError:
            self.logger.error(f'Invalid JSON from {response.url}')
            return

        # Extract products
        for product in data.get('products', []):
            yield {
                'id': product.get('id'),
                'name': product.get('name'),
                'price': product.get('price'),
                'currency': product.get('currency'),
                'stock': product.get('in_stock'),
                'rating': product.get('rating'),
                'reviews': product.get('review_count'),
                'url': product.get('product_url')
            }

        # Pagination
        current_page = data.get('current_page', 1)
        total_pages = data.get('total_pages', 1)

        if current_page < total_pages:
            next_page = current_page + 1

            params = {
                'page': next_page,
                'limit': 50,
                'category': 'electronics',
                'sort': 'popularity'
            }

            next_url = f'{self.api_base}?{urlencode(params)}'
            yield scrapy.Request(next_url, headers=self.headers, callback=self.parse)
        else:
            self.logger.info(f'Finished scraping {total_pages} pages')
Enter fullscreen mode Exit fullscreen mode

Quick Checklist

Finding APIs:

  • [ ] Open DevTools (F12)
  • [ ] Click Network tab
  • [ ] Filter by XHR/Fetch
  • [ ] Refresh page
  • [ ] Click on requests with JSON responses
  • [ ] Note URL, method, headers, payload

Testing APIs:

  • [ ] Copy request URL
  • [ ] Test in Scrapy shell
  • [ ] Check authentication requirements
  • [ ] Test pagination
  • [ ] Test different parameters

Building Spider:

  • [ ] Start with one page
  • [ ] Parse JSON response
  • [ ] Add pagination
  • [ ] Add authentication if needed
  • [ ] Respect rate limits

Summary

Why find APIs:

  • 10-50x faster than HTML scraping
  • Clean JSON data
  • More stable/reliable
  • No JavaScript rendering needed

How to find them:

  • Network tab → XHR/Fetch filter
  • Look for JSON responses
  • Note URL, headers, payload

Common patterns:

  • GET with URL parameters
  • POST with JSON body
  • Authentication via headers or cookies
  • Pagination via page/offset/cursor

Best practices:

  • Test API in Scrapy shell first
  • Copy exact headers from browser
  • Respect rate limits
  • Handle errors gracefully

Remember:

  • Always try to find API first
  • APIs > Playwright > Selenium > HTML scraping
  • 10 minutes finding API saves hours of scraping

Start by opening Network tab on any site you want to scrape. You'll be surprised how many use APIs!

Happy scraping! 🕷️

Top comments (0)