Muhammad Ikramullah Khan

Posted on Jan 3

Finding API Endpoints: Scrape the Data Source, Not the Website

#programming #javascript #api #python

The moment I learned to find API endpoints changed everything. I was struggling to scrape a product listing site with Selenium. It took 5 minutes to render one page.

Then I opened the Network tab and found the API. Same data, but as clean JSON. I switched to scraping the API directly.

Results:

Before: 5 minutes per page, messy HTML parsing
After: 2 seconds per page, clean JSON data

Finding APIs is the secret weapon of professional scrapers. Let me show you how.

Why APIs Are Better Than Scraping HTML

Scraping HTML:

Slow (download + parse)
Brittle (breaks when design changes)
Messy (nested tags, inconsistent structure)
JavaScript needed (even slower)

Scraping API:

Fast (just download JSON)
Stable (APIs change less than websites)
Clean (structured JSON data)
No rendering needed

Speed comparison:

HTML scraping: 10-20 pages/second
API scraping: 100-500 pages/second

That's 10-50x faster!

How to Find API Endpoints

Step 1: Open Developer Tools

Chrome/Edge:

Press F12 or Ctrl+Shift+I
Click "Network" tab

Firefox:

Press F12
Click "Network" tab

Step 2: Filter by XHR/Fetch

Click "XHR" or "Fetch" button in the Network tab. This shows only API requests.

Step 3: Refresh the Page

Press Ctrl+R to reload. Watch requests appear in the Network tab.

Step 4: Look for JSON Responses

Click on requests one by one. Look for:

URLs containing /api/
Responses with JSON data
Requests with your target data

Step 5: Inspect the Request

Click on interesting request → Check:

URL (Request URL at top)
Method (GET, POST, etc.)
Headers (Authorization, cookies, etc.)
Payload (if POST request)
Response (the JSON data)

Real Example: Product Listing

Let's say you're scraping products from a store.

What You See in Network Tab

Request URL: https://api.example.com/v1/products?page=1&limit=20&sort=popular
Method: GET
Status: 200

Response:
{
  "products": [
    {
      "id": 123,
      "name": "Widget Pro",
      "price": 29.99,
      "stock": 50
    },
    {
      "id": 124,
      "name": "Gadget Plus",
      "price": 49.99,
      "stock": 30
    }
  ],
  "total": 1523,
  "page": 1,
  "pages": 77
}

Perfect! You found the API.

Your Scrapy Spider

import scrapy
import json

class ApiSpider(scrapy.Spider):
    name = 'products'

    def start_requests(self):
        url = 'https://api.example.com/v1/products?page=1&limit=20&sort=popular'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        data = json.loads(response.text)

        # Extract products
        for product in data['products']:
            yield {
                'id': product['id'],
                'name': product['name'],
                'price': product['price'],
                'stock': product['stock']
            }

        # Pagination
        current_page = data['page']
        total_pages = data['pages']

        if current_page < total_pages:
            next_page = current_page + 1
            next_url = f'https://api.example.com/v1/products?page={next_page}&limit=20&sort=popular'
            yield scrapy.Request(next_url, callback=self.parse)

Done! Clean, fast, reliable.

Finding Hidden APIs (Advanced)

Some APIs aren't obvious. Here's how to find them.

Technique 1: Search for "api" in Network Tab

Type "api" in the filter box. Shows only URLs containing "api".

Technique 2: Look for GraphQL

Modern sites use GraphQL. Look for:

URL: https://example.com/graphql
Method: POST
Payload contains "query"

Example GraphQL request:

{
  "query": "{ products(limit: 20) { id name price } }"
}

Technique 3: Check WebSocket Connections

Some sites use WebSockets for real-time updates.

In Network tab:

Filter by "WS" (WebSocket)
Click on connection
View messages

Technique 4: Look at Script Tags

Sometimes API URLs are embedded in JavaScript:

def parse(self, response):
    # Look for API URLs in script tags
    scripts = response.css('script::text').getall()

    for script in scripts:
        if 'api.example.com' in script:
            # Extract API URL from JavaScript
            import re
            urls = re.findall(r'https://api\.example\.com/[^"\']+', script)
            for url in urls:
                yield scrapy.Request(url, callback=self.parse_api)

Handling API Authentication

Many APIs require authentication.

Type 1: API Key in URL

https://api.example.com/products?api_key=abc123def456

How to find it:

Check request URL in Network tab
Look for api_key, key, token parameters

Your spider:

def start_requests(self):
    api_key = 'abc123def456'
    url = f'https://api.example.com/products?api_key={api_key}'
    yield scrapy.Request(url)

Type 2: Bearer Token in Headers

Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

How to find it:

Network tab → Click request
Headers tab → Look for "Authorization"

Your spider:

def start_requests(self):
    url = 'https://api.example.com/products'
    headers = {
        'Authorization': 'Bearer YOUR_TOKEN_HERE'
    }
    yield scrapy.Request(url, headers=headers)

Type 3: Session Cookies

Some APIs use cookies for auth.

How to find them:

Network tab → Click request
Headers tab → Look for "Cookie"

Your spider:

def start_requests(self):
    url = 'https://api.example.com/products'
    cookies = {
        'session_id': 'abc123',
        'user_token': 'xyz789'
    }
    yield scrapy.Request(url, cookies=cookies)

Type 4: Custom Headers

X-Api-Key: abc123
X-Client-Id: def456

Your spider:

def start_requests(self):
    headers = {
        'X-Api-Key': 'abc123',
        'X-Client-Id': 'def456'
    }
    yield scrapy.Request(url, headers=headers)

Handling POST Requests

Some APIs use POST instead of GET.

Finding POST Data

Network tab:

Click POST request
"Payload" tab
See the data sent

Example:

{
  "filters": {
    "category": "electronics",
    "price_max": 1000
  },
  "page": 1,
  "limit": 20
}

Your Spider

import scrapy
import json

class PostSpider(scrapy.Spider):
    name = 'post'

    def start_requests(self):
        url = 'https://api.example.com/search'

        payload = {
            'filters': {
                'category': 'electronics',
                'price_max': 1000
            },
            'page': 1,
            'limit': 20
        }

        yield scrapy.Request(
            url,
            method='POST',
            body=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            callback=self.parse
        )

    def parse(self, response):
        data = json.loads(response.text)
        for item in data['results']:
            yield item

Handling Pagination in APIs

APIs have different pagination styles.

Style 1: Page Numbers

/products?page=1
/products?page=2
/products?page=3

Spider:

def parse(self, response):
    data = json.loads(response.text)

    for item in data['items']:
        yield item

    # Next page
    current_page = int(response.url.split('page=')[1])
    if data['has_next']:
        next_page = current_page + 1
        next_url = f'https://api.example.com/products?page={next_page}'
        yield scrapy.Request(next_url, callback=self.parse)

Style 2: Offset/Limit

/products?offset=0&limit=20
/products?offset=20&limit=20
/products?offset=40&limit=20

Spider:

def parse(self, response):
    data = json.loads(response.text)

    for item in data['items']:
        yield item

    # Next offset
    total = data['total']
    offset = int(response.url.split('offset=')[1].split('&')[0])
    limit = 20

    if offset + limit < total:
        next_offset = offset + limit
        next_url = f'https://api.example.com/products?offset={next_offset}&limit={limit}'
        yield scrapy.Request(next_url, callback=self.parse)

Style 3: Cursor-Based

/products?cursor=abc123
/products?cursor=def456

Spider:

def parse(self, response):
    data = json.loads(response.text)

    for item in data['items']:
        yield item

    # Next cursor
    if data['next_cursor']:
        next_url = f"https://api.example.com/products?cursor={data['next_cursor']}"
        yield scrapy.Request(next_url, callback=self.parse)

GraphQL APIs

GraphQL is a modern API query language.

Finding GraphQL Endpoints

Look for:

URL: /graphql
Method: POST
Content-Type: application/json
Body contains "query"

Example GraphQL Query

{
  "query": "query { products(limit: 20) { id name price description } }"
}

Scrapy Spider for GraphQL

import scrapy
import json

class GraphQLSpider(scrapy.Spider):
    name = 'graphql'

    def start_requests(self):
        url = 'https://example.com/graphql'

        query = '''
        query {
          products(limit: 20, offset: 0) {
            id
            name
            price
            description
          }
        }
        '''

        payload = {'query': query}

        yield scrapy.Request(
            url,
            method='POST',
            body=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            callback=self.parse
        )

    def parse(self, response):
        data = json.loads(response.text)

        for product in data['data']['products']:
            yield product

GraphQL Pagination

def start_requests(self):
    for offset in range(0, 1000, 20):  # 0, 20, 40, ...
        query = f'''
        query {{
          products(limit: 20, offset: {offset}) {{
            id
            name
            price
          }}
        }}
        '''

        payload = {'query': query}
        yield scrapy.Request(
            'https://example.com/graphql',
            method='POST',
            body=json.dumps(payload),
            headers={'Content-Type': 'application/json'},
            callback=self.parse
        )

Rate Limiting with APIs

APIs often have rate limits.

Detecting Rate Limits

Signs:

429 status code (Too Many Requests)
Error message about rate limiting
Header: X-RateLimit-Remaining: 0

Handling Rate Limits

# settings.py

# Slow down
DOWNLOAD_DELAY = 2
CONCURRENT_REQUESTS = 4

# Auto throttle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10

Respecting Rate Limit Headers

def parse(self, response):
    # Check rate limit headers
    remaining = response.headers.get('X-RateLimit-Remaining')
    if remaining and int(remaining) < 10:
        self.logger.warning('Approaching rate limit, slowing down')
        # Slow down or pause

    # Continue parsing
    data = json.loads(response.text)
    for item in data:
        yield item

Reverse Engineering API Parameters

Sometimes API URLs have cryptic parameters.

Common Parameters to Try

# Pagination
?page=1
?offset=0&limit=20
?cursor=abc

# Sorting
?sort=price
?sort=price_asc
?order_by=name

# Filtering
?category=electronics
?price_min=10&price_max=100
?in_stock=true

# Search
?q=laptop
?search=laptop
?query=laptop

# Format
?format=json
?output=json

Testing Parameters

def start_requests(self):
    base_url = 'https://api.example.com/products'

    # Try different parameters
    for page in range(1, 11):
        url = f'{base_url}?page={page}&limit=50&sort=price'
        yield scrapy.Request(url, callback=self.parse)

When APIs Don't Exist

If you can't find an API:

Option 1: Use Scrapy-Playwright (render JavaScript)

Option 2: Look harder

Sometimes APIs are there but hidden
Check mobile app traffic (apps often use APIs)
Look at older versions of the site

Option 3: Scrape HTML

Last resort
Slower but works

Complete Real-World Example

Let's scrape a product API:

import scrapy
import json
from urllib.parse import urlencode

class ProductApiSpider(scrapy.Spider):
    name = 'product_api'

    # API base URL (found in Network tab)
    api_base = 'https://api.example.com/v2/products'

    # Headers (copied from Network tab)
    headers = {
        'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...',
        'User-Agent': 'Mozilla/5.0...',
        'Accept': 'application/json'
    }

    def start_requests(self):
        # Start with page 1
        params = {
            'page': 1,
            'limit': 50,
            'category': 'electronics',
            'sort': 'popularity'
        }

        url = f'{self.api_base}?{urlencode(params)}'
        yield scrapy.Request(url, headers=self.headers, callback=self.parse)

    def parse(self, response):
        # Parse JSON response
        try:
            data = json.loads(response.text)
        except json.JSONDecodeError:
            self.logger.error(f'Invalid JSON from {response.url}')
            return

        # Extract products
        for product in data.get('products', []):
            yield {
                'id': product.get('id'),
                'name': product.get('name'),
                'price': product.get('price'),
                'currency': product.get('currency'),
                'stock': product.get('in_stock'),
                'rating': product.get('rating'),
                'reviews': product.get('review_count'),
                'url': product.get('product_url')
            }

        # Pagination
        current_page = data.get('current_page', 1)
        total_pages = data.get('total_pages', 1)

        if current_page < total_pages:
            next_page = current_page + 1

            params = {
                'page': next_page,
                'limit': 50,
                'category': 'electronics',
                'sort': 'popularity'
            }

            next_url = f'{self.api_base}?{urlencode(params)}'
            yield scrapy.Request(next_url, headers=self.headers, callback=self.parse)
        else:
            self.logger.info(f'Finished scraping {total_pages} pages')

Quick Checklist

Finding APIs:

[ ] Open DevTools (F12)
[ ] Click Network tab
[ ] Filter by XHR/Fetch
[ ] Refresh page
[ ] Click on requests with JSON responses
[ ] Note URL, method, headers, payload

Testing APIs:

[ ] Copy request URL
[ ] Test in Scrapy shell
[ ] Check authentication requirements
[ ] Test pagination
[ ] Test different parameters

Building Spider:

[ ] Start with one page
[ ] Parse JSON response
[ ] Add pagination
[ ] Add authentication if needed
[ ] Respect rate limits

Summary

Why find APIs:

10-50x faster than HTML scraping
Clean JSON data
More stable/reliable
No JavaScript rendering needed

How to find them:

Network tab → XHR/Fetch filter
Look for JSON responses
Note URL, headers, payload

Common patterns:

GET with URL parameters
POST with JSON body
Authentication via headers or cookies
Pagination via page/offset/cursor

Best practices:

Test API in Scrapy shell first
Copy exact headers from browser
Respect rate limits
Handle errors gracefully

Remember:

Always try to find API first
APIs > Playwright > Selenium > HTML scraping
10 minutes finding API saves hours of scraping

Start by opening Network tab on any site you want to scrape. You'll be surprised how many use APIs!

Happy scraping! 🕷️