John Rooney for Zyte

Posted on Sep 10

Scraping an Entire Blog? Let the AI Handle Pagination (Full Code)

#zyte #webscraping #api #programming

So you've mastered scraping a single page. But what about scraping an entire blog or news site with dozens, or even hundreds, of pages? The moment you need to click "Next," the complexity skyrockets.

This is where most web scraping projects get messy. You start writing custom logic to find and follow pagination links, creating a fragile system that breaks the moment a website's layout changes.

What if you could bypass that entire headache? In this guide, we'll build a robust script that scrapes every article from a blog and saves it to a CSV, all by leveraging an AI-powered feature that handles the hard parts for you.

We're going to ustilise the AutoExtract part of the Zyte API. This returns us JSON data with the information we need, with no messing around.

You'll need an API Key to start, head over here and you'll get generous free credits to try this and the rest of our Web Scraping API.

Getting Your Script Ready

First, the essentials. We'll use the requests library to communicate with the Zyte API, os to securely load our API key, and csv to save our structured data.

Remember, the golden rule of credentials is never hardcode your API key. Storing it as an environment variable is the professional standard for keeping your keys safe and your code portable.

import os
import requests
import csv

APIKEY = os.getenv("ZYTE_API_KEY")
if APIKEY is None:
    raise Exception("No API key found. Please set the ZYTE_API_KEY environment variable.")

With our environment secure, we can focus on the scraping logic.

Using `articleNavigation`

Here’s where we replace lines and lines of tedious code with a single parameter. We'll create a function that makes one smart request to the Zyte API. Instead of just asking for raw HTML, we set articleNavigation to True.

This single instruction tells the API's machine learning model to perform a series of complex tasks automatically:

Render the page in a real browser to handle any JavaScript-loaded content.
Identify the main list of articles on the page.
Extract key details for each article (URL, headline, date, etc.) into a clean structure.
Locate the "Next Page" link to enable seamless pagination.

def request_list(url):
    """
    Sends a request to the Zyte API to extract article navigation data.
    """
    api_response = requests.post(
        "https://api.zyte.com/v1/extract",
        auth=(APIKEY, ""),
        json={
            "url": url,
            "articleNavigation": True,
            # This is crucial for sites that load content with JavaScript
            "articleNavigationOptions": {"extractFrom":"browserHtml"},
        },
    )
    return api_response

Why This Crushes Manual Parsing

Let's be clear about what this one parameter replaces. Without it, you'd be stuck doing this the hard way:

The Manual Approach:
1. Fetch the page's HTML using a library like requests.
2. Realise the content is loaded by JavaScript. Now you need to bring in a heavy tool like Selenium or Playwright to control a browser instance.
3. Open your browser's developer tools and painstakingly inspect the HTML to find the right CSS selectors or XPath for the article list (e.g., soup.find_all('div', class_='blog-post-item')).
4. Write more selectors to extract the headline, URL, and date from within each list item.
5. Hunt down the selector for the "Next Page" button (e.g., soup.find('a', {'aria-label': 'Next'})).
6. Write logic to handle cases where the button might be disabled or absent on the last page.
7. Repeat this entire process for every website you want to scrape.

This manual process is not only time-consuming but incredibly brittle. The moment a developer changes a class name from blog-post-item to post-preview, your scraper breaks. You become a full-time maintenance engineer, constantly fixing broken selectors.

The articleNavigation feature, powered by AI, understands page structure contextually. It's not looking for a specific class name; it's looking for what looks like a list of articles and a pagination link, making it vastly more resilient to minor website updates.

The Loop: Crawling from Page to Page

With our smart request function ready, we just need a loop to keep it going. A while loop is the perfect tool for the job.

We give it a starting URL and let it run. In each iteration, it calls our function, adds the extracted articles to a master list, and then looks for the nextPage URL in the API response. This URL becomes the target for the next loop.

The try...except block is an elegant and robust way to stop the process. When the API determines there are no more pages, the nextPage key will be missing from its response. This causes a KeyError, which we catch to cleanly exit the loop. No more complex logic to check for disabled or missing buttons!

def main():
    articles = []
    nextPage = "https://zyte.com/learn" # Our starting point

    while True:
        print(f"Scraping page: {nextPage}")
        resp = request_list(nextPage)

        # Add the found articles to our list
        for item in resp.json()["articleNavigation"]["items"]:
            articles.append(item)

        # Try to find the next page; if not found, we're done!
        try:
            nextPage = resp.json()["articleNavigation"]["nextPage"]["url"]
        except KeyError:
            print("Last page reached. Breaking loop.")
            break

Saving Your Data to CSV

After the loop completes, we have a clean list of dictionaries, with each dictionary representing an article. The final step is saving this valuable data. Python's built-in csv library is perfect for this.

The DictWriter is especially useful because it automatically uses the dictionary keys (like headline and url from the API response) as the column headers in your CSV file. This ensures your output is always well-structured and ready for analysis.

def save_to_csv(articles):
    """
    Saves a list of article dictionaries to a CSV file.
    """
    keys = articles[0].keys() # Get headers from the first article

    with open('articles.csv', 'w', newline='', encoding='utf-8') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(articles)

    print(f"\nSuccessfully saved {len(articles)} articles to articles.csv!")

And that's it. You've built a powerful, resilient, and scalable scraper that handles one of the most tedious tasks in web scraping automatically. You've saved hours of development time and future-proofed your code against trivial website changes.

Complete Code

Here is the full, commented script. Grab it, set your API key, and start pulling data the smart way.

import os
import requests
import csv

# Load API key from environment variables for security
APIKEY = os.getenv("ZYTE_API_KEY")
if APIKEY is None:
    raise Exception("No API key found. Please set the ZYTE_API_KEY environment variable.")

def request_list(url):
    """
    Sends a request to the Zyte API to extract article navigation data.
    This one function replaces manual parsing and pagination logic.
    """
    print(f"Requesting data for: {url}")
    api_response = requests.post(
        "https://api.zyte.com/v1/extract",
        auth=(APIKEY, ""),
        json={
            "url": url,
            "articleNavigation": True,
            # Ensure JS-rendered content is seen by the AI extractor
            "articleNavigationOptions": {"extractFrom": "browserHtml"},
        },
    )
    api_response.raise_for_status() # Raise an exception for bad status codes
    return api_response

def save_to_csv(articles):
    """
    Saves a list of article dictionaries to a CSV file.
    """
    if not articles:
        print("No articles to save.")
        return

    # Use the keys from the first article as the CSV headers
    keys = articles[0].keys()

    with open('articles.csv', 'w', newline='', encoding='utf-8') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(articles)

    print(f"\nSuccessfully saved {len(articles)} articles to articles.csv!")

def main():
    """
    Main function to orchestrate the scraping and saving process.
    """
    articles = []
    # The first page of the blog we want to scrape
    nextPage = "https://zyte.com/learn"

    while True:
        resp = request_list(nextPage)
        json_response = resp.json()

        # Add the articles found on the current page to our master list
        found_items = json_response.get("articleNavigation", {}).get("items", [])
        if found_items:
            articles.extend(found_items)

        # Check for the next page URL. If it doesn't exist, break the loop.
        # This is far more reliable than checking for a disabled button selector.
        try:
            nextPage = json_response["articleNavigation"]["nextPage"]["url"]
        except (KeyError, TypeError):
            print("Last page reached. Scraping complete.")
            break

    # Save all the collected articles to a CSV file
    save_to_csv(articles)

if __name__ == "__main__":
    main()

Top comments (1)

OnlineProxy • Sep 10

If you’re scraping paginated content, don’t mess around with flaky “Next” buttons-they break way too easily when sites change stuff. A smarter move is to spot the URL pattern and loop through those directly. It’s way more solid and doesn’t depend on JavaScript gimmicks. Now, if you’re dealing with JavaScript-heavy blogs, plain requests won’t cut it. You’ll want to bring in the big guns-headless browsers like Playwright or Puppeteer-to make sure everything loads up before you start grabbing data. Tools like Zyte’s AutoExtract use AI to automatically figure out what to pull from the page. No need to code all that pagination logic yourself. Plus, since they adapt to layout changes, they’re way more chill than old-school scraping setups.