So you've mastered scraping a single page. But what about scraping an entire blog or news site with dozens, or even hundreds, of pages? The moment you need to click "Next," the complexity skyrockets.
This is where most web scraping projects get messy. You start writing custom logic to find and follow pagination links, creating a fragile system that breaks the moment a website's layout changes.
What if you could bypass that entire headache? In this guide, we'll build a robust script that scrapes every article from a blog and saves it to a CSV, all by leveraging an AI-powered feature that handles the hard parts for you.
We're going to ustilise the AutoExtract part of the Zyte API. This returns us JSON data with the information we need, with no messing around.
You'll need an API Key to start, head over here and you'll get generous free credits to try this and the rest of our Web Scraping API.
Getting Your Script Ready
First, the essentials. We'll use the requests
library to communicate with the Zyte API, os
to securely load our API key, and csv
to save our structured data.
Remember, the golden rule of credentials is never hardcode your API key. Storing it as an environment variable is the professional standard for keeping your keys safe and your code portable.
import os
import requests
import csv
APIKEY = os.getenv("ZYTE_API_KEY")
if APIKEY is None:
raise Exception("No API key found. Please set the ZYTE_API_KEY environment variable.")
With our environment secure, we can focus on the scraping logic.
Using articleNavigation
Here’s where we replace lines and lines of tedious code with a single parameter. We'll create a function that makes one smart request to the Zyte API. Instead of just asking for raw HTML, we set articleNavigation
to True
.
This single instruction tells the API's machine learning model to perform a series of complex tasks automatically:
- Render the page in a real browser to handle any JavaScript-loaded content.
- Identify the main list of articles on the page.
- Extract key details for each article (URL, headline, date, etc.) into a clean structure.
- Locate the "Next Page" link to enable seamless pagination.
def request_list(url):
"""
Sends a request to the Zyte API to extract article navigation data.
"""
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(APIKEY, ""),
json={
"url": url,
"articleNavigation": True,
# This is crucial for sites that load content with JavaScript
"articleNavigationOptions": {"extractFrom":"browserHtml"},
},
)
return api_response
Why This Crushes Manual Parsing
Let's be clear about what this one parameter replaces. Without it, you'd be stuck doing this the hard way:
-
The Manual Approach:
- Fetch the page's HTML using a library like
requests
. - Realise the content is loaded by JavaScript. Now you need to bring in a heavy tool like Selenium or Playwright to control a browser instance.
- Open your browser's developer tools and painstakingly inspect the HTML to find the right CSS selectors or XPath for the article list (e.g.,
soup.find_all('div', class_='blog-post-item')
). - Write more selectors to extract the headline, URL, and date from within each list item.
- Hunt down the selector for the "Next Page" button (e.g.,
soup.find('a', {'aria-label': 'Next'})
). - Write logic to handle cases where the button might be disabled or absent on the last page.
- Repeat this entire process for every website you want to scrape.
- Fetch the page's HTML using a library like
This manual process is not only time-consuming but incredibly brittle. The moment a developer changes a class name from blog-post-item
to post-preview
, your scraper breaks. You become a full-time maintenance engineer, constantly fixing broken selectors.
The articleNavigation
feature, powered by AI, understands page structure contextually. It's not looking for a specific class name; it's looking for what looks like a list of articles and a pagination link, making it vastly more resilient to minor website updates.
The Loop: Crawling from Page to Page
With our smart request function ready, we just need a loop to keep it going. A while
loop is the perfect tool for the job.
We give it a starting URL and let it run. In each iteration, it calls our function, adds the extracted articles to a master list, and then looks for the nextPage
URL in the API response. This URL becomes the target for the next loop.
The try...except
block is an elegant and robust way to stop the process. When the API determines there are no more pages, the nextPage
key will be missing from its response. This causes a KeyError
, which we catch to cleanly exit the loop. No more complex logic to check for disabled or missing buttons!
def main():
articles = []
nextPage = "https://zyte.com/learn" # Our starting point
while True:
print(f"Scraping page: {nextPage}")
resp = request_list(nextPage)
# Add the found articles to our list
for item in resp.json()["articleNavigation"]["items"]:
articles.append(item)
# Try to find the next page; if not found, we're done!
try:
nextPage = resp.json()["articleNavigation"]["nextPage"]["url"]
except KeyError:
print("Last page reached. Breaking loop.")
break
Saving Your Data to CSV
After the loop completes, we have a clean list of dictionaries, with each dictionary representing an article. The final step is saving this valuable data. Python's built-in csv
library is perfect for this.
The DictWriter
is especially useful because it automatically uses the dictionary keys (like headline
and url
from the API response) as the column headers in your CSV file. This ensures your output is always well-structured and ready for analysis.
def save_to_csv(articles):
"""
Saves a list of article dictionaries to a CSV file.
"""
keys = articles[0].keys() # Get headers from the first article
with open('articles.csv', 'w', newline='', encoding='utf-8') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(articles)
print(f"\nSuccessfully saved {len(articles)} articles to articles.csv!")
And that's it. You've built a powerful, resilient, and scalable scraper that handles one of the most tedious tasks in web scraping automatically. You've saved hours of development time and future-proofed your code against trivial website changes.
Complete Code
Here is the full, commented script. Grab it, set your API key, and start pulling data the smart way.
import os
import requests
import csv
# Load API key from environment variables for security
APIKEY = os.getenv("ZYTE_API_KEY")
if APIKEY is None:
raise Exception("No API key found. Please set the ZYTE_API_KEY environment variable.")
def request_list(url):
"""
Sends a request to the Zyte API to extract article navigation data.
This one function replaces manual parsing and pagination logic.
"""
print(f"Requesting data for: {url}")
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(APIKEY, ""),
json={
"url": url,
"articleNavigation": True,
# Ensure JS-rendered content is seen by the AI extractor
"articleNavigationOptions": {"extractFrom": "browserHtml"},
},
)
api_response.raise_for_status() # Raise an exception for bad status codes
return api_response
def save_to_csv(articles):
"""
Saves a list of article dictionaries to a CSV file.
"""
if not articles:
print("No articles to save.")
return
# Use the keys from the first article as the CSV headers
keys = articles[0].keys()
with open('articles.csv', 'w', newline='', encoding='utf-8') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(articles)
print(f"\nSuccessfully saved {len(articles)} articles to articles.csv!")
def main():
"""
Main function to orchestrate the scraping and saving process.
"""
articles = []
# The first page of the blog we want to scrape
nextPage = "https://zyte.com/learn"
while True:
resp = request_list(nextPage)
json_response = resp.json()
# Add the articles found on the current page to our master list
found_items = json_response.get("articleNavigation", {}).get("items", [])
if found_items:
articles.extend(found_items)
# Check for the next page URL. If it doesn't exist, break the loop.
# This is far more reliable than checking for a disabled button selector.
try:
nextPage = json_response["articleNavigation"]["nextPage"]["url"]
except (KeyError, TypeError):
print("Last page reached. Scraping complete.")
break
# Save all the collected articles to a CSV file
save_to_csv(articles)
if __name__ == "__main__":
main()
Top comments (1)
If you’re scraping paginated content, don’t mess around with flaky “Next” buttons-they break way too easily when sites change stuff. A smarter move is to spot the URL pattern and loop through those directly. It’s way more solid and doesn’t depend on JavaScript gimmicks. Now, if you’re dealing with JavaScript-heavy blogs, plain requests won’t cut it. You’ll want to bring in the big guns-headless browsers like Playwright or Puppeteer-to make sure everything loads up before you start grabbing data. Tools like Zyte’s AutoExtract use AI to automatically figure out what to pull from the page. No need to code all that pagination logic yourself. Plus, since they adapt to layout changes, they’re way more chill than old-school scraping setups.