Oxylabs

Posted on Oct 28

Perplexity Web Scraper

#webdev #programming #ai #datascience

Introduction

Web scraping has come a long way from simple HTML parsing. Today’s websites are dynamic, JavaScript-heavy, and often protected by anti-bot mechanisms, making traditional scraping with tools like requests and BeautifulSoup unreliable. This growing complexity has pushed developers to look for smarter ways to extract data efficiently.
That’s where Perplexity AI steps in. Instead of writing dozens of brittle parsing rules, developers can now use Perplexity web scraping to interpret raw HTML or text through natural language prompts and get structured data back.
This in-depth guide explores how Perplexity AI can fit into a web scraping workflow and how it compares with traditional web scraping methods. It’ll demo AI-driven data extraction using a simple Python script. Moreover, we’ll also discuss when it makes sense to scale with solutions like Oxylabs Web Scraper API for more complex, protected, or large-scale use cases.
Let’s dive in!

What is Perplexity AI, and why is it relevant to scraping

Perplexity AI is an LLM-powered research and reasoning engine built to answer complex questions, summarize content, and interpret information with accuracy and context. Unlike a typical search engine, it combines natural language understanding with real-time web data access, allowing it to process long pieces of text, interpret meaning, and produce concise, structured summaries that are easy to work with.

For developers, Perplexity offers more than just conversational capabilities – it can function as a more powerful post-scraping parser. Instead of manually reviewing HTML or dealing with tangled DOM structures, you can feed Perplexity the raw text from a webpage and instruct it to extract only the relevant elements, such as product names, pricing details, or contact information.
This selector-free approach makes AI web scraping a more flexible tool for turning raw, unstructured web data into clean, structured outputs that can be used directly in databases or analytics pipelines.

Instead of reading HTML as a fixed structure full of tags, Perplexity looks at it the way humans do – as language. This makes it easier for developers to simplify their scraping process, especially when dealing with websites that use JavaScript or have layouts that change often.

In short, Perplexity AI isn’t a scraper by itself. It works as an intelligent layer that helps you understand the data you’ve already scraped. It can turn messy, unorganized HTML into clean, structured information that’s ready to store, analyze, or use in other applications.

Now that we understand how Perplexity interprets web content, let’s see how it can provide an edge over traditional web scraping methods.

Traditional vs. AI web scraping in Python

Traditionally, developers use a combo of an HTTP request library and a data parsing library. HTTP request libraries, such as requests, are used to fetch the raw HTML from the target page. Libraries like BeautifulSoup and frameworks like Scrapy enable efficient extraction and parsing of raw HTML content by providing structured access and navigation through the Document Object Model (DOM).

To illustrate this traditional, selector-dependent process, here is a breakdown of the steps required to extract data using libraries like requests and BeautifulSoup.

This approach works well for static pages, but it often breaks when elements change, when pages use JavaScript, or when layouts differ slightly across sections. We can look at a code example of traditional web scraping.

# pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup




# Step 1: Fetch the webpage
url = "https://sandbox.oxylabs.io/products"
response = requests.get(url)


# Step 2: Parse the HTML using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")


# Step 3: Extract product titles using CSS selectors
titles = [item.text for item in soup.select(".title")]
print(titles)

An AI-assisted workflow, on the other hand, adds a reasoning layer to the process. You still use requests (or a headless browser) to get the raw HTML, but instead of parsing it manually, you feed the content to Perplexity AI and describe what data you need in plain language.

The model interprets the text and returns structured results – no brittle CSS selectors or XPath traversal required. While the traditional approach works for static HTML, it struggles with JavaScript-heavy or frequently changing layouts.

In the following illustration, a simple natural-language prompt to Preplexity AI replaces brittle DOM navigation, turning raw HTML into structured data.

The following code simulates how Preplexity AI-enabled Web scraping in Python works.

Note: This is just a sample skeleton code to outline the steps involved in AI web scraping; the actual implementation with actual API integration will be covered in the next section.

# Use the same HTML content fetched earlier
html_content = response.text


# Step 1: Create a natural language prompt for Perplexity
prompt = f"""
You are a structured data extractor. From this HTML, extract all products listed on the page.
For each product, return JSON with:
- name
- category
- price (if available)
HTML:
\"\"\"{html_content[:2000]}\"\"\"
"""


# Step 2: Define a simulated Perplexity call
def call_perplexity(prompt_text: str) -> str:
    # Simulated AI response for a game store
    simulated_json_response = '''
    {
        "products": [
            {"name": "Speed Racer Game", "category": "Racing", "price": "$29.99"},
            {"name": "Puzzle Quest", "category": "Puzzle", "price": "$19.99"},
            {"name": "Adventure Island", "category": "Adventure", "price": "$24.99"}
        ]
    }
    '''
    return simulated_json_response


# Step 3: Send prompt and parse structured result
import json
result = call_perplexity(prompt)
data = json.loads(result)

for product in data["products"]:
    print(product["name"], "-", product["category"], "-", product["price"])

Fetch the HTML content of the page using requests and store it in html_content.
Create a natural language prompt asking Perplexity to extract all product details (name, category, price) from the HTML.
Define a function call_perplexity() that simulates sending the prompt to Perplexity and returns a structured JSON response.
Use json.loads() to convert the JSON string into a Python dictionary.
Loop through the products in the dictionary and print their name, category, and price.
This approach allows the extraction of structured data without manually navigating the HTML or writing fragile selectors.

As we can rely on Perplexity AI to return the required results in our preferred format, we no longer need to provide any CSS selectors for the fields of interest in the HTML content.

That is what makes all the difference; even if the field selectors change with webpage updates, the Preplexity web scraping workflow can intelligently adapt to them. Most interestingly, we don’t need to make any changes to our original query.

In other words, this makes AI web scraping in Python far more flexible. You can skip DOM traversal entirely, handle variations in layout gracefully, and focus more on what to extract rather than how.
We’ve seen how AI-assisted scraping changes the workflow conceptually. The table comprehensively outlines the key differences between traditional and AI web scraping workflows:

Here is an illustrative depiction of the key differences:

Now, let’s put it all together in a practical step-by-step demo, using the real Perplexity API in Python.

Using Perplexity AI for web scraping – step-by-step tutorial

Assume our target for the Perplexity web scraping is the Oxylabs Scraping Sandbox website – a demo site that lists various products with names, prices, and other details.
Here is what a listing of video game category products looks like on this website:

Normally, scraping such a page requires carefully targeting HTML elements, handling different CSS classes, and managing page structures that might change over time.

But with Perplexity AI, things get much simpler. Instead of manually parsing the HTML, you can feed it the raw page content and simply ask it to extract structured data, such as all product names, categories, and prices. The AI then returns a neatly formatted JSON response, saving you from the hassle of traditional parsing logic.

Step 1 – Install and set up dependencies

Ensure you have Python 3.8+ installed on your system along with the following libraries:

pip install perplexityai requests beautifulsoup4

The requests package allows for fetching raw HTML content from the target page.
The BeautifulSoup library helps in parsing and cleaning the contents
Perplexity AI, as the name suggests, allows for communication with the Preplexity API

Step 2 – Add your Perplexity API key

To authenticate requests, you must have a Perplexity API key.
Haven’t got one yet? Follow these steps to create one:

Sign in to your Perplexity account.
Go to the Developer or API section in the dashboard.
Click Create New API Key.
Give your key a name or label for reference (e.g., “Web Scraper Project”).
Copy the generated key and keep it secure – this is what your code will use to authenticate requests.
For production projects, consider rotating or storing the key securely rather than hardcoding it in scripts.

Note: You can also follow this quick start guide to create and learn the basics of the Perplexity AI API.
Once you have the API key, you can pass it in your code to initialize the Perplexity client and start making requests. For this example, we’ll include it directly in the code for simplicity (⚠️ not recommended for production):

API_KEY = "pplx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

💡 In real projects, always store your API key in environment variables for security.

Step 3 – Crawl the target webpage

Let’s fetch the page content we want to extract data from – in this case, a sample product listing page.

url = "https://sandbox.oxylabs.io/products"
resp = requests.get(url, timeout=30)
resp.raise_for_status()
html_content = resp.text

This sends an HTTP GET request and retrieves the HTML content. Once we have the page content, the next step is to clean it up for better model readability.

Step 4 – Clean and prepare the text


[soup = BeautifulSoup(html_content, "html.parser")
for script in soup(["script", "style"]):
    script.decompose()
clean_text = soup.get_text(separator="\n", strip=True)

# Trim text to a safe length for the model input (adjust if needed)
INPUT_SNIPPET = clean_text[:4000]

Step 5 – Initialize the Perplexity client

Now we set up the Perplexity SDK using your API key.


client = Perplexity(api_key=API_KEY)

Step 6 – Define the extraction prompt and schema

From the list of video games, we need to scrape their titles, categories, and prices. The following screenshot shows the placement of elements we need to extract.

We’ll instruct the model, in plain English, to extract structured product data and to strictly return JSON responses.

We also define a JSON schema to make sure the model always generates a predictable structure:

messages = [
    {
        "role": "system",
        "content": "You are a structured data extractor. Return only valid JSON that matches the provided schema."
    },
    {
        "role": "user",
        "content": (
            "Extract all product entries from the following page text. "
            "For each product return 'name', 'category', and 'price'. "
            "If a field is not present, use an empty string. "
            "Return only JSON that matches the schema."
            f"\n\nPage text:\n\n{INPUT_SNIPPET}"
        )
    }
]


response_format = {
    "type": "json_schema",
    "json_schema": {
        "schema": {
            "type": "object",
            "properties": {
                "products": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "category": {"type": "string"},
                            "price": {"type": "string"}
                        },
                        "required": ["name", "category", "price"]
                    }
                }
            },
            "required": ["products"]
        }
    }
}

role: "system"
This sets the behavior or context for the AI.
In your example, it tells the model: “You are a structured data extractor. Return only valid JSON that matches the provided schema.”
Think of it as giving the AI its instructions or personality before it sees any user input.

-role: "user"

This is the actual request from you – what you want the AI to do.
Here, it contains the prompt with the HTML/text and specifies the data you want extracted (name, category, price).
It’s basically saying: “Here’s the page content, please extract the data in the format I asked for.”

Why both are needed:

The system role ensures the AI knows the rules (e.g., return JSON, follow a schema).
The user role provides the task-specific input (page text, instructions).

Using both together helps the AI produce structured and predictable output, especially for tasks like web scraping, where formatting matters. With the prompt and schema ready, it’s time to send the request to Perplexity.

Step 7 – Send the request to Perplexity

At this point, you issue the request to Perplexity’s chat completions endpoint. You’ll specify:

Which model to use (e.g., "sonar" or "sonar-pro")
The messages containing your prompt
The response_format to enforce JSON schema
A limit for max_tokens
Here’s the call-in code:

completion = client.chat.completions.create(
    messages=messages,
    model="sonar",     # or "sonar-pro" if your plan supports it
    response_format=response_format,
    max_tokens=1500
)

Before we move on to parsing the results, let’s understand which Perplexity model fits best for this task and why.

Which model and why?

We used sonar or sonar-pro because Perplexity built them for accurate data extraction and web content understanding.
These models stay closer to the source text, minimizing the hallucinations.
sonar-pro provides better reasoning and accuracy but may cost more or need a higher-tier plan.

As with most AI APIs, model choice also affects cost – so it’s important to understand how Perplexity pricing works.

About pricing

Perplexity charges are typically based on tokens consumed (input + output tokens).
Models like Sonar-Pro typically incur higher costs per token compared to the base Sonar model, due to their enhanced accuracy and increased compute requirements.
Because structured extraction often involves long inputs (HTML snippets) and lengthy outputs (detailed JSON), costs can add up.
To minimize cost, you can:

1.Trim input (INPUT_SNIPPET) to only relevant parts
2.Limit max_tokens to what’s genuinely needed
3.Use the lighter sonar model when high precision is not critical
4.Profile and monitor token usage over sample runs

Step 8 – Parse and handle the structured JSON response

We’ll parse the AI’s JSON output safely, whether it’s returned as a dict or as a raw JSON string.

raw_content = completion.choices[0].message.content


# The SDK may return a dict or a JSON string. Handle both cases:
if isinstance(raw_content, str):
    try:
        parsed = json.loads(raw_content)
    except json.JSONDecodeError:
        # Attempt to find a JSON substring
        import re
        m = re.search(r"(\{[\s\S]*\})", raw_content)
        if m:
            parsed = json.loads(m.group(1))
        else:
            print("Failed to parse JSON from model response.")
            print("Raw response:", raw_content)
            sys.exit(1)
else:
    # Already a dict-like structure
    parsed = raw_content


products_data = parsed.get("products", [])

This part of the code safely extracts and parses the model’s response:

raw_content gets the text returned by the model.
It checks if the response is a string – if so, it tries to convert it into JSON usingjson.loads().
If that fails, it uses a regex to find and extract a JSON-like part from the text.
If parsing still fails, it prints an error and stops the program.
If the response is already a dictionary, it skips parsing.
Finally, it extracts the products list from the parsed JSON.

Step 9 – Display and export the results

Finally, we’ll print the extracted product data and save it as a CSV file for later use.

for p in products_data:
    print(p.get("name", ""), "-", p.get("category", ""), "-", p.get("price", ""))


# Save to CSV (if products found)
if products_data:
    with open("products.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "category", "price"])
        writer.writeheader()
        writer.writerows(products_data)
    print(f"Saved {len(products_data)} products to products.csv")
else:
    print("No products found in the model output.")

Complete Code Example

Here’s the full working script combining everything above:

from bs4 import BeautifulSoup
from perplexity import Perplexity
import requests
import csv
import json
import sys


# ---------------------------
# WARNING: API key in-file for demo only.
# Rotate/secure it for production use.
# ---------------------------
API_KEY = "pplx-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"


# Step 1: Crawl the page
url = "https://sandbox.oxylabs.io/products"
resp = requests.get(url, timeout=30)
resp.raise_for_status()
html_content = resp.text


# Step 2: Clean the content
soup = BeautifulSoup(html_content, "html.parser")
for script in soup(["script", "style"]):
    script.decompose()
clean_text = soup.get_text(separator="\n", strip=True)


# Trim text to a safe length for the model input (adjust if needed)
INPUT_SNIPPET = clean_text[:4000]


# Step 3: Initialize Perplexity client (using provided API_KEY)
client = Perplexity(api_key=API_KEY)


# Step 4: Prepare messages and json_schema response_format
messages = [
    {
        "role": "system",
        "content": "You are a structured data extractor. Return only valid JSON that matches the provided schema."
    },
    {
        "role": "user",
        "content": (
            "Extract all product entries from the following page text. "
            "For each product return 'name', 'category', and 'price'. "
            "If a field is not present, use an empty string. "
            "Return only JSON that matches the schema."
            f"\n\nPage text:\n\n{INPUT_SNIPPET}"
        )
    }
]


response_format = {
    "type": "json_schema",
    "json_schema": {
        "schema": {
            "type": "object",
            "properties": {
                "products": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "category": {"type": "string"},
                            "price": {"type": "string"}
                        },
                        "required": ["name", "category", "price"]
                    }
                }
            },
            "required": ["products"]
        }
    }
}


# Step 5: Call the chat completions API with messages & response_format
try:
    completion = client.chat.completions.create(
        messages=messages,
        model="sonar",            # or "sonar-pro" if your plan supports it
        response_format=response_format,
        max_tokens=1500
    )
except Exception as e:
    print("API request failed:", str(e))
    sys.exit(1)


# Step 6: Extract the structured content safely
raw_content = completion.choices[0].message.content


# The SDK may return a dict or a JSON string. Handle both cases:
if isinstance(raw_content, str):
    try:
        parsed = json.loads(raw_content)
    except json.JSONDecodeError:
        # Attempt to find a JSON substring
        import re
        m = re.search(r"(\{[\s\S]*\})", raw_content)
        if m:
            parsed = json.loads(m.group(1))
        else:
            print("Failed to parse JSON from model response.")
            print("Raw response:", raw_content)
            sys.exit(1)
else:
    # Already a dict-like structure
    parsed = raw_content


products_data = parsed.get("products", [])


# Step 7: Print results
for p in products_data:
    print(p.get("name", ""), "-", p.get("category", ""), "-", p.get("price", ""))


# Step 8: Save to CSV (if products found)
if products_data:
    with open("products.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "category", "price"])
        writer.writeheader()
        writer.writerows(products_data)
    print(f"Saved {len(products_data)} products to products.csv")
else:
    print("No products found in the model output.")

Here is what the above code outputs:

The Legend of Zelda: Ocarina of Time - Action Adventure, Fantasy - 91,99 €
Super Mario Galaxy - Action, Platformer, 3D - 91,99 €
Super Mario Galaxy 2 - Action, Platformer, 3D - 91,99 €
Metroid Prime - Action, Shooter, First-Person, Sci-Fi - 89,99 €
Super Mario Odyssey - Action, Platformer, 3D - 89,99 €
Halo: Combat Evolved - Action, Shooter, First-Person, Sci-Fi -
Saved 6 products to products.csv

Once you’re comfortable with the workflow, you can experiment with different prompt styles to fine-tune how the model structures its output.

Example prompt variants

Some of the example prompts that can be used for Preplexity are:
-Tabular style:
“Return a list of dictionaries. Each dictionary must contain keys title, price_usd, and stock_status. Use ISO formatting for prices (e.g., 19.99).”

-CSV line format:
“Output a CSV with header row: title, price, sku, availability, and then one line per product.”

-Limited scope prompt (for long pages):
“Parse only the section containing

… from the message content.”

Best practices for using AI in web scraping

When designing AI-assisted scraping prompts, a few best practices can make your results far more reliable and consistent. Since AI models like Perplexity interpret your instructions in natural language, clarity and structure directly impact the accuracy of the extracted data.
Here’s what to keep in mind:

Write clear, specific prompts. Tell the model exactly what to extract and how to format it.
Avoid broad queries. Instead of asking for “all product details,” define fields like name, price, and rating.

-Handle inconsistent outputs. Sometimes AI responses may vary in format or structure. Always include fallback logic to handle missing or malformed data.

-Validate and log everything. Keep a record of both raw and parsed responses. This helps debug issues and ensures reliability over time.

Following these steps helps maintain accuracy and ensures your AI doesn’t drift into producing inconsistent or incomplete outputs.
Here’s a quick example of a clean, focused prompt:

prompt = """
Extract product data from the following HTML.
Return JSON with fields: name, price, and rating.
HTML: <div>...</div>
""”

This kind of structured, limited prompt keeps results cleaner and easier to parse later.

When to use Perplexity vs web scraping APIs

Perplexity is particularly effective for interpreting and structuring content when the page is already accessible and doesn’t block crawlers. It’s ideal for extracting text summaries, pricing data, or FAQ-style information.

However, AI-assisted scraping isn’t suitable for everything.

It can struggle with CAPTCHA or anti-bot systems.
It’s not meant for massive crawls or high-volume data extraction.
Some sites may have legal or ethical restrictions on scraping.

For those cases, a dedicated tool like Oxylabs Web Scraper API is a better fit. It’s built for large-scale, reliable scraping – capable of handling JavaScript-heavy pages, CAPTCHA challenges, and dynamic site structures without manual setup.

Oxylabs also provides residential and datacenter IPs, geolocation targeting, and custom headers, which help simulate real user behavior and access localized content. These features make it ideal for projects where consistency and volume matter.

For developers working with tough anti-bot systems or massive crawl requirements, Oxylabs can take care of the data collection layer, while Perplexity focuses on turning that raw HTML into clean, structured insights. Used together, they create a powerful hybrid workflow – automation at scale with AI-driven understanding.

Real-world examples & community use cases

Developers on Oxylabs blog and DEV.to often share practical examples of AI-assisted scraping. Many combine tools like ChatGPT or Perplexity with scraping APIs to extract and structure e-commerce or review data.

For instance, Oxylabs shows cases where AI helps summarize large product catalogs, categorize listings more efficiently, and extract data from unstructured files. DEV.to contributors also highlight how AI can clean messy HTML and extract structured information from dynamic pages.

Both communities note similar challenges: AI parsing can be inconsistent when prompts aren’t well-defined or when page structure changes frequently.

The shared conclusion: combining traditional scraping (for reliability and scale) with AI interpretation (for structure and insights) delivers the most effective and adaptable results – especially when dealing with complex or unstructured web data.

Conclusion

AI tools like Perplexity don’t replace web scraping, they enhance it. Developers should experiment with prompts, refine instructions, and use fallback logic for missing or inconsistent fields.

Always validate outputs to make sure the data is accurate. When paired with a robust scraping tool that handles dynamic content and anti-bot measures, this approach creates pipelines that are faster, scalable, and easier to maintain.

In short, treat AI as a smart layer on top of scraping, not a replacement. Combining traditional scraping for stability with AI for structure delivers cleaner datasets and more efficient workflows.

DEV Community

Perplexity Web Scraper

Using Perplexity AI for web scraping – step-by-step tutorial

Top comments (0)