Erika S. Adkins

Posted on Feb 24

How to Scrape Ulta Reviews for Voice of Customer (VoC) Analysis

#webscraping #python #node #voc

A five-star rating tells you a product is popular, but it doesn't explain why. For marketing teams and product managers, the real value is buried in the text of the reviews. This is where you find the Voice of Customer (VoC)—the specific language, pain points, and "aha!" moments that define the customer experience.

If a customer writes that a moisturizer "doesn't pill under makeup," you've found a high-converting headline for an ad campaign. However, manually reading thousands of reviews across Ulta’s massive catalog is impossible to scale.

This guide shows how to automate the extraction of Ulta product reviews using Python and Playwright. We will then process that data into a format ready for AI-driven analysis, turning raw HTML into actionable marketing insights.

Why Raw Review Data Matters

Star ratings are a lagging indicator of quality, but review text is a leading indicator of market fit. Extracting raw review data allows you to:

Identify Friction Points: Discover if a specific product batch has a faulty pump or if a new scent is polarizing.
Validate Features: See which ingredients users actually mention. If you’re marketing "Hyaluronic Acid" but users are raving about the "dewy finish," you know which angle to lead with.
Find Ad Copy Inspiration: Using a customer's exact phrasing (e.g., "A literal lifesaver for dry patches") increases relatability and trust.

To get this data at scale, we need a scraper capable of navigating Ulta’s anti-bot protections and dynamic content.

Prerequisites

To follow along, you will need:

Python 3.8+
A ScrapeOps API Key to handle proxy rotation and anti-bot bypass (sign up for free here).
Basic familiarity with the terminal and Pandas.

Step 1: Setting Up the Ulta Scraper

We will use the production-ready scrapers from the Ulta.com-Scrapers repository. While there are several ways to build a scraper, the Playwright version is best for reviews because Ulta loads its review sections dynamically.

First, clone the repository and install the dependencies:

# Clone the repository
git clone https://github.com/scraper-bank/Ulta.com-Scrapers.git
cd Ulta.com-Scrapers/python/playwright

# Install dependencies
pip install playwright playwright-stealth
playwright install chromium

Open the scraper file at product_data/scraper/ulta_scraper_product_data_v1.py and add your ScrapeOps API key:

API_KEY = "YOUR_SCRAPEOPS_API_KEY"

Step 2: Extracting Review Data

The Playwright scraper in this repository extracts a comprehensive ScrapedData object. Rather than just grabbing the price and title, this script targets JSON-LD (Linked Data) embedded in the page. This is more reliable because the data is already structured by Ulta’s own backend.

Here is the ScrapedData dataclass used in the script:

@dataclass
class ScrapedData:
    name: str = ""
    brand: str = ""
    price: float = 0.0
    productId: str = ""
    # This is our primary target for VoC analysis
    reviews: List[Dict[str, Any]] = field(default_factory=list)
    url: str = ""

The extract_data function looks for script[type='application/ld+json']. Here is the logic the scraper uses to populate the review list:

# Inside extract_data(page: Page)
scripts = await page.locator("script[type='application/ld+json']").all_text_contents()
for script_text in scripts:
    try:
        data = json.loads(script_text)
        if isinstance(data, dict) and data.get("@type") == "Product":
            # The 'review' key in JSON-LD contains the author, body, and rating
            res.reviews = data.get("review", []) 
    except: 
        continue

Step 3: Running the Scraper

To target a specific product, modify the execution block at the bottom of the script.

if __name__ == "__main__":
    target_urls = [
        "https://www.ulta.com/p/moisturizing-lotion-pimprod2007632"
    ]

    asyncio.run(run_scraper(target_urls))

Run the script from your terminal:

python product_data/scraper/ulta_scraper_product_data_v1.py

The scraper outputs a JSONL (JSON Lines) file. This format is ideal for web scraping because it saves each product as a single line, which prevents file corruption if the script is interrupted.

Step 4: Filtering and Preparing Data

Raw scraping output is often nested, meaning a single product entry contains a list of dozens of reviews. To analyze this, we use Pandas to flatten the data.

import pandas as pd
import json

def flatten_reviews(jsonl_file):
    flattened_data = []

    with open(jsonl_file, 'r') as f:
        for line in f:
            product = json.loads(line)
            product_name = product.get('name')

            for review in product.get('reviews', []):
                flattened_data.append({
                    'product': product_name,
                    'author': review.get('author', {}).get('name'),
                    'text': review.get('reviewBody'),
                    'rating': review.get('reviewRating', {}).get('ratingValue')
                })

    df = pd.DataFrame(flattened_data)

    # Remove empty or extremely short reviews
    df = df[df['text'].str.len() > 40]
    return df

df = flatten_reviews('ulta_output.jsonl')
print(df.head())

Step 5: Summarizing Insights with AI

With a clean DataFrame of customer feedback, you can use an LLM like GPT-4 or Claude to perform the Voice of Customer analysis. Instead of reading 500 reviews, you feed the most descriptive entries into a prompt.

VoC Prompting

You can automate this by sending the text column to an API.

import openai

def get_voc_report(reviews_list):
    reviews_text = "\n---\n".join(reviews_list[:50]) # Use first 50 reviews

    prompt = f"""
    Analyze the following customer reviews for a beauty product. 
    1. Identify the top 3 specific benefits users mention.
    2. List 3 recurring complaints or pain points.
    3. Extract 5 'Power Phrases'—exact quotes that would make great ad headlines.

    Reviews:
    {reviews_text}
    """

    # Call OpenAI/Anthropic API here

Example Results

An AI-generated report from your scraped data might reveal:

Benefit: "Doesn't pill under SPF" (Mentioned 12 times).
Power Phrase: "It's like a drink of water for my face."
Pain Point: "The pump mechanism gets stuck halfway through the bottle."

To Wrap Up

By combining automated scraping with AI analysis, you can treat Ulta as a massive, ongoing focus group. This workflow moves from raw HTML to structured insights in five steps:

Scrape: Use Playwright to extract JSON-LD data.
Export: Save in JSONL for data integrity.
Clean: Use Pandas to flatten and filter the results.
Analyze: Feed the text into an LLM.
Execute: Update ad copy and product roadmaps based on real customer language.

For the full suite of Ulta scraping tools, visit the Ulta.com-Scrapers GitHub repository. To scale your operations, use a ScrapeOps API key to manage proxy rotation and avoid blocks.

DEV Community