Erika S. Adkins

Posted on Feb 28

Feed Rescue: Converting Raw Ulta Scrapes into Google Merchant Center XML

#webscraping #python #node #dataengineering

You’ve bypassed the anti-bot shields, rotated your proxies, and extracted thousands of product records from Ulta.com. Your reward is a massive JSONL file sitting on your hard drive. While this is a victory for data extraction, it’s a dead end for a marketing team.

Ad platforms like Google Merchant Center (GMC) don't accept JSONL. They require a highly structured, strictly validated XML format (RSS 2.0 or Atom). If your data doesn't perfectly match their schema—from currency codes to specific availability enums—your products won't show up in Google Shopping.

This "Feed Rescue" phase involves taking the raw output from the Ulta.com-Scrapers repository and building a Python transformation pipeline to generate a production-ready Google Shopping feed.

Phase 1: Analyzing the Source Data

Before writing any XML, we need to look at the raw material. The scrapers in the Ulta.com-Scrapers repository, specifically the Selenium and Playwright versions, use a ScrapedData dataclass that outputs a consistent JSONL schema.

Here is the typical structure of a raw Ulta scrape:

{
  "productId": "2583561",
  "name": "CeraVe - Hydrating Facial Cleanser",
  "brand": "CeraVe",
  "price": 17.99,
  "currency": "USD",
  "availability": "in_stock",
  "description": "Hydrating Facial Cleanser gently removes dirt...",
  "images": [{"url": "https://media.ulta.com/i/ulta/2583561?w=2000&h=2000", "altText": "Product Image"}],
  "url": "https://www.ulta.com/p/hydrating-facial-cleanser-pimprod2001719"
}

The scrapers in the repo use JSONL (Line Delimited JSON), which allows us to stream the data line-by-line. If you are converting 50,000 products, you don't need to load a 500MB JSON array into memory. You can process one product at a time.

Phase 2: The Google Merchant Center Specification

Google's requirements are rigid. While the scraper provides most of the data, the formatting is often "web-ready" rather than "ad-ready." GMC requires an RSS 2.0 feed using the g: namespace.

Ulta Scraper Field	GMC XML Tag	Requirement
`productId`	`g:id`	Unique identifier
`name`	`g:title`	Max 150 characters
`description`	`g:description`	Clean text, no broken HTML
`url`	`g:link`	Absolute URL
`images[0]['url']`	`g:image_link`	High-res image URL
`price` + `currency`	`g:price`	Format: `17.99 USD`
`availability`	`g:availability`	`in_stock`, `out_of_stock`, or `preorder`

Phase 3: Field Mapping & Transformation Logic

We need specific transformation logic to handle the nuances of Ulta's data.

1. Price Normalization

The scraper provides a float (17.99) and a string (USD). Google requires a combined string. This helper function ensures the decimal precision is always correct:

def format_gmc_price(amount, currency):
    return f"{amount:.2f} {currency}"

2. Availability Mapping

The Ulta scrapers detect stock status accurately, but we must ensure they match Google's allowed enums. If the scraper returns "in_stock", it works, but we should always include a fallback for unexpected values.

3. Image Handling

Google wants one primary image and the rest as "additional" images. Since the Ulta scraper returns a list of dictionaries, we use the first item for the main link and the remaining items for the gallery.

Phase 4: Building the Converter Script

This pipeline uses Python’s xml.etree.ElementTree because it is lightweight and handles namespaces efficiently. A generator reads the JSONL file to keep memory usage low.

import json
import xml.etree.ElementTree as ET
from xml.dom import minidom

def create_gmc_feed(input_jsonl, output_xml):
    # Define Namespaces
    g_ns = "http://base.google.com/ns/1.0"
    ET.register_namespace('g', g_ns)

    # Create Root
    rss = ET.Element("rss", version="2.0")
    channel = ET.SubElement(rss, "channel")
    ET.SubElement(channel, "title").text = "Ulta Product Feed"
    ET.SubElement(channel, "link").text = "https://www.ulta.com"
    ET.SubElement(channel, "description").text = "Daily product updates from Ulta"

    with open(input_jsonl, 'r', encoding='utf-8') as f:
        for line in f:
            product = json.loads(line)

            # Create Item
            item = ET.SubElement(channel, "item")

            # Basic Mapping
            ET.SubElement(item, f"{{{g_ns}}}id").text = str(product.get('productId'))
            ET.SubElement(item, f"{{{g_ns}}}title").text = product.get('name', '')[:150]
            ET.SubElement(item, f"{{{g_ns}}}description").text = product.get('description', '')
            ET.SubElement(item, f"{{{g_ns}}}link").text = product.get('url')
            ET.SubElement(item, f"{{{g_ns}}}brand").text = product.get('brand')
            ET.SubElement(item, f"{{{g_ns}}}condition").text = "new"

            # Formatted Price
            price = product.get('price')
            if price is not None:
                price_str = f"{float(price):.2f} {product.get('currency', 'USD')}"
                ET.SubElement(item, f"{{{g_ns}}}price").text = price_str

            # Availability
            status = product.get('availability', 'out_of_stock')
            ET.SubElement(item, f"{{{g_ns}}}availability").text = status

            # Images
            images = product.get('images', [])
            if images:
                ET.SubElement(item, f"{{{g_ns}}}image_link").text = images[0].get('url')
                # Add up to 10 additional images
                for img in images[1:11]:
                    ET.SubElement(item, f"{{{g_ns}}}additional_image_link").text = img.get('url')

    # Save with pretty printing
    xml_str = ET.tostring(rss, encoding='utf-8')
    pretty_xml = minidom.parseString(xml_str).toprettyxml(indent="  ")

    with open(output_xml, "w", encoding="utf-8") as f:
        f.write(pretty_xml)

if __name__ == "__main__":
    create_gmc_feed('ulta_data.jsonl', 'google_feed.xml')

Why this approach works:

Memory Efficiency: Processing the file line-by-line using for line in f prevents the script from crashing, even with hundreds of thousands of products.
Namespace Handling: Using f"{{{g_ns}}}tag_name" correctly implements the g: prefix Google requires for its specific attributes.
Data Truncation: Automatically truncating titles to 150 characters avoids GMC validation errors.

Phase 5: Handling Edge Cases

Web data is messy. Here are three common issues encountered when processing Ulta scrapes:

HTML in Descriptions: Ulta's descriptions sometimes contain raw HTML tags or entities like  . While the scraper cleans most of this, it is safer to wrap the description in a CDATA section or use a regex to strip remaining tags before inserting them into the XML.
Absolute URLs: Ensure your scraper uses the make_absolute_url logic from the repository. Google rejects relative URLs like /p/product-name.
Zero or Missing Prices: Occasionally, a product might show "Price Varies" or "Out of Stock" without a numerical value. The :.2f formatting will fail if price is None. Always default to 0.00 or skip the item if the price is missing.

To Wrap Up

Converting raw scraper data into a functional marketing asset turns raw data into business value. Bridging the gap between JSONL and GMC XML allows you to automate inventory updates directly from your scraping pipeline.

Key Takeaways:

Stream your data: Use JSONL and line-by-line processing to handle large datasets.
Respect the Schema: Google is strict about formatting. Always include the currency code in the price and map availability to their three specific enums.
Automate the Pipeline: Trigger this script immediately after your scraper finishes to create a hands-off data-to-ads pipeline.

For more information on the initial extraction, check out the ScrapeOps Residential Proxy Aggregator and the full range of implementations in the Ulta.com-Scrapers repository.

DEV Community