You’ve bypassed the anti-bot shields, rotated your proxies, and extracted thousands of product records from Ulta.com. Your reward is a massive JSONL file sitting on your hard drive. While this is a victory for data extraction, it’s a dead end for a marketing team.
Ad platforms like Google Merchant Center (GMC) don't accept JSONL. They require a highly structured, strictly validated XML format (RSS 2.0 or Atom). If your data doesn't perfectly match their schema—from currency codes to specific availability enums—your products won't show up in Google Shopping.
This "Feed Rescue" phase involves taking the raw output from the Ulta.com-Scrapers repository and building a Python transformation pipeline to generate a production-ready Google Shopping feed.
Phase 1: Analyzing the Source Data
Before writing any XML, we need to look at the raw material. The scrapers in the Ulta.com-Scrapers repository, specifically the Selenium and Playwright versions, use a ScrapedData dataclass that outputs a consistent JSONL schema.
Here is the typical structure of a raw Ulta scrape:
{
"productId": "2583561",
"name": "CeraVe - Hydrating Facial Cleanser",
"brand": "CeraVe",
"price": 17.99,
"currency": "USD",
"availability": "in_stock",
"description": "Hydrating Facial Cleanser gently removes dirt...",
"images": [{"url": "https://media.ulta.com/i/ulta/2583561?w=2000&h=2000", "altText": "Product Image"}],
"url": "https://www.ulta.com/p/hydrating-facial-cleanser-pimprod2001719"
}
The scrapers in the repo use JSONL (Line Delimited JSON), which allows us to stream the data line-by-line. If you are converting 50,000 products, you don't need to load a 500MB JSON array into memory. You can process one product at a time.
Phase 2: The Google Merchant Center Specification
Google's requirements are rigid. While the scraper provides most of the data, the formatting is often "web-ready" rather than "ad-ready." GMC requires an RSS 2.0 feed using the g: namespace.
| Ulta Scraper Field | GMC XML Tag | Requirement |
|---|---|---|
productId |
g:id |
Unique identifier |
name |
g:title |
Max 150 characters |
description |
g:description |
Clean text, no broken HTML |
url |
g:link |
Absolute URL |
images[0]['url'] |
g:image_link |
High-res image URL |
price + currency
|
g:price |
Format: 17.99 USD
|
availability |
g:availability |
in_stock, out_of_stock, or preorder
|
Phase 3: Field Mapping & Transformation Logic
We need specific transformation logic to handle the nuances of Ulta's data.
1. Price Normalization
The scraper provides a float (17.99) and a string (USD). Google requires a combined string. This helper function ensures the decimal precision is always correct:
def format_gmc_price(amount, currency):
return f"{amount:.2f} {currency}"
2. Availability Mapping
The Ulta scrapers detect stock status accurately, but we must ensure they match Google's allowed enums. If the scraper returns "in_stock", it works, but we should always include a fallback for unexpected values.
3. Image Handling
Google wants one primary image and the rest as "additional" images. Since the Ulta scraper returns a list of dictionaries, we use the first item for the main link and the remaining items for the gallery.
Phase 4: Building the Converter Script
This pipeline uses Python’s xml.etree.ElementTree because it is lightweight and handles namespaces efficiently. A generator reads the JSONL file to keep memory usage low.
import json
import xml.etree.ElementTree as ET
from xml.dom import minidom
def create_gmc_feed(input_jsonl, output_xml):
# Define Namespaces
g_ns = "http://base.google.com/ns/1.0"
ET.register_namespace('g', g_ns)
# Create Root
rss = ET.Element("rss", version="2.0")
channel = ET.SubElement(rss, "channel")
ET.SubElement(channel, "title").text = "Ulta Product Feed"
ET.SubElement(channel, "link").text = "https://www.ulta.com"
ET.SubElement(channel, "description").text = "Daily product updates from Ulta"
with open(input_jsonl, 'r', encoding='utf-8') as f:
for line in f:
product = json.loads(line)
# Create Item
item = ET.SubElement(channel, "item")
# Basic Mapping
ET.SubElement(item, f"{{{g_ns}}}id").text = str(product.get('productId'))
ET.SubElement(item, f"{{{g_ns}}}title").text = product.get('name', '')[:150]
ET.SubElement(item, f"{{{g_ns}}}description").text = product.get('description', '')
ET.SubElement(item, f"{{{g_ns}}}link").text = product.get('url')
ET.SubElement(item, f"{{{g_ns}}}brand").text = product.get('brand')
ET.SubElement(item, f"{{{g_ns}}}condition").text = "new"
# Formatted Price
price = product.get('price')
if price is not None:
price_str = f"{float(price):.2f} {product.get('currency', 'USD')}"
ET.SubElement(item, f"{{{g_ns}}}price").text = price_str
# Availability
status = product.get('availability', 'out_of_stock')
ET.SubElement(item, f"{{{g_ns}}}availability").text = status
# Images
images = product.get('images', [])
if images:
ET.SubElement(item, f"{{{g_ns}}}image_link").text = images[0].get('url')
# Add up to 10 additional images
for img in images[1:11]:
ET.SubElement(item, f"{{{g_ns}}}additional_image_link").text = img.get('url')
# Save with pretty printing
xml_str = ET.tostring(rss, encoding='utf-8')
pretty_xml = minidom.parseString(xml_str).toprettyxml(indent=" ")
with open(output_xml, "w", encoding="utf-8") as f:
f.write(pretty_xml)
if __name__ == "__main__":
create_gmc_feed('ulta_data.jsonl', 'google_feed.xml')
Why this approach works:
-
Memory Efficiency: Processing the file line-by-line using
for line in fprevents the script from crashing, even with hundreds of thousands of products. -
Namespace Handling: Using
f"{{{g_ns}}}tag_name"correctly implements theg:prefix Google requires for its specific attributes. - Data Truncation: Automatically truncating titles to 150 characters avoids GMC validation errors.
Phase 5: Handling Edge Cases
Web data is messy. Here are three common issues encountered when processing Ulta scrapes:
-
HTML in Descriptions: Ulta's descriptions sometimes contain raw HTML tags or entities like
. While the scraper cleans most of this, it is safer to wrap the description in aCDATAsection or use a regex to strip remaining tags before inserting them into the XML. -
Absolute URLs: Ensure your scraper uses the
make_absolute_urllogic from the repository. Google rejects relative URLs like/p/product-name. -
Zero or Missing Prices: Occasionally, a product might show "Price Varies" or "Out of Stock" without a numerical value. The
:.2fformatting will fail ifpriceisNone. Always default to0.00or skip the item if the price is missing.
To Wrap Up
Converting raw scraper data into a functional marketing asset turns raw data into business value. Bridging the gap between JSONL and GMC XML allows you to automate inventory updates directly from your scraping pipeline.
Key Takeaways:
- Stream your data: Use JSONL and line-by-line processing to handle large datasets.
- Respect the Schema: Google is strict about formatting. Always include the currency code in the price and map availability to their three specific enums.
- Automate the Pipeline: Trigger this script immediately after your scraper finishes to create a hands-off data-to-ads pipeline.
For more information on the initial extraction, check out the ScrapeOps Residential Proxy Aggregator and the full range of implementations in the Ulta.com-Scrapers repository.
Top comments (0)