In e-commerce arbitrage and market research, raw data is only the starting point. Simply having a list of products isn't enough to drive sales or marketing campaigns. If you are scraping Costco to build a price comparison engine or an automated inventory system, the raw JSONL data your scrapers produce is incompatible with platforms like Google Shopping.
Google Shopping requires a specific, strictly formatted XML feed. If a single price is missing a currency code or an availability status uses the wrong terminology, the Merchant Center will reject your entire feed.
This guide walks through the process of taking raw Costco product data and transforming it into a production-ready Google Shopping XML feed using Python.
Prerequisites
To follow along, you will need:
- Python 3.8+ installed.
- Basic familiarity with JSONL and XML structures.
- A ScrapeOps API key for data extraction, which you can get here.
Step 1: The Data Source (Scraping Costco)
Before generating a feed, we need data. We’ll use the Costco.com-Scrapers repository, which provides reliable implementations for extracting Costco product information.
First, clone the repository:
git clone https://github.com/scraper-bank/Costco.com-Scrapers.git
cd Costco.com-Scrapers/python/selenium/product_data/scraper
The Selenium implementation is effective for Costco because it handles the site's dynamic content. When you run costco_scraper_product_data_v1.py, it produces a JSONL (JSON Lines) file. This format is ideal for scraping because it writes one product per line, preventing data loss if the script is interrupted.
The raw output for a single product looks like this:
{
"productId": "1234567",
"name": "Kirkland Signature Coffee, 2 lb",
"price": 19.99,
"currency": "USD",
"availability": "in_stock",
"images": [{"url": "https://bfasset.costco-static.com/.../1234567-847__1.jpg"}]
}
Step 2: Google Merchant Center Feed Requirements
Google Shopping feeds are RSS 2.0 files with a custom namespace (g:). To be accepted, your XML must map Costco's data to Google's specific attribute names.
| Costco Field | Google XML Tag | Requirement |
|---|---|---|
productId |
g:id |
Unique alphanumeric ID |
name |
g:title |
Max 150 characters |
description |
g:description |
Clean text (no HTML) |
price |
g:price |
Must include currency (e.g., "19.99 USD") |
availability |
g:availability |
Must be: in stock, out of stock, or preorder
|
images[0]['url'] |
g:image_link |
Absolute URL to the main image |
Step 3: Setting Up the Transformation Script
Python's built-in xml.etree.ElementTree is a lightweight way to generate XML and handle namespaces. We'll start by defining the skeleton of the RSS feed.
import json
import xml.etree.ElementTree as ET
def create_feed_skeleton():
# Define the namespace for Google attributes
ET.register_namespace('g', "http://base.google.com/ns/1.0")
rss = ET.Element("rss", version="2.0")
channel = ET.SubElement(rss, "channel")
# Basic channel metadata
ET.SubElement(channel, "title").text = "Costco Product Feed"
ET.SubElement(channel, "link").text = "https://www.costco.com"
ET.SubElement(channel, "description").text = "Daily automated product feed from Costco"
return rss, channel
Step 4: Data Normalization and Mapping
The logic inside the scraper repository handles the extraction, but Google's requirements are strict. We need helper functions to ensure the data matches Google's expected formats.
For example, the Costco scraper might return in_stock, but Google requires in stock.
def format_availability(raw_status):
"""Maps scraper status to Google's valid values."""
mapping = {
"in_stock": "in stock",
"out_of_stock": "out of stock",
"preorder": "preorder"
}
return mapping.get(raw_status, "out of stock")
def format_price(amount, currency="USD"):
"""Google requires the currency code inside the price string."""
if amount is None:
return "0.00 USD"
return f"{float(amount):.2f} {currency}"
Step 5: Generating the XML File
Now we combine these elements. We will stream the JSONL file line-by-line. This is a recommended approach for data engineering because it ensures that even with 50,000 products, the script won't exhaust your system's memory.
def generate_google_feed(input_jsonl, output_xml):
rss, channel = create_feed_skeleton()
g_ns = "{http://base.google.com/ns/1.0}"
with open(input_jsonl, 'r', encoding='utf-8') as f:
for line in f:
product = json.loads(line)
# Create the item element
item = ET.SubElement(channel, "item")
# Map standard fields
ET.SubElement(item, f"{g_ns}id").text = str(product.get('productId'))
ET.SubElement(item, "title").text = product.get('name')
ET.SubElement(item, "description").text = product.get('description', 'No description available')
ET.SubElement(item, "link").text = product.get('url')
# Process formatted fields
ET.SubElement(item, f"{g_ns}price").text = format_price(
product.get('price'),
product.get('currency', 'USD')
)
ET.SubElement(item, f"{g_ns}availability").text = format_availability(
product.get('availability')
)
# Handle images
images = product.get('images', [])
if images:
ET.SubElement(item, f"{g_ns}image_link").text = images[0].get('url')
# Write the XML file
tree = ET.ElementTree(rss)
tree.write(output_xml, encoding="utf-8", xml_declaration=True)
print(f"Successfully generated {output_xml}")
if __name__ == "__main__":
generate_google_feed('costco_data.jsonl', 'google_shopping_feed.xml')
Step 6: Validation and Automation
Once you have your google_shopping_feed.xml, validate it using the Feed Debugger in Google Merchant Center. Watch out for these common issues:
-
Missing Brand: Google often requires a
g:brandtag. You can pull this from thebrandfield in the Costco scraper output. - GTIN/MPN: For many categories, Google requires a Global Trade Item Number.
To keep the feed current, you can automate the process with a shell script and a cron job:
#!/bin/bash
# 1. Run the scraper
python3 costco_scraper_product_data_v1.py
# 2. Run the converter
python3 convert_to_xml.py
# 3. Optional: Upload to S3
# aws s3 cp google_shopping_feed.xml s3://my-bucket/feeds/
Running this daily ensures your Google Shopping ads reflect current Costco pricing and stock levels.
To Wrap Up
Converting raw scraped data into a platform-ready format is a vital step in any data pipeline. By using open-source scrapers and a custom Python transformation script, you can turn a basic JSONL file into a structured marketing asset.
Key Takeaways:
- Use JSONL during scraping to maintain data integrity.
- Normalize Price and Availability to meet Google's specific requirements.
- Use streaming (line-by-line) processing to handle large datasets.
For more information, check out our guide on Handling Anti-Bot Measures to keep your scrapers running smoothly as sites update their security.
Top comments (0)