Jerry A. Henley

Posted on Feb 21

From Scrape to Feed: Generating Google Shopping XML from Costco Product Data

#webscraping #googleshopping #python #node

In e-commerce arbitrage and market research, raw data is only the starting point. Simply having a list of products isn't enough to drive sales or marketing campaigns. If you are scraping Costco to build a price comparison engine or an automated inventory system, the raw JSONL data your scrapers produce is incompatible with platforms like Google Shopping.

Google Shopping requires a specific, strictly formatted XML feed. If a single price is missing a currency code or an availability status uses the wrong terminology, the Merchant Center will reject your entire feed.

This guide walks through the process of taking raw Costco product data and transforming it into a production-ready Google Shopping XML feed using Python.

Prerequisites

To follow along, you will need:

Python 3.8+ installed.
Basic familiarity with JSONL and XML structures.
A ScrapeOps API key for data extraction, which you can get here.

Step 1: The Data Source (Scraping Costco)

Before generating a feed, we need data. We’ll use the Costco.com-Scrapers repository, which provides reliable implementations for extracting Costco product information.

First, clone the repository:

git clone https://github.com/scraper-bank/Costco.com-Scrapers.git
cd Costco.com-Scrapers/python/selenium/product_data/scraper

The Selenium implementation is effective for Costco because it handles the site's dynamic content. When you run costco_scraper_product_data_v1.py, it produces a JSONL (JSON Lines) file. This format is ideal for scraping because it writes one product per line, preventing data loss if the script is interrupted.

The raw output for a single product looks like this:

{
  "productId": "1234567",
  "name": "Kirkland Signature Coffee, 2 lb",
  "price": 19.99,
  "currency": "USD",
  "availability": "in_stock",
  "images": [{"url": "https://bfasset.costco-static.com/.../1234567-847__1.jpg"}]
}

Step 2: Google Merchant Center Feed Requirements

Google Shopping feeds are RSS 2.0 files with a custom namespace (g:). To be accepted, your XML must map Costco's data to Google's specific attribute names.

Costco Field	Google XML Tag	Requirement
`productId`	`g:id`	Unique alphanumeric ID
`name`	`g:title`	Max 150 characters
`description`	`g:description`	Clean text (no HTML)
`price`	`g:price`	Must include currency (e.g., "19.99 USD")
`availability`	`g:availability`	Must be: `in stock`, `out of stock`, or `preorder`
`images[0]['url']`	`g:image_link`	Absolute URL to the main image

Step 3: Setting Up the Transformation Script

Python's built-in xml.etree.ElementTree is a lightweight way to generate XML and handle namespaces. We'll start by defining the skeleton of the RSS feed.

import json
import xml.etree.ElementTree as ET

def create_feed_skeleton():
    # Define the namespace for Google attributes
    ET.register_namespace('g', "http://base.google.com/ns/1.0")

    rss = ET.Element("rss", version="2.0")
    channel = ET.SubElement(rss, "channel")

    # Basic channel metadata
    ET.SubElement(channel, "title").text = "Costco Product Feed"
    ET.SubElement(channel, "link").text = "https://www.costco.com"
    ET.SubElement(channel, "description").text = "Daily automated product feed from Costco"

    return rss, channel

Step 4: Data Normalization and Mapping

The logic inside the scraper repository handles the extraction, but Google's requirements are strict. We need helper functions to ensure the data matches Google's expected formats.

For example, the Costco scraper might return in_stock, but Google requires in stock.

def format_availability(raw_status):
    """Maps scraper status to Google's valid values."""
    mapping = {
        "in_stock": "in stock",
        "out_of_stock": "out of stock",
        "preorder": "preorder"
    }
    return mapping.get(raw_status, "out of stock")

def format_price(amount, currency="USD"):
    """Google requires the currency code inside the price string."""
    if amount is None:
        return "0.00 USD"
    return f"{float(amount):.2f} {currency}"

Step 5: Generating the XML File

Now we combine these elements. We will stream the JSONL file line-by-line. This is a recommended approach for data engineering because it ensures that even with 50,000 products, the script won't exhaust your system's memory.

def generate_google_feed(input_jsonl, output_xml):
    rss, channel = create_feed_skeleton()
    g_ns = "{http://base.google.com/ns/1.0}"

    with open(input_jsonl, 'r', encoding='utf-8') as f:
        for line in f:
            product = json.loads(line)

            # Create the item element
            item = ET.SubElement(channel, "item")

            # Map standard fields
            ET.SubElement(item, f"{g_ns}id").text = str(product.get('productId'))
            ET.SubElement(item, "title").text = product.get('name')
            ET.SubElement(item, "description").text = product.get('description', 'No description available')
            ET.SubElement(item, "link").text = product.get('url')

            # Process formatted fields
            ET.SubElement(item, f"{g_ns}price").text = format_price(
                product.get('price'), 
                product.get('currency', 'USD')
            )
            ET.SubElement(item, f"{g_ns}availability").text = format_availability(
                product.get('availability')
            )

            # Handle images
            images = product.get('images', [])
            if images:
                ET.SubElement(item, f"{g_ns}image_link").text = images[0].get('url')

    # Write the XML file
    tree = ET.ElementTree(rss)
    tree.write(output_xml, encoding="utf-8", xml_declaration=True)
    print(f"Successfully generated {output_xml}")

if __name__ == "__main__":
    generate_google_feed('costco_data.jsonl', 'google_shopping_feed.xml')

Step 6: Validation and Automation

Once you have your google_shopping_feed.xml, validate it using the Feed Debugger in Google Merchant Center. Watch out for these common issues:

Missing Brand: Google often requires a g:brand tag. You can pull this from the brand field in the Costco scraper output.
GTIN/MPN: For many categories, Google requires a Global Trade Item Number.

To keep the feed current, you can automate the process with a shell script and a cron job:

#!/bin/bash
# 1. Run the scraper
python3 costco_scraper_product_data_v1.py
# 2. Run the converter
python3 convert_to_xml.py
# 3. Optional: Upload to S3
# aws s3 cp google_shopping_feed.xml s3://my-bucket/feeds/

Running this daily ensures your Google Shopping ads reflect current Costco pricing and stock levels.

To Wrap Up

Converting raw scraped data into a platform-ready format is a vital step in any data pipeline. By using open-source scrapers and a custom Python transformation script, you can turn a basic JSONL file into a structured marketing asset.

Key Takeaways:

Use JSONL during scraping to maintain data integrity.
Normalize Price and Availability to meet Google's specific requirements.
Use streaming (line-by-line) processing to handle large datasets.

For more information, check out our guide on Handling Anti-Bot Measures to keep your scrapers running smoothly as sites update their security.

DEV Community