Robert N. Gutierrez

Posted on Mar 6

From Scrape to Feed: Building a Google Merchant Center CSV from Zappos Data

#webscraping #devops #python #dataengineering

Scraping product data is often only the first half of a larger business objective. Whether you are building a price comparison engine, a dropshipping store, or a market research tool, raw data is rarely useful in its initial state. If you want to list products on Google Shopping, you face a significant hurdle: Google Merchant Center (GMC) has incredibly strict formatting requirements.

Raw scraped data from Zappos typically arrives as a nested JSON or JSONL file. Google expects a clean CSV or XML feed with specific headers like availability, image_link, and price formatted exactly to their specifications.

This guide bridges that gap. We’ll use a production-ready Zappos scraper to extract data and then build a Python transformation pipeline using pandas to convert that raw data into a valid, GMC-compliant product feed.

Prerequisites & Setup

Ensure you have Python installed on your machine. We will use the playwright library for extraction and pandas for data transformation.

First, clone the Zappos.com-Scrapers repository, which contains the logic needed to handle Zappos' complex page structure.

git clone https://github.com/scraper-bank/Zappos.com-Scrapers.git
cd Zappos.com-Scrapers
pip install pandas playwright playwright-stealth
playwright install chromium

You will also need a ScrapeOps API Key. Zappos employs advanced anti-bot measures that can block standard requests. ScrapeOps provides proxy rotation and request optimization to ensure your scrapers reach the page. You can sign up for a free API key here.

Step 1: Extracting Raw Data from Zappos

We will use the Playwright implementation found in the repository. This script navigates to a product page, handles dynamic content loading, and extracts structured data into a JSONL (JSON Lines) format.

Open python/playwright/product_data/scraper/zappos_scraper_product_data_v1.py, insert your API key into the API_KEY variable, and run the scraper:

cd python/playwright/product_data/scraper
python zappos_scraper_product_data_v1.py

The Raw Output

The scraper produces a zappos_com_product_page_scraper_data_[timestamp].jsonl file. Each line looks something like this:

{
  "name": "Nike Air Zoom Pegasus 40",
  "productId": "9848135",
  "price": 129.95,
  "availability": "in_stock",
  "images": [{"url": "https://m.media-amazon.com/images/I/71...jpg", "alt_text": ""}],
  "url": "https://www.zappos.com/p/nike-air-zoom..."
}

While this is structured, it isn't ready for Google. The price is a float without a currency code, the images are trapped in a list, and the availability format doesn't match Google's requirements.

Step 2: Gap Analysis (JSONL vs. GMC Requirements)

To build a successful feed, we must map our scraped fields to the Google Product Data Specification. Here is how the raw data stacks up against Google's requirements:

Field	Scraped Data Example	Google Requirement	Transformation
ID	`9848135`	Alphanumeric ID	Rename to `id`
Title	`Nike Air Zoom...`	Max 150 chars	Rename to `title`
Price	`129.95`	`129.95 USD`	Format as string + currency
Availability	`in_stock`	`in stock`	Replace underscore with space
Image Link	List of objects	Single URL string	Extract first URL in list
Condition	Missing	`new`, `refurbished`, `used`	Add default value `new`

Step 3: Building the Transformation Script

Create a new Python script named build_feed.py in your project root. We’ll use pandas to load the JSONL file and clean the data.

Loading and Mapping

First, load the data and rename the basic columns to match Google’s headers.

import pandas as pd
import json

# 1. Load the JSONL data
data = []
with open('zappos_data.jsonl', 'r') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame(data)

# 2. Basic Column Mapping
df = df.rename(columns={
    'productId': 'id',
    'name': 'title',
    'url': 'link'
})

Formatting Challenges

Next, we need to handle price and image_link. Google is very particular about these formats.

# 3. Format Price (Google requires: [Price] [Currency])
df['price'] = df['price'].apply(lambda x: f"{float(x):.2f} USD")

# 4. Extract Image Link
# Zappos provides a list of objects; Google wants a single primary URL string.
def get_main_image(image_list):
    if isinstance(image_list, list) and len(image_list) > 0:
        return image_list[0].get('url', '')
    return ''

df['image_link'] = df['images'].apply(get_main_image)

# 5. Map Availability
# Google expects 'in stock' or 'out of stock' without underscores
availability_map = {
    "in_stock": "in stock",
    "out_of_stock": "out of stock"
}
df['availability'] = df['availability'].map(availability_map).fillna('out of stock')

# 6. Add Static Required Fields
df['condition'] = 'new'

Step 4: Exporting the Final Feed

Finally, select only the columns Google Merchant Center expects and export the file. While Google accepts Tab-Separated Values (TSV) and XML, CSV is the easiest format for smaller feeds.

# Select only the required columns
final_columns = [
    'id', 'title', 'description', 'link', 
    'image_link', 'availability', 'price', 'condition', 'brand'
]

# Ensure we only export what's needed
final_df = df[final_columns]

# Export to CSV
final_df.to_csv('google_merchant_feed.csv', index=False)

print(f"Successfully converted {len(final_df)} products to google_merchant_feed.csv")

When you open google_merchant_feed.csv in Excel or Google Sheets, you will see a formatted table ready for upload.

Practical Tips and Gotchas

When running this pipeline at scale, keep these factors in mind:

Character Limits: Google truncates titles after 150 characters. Zappos titles are usually short, but if you scrape descriptions, make sure they stay under Google's 5,000-character limit.
Image Quality: The ScrapeOps Zappos scraper is configured to find high-resolution images. In the extract_data function within the repo, a regex replaces thumbnail dimensions (e.g., _AC_SR1840,1472_) to ensure Google receives high-quality assets.
GTINs and MPNs: For "Buy on Google" or highly competitive categories, Google requires GTINs (UPCs). These aren't always visible on the Zappos storefront, but they can sometimes be found in the __INITIAL_STATE__ JSON blob extracted by the scraper.

To Wrap Up

Converting raw scraped data into a platform-specific feed is a vital skill for e-commerce data engineers. By combining the extraction power of the ScrapeOps Zappos scraper with the data manipulation strengths of pandas, you can automate the entire path from a web page to a live product ad.

Key Takeaways:

Raw data is just a starting point: Always perform a gap analysis between your scraped output and your target platform's requirements.
Cleaning is essential: Use Python to handle currency formatting, image URL extraction, and specific status mapping.
Don't reinvent the wheel: Use the production-ready implementations in the Zappos.com-Scrapers repository to save development time.

DEV Community