DEV Community

Cover image for From Scrape to Feed: Building a Google Merchant Center CSV from Zappos Data
Robert N. Gutierrez
Robert N. Gutierrez

Posted on

From Scrape to Feed: Building a Google Merchant Center CSV from Zappos Data

Scraping product data is often only the first half of a larger business objective. Whether you are building a price comparison engine, a dropshipping store, or a market research tool, raw data is rarely useful in its initial state. If you want to list products on Google Shopping, you face a significant hurdle: Google Merchant Center (GMC) has incredibly strict formatting requirements.

Raw scraped data from Zappos typically arrives as a nested JSON or JSONL file. Google expects a clean CSV or XML feed with specific headers like availability, image_link, and price formatted exactly to their specifications.

This guide bridges that gap. We’ll use a production-ready Zappos scraper to extract data and then build a Python transformation pipeline using pandas to convert that raw data into a valid, GMC-compliant product feed.

Prerequisites & Setup

Ensure you have Python installed on your machine. We will use the playwright library for extraction and pandas for data transformation.

First, clone the Zappos.com-Scrapers repository, which contains the logic needed to handle Zappos' complex page structure.

git clone https://github.com/scraper-bank/Zappos.com-Scrapers.git
cd Zappos.com-Scrapers
pip install pandas playwright playwright-stealth
playwright install chromium
Enter fullscreen mode Exit fullscreen mode

You will also need a ScrapeOps API Key. Zappos employs advanced anti-bot measures that can block standard requests. ScrapeOps provides proxy rotation and request optimization to ensure your scrapers reach the page. You can sign up for a free API key here.

Step 1: Extracting Raw Data from Zappos

We will use the Playwright implementation found in the repository. This script navigates to a product page, handles dynamic content loading, and extracts structured data into a JSONL (JSON Lines) format.

Open python/playwright/product_data/scraper/zappos_scraper_product_data_v1.py, insert your API key into the API_KEY variable, and run the scraper:

cd python/playwright/product_data/scraper
python zappos_scraper_product_data_v1.py
Enter fullscreen mode Exit fullscreen mode

The Raw Output

The scraper produces a zappos_com_product_page_scraper_data_[timestamp].jsonl file. Each line looks something like this:

{
  "name": "Nike Air Zoom Pegasus 40",
  "productId": "9848135",
  "price": 129.95,
  "availability": "in_stock",
  "images": [{"url": "https://m.media-amazon.com/images/I/71...jpg", "alt_text": ""}],
  "url": "https://www.zappos.com/p/nike-air-zoom..."
}
Enter fullscreen mode Exit fullscreen mode

While this is structured, it isn't ready for Google. The price is a float without a currency code, the images are trapped in a list, and the availability format doesn't match Google's requirements.

Step 2: Gap Analysis (JSONL vs. GMC Requirements)

To build a successful feed, we must map our scraped fields to the Google Product Data Specification. Here is how the raw data stacks up against Google's requirements:

Field Scraped Data Example Google Requirement Transformation
ID 9848135 Alphanumeric ID Rename to id
Title Nike Air Zoom... Max 150 chars Rename to title
Price 129.95 129.95 USD Format as string + currency
Availability in_stock in stock Replace underscore with space
Image Link List of objects Single URL string Extract first URL in list
Condition Missing new, refurbished, used Add default value new

Step 3: Building the Transformation Script

Create a new Python script named build_feed.py in your project root. We’ll use pandas to load the JSONL file and clean the data.

Loading and Mapping

First, load the data and rename the basic columns to match Google’s headers.

import pandas as pd
import json

# 1. Load the JSONL data
data = []
with open('zappos_data.jsonl', 'r') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame(data)

# 2. Basic Column Mapping
df = df.rename(columns={
    'productId': 'id',
    'name': 'title',
    'url': 'link'
})
Enter fullscreen mode Exit fullscreen mode

Formatting Challenges

Next, we need to handle price and image_link. Google is very particular about these formats.

# 3. Format Price (Google requires: [Price] [Currency])
df['price'] = df['price'].apply(lambda x: f"{float(x):.2f} USD")

# 4. Extract Image Link
# Zappos provides a list of objects; Google wants a single primary URL string.
def get_main_image(image_list):
    if isinstance(image_list, list) and len(image_list) > 0:
        return image_list[0].get('url', '')
    return ''

df['image_link'] = df['images'].apply(get_main_image)

# 5. Map Availability
# Google expects 'in stock' or 'out of stock' without underscores
availability_map = {
    "in_stock": "in stock",
    "out_of_stock": "out of stock"
}
df['availability'] = df['availability'].map(availability_map).fillna('out of stock')

# 6. Add Static Required Fields
df['condition'] = 'new'
Enter fullscreen mode Exit fullscreen mode

Step 4: Exporting the Final Feed

Finally, select only the columns Google Merchant Center expects and export the file. While Google accepts Tab-Separated Values (TSV) and XML, CSV is the easiest format for smaller feeds.

# Select only the required columns
final_columns = [
    'id', 'title', 'description', 'link', 
    'image_link', 'availability', 'price', 'condition', 'brand'
]

# Ensure we only export what's needed
final_df = df[final_columns]

# Export to CSV
final_df.to_csv('google_merchant_feed.csv', index=False)

print(f"Successfully converted {len(final_df)} products to google_merchant_feed.csv")
Enter fullscreen mode Exit fullscreen mode

When you open google_merchant_feed.csv in Excel or Google Sheets, you will see a formatted table ready for upload.

Practical Tips and Gotchas

When running this pipeline at scale, keep these factors in mind:

  • Character Limits: Google truncates titles after 150 characters. Zappos titles are usually short, but if you scrape descriptions, make sure they stay under Google's 5,000-character limit.
  • Image Quality: The ScrapeOps Zappos scraper is configured to find high-resolution images. In the extract_data function within the repo, a regex replaces thumbnail dimensions (e.g., _AC_SR1840,1472_) to ensure Google receives high-quality assets.
  • GTINs and MPNs: For "Buy on Google" or highly competitive categories, Google requires GTINs (UPCs). These aren't always visible on the Zappos storefront, but they can sometimes be found in the __INITIAL_STATE__ JSON blob extracted by the scraper.

To Wrap Up

Converting raw scraped data into a platform-specific feed is a vital skill for e-commerce data engineers. By combining the extraction power of the ScrapeOps Zappos scraper with the data manipulation strengths of pandas, you can automate the entire path from a web page to a live product ad.

Key Takeaways:

  • Raw data is just a starting point: Always perform a gap analysis between your scraped output and your target platform's requirements.
  • Cleaning is essential: Use Python to handle currency formatting, image URL extraction, and specific status mapping.
  • Don't reinvent the wheel: Use the production-ready implementations in the Zappos.com-Scrapers repository to save development time.

Top comments (0)