Aliaksandr Tsviatkou

Posted on Mar 26

Data Migration Strategies for SAP Commerce Projects: Planning, Execution, and Recovery

#sap #database #java #sapcommerce

Every SAP Commerce project involves data migration — whether you're migrating from a legacy platform, consolidating multiple systems, upgrading from an older Hybris version, or integrating with external data sources. The difference between a smooth go-live and a catastrophic one often comes down to how well the data migration was planned and tested.

This article covers the migration strategies, tooling, common pitfalls, and testing approaches that make the difference between a migration that works in theory and one that works in production.

Planning the Migration

Step 1: Data Inventory

Before writing any migration code, catalog what data needs to migrate:

Data Inventory Worksheet:
┌──────────────────┬──────────┬───────────┬────────────┬──────────────┐
│ Data Entity       │ Source   │ Volume    │ Priority   │ Dependencies │
├──────────────────┼──────────┼───────────┼────────────┼──────────────┤
│ Products          │ PIM      │ 50,000    │ Critical   │ Categories   │
│ Categories        │ PIM      │ 2,000     │ Critical   │ None         │
│ Prices            │ ERP      │ 200,000   │ Critical   │ Products     │
│ Stock Levels      │ ERP      │ 50,000    │ Critical   │ Products,    │
│                   │          │           │            │ Warehouses   │
│ Customers         │ CRM      │ 500,000   │ High       │ None         │
│ Addresses         │ CRM      │ 800,000   │ High       │ Customers    │
│ Order History     │ OMS      │ 2,000,000 │ Medium     │ Customers,   │
│                   │          │           │            │ Products     │
│ CMS Pages         │ CMS      │ 500       │ High       │ Media        │
│ Media/Images      │ CDN/PIM  │ 150,000   │ High       │ None         │
│ Promotions        │ Legacy   │ 200       │ Medium     │ Products,    │
│                   │          │           │            │ Categories   │
│ Classifications   │ PIM      │ 10,000    │ High       │ Products     │
└──────────────────┴──────────┴───────────┴────────────┴──────────────┘

Step 2: Dependency Graph

Data must be loaded in dependency order. Loading products before categories fails because products reference categories.

Load Order (dependency-based):
1. Reference Data     → Countries, Currencies, Languages, Units
2. Warehouses         → Warehouse definitions, PointOfService
3. Categories         → Category hierarchy (parent before child)
4. Classification     → ClassificationSystem, ClassAttributeAssignment
5. Products           → Base products, then variants
6. Prices             → Price rows linked to products
7. Stock              → Stock levels linked to products + warehouses
8. Media              → Product images, category images
9. Customers          → Customer accounts
10. Addresses         → Customer addresses
11. Orders            → Historical orders (optional)
12. CMS Content       → Pages, components, media
13. Promotions        → Promotion rules

Step 3: Mapping Specification

Document how source fields map to SAP Commerce fields:

Product Mapping:
┌──────────────────┬──────────────────────┬──────────────────┐
│ Source Field      │ SAP Commerce Field   │ Transformation   │
├──────────────────┼──────────────────────┼──────────────────┤
│ sku               │ Product.code         │ Direct           │
│ title             │ Product.name[en]     │ Direct           │
│ titel             │ Product.name[de]     │ Direct           │
│ description_html  │ Product.description  │ HTML sanitize    │
│ price_usd         │ PriceRow.price       │ Decimal (2dp)    │
│ weight_lbs        │ Product.weight       │ Convert to kg    │
│ category_path     │ CategoryProductRel   │ Split on '>'     │
│ main_image_url    │ Media.URL            │ Download + store │
│ brand_name        │ Product.manufacturer │ Lookup by name   │
│ active            │ Product.approvalStatus│ true→approved   │
│ created_at        │ Product.creationtime │ ISO date parse   │
└──────────────────┴──────────────────────┴──────────────────┘

Automated Migration Pipeline

For large volumes, build an automated pipeline rather than running ImpEx manually.

Architecture

┌─────────────┐    ┌──────────────┐    ┌───────────────┐    ┌──────────────┐
│ Source       │    │ Extract &    │    │ ImpEx         │    │ SAP Commerce │
│ Systems     │───>│ Transform    │───>│ Generator     │───>│ Import       │
│ (DB, API,   │    │ (Python/Java)│    │               │    │ (HAC/API)    │
│  CSV, etc)  │    │              │    │               │    │              │
└─────────────┘    └──────────────┘    └───────────────┘    └──────────────┘
                          │                    │                     │
                          ▼                    ▼                     ▼
                    Validation           Generated            Import Logs
                    Reports              .impex files         & Error Reports

Python ETL Script Example

import csv
import re
from decimal import Decimal

class ProductMigrator:
    """Transforms source product data to ImpEx format."""

    IMPEX_HEADER = """
$catalogVersion = catalogVersion(catalog(id[default='myProductCatalog']),version[default='Staged'])
$supercategories = supercategories(code, $catalogVersion)
$approved = approvalStatus(code)[default='approved']

INSERT_UPDATE Product;code[unique=true];name[lang=en];description[lang=en];$catalogVersion;$supercategories;$approved;unit(code)[default='pieces'];ean;manufacturerName
"""

    def __init__(self, source_file, output_file, batch_size=5000):
        self.source_file = source_file
        self.output_file = output_file
        self.batch_size = batch_size
        self.errors = []
        self.processed = 0
        self.skipped = 0

    def transform_product(self, row):
        """Transform a single source row to ImpEx fields."""
        code = self.sanitize_code(row['sku'])
        if not code:
            self.errors.append(f"Invalid SKU: {row.get('sku', 'EMPTY')}")
            return None

        name = self.sanitize_text(row.get('title', ''))
        if not name:
            self.errors.append(f"Missing name for SKU: {code}")
            return None

        description = self.sanitize_html(row.get('description_html', ''))
        category = self.map_category(row.get('category_path', ''))
        ean = row.get('ean', '')
        manufacturer = self.sanitize_text(row.get('brand_name', ''))

        return f";{code};{name};{description};;{category};;;{ean};{manufacturer}"

    def sanitize_code(self, code):
        """Ensure product code is valid."""
        if not code:
            return None
        # Remove special characters, keep alphanumeric and hyphens
        return re.sub(r'[^a-zA-Z0-9\-_]', '', str(code).strip())

    def sanitize_text(self, text):
        """Clean text for ImpEx (escape semicolons)."""
        if not text:
            return ''
        text = str(text).strip()
        # ImpEx uses semicolons as delimiters — escape them
        text = text.replace(';', '\;')
        # Remove newlines
        text = text.replace('
', ' ').replace('
', '')
        return text

    def sanitize_html(self, html):
        """Clean HTML content for product descriptions."""
        if not html:
            return ''
        # Remove script tags
        html = re.sub(r'<script[^>]*>.*?</script>', '', html, flags=re.DOTALL)
        # Escape semicolons
        html = html.replace(';', '\;')
        return html.strip()

    def map_category(self, category_path):
        """Map source category path to SAP Commerce category code."""
        # Source: "Electronics > Cameras > DSLR"
        # Target: "dslr" (leaf category code)
        if not category_path:
            return ''
        parts = [p.strip().lower().replace(' ', '-') for p in category_path.split('>')]
        return parts[-1] if parts else ''

    def run(self):
        """Execute the migration."""
        with open(self.source_file, 'r', encoding='utf-8') as infile, \
             open(self.output_file, 'w', encoding='utf-8') as outfile:

            reader = csv.DictReader(infile)
            outfile.write(self.IMPEX_HEADER)

            for row in reader:
                line = self.transform_product(row)
                if line:
                    outfile.write(line + '
')
                    self.processed += 1
                else:
                    self.skipped += 1

                # Write batch separator for large imports
                if self.processed % self.batch_size == 0:
                    outfile.write(f"
# --- Batch {self.processed // self.batch_size} ---
")

        self.write_report()

    def write_report(self):
        """Generate migration report."""
        report = f"""
Migration Report
================
Source: {self.source_file}
Output: {self.output_file}
Processed: {self.processed}
Skipped: {self.skipped}
Errors: {len(self.errors)}

Error Details:
"""
        for error in self.errors[:100]:  # First 100 errors
            report += f"  - {error}
"

        with open(self.output_file.replace('.impex', '_report.txt'), 'w') as f:
            f.write(report)

        print(report)

# Usage
migrator = ProductMigrator('source_products.csv', 'products_import.impex')
migrator.run()

Data Validation

Pre-Import Validation

Validate data before importing to catch issues early:

class DataValidator:
    """Validate migration data before import."""

    def __init__(self):
        self.errors = []
        self.warnings = []

    def validate_products(self, products):
        seen_codes = set()

        for i, product in enumerate(products):
            row = i + 1

            # Required fields
            if not product.get('code'):
                self.errors.append(f"Row {row}: Missing product code")
                continue

            # Duplicates
            if product['code'] in seen_codes:
                self.errors.append(f"Row {row}: Duplicate code '{product['code']}'")
            seen_codes.add(product['code'])

            # Code format
            if not re.match(r'^[A-Za-z0-9\-_]+$', product['code']):
                self.errors.append(
                    f"Row {row}: Invalid code format '{product['code']}' "
                    f"(only alphanumeric, hyphens, underscores)")

            # Name length
            name = product.get('name', '')
            if len(name) > 255:
                self.warnings.append(
                    f"Row {row}: Name exceeds 255 chars, will be truncated")

            # Price validation
            price = product.get('price')
            if price:
                try:
                    p = Decimal(str(price))
                    if p < 0:
                        self.errors.append(f"Row {row}: Negative price {price}")
                    if p > 999999.99:
                        self.warnings.append(f"Row {row}: Very high price {price}")
                except:
                    self.errors.append(f"Row {row}: Invalid price format '{price}'")

            # Category reference
            if not product.get('category'):
                self.warnings.append(f"Row {row}: No category for '{product['code']}'")

        return len(self.errors) == 0

Post-Import Verification

After import, verify data integrity:

# Verification queries via FlexibleSearch

# Check product count
# Expected: 50,000
SELECT COUNT(*) FROM {Product} WHERE {catalogVersion} = (
  SELECT {pk} FROM {CatalogVersion} WHERE {version} = 'Staged'
  AND {catalog} = (SELECT {pk} FROM {Catalog} WHERE {id} = 'myProductCatalog')
)

# Find products without prices
SELECT {p.code} FROM {Product AS p}
WHERE {p.catalogVersion} = ?cv
AND NOT EXISTS (
  SELECT 1 FROM {PriceRow AS pr} WHERE {pr.product} = {p.pk}
)

# Find products without categories
SELECT {p.code} FROM {Product AS p}
WHERE {p.catalogVersion} = ?cv
AND NOT EXISTS (
  SELECT 1 FROM {CategoryProductRelation AS r} WHERE {r.target} = {p.pk}
)

# Find products without images
SELECT {p.code} FROM {Product AS p}
WHERE {p.catalogVersion} = ?cv
AND {p.picture} IS NULL
AND {p.thumbnail} IS NULL

// Programmatic verification
@Component
public class MigrationVerifier {

    @Resource
    private FlexibleSearchService flexibleSearchService;

    public MigrationReport verify(CatalogVersionModel catalogVersion) {
        MigrationReport report = new MigrationReport();

        // Count totals
        report.setProductCount(countItems("Product", catalogVersion));
        report.setCategoryCount(countItems("Category", catalogVersion));
        report.setMediaCount(countItems("Media", catalogVersion));

        // Find orphans
        report.setProductsWithoutPrices(findProductsWithoutPrices(catalogVersion));
        report.setProductsWithoutCategories(findProductsWithoutCategories(catalogVersion));
        report.setProductsWithoutImages(findProductsWithoutImages(catalogVersion));

        // Validate critical fields
        report.setProductsWithEmptyNames(findProductsWithEmptyField("name", catalogVersion));

        return report;
    }

    private long countItems(String typeCode, CatalogVersionModel cv) {
        String query = "SELECT COUNT(*) FROM {" + typeCode + "} WHERE {catalogVersion} = ?cv";
        FlexibleSearchQuery fsq = new FlexibleSearchQuery(query);
        fsq.addQueryParameter("cv", cv);
        fsq.setResultClassList(Collections.singletonList(Long.class));
        return flexibleSearchService.<Long>search(fsq).getResult().get(0);
    }
}

Rollback Strategy

Every migration needs a rollback plan.

Database-Level Rollback

# Pre-migration: Full database backup
mysqldump -h db-host -u admin -p commerce_db > pre_migration_backup.sql

# Or for PostgreSQL
pg_dump -h db-host -U admin commerce_db > pre_migration_backup.sql

# Rollback if migration fails
mysql -h db-host -u admin -p commerce_db < pre_migration_backup.sql

ImpEx-Based Rollback

For targeted rollback, generate reverse ImpEx:

# Remove incorrectly imported products
REMOVE Product;code[unique=true];$catalogVersion
;BAD-PROD-001;
;BAD-PROD-002;
;BAD-PROD-003;

# Or remove by query
"#% impex.initScript(""

import de.hybris.platform.servicelayer.search.FlexibleSearchQuery;

query = new FlexibleSearchQuery(""SELECT {pk} FROM {Product} WHERE {creationtime} > '2024-01-15 10:00:00'"");
results = flexibleSearchService.search(query).getResult();
for (item : results) {
    modelService.remove(item);
}

"");"

Best Practices

Never migrate directly to Online catalog — always import into Staged and synchronize to Online after verification.
Idempotent imports — use INSERT_UPDATE so you can re-run imports safely if they fail partway through.
Validate before importing — catch data issues in the transformation layer, not during import.
Keep source data snapshots — archive the exact source data used for each migration run. You'll need it for debugging.
Document every transformation — when you convert weight_lbs to kilograms or map active=true to approvalStatus=approved, document it. Six months later, someone will ask why a value looks wrong.
Plan for delta migrations — there's always a gap between the initial load and go-live. Plan how changes made in the source system during this window will be captured.
Test with production-scale data — migration that works with 1,000 products may fail with 500,000 due to memory, timeouts, or performance.
Have a rollback plan — and test it. A backup you've never restored is not a backup.

DEV Community