Aliaksandr Tsviatkou

Posted on Mar 26

Troubleshooting SAP Commerce in Production: A Practitioner's Guide

#sap #devops #java #sapcommerce

Production issues in SAP Commerce don't announce themselves politely. They arrive as vague alerts, customer complaints, or a sudden spike in error rates at the worst possible time. The difference between a 15-minute resolution and a 4-hour outage comes down to how quickly you can identify the root cause, and that requires knowing where to look and what tools to use.

This article is a field guide for diagnosing and resolving the most common production issues in SAP Commerce Cloud: memory problems, slow queries, thread deadlocks, cache issues, CronJob failures, and deployment errors.

Problem 1: OutOfMemoryError

Symptoms

Application pods restarting repeatedly
java.lang.OutOfMemoryError: Java heap space in logs
Increasing response times before the crash

Immediate Response

#    Navigate to: Environments → [env] → Monitoring

# 2. If you have access, trigger a heap dump before restart
jmap -dump:format=b,file=/tmp/heapdump.hprof <pid>

# 3. Check recent deployments or configuration changes
#    Was anything deployed in the last 24 hours?

Common Causes

Large ImpEx imports consuming memory:

# Check for running imports in HAC → ImpEx → Import
# Large imports without batching can consume gigabytes

# Fix: Split large imports into batches
# Before: Single 2GB import file
# After: Multiple files, 50,000 lines each

Catalog synchronization on large catalogs:

# Catalog sync loads products into memory
# For catalogs with 500k+ products, increase memory or optimize sync

# Reduce sync batch size
catalog.sync.workers=4
synchronization.itemcopycreator.batchSize=100

Unbounded FlexibleSearch results:

// PROBLEM: Loading all products into memory
String query = "SELECT {pk} FROM {Product}";
List<ProductModel> allProducts = flexibleSearchService.search(query).getResult();
// With 500,000 products, this consumes massive heap

// FIX: Always paginate
FlexibleSearchQuery fsq = new FlexibleSearchQuery(query);
fsq.setStart(0);
fsq.setCount(100);  // Process in pages

Session data accumulation:

# Sessions storing too much data (cart calculations, comparison lists)
# Check session sizes in Dynatrace or via JMX

# Reduce session timeout for anonymous users
default.session.timeout=600

# Store large session data externally (Redis) rather than in-memory

Analyzing Heap Dumps

# Download heap dump from pod
kubectl cp commerce-pod:/tmp/heapdump.hprof ./heapdump.hprof

# Open with Eclipse MAT (Memory Analyzer Tool)
# Key reports:
# 1. Leak Suspects Report — shows objects retaining the most memory
# 2. Dominator Tree — shows largest objects by retained size
# 3. Top Consumers — aggregate view by class

# Common findings:
# - Large HashMap instances (session data)
# - List<ProductModel> with millions of entries (unbounded queries)
# - byte[] arrays (media processing without streaming)

Problem 3: Thread Deadlocks and Pool Exhaustion

Symptoms

Requests hanging indefinitely
Thread pool metrics showing 100% utilization
RejectedExecutionException in logs

Taking Thread Dumps

# Via jstack (if you have pod access)
jstack <pid> > thread_dump_$(date +%s).txt

# Take 3 dumps, 10 seconds apart, to see thread progression
for i in 1 2 3; do
  jstack <pid> > thread_dump_${i}.txt
  sleep 10
done

# Via HAC → Monitoring → Thread Dump
# Downloads a formatted thread dump

Reading Thread Dumps

# Look for BLOCKED threads
"http-nio-9002-exec-42" #142 daemon prio=5
   java.lang.Thread.State: BLOCKED (on object monitor)
    at de.hybris.platform.persistence.GenericBMPBean.ejbLoad(GenericBMPBean.java:456)
    - waiting to lock <0x00000007f8e45678> (a java.lang.Object)
    - locked by "http-nio-9002-exec-17" #117

# This thread is blocked waiting for a lock held by thread exec-17
# Check what exec-17 is doing — likely stuck in a long database operation

# Look for deadlocks (two threads each waiting for the other's lock)
"Found one Java-level deadlock:"
  Thread 1 waiting for lock held by Thread 2
  Thread 2 waiting for lock held by Thread 1

Common Thread Issues

Database connection pool exhaustion:

# Symptom: Threads waiting for database connections
# Check: All DB connections are in use, new requests queue up

# Increase pool size (default is often 10-20)
db.pool.maxActive=50
db.pool.maxIdle=20
db.pool.minIdle=5

# Add connection validation to reclaim leaked connections
db.pool.testOnBorrow=true
db.pool.validationQuery=SELECT 1

# Set max wait time to fail fast rather than queue indefinitely
db.pool.maxWait=5000

Long-running transactions blocking other threads:

// PROBLEM: Transaction held open during external API call
@Transactional
public void processOrder(OrderModel order) {
    orderService.updateStatus(order, OrderStatus.PROCESSING);

    // This external call takes 30 seconds during peak load
    // The transaction stays open, holding database locks
    paymentGateway.charge(order.getPaymentInfo());

    orderService.updateStatus(order, OrderStatus.PAID);
}

// FIX: Separate the external call from the transaction
public void processOrder(OrderModel order) {
    updateOrderStatus(order, OrderStatus.PROCESSING);

    PaymentResult result = paymentGateway.charge(order.getPaymentInfo());

    if (result.isSuccess()) {
        updateOrderStatus(order, OrderStatus.PAID);
    }
}

@Transactional
private void updateOrderStatus(OrderModel order, OrderStatus status) {
    orderService.updateStatus(order, status);
}

Problem 5: Deployment Failures on CCv2

Symptoms

Build succeeds but deployment fails
Pod crashes immediately after startup
Health checks failing

Debugging Deployment Failures

CCv2 Cloud Portal → Builds:

1. Check build logs for compilation errors
2. Check deployment logs for startup errors
3. Check pod logs for runtime exceptions

Common failure patterns:
┌─────────────────────────┬──────────────────────────────────────┐
│ Error                    │ Cause                                │
├─────────────────────────┼──────────────────────────────────────┤
│ Build timeout            │ Too many extensions, slow network    │
│ OOM during build         │ Increase build memory in manifest    │
│ Type system update fail  │ Incompatible items.xml changes       │
│ Bean definition error    │ Missing Spring dependency            │
│ Health check timeout     │ Slow startup, increase timeout       │
│ License validation fail  │ License expired or misconfigured     │
└─────────────────────────┴──────────────────────────────────────┘

Manifest Configuration Issues

// manifest.json — common misconfigurations

{
  "commerceSuiteVersion": "2211.25",
  "extensions": [
    // PROBLEM: Extension listed but not in repository
    "myextension",
    // PROBLEM: Dependency extension missing
    // myextension requires basecommerce but it's not listed
  ],
  "aspects": [
    {
      "name": "backoffice",
      "properties": [
        {
          "key": "db.pool.maxActive",
          "value": "50"
          // PROBLEM: Value too high for pod memory allocation
          // 50 connections × ~5MB each = 250MB just for DB connections
        }
      ]
    }
  ]
}

Rolling Back a Failed Deployment

CCv2 Cloud Portal → Deployments:

1. Identify the last working build number
2. Create new deployment using the previous build
3. Deploy to the affected environment

Important: If the failed deployment included database changes
(new types, removed attributes), the rollback build may need
to handle the modified schema. Test rollback scenarios before
production deployments.

Problem 7: Cluster Synchronization Issues

Symptoms

Changes made on one node not visible on others
Inconsistent behavior between requests (load balancer routing to different nodes)
Stale cache data despite modifications

Diagnosis

# Check cluster node status in HAC → Platform → Cluster

# Verify cluster communication
# All nodes should show as "ALIVE" with recent heartbeat

# If nodes show as "STALE" or "DEAD":
# 1. Check network connectivity between pods
# 2. Verify JGroups configuration
# 3. Check if cluster broadcast is enabled

cluster.broadcast.methods=jgroups
cluster.broadcast.method.jgroups.channel.name=hybris-broadcast

Cache Invalidation Verification

// Test: modify an item and check if other nodes see the change
// On Node 1:
ProductModel product = productService.getProductForCode("TEST-001");
product.setName("Updated Name");
modelService.save(product);

// On Node 2 (via different request):
ProductModel product = productService.getProductForCode("TEST-001");
// If name is still old → cache invalidation is not working
// If name is updated → cluster sync is working

Best Practices

Monitor proactively — set alerts for memory usage >80%, cache hit rates <70%, error rates above baseline, and CronJob failures.
Keep thread dumps and heap dumps accessible — configure the JVM to dump on OOM (-XX:+HeapDumpOnOutOfMemoryError) and know how to take thread dumps quickly.
Maintain runbooks — document every production issue and its resolution. The next incident at 3 AM shouldn't require the same diagnosis from scratch.
Test at production scale — issues that appear with 500,000 products and 10,000 concurrent users don't show up in dev environments.
Deploy during low-traffic windows — CCv2 supports rolling deployments, but having fewer users during deployment reduces the blast radius of issues.
Know the rollback process — practice rolling back deployments before you need to do it under pressure.
Separate background processing from customer-facing traffic — CronJobs, catalog sync, and indexing should run on dedicated nodes, not on nodes serving API requests.

DEV Community

Troubleshooting SAP Commerce in Production: A Practitioner's Guide

Problem 1: OutOfMemoryError

Symptoms

Immediate Response

Common Causes

Analyzing Heap Dumps

Problem 3: Thread Deadlocks and Pool Exhaustion

Symptoms

Taking Thread Dumps

Reading Thread Dumps

Common Thread Issues

Problem 5: Deployment Failures on CCv2

Symptoms

Debugging Deployment Failures

Manifest Configuration Issues

Rolling Back a Failed Deployment

Problem 7: Cluster Synchronization Issues

Symptoms

Diagnosis

Cache Invalidation Verification

Best Practices

Top comments (0)