DEV Community

Aliaksandr Tsviatkou
Aliaksandr Tsviatkou

Posted on

Troubleshooting SAP Commerce in Production: A Practitioner's Guide

Production issues in SAP Commerce don't announce themselves politely. They arrive as vague alerts, customer complaints, or a sudden spike in error rates at the worst possible time. The difference between a 15-minute resolution and a 4-hour outage comes down to how quickly you can identify the root cause, and that requires knowing where to look and what tools to use.

This article is a field guide for diagnosing and resolving the most common production issues in SAP Commerce Cloud: memory problems, slow queries, thread deadlocks, cache issues, CronJob failures, and deployment errors.

Problem 1: OutOfMemoryError

Symptoms

  • Application pods restarting repeatedly
  • java.lang.OutOfMemoryError: Java heap space in logs
  • Increasing response times before the crash

Immediate Response

#    Navigate to: Environments → [env] → Monitoring

# 2. If you have access, trigger a heap dump before restart
jmap -dump:format=b,file=/tmp/heapdump.hprof <pid>

# 3. Check recent deployments or configuration changes
#    Was anything deployed in the last 24 hours?
Enter fullscreen mode Exit fullscreen mode

Common Causes

Large ImpEx imports consuming memory:

# Check for running imports in HAC → ImpEx → Import
# Large imports without batching can consume gigabytes

# Fix: Split large imports into batches
# Before: Single 2GB import file
# After: Multiple files, 50,000 lines each
Enter fullscreen mode Exit fullscreen mode

Catalog synchronization on large catalogs:

# Catalog sync loads products into memory
# For catalogs with 500k+ products, increase memory or optimize sync

# Reduce sync batch size
catalog.sync.workers=4
synchronization.itemcopycreator.batchSize=100
Enter fullscreen mode Exit fullscreen mode

Unbounded FlexibleSearch results:

// PROBLEM: Loading all products into memory
String query = "SELECT {pk} FROM {Product}";
List<ProductModel> allProducts = flexibleSearchService.search(query).getResult();
// With 500,000 products, this consumes massive heap

// FIX: Always paginate
FlexibleSearchQuery fsq = new FlexibleSearchQuery(query);
fsq.setStart(0);
fsq.setCount(100);  // Process in pages
Enter fullscreen mode Exit fullscreen mode

Session data accumulation:

# Sessions storing too much data (cart calculations, comparison lists)
# Check session sizes in Dynatrace or via JMX

# Reduce session timeout for anonymous users
default.session.timeout=600

# Store large session data externally (Redis) rather than in-memory
Enter fullscreen mode Exit fullscreen mode

Analyzing Heap Dumps

# Download heap dump from pod
kubectl cp commerce-pod:/tmp/heapdump.hprof ./heapdump.hprof

# Open with Eclipse MAT (Memory Analyzer Tool)
# Key reports:
# 1. Leak Suspects Report — shows objects retaining the most memory
# 2. Dominator Tree — shows largest objects by retained size
# 3. Top Consumers — aggregate view by class

# Common findings:
# - Large HashMap instances (session data)
# - List<ProductModel> with millions of entries (unbounded queries)
# - byte[] arrays (media processing without streaming)
Enter fullscreen mode Exit fullscreen mode

Problem 3: Thread Deadlocks and Pool Exhaustion

Symptoms

  • Requests hanging indefinitely
  • Thread pool metrics showing 100% utilization
  • RejectedExecutionException in logs

Taking Thread Dumps

# Via jstack (if you have pod access)
jstack <pid> > thread_dump_$(date +%s).txt

# Take 3 dumps, 10 seconds apart, to see thread progression
for i in 1 2 3; do
  jstack <pid> > thread_dump_${i}.txt
  sleep 10
done

# Via HAC → Monitoring → Thread Dump
# Downloads a formatted thread dump
Enter fullscreen mode Exit fullscreen mode

Reading Thread Dumps

# Look for BLOCKED threads
"http-nio-9002-exec-42" #142 daemon prio=5
   java.lang.Thread.State: BLOCKED (on object monitor)
    at de.hybris.platform.persistence.GenericBMPBean.ejbLoad(GenericBMPBean.java:456)
    - waiting to lock <0x00000007f8e45678> (a java.lang.Object)
    - locked by "http-nio-9002-exec-17" #117

# This thread is blocked waiting for a lock held by thread exec-17
# Check what exec-17 is doing  likely stuck in a long database operation

# Look for deadlocks (two threads each waiting for the other's lock)
"Found one Java-level deadlock:"
  Thread 1 waiting for lock held by Thread 2
  Thread 2 waiting for lock held by Thread 1
Enter fullscreen mode Exit fullscreen mode

Common Thread Issues

Database connection pool exhaustion:

# Symptom: Threads waiting for database connections
# Check: All DB connections are in use, new requests queue up

# Increase pool size (default is often 10-20)
db.pool.maxActive=50
db.pool.maxIdle=20
db.pool.minIdle=5

# Add connection validation to reclaim leaked connections
db.pool.testOnBorrow=true
db.pool.validationQuery=SELECT 1

# Set max wait time to fail fast rather than queue indefinitely
db.pool.maxWait=5000
Enter fullscreen mode Exit fullscreen mode

Long-running transactions blocking other threads:

// PROBLEM: Transaction held open during external API call
@Transactional
public void processOrder(OrderModel order) {
    orderService.updateStatus(order, OrderStatus.PROCESSING);

    // This external call takes 30 seconds during peak load
    // The transaction stays open, holding database locks
    paymentGateway.charge(order.getPaymentInfo());

    orderService.updateStatus(order, OrderStatus.PAID);
}

// FIX: Separate the external call from the transaction
public void processOrder(OrderModel order) {
    updateOrderStatus(order, OrderStatus.PROCESSING);

    PaymentResult result = paymentGateway.charge(order.getPaymentInfo());

    if (result.isSuccess()) {
        updateOrderStatus(order, OrderStatus.PAID);
    }
}

@Transactional
private void updateOrderStatus(OrderModel order, OrderStatus status) {
    orderService.updateStatus(order, status);
}
Enter fullscreen mode Exit fullscreen mode

Problem 5: Deployment Failures on CCv2

Symptoms

  • Build succeeds but deployment fails
  • Pod crashes immediately after startup
  • Health checks failing

Debugging Deployment Failures

CCv2 Cloud Portal → Builds:

1. Check build logs for compilation errors
2. Check deployment logs for startup errors
3. Check pod logs for runtime exceptions

Common failure patterns:
┌─────────────────────────┬──────────────────────────────────────┐
│ Error                    │ Cause                                │
├─────────────────────────┼──────────────────────────────────────┤
│ Build timeout            │ Too many extensions, slow network    │
│ OOM during build         │ Increase build memory in manifest    │
│ Type system update fail  │ Incompatible items.xml changes       │
│ Bean definition error    │ Missing Spring dependency            │
│ Health check timeout     │ Slow startup, increase timeout       │
│ License validation fail  │ License expired or misconfigured     │
└─────────────────────────┴──────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Manifest Configuration Issues

// manifest.json  common misconfigurations

{
  "commerceSuiteVersion": "2211.25",
  "extensions": [
    // PROBLEM: Extension listed but not in repository
    "myextension",
    // PROBLEM: Dependency extension missing
    // myextension requires basecommerce but it's not listed
  ],
  "aspects": [
    {
      "name": "backoffice",
      "properties": [
        {
          "key": "db.pool.maxActive",
          "value": "50"
          // PROBLEM: Value too high for pod memory allocation
          // 50 connections × ~5MB each = 250MB just for DB connections
        }
      ]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Rolling Back a Failed Deployment

CCv2 Cloud Portal → Deployments:

1. Identify the last working build number
2. Create new deployment using the previous build
3. Deploy to the affected environment

Important: If the failed deployment included database changes
(new types, removed attributes), the rollback build may need
to handle the modified schema. Test rollback scenarios before
production deployments.
Enter fullscreen mode Exit fullscreen mode

Problem 7: Cluster Synchronization Issues

Symptoms

  • Changes made on one node not visible on others
  • Inconsistent behavior between requests (load balancer routing to different nodes)
  • Stale cache data despite modifications

Diagnosis

# Check cluster node status in HAC → Platform → Cluster

# Verify cluster communication
# All nodes should show as "ALIVE" with recent heartbeat

# If nodes show as "STALE" or "DEAD":
# 1. Check network connectivity between pods
# 2. Verify JGroups configuration
# 3. Check if cluster broadcast is enabled

cluster.broadcast.methods=jgroups
cluster.broadcast.method.jgroups.channel.name=hybris-broadcast
Enter fullscreen mode Exit fullscreen mode

Cache Invalidation Verification

// Test: modify an item and check if other nodes see the change
// On Node 1:
ProductModel product = productService.getProductForCode("TEST-001");
product.setName("Updated Name");
modelService.save(product);

// On Node 2 (via different request):
ProductModel product = productService.getProductForCode("TEST-001");
// If name is still old → cache invalidation is not working
// If name is updated → cluster sync is working
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Monitor proactively — set alerts for memory usage >80%, cache hit rates <70%, error rates above baseline, and CronJob failures.

  2. Keep thread dumps and heap dumps accessible — configure the JVM to dump on OOM (-XX:+HeapDumpOnOutOfMemoryError) and know how to take thread dumps quickly.

  3. Maintain runbooks — document every production issue and its resolution. The next incident at 3 AM shouldn't require the same diagnosis from scratch.

  4. Test at production scale — issues that appear with 500,000 products and 10,000 concurrent users don't show up in dev environments.

  5. Deploy during low-traffic windows — CCv2 supports rolling deployments, but having fewer users during deployment reduces the blast radius of issues.

  6. Know the rollback process — practice rolling back deployments before you need to do it under pressure.

  7. Separate background processing from customer-facing traffic — CronJobs, catalog sync, and indexing should run on dedicated nodes, not on nodes serving API requests.

Top comments (0)