Production issues in SAP Commerce don't announce themselves politely. They arrive as vague alerts, customer complaints, or a sudden spike in error rates at the worst possible time. The difference between a 15-minute resolution and a 4-hour outage comes down to how quickly you can identify the root cause, and that requires knowing where to look and what tools to use.
This article is a field guide for diagnosing and resolving the most common production issues in SAP Commerce Cloud: memory problems, slow queries, thread deadlocks, cache issues, CronJob failures, and deployment errors.
Problem 1: OutOfMemoryError
Symptoms
- Application pods restarting repeatedly
-
java.lang.OutOfMemoryError: Java heap spacein logs - Increasing response times before the crash
Immediate Response
# Navigate to: Environments → [env] → Monitoring
# 2. If you have access, trigger a heap dump before restart
jmap -dump:format=b,file=/tmp/heapdump.hprof <pid>
# 3. Check recent deployments or configuration changes
# Was anything deployed in the last 24 hours?
Common Causes
Large ImpEx imports consuming memory:
# Check for running imports in HAC → ImpEx → Import
# Large imports without batching can consume gigabytes
# Fix: Split large imports into batches
# Before: Single 2GB import file
# After: Multiple files, 50,000 lines each
Catalog synchronization on large catalogs:
# Catalog sync loads products into memory
# For catalogs with 500k+ products, increase memory or optimize sync
# Reduce sync batch size
catalog.sync.workers=4
synchronization.itemcopycreator.batchSize=100
Unbounded FlexibleSearch results:
// PROBLEM: Loading all products into memory
String query = "SELECT {pk} FROM {Product}";
List<ProductModel> allProducts = flexibleSearchService.search(query).getResult();
// With 500,000 products, this consumes massive heap
// FIX: Always paginate
FlexibleSearchQuery fsq = new FlexibleSearchQuery(query);
fsq.setStart(0);
fsq.setCount(100); // Process in pages
Session data accumulation:
# Sessions storing too much data (cart calculations, comparison lists)
# Check session sizes in Dynatrace or via JMX
# Reduce session timeout for anonymous users
default.session.timeout=600
# Store large session data externally (Redis) rather than in-memory
Analyzing Heap Dumps
# Download heap dump from pod
kubectl cp commerce-pod:/tmp/heapdump.hprof ./heapdump.hprof
# Open with Eclipse MAT (Memory Analyzer Tool)
# Key reports:
# 1. Leak Suspects Report — shows objects retaining the most memory
# 2. Dominator Tree — shows largest objects by retained size
# 3. Top Consumers — aggregate view by class
# Common findings:
# - Large HashMap instances (session data)
# - List<ProductModel> with millions of entries (unbounded queries)
# - byte[] arrays (media processing without streaming)
Problem 3: Thread Deadlocks and Pool Exhaustion
Symptoms
- Requests hanging indefinitely
- Thread pool metrics showing 100% utilization
-
RejectedExecutionExceptionin logs
Taking Thread Dumps
# Via jstack (if you have pod access)
jstack <pid> > thread_dump_$(date +%s).txt
# Take 3 dumps, 10 seconds apart, to see thread progression
for i in 1 2 3; do
jstack <pid> > thread_dump_${i}.txt
sleep 10
done
# Via HAC → Monitoring → Thread Dump
# Downloads a formatted thread dump
Reading Thread Dumps
# Look for BLOCKED threads
"http-nio-9002-exec-42" #142 daemon prio=5
java.lang.Thread.State: BLOCKED (on object monitor)
at de.hybris.platform.persistence.GenericBMPBean.ejbLoad(GenericBMPBean.java:456)
- waiting to lock <0x00000007f8e45678> (a java.lang.Object)
- locked by "http-nio-9002-exec-17" #117
# This thread is blocked waiting for a lock held by thread exec-17
# Check what exec-17 is doing — likely stuck in a long database operation
# Look for deadlocks (two threads each waiting for the other's lock)
"Found one Java-level deadlock:"
Thread 1 waiting for lock held by Thread 2
Thread 2 waiting for lock held by Thread 1
Common Thread Issues
Database connection pool exhaustion:
# Symptom: Threads waiting for database connections
# Check: All DB connections are in use, new requests queue up
# Increase pool size (default is often 10-20)
db.pool.maxActive=50
db.pool.maxIdle=20
db.pool.minIdle=5
# Add connection validation to reclaim leaked connections
db.pool.testOnBorrow=true
db.pool.validationQuery=SELECT 1
# Set max wait time to fail fast rather than queue indefinitely
db.pool.maxWait=5000
Long-running transactions blocking other threads:
// PROBLEM: Transaction held open during external API call
@Transactional
public void processOrder(OrderModel order) {
orderService.updateStatus(order, OrderStatus.PROCESSING);
// This external call takes 30 seconds during peak load
// The transaction stays open, holding database locks
paymentGateway.charge(order.getPaymentInfo());
orderService.updateStatus(order, OrderStatus.PAID);
}
// FIX: Separate the external call from the transaction
public void processOrder(OrderModel order) {
updateOrderStatus(order, OrderStatus.PROCESSING);
PaymentResult result = paymentGateway.charge(order.getPaymentInfo());
if (result.isSuccess()) {
updateOrderStatus(order, OrderStatus.PAID);
}
}
@Transactional
private void updateOrderStatus(OrderModel order, OrderStatus status) {
orderService.updateStatus(order, status);
}
Problem 5: Deployment Failures on CCv2
Symptoms
- Build succeeds but deployment fails
- Pod crashes immediately after startup
- Health checks failing
Debugging Deployment Failures
CCv2 Cloud Portal → Builds:
1. Check build logs for compilation errors
2. Check deployment logs for startup errors
3. Check pod logs for runtime exceptions
Common failure patterns:
┌─────────────────────────┬──────────────────────────────────────┐
│ Error │ Cause │
├─────────────────────────┼──────────────────────────────────────┤
│ Build timeout │ Too many extensions, slow network │
│ OOM during build │ Increase build memory in manifest │
│ Type system update fail │ Incompatible items.xml changes │
│ Bean definition error │ Missing Spring dependency │
│ Health check timeout │ Slow startup, increase timeout │
│ License validation fail │ License expired or misconfigured │
└─────────────────────────┴──────────────────────────────────────┘
Manifest Configuration Issues
// manifest.json — common misconfigurations
{
"commerceSuiteVersion": "2211.25",
"extensions": [
// PROBLEM: Extension listed but not in repository
"myextension",
// PROBLEM: Dependency extension missing
// myextension requires basecommerce but it's not listed
],
"aspects": [
{
"name": "backoffice",
"properties": [
{
"key": "db.pool.maxActive",
"value": "50"
// PROBLEM: Value too high for pod memory allocation
// 50 connections × ~5MB each = 250MB just for DB connections
}
]
}
]
}
Rolling Back a Failed Deployment
CCv2 Cloud Portal → Deployments:
1. Identify the last working build number
2. Create new deployment using the previous build
3. Deploy to the affected environment
Important: If the failed deployment included database changes
(new types, removed attributes), the rollback build may need
to handle the modified schema. Test rollback scenarios before
production deployments.
Problem 7: Cluster Synchronization Issues
Symptoms
- Changes made on one node not visible on others
- Inconsistent behavior between requests (load balancer routing to different nodes)
- Stale cache data despite modifications
Diagnosis
# Check cluster node status in HAC → Platform → Cluster
# Verify cluster communication
# All nodes should show as "ALIVE" with recent heartbeat
# If nodes show as "STALE" or "DEAD":
# 1. Check network connectivity between pods
# 2. Verify JGroups configuration
# 3. Check if cluster broadcast is enabled
cluster.broadcast.methods=jgroups
cluster.broadcast.method.jgroups.channel.name=hybris-broadcast
Cache Invalidation Verification
// Test: modify an item and check if other nodes see the change
// On Node 1:
ProductModel product = productService.getProductForCode("TEST-001");
product.setName("Updated Name");
modelService.save(product);
// On Node 2 (via different request):
ProductModel product = productService.getProductForCode("TEST-001");
// If name is still old → cache invalidation is not working
// If name is updated → cluster sync is working
Best Practices
Monitor proactively — set alerts for memory usage >80%, cache hit rates <70%, error rates above baseline, and CronJob failures.
Keep thread dumps and heap dumps accessible — configure the JVM to dump on OOM (
-XX:+HeapDumpOnOutOfMemoryError) and know how to take thread dumps quickly.Maintain runbooks — document every production issue and its resolution. The next incident at 3 AM shouldn't require the same diagnosis from scratch.
Test at production scale — issues that appear with 500,000 products and 10,000 concurrent users don't show up in dev environments.
Deploy during low-traffic windows — CCv2 supports rolling deployments, but having fewer users during deployment reduces the blast radius of issues.
Know the rollback process — practice rolling back deployments before you need to do it under pressure.
Separate background processing from customer-facing traffic — CronJobs, catalog sync, and indexing should run on dedicated nodes, not on nodes serving API requests.
Top comments (0)