Michael

Posted on Jun 12 • Originally published at gbase.cn

GBase 8c Lock Conflicts and Deadlock Troubleshooting: A Practical Guide

#gbase #database #数据库 #troubleshooting

Lock conflicts and deadlocks are frequent culprits behind business disruptions in GBase 8c distributed clusters. Unlike single-node databases, GBase 8c introduces both local locks and cross-node global locks, making diagnosis more challenging. This guide provides a field-tested workflow covering lock mechanisms, common causes, diagnosis steps, and practical solutions.

1. Core Lock Mechanisms in GBase 8c

1.1 Lock Scope

Local locks: Confined to a single CN or DN node (e.g., concurrent transactions on one DN).
Global locks: Span multiple nodes, typical in distributed transactions and DDL operations.

1.2 Common Lock Modes and Conflicts

Mode	Purpose	Conflict Scenarios
ShareLock (S)	Shared read lock	Conflicts with X locks
ExclusiveLock (X)	Exclusive write lock	Conflicts with all other lock modes
AccessExclusiveLock (AEL)	DDL operations (e.g., TRUNCATE)	Blocks all reads and writes
Intent Locks (IS/IX)	Table-level intent, coordinates with row-level locks	Ensures table/row lock compatibility

1.3 Deadlock Types

Local deadlock: Occurs within a single node, automatically detected and resolved by GBase 8c.
Global deadlock: Cross-node circular wait; may require manual intervention if auto-resolution fails.

2. Five Common Causes of Lock Issues

Poor transaction design: Long-running transactions, inconsistent multi-table operation order.
Ineffective SQL: Missing WHERE clause causing full-table lock, index failure expanding lock scope, DDL during peak hours.
Stuck two-phase transactions: Distributed transactions stuck in prepared state holding locks.
Improper cluster parameters: e.g., lock_wait_timeout too short/long, insufficient thread pool.
Node communication anomalies: Network delays or partitions triggering global deadlocks.

3. Diagnostic Workflow (4 Steps)

3.1 Step 1: Locate Blocked Sessions

-- Find all blocked lock requests
SELECT
    pg_locks.pid,
    pg_class.relname AS table_name,
    pg_locks.mode,
    pg_locks.granted,
    pg_stat_activity.query,
    pg_stat_activity.state
FROM pg_locks
LEFT JOIN pg_class ON pg_locks.relation = pg_class.oid
LEFT JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE pg_locks.granted = false;

Also check for stranded two-phase transactions:

SELECT gxid, status, start_time
FROM pg_distributed_transactions
WHERE status = 'prepared';

3.2 Step 2: Analyze the Lock Wait Chain

WITH lock_chain AS (
    SELECT
        pg_locks.pid AS wait_pid,
        pg_locks.relation,
        pg_locks.mode AS wait_mode,
        pg_stat_activity.query AS wait_query,
        (SELECT pid FROM pg_locks l2 WHERE l2.relation = pg_locks.relation AND l2.granted = true AND l2.mode = 'ExclusiveLock') AS hold_pid
    FROM pg_locks
    LEFT JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
    WHERE pg_locks.granted = false
)
SELECT
    wait_pid AS blocked_pid,
    pg_class.relname AS table_name,
    wait_mode,
    wait_query,
    hold_pid AS holding_pid,
    (SELECT query FROM pg_stat_activity WHERE pid = hold_pid) AS holding_query
FROM lock_chain
LEFT JOIN pg_class ON lock_chain.relation = pg_class.oid;

3.3 Step 3: Distinguish Local vs. Global Deadlock

Local: Check node logs grep -i "deadlock" /opt/gbase/data/log/postgresql-*.log.
Global: Check all CN/DN logs, and query pg_distributed_transactions for prepared transactions.

3.4 Step 4: Verify Root Cause

Use EXPLAIN ANALYZE to check index usage.
Identify transactions running longer than 5 minutes.
Test node connectivity with ping and telnet.

4. Solutions

4.1 Optimise Transaction Design

Break long transactions; avoid external API calls inside transactions.
Enforce consistent object order across all transactions (e.g., always update user before order).

4.2 Refine SQL and Indexes

-- Add index to prevent full-table scan
CREATE INDEX idx_user_phone ON user(phone);
-- Schedule DDL off-peak, check lock status beforehand

4.3 Clean Up Stuck Prepared Transactions

SELECT pg_terminate_backend(gxid)
FROM pg_distributed_transactions
WHERE status = 'prepared' AND start_time < now() - INTERVAL '1 hour';

4.4 Tune Cluster Parameters

ALTER SYSTEM SET lock_wait_timeout = '10s';
ALTER SYSTEM SET thread_pool_attr = '512, 1, (cpubind:0-3)';
ALTER SYSTEM SET thread_pool_stream_attr = '512, 0.2, 1, (cpubind:0-3)';
-- Apply with SELECT pg_reload_conf(); or restart if necessary.

4.5 Ensure Stable Node Communication

Enable heartbeat checks and adjust inter‑node timeout settings.

5. Real‑World Case Study

Symptom: During peak hours, many queries failed with lock wait timeout, and multiple backend threads hung in waiting state.

Diagnosis: Global lock query revealed two UPDATE transactions in a circular wait — PID 1234 held an X lock on user and waited for order, while PID 5678 held an X lock on order and waited for user. Two prepared transactions also existed.

Resolution:

Manually terminated the stuck prepared transactions.
Standardised transaction order across all applications (user → order).
Adjusted lock_wait_timeout to 10s and optimised thread pool settings.
Repaired network latency and enabled node heartbeat detection.

Result: Lock blocks cleared, API response times returned to normal, and no new deadlocks occurred.

6. Common Pitfalls and Best Practices

Never blindly kill sessions; first identify the lock holder.
Regularly clean up long‑running prepared transactions.
Global deadlocks often need manual intervention.
Execute DDL only during maintenance windows.
Periodically verify index effectiveness and query plans.
Tune lock and thread pool parameters based on actual cluster load.

Effective lock troubleshooting in a gbase database demands a systematic approach — pinpoint the root cause, apply targeted fixes, and establish continuous monitoring to prevent recurrence. Use this guide to keep your GBASE clusters stable and responsive.

DEV Community