DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

GBase 8c Lock Conflicts and Deadlock Troubleshooting: A Practical Guide

Lock conflicts and deadlocks are frequent culprits behind business disruptions in GBase 8c distributed clusters. Unlike single-node databases, GBase 8c introduces both local locks and cross-node global locks, making diagnosis more challenging. This guide provides a field-tested workflow covering lock mechanisms, common causes, diagnosis steps, and practical solutions.

1. Core Lock Mechanisms in GBase 8c

1.1 Lock Scope

  • Local locks: Confined to a single CN or DN node (e.g., concurrent transactions on one DN).
  • Global locks: Span multiple nodes, typical in distributed transactions and DDL operations.

1.2 Common Lock Modes and Conflicts

Mode Purpose Conflict Scenarios
ShareLock (S) Shared read lock Conflicts with X locks
ExclusiveLock (X) Exclusive write lock Conflicts with all other lock modes
AccessExclusiveLock (AEL) DDL operations (e.g., TRUNCATE) Blocks all reads and writes
Intent Locks (IS/IX) Table-level intent, coordinates with row-level locks Ensures table/row lock compatibility

1.3 Deadlock Types

  • Local deadlock: Occurs within a single node, automatically detected and resolved by GBase 8c.
  • Global deadlock: Cross-node circular wait; may require manual intervention if auto-resolution fails.

2. Five Common Causes of Lock Issues

  1. Poor transaction design: Long-running transactions, inconsistent multi-table operation order.
  2. Ineffective SQL: Missing WHERE clause causing full-table lock, index failure expanding lock scope, DDL during peak hours.
  3. Stuck two-phase transactions: Distributed transactions stuck in prepared state holding locks.
  4. Improper cluster parameters: e.g., lock_wait_timeout too short/long, insufficient thread pool.
  5. Node communication anomalies: Network delays or partitions triggering global deadlocks.

3. Diagnostic Workflow (4 Steps)

3.1 Step 1: Locate Blocked Sessions

-- Find all blocked lock requests
SELECT
    pg_locks.pid,
    pg_class.relname AS table_name,
    pg_locks.mode,
    pg_locks.granted,
    pg_stat_activity.query,
    pg_stat_activity.state
FROM pg_locks
LEFT JOIN pg_class ON pg_locks.relation = pg_class.oid
LEFT JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE pg_locks.granted = false;
Enter fullscreen mode Exit fullscreen mode

Also check for stranded two-phase transactions:

SELECT gxid, status, start_time
FROM pg_distributed_transactions
WHERE status = 'prepared';
Enter fullscreen mode Exit fullscreen mode

3.2 Step 2: Analyze the Lock Wait Chain

WITH lock_chain AS (
    SELECT
        pg_locks.pid AS wait_pid,
        pg_locks.relation,
        pg_locks.mode AS wait_mode,
        pg_stat_activity.query AS wait_query,
        (SELECT pid FROM pg_locks l2 WHERE l2.relation = pg_locks.relation AND l2.granted = true AND l2.mode = 'ExclusiveLock') AS hold_pid
    FROM pg_locks
    LEFT JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
    WHERE pg_locks.granted = false
)
SELECT
    wait_pid AS blocked_pid,
    pg_class.relname AS table_name,
    wait_mode,
    wait_query,
    hold_pid AS holding_pid,
    (SELECT query FROM pg_stat_activity WHERE pid = hold_pid) AS holding_query
FROM lock_chain
LEFT JOIN pg_class ON lock_chain.relation = pg_class.oid;
Enter fullscreen mode Exit fullscreen mode

3.3 Step 3: Distinguish Local vs. Global Deadlock

  • Local: Check node logs grep -i "deadlock" /opt/gbase/data/log/postgresql-*.log.
  • Global: Check all CN/DN logs, and query pg_distributed_transactions for prepared transactions.

3.4 Step 4: Verify Root Cause

  • Use EXPLAIN ANALYZE to check index usage.
  • Identify transactions running longer than 5 minutes.
  • Test node connectivity with ping and telnet.

4. Solutions

4.1 Optimise Transaction Design

  • Break long transactions; avoid external API calls inside transactions.
  • Enforce consistent object order across all transactions (e.g., always update user before order).

4.2 Refine SQL and Indexes

-- Add index to prevent full-table scan
CREATE INDEX idx_user_phone ON user(phone);
-- Schedule DDL off-peak, check lock status beforehand
Enter fullscreen mode Exit fullscreen mode

4.3 Clean Up Stuck Prepared Transactions

SELECT pg_terminate_backend(gxid)
FROM pg_distributed_transactions
WHERE status = 'prepared' AND start_time < now() - INTERVAL '1 hour';
Enter fullscreen mode Exit fullscreen mode

4.4 Tune Cluster Parameters

ALTER SYSTEM SET lock_wait_timeout = '10s';
ALTER SYSTEM SET thread_pool_attr = '512, 1, (cpubind:0-3)';
ALTER SYSTEM SET thread_pool_stream_attr = '512, 0.2, 1, (cpubind:0-3)';
-- Apply with SELECT pg_reload_conf(); or restart if necessary.
Enter fullscreen mode Exit fullscreen mode

4.5 Ensure Stable Node Communication

Enable heartbeat checks and adjust inter‑node timeout settings.

5. Real‑World Case Study

Symptom: During peak hours, many queries failed with lock wait timeout, and multiple backend threads hung in waiting state.

Diagnosis: Global lock query revealed two UPDATE transactions in a circular wait — PID 1234 held an X lock on user and waited for order, while PID 5678 held an X lock on order and waited for user. Two prepared transactions also existed.

Resolution:

  1. Manually terminated the stuck prepared transactions.
  2. Standardised transaction order across all applications (user → order).
  3. Adjusted lock_wait_timeout to 10s and optimised thread pool settings.
  4. Repaired network latency and enabled node heartbeat detection.

Result: Lock blocks cleared, API response times returned to normal, and no new deadlocks occurred.

6. Common Pitfalls and Best Practices

  • Never blindly kill sessions; first identify the lock holder.
  • Regularly clean up long‑running prepared transactions.
  • Global deadlocks often need manual intervention.
  • Execute DDL only during maintenance windows.
  • Periodically verify index effectiveness and query plans.
  • Tune lock and thread pool parameters based on actual cluster load.

Effective lock troubleshooting in a gbase database demands a systematic approach — pinpoint the root cause, apply targeted fixes, and establish continuous monitoring to prevent recurrence. Use this guide to keep your GBASE clusters stable and responsive.

Top comments (0)