GBASE Database

Posted on Jul 2, 2024

GBase 8c Join Query Performance Optimization: A Practical Analysis

#database

Join queries are one of the primary methods in relational databases, including methods like hash join, merge join, or nested loop join. This article explores how to optimize join query performance in GBase 8c database through practical examples.

1. Creating Tables and Importing Data

Create tables departments and employees:

-- Create departments table
CREATE TABLE departments (
   dept_id INT PRIMARY KEY,
   dept_name VARCHAR(100)
);

-- Insert department data
INSERT INTO departments (dept_id, dept_name) VALUES
(1, 'HR'),
(2, 'Engineering'),
(3, 'Marketing');

-- Create employees table
CREATE TABLE employees (
   emp_id INT PRIMARY KEY,
   emp_name VARCHAR(100),
   dept_id INT,
   salary DECIMAL(10, 2),
   FOREIGN KEY (dept_id) REFERENCES departments(dept_id)
);

-- Insert employee data
INSERT INTO employees (emp_id, emp_name, dept_id, salary) VALUES
(1, 'Alice', 1, 50000.00),
(2, 'Bob', 2, 60000.00),
(3, 'Carol', 3, 55000.00),
(4, 'David', 1, 48000.00),
(5, 'Eve', 2, 52000.00);

2. Performing Join Queries and Optimizing Performance

Original Query

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) SELECT e.emp_name, d.dept_name 
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

The execution plan may resemble the following:

In this execution plan: with only 5 rows in the table, the database's choice of a hash join is evidently inappropriate. Generally, for joins involving fewer than 1000 rows, a nested loop join (nestloop) significantly outperforms a hash join. This is because a hash join requires hashing both the smaller and larger tables on the join fields before connecting the results of each hash bucket and then aggregating the final results, somewhat akin to the divide-and-conquer approach of a quicksort algorithm.

Optimized Query Using Hints

-- To force the execution plan to use a nestloop
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) SELECT  /*+ nestloop (e d) */ e.emp_name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

The execution plan may show results similar to the following:

In the optimized execution plan, the nestloop hint is used to force the nested loop join. This significantly reduces the SQL execution time from 0.419ms to 0.170ms.

3. Analysis and Optimization

In the original execution plan, the optimizer might incorrectly choose a hash join, resulting in poorer performance. By using a hint /*+ nestloop (e d) */, we force the use of a nested loop join, which is more suitable for scenarios with fewer rows (e.g., less than 1000).

Scenarios for Choosing Join Types

(1) Hash Join

Suitable for: When one table in the join is significantly smaller than the other, leveraging hash algorithms for fast matching (e.g., JOIN ON table1.key = table2.key).

Advantages: Efficient in appropriate scenarios, especially when memory and hash function selection are optimal.

(2) Merge Join

Suitable for: When both input tables are sorted according to the join condition.

Advantages: Efficient for sorted inputs, particularly in large datasets.

(3) Nested Loop Join

Suitable for: When one table is significantly smaller than the other and no suitable indexes exist for hash or merge joins.

Advantages: Provides a reliable join method for smaller tables or when join conditions are not conducive to hash or sort algorithms.

DEV Community