Michael

Posted on Jun 19 • Originally published at gbase.cn

GBase 8a Data Import & Export Guide: gload, LOAD DATA, and SELECT INTO OUTFILE

#gbase #database #数据库 #performance

Importing and exporting data are among the most frequent operations in a gbase database MPP data warehouse. This guide covers tool selection, core parameter configuration, character set handling, error troubleshooting, and production‑tuning experience.

1. Import Method Selection

Method	Command	Throughput	Best For
gload	`gload -f load.cfg`	Highest	Production bulk loads, parallel processing, checkpoint resume
LOAD DATA INFILE	SQL statement	Medium	Single‑file loads, simple syntax, development/testing
INSERT INTO ... VALUES	SQL statement	Low	Small data writes, not for bulk

For large imports (>1 GB), gload is strongly recommended — its parallel processing far exceeds LOAD DATA.

2. gload in Detail

Configuration File

gload is driven by a .cfg file. A full example:

# load_orders.cfg
host     = 10.168.10.26
port     = 5258
user     = gbase
password = your_password
database = sales_db
table    = orders

# Data files (wildcards supported to load multiple files at once)
infile   = /data/orders/orders_2024_*.csv

# File format
fields terminated by ','       # field delimiter
enclosed by '"'                 # string quoting
lines terminated by '\n'        # line terminator
ignore 1 lines                  # skip header line

# Column mapping (file columns mapped to table columns in order)
(order_id, customer_id, dept_id, amount, status, order_date, create_time)

# Error handling
errors = 1000                   # max bad rows allowed (exceeding aborts the load)

Execute the load:

gload -f load_orders.cfg

Common File Format Settings

CSV: fields terminated by ',', enclosed by '"'
TSV: fields terminated by '\t'
Pipe‑delimited: fields terminated by '|'
Skip header & remap columns: ignore 1 lines + column list (order_id, amount, order_date, status)

Character Set Configuration

Mismatched character sets are the most common cause of garbled data after import. Specify the file encoding explicitly in the cfg file:

character_set = utf8    # encoding of the data file

Server‑side parameters (in gbase.cnf) should match. If the file is GBK‑encoded, set character_set = gbk in the cfg and ensure the table uses DEFAULT CHARSET=utf8 — the server will auto‑convert.

Collecting Error Rows

After enabling error collection, query the error log using the task_id printed during the load:

SELECT task_id, error_row_no, error_msg,
       SUBSTR(raw_data, 1, 200) AS raw_line
FROM gclusterdb.load_error_log
WHERE task_id = '20240601_143022_000001'
LIMIT 50;

gload Performance Tuning Parameters

Parameter	Scope	Description	Recommended
gcluster_loader_max_data_processors	gcluster	Concurrent processing threads	physical CPU cores / 2
gcluster_loader_min_chunk_size	gcluster	Chunk size per gnode (bytes)	67108864 (64 MB)
gbase_loader_parallel_degree	gnode	Parallel write threads per gnode	4–8
gbase_loader_buffer_count	gnode	Number of write buffers	4
gbase_loader_read_timeout	gnode	Data read timeout (seconds)	300

3. LOAD DATA INFILE in Detail

Basic Syntax

LOAD DATA INFILE '/data/orders/orders.csv'
INTO TABLE orders
CHARACTER SET utf8
FIELDS TERMINATED BY ','
       ENCLOSED BY '"'
       ESCAPED BY '\\'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(order_id, customer_id, dept_id, amount, status, order_date, @create_time)
SET create_time = STR_TO_DATE(@create_time, '%Y-%m-%d %H:%i:%s');

Key points: the INFILE path is on the gcluster node; use @var to capture column values and transform them in SET.

LOCAL Keyword

LOAD DATA LOCAL INFILE '/local/path/data.csv'
INTO TABLE orders
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

LOCAL reads the file from the client — convenient for small test files, but not recommended for production.

4. Exporting with SELECT INTO OUTFILE

Basic Usage

SELECT order_id, customer_id, amount, order_date
INTO OUTFILE '/data/export/orders_2024.csv'
CHARACTER SET utf8
FIELDS TERMINATED BY ','
       ENCLOSED BY '"'
       ESCAPED BY '\\'
LINES TERMINATED BY '\n'
FROM orders
WHERE order_date >= '2024-01-01';

Notes: the export path is local to the gcluster node; the target file must not already exist; use gbase_export_directory to restrict allowed write directories.

Export a Specific Partition

SELECT *
INTO OUTFILE '/data/export/orders_2024q1.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
FROM orders PARTITION (p2024q1);

Concurrent Export Script for Large Tables

#!/bin/bash
for month in 01 02 03 04 05 06 07 08 09 10 11 12; do
    gccli -u gbase -p password -e "
        SELECT * INTO OUTFILE '/data/export/orders_2024${month}.csv'
        FIELDS TERMINATED BY ','
        LINES TERMINATED BY '\n'
        FROM orders
        WHERE order_date BETWEEN '2024-${month}-01' AND LAST_DAY('2024-${month}-01')
    " &
done
wait
echo "All exports done"

5. Monitoring Export Progress

-- Running exports
SELECT task_id, table_name, status, start_time,
       exported_rows,
       TIMESTAMPDIFF(SECOND, start_time, NOW()) AS elapsed_sec
FROM gclusterdb.export_task
WHERE status = 'RUNNING';

-- Historical exports
SELECT task_id, table_name, status, exported_rows,
       start_time, end_time
FROM gclusterdb.export_task
ORDER BY start_time DESC
LIMIT 20;

6. Common Issues and Solutions

Fewer rows imported than expected, no error: likely illegal characters silently skipped. Enable gbase_loader_logs_collect = ON and check load_error_log.
Slow import with low I/O: concurrency is too low. Increase gcluster_loader_max_data_processors and gbase_loader_parallel_degree.
OUTFILE "Can't create/write to file": check directory existence, file pre‑existence, and gbase_export_directory restrictions.
Date format errors: use @var + SET with STR_TO_DATE:

LOAD DATA INFILE '/data/orders.csv'
INTO TABLE orders
FIELDS TERMINATED BY ','
(@order_id, @customer_id, @amount, @order_date_str)
SET
    order_id    = @order_id,
    customer_id = @customer_id,
    amount      = @amount,
    order_date  = STR_TO_DATE(@order_date_str, '%Y%m%d');

7. Best Practices

Scenario	Recommendation
Daily incremental loads (>1 GB)	gload with config file, concurrent multi‑file batches
Dev/test small table loads	LOAD DATA LOCAL INFILE
Periodic full exports for backup	SELECT INTO OUTFILE with partition‑based concurrent export
Cross‑database migration	gload combined with SELECT INTO OUTFILE pipeline
Data quality inspection	Enable `gbase_loader_logs_collect` first to see error distribution

gload is the production workhorse. It parallelises processing at both the gcluster and gnode layers, fully leveraging the multi‑node concurrent write capability of the MPP cluster — throughput is typically 3–5× that of LOAD DATA.

DEV Community