Importing and exporting data are among the most frequent operations in a gbase database MPP data warehouse. This guide covers tool selection, core parameter configuration, character set handling, error troubleshooting, and production‑tuning experience.
1. Import Method Selection
| Method | Command | Throughput | Best For |
|---|---|---|---|
| gload | gload -f load.cfg |
Highest | Production bulk loads, parallel processing, checkpoint resume |
| LOAD DATA INFILE | SQL statement | Medium | Single‑file loads, simple syntax, development/testing |
| INSERT INTO ... VALUES | SQL statement | Low | Small data writes, not for bulk |
For large imports (>1 GB), gload is strongly recommended — its parallel processing far exceeds LOAD DATA.
2. gload in Detail
Configuration File
gload is driven by a .cfg file. A full example:
# load_orders.cfg
host = 10.168.10.26
port = 5258
user = gbase
password = your_password
database = sales_db
table = orders
# Data files (wildcards supported to load multiple files at once)
infile = /data/orders/orders_2024_*.csv
# File format
fields terminated by ',' # field delimiter
enclosed by '"' # string quoting
lines terminated by '\n' # line terminator
ignore 1 lines # skip header line
# Column mapping (file columns mapped to table columns in order)
(order_id, customer_id, dept_id, amount, status, order_date, create_time)
# Error handling
errors = 1000 # max bad rows allowed (exceeding aborts the load)
Execute the load:
gload -f load_orders.cfg
Common File Format Settings
-
CSV:
fields terminated by ',',enclosed by '"' -
TSV:
fields terminated by '\t' -
Pipe‑delimited:
fields terminated by '|' -
Skip header & remap columns:
ignore 1 lines+ column list(order_id, amount, order_date, status)
Character Set Configuration
Mismatched character sets are the most common cause of garbled data after import. Specify the file encoding explicitly in the cfg file:
character_set = utf8 # encoding of the data file
Server‑side parameters (in gbase.cnf) should match. If the file is GBK‑encoded, set character_set = gbk in the cfg and ensure the table uses DEFAULT CHARSET=utf8 — the server will auto‑convert.
Collecting Error Rows
After enabling error collection, query the error log using the task_id printed during the load:
SELECT task_id, error_row_no, error_msg,
SUBSTR(raw_data, 1, 200) AS raw_line
FROM gclusterdb.load_error_log
WHERE task_id = '20240601_143022_000001'
LIMIT 50;
gload Performance Tuning Parameters
| Parameter | Scope | Description | Recommended |
|---|---|---|---|
| gcluster_loader_max_data_processors | gcluster | Concurrent processing threads | physical CPU cores / 2 |
| gcluster_loader_min_chunk_size | gcluster | Chunk size per gnode (bytes) | 67108864 (64 MB) |
| gbase_loader_parallel_degree | gnode | Parallel write threads per gnode | 4–8 |
| gbase_loader_buffer_count | gnode | Number of write buffers | 4 |
| gbase_loader_read_timeout | gnode | Data read timeout (seconds) | 300 |
3. LOAD DATA INFILE in Detail
Basic Syntax
LOAD DATA INFILE '/data/orders/orders.csv'
INTO TABLE orders
CHARACTER SET utf8
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(order_id, customer_id, dept_id, amount, status, order_date, @create_time)
SET create_time = STR_TO_DATE(@create_time, '%Y-%m-%d %H:%i:%s');
Key points: the INFILE path is on the gcluster node; use @var to capture column values and transform them in SET.
LOCAL Keyword
LOAD DATA LOCAL INFILE '/local/path/data.csv'
INTO TABLE orders
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
LOCAL reads the file from the client — convenient for small test files, but not recommended for production.
4. Exporting with SELECT INTO OUTFILE
Basic Usage
SELECT order_id, customer_id, amount, order_date
INTO OUTFILE '/data/export/orders_2024.csv'
CHARACTER SET utf8
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
FROM orders
WHERE order_date >= '2024-01-01';
Notes: the export path is local to the gcluster node; the target file must not already exist; use gbase_export_directory to restrict allowed write directories.
Export a Specific Partition
SELECT *
INTO OUTFILE '/data/export/orders_2024q1.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
FROM orders PARTITION (p2024q1);
Concurrent Export Script for Large Tables
#!/bin/bash
for month in 01 02 03 04 05 06 07 08 09 10 11 12; do
gccli -u gbase -p password -e "
SELECT * INTO OUTFILE '/data/export/orders_2024${month}.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
FROM orders
WHERE order_date BETWEEN '2024-${month}-01' AND LAST_DAY('2024-${month}-01')
" &
done
wait
echo "All exports done"
5. Monitoring Export Progress
-- Running exports
SELECT task_id, table_name, status, start_time,
exported_rows,
TIMESTAMPDIFF(SECOND, start_time, NOW()) AS elapsed_sec
FROM gclusterdb.export_task
WHERE status = 'RUNNING';
-- Historical exports
SELECT task_id, table_name, status, exported_rows,
start_time, end_time
FROM gclusterdb.export_task
ORDER BY start_time DESC
LIMIT 20;
6. Common Issues and Solutions
-
Fewer rows imported than expected, no error: likely illegal characters silently skipped. Enable
gbase_loader_logs_collect = ONand checkload_error_log. -
Slow import with low I/O: concurrency is too low. Increase
gcluster_loader_max_data_processorsandgbase_loader_parallel_degree. -
OUTFILE "Can't create/write to file": check directory existence, file pre‑existence, and
gbase_export_directoryrestrictions. -
Date format errors: use
@var + SETwithSTR_TO_DATE:
LOAD DATA INFILE '/data/orders.csv'
INTO TABLE orders
FIELDS TERMINATED BY ','
(@order_id, @customer_id, @amount, @order_date_str)
SET
order_id = @order_id,
customer_id = @customer_id,
amount = @amount,
order_date = STR_TO_DATE(@order_date_str, '%Y%m%d');
7. Best Practices
| Scenario | Recommendation |
|---|---|
| Daily incremental loads (>1 GB) | gload with config file, concurrent multi‑file batches |
| Dev/test small table loads | LOAD DATA LOCAL INFILE |
| Periodic full exports for backup | SELECT INTO OUTFILE with partition‑based concurrent export |
| Cross‑database migration | gload combined with SELECT INTO OUTFILE pipeline |
| Data quality inspection | Enable gbase_loader_logs_collect first to see error distribution |
gload is the production workhorse. It parallelises processing at both the gcluster and gnode layers, fully leveraging the multi‑node concurrent write capability of the MPP cluster — throughput is typically 3–5× that of LOAD DATA.
Top comments (0)