Character set issues are among the most frequent pitfalls in a gbase database. Mojibake (garbled text) almost always stems from a mismatch somewhere in the character set conversion chain. This guide covers the supported character sets, the end‑to‑end conversion pipeline, configuration best practices, and diagnosis for the most common garbled‑text scenarios.
Supported Character Sets
| Character Set | Description | Max Bytes/Char | Best For |
|---|---|---|---|
| utf8 | Standard UTF‑8 (1–3 bytes) | 3 | General purpose, the default |
| utf8mb4 | Full UTF‑8 (1–4 bytes) | 4 | Emoji or rare CJK characters |
| gbk | Extended GB national standard | 2 | Legacy systems with GBK sources |
| gb18030 | National standard, GBK superset | 4 | Government scenarios needing rare characters |
Key constraints:
- GBase 8a's
utf8handles only 3‑byte characters (like MySQL'sutf8mb3). Useutf8mb4if you need Emoji. - The instance‑level character set is set at installation and cannot be changed for existing tables via
ALTER TABLE. A rebuild is required. - Collation is case‑sensitive by default, unlike MySQL's
utf8_general_ci. Plan accordingly during migration.
The Conversion Chain
Client → character_set_client → gcluster → character_set_connection → Storage (table/db charset) → character_set_results → Client
Mojibake happens only where the character set on one side of a conversion doesn't match the actual byte encoding. The golden rule: keep client = connection = results = table charset.
Character Set Parameters
SHOW VARIABLES LIKE '%character%';
| Parameter | Purpose | Scope |
|---|---|---|
| character_set_server | Instance charset (set at install) | Config file, restart needed |
| character_set_client | Encoding of SQL sent by client |
SET or config file |
| character_set_connection | Server‑side intermediate charset |
SET or config file |
| character_set_results | Encoding of results returned |
SET or config file |
| character_set_sort | Collation for ordering |
SET or config file |
Configuration Best Practices
At installation time (before gbase.cnf is locked):
[client]
default-character-set = utf8
[gbased]
default_character_set = utf8
At session level, align client, connection, and results with a single command:
SET NAMES gbk;
In JDBC:
String url = "jdbc:gbase://host:5258/db?characterEncoding=utf8&useUnicode=true";
Diagnosing and Fixing Mojibake
Scenario 1: gccli displays question marks or boxes
Check the session charset with SHOW VARIABLES, verify the terminal locale (echo $LANG), and confirm the table charset. Fix with SET NAMES utf8; or SET NAMES gbk; to match the terminal.
Scenario 2: LOAD DATA produces mojibake
GBase 8a's LOAD DATA does not transcode. Convert the file at the OS level first:
iconv -f GBK -t UTF-8 data_gbk.csv > data_utf8.csv
Then load specifying the file's charset:
LOAD DATA INFILE '/data/import/data.csv'
INTO TABLE t_product
CHARACTER SET gbk
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
Scenario 3: JDBC writes garbled
Ensure the JDBC URL includes characterEncoding=utf8. Otherwise the JVM default encoding may be used.
Scenario 4: SELECT INTO OUTFILE opens garbled in Excel
Export as GBK with Windows‑style line endings:
SET character_set_results = gbk;
SELECT ... INTO OUTFILE '/data/export/result.csv'
CHARACTER SET gbk
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
Scenario 5: Mixing character sets in a single table
Some versions support column‑level charset:
CREATE TABLE t_mixed (
id INT,
name_cn VARCHAR(100) CHARACTER SET gbk,
name_intl VARCHAR(100) CHARACTER SET utf8mb4
) DISTRIBUTED BY HASH(id);
Changing the Character Set of an Existing Table
Because ALTER TABLE ... CONVERT TO CHARACTER SET is not supported, the only safe method is to rebuild:
-- 1. Export with the target charset
SET character_set_results = utf8;
SELECT * INTO OUTFILE '/tmp/t_product_utf8.csv'
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
FROM t_product_gbk;
-- 2. Create new table with the target charset
CREATE TABLE t_product_new (...) ENGINE=EXPRESS DEFAULT CHARSET=utf8;
-- 3. Load the exported data
LOAD DATA INFILE '/tmp/t_product_utf8.csv'
INTO TABLE t_product_new CHARACTER SET utf8
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
-- 4. Swap tables after validation
ALTER TABLE t_product_gbk RENAME TO t_product_gbk_bak;
ALTER TABLE t_product_new RENAME TO t_product_gbk;
Chinese Pinyin Sorting
By default, GBase 8a sorts by binary value. For pinyin ordering:
SET character_set_sort = gbk;
Quick Reference
| Symptom | Likely Cause | Fix |
|---|---|---|
Results show ???
|
character_set_results mismatch |
SET NAMES to client locale |
| Square blocks in results | Encoding at write time mismatched table charset | Re‑import with consistent encoding |
| LOAD DATA produces mojibake | File encoding differs from declared |
iconv the file first |
| JDBC writes garbled | Missing characterEncoding in URL |
Add characterEncoding=utf8
|
| CSV garbled in Excel | UTF‑8 file opened as GBK | Export as GBK or add BOM |
| Chinese sorts not pinyin | Binary sort default | SET character_set_sort=gbk |
| Emoji insert fails | utf8 doesn't support 4‑byte chars | Switch to utf8mb4 |
Mastering character sets in a gbase database eliminates one of the most persistent and confusing classes of bugs. The rule is simple: know the encoding of every link in the chain, and never let a conversion happen implicitly.
Top comments (0)