DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

GBase 8a Character Set Guide: Root Causes, Configuration, and Fixes for Mojibake

Character set issues are among the most frequent pitfalls in a gbase database. Mojibake (garbled text) almost always stems from a mismatch somewhere in the character set conversion chain. This guide covers the supported character sets, the end‑to‑end conversion pipeline, configuration best practices, and diagnosis for the most common garbled‑text scenarios.

Supported Character Sets

Character Set Description Max Bytes/Char Best For
utf8 Standard UTF‑8 (1–3 bytes) 3 General purpose, the default
utf8mb4 Full UTF‑8 (1–4 bytes) 4 Emoji or rare CJK characters
gbk Extended GB national standard 2 Legacy systems with GBK sources
gb18030 National standard, GBK superset 4 Government scenarios needing rare characters

Key constraints:

  • GBase 8a's utf8 handles only 3‑byte characters (like MySQL's utf8mb3). Use utf8mb4 if you need Emoji.
  • The instance‑level character set is set at installation and cannot be changed for existing tables via ALTER TABLE. A rebuild is required.
  • Collation is case‑sensitive by default, unlike MySQL's utf8_general_ci. Plan accordingly during migration.

The Conversion Chain

Client → character_set_client → gcluster → character_set_connection → Storage (table/db charset) → character_set_results → Client
Enter fullscreen mode Exit fullscreen mode

Mojibake happens only where the character set on one side of a conversion doesn't match the actual byte encoding. The golden rule: keep client = connection = results = table charset.

Character Set Parameters

SHOW VARIABLES LIKE '%character%';
Enter fullscreen mode Exit fullscreen mode
Parameter Purpose Scope
character_set_server Instance charset (set at install) Config file, restart needed
character_set_client Encoding of SQL sent by client SET or config file
character_set_connection Server‑side intermediate charset SET or config file
character_set_results Encoding of results returned SET or config file
character_set_sort Collation for ordering SET or config file

Configuration Best Practices

At installation time (before gbase.cnf is locked):

[client]
default-character-set = utf8

[gbased]
default_character_set = utf8
Enter fullscreen mode Exit fullscreen mode

At session level, align client, connection, and results with a single command:

SET NAMES gbk;
Enter fullscreen mode Exit fullscreen mode

In JDBC:

String url = "jdbc:gbase://host:5258/db?characterEncoding=utf8&useUnicode=true";
Enter fullscreen mode Exit fullscreen mode

Diagnosing and Fixing Mojibake

Scenario 1: gccli displays question marks or boxes

Check the session charset with SHOW VARIABLES, verify the terminal locale (echo $LANG), and confirm the table charset. Fix with SET NAMES utf8; or SET NAMES gbk; to match the terminal.

Scenario 2: LOAD DATA produces mojibake

GBase 8a's LOAD DATA does not transcode. Convert the file at the OS level first:

iconv -f GBK -t UTF-8 data_gbk.csv > data_utf8.csv
Enter fullscreen mode Exit fullscreen mode

Then load specifying the file's charset:

LOAD DATA INFILE '/data/import/data.csv'
INTO TABLE t_product
CHARACTER SET gbk
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
Enter fullscreen mode Exit fullscreen mode

Scenario 3: JDBC writes garbled

Ensure the JDBC URL includes characterEncoding=utf8. Otherwise the JVM default encoding may be used.

Scenario 4: SELECT INTO OUTFILE opens garbled in Excel

Export as GBK with Windows‑style line endings:

SET character_set_results = gbk;
SELECT ... INTO OUTFILE '/data/export/result.csv'
CHARACTER SET gbk
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
Enter fullscreen mode Exit fullscreen mode

Scenario 5: Mixing character sets in a single table

Some versions support column‑level charset:

CREATE TABLE t_mixed (
    id INT,
    name_cn   VARCHAR(100) CHARACTER SET gbk,
    name_intl VARCHAR(100) CHARACTER SET utf8mb4
) DISTRIBUTED BY HASH(id);
Enter fullscreen mode Exit fullscreen mode

Changing the Character Set of an Existing Table

Because ALTER TABLE ... CONVERT TO CHARACTER SET is not supported, the only safe method is to rebuild:

-- 1. Export with the target charset
SET character_set_results = utf8;
SELECT * INTO OUTFILE '/tmp/t_product_utf8.csv'
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
FROM t_product_gbk;

-- 2. Create new table with the target charset
CREATE TABLE t_product_new (...) ENGINE=EXPRESS DEFAULT CHARSET=utf8;

-- 3. Load the exported data
LOAD DATA INFILE '/tmp/t_product_utf8.csv'
INTO TABLE t_product_new CHARACTER SET utf8
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

-- 4. Swap tables after validation
ALTER TABLE t_product_gbk RENAME TO t_product_gbk_bak;
ALTER TABLE t_product_new RENAME TO t_product_gbk;
Enter fullscreen mode Exit fullscreen mode

Chinese Pinyin Sorting

By default, GBase 8a sorts by binary value. For pinyin ordering:

SET character_set_sort = gbk;
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Symptom Likely Cause Fix
Results show ??? character_set_results mismatch SET NAMES to client locale
Square blocks in results Encoding at write time mismatched table charset Re‑import with consistent encoding
LOAD DATA produces mojibake File encoding differs from declared iconv the file first
JDBC writes garbled Missing characterEncoding in URL Add characterEncoding=utf8
CSV garbled in Excel UTF‑8 file opened as GBK Export as GBK or add BOM
Chinese sorts not pinyin Binary sort default SET character_set_sort=gbk
Emoji insert fails utf8 doesn't support 4‑byte chars Switch to utf8mb4

Mastering character sets in a gbase database eliminates one of the most persistent and confusing classes of bugs. The rule is simple: know the encoding of every link in the chain, and never let a conversion happen implicitly.

Top comments (0)