How Character Sets, Collations, and String Comparisons Cause Subtle Data Drift in GBase 8a

#gbase #database #数据库

Many "the results look wrong" problems in a gbase database don't come from incorrect imports or bad SQL logic. They come from mismatched character sets, collations, case‑handling, and string‑comparison semantics. The real danger: data loads fine, SQL runs without errors, pages render normally — but the results quietly diverge from what the business expects.

Common Symptoms

Two tables with seemingly identical business keys, yet the join match rate is surprisingly low.
A GROUP BY user_code returns more groups than expected — values that look the same are treated as different.
WHERE shop_name = 'Beijing_01' works in test but returns nothing in production.
The same customer number appears in both uppercase and lowercase, so downstream deduplication still leaves duplicates.
Text fields imported from different systems look identical to the eye but never compare equal.
Front‑end report filters return inconsistent counts, ultimately traced back to trailing spaces or invisible characters.

The common thread: the problem isn't whether strings can be stored, but how they are compared, grouped, and joined.

How to Diagnose

I typically break the problem into three layers when on‑site.

Layer 1: Are the raw values actually the same?

Check whether strings that look identical differ at the byte, space, case, or hidden‑character level.

SELECT cust_code, LENGTH(cust_code) AS len1, HEX(cust_code) AS hex1
FROM dwd_trade_order
WHERE cust_code LIKE 'AB12%';

If two values that look the same return different LENGTH() or HEX() results, the direction is clear.

Layer 2: What are the actual comparison rules?

Values can be identical, yet comparison behavior differs. Pull the table definitions first:

SHOW CREATE TABLE dim_customer;
SHOW CREATE TABLE ods_customer_src;

Focus on: mixed CHAR vs VARCHAR usage, direct comparisons between fields of different character sets, and explicit or implicit conversions inside expressions.

Layer 3: Did business logic amplify the dirt?

Many issues aren't born in the raw layer — they're amplified during aggregation and modeling. Occasional case inconsistencies at the source become persistent problems when those columns become join keys, grouping columns, or dedup keys in downstream layers.

A Realistic Example

A retail system syncs store info from a membership system (uppercase codes) and a trading system (lowercase codes, some with trailing spaces). A direct join:

SELECT a.store_code, a.store_name, b.trade_dt, b.sale_amt
FROM ods_store_member a
JOIN ods_store_trade b ON a.store_code = b.store_code;

The match rate is suspiciously low. First, validate the differences:

SELECT store_code, LENGTH(store_code) AS code_len, HEX(store_code) AS code_hex
FROM ods_store_trade WHERE store_code LIKE '%BJ001%';

Then run three quick comparisons:

Compare 1: Raw vs. normalized value

SELECT store_code AS raw_code,
       TRIM(store_code) AS trim_code,
       UPPER(TRIM(store_code)) AS norm_code,
       LENGTH(store_code) AS raw_len,
       LENGTH(TRIM(store_code)) AS trim_len
FROM ods_store_trade WHERE store_code LIKE '%001%' LIMIT 20;

Compare 2: Join match rate before and after normalization

-- Original join
SELECT COUNT(*) FROM ods_store_member a
JOIN ods_store_trade b ON a.store_code = b.store_code;

-- Normalized join
SELECT COUNT(*) FROM ods_store_member a
JOIN ods_store_trade b
  ON UPPER(TRIM(a.store_code)) = UPPER(TRIM(b.store_code));

Compare 3: Grouping results before and after normalization

-- Original GROUP BY
SELECT store_code, COUNT(*) FROM ods_store_trade
GROUP BY store_code ORDER BY cnt DESC LIMIT 20;

-- Normalized GROUP BY
SELECT UPPER(TRIM(store_code)) AS norm_code, COUNT(*) AS cnt
FROM ods_store_trade GROUP BY UPPER(TRIM(store_code))
ORDER BY cnt DESC LIMIT 20;

A Practical Governance Strategy

1. Identify columns that require "strong normalization"

Prioritize: business keys, dimension codes, common filter values, columns used in JOIN/GROUP BY/DISTINCT, and columns used as downstream interface keys.

2. Keep the original value, but generate a normalized version at the detail layer

Don't overwrite source fields. Preserve both:

CREATE TABLE dwd_store_trade AS
SELECT store_code AS store_code_raw,
       UPPER(TRIM(store_code)) AS store_code_norm,
       trade_dt, sale_amt
FROM ods_store_trade;

All downstream joins and groupings should use the normalized column, keeping the aggregation layer clean and consistent.

3. Turn anomaly detection into routine checks

-- Trailing/missing spaces
SELECT COUNT(*) FROM ods_store_trade WHERE store_code <> TRIM(store_code);

-- Mixed case
SELECT SUM(CASE WHEN store_code = UPPER(store_code) THEN 1 ELSE 0 END) AS upper_cnt,
       SUM(CASE WHEN store_code = LOWER(store_code) THEN 1 ELSE 0 END) AS lower_cnt
FROM ods_store_trade;

-- Codes that merge after normalization
SELECT UPPER(TRIM(store_code)) AS norm_code,
       COUNT(DISTINCT store_code) AS raw_variant_cnt
FROM ods_store_trade GROUP BY UPPER(TRIM(store_code))
HAVING COUNT(DISTINCT store_code) > 1;

Common Misconceptions

"No garbled text means the character set is fine": No garbled text only means the data displays — it says nothing about comparison correctness.
"VARCHAR is naturally safe": Variable length doesn't guarantee clean content. Spaces, case, and invisible characters still cause deviations.
"A text business key is no big deal": Once a column serves as a join key, it can't be treated like a descriptive text field.
"I'll just TRIM it in the query and call it done": Ad‑hoc normalization is a band‑aid, not a fix. Without source governance, every downstream query ends up reinventing different cleaning rules.

Shell‑Based Validation Script

BIZ_DT=$(date +%F)
gccli -h ${DBHOST} -u ${DBUSER} ${DBNAME} <<SQL >> string_check_${BIZ_DT}.log 2>&1
SELECT NOW();
SELECT COUNT(*) FROM ods_store_trade WHERE store_code <> TRIM(store_code);
SELECT UPPER(TRIM(store_code)) AS norm_code, COUNT(DISTINCT store_code) AS raw_variant_cnt
FROM ods_store_trade GROUP BY UPPER(TRIM(store_code))
HAVING COUNT(DISTINCT store_code) > 1 LIMIT 50;
SELECT NOW();
SQL

Recommended Sequence

Stabilize results first, then fix the source. If reports are already affected, use normalized columns at the aggregation layer to lock down correctness before pushing governance upstream. For key text columns, always keep both the original value and a normalized column — don't leave every downstream query to invent its own cleaning rules. And save the diagnostic SQL snippets (LENGTH/HEX checks, before/after join counts, before/after grouping counts) as permanent tools — they'll save you hours the next time a similar issue surfaces.

String‑related faults in a gbase database are uncomfortable not because they're complex, but because they look like "nothing is wrong." Data loads, SQL runs, pages render — the results just drift. The approach above — verify the values themselves, then the comparison rules, then check whether joins or aggregations are amplifying the differences — is granular, but it usually closes the case faster than tweaking SQL on a hunch.