How NULL Values Skew Comparisons, Aggregations, and Deduplication in GBase 8a

#gbase #database #数据库 #sql

NULL in a gbase database doesn't behave like an ordinary value, nor is it simply a blank. It affects comparisons, deduplication, aggregations, and conditional logic in ways that often contradict intuition. Many "data is fine but results are off" issues trace back to mishandled NULL semantics.

First, Separate Three Easily Confused Concepts

Value	Meaning	Most Common Misjudgment
NULL	Unknown, missing, not assigned	Treated as an empty string
Empty string ''	Zero‑length string	Treated as NULL
Default value	Filled by business rules or DDL	Treated as a real business value

When these three coexist, you get the classic "some rows match, some don't" confusion.

Common Symptoms

WHERE col = '' returns nothing, but rows clearly exist.
COUNT(col) is far smaller than COUNT(*) — business thinks data was lost.
A mysterious "empty group" appears after GROUP BY, and nobody can say whether it's NULL or an empty string.
After a LEFT JOIN, dimension columns become NULL, and subsequent filters or aggregations silently shift the result.
CASE WHEN col = NULL THEN ... looks logical but produces wrong outcomes.

How to Diagnose

Step 1: Quantify the distribution of NULLs, empty strings, and valid values

Don't guess — query the actual distribution:

SELECT
    COUNT(*) AS total_cnt,
    SUM(CASE WHEN cust_level IS NULL THEN 1 ELSE 0 END) AS null_cnt,
    SUM(CASE WHEN cust_level = '' THEN 1 ELSE 0 END) AS empty_cnt,
    SUM(CASE WHEN cust_level IS NOT NULL AND cust_level <> '' THEN 1 ELSE 0 END) AS valid_cnt
FROM dwd_customer;

Often, NULLs and empty strings come from completely different sources.

Step 2: Verify the aggregation function matches the business intent

SELECT COUNT(*) AS total_rows, COUNT(cust_level) AS non_null_rows FROM dwd_customer;

If the requirement is "total row count," don't casually write COUNT(column). The core question is: are you counting records, or non‑null values?

Step 3: Check whether conditions silently exclude NULLs

-- This does not include NULL rows
SELECT * FROM dwd_customer WHERE cust_level <> 'VIP';

-- To include unknowns:
SELECT * FROM dwd_customer WHERE cust_level <> 'VIP' OR cust_level IS NULL;

A Realistic Example

A customer table has channel_code values that are a mix of real values, NULLs, and empty strings. A business query counts users by channel:

SELECT channel_code, COUNT(*) AS cust_cnt
FROM dwd_customer
GROUP BY channel_code;

An "empty channel" row appears, sparking debate. A more robust approach is to standardize the semantics before aggregation:

SELECT
    CASE
        WHEN channel_code IS NULL THEN 'NULL_VALUE'
        WHEN channel_code = '' THEN 'EMPTY_STRING'
        ELSE channel_code
    END AS channel_tag,
    COUNT(*) AS cust_cnt
FROM dwd_customer
GROUP BY channel_tag;

This shifts the conversation from "the result feels wrong" to "which specific class of values is affecting the output."

SQL Areas Most Affected by NULL

SQL Type	Common Deviation	What to Check First
Filtering	Missing NULL rows	Are `IS NULL` / `IS NOT NULL` explicitly written?
Aggregation	`COUNT(col)` too low	Is a non‑null count mistaken for a total row count?
Conditional	`CASE WHEN col = NULL` not working	Is the comparison syntax correct?
Grouping	NULL and empty string conflated	Is the grouping semantics standardized first?

Common Pitfalls

Treating NULL as an empty string — especially common in text columns. They look the same but have different semantics.
No agreed business semantics, so developers write their own defaults — one report treats NULL as "Unknown", another simply excludes it. Without alignment, results diverge.
Ignoring NULLs introduced by LEFT JOIN — right‑side columns become NULL, and downstream filters or groupings unwittingly change the data scope.
Fixing NULLs only at query time, while the raw layer remains messy — if a column persistently contains both NULLs and empty strings, upstream governance is already broken.

Recommendations

Make semantics explicit in SQL: Use CASE WHEN to map NULLs and empty strings to business labels instead of relying on mental assumptions.
Schedule regular NULL‑distribution checks for critical columns — treat it as routine inspection, not a last‑minute fire drill.
Standardize key columns at the detail layer — don't let every downstream query invent its own NULL‑handling logic.

NULL rarely throws an error in a gbase database. It just quietly changes the answer. Separating NULLs, empty strings, and default values before you compare, aggregate, or group is usually the fastest way to close a result‑accuracy debate.