DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

How NULL Values Skew Comparisons, Aggregations, and Deduplication in GBase 8a

NULL in a gbase database doesn't behave like an ordinary value, nor is it simply a blank. It affects comparisons, deduplication, aggregations, and conditional logic in ways that often contradict intuition. Many "data is fine but results are off" issues trace back to mishandled NULL semantics.

First, Separate Three Easily Confused Concepts

Value Meaning Most Common Misjudgment
NULL Unknown, missing, not assigned Treated as an empty string
Empty string '' Zero‑length string Treated as NULL
Default value Filled by business rules or DDL Treated as a real business value

When these three coexist, you get the classic "some rows match, some don't" confusion.

Common Symptoms

  • WHERE col = '' returns nothing, but rows clearly exist.
  • COUNT(col) is far smaller than COUNT(*) — business thinks data was lost.
  • A mysterious "empty group" appears after GROUP BY, and nobody can say whether it's NULL or an empty string.
  • After a LEFT JOIN, dimension columns become NULL, and subsequent filters or aggregations silently shift the result.
  • CASE WHEN col = NULL THEN ... looks logical but produces wrong outcomes.

How to Diagnose

Step 1: Quantify the distribution of NULLs, empty strings, and valid values

Don't guess — query the actual distribution:

SELECT
    COUNT(*) AS total_cnt,
    SUM(CASE WHEN cust_level IS NULL THEN 1 ELSE 0 END) AS null_cnt,
    SUM(CASE WHEN cust_level = '' THEN 1 ELSE 0 END) AS empty_cnt,
    SUM(CASE WHEN cust_level IS NOT NULL AND cust_level <> '' THEN 1 ELSE 0 END) AS valid_cnt
FROM dwd_customer;
Enter fullscreen mode Exit fullscreen mode

Often, NULLs and empty strings come from completely different sources.

Step 2: Verify the aggregation function matches the business intent

SELECT COUNT(*) AS total_rows, COUNT(cust_level) AS non_null_rows FROM dwd_customer;
Enter fullscreen mode Exit fullscreen mode

If the requirement is "total row count," don't casually write COUNT(column). The core question is: are you counting records, or non‑null values?

Step 3: Check whether conditions silently exclude NULLs

-- This does not include NULL rows
SELECT * FROM dwd_customer WHERE cust_level <> 'VIP';

-- To include unknowns:
SELECT * FROM dwd_customer WHERE cust_level <> 'VIP' OR cust_level IS NULL;
Enter fullscreen mode Exit fullscreen mode

A Realistic Example

A customer table has channel_code values that are a mix of real values, NULLs, and empty strings. A business query counts users by channel:

SELECT channel_code, COUNT(*) AS cust_cnt
FROM dwd_customer
GROUP BY channel_code;
Enter fullscreen mode Exit fullscreen mode

An "empty channel" row appears, sparking debate. A more robust approach is to standardize the semantics before aggregation:

SELECT
    CASE
        WHEN channel_code IS NULL THEN 'NULL_VALUE'
        WHEN channel_code = '' THEN 'EMPTY_STRING'
        ELSE channel_code
    END AS channel_tag,
    COUNT(*) AS cust_cnt
FROM dwd_customer
GROUP BY channel_tag;
Enter fullscreen mode Exit fullscreen mode

This shifts the conversation from "the result feels wrong" to "which specific class of values is affecting the output."

SQL Areas Most Affected by NULL

SQL Type Common Deviation What to Check First
Filtering Missing NULL rows Are IS NULL / IS NOT NULL explicitly written?
Aggregation COUNT(col) too low Is a non‑null count mistaken for a total row count?
Conditional CASE WHEN col = NULL not working Is the comparison syntax correct?
Grouping NULL and empty string conflated Is the grouping semantics standardized first?

Common Pitfalls

  1. Treating NULL as an empty string — especially common in text columns. They look the same but have different semantics.
  2. No agreed business semantics, so developers write their own defaults — one report treats NULL as "Unknown", another simply excludes it. Without alignment, results diverge.
  3. Ignoring NULLs introduced by LEFT JOIN — right‑side columns become NULL, and downstream filters or groupings unwittingly change the data scope.
  4. Fixing NULLs only at query time, while the raw layer remains messy — if a column persistently contains both NULLs and empty strings, upstream governance is already broken.

Recommendations

  1. Make semantics explicit in SQL: Use CASE WHEN to map NULLs and empty strings to business labels instead of relying on mental assumptions.
  2. Schedule regular NULL‑distribution checks for critical columns — treat it as routine inspection, not a last‑minute fire drill.
  3. Standardize key columns at the detail layer — don't let every downstream query invent its own NULL‑handling logic.

NULL rarely throws an error in a gbase database. It just quietly changes the answer. Separating NULLs, empty strings, and default values before you compare, aggregate, or group is usually the fastest way to close a result‑accuracy debate.

Top comments (0)