(5)When Your Data Warehouse Breaks Down, It’s Probably a Naming Problem

#datawarehouse #datascience #bigdata #database

As a data warehouse grows, the first thing that tends to get out of control is not the data itself—but naming. Naming conventions may seem like a minor detail, but they directly determine whether data is easy to find, understand, and maintain. As the fifth article in the Data Lakehouse Design and Practice series, this article starts from real-world usage and summarizes core methods for table and field naming. By combining layered prefixes, unified terminology (word roots), and cycle encoding, table names become self-explanatory. Together with metric naming and governance processes, this helps build a clear and collaborative data system.

Goals and Methods of Naming Conventions: Make Table Names Self-Explanatory and Teams Work Automatically

In a data warehouse system, naming conventions are not just about form—they are foundational infrastructure that directly impacts collaboration efficiency and data quality. A good naming system has one core goal: make the table name itself carry enough information so that people can understand what the table is, where it comes from, and how to use it—without needing extra documentation. Ideally, a table name should be “readable at a glance” and include key information such as data layer, owning team, business domain, subject domain, core object meaning, and update cycle or data scope. When these elements are systematically encoded into table names, data discovery, metric interpretation, troubleshooting, and team handovers all become significantly more efficient, reducing communication costs.

A naming system is essentially a “word root system” that standardizes business language. For example, the same business object must use the same term consistently across tables (e.g., avoid mixing “rack” and “shelf”). Similarly, metric naming should follow unified rules—for instance, all ratio-type metrics should use the _rate suffix, avoiding ambiguity from mixing terms like ratio, percent, or rt.

Layer prefixes must be strictly standardized. They allow users to immediately identify the data layer and purpose of a table: ods_ for source-aligned data, dwd_ for detailed standardized data, dws_ for aggregated data, ads_ for application-facing outputs, and dim_ for shared dimensions. These prefixes are not just naming conventions—they directly reflect the data architecture.

Another often overlooked but critical aspect is encoding update cycles or data scope into table names. For example, _1d represents the last day, _td means up to today, and _7d means the last seven days. This prevents confusion between tables with the same name but different time semantics, reducing the risk of metric misuse.

At the asset management level, table types must be clearly distinguished. Production tables are long-term assets, intermediate tables serve only processing workflows and should have retention policies, and temporary tables are for one-time validation and must not enter production pipelines. Prefixes like mid_ and tmp_ help prevent data asset pollution at the source.

Finally, naming conventions must be integrated with governance processes. Any new table or field must include complete metadata such as owner, field definitions, metric definitions, update frequency, dependencies, and lifecycle. Tables without such metadata may be usable in the short term but will almost certainly become technical debt in the long run. In practice, it is best to standardize templates first—ensuring key fields like layer, domain, and cycle are strictly consistent—while allowing limited flexibility in non-critical parts.

Table Naming Conventions: Templates, Cycle Encoding, and Examples

In practice, table naming should follow a structured template to ensure completeness and consistency. A general template can be defined as {layer}_{dept}_{biz_domain}_{subject}_{object}_{cycle_or_range}, where each component has a clear role: layer indicates data level, dept indicates ownership, biz_domain defines the business domain, subject represents analytical abstraction, object defines the entity or behavior, and cycle_or_range specifies the time scope.

Cycle and range encoding is especially important. Common patterns include _1d (last day), _td (to date), _7d or _30d (last N days). Additional markers can distinguish data types or update modes, such as d for daily snapshots, w for weekly data, i for incremental tables, f for full tables, and l for slowly changing tables. These conventions allow users to quickly understand temporal semantics.

For example, in the aggregation layer, dws_asale_trd_byr_subpay_1d represents buyer-level, staged payment transactions aggregated over the last day, while dws_asale_trd_itm_slr_hh represents hourly aggregation at the seller-item level. Although long, such names are highly informative and readable.

Dimension tables follow a separate convention, using the dim_ prefix and a {scope}_{object} structure, such as dim_pub_area (public area dimension) or dim_asale_item (item dimension), emphasizing cross-domain reuse.

Intermediate tables should be tightly bound to their target tables, typically named as mid_{target_table}_{suffix}, such as mid_dws_xxx_01. Temporary tables must use the tmp_ prefix and are strictly limited to development or validation, never entering production dependencies. For manually maintained data, tables in the DWD layer can explicitly include manual, such as dwd_trade_manual_client_info_l.

Field and Metric Naming Conventions: Rules, Structure, and Examples

At the field level, naming must be strictly standardized. All field names should use lowercase with underscores—camelCase is not allowed. Readability should take priority over brevity, and consistent naming must be maintained for the same semantic meaning.

Partition fields should be unified globally—for example, dt for date, hh for hour, and mi for minute—with fixed formats. This improves development efficiency and avoids confusion across tables.

Field suffixes should clearly indicate meaning: _cnt for counts, _amt or _price for monetary values (choose one consistently), and boolean fields should use the is_ prefix and never be nullable. These conventions allow users to infer data types and meanings at a glance.

NULL handling must also follow consistent rules. Typically, dimension fields use -1 for unknown values, while metric fields use 0 to indicate no occurrence. This prevents NULL propagation in aggregations and improves data stability.

Metric naming should be structured as a combination of business qualifier, time qualifier, aggregation method, and base metric. For example, trade_amt represents transaction amount, install_poi_cnt represents installation point count, and pay_succ_rate represents payment success rate. Aggregation methods should use fixed terms like sum, avg, max, and min, avoiding inconsistent alternatives like “total.”

A full example from fields to metrics: in the detail layer, an incremental order table might be named dwd_trade_order_i, containing fields such as order ID, user ID, payment amount, order status, and partition keys. In the aggregation layer, dws_trade_user_pay_1d summarizes user-level payments over the last day, including metrics like payment success count, total payment amount, and success rate. Finally, in the application layer, a table like ads_fin_kpi_board_d provides business-facing dashboards with KPIs such as GMV, refund amount, net revenue, and number of paying users.

By standardizing naming across tables, fields, and metrics, a data warehouse can achieve clear semantics, consistent structure, and efficient collaboration. While such conventions may introduce some overhead initially, they are essential for scalability and team coordination in the long term.