Michael

Posted on Jun 5 • Originally published at gbase.cn

GBase 8a: When to Materialise Intermediate Results and How to Manage Temporary Tables

#gbase #database #数据库 #performance

In a gbase database, many hard-to-debug, unstable queries aren't caused by insufficient compute or a flawed model. The real culprit is often an unclear strategy for intermediate results: should they be materialised? As a temporary table or a staging table? And once created, how are they managed? This article offers a practical framework, drawn from real-world experience, for making those decisions.

Why Intermediate Results Deserve Their Own Strategy

Analytical pipelines naturally have stages: filter target orders → identify valid users → join dimensions → aggregate. Stuffing every step into a single SQL statement keeps the object count low, but makes troubleshooting and validation painful. A single statement isn't inherently better. What often matters more is whether the pipeline is stable, problems are easy to locate, and results can be verified quickly.

Common Symptoms in the Field

A long SQL statement with deeply nested subqueries performs erratically as data volumes grow.
The same expensive filtering logic is re‑computed in every run, even when the source data barely changes.
Debugging requires modifying the entire statement; there's no way to test a single step in isolation.
Temporary tables are created casually, never cleaned up, and eventually become unmaintained "semi‑permanent" objects.

When to Materialise: A Decision Table

Question	Lean Toward Not Materialising	Lean Toward Materialising
Will the intermediate set be reused?	Only once	Multiple times
Does the business frequently review this step?	Rarely	Often
Is the source of an error easy to pinpoint?	Yes	No
Does the step involve complex filtering/dedup/attribution?	No	Yes
Is it suitable to keep the result by batch?	No	Yes

In particular, materialise intermediate sets that are reused, represent complex business definitions that are often debated, or whose correctness is hard to trace if buried inside a large query.

Temporary Table vs. Staging Table vs. Final Table

Type	Role	Best For	Watch Points
Temporary table	Session‑scoped, short‑lived	Debugging, ad‑hoc analysis, quick step breakdown	Cleanup after session ends
Staging table	Job‑scoped, batch‑based	Stable batch processing, complex definitions split into steps	Naming convention, rerun policy, retention period
Final table	Long‑lived, service‑layer	Downstream queries, report consumption	Stable definitions, permissions, change control

Don't mix temporary tables with staging tables. For ad‑hoc debugging, temporary tables are fine. Once a step becomes part of a scheduled job, use a staging table with clear naming, cleanup, and rerun rules.

A Realistic Example

Counting new‑customer spend by store and category during a promotion. The original approach stuffs new‑customer identification, activity‑period filtering, multi‑table joins, and aggregation into one statement. When the result is off, it's impossible to tell whether the new‑customer definition was wrong or the join/aggregation skewed.

Step‑by‑Step Instead

Step 1: Materialise the new‑customer set

CREATE TEMPORARY TABLE tmp_new_user_202603 AS
SELECT user_id
FROM fact_order
WHERE pay_time >= '2026-03-01'
  AND pay_time < '2026-04-01'
GROUP BY user_id
HAVING MIN(pay_time) >= '2026-03-20';

Step 2: Materialise the paid order subset for the activity period

CREATE TEMPORARY TABLE tmp_paid_order_20260320 AS
SELECT order_id, user_id, store_id, product_id, pay_amt
FROM fact_order
WHERE pay_status = 'PAID'
  AND pay_time >= '2026-03-20'
  AND pay_time < '2026-03-27';

Step 3: Join and aggregate

SELECT d.store_id, p.category_id,
       SUM(o.pay_amt) AS pay_amt,
       COUNT(DISTINCT o.user_id) AS new_user_cnt
FROM tmp_paid_order_20260320 o
JOIN tmp_new_user_202603 nu ON o.user_id = nu.user_id
JOIN dim_store d ON o.store_id = d.store_id
JOIN dim_product p ON o.product_id = p.product_id
GROUP BY d.store_id, p.category_id;

Each step can now be validated independently, making it far easier to locate where a deviation occurred.

Managing Materialised Objects

Only materialise high‑value steps, not every long SQL.
Keep ad‑hoc temps and job staging tables separate.
Always have a cleanup strategy — staging tables should be kept per batch with a defined retention.
Name clearly: include domain, role, and batch, e.g., stg_trade_valid_order_20260327.

A Shell Wrapper for a Staging Step

BIZ_DT=2026-03-27
gccli -h ${DBHOST} -u ${DBUSER} ${DBNAME} <<SQL
DROP TABLE IF EXISTS stg_trade_valid_order_${BIZ_DT};
CREATE TABLE stg_trade_valid_order_${BIZ_DT} AS
SELECT order_id, user_id, store_id, product_id, pay_amt
FROM fact_order
WHERE dt = '${BIZ_DT}' AND pay_status = 'PAID';
SELECT COUNT(*) AS row_cnt FROM stg_trade_valid_order_${BIZ_DT};
SQL

This script enforces batch‑aware object creation, pre‑run cleanup, and a basic row‑count check.

Recommended Decision Sequence

Identify the step whose correctness matters most.
Determine whether the result will be reused.
Make sure the materialised result is understandable to both developers and business reviewers.
Design the cleanup and rerun policy upfront.

The value of temporary and staging tables in GBASE's GBase 8a isn't about rescuing a broken SQL statement. It's about managing complex computational pipelines so that results are verifiable, problems are traceable, and re‑computation is under control.

DEV Community