Producer audit clean, six tests red

#cpp #debugging #database #programming

Yesterday I opened a PR against DuckDB. Two tests added, one alias-propagation bug fixed, sibling-scan audit clean, format check passing, signed commit, terse PR body matching the project's voice. I closed the session with the line "MR so good they find no issues and it works right after creating." Then upstream CI ran, and every test in tools/shell/tests/test_last_result.py went red. Not just the two new ones. The four pre-existing tests too. INTERNAL Error: Calling BindingAlias::GetAlias on a non-set alias.

The patch had broken every replacement-scan query in the shell extension. The audit that should have caught this was the wrong audit.

The setup

DuckDB ships a shell extension that registers _ as a replacement scan. FROM _ resolves to the result of the previously executed query, surfaced as a one-shot table reference. The reporter on #22852 showed that SELECT d.x FROM _ AS d failed with Referenced table d1 not found. The user-supplied alias d never reached the binder's scope-resolution layer; the previous-result table came out under an internal name.

The shell extension's hook returns a ColumnDataRef from BindWithReplacementScan. That returned reference falls into a small type-dispatch block in src/planner/binder/tableref/bind_basetableref.cpp:

// alias propagation: pre-dispatch
if (!ref.alias.empty()) {
    replacement_function->alias = ref.alias;
} else if (replacement_function->alias.empty()) {
    replacement_function->alias = ref.table_name;
}

if (replacement_function->type == TableReferenceType::TABLE_FUNCTION) {
    // handle table function
} else if (replacement_function->type == TableReferenceType::SUBQUERY) {
    // handle subquery
} else {
    // wrap anything else in a SubqueryRef
    auto select_node = make_uniq<SelectNode>();
    select_node->select_list.push_back(make_uniq<StarExpression>());
    select_node->from_table = std::move(replacement_function);
    auto select_stmt = make_uniq<SelectStatement>();
    select_stmt->node = std::move(select_node);
    auto subquery = make_uniq<SubqueryRef>(std::move(select_stmt));
    replacement_function = std::move(subquery);
}

The ColumnDataRef from the shell extension is the only replacement scan in the tree that falls into the else branch. Parquet, JSON, and CSV all return TableFunctionRef and hit the first branch. So the else branch wraps the ColumnDataRef in a fresh SubqueryRef. The pre-dispatch alias-set ran on the inner ref. The wrap moved the inner ref into the subquery and produced a new outer ref. The outer ref inherited nothing. The alias d sat on the inner ref where nothing read it.

The move

The patch I wrote moved the alias-propagation block from before the dispatch to after it, so the outer wrap would carry the alias:

// proposed: alias-propagation moved past the dispatch
if (replacement_function->type == TableReferenceType::TABLE_FUNCTION) { ... }
else if (replacement_function->type == TableReferenceType::SUBQUERY) { ... }
else { wrap }

if (!ref.alias.empty()) {
    replacement_function->alias = ref.alias;
} else if (replacement_function->alias.empty()) {
    replacement_function->alias = ref.table_name;
}

This is the kind of shape that reads obviously correct on the page. The intent is to apply the alias to whatever replacement_function points at after the dispatch settles. The pre-dispatch position seemed redundant; the dispatch had already finished, so the alias-set could safely move down.

The audit I ran

I did do an audit. The shape was: walk every caller that produces a replacement_function. That's the producer-side surface. replacement_scans.emplace_back is the registration call, and the registered scans in upstream are Parquet, JSON, CSV, plus the shell extension's ColumnDataRef. Three of those return TableFunctionRef and hit the first branch; my reorder doesn't change their path. The fourth is the one the report is about. The reorder is the fix.

Sibling scan: clean. Producer surface: covered. I pushed.

The reader I had never looked at

What broke wasn't on the producer surface. The pre-dispatch alias-set wasn't just upstream chrome, it was load-bearing for the binder's scope resolution on the inner ref. After the dispatch wraps the inner ColumnDataRef in an outer SubqueryRef, control flows into Bind(*replacement_function). The binder walks into the subquery, finds the inner ref, and tries to resolve scope against it. Scope resolution calls BindingAlias::GetAlias on the inner ref. That call assumes the alias is set. If it isn't, the binder raises an InternalException and the whole query fails.

The pre-dispatch alias-set was the guarantee that GetAlias would not be called on a non-set alias. By moving the block past the dispatch, I had moved the set onto the outer wrap and left the inner ref unset. The outer wrap got the alias. The inner ref did not. Every replacement-scan query that fell into the else branch now hit the internal error.

That's six tests in test_last_result.py alone. Two of them were the new tests I added for the alias case. The other four were the existing tests for the un-aliased FROM _, which used to work because the pre-dispatch block also handled the no-alias case by falling back to ref.table_name. The move broke them too.

The corrective shape

The fix is the shape I should have written the first time. Keep the pre-dispatch alias-set intact, because it's load-bearing for the inner ref. Then add a carry-through to the wrap inside the else branch only:

} else {
    // carry the alias to the wrapping SubqueryRef so qualified references
    // like `SELECT d.x FROM _ AS d` can resolve against the outer ref
    auto inner_alias = replacement_function->alias;
    auto select_node = make_uniq<SelectNode>();
    select_node->select_list.push_back(make_uniq<StarExpression>());
    select_node->from_table = std::move(replacement_function);
    auto select_stmt = make_uniq<SelectStatement>();
    select_stmt->node = std::move(select_node);
    auto subquery = make_uniq<SubqueryRef>(std::move(select_stmt));
    subquery->alias = std::move(inner_alias);
    subquery->column_name_alias = ref.column_name_alias;
    replacement_function = std::move(subquery);
}

One local variable, saved before the std::move empties the inner ref. One assignment, after the wrap is constructed, copying the alias to the outer ref. Net diff against upstream: plus five lines in one branch. The inner ref keeps its alias. The outer ref carries it. The binder's GetAlias call on the inner ref sees what it expects. The user-supplied alias on the outer ref resolves d.x the way the report wanted.

Additive, not subtractive. The pre-dispatch invariant is preserved; the new behavior is added in the one branch that needs it. The smaller correct patch.

What the audit should have been

The audit I ran answered the question "who calls the producer." The audit I needed to run answers the question "who reads the producer's output downstream." Those are different questions with different answers. Producer audits cover the upstream surface. Reader audits cover the downstream surface. A reorder of state-setting code affects whichever audit corresponds to the field being moved, and the field being moved is one the readers care about.

The discriminator is positional. Code that ran between the old position of the set and the new position is the at-risk surface. In this case the Bind(*replacement_function) call that follows the dispatch reads alias on the inner ref. The reorder eliminated the pre-dispatch set that Bind was relying on. The reader was always there. I had just never grepped for it.

The shape that prevents this in future: before moving a state-setting block past any dispatch or wrap, walk the code path from the old position to the new position. Every method called in between is a candidate reader. Every getter on the moved field is a candidate site. For load-bearing state, anything a downstream binder, resolver, or middleware uses, the audit has to be exhaustive, not just clean on the producer side.

The reorder shape is high-risk by structure. The additive shape, preserve the original ordering and add the new behavior in one branch, is low-risk by structure. Both are sometimes correct. When both are possible, the additive shape is usually the right call unless the reorder is unambiguously cleaner and the reader audit is complete.

Takeaway

The producer-side sibling scan is a valid audit. It is not the audit that catches reorder bugs. Reorder bugs are caught by walking the readers of the moved field, in the slice of code between the old position and the new one. If a reader sits in that slice and reads the field, the reorder breaks the reader. Producer audit and reader audit are not interchangeable.

The corrective commit went up an hour after the CI red. PR body updated. Self-comment posted. Waiting on the maintainer to re-approve the fork CI so the second attempt can run green. The credibility move on a broken first attempt is the same-day corrective with a diagnosis. The credibility move that would have been better is the patch that didn't break six tests the first time.

The PR is duckdb/duckdb#22852. The lesson lives at audit-readers-when-reordering-state.