Skip to content

fix: inherit field names from left projection in set expressions when missing#20819

Open
gruuya wants to merge 4 commits intoapache:mainfrom
splitgraph:fix-set-expr-double-literal-unaliased
Open

fix: inherit field names from left projection in set expressions when missing#20819
gruuya wants to merge 4 commits intoapache:mainfrom
splitgraph:fix-set-expr-double-literal-unaliased

Conversation

@gruuya
Copy link
Contributor

@gruuya gruuya commented Mar 9, 2026

Which issue does this PR close?

Rationale for this change

DataFusion requires all projected expressions to have unique names during planning, so it doesn't support select 0, 0 for instance.

However this shouldn't be an issue when this is just a sub-SELECT in a larger query which does abide by this rule. For example a set expression (UNION, EXCEPT, INTERSECT) query should only require the first SELECT to provide a unique schema, and that should be sufficient.

Furthermore, this requirement is even more redundant, since all field name/aliases other than those in the first SELECT are discarded anyway.

What changes are included in this PR?

For set expression queries, alias all unnamed right-hand side projection expressions based on the left-hand side column names.

This is only done in the case for literals, other expression types should have default unique names.

Are these changes tested?

Yes, there are SQL->plan tests as well as SLT tests added.

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Mar 9, 2026
@gruuya gruuya force-pushed the fix-set-expr-double-literal-unaliased branch 10 times, most recently from 2a15b9f to 4d06d45 Compare March 12, 2026 09:02
Copy link
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach doesn't handle queries like:

SELECT a, b FROM t1 UNION ALL SELECT count(*), count(*) FROM t2;

(Error during planning: Projections require unique expression names but the expression "count(Int64(1)) AS count(*)" at position 0 and "count(Int64(1)) AS count(*)" at position 1 have the same name. Consider aliasing ("AS") one of them.)

Or other non-column-name expressions in the target list of non-first queries.

I wonder if it would be cleaner to arrange not to check for duplicate column names in set operation queries that aren't the left-most query? Since the column names for such queries are discarded anyway, it seems a bit laborious to first rewrite them to ensure they are unique, and then check that they are indeed unique, before discarding them anyway.

col_idx += 1;
}
SelectItem::Wildcard(_) | SelectItem::QualifiedWildcard(_, _) => {
// Wildcards expand to multiple columns - skip position tracking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping wildcards seems like it won't produce the right behavior. Consider:

CREATE TABLE t_left (c1 INT, c2 INT, c3 INT) AS VALUES (1, 2, 3);
CREATE TABLE t_right (c1 INT, c2 INT) AS VALUES (10, 20);
SELECT c1, c2, c3 FROM t_left UNION ALL SELECT *, 0 FROM t_right;

yields

Schema error: Schema contains qualified field name t_right.c1 and unqualified field name c1 which would be ambiguous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've added handling for this case now (single unqualified wildcard) as well.

Note that we don't handle more than 1 wildcard, or any unqualified wildcard for now.

@gruuya
Copy link
Contributor Author

gruuya commented Mar 16, 2026

This approach doesn't handle queries like:

SELECT a, b FROM t1 UNION ALL SELECT count(*), count(*) FROM t2;

(Error during planning: Projections require unique expression names but the expression "count(Int64(1)) AS count()" at position 0 and "count(Int64(1)) AS count()" at position 1 have the same name. Consider aliasing ("AS") one of them.)

Or other non-column-name expressions in the target list of non-first queries.

Fair point, this occurs because of the is_literal_value condition for aliasing.

I wonder if it would be cleaner to arrange not to check for duplicate column names in set operation queries that aren't the left-most query? Since the column names for such queries are discarded anyway, it seems a bit laborious to first rewrite them to ensure they are unique, and then check that they are indeed unique, before discarding them anyway.

Agreed, i actually thought about this first, but it seemed less clean and more invasive in the end, so i ended up with this approach. Let me rethink about it.

gruuya added 3 commits March 19, 2026 10:23
… missing

Otherwise they might end up having the same column names, which is prohibited.
This is only the case for literals, other expression types should have default unique names.
@gruuya gruuya force-pushed the fix-set-expr-double-literal-unaliased branch from 4d06d45 to 9887464 Compare March 19, 2026 09:23
@gruuya
Copy link
Contributor Author

gruuya commented Mar 19, 2026

This approach doesn't handle queries like:

SELECT a, b FROM t1 UNION ALL SELECT count(), count() FROM t2;

Added a fix (and tests) for this now.

I wonder if it would be cleaner to arrange not to check for duplicate column names in set operation queries that aren't the left-most query? Since the column names for such queries are discarded anyway, it seems a bit laborious to first rewrite them to ensure they are unique, and then check that they are indeed unique, before discarding them anyway.

Indeed, whilst in theory this sounds like the best approach, in practice it would require a much more substantial change. We'd need to

  1. extend PlannerContext with a flag to denote that we're planning a set expression
  2. set that flag in in set_expr_to_plan, right after we plan the left side
  3. thread the flag down into SqlToRel::project
  4. then we'd either need to
    a. introduce a breaking change to LogicalPlanBuilder::project in order to pass that flag, or
    b. introduce a new method to LogicalPlanBuilder, something like project_without_validation where we'd skip validate_unique_names (but not normalize & columnize aspect of project_with_validation)
  5. unset the flag at the outermost set expression (i.e. where we've set it), so that further sql->plan calls perform go through validate_unique_names

All in all it's more verbose and complex compared to this approach, but if you'd like i could open a PR for that so that we can compare.

EDIT: in fact even the above wouldn't be sufficient, as there's additional validation in Projection::try_new right after validate_unique_names, so in effect some sort of query re-writing seems inevitable to fix this.

@gruuya
Copy link
Contributor Author

gruuya commented Mar 19, 2026

EDIT: in fact even the above wouldn't be sufficient, as there's additional validation in Projection::try_new right after validate_unique_names, so in effect some sort of query re-writing seems inevitable to fix this.

So I've got an alternative fix here #21052. We still perform the re-write there, so we don't actually skip the check(s), but the benefit there is that the (qualified) wildcards are expanded and we can treat those as well.

@gruuya gruuya force-pushed the fix-set-expr-double-literal-unaliased branch from e30ae21 to 67e3df7 Compare March 19, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpected set expression query planning error

2 participants