fix: inherit field names from left projection in set expressions when missing#20819
fix: inherit field names from left projection in set expressions when missing#20819gruuya wants to merge 4 commits intoapache:mainfrom
Conversation
2a15b9f to
4d06d45
Compare
neilconway
left a comment
There was a problem hiding this comment.
This approach doesn't handle queries like:
SELECT a, b FROM t1 UNION ALL SELECT count(*), count(*) FROM t2;(Error during planning: Projections require unique expression names but the expression "count(Int64(1)) AS count(*)" at position 0 and "count(Int64(1)) AS count(*)" at position 1 have the same name. Consider aliasing ("AS") one of them.)
Or other non-column-name expressions in the target list of non-first queries.
I wonder if it would be cleaner to arrange not to check for duplicate column names in set operation queries that aren't the left-most query? Since the column names for such queries are discarded anyway, it seems a bit laborious to first rewrite them to ensure they are unique, and then check that they are indeed unique, before discarding them anyway.
datafusion/sql/src/set_expr.rs
Outdated
| col_idx += 1; | ||
| } | ||
| SelectItem::Wildcard(_) | SelectItem::QualifiedWildcard(_, _) => { | ||
| // Wildcards expand to multiple columns - skip position tracking |
There was a problem hiding this comment.
Skipping wildcards seems like it won't produce the right behavior. Consider:
CREATE TABLE t_left (c1 INT, c2 INT, c3 INT) AS VALUES (1, 2, 3);
CREATE TABLE t_right (c1 INT, c2 INT) AS VALUES (10, 20);
SELECT c1, c2, c3 FROM t_left UNION ALL SELECT *, 0 FROM t_right;yields
Schema error: Schema contains qualified field name t_right.c1 and unqualified field name c1 which would be ambiguous
There was a problem hiding this comment.
Good catch, I've added handling for this case now (single unqualified wildcard) as well.
Note that we don't handle more than 1 wildcard, or any unqualified wildcard for now.
Fair point, this occurs because of the
Agreed, i actually thought about this first, but it seemed less clean and more invasive in the end, so i ended up with this approach. Let me rethink about it. |
… missing Otherwise they might end up having the same column names, which is prohibited. This is only the case for literals, other expression types should have default unique names.
4d06d45 to
9887464
Compare
Added a fix (and tests) for this now.
Indeed, whilst in theory this sounds like the best approach, in practice it would require a much more substantial change. We'd need to
All in all it's more verbose and complex compared to this approach, but if you'd like i could open a PR for that so that we can compare. EDIT: in fact even the above wouldn't be sufficient, as there's additional validation in |
So I've got an alternative fix here #21052. We still perform the re-write there, so we don't actually skip the check(s), but the benefit there is that the (qualified) wildcards are expanded and we can treat those as well. |
e30ae21 to
67e3df7
Compare
Which issue does this PR close?
Rationale for this change
DataFusion requires all projected expressions to have unique names during planning, so it doesn't support
select 0, 0for instance.However this shouldn't be an issue when this is just a sub-SELECT in a larger query which does abide by this rule. For example a set expression (UNION, EXCEPT, INTERSECT) query should only require the first SELECT to provide a unique schema, and that should be sufficient.
Furthermore, this requirement is even more redundant, since all field name/aliases other than those in the first SELECT are discarded anyway.
What changes are included in this PR?
For set expression queries, alias all unnamed right-hand side projection expressions based on the left-hand side column names.
This is only done in the case for literals, other expression types should have default unique names.
Are these changes tested?
Yes, there are SQL->plan tests as well as SLT tests added.
Are there any user-facing changes?