Skip to content

Conversation

@AndreaBozzo
Copy link

@AndreaBozzo AndreaBozzo commented Jan 20, 2026

Which issue does this PR close?

Closes #16054

Rationale for this change

Binary expressions like (1+2)*3 were displayed incorrectly as Int64(1) + Int64(2) * Int64(3) in Display and SqlDisplay output, without parentheses to preserve operator precedence. This is misleading because it does not reflect the actual expression tree structure.

What changes are included in this PR?

  • Modified Display for BinaryExpr to always wrap nested binary expressions in parentheses (following DuckDB's approach). This simplifies the previous precedence-based logic and avoids ambiguity.
  • Added the same parenthesization to SqlDisplay for SQL generation output.
  • Added a comment to SchemaDisplay clarifying that it intentionally does not add parentheses, because schema names must remain stable for field lookups in optimizers like common_subexpr_eliminate.
  • Updated sqllogictests and optimizer integration test snapshots to reflect the new output format.

What this PR does NOT change

  • SchemaDisplay is intentionally unchanged. Schema names must remain stable because they are used as field lookup keys in optimizers (e.g. common_subexpr_eliminate). Changing them would break field resolution. This was initially included and then reverted after discovering it caused optimizer failures.
  • CLI column headers are not affected. When you run e.g. select (1+2)*3, the column header uses SchemaDisplay, so it will still show Int64(1) + Int64(2) * Int64(3) without parentheses. Fixing the CLI column headers would require changes to SchemaDisplay, which is out of scope for the reasons above.

Known limitation: excessive parentheses for associative operators

Because our expression tree represents a OR b OR c as nested BinaryExpr nodes ((a OR b) OR c), always-parenthesizing produces (a OR b) OR c instead of the flat a OR b OR c. DuckDB avoids this by using a flat ConjunctionExpression with a list of children, so they get (a OR b OR c). This is a structural limitation of how DataFusion represents expressions, not something that can be fixed in the Display layer alone.

Pending: datafusion-testing submodule update

The "Run sqllogictests with the sqlite test suite" CI check fails because 4 files in the datafusion-testing submodule contain error message regex patterns that need updating to match the new parenthesized output. The affected files are:

  • data/sqlite/index/random/10/slt_good_0.slt (5 errors)
  • data/sqlite/index/random/10/slt_good_13.slt (10 errors)
  • data/sqlite/index/random/10/slt_good_4.slt (5 errors)
  • data/sqlite/index/random/10/slt_good_6.slt (5 errors)

A separate PR on datafusion-testing will be needed to update these regex patterns, and then this PR's submodule pointer will need to be updated to point to the new commit.

How are these changes tested?

  • New unit test test_binary_expr_display_with_parentheses covering:
    • Arithmetic expressions: (1+2)*3
    • Right-nested expressions: 1+(2*3)
    • Logical expressions: (a OR b) AND c
  • Updated ~44 sqllogictest files and optimizer integration test snapshots
  • All CI checks pass except the sqlite test suite (see "Pending" section above)

Are these changes safe?

Yes. The Display change only affects human-readable plan output (e.g. EXPLAIN plans, debug logging). SchemaDisplay (used for field lookups and schema stability) is intentionally unchanged. SqlDisplay gains parentheses which makes generated SQL unambiguous.


@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate sql SQL Planner optimizer Optimizer rules substrait Changes to the substrait crate labels Jan 20, 2026
Copy link
Author

@AndreaBozzo AndreaBozzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- deleted, was a progress meter for myself --

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 21, 2026
@AndreaBozzo
Copy link
Author

AndreaBozzo commented Jan 23, 2026

Updated after rebase on main.

Note: There are also 3 files in the datafusion-testing submodule that need updating for the sqlite test suite:

  • data/sqlite/index/random/10/slt_good_13.slt
  • data/sqlite/index/random/10/slt_good_4.slt
  • data/sqlite/index/random/10/slt_good_6.slt

These contain error message patterns that now include the additional parentheses (e.g., (- cor0.col4) + Int64(26)((- cor0.col4) + Int64(26))).

I'll create a separate PR on datafusion-testing once this PR gets reviewed, but its safe to say that the longer it takes, the harder is gonna get, very open on getting mantainers guidence now

Always wrap nested binary expressions in parentheses when formatting
to avoid ambiguity. For example, `(1+2)*3` now displays as
`(Int64(1) + Int64(2)) * Int64(3)` instead of the misleading
`Int64(1) + Int64(2) * Int64(3)`.

This follows DuckDB's approach of always adding parentheses around
nested binary expressions.

Fixes apache#16054
Updates all test snapshots affected by the binary expression Display
parentheses change.

The `test_pretty_roundtrip` test had a flawed assertion that checked
structural equality after round-tripping through the pretty unparser.
This was incorrect because the pretty unparser intentionally removes
"unnecessary" parentheses, which can change expression associativity:
- Input: `(id + 5) * (age * 8)` (right-grouped multiplication)
- Pretty output: `(id + 5) * age * 8`
- Re-parsed: `((id + 5) * age) * 8` (left-grouped due to associativity)

These are semantically equivalent for associative operations but
structurally different, so the round-trip assertion was removed.
Updated expected outputs for binary expression parentheses changes.
@r#"
Projection: data.a, data.f
Filter: data.f = Utf8("a") OR data.f = Utf8("b") OR data.f = Utf8("c") OR data2.mark
Filter: (((data.f = Utf8("a")) OR (data.f = Utf8("b"))) OR (data.f = Utf8("c"))) OR data2.mark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to omit these parentheses in cases like this? Since it's all OR

Copy link
Author

@AndreaBozzo AndreaBozzo Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to explain what i found but my current understanding might be limited.

Seems like DuckDB also always adds parentheses, see conjunction_expression.hpp#L42-L48

they represent a OR b OR c as a flat list of children (ConjunctionExpression), they get (a OR b OR c) while our nested BinaryExpr tree produces ((a OR b) OR c).

Its the reason why i think we should discuss this further as its kinda delicate, and more people should take a look

Thank you for your time mr.Vo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense; I suppose it's more a limitation on how we represent OR rather than on the display code in this PR 👍

@Jefffrey
Copy link
Contributor

I'm a little confused by the intent of this PR; the body seems to imply it fixes cases where parentheses aren't being added to preserve precedence, but I'm fairly sure we already have this code? Especially judging by the discussion on #16054

Moreover the example in the PR body is incorrect with the behaviour of this PR:

After:

> select (1+2)*3;
+------------------------------------+
| (Int64(1) + Int64(2)) * Int64(3)   |
+------------------------------------+

I tried out this PR with cli and the parentheses aren't added because this is a schema name. And there is a test added in this PR contradicting this statement.

Could you help clarify what exactly this PR is achieving that we don't have on main?

@AndreaBozzo
Copy link
Author

AndreaBozzo commented Jan 27, 2026

This originally contained changes to both 'Display' and 'SchemaDisplay', but then i had to revert schema display as It seems It Needed to be stabile for optimizer field lookup.

Unfortunately this way It doesn't solve the original Issue pointed in #16054, the parentheses won't appear in CLI, which Is also part of why i asked for direct help.

While It should still provide an improvement for SQL generation and debug.. its not shaping the way i initially thought/wanted.

@Jefffrey
Copy link
Contributor

Would you be able to update the title & PR body to more accurately represent what changes are being made here? I'm still not exactly clear what the intended changes are which makes it harder to review

@AndreaBozzo AndreaBozzo changed the title fix: add parentheses to nested binary expression Display fix: add parentheses to nested binary expressions in Display and SqlDisplay Jan 30, 2026
@AndreaBozzo
Copy link
Author

AndreaBozzo commented Jan 30, 2026

Hi @Jefffrey, thanks for the feedback. I've updated the PR title and body to reflect the current state of the changes after the SchemaDisplay revert, and also added everything i could find about.

Given the scope creep of ~44 sqllogictest updates, 4 datafusion-testing submodule files needing a separate PR, and the known limitation with excessive parentheses on associative operators, i'm wondering whether this change is worth the churn on its own, or if it would be better to revisit this as part of a broader effort to handle parentheses consistently across all display modes.

Happy to hear your thoughts on whether to proceed or hold off, and if its the second case, i apologize for wasting someone else time, the intent behind was genuine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expr formatting missing parentheses

2 participants