-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
In #15265 brings up that we do not really use NDVs or distinct_count anywhere in the code. However, it will become more practical after #19957 is merged.
An optimization that uses distinct_count can be shown here: #20731
Where NDV is currently used:
estimate_inner_join_cardinalityuses the same approach as spark's catalyst optimizer (code)
Describe the solution you'd like
I looked into Trino/Spark and added a list of optimizations that can be made with this statistic:
- feat: Improve
partition_statistics()forAggregateExecusingdistinct_count#20731 - feat: Use NDV for equality filter selectivity calculation #20789
- Semi/anti join selectivity calculations
- HJ vs. SMJ
- Multi-join column selectivity with decay
- Choose hash key with high NDV for better spread (maybe reject low NDV columns to avoid skew) - very good for distributed datafusion
- Possibly check at runtime whether partial aggregation is useful enough for reduction -> trino uses the formula,
NDV x 2 > input_rows. - Top k output cardinality estimates
- count(distinct col) can use
distinct_count - Propagate
distinct_countthrough UNION
Describe alternatives you've considered
No response
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request