EPIC: Making use of NDVs (number of distinct values) in DataFusion

### Is your feature request related to a problem or challenge?

In #15265 brings up that we do not really use NDVs or `distinct_count` anywhere in the code. However, it will become more practical after #19957 is merged. 

An optimization that uses `distinct_count` can be shown here: https://github.com/apache/datafusion/pull/20731 

Where NDV is currently used:
 - `estimate_inner_join_cardinality` uses the same approach as spark's catalyst optimizer ([code](https://github.com/apache/datafusion/blob/4dbb4498fc92539e96993692f0263e74f557e20a/datafusion/physical-plan/src/joins/utils.rs#L590))

### Describe the solution you'd like

I looked into Trino/Spark and added a list of optimizations that can be made with this statistic:
 - https://github.com/apache/datafusion/pull/20731 
 - https://github.com/apache/datafusion/pull/20789
 - Semi/anti join selectivity calculations
 - HJ vs. SMJ
 - Multi-join column selectivity with decay
 - Choose hash key with high NDV for better spread (maybe reject low NDV columns to avoid skew) - very good for distributed datafusion
 - Possibly check at runtime whether partial aggregation is useful enough for reduction -> trino uses the formula, `NDV x 2 > input_rows`.
 - Top k output cardinality estimates
 - count(distinct col) can use `distinct_count`
 - Propagate `distinct_count` through UNION
 

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: Making use of NDVs (number of distinct values) in DataFusion #20766

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EPIC: Making use of NDVs (number of distinct values) in DataFusion #20766

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions