Implement cache-efficient partial aggregation by Leis et al

### Is your feature request related to a problem or challenge?

The paper "Morsel-Driven Parallelism"[1] by Leis et al introduces a cache efficient aggregation algorithm.
It is also implemented by DuckDB[2]

We can implement it and see if it improves performance.


[1] https://db.in.tum.de/~leis/papers/morsels.pdf
[2] https://duckdb.org/2022/03/07/aggregate-hashtable#parallel-aggregation

### Describe the solution you'd like

The steps are as follows

1. Start with a single hashtable (current approach) without pre-repartitioning
2. Once the table exceeds a threshold (i.e. roughly CPU cache size), (radix) repartition the maps into a number of *thread-local* hashmaps (e.g. we can start of using the number of `target_partitions` and use `hash % target_partitions` instead but in future might make it more adaptive)
3. Once they are repartitioned / finalized the output batches can be sent directly to the target partitions (no RepartitionExec needed! anymore)

<img width="351" height="274" alt="Image" src="https://github.com/user-attachments/assets/d41845bb-1c14-4715-9f44-dcbb6b516014" />

The benefit of it I think is mostly that by local grouping, we process the rows (smaller) hashmaps partition-by-partition, so while doing it they more likely fit in CPU cache.

It also avoids the double hashing in partial aggregation / hash repartition as the latter can be removed.

Care must be taken to make the code efficient, i.e. avoid materializing the batches upfront / accumulate based on indices.

I am not sure why the final aggregation doesn't take the same approach - this could be tested out as well when it is implemented for the partial aggregation.

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cache-efficient partial aggregation by Leis et al #20773

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement cache-efficient partial aggregation by Leis et al #20773

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions