-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
The paper "Morsel-Driven Parallelism"[1] by Leis et al introduces a cache efficient aggregation algorithm.
It is also implemented by DuckDB[2]
We can implement it and see if it improves performance.
[1] https://db.in.tum.de/~leis/papers/morsels.pdf
[2] https://duckdb.org/2022/03/07/aggregate-hashtable#parallel-aggregation
Describe the solution you'd like
The steps are as follows
- Start with a single hashtable (current approach) without pre-repartitioning
- Once the table exceeds a threshold (i.e. roughly CPU cache size), (radix) repartition the maps into a number of thread-local hashmaps (e.g. we can start of using the number of
target_partitionsand usehash % target_partitionsinstead but in future might make it more adaptive) - Once they are repartitioned / finalized the output batches can be sent directly to the target partitions (no RepartitionExec needed! anymore)
The benefit of it I think is mostly that by local grouping, we process the rows (smaller) hashmaps partition-by-partition, so while doing it they more likely fit in CPU cache.
It also avoids the double hashing in partial aggregation / hash repartition as the latter can be removed.
Care must be taken to make the code efficient, i.e. avoid materializing the batches upfront / accumulate based on indices.
I am not sure why the final aggregation doesn't take the same approach - this could be tested out as well when it is implemented for the partial aggregation.
Describe alternatives you've considered
No response
Additional context
No response