Skip to content

Implement cache-efficient partial aggregation by Leis et al #20773

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge?

The paper "Morsel-Driven Parallelism"[1] by Leis et al introduces a cache efficient aggregation algorithm.
It is also implemented by DuckDB[2]

We can implement it and see if it improves performance.

[1] https://db.in.tum.de/~leis/papers/morsels.pdf
[2] https://duckdb.org/2022/03/07/aggregate-hashtable#parallel-aggregation

Describe the solution you'd like

The steps are as follows

  1. Start with a single hashtable (current approach) without pre-repartitioning
  2. Once the table exceeds a threshold (i.e. roughly CPU cache size), (radix) repartition the maps into a number of thread-local hashmaps (e.g. we can start of using the number of target_partitions and use hash % target_partitions instead but in future might make it more adaptive)
  3. Once they are repartitioned / finalized the output batches can be sent directly to the target partitions (no RepartitionExec needed! anymore)
Image

The benefit of it I think is mostly that by local grouping, we process the rows (smaller) hashmaps partition-by-partition, so while doing it they more likely fit in CPU cache.

It also avoids the double hashing in partial aggregation / hash repartition as the latter can be removed.

Care must be taken to make the code efficient, i.e. avoid materializing the batches upfront / accumulate based on indices.

I am not sure why the final aggregation doesn't take the same approach - this could be tested out as well when it is implemented for the partial aggregation.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Labels

enhancementNew feature or requestperformanceMake DataFusion faster

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions