Add MIN_BY and MAX_BY aggregations for groupby and reduction #20947

Copilot · 2025-12-23T21:46:03Z

Implements native MIN_BY and MAX_BY aggregations to support Apache Spark's min_by/max_by operations. These aggregations return the value from one column at the row where another column has its minimum/maximum value, avoiding the slower struct-based min/max comparison path that forces sort-based groupby.

Changes

Aggregation Infrastructure

Added MIN_BY and MAX_BY to aggregation::Kind enum
Created min_by_aggregation and max_by_aggregation classes with visitor pattern support
Added factory functions make_min_by_aggregation() and make_max_by_aggregation()

Implementation

Groupby: Implements sort-based operations using argmin/argmax + gather pattern
Reduction: Uses argmin/argmax to find index, then extracts value at that index
Both accept struct columns with 2 children: ordering column (index 0) and value column (index 1)

Tests

Added min_by_tests.cpp and max_by_tests.cpp covering basic functionality and null handling

Example Usage

// Ordering column: values to find min/max
cudf::test::fixed_width_column_wrapper<int> ordering{9, 8, 7, 6, 5};

// Value column: values to return
cudf::test::fixed_width_column_wrapper<int> values{10, 20, 30, 40, 50};

// Create struct column
cudf::test::structs_column_wrapper input{{ordering, values}};

// MIN_BY returns 50 (value where ordering is minimum: 5)
auto agg = cudf::make_min_by_aggregation<cudf::reduce_aggregation>();
auto result = cudf::reduce(input, *agg, output_type);

// MAX_BY returns 10 (value where ordering is maximum: 9)
auto agg = cudf::make_max_by_aggregation<cudf::reduce_aggregation>();
auto result = cudf::reduce(input, *agg, output_type);

Files Modified

14 files changed: 730 insertions(+), 0 deletions(-)
Core: aggregation.hpp, aggregation.cpp, aggregation/aggregation.hpp
Groupby: group_min_by.cu, group_max_by.cu, aggregate.cpp, group_reductions.hpp
Reduction: min_by.cu, max_by.cu, reductions.cpp, reduction_functions.hpp
Build: CMakeLists.txt
Tests: min_by_tests.cpp, max_by_tests.cpp

Original prompt

This section details on the original issue you should resolve

<issue_title>[FEA] Support min_by and max_by aggregations in reduction and groupby</issue_title>
<issue_description>Is your feature request related to a problem? Please describe.
Apache Spark has min_by and max_by groupby aggregations that find row of a value column based on the min/max row of an ordering column. In order to implement such aggregations, currently we combine the input into a structs column (ordering stands before value) and find the min/max struct element. However, this operation (finding min/max of struct element) is not supported in hash-based groupby aggregation thus it is running in the sort-based aggregation code path, which can be much slower especially for queries having many aggregations.

Describe the solution you'd like
When we tried to implement the native min_by and max_by aggregations in cudf (#16163), we were suggested that the more efficient approach is:

Finding arg_min/arg_max of the ordering column, then

Gather the value column using arg_min/arg_max indices

However, after investing significant effort in the Spark-Rapids plugin, we still couldn't implement such approach due to the architecture of our plugin that cannot connect the input of the aggregation to the output of the aggregation for the gathering step. As such, we still need to support the native min_by/max_by aggregation in cudf.

Describe alternatives you've considered
The implementation was proposed in #16163, thus we just need to creating a new PR from it, addressing review comments and do some more cleanup before reviewing.
</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes [FEA] Support min_by and max_by aggregations in reduction and groupby #20946

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

copy-pr-bot · 2025-12-23T21:46:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Added MIN_BY and MAX_BY to aggregation::Kind enum - Created min_by_aggregation and max_by_aggregation classes - Added visitor pattern methods for new aggregations - Added factory functions make_min_by_aggregation and make_max_by_aggregation - Updated aggregation_dispatcher with MIN_BY and MAX_BY cases - Added target_type_impl specializations Co-authored-by: ttnghia <[email protected]>

- Added group_min_by.cu and group_max_by.cu for sort-based groupby - Added min_by.cu and max_by.cu for reductions - Updated group_reductions.hpp with declarations - Updated aggregate.cpp with template specializations - Updated reduction_functions.hpp with declarations - Updated reductions.cpp with dispatch logic Co-authored-by: ttnghia <[email protected]>

- Fixed includes in min_by.cu and max_by.cu to use proper headers - Fixed includes in group_min_by.cu and group_max_by.cu - Added new source files to CMakeLists.txt for compilation Co-authored-by: ttnghia <[email protected]>

- Created min_by_tests.cpp for groupby MIN_BY aggregation tests - Created max_by_tests.cpp for groupby MAX_BY aggregation tests - Tests cover basic functionality and null handling Co-authored-by: ttnghia <[email protected]>

- Removed trailing whitespace from test files - Removed trailing whitespace from reduction files Co-authored-by: ttnghia <[email protected]>

Initial plan

bffdc14

Copilot AI assigned Copilot and ttnghia Dec 23, 2025

Copilot started work on behalf of ttnghia December 23, 2025 21:46 View session

Copilot AI and others added 5 commits December 23, 2025 21:52

Fix includes and add new files to CMakeLists.txt

9ea4941

- Fixed includes in min_by.cu and max_by.cu to use proper headers - Fixed includes in group_min_by.cu and group_max_by.cu - Added new source files to CMakeLists.txt for compilation Co-authored-by: ttnghia <[email protected]>

Add basic tests for min_by and max_by groupby aggregations

91e1587

- Created min_by_tests.cpp for groupby MIN_BY aggregation tests - Created max_by_tests.cpp for groupby MAX_BY aggregation tests - Tests cover basic functionality and null handling Co-authored-by: ttnghia <[email protected]>

Fix trailing whitespace in min_by and max_by files

9aa16d7

- Removed trailing whitespace from test files - Removed trailing whitespace from reduction files Co-authored-by: ttnghia <[email protected]>

Copilot AI changed the title ~~[WIP] Implement min_by and max_by aggregations for groupby~~ Add MIN_BY and MAX_BY aggregations for groupby and reduction Dec 23, 2025

Copilot AI requested a review from ttnghia December 23, 2025 22:04

Copilot finished work on behalf of ttnghia December 23, 2025 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MIN_BY and MAX_BY aggregations for groupby and reduction #20947

Add MIN_BY and MAX_BY aggregations for groupby and reduction #20947

Uh oh!

Copilot AI commented Dec 23, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add MIN_BY and MAX_BY aggregations for groupby and reduction #20947

Are you sure you want to change the base?

Add MIN_BY and MAX_BY aggregations for groupby and reduction #20947

Uh oh!

Conversation

Copilot AI commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example Usage

Files Modified

Comments on the Issue (you are @copilot in this section)

Uh oh!

copy-pr-bot bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 23, 2025 •

edited

Loading