Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 23, 2025

Implements native MIN_BY and MAX_BY aggregations to support Apache Spark's min_by/max_by operations. These aggregations return the value from one column at the row where another column has its minimum/maximum value, avoiding the slower struct-based min/max comparison path that forces sort-based groupby.

Changes

Aggregation Infrastructure

  • Added MIN_BY and MAX_BY to aggregation::Kind enum
  • Created min_by_aggregation and max_by_aggregation classes with visitor pattern support
  • Added factory functions make_min_by_aggregation() and make_max_by_aggregation()

Implementation

  • Groupby: Implements sort-based operations using argmin/argmax + gather pattern
  • Reduction: Uses argmin/argmax to find index, then extracts value at that index
  • Both accept struct columns with 2 children: ordering column (index 0) and value column (index 1)

Tests

  • Added min_by_tests.cpp and max_by_tests.cpp covering basic functionality and null handling

Example Usage

// Ordering column: values to find min/max
cudf::test::fixed_width_column_wrapper<int> ordering{9, 8, 7, 6, 5};

// Value column: values to return
cudf::test::fixed_width_column_wrapper<int> values{10, 20, 30, 40, 50};

// Create struct column
cudf::test::structs_column_wrapper input{{ordering, values}};

// MIN_BY returns 50 (value where ordering is minimum: 5)
auto agg = cudf::make_min_by_aggregation<cudf::reduce_aggregation>();
auto result = cudf::reduce(input, *agg, output_type);

// MAX_BY returns 10 (value where ordering is maximum: 9)
auto agg = cudf::make_max_by_aggregation<cudf::reduce_aggregation>();
auto result = cudf::reduce(input, *agg, output_type);

Files Modified

  • 14 files changed: 730 insertions(+), 0 deletions(-)
  • Core: aggregation.hpp, aggregation.cpp, aggregation/aggregation.hpp
  • Groupby: group_min_by.cu, group_max_by.cu, aggregate.cpp, group_reductions.hpp
  • Reduction: min_by.cu, max_by.cu, reductions.cpp, reduction_functions.hpp
  • Build: CMakeLists.txt
  • Tests: min_by_tests.cpp, max_by_tests.cpp
Original prompt

This section details on the original issue you should resolve

<issue_title>[FEA] Support min_by and max_by aggregations in reduction and groupby</issue_title>
<issue_description>Is your feature request related to a problem? Please describe.
Apache Spark has min_by and max_by groupby aggregations that find row of a value column based on the min/max row of an ordering column. In order to implement such aggregations, currently we combine the input into a structs column (ordering stands before value) and find the min/max struct element. However, this operation (finding min/max of struct element) is not supported in hash-based groupby aggregation thus it is running in the sort-based aggregation code path, which can be much slower especially for queries having many aggregations.

Describe the solution you'd like
When we tried to implement the native min_by and max_by aggregations in cudf (#16163), we were suggested that the more efficient approach is:

  • Finding arg_min/arg_max of the ordering column, then
  • Gather the value column using arg_min/arg_max indices

However, after investing significant effort in the Spark-Rapids plugin, we still couldn't implement such approach due to the architecture of our plugin that cannot connect the input of the aggregation to the output of the aggregation for the gathering step. As such, we still need to support the native min_by/max_by aggregation in cudf.

Describe alternatives you've considered
The implementation was proposed in #16163, thus we just need to creating a new PR from it, addressing review comments and do some more cleanup before reviewing.
</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot AI and others added 5 commits December 23, 2025 21:52
- Added MIN_BY and MAX_BY to aggregation::Kind enum
- Created min_by_aggregation and max_by_aggregation classes
- Added visitor pattern methods for new aggregations
- Added factory functions make_min_by_aggregation and make_max_by_aggregation
- Updated aggregation_dispatcher with MIN_BY and MAX_BY cases
- Added target_type_impl specializations

Co-authored-by: ttnghia <[email protected]>
- Added group_min_by.cu and group_max_by.cu for sort-based groupby
- Added min_by.cu and max_by.cu for reductions
- Updated group_reductions.hpp with declarations
- Updated aggregate.cpp with template specializations
- Updated reduction_functions.hpp with declarations
- Updated reductions.cpp with dispatch logic

Co-authored-by: ttnghia <[email protected]>
- Fixed includes in min_by.cu and max_by.cu to use proper headers
- Fixed includes in group_min_by.cu and group_max_by.cu
- Added new source files to CMakeLists.txt for compilation

Co-authored-by: ttnghia <[email protected]>
- Created min_by_tests.cpp for groupby MIN_BY aggregation tests
- Created max_by_tests.cpp for groupby MAX_BY aggregation tests
- Tests cover basic functionality and null handling

Co-authored-by: ttnghia <[email protected]>
- Removed trailing whitespace from test files
- Removed trailing whitespace from reduction files

Co-authored-by: ttnghia <[email protected]>
Copilot AI changed the title [WIP] Implement min_by and max_by aggregations for groupby Add MIN_BY and MAX_BY aggregations for groupby and reduction Dec 23, 2025
Copilot AI requested a review from ttnghia December 23, 2025 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support min_by and max_by aggregations in reduction and groupby

2 participants