perf: optimize shuffle array element iteration with slice-based append by andygrove · Pull Request #3222 · apache/datafusion-comet

andygrove · 2026-01-20T18:04:52Z

Summary

Optimizes primitive type array iteration in native shuffle by using bulk operations instead of per-element iteration.

Key optimizations:

Non-nullable path: Uses append_slice() for optimal memcpy-style copy
Nullable path: Uses pointer iteration with efficient null bitset reading (reads 64 bits at a time)

Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp

Benchmark Results

Benchmark for converting SparkUnsafeArray (10K elements) to Arrow array:

Type	Baseline	Optimized	Speedup
i32/no_nulls	6.08µs	0.65µs	9.3x
i32/with_nulls	22.49µs	16.21µs	1.39x
i64/no_nulls	6.15µs	1.22µs	5x
i64/with_nulls	16.41µs	16.41µs	1x
f64/no_nulls	8.05µs	1.22µs	6.6x
f64/with_nulls	16.52µs	16.21µs	1.02x
date32/no_nulls	~6µs*	0.66µs	~9x
timestamp/no_nulls	~6µs*	1.21µs	~5x

*Baseline estimated from similar types

Why such dramatic improvement for non-nullable?

The original code appended elements one by one using index-based access:

for idx in 0..array.get_num_elements() {
    builder.append_value(array.get_int(idx));  // get_int does: offset + idx * 4
}

The optimized code uses slice-based bulk append:

let slice = unsafe {
    std::slice::from_raw_parts(self.element_offset as *const i32, num_elements)
};
builder.append_slice(slice);  // Single memcpy-style operation

Test plan

All Rust tests pass (118 tests)
Native shuffle test suite passes (16 tests)
Clippy clean
Added dedicated benchmark (native/core/benches/array_conversion.rs)

🤖 Generated with Claude Code

Use bulk-append methods for primitive types in SparkUnsafeArray: - Non-nullable path uses append_slice() for optimal memcpy-style copy - Nullable path uses pointer iteration with efficient null bitset reading Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp Benchmark results (10K elements): | Type | Baseline | Optimized | Speedup | |------|----------|-----------|---------| | i32/no_nulls | 6.08µs | 0.65µs | **9.3x** | | i32/with_nulls | 22.49µs | 16.21µs | **1.39x** | | i64/no_nulls | 6.15µs | 1.22µs | **5x** | | i64/with_nulls | 16.41µs | 16.41µs | 1x | | f64/no_nulls | 8.05µs | 1.22µs | **6.6x** | | f64/with_nulls | 16.52µs | 16.21µs | 1.02x | | date32/no_nulls | - | 0.66µs | ~9x | | timestamp/no_nulls | - | 1.21µs | ~5x | Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

The #[inline] attribute on functions with loops iterating over thousands of elements provides no benefit - the function call overhead is negligible compared to loop body execution, and inlining large functions causes instruction cache pressure. Keep #[inline] only on small helper functions: - get_header_portion_in_bytes (tiny const fn) - is_null_at (small, hot path) - null_bitset_ptr (tiny accessor) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Remove unused ArrayBuilder import - Use div_ceil() instead of manual implementation Co-Authored-By: Claude Opus 4.5 <[email protected]>

andygrove · 2026-01-20T18:42:22Z

@sqlbenchmark run tpch

andygrove · 2026-01-20T18:55:02Z

@sqlbenchmark run tpch

codecov-commenter · 2026-01-20T19:05:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.02%. Comparing base (f09f8af) to head (fe54548).
⚠️ Report is 858 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3222      +/-   ##
============================================
+ Coverage     56.12%   60.02%   +3.89%     
- Complexity      976     1429     +453     
============================================
  Files           119      170      +51     
  Lines         11743    15746    +4003     
  Branches       2251     2602     +351     
============================================
+ Hits           6591     9451    +2860     
- Misses         4012     4976     +964     
- Partials       1140     1319     +179

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2026-01-20T19:10:58Z

@sqlbenchmark run tpch

sqlbenchmark · 2026-01-20T19:37:15Z

Comet TPC-H Benchmark Results

Commit: fe54548 - fix: address clippy warnings in benchmark
Scale Factor: SF100
Iterations: 1

Query Times

Query	Time (s)	Query	Time (s)
Q1	10.84	Q12	6.83
Q2	5.88	Q13	6.76
Q3	9.72	Q14	3.56
Q4	11.25	Q15	7.27
Q5	19.02	Q16	4.94
Q6	2.50	Q17	32.00
Q7	11.88	Q18	33.60
Q8	24.26	Q19	6.73
Q9	37.46	Q20	6.87
Q10	10.51	Q21	45.81
Q11	4.26	Q22	5.03

Total Time: 306.97 seconds

Spark Configuration

Setting	Value
Spark Master	`local[*]`
Driver Memory	32G
Driver Cores	8
Executor Memory	32G
Executor Cores	8
Off-Heap Enabled	true
Off-Heap Size	24g
Shuffle Manager	`CometShuffleManager`
Comet Replace SMJ	true

Automated benchmark run by dfbench

sqlbenchmark · 2026-01-20T19:49:50Z

Comet TPC-H Benchmark Results

Commit: fe54548 - fix: address clippy warnings in benchmark
Scale Factor: SF100
Iterations: 1

Query Times

Query	Time (s)	Query	Time (s)
Q1	10.75	Q12	6.77
Q2	5.99	Q13	6.88
Q3	9.64	Q14	3.56
Q4	11.48	Q15	7.15
Q5	19.00	Q16	4.98
Q6	2.59	Q17	31.85
Q7	11.91	Q18	33.46
Q8	24.44	Q19	6.69
Q9	37.66	Q20	6.57
Q10	10.38	Q21	45.80
Q11	4.30	Q22	4.88

Total Time: 306.72 seconds

Spark Configuration

Setting	Value
Spark Master	`local[*]`
Driver Memory	32G
Driver Cores	8
Executor Memory	32G
Executor Cores	8
Off-Heap Enabled	true
Off-Heap Size	24g
Shuffle Manager	`CometShuffleManager`
Comet Replace SMJ	true

Automated benchmark run by dfbench

sqlbenchmark · 2026-01-20T20:02:34Z

Comet TPC-H Benchmark Results

Commit: fe54548 - fix: address clippy warnings in benchmark
Scale Factor: SF100
Iterations: 1

Query Times

Query	Time (s)	Query	Time (s)
Q1	10.77	Q12	6.80
Q2	5.86	Q13	6.81
Q3	9.62	Q14	3.52
Q4	11.69	Q15	7.17
Q5	18.82	Q16	4.49
Q6	2.53	Q17	31.85
Q7	11.90	Q18	33.30
Q8	24.14	Q19	6.76
Q9	37.42	Q20	6.54
Q10	10.38	Q21	46.16
Q11	4.26	Q22	4.98

Total Time: 305.78 seconds

Spark Configuration

Setting	Value
Spark Master	`local[*]`
Driver Memory	32G
Driver Cores	8
Executor Memory	32G
Executor Cores	8
Off-Heap Enabled	true
Off-Heap Size	24g
Shuffle Manager	`CometShuffleManager`
Comet Replace SMJ	true

Automated benchmark run by dfbench

andygrove · 2026-01-20T20:36:02Z

note that I did not expect any improvements in tpc-h since it does not use arrays

andygrove · 2026-01-26T16:08:13Z

replaced with #3289 so I can work on all optimizations in a single PR for now

andygrove force-pushed the shuffle-optimization branch from 1f7ae01 to 4743c23 Compare January 20, 2026 18:21

andygrove changed the title ~~perf: optimize shuffle array element iteration with pointer arithmetic~~ perf: optimize shuffle array element iteration with slice-based append Jan 20, 2026

andygrove force-pushed the shuffle-optimization branch from 4743c23 to e32dd52 Compare January 20, 2026 18:31

andygrove and others added 3 commits January 20, 2026 11:32

chore: format code

2ea5631

Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: address clippy warnings in benchmark

fe54548

- Remove unused ArrayBuilder import - Use div_ceil() instead of manual implementation Co-Authored-By: Claude Opus 4.5 <[email protected]>

andygrove marked this pull request as ready for review January 20, 2026 20:35

apache deleted a comment from sqlbenchmark Jan 20, 2026

andygrove modified the milestone: 0.13.0 Jan 22, 2026

andygrove requested review from comphead and wForget January 25, 2026 13:00

andygrove closed this Jan 26, 2026

andygrove deleted the shuffle-optimization branch January 26, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize shuffle array element iteration with slice-based append#3222

perf: optimize shuffle array element iteration with slice-based append#3222
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:shuffle-optimization

andygrove commented Jan 20, 2026 •

edited

Loading

Uh oh!

andygrove commented Jan 20, 2026

Uh oh!

andygrove commented Jan 20, 2026

Uh oh!

codecov-commenter commented Jan 20, 2026 •

edited

Loading

Uh oh!

andygrove commented Jan 20, 2026

Uh oh!

sqlbenchmark commented Jan 20, 2026

Uh oh!

sqlbenchmark commented Jan 20, 2026

Uh oh!

sqlbenchmark commented Jan 20, 2026

Uh oh!

andygrove commented Jan 20, 2026

Uh oh!

andygrove commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andygrove commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Results

Test plan

Uh oh!

andygrove commented Jan 20, 2026

Uh oh!

andygrove commented Jan 20, 2026

Uh oh!

codecov-commenter commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jan 20, 2026

Uh oh!

sqlbenchmark commented Jan 20, 2026

Comet TPC-H Benchmark Results

Query Times

Uh oh!

sqlbenchmark commented Jan 20, 2026

Comet TPC-H Benchmark Results

Query Times

Uh oh!

sqlbenchmark commented Jan 20, 2026

Comet TPC-H Benchmark Results

Query Times

Uh oh!

andygrove commented Jan 20, 2026

Uh oh!

andygrove commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Jan 20, 2026 •

edited

Loading

codecov-commenter commented Jan 20, 2026 •

edited

Loading