perf: optimize shuffle array element iteration with slice-based append#3222
perf: optimize shuffle array element iteration with slice-based append#3222andygrove wants to merge 4 commits intoapache:mainfrom
Conversation
1f7ae01 to
4743c23
Compare
Use bulk-append methods for primitive types in SparkUnsafeArray: - Non-nullable path uses append_slice() for optimal memcpy-style copy - Nullable path uses pointer iteration with efficient null bitset reading Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp Benchmark results (10K elements): | Type | Baseline | Optimized | Speedup | |------|----------|-----------|---------| | i32/no_nulls | 6.08µs | 0.65µs | **9.3x** | | i32/with_nulls | 22.49µs | 16.21µs | **1.39x** | | i64/no_nulls | 6.15µs | 1.22µs | **5x** | | i64/with_nulls | 16.41µs | 16.41µs | 1x | | f64/no_nulls | 8.05µs | 1.22µs | **6.6x** | | f64/with_nulls | 16.52µs | 16.21µs | 1.02x | | date32/no_nulls | - | 0.66µs | ~9x | | timestamp/no_nulls | - | 1.21µs | ~5x | Co-Authored-By: Claude Opus 4.5 <[email protected]>
4743c23 to
e32dd52
Compare
Co-Authored-By: Claude Opus 4.5 <[email protected]>
The #[inline] attribute on functions with loops iterating over thousands of elements provides no benefit - the function call overhead is negligible compared to loop body execution, and inlining large functions causes instruction cache pressure. Keep #[inline] only on small helper functions: - get_header_portion_in_bytes (tiny const fn) - is_null_at (small, hot path) - null_bitset_ptr (tiny accessor) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove unused ArrayBuilder import - Use div_ceil() instead of manual implementation Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
@sqlbenchmark run tpch |
1 similar comment
|
@sqlbenchmark run tpch |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3222 +/- ##
============================================
+ Coverage 56.12% 60.02% +3.89%
- Complexity 976 1429 +453
============================================
Files 119 170 +51
Lines 11743 15746 +4003
Branches 2251 2602 +351
============================================
+ Hits 6591 9451 +2860
- Misses 4012 4976 +964
- Partials 1140 1319 +179 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@sqlbenchmark run tpch |
Comet TPC-H Benchmark ResultsCommit: Query Times
Total Time: 306.97 seconds Spark Configuration
Automated benchmark run by dfbench |
Comet TPC-H Benchmark ResultsCommit: Query Times
Total Time: 306.72 seconds Spark Configuration
Automated benchmark run by dfbench |
Comet TPC-H Benchmark ResultsCommit: Query Times
Total Time: 305.78 seconds Spark Configuration
Automated benchmark run by dfbench |
|
note that I did not expect any improvements in tpc-h since it does not use arrays |
|
replaced with #3289 so I can work on all optimizations in a single PR for now |
Summary
Optimizes primitive type array iteration in native shuffle by using bulk operations instead of per-element iteration.
Key optimizations:
append_slice()for optimal memcpy-style copySupported types: i8, i16, i32, i64, f32, f64, date32, timestamp
Benchmark Results
Benchmark for converting SparkUnsafeArray (10K elements) to Arrow array:
*Baseline estimated from similar types
Why such dramatic improvement for non-nullable?
The original code appended elements one by one using index-based access:
The optimized code uses slice-based bulk append:
Test plan
native/core/benches/array_conversion.rs)🤖 Generated with Claude Code