Implement optimized movemasks for NEON #1236

onalante-ebay · 2025-12-26T17:47:47Z

While the scalar post-processing required to obtain one bit per lane
makes this more expensive than directly supporting variable-sized bit
groups (as done in Zstandard¹), the result is still an improvement
over the current lane-by-lane algorithm.

To reduce duplication, XSIMD_LITTLE_ENDIAN is moved from
math/xsimd_rem_pio2.hpp to config/xsimd_config.hpp, and will now be
available outside the defining header.

See [lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM facebook/zstd#3139, namely ZSTD_row_matchMaskGroupWidth. ↩

While the scalar post-processing required to obtain one bit per lane makes this more expensive than directly supporting variable-sized bit groups (as done in Zstandard[^1]), the result is still an improvement over the current lane-by-lane algorithm. [^1]: See facebook/zstd#3139, namely `ZSTD_row_matchMaskGroupWidth`.

serge-sans-paille · 2025-12-26T18:54:48Z

I've suggested an improvement of the 64 bit version there: https://godbolt.org/z/b7henc933

onalante-ebay · 2025-12-26T19:07:56Z

Applied, thank you for the suggestion. I will fix the GCC build in a moment.

serge-sans-paille · 2025-12-27T10:50:45Z

include/xsimd/arch/xsimd_neon.hpp

+        template <class A, class T, detail::enable_sized_t<T, 1> = 0>
+        XSIMD_INLINE uint64_t mask(batch_bool<T, A> const& self, requires_arch<neon>) noexcept
+        {
+            uint8x16_t inner = self;


You should probably use the method described in https://github.com/DLTcollab/sse2neon/blob/ade5552a32852422e4f34f0beaa51790ef9f4171/sse2neon.h#L5574.
It performs the reduction in parallel and that seems more efficient!

serge-sans-paille · 2025-12-27T15:32:50Z

FYI, I'm working on the aarch64 version in #1237

onalante-ebay force-pushed the neon_bitmask branch from 10bab43 to c75a0ea Compare December 26, 2025 17:52

onalante-ebay force-pushed the neon_bitmask branch from c75a0ea to ffaa19d Compare December 26, 2025 17:54

Improve 64-bit bitmask code

7aa9df4

Fix build when if constexpr is unavailable

d205b15

onalante-ebay force-pushed the neon_bitmask branch from 371ab6b to d205b15 Compare December 26, 2025 19:20

serge-sans-paille reviewed Dec 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement optimized movemasks for NEON #1236

Implement optimized movemasks for NEON #1236

onalante-ebay commented Dec 26, 2025 •

edited

Loading

Uh oh!

serge-sans-paille commented Dec 26, 2025

Uh oh!

onalante-ebay commented Dec 26, 2025

Uh oh!

serge-sans-paille Dec 27, 2025

Uh oh!

serge-sans-paille commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement optimized movemasks for NEON #1236

Are you sure you want to change the base?

Implement optimized movemasks for NEON #1236

Conversation

onalante-ebay commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

serge-sans-paille commented Dec 26, 2025

Uh oh!

onalante-ebay commented Dec 26, 2025

Uh oh!

serge-sans-paille Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

onalante-ebay commented Dec 26, 2025 •

edited

Loading