Skip to content

perf: Optimize approx_distinct for inline Utf8View#21064

Merged
Dandandan merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-approx-distinct-inline-strings
Mar 20, 2026
Merged

perf: Optimize approx_distinct for inline Utf8View#21064
Dandandan merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-approx-distinct-inline-strings

Conversation

@neilconway
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

For short strings that are stored inline in a Utf8View, we can hash the string's value directly, without materializing a &str, and then add the hash value to HyperLogLog directly. This improves performance by ~40%.

What changes are included in this PR?

  • Add benchmark for approx_distinct on short strings
  • Add add_hashed API to HyperLogLog
  • Rename SEED to HLL_HASH_STATE and make it pub(crate)
  • Optimize approx_distinct on short strings as described above.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the functions Changes to functions implementation label Mar 19, 2026
@neilconway
Copy link
Contributor Author

FYI @Dandandan

@neilconway
Copy link
Contributor Author

neilconway commented Mar 19, 2026

This version only uses the short-string optimization if all of the strings in the StringViewArray are short. I tried a variant where we apply the optimization on a per-string basis:

  fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
      let array: &StringViewArray = downcast_value!(values[0], StringViewArray);

      // When strings are stored inline in the StringView (≤ 12 bytes),
      // hash the raw u128 view directly instead of materializing a &str.
      if array.data_buffers().is_empty() {
          // All strings are inline — skip per-element length check
          for (i, &view) in array.views().iter().enumerate() {
              if !array.is_null(i) {
                  self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
              }
          }
      } else {
          for (i, &view) in array.views().iter().enumerate() {
              if array.is_null(i) {
                  continue;
              }
              if (view as u32) <= 12 {
                  self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
              } else {
                  // SAFETY: i is within bounds, checked non-null above
                  let s = unsafe { array.value_unchecked(i) };
                  self.hll.add(s);
              }
          }
      }

      Ok(())
  }

But this version regresses long strings by ~5%, which was stable over multiple runs:

  ┌──────────────────────┬──────────┬───────────┬───────────────┐
  │      Benchmark       │ Baseline │ Optimized │    Change     │
  ├──────────────────────┼──────────┼───────────┼───────────────┤
  │ utf8view short 80%   │ 12.16 µs │ 6.98 µs   │ -42.5%        │
  ├──────────────────────┼──────────┼───────────┼───────────────┤
  │ utf8view short 99%   │ 12.10 µs │ 6.96 µs   │ -42.5%        │
  ├──────────────────────┼──────────┼───────────┼───────────────┤
  │ utf8view long 80%    │ 20.19 µs │ 21.11 µs  │ +5.3%         │
  ├──────────────────────┼──────────┼───────────┼───────────────┤
  │ utf8view long 99%    │ 20.17 µs │ 21.05 µs  │ +4.3%         │
  ├──────────────────────┼──────────┼───────────┼───────────────┤

Whereas the version in the PR avoids the regression on long strings, at the price of only applying the optimization if all strings in the StringViewArray are short:

  ┌────────────────────┬──────────┬───────────┬───────────┐
  │     Benchmark      │ Baseline │ Optimized │  Change   │
  ├────────────────────┼──────────┼───────────┼───────────┤
  │ utf8view short 80% │ 12.16 µs │ 7.14 µs   │ -41.3%    │
  ├────────────────────┼──────────┼───────────┼───────────┤
  │ utf8view short 99% │ 12.10 µs │ 7.11 µs   │ -41.3%    │
  ├────────────────────┼──────────┼───────────┼───────────┤
  │ utf8view long 80%  │ 20.19 µs │ 20.09 µs  │ no change │
  ├────────────────────┼──────────┼───────────┼───────────┤
  │ utf8view long 99%  │ 20.17 µs │ 19.88 µs  │ no change │
  └────────────────────┴──────────┴───────────┴───────────┘

Lmk if you have a preference, or if you can see a way to get the best of both worlds here.

@adriangb
Copy link
Contributor

Could we do two passes: one for short strings and one for long? Order of iterations shouldn’t matter.

///
/// Note that when we later move on to have serialized HLL register binaries
/// shared across cluster, this SEED will have to be consistent across all
/// shared across cluster, this HLL_HASH_STATE will have to be consistent across all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// hash the raw u128 view directly instead of materializing a &str.
if array.data_buffers().is_empty() {
for (i, &view) in array.views().iter().enumerate() {
if !array.is_null(i) {
Copy link
Contributor

@Dandandan Dandandan Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, perhaps we can reuse/generalize hash_string_view_array_inner as well (passing a quality hash function instead)?
It has some more optimization for specializing on non-nulls, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this, but I didn't see an obvious way to reuse/generalize this code?

I played around with accessing data_buffers directly (similar to how hash_string_view_array_inner does it) and computing the hash there for out-of-line strings (I pushed a commit for this), but it didn't seem like a huge win (so I pushed a revert for it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could be rewritten in terms of https://doc.rust-lang.org/std/hash/trait.BuildHasher.html 🤔

@Dandandan
Copy link
Contributor

Nice! Perhaps we can use the hash_string_view_array_inner instead for even faster hashing (and reuse that code).

@Dandandan Dandandan added this pull request to the merge queue Mar 20, 2026
Merged via the queue into apache:main with commit 1cb4de4 Mar 20, 2026
30 checks passed
@Dandandan
Copy link
Contributor

Thank you @neilconway 🚀🚀🚀

@neilconway
Copy link
Contributor Author

Thank you for the review and the idea, @Dandandan !

haohuaijin pushed a commit to haohuaijin/arrow-datafusion that referenced this pull request Mar 21, 2026
## Which issue does this PR close?

- Closes apache#21039 .

## Rationale for this change

For short strings that are stored inline in a `Utf8View`, we can hash
the string's value directly, without materializing a `&str`, and then
add the hash value to HyperLogLog directly. This improves performance by
~40%.

## What changes are included in this PR?

* Add benchmark for `approx_distinct` on short strings
* Add `add_hashed` API to `HyperLogLog`
* Rename `SEED` to `HLL_HASH_STATE` and make it `pub(crate)`
* Optimize `approx_distinct` on short strings as described above.

## Are these changes tested?

Yes.

## Are there any user-facing changes?

No.
@neilconway neilconway deleted the neilc/optimize-approx-distinct-inline-strings branch March 21, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize approx_distinct on Utf8View

3 participants