perf: Optimize `approx_distinct` for inline Utf8View by neilconway · Pull Request #21064 · apache/datafusion

neilconway · 2026-03-19T18:47:30Z

Which issue does this PR close?

Closes Optimize approx_distinct on Utf8View #21039 .

Rationale for this change

For short strings that are stored inline in a Utf8View, we can hash the string's value directly, without materializing a &str, and then add the hash value to HyperLogLog directly. This improves performance by ~40%.

What changes are included in this PR?

Add benchmark for approx_distinct on short strings
Add add_hashed API to HyperLogLog
Rename SEED to HLL_HASH_STATE and make it pub(crate)
Optimize approx_distinct on short strings as described above.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

neilconway · 2026-03-19T18:54:39Z

FYI @Dandandan

neilconway · 2026-03-19T19:02:42Z

This version only uses the short-string optimization if all of the strings in the StringViewArray are short. I tried a variant where we apply the optimization on a per-string basis:

  fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
      let array: &StringViewArray = downcast_value!(values[0], StringViewArray);

      // When strings are stored inline in the StringView (≤ 12 bytes),
      // hash the raw u128 view directly instead of materializing a &str.
      if array.data_buffers().is_empty() {
          // All strings are inline — skip per-element length check
          for (i, &view) in array.views().iter().enumerate() {
              if !array.is_null(i) {
                  self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
              }
          }
      } else {
          for (i, &view) in array.views().iter().enumerate() {
              if array.is_null(i) {
                  continue;
              }
              if (view as u32) <= 12 {
                  self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
              } else {
                  // SAFETY: i is within bounds, checked non-null above
                  let s = unsafe { array.value_unchecked(i) };
                  self.hll.add(s);
              }
          }
      }

      Ok(())
  }

But this version regresses long strings by ~5%, which was stable over multiple runs:

  ┌──────────────────────┬──────────┬───────────┬───────────────┐
  │      Benchmark       │ Baseline │ Optimized │    Change     │
  ├──────────────────────┼──────────┼───────────┼───────────────┤
  │ utf8view short 80%   │ 12.16 µs │ 6.98 µs   │ -42.5%        │
  ├──────────────────────┼──────────┼───────────┼───────────────┤
  │ utf8view short 99%   │ 12.10 µs │ 6.96 µs   │ -42.5%        │
  ├──────────────────────┼──────────┼───────────┼───────────────┤
  │ utf8view long 80%    │ 20.19 µs │ 21.11 µs  │ +5.3%         │
  ├──────────────────────┼──────────┼───────────┼───────────────┤
  │ utf8view long 99%    │ 20.17 µs │ 21.05 µs  │ +4.3%         │
  ├──────────────────────┼──────────┼───────────┼───────────────┤

Whereas the version in the PR avoids the regression on long strings, at the price of only applying the optimization if all strings in the StringViewArray are short:

  ┌────────────────────┬──────────┬───────────┬───────────┐
  │     Benchmark      │ Baseline │ Optimized │  Change   │
  ├────────────────────┼──────────┼───────────┼───────────┤
  │ utf8view short 80% │ 12.16 µs │ 7.14 µs   │ -41.3%    │
  ├────────────────────┼──────────┼───────────┼───────────┤
  │ utf8view short 99% │ 12.10 µs │ 7.11 µs   │ -41.3%    │
  ├────────────────────┼──────────┼───────────┼───────────┤
  │ utf8view long 80%  │ 20.19 µs │ 20.09 µs  │ no change │
  ├────────────────────┼──────────┼───────────┼───────────┤
  │ utf8view long 99%  │ 20.17 µs │ 19.88 µs  │ no change │
  └────────────────────┴──────────┴───────────┴───────────┘

Lmk if you have a preference, or if you can see a way to get the best of both worlds here.

adriangb · 2026-03-20T04:06:00Z

Could we do two passes: one for short strings and one for long? Order of iterations shouldn’t matter.

Dandandan · 2026-03-20T05:00:30Z

datafusion/functions-aggregate/src/hyperloglog.rs

 ///
 /// Note that when we later move on to have serialized HLL register binaries
-/// shared across cluster, this SEED will have to be consistent across all
+/// shared across cluster, this HLL_HASH_STATE will have to be consistent across all


Dandandan · 2026-03-20T05:09:18Z

datafusion/functions-aggregate/src/approx_distinct.rs

+        // hash the raw u128 view directly instead of materializing a &str.
+        if array.data_buffers().is_empty() {
+            for (i, &view) in array.views().iter().enumerate() {
+                if !array.is_null(i) {


I am wondering, perhaps we can reuse/generalize hash_string_view_array_inner as well (passing a quality hash function instead)?
It has some more optimization for specializing on non-nulls, etc.

I took a look at this, but I didn't see an obvious way to reuse/generalize this code?

I played around with accessing data_buffers directly (similar to how hash_string_view_array_inner does it) and computing the hash there for out-of-line strings (I pushed a commit for this), but it didn't seem like a huge win (so I pushed a revert for it).

I guess it could be rewritten in terms of https://doc.rust-lang.org/std/hash/trait.BuildHasher.html 🤔

Dandandan · 2026-03-20T05:10:43Z

Nice! Perhaps we can use the hash_string_view_array_inner instead for even faster hashing (and reuse that code).

This reverts commit 893fc06.

This reverts commit ef1f173.

Dandandan · 2026-03-20T19:46:05Z

Thank you @neilconway 🚀🚀🚀

neilconway · 2026-03-20T23:46:37Z

Thank you for the review and the idea, @Dandandan !

## Which issue does this PR close? - Closes apache#21039 . ## Rationale for this change For short strings that are stored inline in a `Utf8View`, we can hash the string's value directly, without materializing a `&str`, and then add the hash value to HyperLogLog directly. This improves performance by ~40%. ## What changes are included in this PR? * Add benchmark for `approx_distinct` on short strings * Add `add_hashed` API to `HyperLogLog` * Rename `SEED` to `HLL_HASH_STATE` and make it `pub(crate)` * Optimize `approx_distinct` on short strings as described above. ## Are these changes tested? Yes. ## Are there any user-facing changes? No.

neilconway added 3 commits March 19, 2026 14:34

Add benchmark for approx_distinct with short strings

ff4d50f

Implement add_hashed for HyperLogLog

c1d30f9

Optimize approx_distinct to hash short strings directly

cccd362

github-actions bot added the functions Changes to functions implementation label Mar 19, 2026

Dandandan reviewed Mar 20, 2026

View reviewed changes

Dandandan approved these changes Mar 20, 2026

View reviewed changes

neilconway added 4 commits March 20, 2026 10:48

Per-string short-string opt, access data buffer directly

ef1f173

Tweak comment

893fc06

Revert "Tweak comment"

dcce169

This reverts commit 893fc06.

Revert "Per-string short-string opt, access data buffer directly"

cc44b1e

This reverts commit ef1f173.

Dandandan approved these changes Mar 20, 2026

View reviewed changes

Dandandan added this pull request to the merge queue Mar 20, 2026

Merged via the queue into apache:main with commit 1cb4de4 Mar 20, 2026
30 checks passed

neilconway deleted the neilc/optimize-approx-distinct-inline-strings branch March 21, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize `approx_distinct` for inline Utf8View#21064

perf: Optimize `approx_distinct` for inline Utf8View#21064
Dandandan merged 7 commits intoapache:mainfrom
neilconway:neilc/optimize-approx-distinct-inline-strings

neilconway commented Mar 19, 2026

Uh oh!

neilconway commented Mar 19, 2026

Uh oh!

neilconway commented Mar 19, 2026 •

edited

Loading

Uh oh!

adriangb commented Mar 20, 2026

Uh oh!

Dandandan Mar 20, 2026

Uh oh!

Dandandan Mar 20, 2026 •

edited

Loading

Uh oh!

neilconway Mar 20, 2026

Uh oh!

Dandandan Mar 20, 2026

Uh oh!

Dandandan commented Mar 20, 2026

Uh oh!

Uh oh!

Dandandan commented Mar 20, 2026

Uh oh!

neilconway commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neilconway commented Mar 19, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented Mar 19, 2026

Uh oh!

neilconway commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Mar 20, 2026

Uh oh!

Dandandan Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Dandandan Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neilconway Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Dandandan Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Mar 20, 2026

Uh oh!

Uh oh!

Dandandan commented Mar 20, 2026

Uh oh!

neilconway commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neilconway commented Mar 19, 2026 •

edited

Loading

Dandandan Mar 20, 2026 •

edited

Loading