perf: Optimize approx_distinct for inline Utf8View#21064
perf: Optimize approx_distinct for inline Utf8View#21064Dandandan merged 7 commits intoapache:mainfrom
approx_distinct for inline Utf8View#21064Conversation
|
FYI @Dandandan |
|
This version only uses the short-string optimization if all of the strings in the StringViewArray are short. I tried a variant where we apply the optimization on a per-string basis: fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
let array: &StringViewArray = downcast_value!(values[0], StringViewArray);
// When strings are stored inline in the StringView (≤ 12 bytes),
// hash the raw u128 view directly instead of materializing a &str.
if array.data_buffers().is_empty() {
// All strings are inline — skip per-element length check
for (i, &view) in array.views().iter().enumerate() {
if !array.is_null(i) {
self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
}
}
} else {
for (i, &view) in array.views().iter().enumerate() {
if array.is_null(i) {
continue;
}
if (view as u32) <= 12 {
self.hll.add_hashed(HLL_HASH_STATE.hash_one(view));
} else {
// SAFETY: i is within bounds, checked non-null above
let s = unsafe { array.value_unchecked(i) };
self.hll.add(s);
}
}
}
Ok(())
}But this version regresses long strings by ~5%, which was stable over multiple runs: Whereas the version in the PR avoids the regression on long strings, at the price of only applying the optimization if all strings in the StringViewArray are short: Lmk if you have a preference, or if you can see a way to get the best of both worlds here. |
|
Could we do two passes: one for short strings and one for long? Order of iterations shouldn’t matter. |
| /// | ||
| /// Note that when we later move on to have serialized HLL register binaries | ||
| /// shared across cluster, this SEED will have to be consistent across all | ||
| /// shared across cluster, this HLL_HASH_STATE will have to be consistent across all |
| // hash the raw u128 view directly instead of materializing a &str. | ||
| if array.data_buffers().is_empty() { | ||
| for (i, &view) in array.views().iter().enumerate() { | ||
| if !array.is_null(i) { |
There was a problem hiding this comment.
I am wondering, perhaps we can reuse/generalize hash_string_view_array_inner as well (passing a quality hash function instead)?
It has some more optimization for specializing on non-nulls, etc.
There was a problem hiding this comment.
I took a look at this, but I didn't see an obvious way to reuse/generalize this code?
I played around with accessing data_buffers directly (similar to how hash_string_view_array_inner does it) and computing the hash there for out-of-line strings (I pushed a commit for this), but it didn't seem like a huge win (so I pushed a revert for it).
There was a problem hiding this comment.
I guess it could be rewritten in terms of https://doc.rust-lang.org/std/hash/trait.BuildHasher.html 🤔
|
Nice! Perhaps we can use the |
|
Thank you @neilconway 🚀🚀🚀 |
|
Thank you for the review and the idea, @Dandandan ! |
## Which issue does this PR close? - Closes apache#21039 . ## Rationale for this change For short strings that are stored inline in a `Utf8View`, we can hash the string's value directly, without materializing a `&str`, and then add the hash value to HyperLogLog directly. This improves performance by ~40%. ## What changes are included in this PR? * Add benchmark for `approx_distinct` on short strings * Add `add_hashed` API to `HyperLogLog` * Rename `SEED` to `HLL_HASH_STATE` and make it `pub(crate)` * Optimize `approx_distinct` on short strings as described above. ## Are these changes tested? Yes. ## Are there any user-facing changes? No.
Which issue does this PR close?
approx_distinctonUtf8View#21039 .Rationale for this change
For short strings that are stored inline in a
Utf8View, we can hash the string's value directly, without materializing a&str, and then add the hash value to HyperLogLog directly. This improves performance by ~40%.What changes are included in this PR?
approx_distincton short stringsadd_hashedAPI toHyperLogLogSEEDtoHLL_HASH_STATEand make itpub(crate)approx_distincton short strings as described above.Are these changes tested?
Yes.
Are there any user-facing changes?
No.