Conversation
Adds a struct which holds multiple parallel Keccak states, which can be used to take advantage of various parallel SIMD implementations of Keccak. The struct is const generic around the number of parallel states `P`. Also adds a `KeccakP1600Permute` trait and impls it for `ParKeccakP1600` where `P` is one of 1, 2, 4, or 8. These match the sizes supported by the SIMD backends presently available in XKCP, where 8-way parallelism is supported for AVX-512. The "parallel" implementation currently uses a simple loop combined with our existing single-state Keccak impls, but the idea is we can plug in various target-specific SIMD implementations in the future. Additionally, changes the single-state `KeccakP1600` to be a newtype for `ParKeccakP1600<1>`. The advantage of a newtype over a type alias is the `AsRef` and `From` impls don't need to deal with arrays.
ParKeccak1600 structParKeccakP1600 struct
|
I don't think it's a good way to expose support for parallel processing. How do you intent for this API to be used in practice? I think we need something similar to the |
|
Several Keccak-based constructions are designed to leverage this sort of parallel computation. The main one we implement that could benefit is There are other Keccak-based constructions I'm not terribly familiar with and we don't currently implement which can also benefit and may be of interest to others, like Kravatte and Xoofff. You can look in XKCP for how the parallel implementations are structured:
They're all designed to support up to at least 8-way parallelism. This kind of parallelism seems to naturally fall out of the SIMD implementations too, e.g. the existing ARMv8 ASM we already have has always been capable of 2-way parallelism, it just wasn't exposed (and should be by #112) simply by filling the full vector register width, and if I were to port the SSSE3/AVX2 backends, they're capable of it as well. So I think we should at least expose it, because it's generally useful for many Keccak-based constructions. |
|
Yes, I know about
I agree that we should expose it. My objection was towards the currently implemented API. |
|
If you'd be fine with adding 2-way parallelism today, we can start there, as that is easily exposed by our existing ARMv8 implementation and an SSSE3 implementation on x86(_64). |
|
Give me time until the end of the week. I will try to draft an alternative API to better demonstrate my point. |
|
Cool. I'm not too hung up on the API, just having some way to expose it. I'll continue working on the SIMD backends, and hopefully we can support 2-way parallelism on both ARMv8 and x86(_64) using intrinsics-based implementations with CPU feature autodetection that are on-by-default and avoid the need for any |
Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as `p1600_armv8_sha3_times2` (though not yet in the public API, see #110). This is a little tricky due to high register pressure: this implementation uses every vector register. I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a `[uint64x2_t; 25]` state. Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code. Godbolt links to the original `asm!` versus this translation: - original: https://godbolt.org/z/G8Mf5vboE - translated: https://godbolt.org/z/sszzbdexK It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent. Benchmarks (`sha3` crate): - Pure software implementation: test sha3_224_10 ... bench: 17.97 ns/iter (+/- 0.32) = 588 MB/s test sha3_224_100 ... bench: 164.15 ns/iter (+/- 5.14) = 609 MB/s test sha3_224_1000 ... bench: 1,646.07 ns/iter (+/- 139.45) = 607 MB/s test sha3_224_10000 ... bench: 16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s test sha3_256_10 ... bench: 19.12 ns/iter (+/- 0.77) = 526 MB/s test sha3_256_1000 ... bench: 1,694.21 ns/iter (+/- 41.20) = 590 MB/s test sha3_256_10000 ... bench: 16,807.40 ns/iter (+/- 556.17) = 594 MB/s test sha3_265_100 ... bench: 173.41 ns/iter (+/- 4.98) = 578 MB/s test sha3_384_10 ... bench: 24.32 ns/iter (+/- 1.16) = 416 MB/s test sha3_384_100 ... bench: 225.00 ns/iter (+/- 5.50) = 444 MB/s test sha3_384_1000 ... bench: 2,224.49 ns/iter (+/- 47.86) = 449 MB/s test sha3_384_10000 ... bench: 22,181.02 ns/iter (+/- 971.37) = 450 MB/s test sha3_512_10 ... bench: 33.78 ns/iter (+/- 0.32) = 303 MB/s test sha3_512_100 ... bench: 320.54 ns/iter (+/- 10.77) = 312 MB/s test sha3_512_1000 ... bench: 3,174.62 ns/iter (+/- 80.98) = 315 MB/s test sha3_512_10000 ... bench: 31,629.97 ns/iter (+/- 871.85) = 316 MB/s test shake128_10 ... bench: 15.97 ns/iter (+/- 0.44) = 666 MB/s test shake128_100 ... bench: 142.19 ns/iter (+/- 6.58) = 704 MB/s test shake128_1000 ... bench: 1,390.27 ns/iter (+/- 56.14) = 719 MB/s test shake128_10000 ... bench: 13,813.13 ns/iter (+/- 677.65) = 723 MB/s test shake256_10 ... bench: 19.06 ns/iter (+/- 0.44) = 526 MB/s test shake256_100 ... bench: 173.50 ns/iter (+/- 4.26) = 578 MB/s test shake256_1000 ... bench: 1,695.05 ns/iter (+/- 87.19) = 589 MB/s test shake256_10000 ... bench: 16,882.98 ns/iter (+/- 683.56) = 592 MB/s - New intrinsics implementation: test sha3_224_10 ... bench: 13.07 ns/iter (+/- 0.55) = 769 MB/s test sha3_224_100 ... bench: 111.29 ns/iter (+/- 6.62) = 900 MB/s test sha3_224_1000 ... bench: 1,113.87 ns/iter (+/- 29.88) = 898 MB/s test sha3_224_10000 ... bench: 11,095.95 ns/iter (+/- 302.99) = 901 MB/s test sha3_256_10 ... bench: 13.53 ns/iter (+/- 0.51) = 769 MB/s test sha3_256_1000 ... bench: 1,173.40 ns/iter (+/- 33.72) = 852 MB/s test sha3_256_10000 ... bench: 12,305.99 ns/iter (+/- 623.31) = 812 MB/s test sha3_265_100 ... bench: 118.16 ns/iter (+/- 2.85) = 847 MB/s test sha3_384_10 ... bench: 17.27 ns/iter (+/- 0.78) = 588 MB/s test sha3_384_100 ... bench: 153.80 ns/iter (+/- 5.42) = 653 MB/s test sha3_384_1000 ... bench: 1,529.35 ns/iter (+/- 18.99) = 654 MB/s test sha3_384_10000 ... bench: 15,239.19 ns/iter (+/- 189.19) = 656 MB/s test sha3_512_10 ... bench: 23.43 ns/iter (+/- 0.95) = 434 MB/s test sha3_512_100 ... bench: 218.97 ns/iter (+/- 4.01) = 458 MB/s test sha3_512_1000 ... bench: 2,193.58 ns/iter (+/- 37.98) = 455 MB/s test sha3_512_10000 ... bench: 21,968.75 ns/iter (+/- 385.75) = 455 MB/s test shake128_10 ... bench: 11.47 ns/iter (+/- 0.32) = 909 MB/s test shake128_100 ... bench: 95.51 ns/iter (+/- 1.32) = 1052 MB/s test shake128_1000 ... bench: 960.08 ns/iter (+/- 34.57) = 1041 MB/s test shake128_10000 ... bench: 9,564.39 ns/iter (+/- 255.34) = 1045 MB/s test shake256_10 ... bench: 13.61 ns/iter (+/- 0.53) = 769 MB/s test shake256_100 ... bench: 116.77 ns/iter (+/- 1.94) = 862 MB/s test shake256_1000 ... bench: 1,163.09 ns/iter (+/- 27.17) = 859 MB/s test shake256_10000 ... bench: 11,750.47 ns/iter (+/- 250.38) = 851 MB/s - Original assembly: test sha3_224_10 ... bench: 12.54 ns/iter (+/- 0.43) = 833 MB/s test sha3_224_100 ... bench: 109.49 ns/iter (+/- 2.54) = 917 MB/s test sha3_224_1000 ... bench: 1,095.79 ns/iter (+/- 32.04) = 913 MB/s test sha3_224_10000 ... bench: 10,953.02 ns/iter (+/- 157.49) = 912 MB/s test sha3_256_10 ... bench: 13.05 ns/iter (+/- 0.25) = 769 MB/s test sha3_256_1000 ... bench: 1,161.46 ns/iter (+/- 28.09) = 861 MB/s test sha3_256_10000 ... bench: 11,609.98 ns/iter (+/- 148.88) = 861 MB/s test sha3_265_100 ... bench: 118.17 ns/iter (+/- 7.42) = 847 MB/s test sha3_384_10 ... bench: 17.07 ns/iter (+/- 2.80) = 588 MB/s test sha3_384_100 ... bench: 151.93 ns/iter (+/- 4.39) = 662 MB/s test sha3_384_1000 ... bench: 1,506.50 ns/iter (+/- 40.71) = 664 MB/s test sha3_384_10000 ... bench: 15,119.04 ns/iter (+/- 495.59) = 661 MB/s test sha3_512_10 ... bench: 22.93 ns/iter (+/- 0.53) = 454 MB/s test sha3_512_100 ... bench: 216.77 ns/iter (+/- 7.42) = 462 MB/s test sha3_512_1000 ... bench: 2,165.67 ns/iter (+/- 49.04) = 461 MB/s test sha3_512_10000 ... bench: 21,666.71 ns/iter (+/- 651.02) = 461 MB/s test shake128_10 ... bench: 11.30 ns/iter (+/- 0.14) = 909 MB/s test shake128_100 ... bench: 94.75 ns/iter (+/- 3.86) = 1063 MB/s test shake128_1000 ... bench: 961.72 ns/iter (+/- 81.88) = 1040 MB/s test shake128_10000 ... bench: 9,573.39 ns/iter (+/- 311.05) = 1044 MB/s test shake256_10 ... bench: 13.17 ns/iter (+/- 0.54) = 769 MB/s test shake256_100 ... bench: 117.39 ns/iter (+/- 3.22) = 854 MB/s test shake256_1000 ... bench: 1,174.65 ns/iter (+/- 45.62) = 851 MB/s test shake256_10000 ... bench: 11,659.19 ns/iter (+/- 330.23) = 857 MB/s The performance seems pretty close to the original assembly, maybe just slightly slower.
Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as `p1600_armv8_sha3_times2` (though not yet in the public API, see #110). This is a little tricky due to high register pressure: this implementation uses every vector register. I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a `[uint64x2_t; 25]` state. Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code. Godbolt links to the original `asm!` versus this translation: - original: https://godbolt.org/z/G8Mf5vboE - translated: https://godbolt.org/z/sszzbdexK It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent. Since we're no longer using `asm!`, `cfg(armv8_asm)` has been removed and this is now enabled by default on `aarch64` targets. Closes #95 # Benchmarks (`sha3` crate on M1 Max) ## Pure software implementation test sha3_224_10 ... bench: 17.97 ns/iter (+/- 0.32) = 588 MB/s test sha3_224_100 ... bench: 164.15 ns/iter (+/- 5.14) = 609 MB/s test sha3_224_1000 ... bench: 1,646.07 ns/iter (+/- 139.45) = 607 MB/s test sha3_224_10000 ... bench: 16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s test sha3_256_10 ... bench: 19.12 ns/iter (+/- 0.77) = 526 MB/s test sha3_256_1000 ... bench: 1,694.21 ns/iter (+/- 41.20) = 590 MB/s test sha3_256_10000 ... bench: 16,807.40 ns/iter (+/- 556.17) = 594 MB/s test sha3_265_100 ... bench: 173.41 ns/iter (+/- 4.98) = 578 MB/s test sha3_384_10 ... bench: 24.32 ns/iter (+/- 1.16) = 416 MB/s test sha3_384_100 ... bench: 225.00 ns/iter (+/- 5.50) = 444 MB/s test sha3_384_1000 ... bench: 2,224.49 ns/iter (+/- 47.86) = 449 MB/s test sha3_384_10000 ... bench: 22,181.02 ns/iter (+/- 971.37) = 450 MB/s test sha3_512_10 ... bench: 33.78 ns/iter (+/- 0.32) = 303 MB/s test sha3_512_100 ... bench: 320.54 ns/iter (+/- 10.77) = 312 MB/s test sha3_512_1000 ... bench: 3,174.62 ns/iter (+/- 80.98) = 315 MB/s test sha3_512_10000 ... bench: 31,629.97 ns/iter (+/- 871.85) = 316 MB/s test shake128_10 ... bench: 15.97 ns/iter (+/- 0.44) = 666 MB/s test shake128_100 ... bench: 142.19 ns/iter (+/- 6.58) = 704 MB/s test shake128_1000 ... bench: 1,390.27 ns/iter (+/- 56.14) = 719 MB/s test shake128_10000 ... bench: 13,813.13 ns/iter (+/- 677.65) = 723 MB/s test shake256_10 ... bench: 19.06 ns/iter (+/- 0.44) = 526 MB/s test shake256_100 ... bench: 173.50 ns/iter (+/- 4.26) = 578 MB/s test shake256_1000 ... bench: 1,695.05 ns/iter (+/- 87.19) = 589 MB/s test shake256_10000 ... bench: 16,882.98 ns/iter (+/- 683.56) = 592 MB/s ## This new intrinsics implementation test sha3_224_10 ... bench: 13.07 ns/iter (+/- 0.55) = 769 MB/s test sha3_224_100 ... bench: 111.29 ns/iter (+/- 6.62) = 900 MB/s test sha3_224_1000 ... bench: 1,113.87 ns/iter (+/- 29.88) = 898 MB/s test sha3_224_10000 ... bench: 11,095.95 ns/iter (+/- 302.99) = 901 MB/s test sha3_256_10 ... bench: 13.53 ns/iter (+/- 0.51) = 769 MB/s test sha3_256_1000 ... bench: 1,173.40 ns/iter (+/- 33.72) = 852 MB/s test sha3_256_10000 ... bench: 12,305.99 ns/iter (+/- 623.31) = 812 MB/s test sha3_265_100 ... bench: 118.16 ns/iter (+/- 2.85) = 847 MB/s test sha3_384_10 ... bench: 17.27 ns/iter (+/- 0.78) = 588 MB/s test sha3_384_100 ... bench: 153.80 ns/iter (+/- 5.42) = 653 MB/s test sha3_384_1000 ... bench: 1,529.35 ns/iter (+/- 18.99) = 654 MB/s test sha3_384_10000 ... bench: 15,239.19 ns/iter (+/- 189.19) = 656 MB/s test sha3_512_10 ... bench: 23.43 ns/iter (+/- 0.95) = 434 MB/s test sha3_512_100 ... bench: 218.97 ns/iter (+/- 4.01) = 458 MB/s test sha3_512_1000 ... bench: 2,193.58 ns/iter (+/- 37.98) = 455 MB/s test sha3_512_10000 ... bench: 21,968.75 ns/iter (+/- 385.75) = 455 MB/s test shake128_10 ... bench: 11.47 ns/iter (+/- 0.32) = 909 MB/s test shake128_100 ... bench: 95.51 ns/iter (+/- 1.32) = 1052 MB/s test shake128_1000 ... bench: 960.08 ns/iter (+/- 34.57) = 1041 MB/s test shake128_10000 ... bench: 9,564.39 ns/iter (+/- 255.34) = 1045 MB/s test shake256_10 ... bench: 13.61 ns/iter (+/- 0.53) = 769 MB/s test shake256_100 ... bench: 116.77 ns/iter (+/- 1.94) = 862 MB/s test shake256_1000 ... bench: 1,163.09 ns/iter (+/- 27.17) = 859 MB/s test shake256_10000 ... bench: 11,750.47 ns/iter (+/- 250.38) = 851 MB/s ## Original assembly test sha3_224_10 ... bench: 12.54 ns/iter (+/- 0.43) = 833 MB/s test sha3_224_100 ... bench: 109.49 ns/iter (+/- 2.54) = 917 MB/s test sha3_224_1000 ... bench: 1,095.79 ns/iter (+/- 32.04) = 913 MB/s test sha3_224_10000 ... bench: 10,953.02 ns/iter (+/- 157.49) = 912 MB/s test sha3_256_10 ... bench: 13.05 ns/iter (+/- 0.25) = 769 MB/s test sha3_256_1000 ... bench: 1,161.46 ns/iter (+/- 28.09) = 861 MB/s test sha3_256_10000 ... bench: 11,609.98 ns/iter (+/- 148.88) = 861 MB/s test sha3_265_100 ... bench: 118.17 ns/iter (+/- 7.42) = 847 MB/s test sha3_384_10 ... bench: 17.07 ns/iter (+/- 2.80) = 588 MB/s test sha3_384_100 ... bench: 151.93 ns/iter (+/- 4.39) = 662 MB/s test sha3_384_1000 ... bench: 1,506.50 ns/iter (+/- 40.71) = 664 MB/s test sha3_384_10000 ... bench: 15,119.04 ns/iter (+/- 495.59) = 661 MB/s test sha3_512_10 ... bench: 22.93 ns/iter (+/- 0.53) = 454 MB/s test sha3_512_100 ... bench: 216.77 ns/iter (+/- 7.42) = 462 MB/s test sha3_512_1000 ... bench: 2,165.67 ns/iter (+/- 49.04) = 461 MB/s test sha3_512_10000 ... bench: 21,666.71 ns/iter (+/- 651.02) = 461 MB/s test shake128_10 ... bench: 11.30 ns/iter (+/- 0.14) = 909 MB/s test shake128_100 ... bench: 94.75 ns/iter (+/- 3.86) = 1063 MB/s test shake128_1000 ... bench: 961.72 ns/iter (+/- 81.88) = 1040 MB/s test shake128_10000 ... bench: 9,573.39 ns/iter (+/- 311.05) = 1044 MB/s test shake256_10 ... bench: 13.17 ns/iter (+/- 0.54) = 769 MB/s test shake256_100 ... bench: 117.39 ns/iter (+/- 3.22) = 854 MB/s test shake256_1000 ... bench: 1,174.65 ns/iter (+/- 45.62) = 851 MB/s test shake256_10000 ... bench: 11,659.19 ns/iter (+/- 330.23) = 857 MB/s The performance seems pretty close to the original assembly, maybe just slightly slower.
|
See #113. |
Adds a struct which holds multiple parallel Keccak states, which can be used to take advantage of various parallel SIMD implementations of Keccak.
The struct is const generic around the number of parallel states
P.Also adds a
KeccakP1600Permutetrait and impls it forParKeccakP1600wherePis one of 1, 2, 4, or 8. These match the sizes supported by the SIMD backends presently available in XKCP, where 8-way parallelism is supported for AVX-512.The "parallel" implementation currently uses a simple loop combined with our existing single-state Keccak impls, but the idea is we can plug in various target-specific SIMD implementations in the future.
Additionally, changes the single-state
KeccakP1600to be a newtype forParKeccakP1600<1>. The advantage of a newtype over a type alias is theAsRefandFromimpls don't need to deal with arrays.