keccak: add `ParKeccakP1600` struct by tarcieri · Pull Request #110 · RustCrypto/sponges

tarcieri · 2026-02-26T17:44:29Z

Adds a struct which holds multiple parallel Keccak states, which can be used to take advantage of various parallel SIMD implementations of Keccak.

The struct is const generic around the number of parallel states P.

Also adds a KeccakP1600Permute trait and impls it for ParKeccakP1600 where P is one of 1, 2, 4, or 8. These match the sizes supported by the SIMD backends presently available in XKCP, where 8-way parallelism is supported for AVX-512.

The "parallel" implementation currently uses a simple loop combined with our existing single-state Keccak impls, but the idea is we can plug in various target-specific SIMD implementations in the future.

Additionally, changes the single-state KeccakP1600 to be a newtype for ParKeccakP1600<1>. The advantage of a newtype over a type alias is the AsRef and From impls don't need to deal with arrays.

Adds a struct which holds multiple parallel Keccak states, which can be used to take advantage of various parallel SIMD implementations of Keccak. The struct is const generic around the number of parallel states `P`. Also adds a `KeccakP1600Permute` trait and impls it for `ParKeccakP1600` where `P` is one of 1, 2, 4, or 8. These match the sizes supported by the SIMD backends presently available in XKCP, where 8-way parallelism is supported for AVX-512. The "parallel" implementation currently uses a simple loop combined with our existing single-state Keccak impls, but the idea is we can plug in various target-specific SIMD implementations in the future. Additionally, changes the single-state `KeccakP1600` to be a newtype for `ParKeccakP1600<1>`. The advantage of a newtype over a type alias is the `AsRef` and `From` impls don't need to deal with arrays.

newpavlov · 2026-02-27T13:06:16Z

I don't think it's a good way to expose support for parallel processing. How do you intent for this API to be used in practice?

I think we need something similar to the cipher backend traits here. In other words, downstream users are either able to use this capability and then they should use optimal parallelism for the selected (potentially at runtime) backend, or they work in sequential fashion.

tarcieri · 2026-02-27T13:58:50Z

Several Keccak-based constructions are designed to leverage this sort of parallel computation. The main one we implement that could benefit is k12 / KangarooTwelve, which has a built-in parallel mode designed to leverage exactly this sort of n-way parallelism for long messages, namely it's a tree hashing algorithm based on Sakura encoding.

There are other Keccak-based constructions I'm not terribly familiar with and we don't currently implement which can also benefit and may be of interest to others, like Kravatte and Xoofff.

You can look in XKCP for how the parallel implementations are structured:

They're all designed to support up to at least 8-way parallelism.

This kind of parallelism seems to naturally fall out of the SIMD implementations too, e.g. the existing ARMv8 ASM we already have has always been capable of 2-way parallelism, it just wasn't exposed (and should be by #112) simply by filling the full vector register width, and if I were to port the SSSE3/AVX2 backends, they're capable of it as well. So I think we should at least expose it, because it's generally useful for many Keccak-based constructions.

newpavlov · 2026-02-27T14:06:15Z

Yes, I know about k12. My point was that we should use the parallelism number which is optimal for selected backend (e.g. 2 in the case of the ARM backend) instead of the maximum number supported by the implemented algorithm (e.g. 8).

So I think we should at least expose it, because it's generally useful for many Keccak-based constructions.

I agree that we should expose it. My objection was towards the currently implemented API.

tarcieri · 2026-02-27T14:09:10Z

If you'd be fine with adding 2-way parallelism today, we can start there, as that is easily exposed by our existing ARMv8 implementation and an SSSE3 implementation on x86(_64).

newpavlov · 2026-02-27T14:11:36Z

Give me time until the end of the week. I will try to draft an alternative API to better demonstrate my point.

tarcieri · 2026-02-27T14:15:37Z

Cool. I'm not too hung up on the API, just having some way to expose it.

I'll continue working on the SIMD backends, and hopefully we can support 2-way parallelism on both ARMv8 and x86(_64) using intrinsics-based implementations with CPU feature autodetection that are on-by-default and avoid the need for any cfg, like we have in other crates with intrinsics support

Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as `p1600_armv8_sha3_times2` (though not yet in the public API, see #110). This is a little tricky due to high register pressure: this implementation uses every vector register. I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a `[uint64x2_t; 25]` state. Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code. Godbolt links to the original `asm!` versus this translation: - original: https://godbolt.org/z/G8Mf5vboE - translated: https://godbolt.org/z/sszzbdexK It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent. Benchmarks (`sha3` crate): - Pure software implementation: test sha3_224_10 ... bench: 17.97 ns/iter (+/- 0.32) = 588 MB/s test sha3_224_100 ... bench: 164.15 ns/iter (+/- 5.14) = 609 MB/s test sha3_224_1000 ... bench: 1,646.07 ns/iter (+/- 139.45) = 607 MB/s test sha3_224_10000 ... bench: 16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s test sha3_256_10 ... bench: 19.12 ns/iter (+/- 0.77) = 526 MB/s test sha3_256_1000 ... bench: 1,694.21 ns/iter (+/- 41.20) = 590 MB/s test sha3_256_10000 ... bench: 16,807.40 ns/iter (+/- 556.17) = 594 MB/s test sha3_265_100 ... bench: 173.41 ns/iter (+/- 4.98) = 578 MB/s test sha3_384_10 ... bench: 24.32 ns/iter (+/- 1.16) = 416 MB/s test sha3_384_100 ... bench: 225.00 ns/iter (+/- 5.50) = 444 MB/s test sha3_384_1000 ... bench: 2,224.49 ns/iter (+/- 47.86) = 449 MB/s test sha3_384_10000 ... bench: 22,181.02 ns/iter (+/- 971.37) = 450 MB/s test sha3_512_10 ... bench: 33.78 ns/iter (+/- 0.32) = 303 MB/s test sha3_512_100 ... bench: 320.54 ns/iter (+/- 10.77) = 312 MB/s test sha3_512_1000 ... bench: 3,174.62 ns/iter (+/- 80.98) = 315 MB/s test sha3_512_10000 ... bench: 31,629.97 ns/iter (+/- 871.85) = 316 MB/s test shake128_10 ... bench: 15.97 ns/iter (+/- 0.44) = 666 MB/s test shake128_100 ... bench: 142.19 ns/iter (+/- 6.58) = 704 MB/s test shake128_1000 ... bench: 1,390.27 ns/iter (+/- 56.14) = 719 MB/s test shake128_10000 ... bench: 13,813.13 ns/iter (+/- 677.65) = 723 MB/s test shake256_10 ... bench: 19.06 ns/iter (+/- 0.44) = 526 MB/s test shake256_100 ... bench: 173.50 ns/iter (+/- 4.26) = 578 MB/s test shake256_1000 ... bench: 1,695.05 ns/iter (+/- 87.19) = 589 MB/s test shake256_10000 ... bench: 16,882.98 ns/iter (+/- 683.56) = 592 MB/s - New intrinsics implementation: test sha3_224_10 ... bench: 13.07 ns/iter (+/- 0.55) = 769 MB/s test sha3_224_100 ... bench: 111.29 ns/iter (+/- 6.62) = 900 MB/s test sha3_224_1000 ... bench: 1,113.87 ns/iter (+/- 29.88) = 898 MB/s test sha3_224_10000 ... bench: 11,095.95 ns/iter (+/- 302.99) = 901 MB/s test sha3_256_10 ... bench: 13.53 ns/iter (+/- 0.51) = 769 MB/s test sha3_256_1000 ... bench: 1,173.40 ns/iter (+/- 33.72) = 852 MB/s test sha3_256_10000 ... bench: 12,305.99 ns/iter (+/- 623.31) = 812 MB/s test sha3_265_100 ... bench: 118.16 ns/iter (+/- 2.85) = 847 MB/s test sha3_384_10 ... bench: 17.27 ns/iter (+/- 0.78) = 588 MB/s test sha3_384_100 ... bench: 153.80 ns/iter (+/- 5.42) = 653 MB/s test sha3_384_1000 ... bench: 1,529.35 ns/iter (+/- 18.99) = 654 MB/s test sha3_384_10000 ... bench: 15,239.19 ns/iter (+/- 189.19) = 656 MB/s test sha3_512_10 ... bench: 23.43 ns/iter (+/- 0.95) = 434 MB/s test sha3_512_100 ... bench: 218.97 ns/iter (+/- 4.01) = 458 MB/s test sha3_512_1000 ... bench: 2,193.58 ns/iter (+/- 37.98) = 455 MB/s test sha3_512_10000 ... bench: 21,968.75 ns/iter (+/- 385.75) = 455 MB/s test shake128_10 ... bench: 11.47 ns/iter (+/- 0.32) = 909 MB/s test shake128_100 ... bench: 95.51 ns/iter (+/- 1.32) = 1052 MB/s test shake128_1000 ... bench: 960.08 ns/iter (+/- 34.57) = 1041 MB/s test shake128_10000 ... bench: 9,564.39 ns/iter (+/- 255.34) = 1045 MB/s test shake256_10 ... bench: 13.61 ns/iter (+/- 0.53) = 769 MB/s test shake256_100 ... bench: 116.77 ns/iter (+/- 1.94) = 862 MB/s test shake256_1000 ... bench: 1,163.09 ns/iter (+/- 27.17) = 859 MB/s test shake256_10000 ... bench: 11,750.47 ns/iter (+/- 250.38) = 851 MB/s - Original assembly: test sha3_224_10 ... bench: 12.54 ns/iter (+/- 0.43) = 833 MB/s test sha3_224_100 ... bench: 109.49 ns/iter (+/- 2.54) = 917 MB/s test sha3_224_1000 ... bench: 1,095.79 ns/iter (+/- 32.04) = 913 MB/s test sha3_224_10000 ... bench: 10,953.02 ns/iter (+/- 157.49) = 912 MB/s test sha3_256_10 ... bench: 13.05 ns/iter (+/- 0.25) = 769 MB/s test sha3_256_1000 ... bench: 1,161.46 ns/iter (+/- 28.09) = 861 MB/s test sha3_256_10000 ... bench: 11,609.98 ns/iter (+/- 148.88) = 861 MB/s test sha3_265_100 ... bench: 118.17 ns/iter (+/- 7.42) = 847 MB/s test sha3_384_10 ... bench: 17.07 ns/iter (+/- 2.80) = 588 MB/s test sha3_384_100 ... bench: 151.93 ns/iter (+/- 4.39) = 662 MB/s test sha3_384_1000 ... bench: 1,506.50 ns/iter (+/- 40.71) = 664 MB/s test sha3_384_10000 ... bench: 15,119.04 ns/iter (+/- 495.59) = 661 MB/s test sha3_512_10 ... bench: 22.93 ns/iter (+/- 0.53) = 454 MB/s test sha3_512_100 ... bench: 216.77 ns/iter (+/- 7.42) = 462 MB/s test sha3_512_1000 ... bench: 2,165.67 ns/iter (+/- 49.04) = 461 MB/s test sha3_512_10000 ... bench: 21,666.71 ns/iter (+/- 651.02) = 461 MB/s test shake128_10 ... bench: 11.30 ns/iter (+/- 0.14) = 909 MB/s test shake128_100 ... bench: 94.75 ns/iter (+/- 3.86) = 1063 MB/s test shake128_1000 ... bench: 961.72 ns/iter (+/- 81.88) = 1040 MB/s test shake128_10000 ... bench: 9,573.39 ns/iter (+/- 311.05) = 1044 MB/s test shake256_10 ... bench: 13.17 ns/iter (+/- 0.54) = 769 MB/s test shake256_100 ... bench: 117.39 ns/iter (+/- 3.22) = 854 MB/s test shake256_1000 ... bench: 1,174.65 ns/iter (+/- 45.62) = 851 MB/s test shake256_10000 ... bench: 11,659.19 ns/iter (+/- 330.23) = 857 MB/s The performance seems pretty close to the original assembly, maybe just slightly slower.

Rewrites the inline assembly implementation using an equivalent (but not identical) intrinsics implementation. Also exposes support for computing two Keccak states in parallel which a previous comment in the ASM implementation noted was possible but wasn't actually exposed, and is now available as `p1600_armv8_sha3_times2` (though not yet in the public API, see #110). This is a little tricky due to high register pressure: this implementation uses every vector register. I started by rewriting the round loop and iterating over the round constants, then breaking apart the body into theta and everything else (rho/pi/chi/iota), mapping the NEON registers onto a `[uint64x2_t; 25]` state. Theta was translated by hand, but the rest of them were too tedious regarding a manual mapping of the registers to slots in the state array. So I wrote a small program that operates over a representation of the original assembly, doing all the bookkeeping for which registers map to which slots in the state array, and outputs the equivalent intrinsics code. Godbolt links to the original `asm!` versus this translation: - original: https://godbolt.org/z/G8Mf5vboE - translated: https://godbolt.org/z/sszzbdexK It's using nearly the same number of instructions, but there are differences between the two versions, i.e. it isn't an identical recreation of the original assembly, which I'm not sure is possible/preferable, but it should be functionally equivalent. Since we're no longer using `asm!`, `cfg(armv8_asm)` has been removed and this is now enabled by default on `aarch64` targets. Closes #95 # Benchmarks (`sha3` crate on M1 Max) ## Pure software implementation test sha3_224_10 ... bench: 17.97 ns/iter (+/- 0.32) = 588 MB/s test sha3_224_100 ... bench: 164.15 ns/iter (+/- 5.14) = 609 MB/s test sha3_224_1000 ... bench: 1,646.07 ns/iter (+/- 139.45) = 607 MB/s test sha3_224_10000 ... bench: 16,585.52 ns/iter (+/- 1,168.57) = 602 MB/s test sha3_256_10 ... bench: 19.12 ns/iter (+/- 0.77) = 526 MB/s test sha3_256_1000 ... bench: 1,694.21 ns/iter (+/- 41.20) = 590 MB/s test sha3_256_10000 ... bench: 16,807.40 ns/iter (+/- 556.17) = 594 MB/s test sha3_265_100 ... bench: 173.41 ns/iter (+/- 4.98) = 578 MB/s test sha3_384_10 ... bench: 24.32 ns/iter (+/- 1.16) = 416 MB/s test sha3_384_100 ... bench: 225.00 ns/iter (+/- 5.50) = 444 MB/s test sha3_384_1000 ... bench: 2,224.49 ns/iter (+/- 47.86) = 449 MB/s test sha3_384_10000 ... bench: 22,181.02 ns/iter (+/- 971.37) = 450 MB/s test sha3_512_10 ... bench: 33.78 ns/iter (+/- 0.32) = 303 MB/s test sha3_512_100 ... bench: 320.54 ns/iter (+/- 10.77) = 312 MB/s test sha3_512_1000 ... bench: 3,174.62 ns/iter (+/- 80.98) = 315 MB/s test sha3_512_10000 ... bench: 31,629.97 ns/iter (+/- 871.85) = 316 MB/s test shake128_10 ... bench: 15.97 ns/iter (+/- 0.44) = 666 MB/s test shake128_100 ... bench: 142.19 ns/iter (+/- 6.58) = 704 MB/s test shake128_1000 ... bench: 1,390.27 ns/iter (+/- 56.14) = 719 MB/s test shake128_10000 ... bench: 13,813.13 ns/iter (+/- 677.65) = 723 MB/s test shake256_10 ... bench: 19.06 ns/iter (+/- 0.44) = 526 MB/s test shake256_100 ... bench: 173.50 ns/iter (+/- 4.26) = 578 MB/s test shake256_1000 ... bench: 1,695.05 ns/iter (+/- 87.19) = 589 MB/s test shake256_10000 ... bench: 16,882.98 ns/iter (+/- 683.56) = 592 MB/s ## This new intrinsics implementation test sha3_224_10 ... bench: 13.07 ns/iter (+/- 0.55) = 769 MB/s test sha3_224_100 ... bench: 111.29 ns/iter (+/- 6.62) = 900 MB/s test sha3_224_1000 ... bench: 1,113.87 ns/iter (+/- 29.88) = 898 MB/s test sha3_224_10000 ... bench: 11,095.95 ns/iter (+/- 302.99) = 901 MB/s test sha3_256_10 ... bench: 13.53 ns/iter (+/- 0.51) = 769 MB/s test sha3_256_1000 ... bench: 1,173.40 ns/iter (+/- 33.72) = 852 MB/s test sha3_256_10000 ... bench: 12,305.99 ns/iter (+/- 623.31) = 812 MB/s test sha3_265_100 ... bench: 118.16 ns/iter (+/- 2.85) = 847 MB/s test sha3_384_10 ... bench: 17.27 ns/iter (+/- 0.78) = 588 MB/s test sha3_384_100 ... bench: 153.80 ns/iter (+/- 5.42) = 653 MB/s test sha3_384_1000 ... bench: 1,529.35 ns/iter (+/- 18.99) = 654 MB/s test sha3_384_10000 ... bench: 15,239.19 ns/iter (+/- 189.19) = 656 MB/s test sha3_512_10 ... bench: 23.43 ns/iter (+/- 0.95) = 434 MB/s test sha3_512_100 ... bench: 218.97 ns/iter (+/- 4.01) = 458 MB/s test sha3_512_1000 ... bench: 2,193.58 ns/iter (+/- 37.98) = 455 MB/s test sha3_512_10000 ... bench: 21,968.75 ns/iter (+/- 385.75) = 455 MB/s test shake128_10 ... bench: 11.47 ns/iter (+/- 0.32) = 909 MB/s test shake128_100 ... bench: 95.51 ns/iter (+/- 1.32) = 1052 MB/s test shake128_1000 ... bench: 960.08 ns/iter (+/- 34.57) = 1041 MB/s test shake128_10000 ... bench: 9,564.39 ns/iter (+/- 255.34) = 1045 MB/s test shake256_10 ... bench: 13.61 ns/iter (+/- 0.53) = 769 MB/s test shake256_100 ... bench: 116.77 ns/iter (+/- 1.94) = 862 MB/s test shake256_1000 ... bench: 1,163.09 ns/iter (+/- 27.17) = 859 MB/s test shake256_10000 ... bench: 11,750.47 ns/iter (+/- 250.38) = 851 MB/s ## Original assembly test sha3_224_10 ... bench: 12.54 ns/iter (+/- 0.43) = 833 MB/s test sha3_224_100 ... bench: 109.49 ns/iter (+/- 2.54) = 917 MB/s test sha3_224_1000 ... bench: 1,095.79 ns/iter (+/- 32.04) = 913 MB/s test sha3_224_10000 ... bench: 10,953.02 ns/iter (+/- 157.49) = 912 MB/s test sha3_256_10 ... bench: 13.05 ns/iter (+/- 0.25) = 769 MB/s test sha3_256_1000 ... bench: 1,161.46 ns/iter (+/- 28.09) = 861 MB/s test sha3_256_10000 ... bench: 11,609.98 ns/iter (+/- 148.88) = 861 MB/s test sha3_265_100 ... bench: 118.17 ns/iter (+/- 7.42) = 847 MB/s test sha3_384_10 ... bench: 17.07 ns/iter (+/- 2.80) = 588 MB/s test sha3_384_100 ... bench: 151.93 ns/iter (+/- 4.39) = 662 MB/s test sha3_384_1000 ... bench: 1,506.50 ns/iter (+/- 40.71) = 664 MB/s test sha3_384_10000 ... bench: 15,119.04 ns/iter (+/- 495.59) = 661 MB/s test sha3_512_10 ... bench: 22.93 ns/iter (+/- 0.53) = 454 MB/s test sha3_512_100 ... bench: 216.77 ns/iter (+/- 7.42) = 462 MB/s test sha3_512_1000 ... bench: 2,165.67 ns/iter (+/- 49.04) = 461 MB/s test sha3_512_10000 ... bench: 21,666.71 ns/iter (+/- 651.02) = 461 MB/s test shake128_10 ... bench: 11.30 ns/iter (+/- 0.14) = 909 MB/s test shake128_100 ... bench: 94.75 ns/iter (+/- 3.86) = 1063 MB/s test shake128_1000 ... bench: 961.72 ns/iter (+/- 81.88) = 1040 MB/s test shake128_10000 ... bench: 9,573.39 ns/iter (+/- 311.05) = 1044 MB/s test shake256_10 ... bench: 13.17 ns/iter (+/- 0.54) = 769 MB/s test shake256_100 ... bench: 117.39 ns/iter (+/- 3.22) = 854 MB/s test shake256_1000 ... bench: 1,174.65 ns/iter (+/- 45.62) = 851 MB/s test shake256_10000 ... bench: 11,659.19 ns/iter (+/- 330.23) = 857 MB/s The performance seems pretty close to the original assembly, maybe just slightly slower.

newpavlov · 2026-02-28T16:17:34Z

See #113.

tarcieri requested a review from newpavlov February 26, 2026 17:44

tarcieri changed the title ~~keccak: add ParKeccak1600 struct~~ keccak: add ParKeccakP1600 struct Feb 26, 2026

tarcieri mentioned this pull request Feb 26, 2026

Release keccak v0.2.0 #109

Closed

tarcieri mentioned this pull request Feb 27, 2026

keccak: convert ARMv8 ASM into intrinsics #112

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keccak: add `ParKeccakP1600` struct#110

keccak: add `ParKeccakP1600` struct#110
tarcieri wants to merge 1 commit intomasterfrom
keccak/parkeccak1600

tarcieri commented Feb 26, 2026 •

edited

Loading

Uh oh!

newpavlov commented Feb 27, 2026 •

edited

Loading

Uh oh!

tarcieri commented Feb 27, 2026 •

edited

Loading

Uh oh!

newpavlov commented Feb 27, 2026 •

edited

Loading

Uh oh!

tarcieri commented Feb 27, 2026

Uh oh!

newpavlov commented Feb 27, 2026

Uh oh!

tarcieri commented Feb 27, 2026

Uh oh!

newpavlov commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tarcieri commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newpavlov commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newpavlov commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri commented Feb 27, 2026

Uh oh!

newpavlov commented Feb 27, 2026

Uh oh!

tarcieri commented Feb 27, 2026

Uh oh!

newpavlov commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tarcieri commented Feb 26, 2026 •

edited

Loading

newpavlov commented Feb 27, 2026 •

edited

Loading

tarcieri commented Feb 27, 2026 •

edited

Loading

newpavlov commented Feb 27, 2026 •

edited

Loading