ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations #17977
+137
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR implements true Q8_0 quantization for the Hexagon NPU backend, building on and integrating substantial previous work, to align its behavior with the CPU implementation in llama.cpp and improve the numerical accuracy of mixed-precision matmul operations.
Background:
In the current Hexagon NPU pipeline, quantization is performed on-the-fly during matrix multiplication, where FP32 activations are quantized and multiplied with already quantized weights. As a result, the quantization group size directly impacts the numerical behavior of these mixed-precision matmul operations, making alignment with the CPU Q8_0 scheme particularly important for correctness.
Previously, the Hexagon backend only supported quantize_block_fp32_q8x4, which uses a group size of 128. While functional, this does not match the standard Q8_0 definition used by the CPU backend in llama.cpp, leading to accuracy differences.
What's new:
Why this matters:
Summary:
This change aligns the Hexagon NPU backend with the true Q8_0 quantization scheme used on CPU, improving correctness while retaining flexibility for performance tuning and future experimentation.