ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations #17977

ngdxzy · 2025-12-12T22:53:41Z

Description:

This PR implements true Q8_0 quantization for the Hexagon NPU backend, building on and integrating substantial previous work, to align its behavior with the CPU implementation in llama.cpp and improve the numerical accuracy of mixed-precision matmul operations.

Background:

In the current Hexagon NPU pipeline, quantization is performed on-the-fly during matrix multiplication, where FP32 activations are quantized and multiplied with already quantized weights. As a result, the quantization group size directly impacts the numerical behavior of these mixed-precision matmul operations, making alignment with the CPU Q8_0 scheme particularly important for correctness.

Previously, the Hexagon backend only supported quantize_block_fp32_q8x4, which uses a group size of 128. While functional, this does not match the standard Q8_0 definition used by the CPU backend in llama.cpp, leading to accuracy differences.

What's new:

Implemented true Q8_0 quantization kernels with smaller group sizes:

quantize_block_fp32_q8x1 with group size 32
quantize_block_fp32_q8x2 with group size 64

Retained the original quantize_block_fp32_q8x4 implementation (group size 128) for compatibility and performance comparisons.
Introduced a function–pointer–based dispatch mechanism to select the Q8 quantization kernel at runtime.

Enables dynamic switching between q8x1 / q8x2 / q8x4 without code duplication.
Facilitates future debugging, validation, and accuracy/performance trade-off studies.
Allows easier experimentation with different group sizes while keeping the call sites unchanged.

Aligned scale computation and quantization behavior with the CPU Q8_0 implementation in llama.cpp.

Why this matters:

Aligns Hexagon NPU Q8_0 quantization with the CPU implementation in llama.cpp
Improves quantization accuracy by using smaller group sizes
Reduces numerical discrepancies between CPU and NPU backends
Preserves the original q8x4 path for performance-oriented use cases
Validated on the K projection of layer 0 in the Qwen3-0.6B model, showing an over 35% reduction in L2 error with no observable performance regression.

Summary:

quantize_block_fp32_q8x1 → group size 32
quantize_block_fp32_q8x2 → group size 64
quantize_block_fp32_q8x4 → group size 128

This change aligns the Hexagon NPU backend with the true Q8_0 quantization scheme used on CPU, improving correctness while retaining flexibility for performance tuning and future experimentation.

chraac · 2025-12-13T04:39:51Z

ggml/src/ggml-hexagon/htp/matmul-ops.c

-    quantize_fp32_q8x4x2(&octx->src1, octx->src1_spad.data, &octx->src0_spad, n, i, octx->src1_nrows_per_thread);
+    // quantize_block_fp32_q8x4: use group size 128: tested on Qwen3:0.6B k proj layer 0 on 256 tokens, Relative L2 1.7%
+    // quantize_block_fp32_q8x2: use group size 64 : Relative L2 1.3%
+    // quantize_block_fp32_q8x1: use group size 32 : Relative L2 1.1%


Sorry to pop up with a maybe off-topic question:
Is there a convenient way to test op level L2 divergence in the current codebase? Aware of llama-perplexity, but it seems to be a model-level metric rather than something you can use for a single op.

relative L2 error, rather than absolute value, I guess.

feat: implement real Q8_0

9148eaa

ngdxzy requested review from lhez and max-krasnyansky as code owners December 12, 2025 22:53

ngdxzy changed the title ~~ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths)~~ ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operations Dec 12, 2025

ngdxzy changed the title ~~ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operations~~ ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations Dec 12, 2025

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 13, 2025

chraac reviewed Dec 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations #17977

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations #17977

ngdxzy commented Dec 12, 2025 •

edited

Loading

Uh oh!

chraac Dec 13, 2025

Uh oh!

twflm Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations #17977

Are you sure you want to change the base?

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations #17977

Conversation

ngdxzy commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Background:

What's new:

Why this matters:

Summary:

Uh oh!

chraac Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

twflm Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngdxzy commented Dec 12, 2025 •

edited

Loading