Skip to content

Conversation

@ngdxzy
Copy link

@ngdxzy ngdxzy commented Dec 12, 2025

Description:

This PR implements true Q8_0 quantization for the Hexagon NPU backend, building on and integrating substantial previous work, to align its behavior with the CPU implementation in llama.cpp and improve the numerical accuracy of mixed-precision matmul operations.

Background:

In the current Hexagon NPU pipeline, quantization is performed on-the-fly during matrix multiplication, where FP32 activations are quantized and multiplied with already quantized weights. As a result, the quantization group size directly impacts the numerical behavior of these mixed-precision matmul operations, making alignment with the CPU Q8_0 scheme particularly important for correctness.

Previously, the Hexagon backend only supported quantize_block_fp32_q8x4, which uses a group size of 128. While functional, this does not match the standard Q8_0 definition used by the CPU backend in llama.cpp, leading to accuracy differences.

What's new:

  1. Implemented true Q8_0 quantization kernels with smaller group sizes:
  • quantize_block_fp32_q8x1 with group size 32
  • quantize_block_fp32_q8x2 with group size 64
  1. Retained the original quantize_block_fp32_q8x4 implementation (group size 128) for compatibility and performance comparisons.
  2. Introduced a function–pointer–based dispatch mechanism to select the Q8 quantization kernel at runtime.
  • Enables dynamic switching between q8x1 / q8x2 / q8x4 without code duplication.
  • Facilitates future debugging, validation, and accuracy/performance trade-off studies.
  • Allows easier experimentation with different group sizes while keeping the call sites unchanged.
  1. Aligned scale computation and quantization behavior with the CPU Q8_0 implementation in llama.cpp.

Why this matters:

  • Aligns Hexagon NPU Q8_0 quantization with the CPU implementation in llama.cpp
  • Improves quantization accuracy by using smaller group sizes
  • Reduces numerical discrepancies between CPU and NPU backends
  • Preserves the original q8x4 path for performance-oriented use cases
  • Validated on the K projection of layer 0 in the Qwen3-0.6B model, showing an over 35% reduction in L2 error with no observable performance regression.

Summary:

  • quantize_block_fp32_q8x1 → group size 32
  • quantize_block_fp32_q8x2 → group size 64
  • quantize_block_fp32_q8x4 → group size 128

This change aligns the Hexagon NPU backend with the true Q8_0 quantization scheme used on CPU, improving correctness while retaining flexibility for performance tuning and future experimentation.

@ngdxzy ngdxzy changed the title ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operations Dec 12, 2025
@ngdxzy ngdxzy changed the title ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operations ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations Dec 12, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 13, 2025
quantize_fp32_q8x4x2(&octx->src1, octx->src1_spad.data, &octx->src0_spad, n, i, octx->src1_nrows_per_thread);
// quantize_block_fp32_q8x4: use group size 128: tested on Qwen3:0.6B k proj layer 0 on 256 tokens, Relative L2 1.7%
// quantize_block_fp32_q8x2: use group size 64 : Relative L2 1.3%
// quantize_block_fp32_q8x1: use group size 32 : Relative L2 1.1%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to pop up with a maybe off-topic question:
Is there a convenient way to test op level L2 divergence in the current codebase? Aware of llama-perplexity, but it seems to be a model-level metric rather than something you can use for a single op.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relative L2 error, rather than absolute value, I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants