Feature Request: Qwen3-Next: CPU performance optimization

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

The model loads and generates valid text, but inference speed is ~5x slower than expected compared to other MoE models with similar or even higher active parameter counts.

CPU optimizations were noted as planned in the initial support discussion (#15940). Opening this issue to track optimization progress.

### Environment

- **Model:** Qwen3-Next-80B-A3B-Instruct (Q4_0)
- **CPU:** AMD EPYC 9454P (48c/96t, Zen 4, AVX-512)
- **RAM:** 12-channel DDR5-4800
- **llama.cpp:** build 7315 (4d3726278)
- **Backend:** CPU-only
- **OS:** Debian 12 (Linux)

### Benchmark

| Model                | Size    | Params   | Backend | Threads | PP (512) [t/s] | TG (128) [t/s] |
|----------------------|---------|----------|---------|---------|----------------|----------------|
| `qwen3moe-30B.A3B` Q4_0 | 16.11 GiB | 30.53 B  | CPU     | 24      | 236.31         | 63.10          |
| `gpt-oss-120B` Q4_0     | 60.87 GiB | 116.83 B | CPU     | 24      | 108.56         | 35.75          |
| `qwen3next-80B.A3B` Q4_0| 41.98 GiB | 79.67 B  | CPU     | 24      | 73.74          | 11.76          |

Qwen3-Next with ~3B active params is 5.4x slower than Qwen3-MoE (also ~3B active), and 3x slower than GPT-OSS 120B which has ~5B active params and is almost twice the model size.

### Motivation

Given similar active parameter count (~3B), Qwen3-Next should perform closer to Qwen3-MoE (~50-60 t/s on this hardware).

I can run any additional benchmarks if needed.

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Qwen3-Next: CPU performance optimization #17936

Prerequisites

Feature Description

Environment

Benchmark

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Size	Params	Backend	Threads	PP (512) [t/s]	TG (128) [t/s]
`qwen3moe-30B.A3B` Q4_0	16.11 GiB	30.53 B	CPU	24	236.31	63.10
`gpt-oss-120B` Q4_0	60.87 GiB	116.83 B	CPU	24	108.56	35.75
`qwen3next-80B.A3B` Q4_0	41.98 GiB	79.67 B	CPU	24	73.74	11.76

Feature Request: Qwen3-Next: CPU performance optimization #17936

Description

Prerequisites

Feature Description

Environment

Benchmark

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions