-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
The model loads and generates valid text, but inference speed is ~5x slower than expected compared to other MoE models with similar or even higher active parameter counts.
CPU optimizations were noted as planned in the initial support discussion (#15940). Opening this issue to track optimization progress.
Environment
- Model: Qwen3-Next-80B-A3B-Instruct (Q4_0)
- CPU: AMD EPYC 9454P (48c/96t, Zen 4, AVX-512)
- RAM: 12-channel DDR5-4800
- llama.cpp: build 7315 (4d37262)
- Backend: CPU-only
- OS: Debian 12 (Linux)
Benchmark
| Model | Size | Params | Backend | Threads | PP (512) [t/s] | TG (128) [t/s] |
|---|---|---|---|---|---|---|
qwen3moe-30B.A3B Q4_0 |
16.11 GiB | 30.53 B | CPU | 24 | 236.31 | 63.10 |
gpt-oss-120B Q4_0 |
60.87 GiB | 116.83 B | CPU | 24 | 108.56 | 35.75 |
qwen3next-80B.A3B Q4_0 |
41.98 GiB | 79.67 B | CPU | 24 | 73.74 | 11.76 |
Qwen3-Next with ~3B active params is 5.4x slower than Qwen3-MoE (also ~3B active), and 3x slower than GPT-OSS 120B which has ~5B active params and is almost twice the model size.
Motivation
Given similar active parameter count (~3B), Qwen3-Next should perform closer to Qwen3-MoE (~50-60 t/s on this hardware).
I can run any additional benchmarks if needed.
Possible Implementation
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request