Skip to content

Feature Request: Qwen3-Next: CPU performance optimization #17936

@jdvpro

Description

@jdvpro

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

The model loads and generates valid text, but inference speed is ~5x slower than expected compared to other MoE models with similar or even higher active parameter counts.

CPU optimizations were noted as planned in the initial support discussion (#15940). Opening this issue to track optimization progress.

Environment

  • Model: Qwen3-Next-80B-A3B-Instruct (Q4_0)
  • CPU: AMD EPYC 9454P (48c/96t, Zen 4, AVX-512)
  • RAM: 12-channel DDR5-4800
  • llama.cpp: build 7315 (4d37262)
  • Backend: CPU-only
  • OS: Debian 12 (Linux)

Benchmark

Model Size Params Backend Threads PP (512) [t/s] TG (128) [t/s]
qwen3moe-30B.A3B Q4_0 16.11 GiB 30.53 B CPU 24 236.31 63.10
gpt-oss-120B Q4_0 60.87 GiB 116.83 B CPU 24 108.56 35.75
qwen3next-80B.A3B Q4_0 41.98 GiB 79.67 B CPU 24 73.74 11.76

Qwen3-Next with ~3B active params is 5.4x slower than Qwen3-MoE (also ~3B active), and 3x slower than GPT-OSS 120B which has ~5B active params and is almost twice the model size.

Motivation

Given similar active parameter count (~3B), Qwen3-Next should perform closer to Qwen3-MoE (~50-60 t/s on this hardware).

I can run any additional benchmarks if needed.

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions