-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Optimization: Qwen3 next autoregressive pass #17996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
|
before: ggml_cuda_init: found 3 CUDA devices:
after: ggml_cuda_init: found 3 CUDA devices:
|
Edited: |
|
Nah, this should be a general optimization. This means there are other bottlenecks in play for the ROCm implementation than the slow delta-net. Can you run inference with |
That looks like a 10% bump, right? |
@pwilkin Hopefully this log is what you need :) |
CISC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an excessive amount of conts and asserts here, most of which I'm sure are unnecessary, but I think qwen3next needs a general cleanup of these anyway, so will leave that to you at a later stage.
|
@IIIIIllllIIIIIlllll can you do a bench for |
|
@pwilkin In case you're wondering, I think the |
@pwilkin |
|
Adding some multi-GPU ROCm data with several experts offloaded to CPU: SetupSpecsCPU: Ryzen 9 3950x ModelQwen3-Next-80B-A3B-Thinking-Q4_K_S Command/llama.cpp/build/build/bin/llama-server --host 127.0.0.1 --jinja --min-p 0 --mlock --mmap -ncmoe 20 --port 44163 --repeat-penalty 1.05 --temp 0.5 --top-k 0.20 --top-p 0.95 --warmup --alias Qwen3-Next-80B-A3B-Thinking-Q4_K_S --ctx-size 75000 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --model /models/Qwen3-Next-80B-A3B-Thinking-Q4_K_S.gguf --n-gpu-layers 999 --threads 8 --tensor-split 67,33 --log-verboseResultsggml-org/main Branch17.3 tokens/second pwilkin:lean_mean_token_machine Branch22.5 tokens/second Increase of >5 tokens/second or ~30% increase in token-gen speed |
|
Some 4x V100 32GB results w/ q8_0 gguf master: lean_mean_token_machine: Before: 38.39 t/s |
|
I was feeling a bit bored and naively asked gemini-cli to make the changes CISC suggested, it seems like it's consistently faster and it seems coherent (only did very brief testing). I do remember it breaking when it changed the sum_row conts though, but I don't know if any of the rest are needed. cont/assert reduction: Gain of 1.19 t/s over this commit (+2.67%) for a total gain of 7.4 t/s (+19.3%) over master patch file if your interested: qwen3.patch |
Nice little PP boost. |
|
Worth a few percent on my system: The number of CONT ops for |
|
|
please ignore my previous reply. The test results in my previous reply were executed in the PuTTY terminal, and I don't know why they were so bad. It's really strange, changing -DGGML_HIP_ROCWMMA_FATTN to OFF significantly improved pp's speed... Perhaps the performance of AI MAX+ 395 has reached its limit (this is questionable). this PR - DGGML_HIP_ROCWMMA_FATTN=OFF: this PR - DGGML_HIP_ROCWMMA_FATTN=ON: master- DGGML_HIP_ROCWMMA_FATTN=OFF: |
| // Choose between build_delta_net_chunking, build_delta_net_recurrent, and build_delta_net_autoregressive based on n_tokens | ||
| ggml_tensor * attn_out; | ||
| if (n_seq_tokens == 1) { | ||
| attn_out = build_delta_net_autoregressive(q_conv, k_conv, v_conv, gate, beta, state, il); | ||
| } else if (n_seq_tokens > CHUNK_SIZE) { | ||
| attn_out = build_delta_net_chunking(q_conv, k_conv, v_conv, gate, beta, state, causal_mask, identity, il); | ||
| } else { | ||
| attn_out = build_delta_net_recurrent(q_conv, k_conv, v_conv, gate, beta, state, causal_mask, identity, il); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is highly not recommended. Instead of adding more branches, we have to figure out how to make the graph static. Start with simplifying the existing graphs by removing redundant ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in this case we can't make the graph static since the special branch here is one where the decay mask computation doesn't happen (because n_seq_tokens == 1, so it all collapses to trivial transformations, therefore they can be optimized out).
I can probably remove the recurrent part now since I'm not sure there's a realistic case for it, it'll be either chunking or autoregressive.
Is there any reason why it could have gotten slower for me? I'm compiling it with |
|
got an interesting finding in Win11 + RTX5090, compile with vulkan support and force to use vulkan0 device with pp512 60%+ and tg128 100%+ vulkan0:
build: c00ff92 (7389) cuda0:
build: c00ff92 (7389) |
This change adds a dedicated autoregressive version of delta-net which short cirtuits all the recurrent computations for
n_seq_tokens == 1. The end result is roughly a 40% bump in token generation speed.