Support Flashinfer-cutedsl nvfp4 grouped masked gemm#6924
Open
mpgemm wants to merge 11 commits intoPaddlePaddle:developfrom
Open
Support Flashinfer-cutedsl nvfp4 grouped masked gemm#6924mpgemm wants to merge 11 commits intoPaddlePaddle:developfrom
mpgemm wants to merge 11 commits intoPaddlePaddle:developfrom
Conversation
|
|
|
Thanks for your contribution! |
mpgemm
commented
Mar 18, 2026
Author
mpgemm
left a comment
There was a problem hiding this comment.
support flashinfer-cutedsl nvfp4 fusedmoe, prefill仍然存在问题
zhoutianzi666
previously approved these changes
Mar 19, 2026
lizexu123
reviewed
Mar 19, 2026
| raise ValueError(f"Unsupported cute dtype {input.dtype}") | ||
|
|
||
|
|
||
| def flashinfer_cutedsl_moe_masked( |
Collaborator
There was a problem hiding this comment.
你这个函数写了,但是apply里面没用到是为什么?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
支持 flashinfer-cutedsl NVFP4 FusedMoE 计算。
Modifications
1.新增 /moe/flashinfer_cutedsl_moe.py 用于引入 flashinfer-cutedsl nvfp4 group gemm 支持 FFN 计算。
2.在 /quantization/nvfp4.py 新增 ModelOptNvFp4FusedMoECuteDSL 以支持Nvfp4FusedMoE 计算。
3.新增测试文件 tests/layer/test_cutedsl_moe.py 和 tests/layers/test_nvfp4_fusedmoe.py
Usage or Command
Paddle Flashinfer 和 nvidia-cutlass-dsl 存在问题,导入时需要修改 python3.10/site-packages/flashinfer 和 nvidia-dsl。
总结出了三个问题,1个nvidia-dsl和2个flashinfer。
1:nvidia_cutlass_dsl/python_packages/cutlass/torch.py 将 torch.device 改成 "torch.device"。
2:flashinfer/utils.py. get_compute_capability函数下面改成:
@functools.cache
def get_compute_capability(device: torch.device) -> Tuple[int, int]:
return torch.cuda.get_device_capability(device)
if device.type != "cuda":
raise ValueError("device must be a cuda device")
return torch.cuda.get_device_capability(device.index)
注:如果遇到device的问题,将 A.place 换成 A.device 可以解决大部分问题。
3:flashinfer/cute_dsl/blockscaled_gemm.py
首先 import cuda.bindings.driver as cuda
然后将 cutlass_torch.current_stream() 替换成 cuda.CUstream(torch.cuda.current_stream().stream_base.raw_stream)
Flashinfer-cutedsl nvfp4 grouped mask gemm 算子测试:
python tests/layer/test_cutedsl_moe.pydecode误差测试:
python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tests/layers/test_nvfp4_fusedmoe.py TestFusedMoE.test_decode_correctness 2>&1Prefill测试:
NVFP4_TEST_MODE=prefill NVFP4_TEST_ITERS=2 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tests/layers/test_nvfp4_fusedmoe.pyDecode测试:
NVFP4_TEST_MODE=decode NVFP4_TEST_ITERS=2 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tests/layers/test_nvfp4_fusedmoe.pyAccuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.