Skip to content

Support Flashinfer-cutedsl nvfp4 grouped masked gemm#6924

Open
mpgemm wants to merge 11 commits intoPaddlePaddle:developfrom
mpgemm:develop
Open

Support Flashinfer-cutedsl nvfp4 grouped masked gemm#6924
mpgemm wants to merge 11 commits intoPaddlePaddle:developfrom
mpgemm:develop

Conversation

@mpgemm
Copy link

@mpgemm mpgemm commented Mar 18, 2026

Motivation

支持 flashinfer-cutedsl NVFP4 FusedMoE 计算。

Modifications

1.新增 /moe/flashinfer_cutedsl_moe.py 用于引入 flashinfer-cutedsl nvfp4 group gemm 支持 FFN 计算。

2.在 /quantization/nvfp4.py 新增 ModelOptNvFp4FusedMoECuteDSL 以支持Nvfp4FusedMoE 计算。

3.新增测试文件 tests/layer/test_cutedsl_moe.py 和 tests/layers/test_nvfp4_fusedmoe.py

Usage or Command

Paddle Flashinfer 和 nvidia-cutlass-dsl 存在问题,导入时需要修改 python3.10/site-packages/flashinfer 和 nvidia-dsl。
总结出了三个问题,1个nvidia-dsl和2个flashinfer。

1:nvidia_cutlass_dsl/python_packages/cutlass/torch.py 将 torch.device 改成 "torch.device"。
2:flashinfer/utils.py. get_compute_capability函数下面改成:
@functools.cache
def get_compute_capability(device: torch.device) -> Tuple[int, int]:
return torch.cuda.get_device_capability(device)
if device.type != "cuda":
raise ValueError("device must be a cuda device")
return torch.cuda.get_device_capability(device.index)
注:如果遇到device的问题,将 A.place 换成 A.device 可以解决大部分问题。
3:flashinfer/cute_dsl/blockscaled_gemm.py
首先 import cuda.bindings.driver as cuda
然后将 cutlass_torch.current_stream() 替换成 cuda.CUstream(torch.cuda.current_stream().stream_base.raw_stream)


Flashinfer-cutedsl nvfp4 grouped mask gemm 算子测试: python tests/layer/test_cutedsl_moe.py

decode误差测试:python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tests/layers/test_nvfp4_fusedmoe.py TestFusedMoE.test_decode_correctness 2>&1

Prefill测试: NVFP4_TEST_MODE=prefill NVFP4_TEST_ITERS=2 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tests/layers/test_nvfp4_fusedmoe.py

Decode测试: NVFP4_TEST_MODE=decode NVFP4_TEST_ITERS=2 python -m paddle.distributed.launch --gpus 0,1,2,3,4,5,6,7 tests/layers/test_nvfp4_fusedmoe.py

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@paddle-bot
Copy link

paddle-bot bot commented Mar 18, 2026

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Mar 18, 2026
Copy link
Author

@mpgemm mpgemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support flashinfer-cutedsl nvfp4 fusedmoe, prefill仍然存在问题

zhoutianzi666
zhoutianzi666 previously approved these changes Mar 19, 2026
raise ValueError(f"Unsupported cute dtype {input.dtype}")


def flashinfer_cutedsl_moe_masked(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你这个函数写了,但是apply里面没用到是为什么?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants