【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support#6941
Open
cloudforge1 wants to merge 10 commits intoPaddlePaddle:developfrom
Open
【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support#6941cloudforge1 wants to merge 10 commits intoPaddlePaddle:developfrom
cloudforge1 wants to merge 10 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
141b8e5 to
520b220
Compare
…for T4/V100 -part2 Add compile guards for 12 ops missing from PR PaddlePaddle#6488: SM80+ (ENABLE_SM80_EXT_OPS, 7 ops): - prefill_permute_to_masked_gemm (moe/) - depermute_prefill_combine (moe/) - radix_topk_ragged_transform (sparse_indexer/) - dsk_attn_write_cache (append_attn/) - indexer_k_quant_and_cache (append_attn/) - cp_gather_indexer_k_quant_cache (append_attn/) - per_token_group_fp8_quant (sparse_indexer/) SM75+ (ENABLE_SCALED_MM_C2X, 5 ops): - cutlass_scaled_mm (w8a8/) - cutlass_scaled_mm_azp (w8a8/) - static_scaled_fp8_quant (quantization/) - dynamic_scaled_fp8_quant (quantization/) - dynamic_per_token_scaled_fp8_quant (quantization/) Also defines -DENABLE_SM80_EXT_OPS=1 in setup_ops.py at cc>=80, which is required by both this PR and PR PaddlePaddle#6488.
520b220 to
8f74ea3
Compare
Contributor
Author
|
Aware of PR #6488 which targets the same task. This PR takes a lighter approach (+47 lines vs +73) with a smaller guard surface. Happy to defer to whichever implementation the maintainers prefer — this PR is conflict-free against current |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #6941 +/- ##
==========================================
Coverage ? 73.46%
==========================================
Files ? 399
Lines ? 55620
Branches ? 8766
==========================================
Hits ? 40861
Misses ? 11852
Partials ? 2907
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5 tasks
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Task 45 requires FastDeploy's
custom_opsto compile on T4 (SM75) and V100 (SM70) GPUs. Currently,cpp_extensions.ccregisters all 115 ops unconditionally, causing link errors when SM80+-only CUDA kernels (MoE, MLA, speculative decoding, append attention) are absent from the build.This PR adds conditional compilation guards to
cpp_extensions.ccand corresponding macro definitions insetup_ops.py, gating SM80+ op bindings behindENABLE_SM80_EXT_OPS, SM75+ ops behindENABLE_SM75_EXT_OPS/ENABLE_SCALED_MM_C2X, and SM70'sgelu_tanhbehindDISABLE_GELU_TANH_OP.Modifications
cpp_extensions.cc(+28 lines)14 guard blocks wrapping 70 of 115 ops:
ENABLE_SM80_EXT_OPSENABLE_SM75_EXT_OPSENABLE_SCALED_MM_C2XDISABLE_GELU_TANH_OPThe remaining 45 ops (per_token_quant, get_padding_offset, fused_rotary_position_encoding, noaux_tc, etc.) compile on all SM tiers and remain unguarded.
setup_ops.py(+19 lines, -1 line)ENABLE_SM75_EXT_OPSadded to bothcc_compile_argsandnvcc_compile_argsatcc >= 75— also addsmoe_deepgemm_permute.cuandmoe_deepgemm_depermute.cusources (these kernels have no BF16 dependency)ENABLE_SM80_EXT_OPSadded to bothcc_compile_argsandnvcc_compile_argsatcc >= 80DISABLE_GELU_TANH_OPadded to both compile args when SM70 is in the target architectures — also removesgelu_tanh.cufrom sources to avoid compiling unsupported SM75 Tanh instructionssm_versionscomputed once and reused (avoids redundantget_sm_version()call)dict.fromkeys()beforesetup()to prevent duplicate translation units from overlappingfind_end_files()callsUsage or Command
Verification
Preprocessor simulation confirms correct op registration per SM tier:
Verification script (run from repo root)
Accuracy Tests
Compile-time guards only — no runtime behavior change. On SM89+ GPUs, all 115 ops are compiled and registered exactly as before. On SM70/SM75/SM80, only ops whose kernels can compile at that SM tier are registered (graceful absence, not a crash).
Guard balance verified: 18
#if*directives = 18#endifdirectives.Checklist
ENABLE_SM75_EXT_OPSin both compile args + deepgemm sources at cc>=75ENABLE_SM80_EXT_OPSin bothcc_compile_argsandnvcc_compile_argsat cc>=80DISABLE_GELU_TANH_OPin both compile args + source exclusion at SM70