【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support by cloudforge1 · Pull Request #6941 · PaddlePaddle/FastDeploy

cloudforge1 · 2026-03-19T19:31:35Z

Motivation

Task 45 requires FastDeploy's custom_ops to compile on T4 (SM75) and V100 (SM70) GPUs. Currently, cpp_extensions.cc registers all 115 ops unconditionally, causing link errors when SM80+-only CUDA kernels (MoE, MLA, speculative decoding, append attention) are absent from the build.

This PR adds conditional compilation guards to cpp_extensions.cc and corresponding macro definitions in setup_ops.py, gating SM80+ op bindings behind ENABLE_SM80_EXT_OPS, SM75+ ops behind ENABLE_SM75_EXT_OPS / ENABLE_SCALED_MM_C2X, and SM70's gelu_tanh behind DISABLE_GELU_TANH_OP.

Modifications

`cpp_extensions.cc` (+28 lines)

14 guard blocks wrapping 70 of 115 ops:

Guard	Blocks	Ops	Examples
`ENABLE_SM80_EXT_OPS`	11	62	MoE (fused_moe, moe_expert_ffn, moe_topk_select, …), MLA (multi_head_latent_attention, decode/prefill_mla_write_cache), speculative decoding (speculate_verify, speculate_update, …), append_attention, gqa_rope_write_cache, group_swiglu_with_masked, MoeWna16MarlinGemmApi
`ENABLE_SM75_EXT_OPS`	1	2	moe_deepgemm_permute, moe_deepgemm_depermute
`ENABLE_SCALED_MM_C2X`	1	5	cutlass_scaled_mm, cutlass_scaled_mm_azp, static/dynamic_scaled_fp8_quant
`DISABLE_GELU_TANH_OP`	1	1	gelu_tanh

The remaining 45 ops (per_token_quant, get_padding_offset, fused_rotary_position_encoding, noaux_tc, etc.) compile on all SM tiers and remain unguarded.

`setup_ops.py` (+19 lines, -1 line)

ENABLE_SM75_EXT_OPS added to both cc_compile_args and nvcc_compile_args at cc >= 75 — also adds moe_deepgemm_permute.cu and moe_deepgemm_depermute.cu sources (these kernels have no BF16 dependency)
ENABLE_SM80_EXT_OPS added to both cc_compile_args and nvcc_compile_args at cc >= 80
DISABLE_GELU_TANH_OP added to both compile args when SM70 is in the target architectures — also removes gelu_tanh.cu from sources to avoid compiling unsupported SM75 Tanh instructions
sm_versions computed once and reused (avoids redundant get_sm_version() call)
Source deduplication via dict.fromkeys() before setup() to prevent duplicate translation units from overlapping find_end_files() calls

Usage or Command

# Build for V100 (SM70) — gelu_tanh excluded, SM80 ops gated out
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

# Build for T4 (SM75) — SM80 ops gated out, gelu_tanh + deepgemm available
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

# Build for A100+ (SM80+) — all ops compiled, no guards active
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

Verification

Preprocessor simulation confirms correct op registration per SM tier:

Tier             Registered   Excluded
--------------------------------------
SM70 (V100)              38         77
SM75 (T4)                45         70
SM80 (A100)             109          6
SM89 (L4)               115          0
SM90 (H100)             115          0

#if*=18  #endif=18  ✓ balanced

T4 gains over V100 (7): cutlass_scaled_mm, cutlass_scaled_mm_azp,
  dynamic_per_token_scaled_fp8_quant, dynamic_scaled_fp8_quant,
  gelu_tanh, moe_deepgemm_permute, static_scaled_fp8_quant

Verification script (run from repo root)

"""verify_guards.py — Preprocessor simulation for cpp_extensions.cc compile guards.
Usage: python verify_guards.py [path/to/cpp_extensions.cc]
"""
import re, sys

path = sys.argv[1] if len(sys.argv) > 1 else "custom_ops/gpu_ops/cpp_extensions.cc"
lines = open(path).read().split("\n")

TIERS = {
    "SM70 (V100)": {"ENABLE_SM80_EXT_OPS": 0, "ENABLE_SM75_EXT_OPS": 0, "ENABLE_SCALED_MM_C2X": 0,
                     "ENABLE_FP8": 0, "ENABLE_FLASH_MASK_ATTENTION": 0, "ENABLE_MACHETE": 0, "DISABLE_GELU_TANH_OP": 1},
    "SM75 (T4)":   {"ENABLE_SM80_EXT_OPS": 0, "ENABLE_SM75_EXT_OPS": 1, "ENABLE_SCALED_MM_C2X": 1,
                     "ENABLE_FP8": 0, "ENABLE_FLASH_MASK_ATTENTION": 0, "ENABLE_MACHETE": 0, "DISABLE_GELU_TANH_OP": 0},
    "SM80 (A100)": {"ENABLE_SM80_EXT_OPS": 1, "ENABLE_SM75_EXT_OPS": 1, "ENABLE_SCALED_MM_C2X": 1,
                     "ENABLE_FP8": 0, "ENABLE_FLASH_MASK_ATTENTION": 1, "ENABLE_MACHETE": 0, "DISABLE_GELU_TANH_OP": 0},
    "SM89 (L4)":   {"ENABLE_SM80_EXT_OPS": 1, "ENABLE_SM75_EXT_OPS": 1, "ENABLE_SCALED_MM_C2X": 1,
                     "ENABLE_FP8": 1, "ENABLE_FLASH_MASK_ATTENTION": 1, "ENABLE_MACHETE": 1, "DISABLE_GELU_TANH_OP": 0},
    "SM90 (H100)": {"ENABLE_SM80_EXT_OPS": 1, "ENABLE_SM75_EXT_OPS": 1, "ENABLE_SCALED_MM_C2X": 1,
                     "ENABLE_FP8": 1, "ENABLE_FLASH_MASK_ATTENTION": 1, "ENABLE_MACHETE": 1, "DISABLE_GELU_TANH_OP": 0},
}

def simulate(macros):
    active, stack, ops = True, [], []
    for line in lines:
        s = line.strip()
        if s.startswith("#ifdef "):
            stack.append(active); active = active and bool(macros.get(s.split()[1], 0))
        elif s.startswith("#ifndef "):
            stack.append(active); active = active and not bool(macros.get(s.split()[1], 0))
        elif s == "#endif" and stack:
            active = stack.pop()
        elif active:
            m = re.search(r'm\.def\("([^"]+)"', line)
            if m: ops.append(m.group(1))
    return ops

results = {t: simulate(m) for t, m in TIERS.items()}
full = results["SM90 (H100)"]

ifcount = sum(1 for l in lines if l.strip().startswith(('#ifdef','#ifndef')))
endif_count = sum(1 for l in lines if l.strip()=='#endif')

print(f"{'Tier':<16} {'Registered':>10} {'Excluded':>10}")
print("-" * 38)
for t, ops in results.items():
    print(f"{t:<16} {len(ops):>10} {len(full)-len(ops):>10}")

print(f"\n#if*={ifcount}  #endif={endif_count}  {'✓ balanced' if ifcount==endif_count else '✗ MISMATCH'}")

t4, v100 = set(results["SM75 (T4)"]), set(results["SM70 (V100)"])
extra = sorted(t4 - v100)
if extra: print(f"\nT4 gains over V100 ({len(extra)}): {', '.join(extra)}")

Accuracy Tests

Compile-time guards only — no runtime behavior change. On SM89+ GPUs, all 115 ops are compiled and registered exactly as before. On SM70/SM75/SM80, only ops whose kernels can compile at that SM tier are registered (graceful absence, not a crash).

Guard balance verified: 18 #if* directives = 18 #endif directives.

Checklist

Pre-commit checks pass (black, isort, flake8, ruff, clang-format)
Guards are balanced (#if/#endif pairs match)
ENABLE_SM75_EXT_OPS in both compile args + deepgemm sources at cc>=75
ENABLE_SM80_EXT_OPS in both cc_compile_args and nvcc_compile_args at cc>=80
DISABLE_GELU_TANH_OP in both compile args + source exclusion at SM70
Source deduplication prevents duplicate TU errors
No runtime behavior change on SM89+ GPUs
Preprocessor simulation verifies correct op counts per tier

paddle-bot · 2026-03-19T19:31:43Z

Thanks for your contribution!

…for T4/V100 -part2 Add compile guards for 12 ops missing from PR PaddlePaddle#6488: SM80+ (ENABLE_SM80_EXT_OPS, 7 ops): - prefill_permute_to_masked_gemm (moe/) - depermute_prefill_combine (moe/) - radix_topk_ragged_transform (sparse_indexer/) - dsk_attn_write_cache (append_attn/) - indexer_k_quant_and_cache (append_attn/) - cp_gather_indexer_k_quant_cache (append_attn/) - per_token_group_fp8_quant (sparse_indexer/) SM75+ (ENABLE_SCALED_MM_C2X, 5 ops): - cutlass_scaled_mm (w8a8/) - cutlass_scaled_mm_azp (w8a8/) - static_scaled_fp8_quant (quantization/) - dynamic_scaled_fp8_quant (quantization/) - dynamic_per_token_scaled_fp8_quant (quantization/) Also defines -DENABLE_SM80_EXT_OPS=1 in setup_ops.py at cc>=80, which is required by both this PR and PR PaddlePaddle#6488.

cloudforge1 · 2026-03-19T20:34:25Z

Aware of PR #6488 which targets the same task. This PR takes a lighter approach (+47 lines vs +73) with a smaller guard surface. Happy to defer to whichever implementation the maintainers prefer — this PR is conflict-free against current develop.

codecov-commenter · 2026-03-19T23:02:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@f4a79d4). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6941   +/-   ##
==========================================
  Coverage           ?   73.46%           
==========================================
  Files              ?      399           
  Lines              ?    55620           
  Branches           ?     8766           
==========================================
  Hits               ?    40861           
  Misses             ?    11852           
  Partials           ?     2907

Flag	Coverage Δ
GPU	`73.46% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

luotao1 · 2026-03-20T06:21:46Z

@mitu626

cloudforge1 added 9 commits March 6, 2026 10:30

Merge remote-tracking branch 'upstream/develop' into develop

daf20d9

Merge remote-tracking branch 'upstream/develop' into develop

6f1e63c

Merge remote-tracking branch 'upstream/develop' into develop

4deb7a7

Merge remote-tracking branch 'upstream/develop' into develop

676daf6

Merge remote-tracking branch 'upstream/develop' into develop

9bcfdca

Merge remote-tracking branch 'upstream/develop' into develop

2bfa878

Merge remote-tracking branch 'upstream/develop' into develop

262c470

Merge remote-tracking branch 'upstream/develop' into develop

171b4d3

Merge remote-tracking branch 'upstream/develop' into develop

def0bd2

cloudforge1 temporarily deployed to Metax_ci March 19, 2026 19:31 — with GitHub Actions Inactive

paddle-bot bot added the contributor External developers label Mar 19, 2026

cloudforge1 force-pushed the task/045-t4-v100-compile-guards-part2 branch from 141b8e5 to 520b220 Compare March 19, 2026 20:10

cloudforge1 had a problem deploying to Metax_ci March 19, 2026 20:10 — with GitHub Actions Failure

cloudforge1 changed the title ~~【Hackathon 10th Spring No.45】[Build] Complete SM-tier compile guards for T4/V100 -part2~~ 【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support Mar 19, 2026

cloudforge1 force-pushed the task/045-t4-v100-compile-guards-part2 branch from 520b220 to 8f74ea3 Compare March 19, 2026 20:25

cloudforge1 temporarily deployed to Metax_ci March 19, 2026 20:25 — with GitHub Actions Inactive

cloudforge1 mentioned this pull request Mar 20, 2026

【Hackathon 9th No.39】自定义算子 moe_expert_ffn_wint2 单测补充 #6687

Draft

5 tasks

luotao1 added the PaddlePaddle Hackathon label Mar 20, 2026

luotao1 self-assigned this Mar 20, 2026

luotao1 mentioned this pull request Mar 20, 2026

【Hackathon 10th】开源贡献个人挑战赛 · 春节特别季 PaddlePaddle/Paddle#77429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support#6941

【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support#6941
cloudforge1 wants to merge 10 commits intoPaddlePaddle:developfrom
cloudforge1:task/045-t4-v100-compile-guards-part2

cloudforge1 commented Mar 19, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Mar 19, 2026

Uh oh!

cloudforge1 commented Mar 19, 2026

Uh oh!

codecov-commenter commented Mar 19, 2026

Uh oh!

luotao1 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cloudforge1 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

cpp_extensions.cc (+28 lines)

setup_ops.py (+19 lines, -1 line)

Usage or Command

Verification

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 19, 2026

Uh oh!

cloudforge1 commented Mar 19, 2026

Uh oh!

codecov-commenter commented Mar 19, 2026

Codecov Report

Uh oh!

luotao1 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cloudforge1 commented Mar 19, 2026 •

edited

Loading

`cpp_extensions.cc` (+28 lines)

`setup_ops.py` (+19 lines, -1 line)