Skip to content

【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support#6941

Open
cloudforge1 wants to merge 10 commits intoPaddlePaddle:developfrom
cloudforge1:task/045-t4-v100-compile-guards-part2
Open

【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support#6941
cloudforge1 wants to merge 10 commits intoPaddlePaddle:developfrom
cloudforge1:task/045-t4-v100-compile-guards-part2

Conversation

@cloudforge1
Copy link
Contributor

@cloudforge1 cloudforge1 commented Mar 19, 2026

Motivation

Task 45 requires FastDeploy's custom_ops to compile on T4 (SM75) and V100 (SM70) GPUs. Currently, cpp_extensions.cc registers all 115 ops unconditionally, causing link errors when SM80+-only CUDA kernels (MoE, MLA, speculative decoding, append attention) are absent from the build.

This PR adds conditional compilation guards to cpp_extensions.cc and corresponding macro definitions in setup_ops.py, gating SM80+ op bindings behind ENABLE_SM80_EXT_OPS, SM75+ ops behind ENABLE_SM75_EXT_OPS / ENABLE_SCALED_MM_C2X, and SM70's gelu_tanh behind DISABLE_GELU_TANH_OP.

Modifications

cpp_extensions.cc (+28 lines)

14 guard blocks wrapping 70 of 115 ops:

Guard Blocks Ops Examples
ENABLE_SM80_EXT_OPS 11 62 MoE (fused_moe, moe_expert_ffn, moe_topk_select, …), MLA (multi_head_latent_attention, decode/prefill_mla_write_cache), speculative decoding (speculate_verify, speculate_update, …), append_attention, gqa_rope_write_cache, group_swiglu_with_masked, MoeWna16MarlinGemmApi
ENABLE_SM75_EXT_OPS 1 2 moe_deepgemm_permute, moe_deepgemm_depermute
ENABLE_SCALED_MM_C2X 1 5 cutlass_scaled_mm, cutlass_scaled_mm_azp, static/dynamic_scaled_fp8_quant
DISABLE_GELU_TANH_OP 1 1 gelu_tanh

The remaining 45 ops (per_token_quant, get_padding_offset, fused_rotary_position_encoding, noaux_tc, etc.) compile on all SM tiers and remain unguarded.

setup_ops.py (+19 lines, -1 line)

  1. ENABLE_SM75_EXT_OPS added to both cc_compile_args and nvcc_compile_args at cc >= 75 — also adds moe_deepgemm_permute.cu and moe_deepgemm_depermute.cu sources (these kernels have no BF16 dependency)
  2. ENABLE_SM80_EXT_OPS added to both cc_compile_args and nvcc_compile_args at cc >= 80
  3. DISABLE_GELU_TANH_OP added to both compile args when SM70 is in the target architectures — also removes gelu_tanh.cu from sources to avoid compiling unsupported SM75 Tanh instructions
  4. sm_versions computed once and reused (avoids redundant get_sm_version() call)
  5. Source deduplication via dict.fromkeys() before setup() to prevent duplicate translation units from overlapping find_end_files() calls

Usage or Command

# Build for V100 (SM70) — gelu_tanh excluded, SM80 ops gated out
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

# Build for T4 (SM75) — SM80 ops gated out, gelu_tanh + deepgemm available
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

# Build for A100+ (SM80+) — all ops compiled, no guards active
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

Verification

Preprocessor simulation confirms correct op registration per SM tier:

Tier             Registered   Excluded
--------------------------------------
SM70 (V100)              38         77
SM75 (T4)                45         70
SM80 (A100)             109          6
SM89 (L4)               115          0
SM90 (H100)             115          0

#if*=18  #endif=18  ✓ balanced

T4 gains over V100 (7): cutlass_scaled_mm, cutlass_scaled_mm_azp,
  dynamic_per_token_scaled_fp8_quant, dynamic_scaled_fp8_quant,
  gelu_tanh, moe_deepgemm_permute, static_scaled_fp8_quant
Verification script (run from repo root)
"""verify_guards.py — Preprocessor simulation for cpp_extensions.cc compile guards.
Usage: python verify_guards.py [path/to/cpp_extensions.cc]
"""
import re, sys

path = sys.argv[1] if len(sys.argv) > 1 else "custom_ops/gpu_ops/cpp_extensions.cc"
lines = open(path).read().split("\n")

TIERS = {
    "SM70 (V100)": {"ENABLE_SM80_EXT_OPS": 0, "ENABLE_SM75_EXT_OPS": 0, "ENABLE_SCALED_MM_C2X": 0,
                     "ENABLE_FP8": 0, "ENABLE_FLASH_MASK_ATTENTION": 0, "ENABLE_MACHETE": 0, "DISABLE_GELU_TANH_OP": 1},
    "SM75 (T4)":   {"ENABLE_SM80_EXT_OPS": 0, "ENABLE_SM75_EXT_OPS": 1, "ENABLE_SCALED_MM_C2X": 1,
                     "ENABLE_FP8": 0, "ENABLE_FLASH_MASK_ATTENTION": 0, "ENABLE_MACHETE": 0, "DISABLE_GELU_TANH_OP": 0},
    "SM80 (A100)": {"ENABLE_SM80_EXT_OPS": 1, "ENABLE_SM75_EXT_OPS": 1, "ENABLE_SCALED_MM_C2X": 1,
                     "ENABLE_FP8": 0, "ENABLE_FLASH_MASK_ATTENTION": 1, "ENABLE_MACHETE": 0, "DISABLE_GELU_TANH_OP": 0},
    "SM89 (L4)":   {"ENABLE_SM80_EXT_OPS": 1, "ENABLE_SM75_EXT_OPS": 1, "ENABLE_SCALED_MM_C2X": 1,
                     "ENABLE_FP8": 1, "ENABLE_FLASH_MASK_ATTENTION": 1, "ENABLE_MACHETE": 1, "DISABLE_GELU_TANH_OP": 0},
    "SM90 (H100)": {"ENABLE_SM80_EXT_OPS": 1, "ENABLE_SM75_EXT_OPS": 1, "ENABLE_SCALED_MM_C2X": 1,
                     "ENABLE_FP8": 1, "ENABLE_FLASH_MASK_ATTENTION": 1, "ENABLE_MACHETE": 1, "DISABLE_GELU_TANH_OP": 0},
}

def simulate(macros):
    active, stack, ops = True, [], []
    for line in lines:
        s = line.strip()
        if s.startswith("#ifdef "):
            stack.append(active); active = active and bool(macros.get(s.split()[1], 0))
        elif s.startswith("#ifndef "):
            stack.append(active); active = active and not bool(macros.get(s.split()[1], 0))
        elif s == "#endif" and stack:
            active = stack.pop()
        elif active:
            m = re.search(r'm\.def\("([^"]+)"', line)
            if m: ops.append(m.group(1))
    return ops

results = {t: simulate(m) for t, m in TIERS.items()}
full = results["SM90 (H100)"]

ifcount = sum(1 for l in lines if l.strip().startswith(('#ifdef','#ifndef')))
endif_count = sum(1 for l in lines if l.strip()=='#endif')

print(f"{'Tier':<16} {'Registered':>10} {'Excluded':>10}")
print("-" * 38)
for t, ops in results.items():
    print(f"{t:<16} {len(ops):>10} {len(full)-len(ops):>10}")

print(f"\n#if*={ifcount}  #endif={endif_count}  {'✓ balanced' if ifcount==endif_count else '✗ MISMATCH'}")

t4, v100 = set(results["SM75 (T4)"]), set(results["SM70 (V100)"])
extra = sorted(t4 - v100)
if extra: print(f"\nT4 gains over V100 ({len(extra)}): {', '.join(extra)}")

Accuracy Tests

Compile-time guards only — no runtime behavior change. On SM89+ GPUs, all 115 ops are compiled and registered exactly as before. On SM70/SM75/SM80, only ops whose kernels can compile at that SM tier are registered (graceful absence, not a crash).

Guard balance verified: 18 #if* directives = 18 #endif directives.

Checklist

  • Pre-commit checks pass (black, isort, flake8, ruff, clang-format)
  • Guards are balanced (#if/#endif pairs match)
  • ENABLE_SM75_EXT_OPS in both compile args + deepgemm sources at cc>=75
  • ENABLE_SM80_EXT_OPS in both cc_compile_args and nvcc_compile_args at cc>=80
  • DISABLE_GELU_TANH_OP in both compile args + source exclusion at SM70
  • Source deduplication prevents duplicate TU errors
  • No runtime behavior change on SM89+ GPUs
  • Preprocessor simulation verifies correct op counts per tier

@paddle-bot
Copy link

paddle-bot bot commented Mar 19, 2026

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Mar 19, 2026
@cloudforge1 cloudforge1 force-pushed the task/045-t4-v100-compile-guards-part2 branch from 141b8e5 to 520b220 Compare March 19, 2026 20:10
@cloudforge1 cloudforge1 changed the title 【Hackathon 10th Spring No.45】[Build] Complete SM-tier compile guards for T4/V100 -part2 【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support Mar 19, 2026
…for T4/V100 -part2

Add compile guards for 12 ops missing from PR PaddlePaddle#6488:

SM80+ (ENABLE_SM80_EXT_OPS, 7 ops):
- prefill_permute_to_masked_gemm (moe/)
- depermute_prefill_combine (moe/)
- radix_topk_ragged_transform (sparse_indexer/)
- dsk_attn_write_cache (append_attn/)
- indexer_k_quant_and_cache (append_attn/)
- cp_gather_indexer_k_quant_cache (append_attn/)
- per_token_group_fp8_quant (sparse_indexer/)

SM75+ (ENABLE_SCALED_MM_C2X, 5 ops):
- cutlass_scaled_mm (w8a8/)
- cutlass_scaled_mm_azp (w8a8/)
- static_scaled_fp8_quant (quantization/)
- dynamic_scaled_fp8_quant (quantization/)
- dynamic_per_token_scaled_fp8_quant (quantization/)

Also defines -DENABLE_SM80_EXT_OPS=1 in setup_ops.py at cc>=80,
which is required by both this PR and PR PaddlePaddle#6488.
@cloudforge1 cloudforge1 force-pushed the task/045-t4-v100-compile-guards-part2 branch from 520b220 to 8f74ea3 Compare March 19, 2026 20:25
@cloudforge1
Copy link
Contributor Author

Aware of PR #6488 which targets the same task. This PR takes a lighter approach (+47 lines vs +73) with a smaller guard surface. Happy to defer to whichever implementation the maintainers prefer — this PR is conflict-free against current develop.

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@f4a79d4). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6941   +/-   ##
==========================================
  Coverage           ?   73.46%           
==========================================
  Files              ?      399           
  Lines              ?    55620           
  Branches           ?     8766           
==========================================
  Hits               ?    40861           
  Misses             ?    11852           
  Partials           ?     2907           
Flag Coverage Δ
GPU 73.46% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@luotao1
Copy link
Collaborator

luotao1 commented Mar 20, 2026

@mitu626

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants