[TRTLLM-10927][perf] Use NCCL LSA Barrier to implement synchronization for NVLinkOneSided AlltoAll kernels.#11366
[TRTLLM-10927][perf] Use NCCL LSA Barrier to implement synchronization for NVLinkOneSided AlltoAll kernels.#11366bobboli wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
456e107 to
183a17e
Compare
📝 WalkthroughWalkthroughReplaced flag-based synchronization with NCCL LSA barrier-based synchronization in MOE All-to-All kernels. Removed flag_val and completion_flags parameters from kernel signatures. Updated metadata structures to eliminate flag-related offset indices. Modified Python bindings and test verification to reflect the new synchronization approach. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`:
- Around line 138-151: The global communicator g_moeDevComm may be reused with
mismatched epSize/epRank; modify moeA2AInitializeOp to validate that an existing
g_moeDevComm was created with the same epSize/epRank before reusing it
(call-site: the if (g_moeDevComm == nullptr) block and the early-reuse path). If
parameters differ, either log/error and abort or recreate the communicator by
destroying the old g_moeDevComm and calling
create_moe_nccl_dev_comm(static_cast<int>(epSize), static_cast<int>(epRank),
/*num_lsa_barriers=*/2); ensure the check compares the stored communicator's
configuration (add fields or accessors to g_moeDevComm metadata if needed) and
include a clear diagnostic message mentioning epSize/epRank and the expected vs
actual values.
🧹 Nitpick comments (7)
cpp/tensorrt_llm/thop/moeAlltoAllMeta.h (1)
17-17: Header guard style:#pragma oncevsTRTLLM_-prefixed include guard.The coding guidelines specify using a preprocessor guard with prefix
TRTLLM_followed by the filename in uppercase for header files. This file uses#pragma onceinstead. Since this was pre-existing and not part of this PR's changes, it's a minor note.As per coding guidelines, "Use a preprocessor guard with prefix
TRTLLM_followed by the filename in uppercase (e.g.,TRTLLM_FOO_BAR_HELLO_H)."cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)
23-26: Forward declarations for NCCL types look good.The forward declarations avoid pulling the full NCCL headers into this widely-included header. The
ncclComm_ttypedef matches the NCCL convention. Note thatncclComm_tis only needed for thedestroy_moe_nccl_dev_commsignature (which takesncclDevComm*, notncclComm_t), so technically line 25 is unused in this header.#!/bin/bash # Check if ncclComm_t is actually used in the header file rg -n 'ncclComm_t' cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.hcpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (3)
1219-1226: Static variable naming: should usesprefix per coding guidelines.
g_moeNcclCommis declaredstatic, so per the naming convention it should use thesprefix (e.g.,sMoeNcclComm). Thegprefix is reserved for non-static, non-anonymous-namespace globals.As per coding guidelines, "Use camelCase prefixed by lowercase 's' for non-magic-number global variables that are static or in anonymous namespaces (e.g.,
sMutableStaticGlobal)."
1259-1276: Missing CUDA error checking indestroy_moe_nccl_dev_comm.Lines 1265 and 1274 use
cudaMemcpyandcudaFreewithout error checking, whilecreate_moe_nccl_dev_commconsistently usesTLLM_CUDA_CHECK. Silently ignoring errors during cleanup can mask bugs.Proposed fix: add error checking
if (dev_comm != nullptr) { // Copy back to host for ncclDevCommDestroy ncclDevComm hostDevComm; - cudaMemcpy(&hostDevComm, dev_comm, sizeof(ncclDevComm), cudaMemcpyDeviceToHost); + TLLM_CUDA_CHECK(cudaMemcpy(&hostDevComm, dev_comm, sizeof(ncclDevComm), cudaMemcpyDeviceToHost)); if (g_moeNcclComm != nullptr) { ncclDevCommDestroy(g_moeNcclComm, &hostDevComm); ncclCommDestroy(g_moeNcclComm); g_moeNcclComm = nullptr; } - cudaFree(dev_comm); + TLLM_CUDA_CHECK(cudaFree(dev_comm)); }
1227-1257:create_moe_nccl_dev_commoverwritesg_moeNcclCommwithout checking for existing comm.If
create_moe_nccl_dev_commis called more than once (e.g., due to a missing guard at the call site), the existing NCCL communicator ing_moeNcclCommwould be leaked. The caller inmoeAlltoAllOp.cppguards withif (g_moeDevComm == nullptr), but adding a defensive check here would make the function safer.Defensive guard
ncclDevComm* create_moe_nccl_dev_comm(int ep_size, int ep_rank, int num_lsa_barriers) { + TLLM_CHECK_WITH_INFO(g_moeNcclComm == nullptr, "NCCL communicator already exists; destroy it before creating a new one"); + // 1. Create NCCL communicator via MPI bootstrap (collective call)cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp (2)
28-29: Redundant forward declaration.
ncclDevCommis already forward-declared inmoeAlltoAllKernels.h(line 26), which is included at line 18. This forward declaration is harmless but redundant.
107-108: Static variable naming and resource management.Two notes:
Naming:
g_moeDevCommisstatic, so per coding guidelines it should use thesprefix:sMoeDevComm.Resource leak:
g_moeDevCommis allocated viacreate_moe_nccl_dev_commbutdestroy_moe_nccl_dev_commis never called. While the OS will reclaim resources at process exit, this leaks the NCCL communicator and device memory for the lifetime of the process. Consider registering anatexithandler or adding explicit teardown.As per coding guidelines, "Use camelCase prefixed by lowercase 's' for non-magic-number global variables that are static or in anonymous namespaces."
#!/bin/bash # Check if destroy_moe_nccl_dev_comm is ever called anywhere in the codebase rg -n 'destroy_moe_nccl_dev_comm' -g '*.{cpp,cu,h,hpp,cuh,py}'
…AlltoAll kernels. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
…en necessary. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
c93bcbc to
b83ec19
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #36869 [ run ] triggered by Bot. Commit: |
|
PR_Github #36869 [ run ] completed with state
|
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #37095 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.