-
Notifications
You must be signed in to change notification settings - Fork 65
Update profiling docs: rocprof v1, v3 and roctx #862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,10 +2,11 @@ | |
|
|
||
| ## rocprof | ||
|
|
||
| [rocprofv2](https://github.com/ROCm/rocprofiler?tab=readme-ov-file#rocprofiler-v2) | ||
| allows profiling both HSA & HIP API calls (rocprof being deprecated). | ||
| [rocprof](https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler) | ||
| allows profiling HSA & HIP API calls, kernel launches, and more... | ||
| Multiple major versions are available: `rocprof`, `rocprofv2` and `rocprofv3`. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One could add upfront here a comment stating that rocprofv3 is now to be used as others are now deprecated (unless one is on a system where it may not work). |
||
|
|
||
| Let's profile simple copying kernel saved in `profile.jl` file: | ||
| Let's profile a simple copying kernel saved in a `profile.jl` file: | ||
| ```julia | ||
| using AMDGPU | ||
|
|
||
|
|
@@ -38,13 +39,52 @@ main(2^24) | |
|
|
||
| ### Profiling problematic code | ||
|
|
||
| As mentioned above, there are different `rocprof` versions and see which one works the best. | ||
| On older ROCm versions, the newer `rocprofv2` and `rocprofv3` may not work so well. | ||
|
|
||
| !!! note | ||
| While AMDGPU.jl uses the HIP API, only `--hsa-trace` seems to capture CPU API calls, | ||
| of the lower-level HSA API, while `--hip-trace` has no effect. | ||
|
|
||
| This applies to ROCm 6.2.4 at least, and was tested on AMDGPU 2.1.3. | ||
| If with other versions HIP API calls can be captured then please amend this documentation. | ||
|
|
||
| #### rocprof | ||
| ```bash | ||
| rocprof --hsa-trace --roctx-trace julia ./profile.jl | ||
| ``` | ||
|
|
||
| This enables HSA and ROC-TX (see below) tracing. | ||
| Memory copies and kernel launches are reported as well. | ||
|
|
||
| This will produce an `output.json` file which can be visualized | ||
| using [Perfetto UI](https://ui.perfetto.dev/). | ||
|
|
||
| #### rocprofv2 | ||
| ```bash | ||
| rocprofv2 --plugin perfetto --hsa-trace --roctx-trace --kernel-trace -o prof julia ./profile.jl | ||
| ``` | ||
|
|
||
| In principle this should enable various types of tracing, | ||
| but note that on ROCm 6.2.4 only kernel launches seem to be reported. | ||
|
|
||
| This will produce a `prof_output.pftrace` file which can be visualized | ||
| using [Perfetto UI](https://ui.perfetto.dev/). | ||
|
|
||
| #### rocprofv3 | ||
| ```bash | ||
| ENABLE_JITPROFILING=1 rocprofv2 --plugin perfetto --hip-trace --hsa-trace --kernel-trace -o prof julia ./profile.jl | ||
| rocprofv3 --output-format pftrace --hsa-trace --marker-trace --kernel-trace --memory-copy-trace -- julia ./profile.jl | ||
| ``` | ||
|
|
||
| This will produce `prof_output.pftrace` file which can be visualized | ||
| This will produce a number of `prof_output.pftrace` files which can be visualized | ||
| using [Perfetto UI](https://ui.perfetto.dev/). | ||
|
|
||
| `rocprofv3` is now recommended by AMD, however on ROCm 6.2.4 nothing seems to be reported. | ||
|
|
||
| #### Visualization of the results | ||
| Here is an example of visualizing the `profile.jl` script above in Perfetto. | ||
| Use `W`/`S` to zoom in/out and `A`/`D` to move left/right in the timeline. | ||
|
|
||
|  | ||
|
|
||
| Here we can clearly see that host synchronization after each kernel dispatch | ||
|
|
@@ -69,6 +109,30 @@ wall duration is lower. | |
|
|
||
|  | ||
|
|
||
| ### Marking regions | ||
| When launching lots of kernels, it can be difficult to understand | ||
| the trace in terms of high-level program behavior. | ||
| In that case, the ROC-TX API can be used to mark regions that will be visible in the traces. | ||
|
|
||
| Here is an example of calling the API directly: | ||
| ```julia | ||
| function rangePush(message) | ||
| @ccall "libroctx64".roctxRangePushA(message::Ptr{Cchar})::Cint | ||
| end | ||
|
|
||
| function rangePop() | ||
| @ccall "libroctx64".roctxRangePop()::Cint | ||
| end | ||
|
|
||
| rangePush("Section name") | ||
| # Launch some kernels, call some functions, etc... | ||
| rangePop() | ||
| ``` | ||
|
|
||
| While [ROCTX.jl](https://github.com/JuliaGPU/ROCTX.jl) aims to offer a Julia wrapper around it, | ||
| it does not seem to be working yet. PRs welcome! | ||
| (Note: the `ccall`s above do _not_ require ROCTX.jl to be loaded!) | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for this addition! |
||
| ## Debugging | ||
|
|
||
| Use `HIP_LAUNCH_BLOCKING=1` to synchronize immediately after launching GPU kernels. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Links to the doc