diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/CMakeLists.txt b/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/CMakeLists.txt
index 2d72e1d283..ef330e9dbc 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/CMakeLists.txt
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required (VERSION 3.4)
+cmake_minimum_required (VERSION 3.5)
set (CMAKE_CXX_COMPILER "icpx")
project (matrix_mul LANGUAGES CXX)
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/README.md b/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/README.md
index afab79ab6a..9c8398b822 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/README.md
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/README.md
@@ -6,14 +6,14 @@ The sample is a simple program that multiplies together two large matrices and v
| Area | Description
|:--- |:---
-| What you will learn | A method to determine the root cause problems from passing bad buffers through the SYCL runtime.
+| What you will learn | A method to determine the root cause problems from passing bad data through the SYCL runtime.
| Time to complete | 50 minutes
>**Note**: For comprehensive instructions on the Intel® Distribution for GDB* and writing SYCL code, see the *[Intel® oneAPI Programming Guide](https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/current/overview.html)*. (Use search or the table of contents to find relevant information quickly.)
## Purpose
-The two samples in this tutorial show examples of how to debug issues arising from passing bad buffers through the SYCL runtime, one when using SYCL buffers and one when using a direct reference to device memory.
+The two samples in this tutorial show examples of how to debug issues arising from passing bad data through the SYCL runtime, one when using SYCL buffers and one when using a direct reference to device memory.
In one case, we will know that there is a problem due to a crash. In the other case, we will get bad results.
@@ -32,8 +32,8 @@ The sample includes different versions of a simple matrix multiplication program
|:--- |:---
| OS | Ubuntu* 24.04 LTS
| Hardware | GEN9 or newer
-| Software | Intel® oneAPI DPC++/C++ Compiler 2025.1
Intel® Distribution for GDB* 2025.1
Unified Tracing and Profiling Tool 2.1.2, which is available from the [following Github repository](https://github.com/intel/pti-gpu/tree/master/tools/unitrace).
-| Intel GPU Driver | Intel® General-Purpose GPU Rolling Release driver 2507.12 or later from https://dgpu-docs.intel.com/releases/releases.html
+| Software | Intel® oneAPI DPC++/C++ Compiler 2025.3
Intel® Distribution for GDB* 2025.3
Unified Tracing and Profiling Tool 2.3.0, which is available from the [following Github repository](https://github.com/intel/pti-gpu/tree/master/tools/unitrace).
+| Intel GPU Driver | Intel® General-Purpose GPU Long-Term Support driver 2523.31 or later from https://dgpu-docs.intel.com/releases/releases.html
## Key Implementation Details
@@ -137,15 +137,73 @@ Documentation on using the debugger in a variety of situations can be found at *
### Getting the Tracing and Profiling Tool
-At a step in this tutorial, the instructions require a utility that was not installed with the Intel® oneAPI Base Toolkit (Base Kit).
+In this tutorial, the instructions require a utility that was not installed with the Intel® oneAPI Base Toolkit (Base Kit).
-To complete the steps in the following section, you must download the [Unified Tracing and Profiling Tool](https://github.com/intel/pti-gpu/tree/master/tools/unitrace) code from GitHub and build the utility. The build instructions are included in the README in the GitHub repository. This build will go much more smoothly if you first install the latest drivers from [the Intel GPU driver download site](https://dgpu-docs.intel.com/driver/overview.html), especially the development packages (only available in the Data Center GPU driver install ). Once you have built the utility, you invoke it on the command line in front of your program (similar to using GDB).
+To complete the steps in the following section, you must download the [Unified Tracing and Profiling Tool](https://github.com/intel/pti-gpu/tree/master/tools/unitrace) code from GitHub and build the utility. The build instructions are included in the README in the GitHub repository. This build will go much more smoothly if you first install the latest drivers from [the Intel GPU driver download site](https://dgpu-docs.intel.com/driver/overview.html), especially the development packages (only available in the Data Center GPU driver install). Once you have built the utility, you invoke it on the command line in front of your program (similar to using GDB).
-### Guided Instructions for Zero Buffer
+### Guided Instructions for Zero Buffer using Address Sanitizer
+A recent addition to the oneAPI compiler is that ability to use the "Address Sanitizer" you may have seen when using [GCC](https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html) or [CLANG](https://clang.llvm.org/docs/AddressSanitizer.html) to catch invalid pointer addresses at runtime on the GPU rather than the host. This will require a special build of the application.
-In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `lambda` function. **This will cause the application to crash.**
+1. Compile a version of the program with device-side Address Sanitizer (assuming that you are in the `build` directory)
+ ```
+ icpx -fsycl -O0 -g -Xarch_device -fsanitize=address -std=gnu++17 -Rno-debug-disables-optimization -o a1_matrix_mul_zero_buff_asan ../src/a1_matrix_mul_zero_buff.cpp
+ ```
+ > Note: If you leave the `-Xarch_device` off, this command will look for illegal addresses on the host rather than the device.
+
+2. Now run the program on the GPU:
+ ```
+ ./a1_matrix_mul_zero_buff_asan
+ ==== DeviceSanitizer: ASAN
+ Device: Intel(R) Data Center GPU Max 1550
+ Problem size: c(150,600) = a(150,300) * b(300,600)
+
+ ====ERROR: DeviceSanitizer: null-pointer-access on Unknown Memory (0x460)
+ WRITE of size 4 at kernel (auto&) const::'lambda'(auto)> LID(80, 8, 0) GID(280, 128, 0)
+ #0 auto auto main::'lambda0'(auto&)::operator()(auto&) const::'lambda'(auto)::operator()>(auto) const Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/build/../src/a1_matrix_mul_zero_buff.cpp:93
+ Aborted (core dumped)
+ ```
+
+3. Look at the reported source location
+
+ If you pull up an editor and go to line 93, you will see the following:
+ ```
+ 85 // Submit command group to queue to initialize matrix a
+ 86 q.submit([&](auto &h) {
+ 87 // Get write only access to the buffer on a device.
+ 88 accessor a(a_buf, h, write_only);
+ 89
+ 90 // Execute kernel.
+ 91 h.parallel_for(range(M, N), [=](auto index) {
+ 92 // Each element of matrix a is 1.
+ 93 a[index] = 1.0f; // --- Error here!
+ 94 });
+ 95 });
+ ```
+
+4. Understand what is happening
+
+ Looking at the error, we see that we were trying to write to local index `LID(80, 8, 0)` , or global index `GID(280, 128, 0)` of array `a`. According to the text when the program ran (`Problem size: c(150,600) = a(150,300) * b(300,600)`) both of these indexes are in range, but are they?
+
+ Take a look at the lines where the arrays are allocated:
+
+ ```
+ 61 buffer a_buf(range(0, 0));
+ 62 buffer b_buf(range(N, P));
+ 63 buffer c_buf(reinterpret_cast(c_back), range(M, P));
+ ```
+
+ Well, that's a problem. `a` was accidentally allocated with zero size. Fix this like you see in `a2_matrix_mul.cpp` and things will work just fine.
+
+ Notice how the Address Sanitizer correctly caught a bad write on the device before it caused problems.
+
+### Guided Instructions for Zero Buffer using gdb-oneapi and the OpenCL CPU device
+
+In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `lambda` function. **This will cause the application to crash.** We saw in the previous section how we can catch this with the device-side Address Sanitizer. But what if the bad array allocation occured somewhere else deep in the program? How would we track the problem back to its source? Let's try one technique to locate the source of the error.
1. Run the program without the debugger.
+
+ > ***Warning: this may cause the card to vanish - check with `sycl-ls` after running: if the GPU no longer shows up you will need to reboot the machine before continuing***
+
```
./a1_matrix_mul_zero_buff
```
@@ -155,13 +213,13 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
Problem size: c(150,600) = a(150,300) * b(300,600)
Segmentation fault from GPU at 0x0, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 3 (PML4), access: 1 (Write), banned: 0, aborting.
Segmentation fault from GPU at 0x0, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 3 (PML4), access: 1 (Write), banned: 0, aborting.
- Abort was called at 274 line in file:
+ Abort was called at 288 line in file:
./shared/source/os_interface/linux/drm_neo.cpp
Aborted (core dumped)
```
- These error messages tells us that we wrote to an address on a memory page that we did not allocate on the GPU (generating an unexpected page fault)
+ These error messages tell us that we wrote to an address on a memory page (`0x0`) that we did not allocate on the GPU (generating an unexpected page fault).
- On an Intel(R) Graphics GPU, the crash will look something like this:
+ On an Intel(R) Graphics GPU, the crash will look something like this, or the program may hang (exit with `control-C`):
```
Device: Intel(R) Graphics [0xe20b]
Problem size: c(150,600) = a(150,300) * b(300,600)
@@ -172,32 +230,11 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
2. Start the debugger to watch the application failure and find out where it failed. Since the message indicates that the failure was on the GPU, we need to enable GPU debugging by doing [some setup on your system](#setting-up-to-debug-on-the-gpu).
- However, in this case, let's see if we can catch the stack dump by running on the CPU where it is easier to pull data from the failing kernel.
+ However, in this case, let's see if we can catch the stack dump by running on the CPU where it is easier to gather data from the failing kernel.
```
ONEAPI_DEVICE_SELECTOR=opencl:cpu gdb-oneapi ./a1_matrix_mul_zero_buff
```
- > **Note:** this will only work if the `sycl-ls` command shows OpenCL
- > devices for the graphics card, such as like this:
-
- ```
- $ sycl-ls
- [opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
- [opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
- [opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
- [opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000]
- [opencl:fpga][opencl:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.7.0.21_160000]
- [level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
- [level_zero:gpu][level_zero:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
- ```
- If you are missing `[opencl:gpu]` devices you may have to add the necessary libraries to your device path by setting the appropriate path in `DRIVERLOC` and then running the following four commands (for Ubuntu - adapt for other OSes):
-
- ```
- export DRIVERLOC=/usr/lib/x86_64-linux-gnu
- export OCL_ICD_FILENAMES=$OCL_ICD_FILENAMES:$DRIVERLOC/intel-opencl/libigdrcl.so
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DRIVERLOC
- export PATH=$PATH:/opt/intel/oneapi:$DRIVERLOC
- ```
3. You should get the prompt `(gdb)`.
@@ -244,9 +281,9 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
at a1_matrix_mul_zero_buff.cpp:123
```
- Note that the stack originates from `a1_matrix_mul_zero_buff.cpp:123` (if you look in the source you will see that this is the end of the block containing all the offload code), but we crash in `a1_matrix_mul_zero_buff.cpp:93`. This is a side-effect of the fact that SYCL does not wait for the submitted kernel to complete (unless you tell it to), so the main thread has made it all the way past all the offload statements and is waiting for them to complete.
+ Note that the stack originates from `a1_matrix_mul_zero_buff.cpp:123` (if you look in the source you will see that this is the end of the block containing all the offload code), but we crash in `a1_matrix_mul_zero_buff.cpp:93`. This is a side-effect of the fact that SYCL does not wait for the submitted kernel to complete (unless you tell it to), so the main thread has made it all the way past all the offload statements and is waiting for them to complete when it hears about the segmentation fault.
- If the crash happens on a thread other than the first thread, you'll have a shorter that that starts from a "clone":
+ If the crash happens on a thread other than the first thread, you'll have a shorter stack that starts from a "clone":
```
#0 0x00007ffff7886fa2 in main::{lambda(auto:1&)#2}::operator()(sycl::_V1::handler&) const::{lambda(auto:1)#1}::operator() >(sycl::_V1::item<2, true>) const (this=0x7fffd5fff450,
@@ -260,7 +297,7 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
#21 0x00007ffff72a1e2e in start_thread (arg=) at ./nptl/pthread_create.c:447
#22 0x00007ffff7333a4c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```
- This stack might look a little odd due to the fact we are seeing one thread out of many launched to execute the kernels. We ran the debugger on the host OpenCL CPU driver, where the parallel processing is implemented using Intel(R) oneAPI Threading Building Blocks.
+ This stack might look a little odd due to the fact we are seeing one thread out of many launched to execute the kernels. Because we are running using the host OpenCL CPU driver, parallel processing is implemented using Intel(R) oneAPI Threading Building Blocks which uses `clone` to spawn off additional threads.
6. Examine the code at the crash
```
@@ -289,8 +326,6 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
```
>**Note:** `/r` disables the *pretty printer* for the SYCL `buffer` class. You can see all available pretty printers using `info pretty-printer` at the `gdb` prompt.
- You might notice that this buffer has a size 0 by 0 elements (the `AccessRange` and `MemRange` are for a `common_array` of size 0 by 0 elements). Since it has zero size, this buffer is the problem.
-
```
$2 = { >> = {
static AS = sycl::_V1::access::address_space::global_space, static IsHostBuf = false, static IsHostTask = false,
@@ -305,6 +340,8 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
MemRange = {> = {common_array = {0, 0}}, static dimensions = }}, {
MData = 0x0}}
```
+
+ You might notice that this buffer has a size 0 by 0 elements (the `AccessRange` and `MemRange` are for a `common_array` of size 0 by 0 elements). Since it has zero size, this buffer is the problem.
8. Now look at the `index` variable, which represents the iteration space that we will traverse to set all elements of the array `a` to an initial value of `1.0`. You will see something like this:
```
@@ -316,9 +353,9 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
0}}, static dimensions = }, MOffset = {> = {common_array = {0,
0}}, static dimensions = }}}
```
- Clearly there is a mismatch here! 'a' has no space reserved for it, yet we will be iterating over 150 by 300 elements (and updating element 120 by 144 in this thread), which is clearly an error.
+ Clearly there is a mismatch here! 'a' has no space reserved for it, yet we will be iterating over 150 by 300 elements (and updating element 131 by 0 in this thread), which is clearly an error.
-9. To further root-cause the error, we will need to restart the program and look at the values of `a_buf` and `b_buf`, which are not in scope in any of our stack frames. We'll set some breakpoints where they are used:
+9. To further root-cause the error, we will need to restart the program and look at the values of the buffers behind the accessors (`a_buf` and `b_buf`), which are not in scope in any of our stack frames. We'll set some breakpoints at the `parallel_for` statements where they are initialized.
```
(gdb) b 79
@@ -335,7 +372,7 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
Problem size: c(150,600) = a(150,300) * b(300,600)
Thread 1 "a1_matrix_mul_z" hit Breakpoint 1.1, main::{lambda(auto:1&)#1}::operator()(sycl::_V1::handler&) const (this=0x7fffffffb550, h=sycl::handler& = {...})
- at /nfs/site/home/cwcongdo/oneAPI-samples-mine/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/src/a1_matrix_mul_zero_buff.cpp:79
+ at Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/src/a1_matrix_mul_zero_buff.cpp:79
79 h.parallel_for(range(N, P), [=](auto index) {
(gdb)
```
@@ -355,7 +392,7 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
Continuing.
Thread 1 "a1_matrix_mul_z" hit Breakpoint 2.1, main::{lambda(auto:1&)#2}::operator()(sycl::_V1::handler&) const (this=0x7fffffffb550, h=sycl::handler& = {...})
- at /nfs/site/home/cwcongdo/oneAPI-samples-mine/Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/src/a1_matrix_mul_zero_buff.cpp:91
+ at Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/src/a1_matrix_mul_zero_buff.cpp:91
91 h.parallel_for(range(M, N), [=](auto index) {
(gdb) p a
$7 = sycl::accessor write range {0, 0, 1}
@@ -368,8 +405,8 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
Looking at the source again, you'll see that this originated when the buffers were created around line 61 and 62:
```
- buffer a_buf(range(0, 0));
- buffer b_buf(range(N, P));
+ 61 buffer a_buf(range(0, 0));
+ 62 buffer b_buf(range(N, P));
```
In real code the values to the ranges may be passed into the function from outside, so you will need to inspect those as well as the code where they are calculated. For example, you would need to find the values of `M`, `N`, and `P` to make sure that the resulting buffer sizes are non-zero in these buffer definitions:
@@ -378,15 +415,85 @@ In `a1_matrix_mul_zero_buff`, a zero-element buffer is passed to a SYCL submit `
buffer b_buf(range(N, P));
```
-### Guided Instructions for Null Device Pointer
+### Guided Instructions for Null Device Pointer using Address Sanitizer
+Let us use the Address Sanitizer again to catch invalid pointer addresses at runtime, this time in code that makes use of explicit device memory allocations rather than using SYCL buffers. This will require a special build of the application.
+
+1. Compile a version of the program with device-side Address Sanitizer (assuming that you are in the `build` directory)
+ ```
+ icpx -fsycl -O0 -g -Xarch_device -fsanitize=address -std=gnu++17 -Rno-debug-disables-optimization -o b1_matrix_mul_null_usm_asan ../src/b1_matrix_mul_null_usm.cpp
+ ```
+
+2. Now run the program on the GPU:
+ ```
+ ./b1_matrix_mul_null_usm_asan
+ Initializing
+ ==== DeviceSanitizer: ASAN
+ Computing
+ Device: Intel(R) Data Center GPU Max 1550
+ Device compute units: 512
+ Device max work item size: 1024, 1024, 1024
+ Device max work group size: 1024
+ Problem size: c(150,600) = a(150,300) * b(300,600)
+
+ ====ERROR: DeviceSanitizer: null-pointer-access on Unknown Memory (0x15630)
+ READ of size 4 at kernel (auto&) const::'lambda'(auto)> LID(68, 3, 0) GID(468, 73, 0)
+ #0 auto auto main::'lambda0'(auto&)::operator()(auto&) const::'lambda'(auto)::operator()>(auto) const Tools/ApplicationDebugger/guided_matrix_mult_BadBuffers/build/../src/b1_matrix_mul_null_usm.cpp:122
+ Aborted (core dumped)
+ ```
+
+3. Look at the reported source location
+
+ If you pull up an editor and go to line 122, you will see the following:
+ ```
+ // Submit command group to queue to multiply matrices: c = a * b
+ 107 q.submit([&](auto &h) {
+ 108 // Read from a and b, write to c
+ 109 int width_a = N;
+ 110
+ 111 // Execute kernel.
+ 112 h.parallel_for(range(M, P), [=](auto index) {
+ 113 // Get global position in Y direction.
+ 114 int row = index[0]; // m
+ 115 int col = index[1]; // p
+ 116 float sum = 0.0f;
+ 117
+ 118 // Compute the result of one element of c
+ 119 for (int i = 0; i < width_a; i++) {
+ 120 auto a_index = row * width_a + i;
+ 121 auto b_index = i * P + col;
+ 122 sum += dev_a[a_index] * dev_b[b_index]; // ----- Problem here
+ 123 }
+ 124
+ 125 auto idx = row * P + col;
+ 126 dev_c[idx] = sum;
+ 127 });
+ 128 });
+ ```
+
+4. Putting together what we know
+
+ Looking at the error, we see that we were trying to read local index `LID(68, 3, 0)` , or global index `GID(468, 73, 0)` of either array `a` or array `b`. According to the text when the program ran (`Problem size: c(150,600) = a(150,300) * b(300,600)`) both of these indexes are in range, but are they?
+
+ Take a look at the lines where the arrays are allocated:
+
+ ```
+ 79 float * dev_a = sycl::malloc_device(M*N, q);
+ 80 float * dev_b = sycl::malloc_device(N*P, q);
+ 81 float * dev_c = sycl::malloc_device(M*P, q);
+ ```
+
+ Those look OK, but the Address Sanitizer is telling us that someplace after that allocation something went wrong. Unless you are lucky and spot the problem by manual analysis (tricky in a large code base), it's time to pull out the debugger and see what is going on with the pointers `dev_a` and `dev_b`.
-In `b1_matrix_mul_null_usm.cpp` a bad (in this case, null) pointer that is supposed to represent unallocated memory on the device is inadvertently used in a kernel. This example uses unified shared memory rather than SYCL buffers like the previous example.
+### Guided Instructions for Null Device Pointer using gdb-oneapi and the OpenCL CPU device
+In `b1_matrix_mul_null_usm.cpp` a bad (in this case, null) pointer that is supposed to represent allocated memory on the device is inadvertently passed as an argument to a kernel. This example uses explicitly allocated device memory rather than SYCL buffers like the previous example.
+
+#### Checking the Behavior using Multiple Backends
1. Run the program on the GPU using Level Zero.
```
ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./b1_matrix_mul_null_usm
```
- This run produces troublesome output.
+ This run produces troublesome output - a crash due to accessing an illegal (non-allocated) memory.
```
Device max work group size: 1024
Problem size: c(150,600) = a(150,300) * b(300,600)
@@ -405,40 +512,39 @@ In `b1_matrix_mul_null_usm.cpp` a bad (in this case, null) pointer that is suppo
The results should be the same as the Level Zero output.
> **Note:** this will only work if the `sycl-ls` command shows OpenCL
- > devices for the graphics card, such as like this:
+ devices for the graphics card, such as like this:
```
- $ sycl-ls
- [opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
- [opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
- [opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
- [opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000]
- [opencl:fpga][opencl:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.7.0.21_160000]
- [level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
- [level_zero:gpu][level_zero:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
+ $ sycl-ls
+ [opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
+ [opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
+ [opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
+ [opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000]
+ [opencl:fpga][opencl:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.7.0.21_160000]
+ [level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
+ [level_zero:gpu][level_zero:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
```
- If you are missing `[opencl:gpu]` devices you may have to add the necessary libraries to your device path by setting the appropriate path in `DRIVERLOC` and then running the following four commands (for Ubuntu - adapt for other OSes):
+ > If you are missing `[opencl:gpu]` devices you may have to add the necessary libraries to your device path by setting the appropriate path in `DRIVERLOC` and then running the following four commands (for Ubuntu - adapt for other OSes):
```
- export DRIVERLOC=/usr/lib/x86_64-linux-gnu
- export OCL_ICD_FILENAMES=$OCL_ICD_FILENAMES:$DRIVERLOC/intel-opencl/libigdrcl.so
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DRIVERLOC
- export PATH=$PATH:/opt/intel/oneapi:$DRIVERLOC
+ export DRIVERLOC=/usr/lib/x86_64-linux-gnu
+ export OCL_ICD_FILENAMES=$OCL_ICD_FILENAMES:$DRIVERLOC/intel-opencl/libigdrcl.so
+ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DRIVERLOC
+ export PATH=$PATH:/opt/intel/oneapi:$DRIVERLOC
```
3. Check the output we get by bypassing the GPU entirely and using the OpenCL driver for CPU.
```
ONEAPI_DEVICE_SELECTOR=opencl:cpu ./b1_matrix_mul_null_usm
- ```
- ```
+ :
Problem size: c(150,600) = a(150,300) * b(300,600)
Segmentation fault (core dumped)
```
-#### Attempting to Understand What Is Happening
+#### Debugging the Problem
-Why did we try with multiple backends? If one had shown correct or incorrect results, and one had crashed, we might be facing a race condition that only occasionally manifests as something that goes terribly wrong. Or one of the backbends might have a bug. But here all three crash, so it's likely the program is doing something illegal to memory. The host CPU is a particularly good place to test for illegal memory accesses, because the CPU never allows pointers with an address within a few kilobytes of address 0x0, while this may be legally allocated memory on the GPU.
+Why did we try with multiple backends? If one had shown correct or incorrect results, and one had crashed, we might be facing a race condition that only occasionally manifests as something that goes terribly wrong. Or one of the backbends might have a bug while the others do not. But here all three crash, so it's likely the program is doing something illegal to memory. The host CPU is a particularly good place to test for illegal memory accesses, because the CPU never allows pointers with an address within a few kilobytes of address `0x0`, while this may be legally allocated memory on the GPU.
Another reason to try different backends is that debugging support may differ between different GPU drivers and/or different GPU models. Debugging the program using the OpenCL™ CPU driver gets around these issues.
@@ -472,7 +578,7 @@ Let's see what caused the problem by running in the debugger:
```
(gdb) backtrace
```
- You should see output similar to the following.
+ You should see output similar to the following, showing the stack for one oneTBB thread:
```
#0 0x00007ffff4a75f60 in main::{lambda(auto:1&)#2}::operator()(sycl::_V1::handler&) const::{lambda(auto:1)#1}::operator() >(sycl::_V1::item<2, true>) const (this=0x7fff3d5ff438,
index=sycl::item range ..., offset ... = ...) at b1_matrix_mul_null_usm.cpp:122
@@ -486,6 +592,8 @@ Let's see what caused the problem by running in the debugger:
(gdb)
```
+ Or you will see a longer stack dump for the main thread, which also shows a problem at `b1_matrix_mul_null_usm.cpp:122`
+
5. We got lucky in that the frame where we crashed is in our code. Let's examine the code in a little more detail:
```
(gdb) list
@@ -498,12 +606,11 @@ Let's see what caused the problem by running in the debugger:
119 for (int i = 0; i < width_a; i++) {
120 auto a_index = row * width_a + i;
121 auto b_index = i * P + col;
- 122 sum += dev_a[a_index] * dev_b[b_index];
+ 122 sum += dev_a[a_index] * dev_b[b_index]; // --- Problems here
123 }
124
125 auto idx = row * P + col;
126 dev_c[idx] = sum;
-
```
6. Let's check the pointers in use at line 122.
@@ -525,17 +632,19 @@ Let's see what caused the problem by running in the debugger:
```
This located the problem: one of the array pointers is null (zero) while the other has a (hopefully) valid value. If both pointers were valid, we'd want to check the index values being used to access the arrays, and the memory values at those locations (and at the start of the array).
+ At this point you know what the problem is. Now you need to track down the source of the bad address - somewhere between the allocation of the buffer, and the use of the pointer to the buffer.
+
#### Understanding What Is Happening
-Early on, operating system designers realized that it was common for developers to write bugs in which they accidentally try to de-reference null pointers (access them like an array). To make it easier to find these errors the operating system designers implemented logic that made it illegal for any program to access the first two memory pages (so from address 0x0 to around 0x2000). They didn't go further to explicitly validate the address of any memory accessed by a program (because it is too expensive), but this range of illegal addresses was a cheap check that caught a huge number of bugs.
+Early on, operating system designers realized that it was common for developers to write bugs in which they accidentally try to de-reference null pointers. To make it easier to find these errors the operating system designers implemented logic that made it illegal for any program to access the first two memory pages (so from address `0x0` to around `0x2000`).
-GPUs typically don't have a lot of memory, so they can't afford to set aside a large range of illegal addresses. So both Level Zero and OpenCL passed on the pointer to device memory (in our case intentionally zero) assuming it was valid. From there we had three possible outcomes:
+GPUs typically don't have a lot of memory, so they can't afford to set aside a range of illegal addresses. So both Level Zero and OpenCL passed on the pointer to device memory (in our case intentionally zero) assuming it was valid. From there we had three possible outcomes:
1. If the pointer was correct and pointing to allocated memory, the program would have completed correctly.
2. If the pointer was incorrect but was pointing to memory allocated for something else, the kernel would have accessed random memory values in calculating the sum on line 122, and returned incorrect results as a consequence.
3. If the pointer was incorrect and pointing to memory that was never allocated, the kernel will crash on either the GPU or CPU.
-What would we have seen if `dev_a` had contained just a random pointer value? If you were lucky, the address returned when you print `dev_a` would look non-null and very different from the one returned when you printed `dev_b`, but that's not a certainty. Or it would have pointed to memory not owned by your process and caused a crash.
+What would we have seen if `dev_a` had contained just a random pointer value? If you were lucky, the address returned when you print `dev_a` would look non-null and very different from the one returned when you printed `dev_b`, but that's not a certainty. Or it would have pointed to memory not owned by your process and caused a crash. Either way, your program would not produced correct results, if it ran at all.
#### Other Debug Techniques
@@ -552,29 +661,29 @@ You need to build `unitrace` before you can use it. See the instructions at [Uni
While reviewing the output, you might see something like the following excerpt near the bottom of the output.
```
:
- >>>> [1487508411066857] zeKernelCreate: hModule = 44797904 desc = 140735939087520 {ZE_STRUCTURE_TYPE_KERNEL_DESC(0x1d) 0 0 "_ZTSZZ4mainENKUlRT_E0_clIN4sycl3_V17handlerEEEDaS0_EUlS_E_"} phKernel = 140735939087512 (hKernel = 14798282318754847232)
- <<<< [1487508411092885] zeKernelCreate [21566 ns] hKernel = 44799912 -> ZE_RESULT_SUCCESS(0x0)
- >>>> [1487508411096140] zeDeviceGetSubDevices: hDevice = 39507224 pCount = 140735939087500 (Count = 0) phSubdevices = 0
- <<<< [1487508411098094] zeDeviceGetSubDevices [395 ns] Count = 0 -> ZE_RESULT_SUCCESS(0x0)
- >>>> [1487508411099724] zeDeviceGetSubDevices: hDevice = 39507224 pCount = 140735939087500 (Count = 0) phSubdevices = 0
- <<<< [1487508411100759] zeDeviceGetSubDevices [22 ns] Count = 0 -> ZE_RESULT_SUCCESS(0x0)
- >>>> [1487508411104804] zeKernelSetIndirectAccess: hKernel = 44799912 flags = 7
- <<<< [1487508411106715] zeKernelSetIndirectAccess [431 ns] -> ZE_RESULT_SUCCESS(0x0)
- >>>> [1487508411113472] zeKernelGetProperties: hKernel = 44799912 pKernelProperties = 45076936
- <<<< [1487508411122357] zeKernelGetProperties [7693 ns] -> ZE_RESULT_SUCCESS(0x0)
- >>>> [1487508411125804] zeKernelSetArgumentValue: hKernel = 44799912 argIndex = 0 argSize = 4 pArgValue = 44677968
- <<<< [1487508411127704] zeKernelSetArgumentValue [265 ns] -> ZE_RESULT_SUCCESS(0x0)
- >>>> [1487508411130788] zeKernelSetArgumentValue: hKernel = 44799912 argIndex = 1 argSize = 8 pArgValue = 0
- <<<< [1487508411132260] zeKernelSetArgumentValue [528 ns] -> ZE_RESULT_SUCCESS(0x0)
- >>>> [1487508411133457] zeKernelSetArgumentValue: hKernel = 44799912 argIndex = 2 argSize = 8 pArgValue = 140735939087912
- <<<< [1487508411135573] zeKernelSetArgumentValue [1131 ns] -> ZE_RESULT_SUCCESS(0x0)
- >>>> [1487508411136604] zeKernelSetArgumentValue: hKernel = 44799912 argIndex = 3 argSize = 8 pArgValue = 140735939087912
- <<<< [1487508411137932] zeKernelSetArgumentValue [459 ns] -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101562718] zeKernelCreate: hModule = 0x3621f28 desc = 0x7ffed5a4cc80 {ZE_STRUCTURE_TYPE_KERNEL_DESC(0x1d) 0 0 "_ZTSZZ4mainENKUlRT_E0_clIN4sycl3_V17handlerEEEDaS0_EUlS_E_"} phKernel = 0x7ffed5a4cc78 (hKernel = 0x35a1110)
+ <<<< [500091101674826] zeKernelCreate [96812 ns] hKernel = 0x337e668 -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101689327] zeDeviceGetSubDevices: hDevice = 0x325eee8 pCount = 0x7ffed5a4cc74 (Count = 0x0) phSubdevices = 0x0
+ <<<< [500091101697229] zeDeviceGetSubDevices [630 ns] Count = 0x0 -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101705199] zeDeviceGetSubDevices: hDevice = 0x325eee8 pCount = 0x7ffed5a4cc74 (Count = 0x0) phSubdevices = 0x0
+ <<<< [500091101711447] zeDeviceGetSubDevices [208 ns] Count = 0x0 -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101738711] zeKernelSetIndirectAccess: hKernel = 0x337e668 flags = 0x7
+ <<<< [500091101745875] zeKernelSetIndirectAccess [724 ns] -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101763140] zeKernelGetProperties: hKernel = 0x337e668 pKernelProperties = 0x36333c8
+ <<<< [500091101771939] zeKernelGetProperties [1608 ns] -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101796337] zeKernelSetArgumentValue: hKernel = 0x337e668 argIndex = 0x0 argSize = 0x4 pArgValue = 0x36046a8 ArgValue = 0x12c
+ <<<< [500091101806110] zeKernelSetArgumentValue [1133 ns] -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101814314] zeKernelSetArgumentValue: hKernel = 0x337e668 argIndex = 0x1 argSize = 0x8 pArgValue = 0x0 (NULL)
+ <<<< [500091101820959] zeKernelSetArgumentValue [770 ns] -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101827079] zeKernelSetArgumentValue: hKernel = 0x337e668 argIndex = 0x2 argSize = 0x8 pArgValue = 0x7ffed5a4ce00 ArgValue = 0xff00fffffff00000
+ <<<< [500091101836736] zeKernelSetArgumentValue [2819 ns] -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [500091101842752] zeKernelSetArgumentValue: hKernel = 0x337e668 argIndex = 0x3 argSize = 0x8 pArgValue = 0x7ffed5a4ce00 ArgValue = 0xff00ffffffea0000
+ <<<< [500091101849732] zeKernelSetArgumentValue [1181 ns] -> ZE_RESULT_SUCCESS(0x0)
:
```
- Notice how all the kernel arguments have a non-zero value except one (`argIndex = 1`) when they are set up. We have nothing to help us map the kernel arguments created by the SYCL runtime and passed to Level Zero to the arguments in the user program (the value may be OK). However, if you see a kernel argument that looks out of place (note that the values at all other argument indexes are very similar), you might want to be suspicious.
+ Notice how all the kernel arguments have a non-zero value except one (`argIndex = 0x1`) when they are set up. We have nothing to help us map the kernel arguments created by the SYCL runtime and passed to Level Zero to the arguments in the user program (the value may be OK). However, if you see a kernel argument that looks out of place (note that the values at all other argument indexes are very similar), you might want to be suspicious.
- Other than this clue, there are no other hints that the bad output from the program is due to a bad input argument rather than a race condition or other algorithmic error. These sorts of issues can be partically tricky to diagnose when the device pointers are initialized in a different part of the program, or in a 3rd-party library.
+ Other than this clue, there are no other hints that the bad output from the program is due to a bad input argument rather than a race condition or other algorithmic error. These sorts of issues can be particularly tricky to diagnose when the device pointers are initialized in a different part of the program, or in a 3rd-party library.
## License
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/CMakeLists.txt b/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/CMakeLists.txt
index 510d479316..87dd6fdf07 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/CMakeLists.txt
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required (VERSION 3.4)
+cmake_minimum_required (VERSION 3.5)
set (CMAKE_CXX_COMPILER "icpx")
project (matrix_mul LANGUAGES CXX)
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/README.md b/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/README.md
index d7308606c2..dd8b72569f 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/README.md
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/README.md
@@ -13,7 +13,7 @@ The sample code is a simple program that multiplies together two large matrices
## Purpose
-The two samples in this tutorial show examples of situations where the SYCL runtime provides an assert when it detects incorrect use of the SYCL API that is not caught at build time. Unfortunately, these runtime error checks are not comprehensive, so not getting an assert does not indicate correct code structure or practices.
+The two samples in this tutorial show situations where the SYCL runtime provides an assert when it detects incorrect use of the SYCL API that is not caught at build time. Unfortunately, these runtime error checks are not comprehensive, so not getting an assert does not indicate correct code structure or practices.
Currently, SYCL asserts only tell you that an error was detected, but not where it resides in your code. To determine the location, you must run the program in the Intel® Distribution for GDB* with debug symbols enabled. Turning off optimization can also help.
@@ -31,11 +31,12 @@ The sample includes three different versions of some simple matrix multiplicatio
## Prerequisites
-| Optimized for | Description
-|:--- |:---
+| Optimized for | Description
+|:--- |:---
| OS | Ubuntu* 24.04 LTS
| Hardware | GEN9 or newer
-| Software | Intel® oneAPI DPC++/C++ Compiler 2025.1
Intel® Distribution for GDB* 2025.1
+| Software | Intel® oneAPI DPC++/C++ Compiler 2025.3
Intel® Distribution for GDB* 2025.3
+| Intel GPU Driver | Intel® General-Purpose GPU Long-Term Support driver 2523.31 or later from https://dgpu-docs.intel.com/releases/releases.html
## Key Implementation Details
@@ -54,7 +55,7 @@ When working with the command-line interface (CLI), you should configure the one
## Build and Run the `Guided Matrix Multiplication Exception` Programs
> **Note**: If you have not already done so, set up your CLI
-> environment by sourcing the `setvars` script in the root of your oneAPI installation.
+environment by sourcing the `setvars` script in the root of your oneAPI installation.
>
> Linux*:
> - For system wide installations: `. /opt/intel/oneapi/setvars.sh`
@@ -147,7 +148,7 @@ In `1_matrix_mul_null_pointer` a null pointer is passed to a SYCL `memcpy` state
Device max work item size: 1024, 1024, 1024
Device max work group size: 1024
Problem size: c(150,600) = a(150,300) * b(300,600)
- Exception caught at File: 1_matrix_mul_null_pointer.cpp | Function: main | Line: 95 | Column: 5
+ Exception caught at File: 1_matrix_mul_null_pointer.cpp | Function: main | Line: 95 | Column: 7
terminate called after throwing an instance of 'sycl::_V1::exception'
what(): NULL pointer argument in memory copy operation.
Aborted (core dumped)
@@ -160,13 +161,15 @@ As an exercise, let's find this a debugger (any host debugger will work; however
gdb-oneapi ./1_matrix_mul_null_pointer
(gdb) run
```
- When you get the error message `Debugging of GPU offloaded code is not enabled`, ignore it and answer `n` to the question `Quit anyway? (y or n)`
+ > When you get the error message `Debugging of GPU offloaded code is not enabled`, ignore it and answer `n` to the question `Quit anyway? (y or n)`. You may need to do this more than once.
+
+ > Why can we ignore these messages and keep on debugging anyway? Because we don't need to monitor the code running on the device in the debugger - the asserts are coming from the host during the call of the kernel. Running `gdb-oneapi` with `ZET_ENABLE_PROGRAM_DEBUGGING=1` is only necessary if you want to debug the kernels running on the GPU.
2. Notice the application failure. The error is the same message seen when we ran it outside the debugger.
```
- Exception caught at File: 1_matrix_mul_null_pointer.cpp | Function: main | Line: 95 | Column: 5
+ Exception caught at File: 1_matrix_mul_null_pointer.cpp | Function: main | Line: 95 | Column: 7
terminate called after throwing an instance of 'sycl::_V1::exception'
- what(): NULL pointer argument in memory copy operation.
+ what(): NULL pointer argument in memory copy operation.
Thread 1.1 "1_matrix_mul_nu" received signal SIGABRT, Aborted.
```
@@ -215,7 +218,9 @@ In the second version, the code attempts to execute more than one offload statem
gdb-oneapi ./2_matrix_mul_multi_offload
(gdb) run
```
- When you get the error message `Debugging of GPU offloaded code is not enabled`, ignore it and answer `n` to the question `Quit anyway? (y or n)`
+ > When you get the error message `Debugging of GPU offloaded code is not enabled`, ignore it and answer `n` to the question `Quit anyway? (y or n)`. You may need to do this more than once.
+
+ > Why can we ignore these messages and keep on debugging anyway? Because we don't need to monitor the code running on the device in the debugger - the asserts are coming from the host during the call of the kernel. Running `gdb-oneapi` with `ZET_ENABLE_PROGRAM_DEBUGGING=1` is only necessary if you want to debug the kernels running on the GPU.
2. The error is the same message seen when we ran it outside the debugger.
```
@@ -224,14 +229,14 @@ In the second version, the code attempts to execute more than one offload statem
Thread 1.1 "2_matrix_mul_mu" received signal SIGABRT, Aborted.
```
- The exception talks about a “command group” and that only a single command group is allowed within a `submit`. A command group is something like a `parallel_for` or a SYCL `memcpy` statement – it’s a language construct or function call that makes something happen on the device. Only one action is allowed per `submit` construct.
+ The exception talks about a “command group” and that only a single command group is allowed within a `submit`. A command group is something like a `parallel_for` or a SYCL `memcpy` statement – it’s a language construct or function call that makes something happen on the device. Only one such action is allowed per `submit` construct.
3. Run a `backtrace` to get summary showing the rough location that triggered the assert.
```
(gdb) backtrace
```
-4. Notice in the results (which should look something like the following) that the exception (frame 8) was triggered around line 98 (frame 19):
+4. Notice in the results (which should look something like the following) that the exception (frame 8) was triggered around line 98 (frame 17):
```
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78
@@ -242,37 +247,34 @@ In the second version, the code attempts to execute more than one offload statem
#6 0x00007ffff78bb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007ffff78a5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007ffff78bb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
- #9 0x00007ffff7f076a0 in sycl::_V1::handler::memcpy(void*, void const*, unsigned long) ()
- from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #10 0x0000000000404ba2 in main::{lambda(auto:1&)#1}::operator()(sycl::_V1::handler&) const (
- this=0x7fffffffb2d8, h=sycl::handler& = {...})
- at /nfs/site/home/cwcongdo/oneAPI-samples-true/Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/2_matrix_mul_multi_offload.cpp:100
- #11 0x0000000000404b3d in std::__invoke_impl(std::__invoke_other, main::{lambda(auto:1&)#1}&, sycl::_V1::handler&) (__f=..., __args=sycl::handler& = {...})
- at /usr/lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/invoke.h:61
- #12 0x0000000000404add in std::__invoke_r(main::{lambda(auto:1&)#1}&, sycl::_V1::handler&) (__fn=..., __args=sycl::handler& = {...})
- at /usr/lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/invoke.h:111
- #13 0x00000000004049f5 in std::_Function_handler::_M_invoke(std::_Any_data const&, sycl::_V1::handler&) (__functor=..., __args=sycl::handler& = {...})
- at /usr/lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_function.h:290
- #14 0x00007ffff7e83121 in sycl::_V1::detail::queue_impl::submit_impl(std::function const&, std::shared_ptr const&, std::shared_ptr const&, std::shared_ptr const&, bool, sycl::_V1::detail::code_location const&, bool, sycl::_V1::detail::SubmissionInfo const&) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #15 0x00007ffff7e895c8 in sycl::_V1::detail::queue_impl::submit_with_event(std::function const&, std::shared_ptr const&, sycl::_V1::detail::SubmissionInfo const&, sycl::_V1::detail::code_location const&, bool) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #16 0x00007ffff7f33afa in sycl::_V1::queue::submit_with_event_impl(std::function, sycl::_V1::detail::SubmissionInfo const&, sycl::_V1::detail::code_location const&, bool) ()
- from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #17 0x00000000004048b3 in sycl::_V1::queue::submit_with_event(main::{lambda(auto:1&)#1}, sycl::_V1::queue*, sycl::_V1::detail::code_location const&) (this=0x7fffffffb860, CGF=..., SecondaryQueuePtr=0x0,
- CodeLoc=...) at /opt/intel/oneapi/compiler/2025.1/bin/compiler/../../include/sycl/queue.hpp:2826
- #18 0x00000000004042cd in sycl::_V1::queue::submit(main::{lambda(auto:1&)#1}, sycl::_V1::detail::code_location const&) (this=0x7fffffffb860, CGF=..., CodeLoc=...)
- at /opt/intel/oneapi/compiler/2025.1/bin/compiler/../../include/sycl/queue.hpp:365
- #19 0x0000000000403edc in main ()
- at 2_matrix_mul_multi_offload.cpp:98
+ #9 0x00007ffff7f11ed0 in sycl::_V1::handler::memcpy(void*, void const*, unsigned long) ()
+ from /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8
+ #10 0x0000000000404812 in main::{lambda(auto:1&)#1}::operator()(sycl::_V1::handler&) const (
+ this=0x7fffffffb098, h=sycl::handler& = {...})
+ at Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/2_matrix_mul_multi_offload.cpp:100
+ #11 0x00000000004047ad in sycl::_V1::detail::type_erased_cgfo_ty::invoker::call(void const*, sycl::_V1::handler&) (object=0x7fffffffb098, cgh=sycl::handler& = {...})
+ at /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/handler.hpp:190
+ #12 0x00007ffff7e8a4a4 in sycl::_V1::detail::queue_impl::submit_impl(sycl::_V1::detail::type_erased_cgfo_ty const&, sycl::_V1::detail::queue_impl*, bool, sycl::_V1::detail::code_location const&, bool, sycl::_V1::detail::v1::SubmissionInfo const&) () from /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8
+ #13 0x00007ffff7e90022 in sycl::_V1::detail::queue_impl::submit_with_event(sycl::_V1::detail::type_erased_cgfo_ty const&, sycl::_V1::detail::v1::SubmissionInfo const&, sycl::_V1::detail::code_location const&, bool) ()
+ from /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8
+ #14 0x00007ffff7f58844 in sycl::_V1::queue::submit_with_event_impl(sycl::_V1::detail::type_erased_cgfo_ty const&, sycl::_V1::detail::v1::SubmissionInfo const&, sycl::_V1::detail::code_location const&, bool) const ()
+ from /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8
+ #15 0x0000000000407392 in sycl::_V1::queue::submit_with_event > >(sycl::_V1::ext::oneapi::experimental::properties >, sycl::_V1::detail::type_erased_cgfo_ty const&, sycl::_V1::detail::code_location const&) const (this=0x7fffffffb570, Props=..., CGF=..., CodeLoc=...)
+ at /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/queue.hpp:3762
+ #16 0x00000000004042a1 in sycl::_V1::queue::submit(main::{lambda(auto:1&)#1}, sycl::_V1::detail::code_location const&) (this=0x7fffffffb570, CGF=..., CodeLoc=...)
+ at /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/queue.hpp:429
+ #17 0x0000000000403eac in main ()
+ at Tools/ApplicationDebugger/guided_matrix_mult_Exceptions/src/2_matrix_mul_multi_offload.cpp:98
```
5. Examine the last frame (it may be different from the output above) using the following command:
```
- (gdb) frame 19
+ (gdb) frame 17
```
You may need to issue this command twice before you see output similar to the following example:
```
- #19 0x0000000000403e7c in main ()
+ #17 0x0000000000403e7c in main ()
at 2_matrix_mul_multi_offload.cpp:98
98 q.submit([&](auto &h) {
```
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/CMakeLists.txt b/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/CMakeLists.txt
index 1d4ec9e7d4..cbdfc82bda 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/CMakeLists.txt
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required (VERSION 3.4)
+cmake_minimum_required (VERSION 3.5)
set (CMAKE_CXX_COMPILER "icpx")
project (matrix_mul LANGUAGES CXX)
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/README.md b/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/README.md
index 7977f5e293..7bca9f4b79 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/README.md
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/README.md
@@ -19,7 +19,7 @@ verifies the results.
The sample in this tutorial shows how to debug incorrect use of variables that
are owned by different queues that have different contexts.
-This type of error can be hard to detect and determine the root cause in a
+This type of error can be hard to detect and root cause in a
large body of code where queues and memory are passed between functions. The
lack of tools that tell you what is wrong combined with the fact that the
default Level Zero driver does not notice there is a problem (only the OpenCL™
@@ -37,12 +37,12 @@ program.
## Prerequisites
-| Optimized for | Description
-|:--- |:---
+| Optimized for | Description
+|:--- |:---
| OS | Ubuntu* 24.04 LTS
| Hardware | GEN9 or newer
-| Software | Intel® oneAPI DPC++/C++ Compiler 2025.1
Intel® Distribution for GDB* 2025.1
Unified Tracing and Profiling Tool 2.1.2, which is available from the [following Github repository](https://github.com/intel/pti-gpu/tree/master/tools/unitrace).
-| Intel GPU Driver | Intel® General-Purpose GPU Rolling Release driver 2507.12 or later from https://dgpu-docs.intel.com/releases/releases.html
+| Software | Intel® oneAPI DPC++/C++ Compiler 2025.3
Intel® Distribution for GDB* 2025.3
Unified Tracing and Profiling Tool 2.3.0, which is available from the [following Github repository](https://github.com/intel/pti-gpu/tree/master/tools/unitrace).
+| Intel GPU Driver | Intel® General-Purpose GPU Long-Term Support driver 2523.31 or later from https://dgpu-docs.intel.com/releases/releases.html
## Key Implementation Details
@@ -163,20 +163,16 @@ Documentation on using the debugger in a variety of situations can be found at *
### Getting the Tracing and Profiling Tool
-At a step in this tutorial, the instructions require a utility that was not installed with the Intel® oneAPI Base Toolkit (Base Kit).
+In this tutorial, the instructions require a utility that was not installed with the Intel® oneAPI Base Toolkit (Base Kit).
-To complete the steps in the following section, you must download the [Unified Tracing and Profiling Tool](https://github.com/intel/pti-gpu/tree/master/tools/unitrace) code from GitHub and build the utility. The build instructions are included in the README in the GitHub repository. This build will go much more smoothly if you first install the latest drivers from [the Intel GPU driver download site](https://dgpu-docs.intel.com/driver/overview.html), especially the development packages (only available in the Data Center GPU driver install ). Once you have built the utility, you invoke it on the command line in front of your program (similar to using GDB).
+To complete the steps in the following section, you must download the [Unified Tracing and Profiling Tool](https://github.com/intel/pti-gpu/tree/master/tools/unitrace) code from GitHub and build the utility. The build instructions are included in the README in the GitHub repository. This build will go much more smoothly if you first install the latest drivers from [the Intel GPU driver download site](https://dgpu-docs.intel.com/driver/overview.html), especially the development packages (only available in the Data Center GPU driver install). Once you have built the utility, you invoke it on the command line in front of your program (similar to using GDB).
### Check the Programs
1. Notice that both versions of the application run to completion and report
correct results.
- SYCL applications use the Level Zero runtime by default with an Intel GPU.
- If you use OpenCL™ software to run `1_matrix_mul_invalid_contexts`, the
- program with a bug in it will crash before it can report results.
-
-2. Check the results on a **GPU** with OpenCL.
+2. SYCL applications use the Level Zero runtime by default with an Intel GPU. What happens if you use OpenCL™ software to run `1_matrix_mul_invalid_contexts` on the GPU?
```
ONEAPI_DEVICE_SELECTOR=opencl:gpu ./1_matrix_mul_invalid_contexts
@@ -193,39 +189,126 @@ To complete the steps in the following section, you must download the [Unified T
Device max work group size: 1024
Problem size: c(150,600) = a(150,300) * b(300,600)
terminate called after throwing an instance of 'sycl::_V1::exception'
- what(): Enqueue process failed.
+ what(): Enqueue process failed.
+ opencl backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
Aborted (core dumped)
```
> **Note:** this will only work if the `sycl-ls` command shows OpenCL
- > devices for the graphics card, such as like this:
+ devices for the graphics card, such as like this:
```
- $ sycl-ls
- [opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
- [opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
- [opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
- [opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000]
- [opencl:fpga][opencl:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.7.0.21_160000]
- [level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
- [level_zero:gpu][level_zero:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
+ $ sycl-ls
+ [opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
+ [opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
+ [opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
+ [opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000]
+ [opencl:fpga][opencl:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.7.0.21_160000]
+ [level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
+ [level_zero:gpu][level_zero:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
```
- If you are missing `[opencl:gpu]` devices you may have to add the necessary libraries to your device path by setting the appropriate path in `DRIVERLOC` and then running the following four commands (for Ubuntu - adapt for other OSes):
+ > If you are missing `[opencl:gpu]` devices you may have to add the necessary libraries to your device path by setting the appropriate path in `DRIVERLOC` and then running the following four commands (for Ubuntu - adapt for other OSes):
```
- export DRIVERLOC=/usr/lib/x86_64-linux-gnu
- export OCL_ICD_FILENAMES=$OCL_ICD_FILENAMES:$DRIVERLOC/intel-opencl/libigdrcl.so
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DRIVERLOC
- export PATH=$PATH:/opt/intel/oneapi:$DRIVERLOC
+ export DRIVERLOC=/usr/lib/x86_64-linux-gnu
+ export OCL_ICD_FILENAMES=$OCL_ICD_FILENAMES:$DRIVERLOC/intel-opencl/libigdrcl.so
+ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DRIVERLOC
+ export PATH=$PATH:/opt/intel/oneapi:$DRIVERLOC
```
+
3. Check the results on the **CPU** using OpenCL.
```
ONEAPI_DEVICE_SELECTOR=opencl:cpu ./1_matrix_mul_invalid_contexts
```
- Interestingly, this runs just fine. In the next section we will try to explain the inconsistency.
+ Interestingly, this runs just fine. In the next sections we will try to explain the inconsistency.
+
+### Guided Instructions for Zero Buffer using Address Sanitizer
+A recent addition to the oneAPI compiler is that ability to use the "Address Sanitizer" you may have seen when using [GCC](https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html) or [CLANG](https://clang.llvm.org/docs/AddressSanitizer.html) to catch invalid pointer addresses at runtime on the GPU rather than the host. This will require a special build of the application.
+
+1. Compile a version of the program with device-side address sanitizer (assuming that you are in the `build` directory)
+ ```
+ icpx -fsycl -O0 -g -Xarch_device -fsanitize=address -std=gnu++17 -Rno-debug-disables-optimization -o 1_matrix_mul_invalid_contexts_asan ../src/1_matrix_mul_invalid_contexts.cpp
+ ```
+ > Note: If you leave the `-Xarch_device` off, this command will look for illegal addresses on the host rather than the device.
+
+2. Now run the program on the GPU:
+ ```
+ ./1_matrix_mul_invalid_contexts_asan
+ Initializing
+ ==== DeviceSanitizer: ASAN
+ Computing
+ Device: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
+ Device compute units: 144
+ Device max work item size: 8192, 8192, 8192
+ Device max work group size: 8192
+ Problem size: c(150,600) = a(150,300) * b(300,600)
+
+ ====ERROR: DeviceSanitizer: invalid-argument on kernel (auto&) const::'lambda'(auto)>
+ #0 in sycl::_V1::event sycl::_V1::queue::submit_with_event>>(sycl::_V1::ext::oneapi::experimental::properties>, sycl::_V1::detail::type_erased_cgfo_ty const&, sycl::_V1::detail::code_location const&) const /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/queue.hpp:3762:12
+ #1 in std::enable_if, sycl::_V1::event>::type sycl::_V1::queue::submit(auto, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/queue.hpp:429:12
+ #2 in main Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/build/../src/1_matrix_mul_invalid_contexts.cpp:106:7
+ #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x7fddd2a5bd8f)
+ #4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x7fddd2a5be3f)
+ #5 in _start (./1_matrix_mul_invalid_contexts_asan+0x403624)
+
+ The 5th argument 0x7fddcf88a800 is allocated in other context
+ 0x7fddcf88a800 is located inside of Device USM region [0x7fddcf88a800, 0x7fddcf8e2640)
+ allocated here:
+ #0 in float* sycl::_V1::malloc_device(unsigned long, sycl::_V1::device const&, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/usm.hpp:174:27
+ #1 in float* sycl::_V1::malloc_device(unsigned long, sycl::_V1::queue const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/usm.hpp:182:10
+ #2 in main Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/build/../src/1_matrix_mul_invalid_contexts.cpp:82:21
+ #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x7fddd2a5bd8f)
+ #4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x7fddd2a5be3f)
+ #5 in _start (./1_matrix_mul_invalid_contexts_asan+0x403624)
+
+ Aborted (core dumped)
+ ```
+
+ Remember that SYCL applications use the Level Zero runtime by default with an Intel GPU, and that when we ran this version of the program without Address Sanitizer everything looked fine. So, is this a problem with Address Sanitizer or the program?
+
+3. Look at the reported source location
+
+ That Address Sanitizer threw an error at line 106 where we enter the main work kernel.
+
+ ```
+ 99 q.submit([&](auto &h) {
+ 100 h.memcpy(dev_c, &c_back[0], M*P * sizeof(float));
+ 101 });
+ 102
+ 103 q.wait();
+ 104
+ 105 // Submit command group to queue to multiply matrices: c = a * b
+ 106 q.submit([&](auto &h) { // Something reported wrong here
+ 107 // Read from a and b, write to c
+ 108 int width_a = N;
+ 109
+ 110 // Execute kernel.
+ ```
+
+ And it complained that "the 5th argument" was allocated in another context, an element that was set up in line 82:
+
+ ```
+ 77 float * dev_a = sycl::malloc_device(M*N, q);
+ 78 float * dev_b = sycl::malloc_device(N*P, q);
+ 79 device selected_device = device(default_selector_v);
+ 80 context devicecontext(selected_device);
+ 81 queue q2(devicecontext, selected_device);
+ 82 float * dev_c = sycl::malloc_device(M*P, q2); // Complaining about this
+ 83
+ 84 cout << "Problem size: c(" << M << "," << P << ") = a(" << M << "," << N
+ 85 << ") * b(" << N << "," << P << ")\n";
+ ```
+
+ Unfortunately, figuring out which "5th argument" of which kernel is involved requires internal knowledge of Level Zero API calls and how SYCL collects the references used in the `q.submit` lambda function at line 106. We can get some of this by running `unitrace ` against `./1_matrix_mul_invalid_contexts_asan`, but the result is not satisfying.
+
+ Also, it may not be useful to figure out which argument was involved, because this may not be where things first go bad. The SYCL syntax used in this program means that `q.submit` statements will immediately return control to the calling program even if the submitted kernel is still running. So the problem might have occurred before line 106, and we are only just now learning about it.
+
+ However, we don't need to go that deep into the internals. We are told that the invalid element encountered at line 106 was allocated to a different context in line 82. Here we see that `dev_c` is the variable, and it was allocated a size of `M*P` using queue `q2`.
+
+ You've probably spotted the problem in this trivial example, but let's use the debugger to see if we can gather additional information to do a proper diagnosis.
### Use the Debugger to Find the Issue
@@ -276,7 +359,7 @@ In case we need view code running on the GPU, we need to enable GPU debugging.
4. Prompt for a call stack to inspect the results.
```
- (gdb) where
+ (gdb) backtrace
```
The output can be extensive and might look similar to the following:
@@ -324,7 +407,7 @@ In case we need view code running on the GPU, we need to enable GPU debugging.
```
#18 0x0000000000403f7b in main ()
- at /nfs/site/home/cwcongdo/oneAPI-samples-true/Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/src/1_matrix_mul_invalid_contexts.cpp:99
+ at Tools/ApplicationDebugger/guided_matrix_mult_InvalidContexts/src/1_matrix_mul_invalid_contexts.cpp:99
99 q.submit([&](auto &h) {
(gdb)
```
@@ -350,38 +433,36 @@ In case we need view code running on the GPU, we need to enable GPU debugging.
103 q.wait();
```
- As you can see, there is something wrong in line 99. Unfortunately, the
+ Something wrong in line 99. Unfortunately, the
` Enqueue process failed` message we saw when it crashed does not really tell us anything other than our attempt to submit the `memcpy` to the device failed
- Fortunately, in this case the two variables, `dev_c` and `c_back`, are
- allocated only a few lines above line 99. In real code this might have
- happened in another source file or library, so hunting down this issue is
- going to be much harder.
+ Fortunately, in this case the two variables, `dev_c` and `c_back`, are allocated only a few lines above line 99. In real code this might have happened in another source file or library, so hunting down this issue is going to be much harder.
- Look at the source, and note that `dev_c` is defined as a pointer to device memory allocated on queue `q2`:
+ Look at the source, and note that `dev_c` is defined as a pointer to device memory allocated on queue `q2` (we noticed this with Address Sanitizer):
```
- float * dev_c = sycl::malloc_device(M*P, q2);
+ 82 float * dev_c = sycl::malloc_device(M*P, q2);
```
and `c_back` is defined as local memory
```
- float(*c_back)[P] = new float[M][P];
+ 49 float(*c_back)[P] = new float[M][P];
```
8. Look at line 99, and notice the discrepancy.
```
- q.submit([&](auto &h) {
+ 99 q.submit([&](auto &h) {
```
- Variable `dev_c` was allocated on queue `q2` while the submit statement is
- being done on queue `q`.
+ Variable `dev_c` was allocated on queue `q2` while the submit statement is being done on queue `q`. These queues are created using deferent devices contexts (the default one, and `devicecontext`)
+
+ So unlike what Address Sanitizer suggested, the problem was first noticed in line 99, not 106. The debugger stopped immediately, while Address Sanitizer took a moment longer to spot the problem/respond to the exception.
### Identify the Problem without Code Inspection
-You must have already built the [Unified Tracing and Profiling Tool](#getting-the-tracing-and-profiling-tool). Once you have built the utility, you can start it before your program (similar to using GBD).
+You need to build the [Unified Tracing and Profiling Tool](#getting-the-tracing-and-profiling-tool) before completing this section. Once you have built the utility, you can start it before your program (similar to using GBD).
One of the things that the Unified Tracing and Profiling utility can help us see
is every low-level API call made to OpenCL™ or Level Zero. We will use it to attempt to match the source to the events.
@@ -407,7 +488,7 @@ is every low-level API call made to OpenCL™ or Level Zero. We will use it to a
`src_ptr` into device memory `dst_ptr = 0xff00ffffffeb0000` (NOTE: in some versions of `unitrace` these addresses may be returned in decimal rather than hexidecimal). Working back
up the trace, you can see we allocated the destination device memory with the address
`0xff00ffffffeb0000` using context `0x49dbff0` (line 16). However,
- the command queue (`0x49d7130`) being used in the `clEnqueueMemcpyINTEL`
+ the command queue (`0x49d7130`) being used in the `clEnqueueMemcpyINTEL` call
was created using the context `0x488d190` (line 4), which is
different from the context used to allocate the destination device memory
(`0x49dbff0` - line 16 again). The generic error we get is the OpenCL
@@ -422,7 +503,7 @@ is every low-level API call made to OpenCL™ or Level Zero. We will use it to a
For comparison, an example of legal memory copy where the device context
(`0x488d190`) used for the command queue (`0x49d7130`) is the same as that
- uses for the memory allocation is shown as well (lines 4, 7, 19).
+ used for the memory allocation is shown as well (lines 4, 7, 19).
2. Let's also look at the output from Level Zero, and see if we could have
detected the issue there:
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/CMakeLists.txt b/Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/CMakeLists.txt
index 27cbc5ac7f..3410793a23 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/CMakeLists.txt
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required (VERSION 3.4)
+cmake_minimum_required (VERSION 3.5)
set (CMAKE_CXX_COMPILER "icpx")
project (matrix_mul LANGUAGES CXX)
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/README.md b/Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/README.md
index e9b49d530f..2bce82e253 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/README.md
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_RaceCondition/README.md
@@ -31,8 +31,8 @@ The sample includes different versions of a simple matrix multiplication program
|:--- |:---
| OS | Ubuntu* 24.04 LTS
| Hardware | GEN9 or newer
-| Software | Intel® oneAPI DPC++/C++ Compiler 2025.1
Intel® Distribution for GDB* 2025.1
Unified Tracing and Profiling Tool 2.1.2, which is available from the [following Github repository](https://github.com/intel/pti-gpu/tree/master/tools/unitrace).
-| Intel GPU Driver | Intel® General-Purpose GPU Rolling Release driver 2507.12 or later from https://dgpu-docs.intel.com/releases/releases.html
+| Software | Intel® oneAPI DPC++/C++ Compiler 2025.3
Intel® Distribution for GDB* 2025.3
Unified Tracing and Profiling Tool 2.3.0, which is available from the [following Github repository](https://github.com/intel/pti-gpu/tree/master/tools/unitrace).
+| Intel GPU Driver | Intel® General-Purpose GPU Long-Term Support driver 2523.31 or later from https://dgpu-docs.intel.com/releases/releases.html
## Key Implementation Details
@@ -133,9 +133,9 @@ Documentation on using the debugger in a variety of situations can be found at *
### Getting the Tracing and Profiling Tool
-At a step in this tutorial, the instructions require a utility that was not installed with the Intel® oneAPI Base Toolkit (Base Kit).
+In this tutorial, the instructions require a utility that was not installed with the Intel® oneAPI Base Toolkit (Base Kit).
-To complete the steps in the following section, you must download the [Unified Tracing and Profiling Tool](https://github.com/intel/pti-gpu/tree/master/tools/unitrace) code from GitHub and build the utility. The build instructions are included in the README in the GitHub repository. This build will go much more smoothly if you first install the latest drivers from [the Intel GPU driver download site](https://dgpu-docs.intel.com/driver/overview.html), especially the development packages (only available in the Data Center GPU driver install ). Once you have built the utility, you invoke it on the command line in front of your program (similar to using GDB).
+To complete the steps in the following section, you must download the [Unified Tracing and Profiling Tool](https://github.com/intel/pti-gpu/tree/master/tools/unitrace) code from GitHub and build the utility. The build instructions are included in the README in the GitHub repository. This build will go much more smoothly if you first install the latest drivers from [the Intel GPU driver download site](https://dgpu-docs.intel.com/driver/overview.html), especially the development packages (only available in the Data Center GPU driver install). Once you have built the utility, you invoke it on the command line in front of your program (similar to using GDB).
### Examine the Original Code
@@ -171,7 +171,7 @@ In case we need view code running on the GPU, we need to enable GPU debugging.
intelgt: inferior 3 (gdbserver-ze) has been removed.
(gbd)
```
- As we saw outside the debugger, it ran to completion. But note that the inferior (the code running on the GPU), exited with an error code (0377). Let's see if we can trap that error.
+ As we saw outside the debugger, it ran to completion. But note that the inferior (the code running on the GPU), exited with an error code (`0377`). Let's see if we can trap that error.
4. Run again, telling the debugger to stop if the application throws an exception
```
@@ -208,11 +208,7 @@ In case we need view code running on the GPU, we need to enable GPU debugging.
from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
#3 0x00007ffff7e2b73c in sycl::_V1::detail::MemoryManager::copy(sycl::_V1::detail::SYCLMemObjI*, void*, std::shared_ptr, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, void*, std::shared_ptr, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, std::vector >, ur_event_handle_t_*&, std::shared_ptr const&) ()
from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #4 0x00007ffff7eb31e9 in ur_result_t sycl::_V1::detail::callMemOpHelper, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, void*, std::shared_ptr, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, std::vector >, ur_event_handle_t_*&, std::shared_ptr const&), sycl::_V1::detail::SYCLMemObjI*, void*, std::shared_ptr&, unsigned int&, sycl::_V1::range<3>&, sycl::_V1::range<3>&, sycl::_V1::id<3>&, unsigned int&, void*&, std::shared_ptr&, unsigned int&, sycl::_V1::range<3>&, sycl::_V1::range<3>&, sycl::_V1::id<3>&, unsigned int&, std::vector >, ur_event_handle_t_*&, std::shared_ptr&>(void (&)(sycl::_V1::detail::SYCLMemObjI*, void*, std::shared_ptr, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, void*, std::shared_ptr, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, std::vector >, ur_event_handle_t_*&, std::shared_ptr const&), sycl::_V1::detail::SYCLMemObjI*&&, void*&&, std::shared_ptr&, unsigned int&, sycl::_V1::range<3>&, sycl::_V1::range<3>&, sycl::_V1::id<3>&, unsigned int&, void*&, std::shared_ptr&, unsigned int&, sycl::_V1::range<3>&, sycl::_V1::range<3>&, sycl::_V1::id<3>&, unsigned int&, std::vector >&&, ur_event_handle_t_*&, std::shared_ptr&) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #5 0x00007ffff7eb2c4e in sycl::_V1::detail::MemCpyCommandHost::enqueueImp() () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #6 0x00007ffff7ea90fb in sycl::_V1::detail::Command::enqueue(sycl::_V1::detail::EnqueueResultT&, sycl::_V1::detail::BlockingT, std::vector >&) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #7 0x00007ffff7ecec2e in sycl::_V1::detail::Scheduler::GraphProcessor::enqueueCommand(sycl::_V1::detail::Command*, std::shared_lock&, sycl::_V1::detail::EnqueueResultT&, std::vector >&, sycl::_V1::detail::Command*, sycl::_V1::detail::BlockingT) ()
- from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
+ :
#8 0x00007ffff7eca5ba in sycl::_V1::detail::Scheduler::addCopyBack(sycl::_V1::detail::AccessorImplHost*) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
#9 0x00007ffff7edd2e6 in sycl::_V1::detail::SYCLMemObjT::updateHostMemory(void*) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
#10 0x00007ffff7eebdd3 in std::_Function_handler const&), sycl::_V1::detail::SYCLMemObjT::handleHostData(void*, unsigned long)::{lambda(std::function const&)#1}>::_M_invoke(std::_Any_data const&, std::function const&) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
@@ -261,6 +257,8 @@ In case we need view code running on the GPU, we need to enable GPU debugging.
133 }
```
+ So we crashed in the last line of the program while we were cleaning up. But if you look closely at the stack above, you will see that we were also in the middle of a copy from the device to the host (frame 2).
+
8. Exit the debugger.
9. Run the program using the [Unified Tracing and Profiling Tool](#getting-the-tracing-and-profiling-tool) tool.
@@ -289,56 +287,59 @@ In case we need view code running on the GPU, we need to enable GPU debugging.
### Interpret the Results
-The first clue here is that the program throws an exception after it has completed checking the results and finding them bad. That behavior is worrying.
+The first clue here is that the program throws an exception *after* it has completed checking the results and finding them bad. That behavior is worrying.
Next, looking at the crash in the debugger, there are a couple of odd things that stand out. Look at stack `frame 9`. This frame shows us attempting to update the host memory from the device, while `frame 20` shows we are already at the end of the program and have started cleaning up the SYCL buffers (`frame 19`). The only variable containing data returned from the device is `c_back`. But the developer has already deleted `c_back` in line 126, so the *data the buffer being copied into (`c_back`) no longer exists*.
We see something like this in the `unitrace` output above. The kernel is executed, the results are immediately checked, we create and wait on some events, and then the last thing we try to do before crashing is to copy some memory from the device memory (`srcptr = 18374967954634571776`) to a host pointer (`dstptr = 35936816`) that previously was used to initialize this same device memory (around line 101). Since `c_buf` is the only accessor that is defined as writeable in the `q.submit` at line 97, it again is a likely suspect.
-But what if the developer didn't delete `c_back`, and let program termination clean it up? Try it! Unfortunately, in that case your program complains about bad results, but it exits cleanly (shutdown will wait for the GPU to copy memory back to the host buffer before it kills the buffer).
+But what if the developer didn't delete `c_back`, and let program termination clean it up? Try it!
+
+Unfortunately, if you do this your program complains about bad results, but it exits cleanly (shutdown will wait for the GPU to copy memory back to the host buffer before it kills the buffer).
Is the behavior different if you run it on OpenCL or Level 0? The default is to use the Level Zero run time, but we can explicitly force the use of either Level Zero or OpenCL, which can be helpful when troubleshooting.
```
ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./1_matrix_mul_race_condition
+
ONEAPI_DEVICE_SELECTOR=opencl:gpu ./1_matrix_mul_race_condition
```
Unfortunately not; pretty much the same thing happens - they both produce incorrect results on exiting.
-> **Note:** the command with OpenCL will only work if the `sycl-ls` command shows OpenCL
-> devices for the graphics card, such as like this:
+> **Note:** the command with OpenCL will only work if the `sycl-ls` command
+ shows OpenCL devices for the graphics card, such as like this:
```
- $ sycl-ls
- [opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
- [opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
- [opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
- [opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000]
- [opencl:fpga][opencl:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.7.0.21_160000]
- [level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
- [level_zero:gpu][level_zero:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
+ $ sycl-ls
+ [opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
+ [opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
+ [opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO [24.22.29735.27]
+ [opencl:cpu][opencl:3] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz OpenCL 3.0 (Build 0) [2023.16.7.0.21_160000]
+ [opencl:fpga][opencl:4] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.7.0.21_160000]
+ [level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
+ [level_zero:gpu][level_zero:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.29735]
```
- If you are missing `[opencl:gpu]` devices you may have to add the necessary libraries to your device path by setting the appropriate path in `DRIVERLOC` and then running the following four commands (for Ubuntu - adapt for other OSes):
+ > If you are missing `[opencl:gpu]` devices you may have to add the necessary libraries to your device path by setting the appropriate path in `DRIVERLOC` and then running the following four commands (for Ubuntu - adapt for other OSes):
```
- export DRIVERLOC=/usr/lib/x86_64-linux-gnu
- export OCL_ICD_FILENAMES=$OCL_ICD_FILENAMES:$DRIVERLOC/intel-opencl/libigdrcl.so
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DRIVERLOC
- export PATH=$PATH:/opt/intel/oneapi:$DRIVERLOC
+ export DRIVERLOC=/usr/lib/x86_64-linux-gnu
+ export OCL_ICD_FILENAMES=$OCL_ICD_FILENAMES:$DRIVERLOC/intel-opencl/libigdrcl.so
+ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DRIVERLOC
+ export PATH=$PATH:/opt/intel/oneapi:$DRIVERLOC
```
-Similarly, we specify targeting the CPU, which sometimes can avoid problems in your code that are specific to offloading to the GPU.
+Similarly, we an force the program to run on the CPU, which sometimes can avoid problems in your code that are specific to offloading to the GPU.
```
ONEAPI_DEVICE_SELECTOR=*:cpu ./1_matrix_mul_race_condition
```
This also has problems, but if you run this in the debugger you will see lots of threads all running in the third `q.submit` kernel, but no thread running `main`. This is because these threads have been abandoned when `main` deleted `c_back` and exited!
-So in conclusion, it looks like the third kernel is still executing and/or its results are still being copied back to the host as the program is terminating. Which explains the incorrect results (they aren't available on the host yet) and the crash (results from the card are being copied to memory that has been deallocated). All these point to some sort of synchronization issue or race condition between the host and device.
+So in conclusion, it looks like the third kernel is still executing and/or its results are still being copied back to the host as the program is terminating. Which explains the incorrect results (they aren't available on the host yet) and the crash (the card is trying to copy the results to memory on the host that has been deallocated). All these point to some sort of synchronization issue or race condition between the host and device.
### Understand the Problem
-Because we are using SYCL buffers, even though the `q.submit` statements that populate `a_buf` and `b_buf` execute asynchronously, the third `q.submit` statement does not execute until those first two submits are complete because the SYCL runtime realizes that the third `q.submit` depends on the `a_buf` and `b_buf` buffers, which are being used in the first two kernels. Once the first two kernels complete, the third `q.submit` kernel starts executing because both its inputs are ready. The SYCL runtime then immediately returns control to the host and we proceed to the code which verifies the result - ***while the third `q.submit` keeps running***.
+Because we are using SYCL buffers, even though the `q.submit` statements that populate `a_buf` and `b_buf` execute asynchronously, the third `q.submit` statement does not execute until those first two submits are complete because the SYCL runtime realizes that the third `q.submit` depends on the `a_buf` and `b_buf` buffers, which are being used in the first two kernels. Once the first two kernels complete, the third `q.submit` kernel starts executing because both its inputs are ready. The SYCL runtime then immediately returns control to the host program and we proceed to the code which verifies the result - ***while the third `q.submit` keeps running***.
There are three errors in this code:
@@ -346,7 +347,7 @@ There are three errors in this code:
2. We should be using a host accessor pointing to SYCL buffer `c_buf` to access its contents, which would also indicate that we need to wait for the third `q.submit kernel` to complete **and** for the *data to be copied back to the host* before accessing the data in `c_back`.
-3. For buffers initialized with a pointer to host memory (like `c_buf`), the developer "makes a contract with the SYCL runtime" to not reference the host pointer again until the SYCL buffer is destroyed. Thus, deleting the host memory before the SYCL buffer is destroyed is illegal (the call to `delete[] c_back;`). The buffer cannot detect that the memory was deallocated.
+3. For buffers initialized with a pointer to host memory (like `c_buf`), the developer "makes a contract with the SYCL runtime" to not reference the host pointer again until the SYCL buffer is destroyed. Thus, deleting the host memory before the SYCL buffer is destroyed is illegal (the call to `delete[] c_back;` is illegal because `c_buf` has not yet been deleted). The buffer cannot detect that the memory was deallocated.
### Fix the Code
@@ -396,7 +397,7 @@ int i, j, k;
:
```
The result should look like `3_matrix_mul.cpp`. Reiterating, with these changes :
-1. We created a host accessor to pull the values of out `c_buf` on the host, forcing the data to be transferred from the device to the host before the first access (one of the race conditions in this code).
+1. We created a host accessor to pull the values of out `c_buf` on the host, forcing the data to be transferred from the device to the host before the first access on the host (one of the race conditions in this code).
2. We waited for the third `q.submit` kernel to complete before asking for the values in `c_buf`, fixing the other race condition.
3. We are no longer deleting `c_back` before the SYCL buffer that makes use of it (`c_buf`) is destroyed on program exit.
4. We changed `VerifyResult` to pass down the host accessor, with which we are able to read the contents of the accessor the same way we would access the original `c_back` array (which we "made a contract" not to look at while a SYCL buffer was making use of it).
@@ -449,7 +450,7 @@ Note that `2_matrix_mul.cpp` still has a bug. It is an example of problem (2) a
This points out a potential trap in the training documentation you may have read while learning SYCL. You can easily get the impression that if you use the SYCL buffer-accessor mechanism, synchronization will be taken care of for you. The use of parenthesis may be mentioned in passing with little explanation. Even though the documentation may say "the { } block ensures all SYCL work has concluded," this is not stressed.
-This is the trap of the SYCL buffer-accessor mechanism - you may assume that the automatic synchronization mechanism is smarter than it really is. In `1_matrix_mul_race_condition.cpp`, the SYCL runtime does not realize that we cannot call `VerifyResult` with the `c_back` array until the third `q.submit` kernel completes and the data are copied back to the host - it assumes you know what you are doing.
+This is the trap of the SYCL buffer-accessor mechanism - you may assume that the automatic synchronization mechanism is smarter than it really is. In `1_matrix_mul_race_condition.cpp`, the SYCL runtime does not realize that we cannot access the `c_back` array in `VerifyResult` until the third `q.submit` kernel completes and the data are copied back to the host - it assumes you know what you are doing.
>**Note**: You will find more on the proper use of buffers and accessors in the *Buffer Accessor Mode* section of the *[oneAPI GPU Optimization Guide Developer Guide](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/current/buffer-accessor-modes.html)*.
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/CMakeLists.txt b/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/CMakeLists.txt
index aa1d9e0811..bc3ed05a3b 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/CMakeLists.txt
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/CMakeLists.txt
@@ -1,4 +1,4 @@
-cmake_minimum_required (VERSION 3.4)
+cmake_minimum_required (VERSION 3.5)
set (CMAKE_CXX_COMPILER "icpx")
project (matrix_mul LANGUAGES CXX)
diff --git a/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/README.md b/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/README.md
index a146de4c8d..79e5dc4c03 100644
--- a/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/README.md
+++ b/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/README.md
@@ -14,7 +14,7 @@ The sample is a simple program that multiplies together two large matrices and v
## Purpose
-The sample in this tutorial shows how to debug crashes that occur when the user tries to reserve more memory for a work-group than there is space in work-group local memory (also called [Shared Local Memory (SLM)](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/current/shared-local-memory.html).
+The sample in this tutorial shows how to debug crashes that occur when the user tries to reserve more memory for a work-group than there is space in work-group local memory (also called [Shared Local Memory (SLM)](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/current/shared-local-memory.html)).
Using this type of memory when working with GPUs is an important optimization, but you must be careful due to its limited size. Shared local memory is often also shared with/traded for Vector Engine memory registers.
@@ -27,12 +27,12 @@ The sample includes different versions of a simple matrix multiplication program
## Prerequisites
-| Optimized for | Description
-|:--- |:---
+| Optimized for | Description
+|:--- |:---
| OS | Ubuntu* 24.04 LTS
| Hardware | GEN9 or newer
-| Software | Intel® oneAPI DPC++/C++ Compiler 2025.1
Intel® Distribution for GDB* 2025.1
Unified Tracing and Profiling Tool 2.1.2, which is available from the [following Github repository](https://github.com/intel/pti-gpu/tree/master/tools/unitrace).
-| Intel GPU Driver | Intel® General-Purpose GPU Rolling Release driver 2507.12 or later from https://dgpu-docs.intel.com/releases/releases.html
+| Software | Intel® oneAPI DPC++/C++ Compiler 2025.3
Intel® Distribution for GDB* 2025.3
Unified Tracing and Profiling Tool 2.3.0, which is available from the [following Github repository](https://github.com/intel/pti-gpu/tree/master/tools/unitrace).
+| Intel GPU Driver | Intel® General-Purpose GPU Long-Term Support driver 2523.31 or later from https://dgpu-docs.intel.com/releases/releases.html
## Key Implementation Details
@@ -43,7 +43,7 @@ The basic SYCL* standards implemented in the code include the use of the followi
- SYCL* kernels (including parallel_for function and explicit memory copies)
- SYCL* queues
-The type of error shown in this sample can be hard to detect and root cause in a large body of code where large amounts of data are passed due to the lack of tools that tell you what is actually going wrong, and because the resulting error message ("`PI_ERROR_OUT_OF_RESOURCES`") isn't informative.
+The type of error shown in this sample can be hard to detect and root cause in a large body of code where large amounts of data are passed due to the lack of tools that tell you what is actually going wrong, and because the resulting error message ("`UR_RESULT_ERROR_OUT_OF_RESOURCES`") isn't informative.
This can be particularly painful. For example, you might experience this error in the cases where code that runs on a device with a large amount of shared local memory fails on a device with less shared local memory, or where one data set out of many results in an allocation that exceeds SLM limits on a given machine.
@@ -134,15 +134,15 @@ Documentation on using the debugger in a variety of situations can be found at *
### Getting the Tracing and Profiling Tool
-At a step in this tutorial, the instructions require a utility that was not installed with the Intel® oneAPI Base Toolkit (Base Kit).
+In this tutorial, the instructions require a utility that was not installed with the Intel® oneAPI Base Toolkit (Base Kit).
-To complete the steps in the following section, you must download the [Unified Tracing and Profiling Tool](https://github.com/intel/pti-gpu/tree/master/tools/unitrace) code from GitHub and build the utility. The build instructions are included in the README in the GitHub repository. This build will go much more smoothly if you first install the latest drivers from [the Intel GPU driver download site](https://dgpu-docs.intel.com/driver/overview.html), especially the development packages (only available in the Data Center GPU driver install ). Once you have built the utility, you invoke it on the command line in front of your program (similar to using GDB).
+To complete the steps in the following section, you must download the [Unified Tracing and Profiling Tool](https://github.com/intel/pti-gpu/tree/master/tools/unitrace) code from GitHub and build the utility. The build instructions are included in the README in the GitHub repository. This build will go much more smoothly if you first install the latest drivers from [the Intel GPU driver download site](https://dgpu-docs.intel.com/driver/overview.html), especially the development packages (only available in the Data Center GPU driver install). Once you have built the utility, you invoke it on the command line in front of your program (similar to using GDB).
### Check the Program
In `1_matrix_mul_SLM_size`, the local_accessor class is used to reserve an illegal amount of device-local memory. If you attempt to run the code, the application will crash.
-#### Observe the Failure
+### Observe the Failure
1. Run the program outside the debugger.
```
@@ -160,11 +160,11 @@ In `1_matrix_mul_SLM_size`, the local_accessor class is used to reserve an illeg
Device max work group size: 1024
Problem size: c(150,600) = a(150,300) * b(300,600)
terminate called after throwing an instance of 'sycl::_V1::exception'
- what(): UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
+ what(): level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
Aborted (core dumped)
```
-#### Locate the General Location of the Problem
+### Locate the General Location of the Problem
1. Start the debugger to learn more about the error.
```
@@ -175,14 +175,14 @@ In `1_matrix_mul_SLM_size`, the local_accessor class is used to reserve an illeg
```
(gdb) run
```
- When you get the error message `Debugging of GPU offloaded code is not enabled`, ignore it and answer `n` to the question `Quit anyway? (y or n)`
+ When you get the error message `Debugging of GPU offloaded code is not enabled`, ignore it and answer `n` to the question `Quit anyway? (y or n)`. You may need to do this more than once. If the error happens further in the program, you may need to re-run with `ZET_ENABLE_PROGRAM_DEBUGGING=1`
The application will fail and display the same message when we ran it outside of the debugger.
```
:
Problem size: c(150,600) = a(150,300) * b(300,600)
terminate called after throwing an instance of 'sycl::_V1::exception'
- what(): UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
+ what(): level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
Thread 1.1 "1_matrix_mul_SL" received signal SIGABRT, Aborted.
(gdb)
@@ -205,29 +205,23 @@ In `1_matrix_mul_SLM_size`, the local_accessor class is used to reserve an illeg
#6 0x00007ffff78bb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007ffff78a5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007ffff78bb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
- #9 0x00007ffff7dcc4c9 in sycl::_V1::detail::enqueue_kernel_launch::handleOutOfResources(sycl::_V1::detail::device_impl const&, ur_kernel_handle_t_*, sycl::_V1::detail::NDRDescT const&) ()
- from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #10 0x00007ffff7dd6214 in sycl::_V1::detail::enqueue_kernel_launch::handleErrorOrWarning(ur_result_t, sycl::_V1::detail::device_impl const&, ur_kernel_handle_t_*, sycl::_V1::detail::NDRDescT const&) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #11 0x00007ffff7eb9b71 in sycl::_V1::detail::enqueueImpKernel(std::shared_ptr const&, sycl::_V1::detail::NDRDescT&, std::vector >&, std::shared_ptr const&, std::shared_ptr const&, std::__cxx11::basic_string, std::allocator > const&, std::vector >&, std::shared_ptr const&, std::function const&, ur_kernel_cache_config_t, bool, bool, unsigned long, sycl::_V1::detail::RTDeviceBinaryImage const*) ()
- from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #12 0x00007ffff7f02e52 in sycl::_V1::handler::finalize()::$_0::operator()() const () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #13 0x00007ffff7effe3a in sycl::_V1::handler::finalize() () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #14 0x00007ffff7e84277 in void sycl::_V1::detail::queue_impl::finalizeHandler(sycl::_V1::handler&, sycl::_V1::event&) ()
- from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #15 0x00007ffff7e832b7 in sycl::_V1::detail::queue_impl::submit_impl(std::function const&, std::shared_ptr const&, std::shared_ptr const&, std::shared_ptr const&, bool, sycl::_V1::detail::code_location const&, bool, sycl::_V1::detail::SubmissionInfo const&)
- () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #16 0x00007ffff7e895c8 in sycl::_V1::detail::queue_impl::submit_with_event(std::function const&, std::shared_ptr const&, sycl::_V1::detail::SubmissionInfo const&, sycl::_V1::detail::code_location const&, bool) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #17 0x00007ffff7f33afa in sycl::_V1::queue::submit_with_event_impl(std::function, sycl::_V1::detail::SubmissionInfo const&, sycl::_V1::detail::code_location const&, bool) () from /opt/intel/oneapi/compiler/2025.1/lib/libsycl.so.8
- #18 0x0000000000404f54 in sycl::_V1::queue::submit_with_event(main::{lambda(sycl::_V1::handler&)#1}, sycl::_V1::queue*, sycl::_V1::detail::code_location const&) (this=0x7fffffffb8f0, CGF=..., SecondaryQueuePtr=0x0, CodeLoc=...) at /opt/intel/oneapi/compiler/2025.1/bin/compiler/../../include/sycl/queue.hpp:2826
- #19 0x000000000040440c in sycl::_V1::queue::submit(main::{lambda(sycl::_V1::handler&)#1}, sycl::_V1::detail::code_location const&) (
- this=0x7fffffffb8f0, CGF=..., CodeLoc=...) at /opt/intel/oneapi/compiler/2025.1/bin/compiler/../../include/sycl/queue.hpp:365
- #20 0x0000000000403fe3 in main () at /nfs/site/home/cwcongdo/oneAPI-samples-true/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/src/1_matrix_mul_SLM_size.cpp:104
+ #9 0x00007ffff7dd78a5 in sycl::_V1::detail::enqueue_kernel_launch::handleOutOfResources(sycl::_V1::detail::device_impl const&, ur_kernel_handle_t_*, sycl::_V1::detail::NDRDescT const&) ()
+ from /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8
+ #10 0x00007ffff7de164a in sycl::_V1::detail::enqueue_kernel_launch::handleErrorOrWarning(ur_result_t, sycl::_V1::detail::device_impl const&, ur_kernel_handle_t_*, sycl::_V1::detail::NDRDescT const&) ()
+ from /opt/intel/oneapi/compiler/2025.3/lib/libsycl.so.8
+ :
+ #19 0x0000000000407e62 in sycl::_V1::queue::submit_with_event > >(sycl::_V1::ext::oneapi::experimental::properties >, sycl::_V1::detail::type_erased_cgfo_ty const&, sycl::_V1::detail::code_location const&) const (this=0x7fffffffb580, Props=..., CGF=..., CodeLoc=...)
+ at /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/queue.hpp:3762
+ #20 0x00000000004043e9 in sycl::_V1::queue::submit(main::{lambda(sycl::_V1::handler&)#1}, sycl::_V1::detail::code_location const&) (this=0x7fffffffb580, CGF=..., CodeLoc=...)
+ at /opt/intel/oneapi/compiler/2025.3/bin/compiler/../../include/sycl/queue.hpp:429
+ #21 0x0000000000403fa3 in main ()
+ at Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/src/1_matrix_mul_SLM_size.cpp:104
```
4. Look at the final frame. (Your frame number might differ, and you might have to repeat this command to get the frame to change).
```
- (gdb) frame 20
- #20 0x0000000000403fe3 in main () at /nfs/site/home/cwcongdo/oneAPI-samples-true/Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/src/1_matrix_mul_SLM_size.cpp:104
+ (gdb) frame 21
+ #21 0x0000000000403fe3 in main () at Tools/ApplicationDebugger/guided_matrix_mult_SLMSize/src/1_matrix_mul_SLM_size.cpp:104
104 q.submit([&](handler &h){
(gdb)
```
@@ -254,7 +248,7 @@ In `1_matrix_mul_SLM_size`, the local_accessor class is used to reserve an illeg
Now exit the debugger.
-#### Root-Cause the Issue
+### Root-Cause the Issue
You can see that there is something wrong in the submit at line `104`, but we need more information to understand what is happening. For that we need to capture the lower-level API calls using the `unitrace` tool.
@@ -270,16 +264,16 @@ Among other things, the Tracing and Profiling utility can print every low-level
3. Let the output continue until the error occurs and the program stops.
```
:
- >>>> [776257970958971] zeKernelSetGroupSize: hKernel = 54646808 groupSizeX = 10 groupSizeY = 1 groupSizeZ = 1
- <<<< [776257970963072] zeKernelSetGroupSize [1237 ns] -> ZE_RESULT_SUCCESS(0x0)
- >>>> [776257970967552] zeCommandListCreateImmediate: hContext = 53065840 hDevice = 48614248 altdesc = 140735243323376 {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC(0xe) 0 0 0 0 2 0} phCommandList = 140735243323360 (hCommandList = 0)
- <<<< [776257971129090] zeCommandListCreateImmediate [157138 ns] hCommandList = 54788792 -> ZE_RESULT_SUCCESS(0x0)
- >>>> [776257971135996] zeEventHostReset: hEvent = 49803640
- <<<< [776257971139385] zeEventHostReset [1296 ns] -> ZE_RESULT_SUCCESS(0x0)
- >>>> [776257972254927] zeCommandListAppendLaunchKernel: hCommandList = 54788792 hKernel = 54646808 (_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_7nd_itemILi1EEEE_) pLaunchFuncArgs = 140735243324504 {16385, 1, 1} hSignalEvent = 49803640 numWaitEvents = 0 phWaitEvents = 0
- <<<< [776257972338436] zeCommandListAppendLaunchKernel [56440 ns] -> ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY(0x1879048195)
+ >>>> [806881439215102] zeKernelSetGroupSize: hKernel = 0x2a54be8 groupSizeX = 0xa groupSizeY = 0x1 groupSizeZ = 0x1
+ <<<< [806881439220367] zeKernelSetGroupSize [1631 ns] -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [806881439225163] zeCommandListCreateImmediate: hContext = 0x29cd578 hDevice = 0x2936ee8 altdesc = 0x7ffcce631850 {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC(0xe) 0 0 0 0 2 0} phCommandList = 0x7ffcce631838 (hCommandList = 0x0)
+ <<<< [806881439391714] zeCommandListCreateImmediate [160772 ns] hCommandList = 0x2d3b218 -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [806881439399881] zeEventHostReset: hEvent = 0x2cd8a88
+ <<<< [806881439403570] zeEventHostReset [1577 ns] -> ZE_RESULT_SUCCESS(0x0)
+ >>>> [806881439411094] zeCommandListAppendLaunchKernel: hCommandList = 0x2d3b218 hKernel = 0x2a54be8 (_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_7nd_itemILi1EEEE_) pLaunchFuncArgs = 0x7ffcce631d78 {16385, 1, 1} hSignalEvent = 0x2cd8a88 numWaitEvents = 0x0 phWaitEvents = 0x0
+ <<<< [806881439467224] zeCommandListAppendLaunchKernel [47029 ns] -> ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY(0x1879048195)
terminate called after throwing an instance of 'sycl::_V1::exception'
- what(): UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
+ what(): level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
Aborted (core dumped)
```
@@ -288,16 +282,16 @@ Among other things, the Tracing and Profiling utility can print every low-level
A note about the output above. You will see that is has two lines that read:
```
- >>>> [776257970958971] zeKernelSetGroupSize: hKernel = 54646808 groupSizeX = 10 groupSizeY = 1 groupSizeZ = 1
+ >>>> [806881439215102] zeKernelSetGroupSize: hKernel = 0x2a54be8 groupSizeX = 0xa groupSizeY = 0x1 groupSizeZ = 0x1
:
- >>>> [776257972254927] zeCommandListAppendLaunchKernel: hCommandList = 54788792 hKernel = 54646808 (_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_7nd_itemILi1EEEE_) pLaunchFuncArgs = 140735243324504 {16385, 1, 1} hSignalEvent = 49803640 numWaitEvents = 0 phWaitEvents = 0
+ >>>> [806881439411094] zeCommandListAppendLaunchKernel: hCommandList = 0x2d3b218 hKernel = 0x2a54be8 (_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_7nd_itemILi1EEEE_) pLaunchFuncArgs = 0x7ffcce631d78 {16385, 1, 1} hSignalEvent = 0x2cd8a88 numWaitEvents = 0x0 phWaitEvents = 0x0
```
- We used the form of `parallel_for` that takes the `nd_range`, which specifies the global iteration range (163850) and the local work-group size (10) like so: `nd_range<1>{{163850}, {10}}`. The first line above shows the workgroup size (`groupSizeX = 10 groupSizeY = 1 groupSizeZ = 1`), and the second shows how many total workgroups will be needed to process the global iteration range (`{16385, 1, 1}`).
+ We used the form of `parallel_for` that takes the `nd_range`, which specifies the global iteration range (163850) and the local work-group size (10) like so: `nd_range<1>{{163850}, {10}}`. The first line above shows the workgroup size (`groupSizeX = 0xa groupSizeY = 0x1 groupSizeZ = 0x1`), and the second shows how many total workgroups will be needed to process the global iteration range (`{16385, 1, 1}`).
-#### Determine Device Limits
+### Determine Device Limits
-If you have access to a version of the graphics drivers built with debug functionality, you can get even more information about this error by setting two NEO variables to the following values:
+If you have access to a version of the graphics drivers built with debug functionality, you can get even more information about this error by setting two "NEO" variables to the following values:
```
export NEOReadDebugKeys=1
@@ -324,7 +318,7 @@ Aborted (core dumped)
The new message of interest is `Size of SLM (656384) larger than available (131072)`. This tells you that the size of the Shared Local Memory (SLM) memory on the device, 131072 bytes (128Kb), is smaller than the requested size of 656384 bytes (641Kb).
-If the `parallel_for` were operating over a multi-dimensional range (for example, if `acc` were two or three-dimensional), you need to multiply the dimensions together to determine the number of floating point numbers we are trying to store in SLM. In our case, the calculation is easy: the first argument to the `nd_range` in the `parallel_for` is single-dimensional, so it's just 163850. Thus the problem is that the size of work-group local memory we tried to allocate, (163850 floats or 4*163850=655,400 bytes rounded up to the nearest 64-byte cache line), doesn't fit in the SLM on this device.
+If the `parallel_for` were operating over a multi-dimensional range (for example, if `acc` were two or three-dimensional), you will need to multiply the dimensions together to determine the number of floating point numbers we are trying to store in SLM. In our case, the calculation is easy: the first argument to the `nd_range` in the `parallel_for` is single-dimensional, so it's just 163850. Thus the problem is that the size of work-group local memory we tried to allocate, (163850 floats or 4*163850=655,400 bytes rounded up to the nearest 64-byte cache line), doesn't fit in the SLM on this device.
You should know that different devices will have different amounts of memory set aside as SLM. In SYCL, you can query this number by passing `info::device::local_mem_size` to the `get_info` member of the `device` class.
@@ -332,27 +326,27 @@ Finally, running under `unitrace -c` you see:
```
:
->>>> [776708418226802] zeKernelSetGroupSize: hKernel = 57133096 groupSizeX = 10 groupSizeY = 1 groupSizeZ = 1
-<<<< [776708418230893] zeKernelSetGroupSize [1154 ns] -> ZE_RESULT_SUCCESS(0x0)
->>>> [776708418235549] zeCommandListCreateImmediate: hContext = 55553168 hDevice = 51101560 altdesc = 140722633379296 {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC(0xe) 0 0 0 0 2 0} phCommandList = 140722633379280 (hCommandList = 0)
+>>>> [807184674868230] zeKernelSetGroupSize: hKernel = 0x257bbe8 groupSizeX = 0xa groupSizeY = 0x1 groupSizeZ = 0x1
+<<<< [807184674879934] zeKernelSetGroupSize [1560 ns] -> ZE_RESULT_SUCCESS(0x0)
+>>>> [807184674884930] zeCommandListCreateImmediate: hContext = 0x24f4578 hDevice = 0x245e2f8 altdesc = 0x7ffeb187b880 {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC(0xe) 0 0 0 0 2 0} phCommandList = 0x7ffeb187b868 (hCommandList = 0x0)
Flush Task for Immediate command list : Enabled
-Using PCI barrier ptr: 0x141f77d49000
-<<<< [776708418401199] zeCommandListCreateImmediate [160724 ns] hCommandList = 57275080 -> ZE_RESULT_SUCCESS(0x0)
->>>> [776708418408270] zeEventHostReset: hEvent = 52290952
-<<<< [776708418411693] zeEventHostReset [997 ns] -> ZE_RESULT_SUCCESS(0x0)
->>>> [776708418417397] zeCommandListAppendLaunchKernel: hCommandList = 57275080 hKernel = 57133096 (_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_7nd_itemILi1EEEE_) pLaunchFuncArgs = 140722633380424 {16385, 1, 1} hSignalEvent = 52290952 numWaitEvents = 0 phWaitEvents = 0
+Using PCI barrier ptr: 0xbbbf8206000
+<<<< [807184675059549] zeCommandListCreateImmediate [169347 ns] hCommandList = 0x2862218 -> ZE_RESULT_SUCCESS(0x0)
+>>>> [807184675067615] zeEventHostReset: hEvent = 0x27ffa88
+<<<< [807184675071317] zeEventHostReset [1401 ns] -> ZE_RESULT_SUCCESS(0x0)
+>>>> [807184675077771] zeCommandListAppendLaunchKernel: hCommandList = 0x2862218 hKernel = 0x257bbe8 (_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_7nd_itemILi1EEEE_) pLaunchFuncArgs = 0x7ffeb187bda8 {16385, 1, 1} hSignalEvent = 0x27ffa88 numWaitEvents = 0x0 phWaitEvents = 0x0
Size of SLM (656384) larger than available (131072)
-<<<< [776708418485438] zeCommandListAppendLaunchKernel [60634 ns] -> ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY(0x1879048195)
+<<<< [807184675135249] zeCommandListAppendLaunchKernel [48600 ns] -> ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY(0x1879048195)
terminate called after throwing an instance of 'sycl::_V1::exception'
- what(): UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
+ what(): level_zero backend failed with error: 40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)
Aborted (core dumped)
```
-This is useful because it shows you the kernel being called that caused the error (`_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_7nd_itemILi1EEEE_` which `c++filt` resolves to `typeinfo name for main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::{lambda(sycl::_V1::nd_item<1>)#1} `) in addition to the amount of memory requested vs. the available size of SLM.
+This is useful because it shows you the kernel being called that caused the error (`_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_7nd_itemILi1EEEE_` which `c++filt` resolves to `typeinfo name for main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::{lambda(sycl::_V1::nd_item<1>)#1} `) in addition to the amount of memory requested vs. the available size of SLM. So now we know for sure that the problem was in fact in the lambda function located by the debugger in addition to the reason for the error.
-#### Resolving the Problem
+### Resolving the Problem
The synthetic code in this example has nothing to do with matrix multiply and can simply be removed to resolve the problem, so you can delete code to solve the problem.