-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Dear all,
thank you for these beautiful examples, really helpful!
My aplogies if this is not the right place to ask, feel free to close this issue.
I am not openning this issue cause I have a problem,
but rather, I am trying to do openacc-interoperability with cuda_fp16 half precision intrinsics.
I have looked both at openacc_c_main and openacc_cuda_device in order to get some influence.
My changes are here, for openacc_cuda_device: master...georgebisbas:wip_fp16
I am working on a V100 and I am using:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
and
pgcc --version
pgcc (aka nvc) 20.7-0 LLVM 64-bit target on x86-64 Linux -tp skylake
PGI Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
Code compiles:
$ make openacc_cuda_device
nvc++ -fast -acc -Minfo=all -gpu= cc75 -c openacc_cuda_device.cpp
"openacc_cuda_device.cpp", line 19: warning: variable "tmp" was declared but
never referenced
float *x, *y, tmp;
^
main:
34, Generating copyout(y[:n]) [if not already present]
Generating create(x[:n]) [if not already present]
37, Loop is parallelizable
Generating Tesla code
37, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
37, Complex loop carried dependence of x-> prevents parallelization
Loop carried dependence of y-> prevents parallelization
Loop not fused: complex flow graph
Loop not vectorized: data dependency
Generated vector simd code for the loop
Loop unrolled 8 times
45, Generating Tesla code
45, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
45, Loop not vectorized/parallelized: contains call
nvc++ -o openacc_cuda_device -fast -acc -Minfo=all -gpu= cc75 saxpy_cuda_device.o openacc_cuda_device.o -Mcuda
but seems to be crashing when calling foo:
$ ./openacc_cuda_device
c = 0.160000
I have been able so far to compile and execute with ease mixed precision code: https://github.com/NVIDIA-developer-blog/code-samples/tree/master/posts/mixed-precision
and bare openacc code as well as the openacc+cuda examples of this repository (
openacc-interoperability ).
Any inshight would be extremely helpful.
Regards,
--George