Interoperability with half-precision

Dear all,

thank you for these beautiful examples, really helpful!
My aplogies if this is not the right place to ask, feel free to close this issue.
I am not openning this issue cause I have a problem,
but rather, I am trying to do openacc-interoperability with `cuda_fp16` half precision intrinsics.
I have looked both at `openacc_c_main` and `openacc_cuda_device` in order to get some influence.

My changes are here, for openacc_cuda_device: https://github.com/jefflarkin/openacc-interoperability/compare/master...georgebisbas:wip_fp16

I am working on a V100 and I am using:
```
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

```
and

```
 pgcc --version

pgcc (aka nvc) 20.7-0 LLVM 64-bit target on x86-64 Linux -tp skylake 
PGI Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
```

Code compiles:
```
$ make openacc_cuda_device
nvc++ -fast -acc -Minfo=all -gpu= cc75 -c openacc_cuda_device.cpp
"openacc_cuda_device.cpp", line 19: warning: variable "tmp" was declared but
          never referenced
    float *x, *y, tmp;
                  ^

main:
     34, Generating copyout(y[:n]) [if not already present]
         Generating create(x[:n]) [if not already present]
     37, Loop is parallelizable
         Generating Tesla code
         37, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     37, Complex loop carried dependence of x-> prevents parallelization
         Loop carried dependence of y-> prevents parallelization
         Loop not fused: complex flow graph
         Loop not vectorized: data dependency
         Generated vector simd code for the loop
         Loop unrolled 8 times
     45, Generating Tesla code
         45, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     45, Loop not vectorized/parallelized: contains call
nvc++ -o openacc_cuda_device -fast -acc -Minfo=all -gpu= cc75 saxpy_cuda_device.o openacc_cuda_device.o -Mcuda 

```
but seems to be crashing when calling `foo`:

```
$ ./openacc_cuda_device 
c = 0.160000
```

I have been able so far to compile and execute with ease mixed precision code: https://github.com/NVIDIA-developer-blog/code-samples/tree/master/posts/mixed-precision
and bare `openacc` code as well as the openacc+cuda examples of this repository (
openacc-interoperability ).

Any inshight would be extremely helpful.
Regards,
--George

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interoperability with half-precision #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Interoperability with half-precision #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions