NVIDIA CUDA and ARM Forge DDT

Arm Forge DDT can be used for debugging GPU parallel codes.

Running the test

The test can be run from the command-line:

module load reframe
cd hpctools.git/reframechecks/debug/

~/reframe.git/reframe.py \
-C ~/reframe.git/config/cscs.py \
--system daint:gpu \
--prefix=$SCRATCH -r \
-p PrgEnv-gnu \
--keep-stage-files \
-c ./arm_ddt_cuda.py

A successful ReFrame output will look like the following:

[----------] started processing sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps (Tool validation)
[ RUN      ] sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps on daint:gpu using PrgEnv-gnu
[----------] finished processing sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps (Tool validation)

[----------] waiting for spawned checks to finish
[       OK ] (1/1) sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps on daint:gpu using PrgEnv-gnu
[----------] all spawned checks have finished

[  PASSED  ] Ran 1 test case(s) from 1 check(s) (0 failure(s))
==============================================================================
PERFORMANCE REPORT
------------------------------------------------------------------------------
sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps
- daint:gpu
   - PrgEnv-gnu
      * num_tasks: 1
      * elapsed: 113 s
------------------------------------------------------------------------------

Looking into the Class shows how to setup and run the code with the tool.

Bug reporting

DDT will automatically set a breakpoint at the entrance of cuda kernels.

kernel launch

Arm Forge DDT break on cuda kernel launch

In this example, the first cuda kernel to be launched is the density kernel:

b0_th0

Arm Forge DDT density kernel (block 0, thread 0)

The Thread Selector allows to select a gpu thread and/or threadblock.

bn_thn

Arm Forge DDT density kernel (last block, last thread)

Arm DDT also includes a GPU Devices display that gives information about the gpu device:

info_devices

Arm Forge DDT gpu devices info

gpu device info

cuda

thread

warp

sm

P100

threads

1

32

2’048

114’688

warps

x

1

64

3’584

sms

x

x

1

56

P100

x

x

x

1

It can be read as: one NVIDIA Pascal P100 gpu leverages up to 32 threads per warp, 2048 threads per sm, 114’688 threads per device, 64 warps per sm, 3’584 warps per device, 56 sms per device and so on.

As usual, it is possible to inspect variables on the cpu and on the gpu:

info_cpu

Arm Forge DDT variables (cpu)

info_gpu

Arm Forge DDT variables (gpu)

Note

GPU execution under the control of a debugger is not as fast as running without a debugger.

Running ddt with a tracepoint allows to specify the variables to record at runtime in batch mode. This is done in the set_launcher method. An overview of the debugging data will typically look like this in the html report:

html report

Arm Forge DDT html report (tracepoints)

and similarly in the txt report:

Tracepoints

#   Time               Tracepoint              Processes                                      Values
           sphexa::sph::cuda::kernels::density
           <double>(int, double, double, int,
           sphexa::BBox<double> const*, int
1 0:17.610 const*, int const*, int const*,     0         clist[27000-1]@3: {[0] = 26999, [1] = 0, [2] = 0} clist: Sparkline
           double const*, double const*,                 0x2aaafab3ca00
           double const*, double const*,
           double const*, double*)
           (cudaDensity.cu:26)
           sphexa::sph::cuda::kernels::density