NVIDIA CUDA GDB

CUDA-GDB is the NVIDIA tool for debugging CUDA applications running on GPUs.

Running the test

The test can be run from the command-line:

module load reframe
cd hpctools.git/reframechecks/debug/

~/reframe.git/reframe.py \
-C ~/reframe.git/config/cscs.py \
--system daint:gpu \
--prefix=$SCRATCH -r \
-p PrgEnv-gnu \
--keep-stage-files \
-c ./cuda_gdb.py

A successful ReFrame output will look like the following:

Reframe version: 3.0-dev6 (rev: 3f0c45d4)
Launched on host: daint101

[----------] started processing sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps (Tool validation)
[ RUN      ] sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps on daint:gpu using PrgEnv-gnu
[----------] finished processing sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps (Tool validation)

[----------] waiting for spawned checks to finish
[       OK ] (1/1) sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps on daint:gpu using PrgEnv-gnu
[----------] all spawned checks have finished

Looking into the Class shows how to setup and run the code with the tool.

Bug reporting

Running cuda-gdb in batch mode is possible with a input file that specify the commands to execute at runtime:

break main
run -s 0 -n 15
break 75
info br
continue
p domain.clist
# $1 = std::vector of length 1000, capacity 1000 = {0, 1, 2, ...
ptype domain.clist
# type = std::vector<int>
print "info cuda devices"
set logging on info_devices.log
print "info cuda kernels"
set logging on info_kernels.log
show logging
print "info cuda threads"
set logging on info_threads.log

cuda-gdb supports user-defined functions (via the define command):

# set trace-commands off
define mygps_cmd
set trace-commands off
printf "gridDim=(%d,%d,%d) blockDim=(%d,%d,%d) blockIdx=(%d,%d,%d) threadIdx=(%d,%d,%d) warpSize=%d thid=%d\n", gridDim.x, gridDim.y, gridDim.z, blockDim.x, blockDim.y, blockDim.z, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, warpSize, blockDim.x * blockIdx.x + threadIdx.x

You can also extend GDB using the Python programming language. An example of GDB’s Python API usage is:

import re
txt=gdb.execute('info cuda devices', to_string=True)
regex = r'\s+sm_\d+\s+(\d+)\s+'
res = re.findall(regex, txt)
gdb.execute('set $sm_max = %s' % res[0])

An overview of the debugging data will typically look like this:

PERFORMANCE REPORT
------------------------------------------------------------------------------
sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps
- daint:gpu
   - PrgEnv-gnu
      * num_tasks: 1
      * info_kernel_nblocks: 106
      * info_kernel_nthperblock: 256
      * info_kernel_np: 27000
      * info_threads_np: 27008
      * SMs: 56
      * WarpsPerSM: 64
      * LanesPerWarp: 32
      * max_threads_per_sm: 2048
      * max_threads_per_device: 114688
      * best_cubesize_per_device: 49
      * cubesize: 30
      * vec_len: 27000
      * threadid_of_last_sm: 14335
      * last_threadid: 27007
------------------------------------------------------------------------------

It gives information about the limits of the gpu device:

cuda

thread

warp

sm

P100

threads

1

32

2’048

114’688

warps

x

1

64

3’584

sms

x

x

1

56

P100

x

x

x

1

It can be read as: one P100 gpu leverages up to 32 threads per warp, 2048 threads per sm, 114’688 threads per device, 64 warps per sm, 3’584 warps per device, 56 sms per device and so on.