Debugging tools

CSCS provides several debugging tools:

Arm Forge DDT

Arm Forge DDT is a licensed tool that can be used for debugging serial, multi-threaded (OpenMP), multi-process (MPI) and accelerator based (Cuda, OpenACC) programs running on research and production systems, including the CRAY Piz Daint system. It can be executed either as a graphical user interface (ddt --connect mode or just ddt) or from the command-line (ddt --offline mode).

Running the test

The test can be run from the command-line:

module load reframe
cd hpctools.git/reframechecks/debug/

~/reframe.git/reframe.py \
-C ~/reframe.git/config/cscs.py \
--system daint:gpu \
--prefix=$SCRATCH -r \
-p PrgEnv-gnu \
--keep-stage-files \
-c ./arm_ddt.py

A successful ReFrame output will look like the following:

Reframe version: 3.0-dev6 (rev: e0f8d969)
Launched on host: daint101

[---] waiting for spawned checks to finish
[ OK ] (1/1) sphexa_ddt_sqpatch_024mpi_001omp_35n_2steps on daint:gpu using PrgEnv-gnu
[---] all spawned checks have finished

[  PASSED  ] Ran 1 test case(s) from 1 check(s) (0 failure(s))

Looking into the Class shows how to setup and run the code with the tool. In this case, the code is knowingly written in order that the mpi ranks other than 0, 1 and 2 will call MPI::COMM_WORLD.Abort thus making the execution to crash.

Bug reporting

An overview of the debugging data will typically look like this:

Tracepoints

#   Time      Tracepoint    Processes                                             Values
           main(int,                  domain.clist: std::vector of length 0, capacity 1786 domain.clist[0]: Sparkline
1 0:07.258 char**)          0-23      from 0 to 19286
           (sqpatch.cpp:75)
           main(int,                  domain.clist: std::vector of length 0, capacity 1786 domain.clist[0]: Sparkline
2 0:07.970 char**)          0-23      from 0 to 26171
           (sqpatch.cpp:75)
           main(int,                  domain.clist: std::vector of length 0, capacity 1786 domain.clist[0]: Sparkline
3 0:08.873 char**)          0-23      from 0 to 19097
           (sqpatch.cpp:75)

The same data can be viewed with a web browser:

ddt --offline screenshot

ARM Forge DDT html report (created with --offline --output=rpt.html)

In the same way, using DDT gui will give the same result and more insight about the crash of the code:

ddt 01

ARM Forge DDT (All mpi ranks (except 0, 1 and 2) aborted)

ddt 02

ARM Forge DDT (callstack)

Cray ATP

Cray ATP (Abnormal Termination Processing) is a tool that monitors user applications, and should an application take a system trap, performs analysis on the dying application. All of the stack backtraces of the application processes are gathered into a merged stack backtrace tree and written to disk as the file atpMergedBT.dot.

Running the test

The test can be run from the command-line:

module load reframe
cd hpctools.git/reframechecks/debug/

~/reframe.git/reframe.py \
-C ~/reframe.git/config/cscs.py \
--system daint:gpu \
--prefix=$SCRATCH -r \
-p PrgEnv-gnu \
--keep-stage-files \
-c ./cray_atp.py

A successful ReFrame output will look like the following:

Reframe version: 3.0-dev6 (rev: e0f8d969)
Launched on host: daint101

[----] waiting for spawned checks to finish
[ OK ] (1/1) sphexa_atp_sqpatch_024mpi_001omp_50n_1steps on daint:gpu using PrgEnv-gnu
[----] all spawned checks have finished

[  PASSED  ] Ran 1 test case(s) from 1 check(s) (0 failure(s))

Looking into the Class shows how to setup and run the code with the tool. In this case, the code is knowingly written in order that the mpi ranks other than 0, 1 and 2 will call MPI::COMM_WORLD.Abort thus making the execution to crash.

Bug reporting

An overview of the debugging data will typically look like this:

MPI VERSION    : CRAY MPICH version 7.7.10 (ANL base 3.2)
...
Rank 1633 [Tue May  5 19:30:24 2020] [c9-2c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 7) - process 1633
Rank 1721 [Tue May  5 19:30:24 2020] [c9-2c0s3n1] application called MPI_Abort(MPI_COMM_WORLD, 7) - process 1721
...
Rank 757 [Tue May  5 19:30:24 2020] [c7-1c0s4n1] application called MPI_Abort(MPI_COMM_WORLD, 7) - process 757
Application 22398835 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 1743 starting:
  _start@start.S:120
  __libc_start_main@0x2aaaac3ddf89
  main@sqpatch.cpp:85
  MPI::Comm::Abort(int) const@mpicxx.h:1236
  PMPI_Abort@0x2aaaab1f15e5
  MPID_Abort@0x2aaaab2e4267
  __GI_abort@0x2aaaac3f4740
  __GI_raise@0x2aaaac3f3160
ATP Stack walkback for Rank 1743 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 1743, 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

srun: error: nid04079: tasks 1344-1355: Killed
srun: Terminating job step 22398835.0
srun: error: nid03274: tasks 672-683: Killed
srun: error: nid04080: tasks 1356-1367: Killed
...
srun: error: nid03236: tasks 216-227: Killed
srun: error: nid05581: tasks 1716-1727: Killed
srun: error: nid05583: task 1743: Aborted (core dumped)
srun: Force Terminated job step 22398835.0

Several files are created:

atpMergedBT.dot
atpMergedBT_line.dot
core.atp.22398835.0.5324
core.atp.22398835.1743.23855
These files contains useful information about the crash:
  • atpMergedBT.dot: File containing the merged backtrace tree at a simple, function-level granularity. This file gives the simplest and most-collapsed view of the application state.

  • atpMergedBT_line.dot: File containing the merged backtrace tree at a more-complex, source-code line level of granularity. This file shows a denser, busier view of the application state and supports modest source browsing.

  • core.atp.apid.rank: These are the heuristically chosen core files named after the application ID and rank of the process from which they came.

The corefile contains an image of the process’s memory at the time of termination. This image can be opened in a debugger, in this case with gdb:

            f'`pkg-config --modversion libAtpSigHandler` >> {version_rpt}',
            f'echo ATP_HOME=$ATP_HOME >> {version_rpt}',
            f'pkg-config --variable=exec_prefix libAtpSigHandler &>{which_rpt}'
        ]
        self.postbuild_cmds += [

A typical report for rank 0 (or 1) will look like this:

Program terminated with signal SIGQUIT, Quit.
#0  0x00002aaaab2539bc in MPIDI_Cray_shared_mem_coll_tree_reduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#0  0x00002aaaab2539bc in MPIDI_Cray_shared_mem_coll_tree_reduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#1  0x00002aaaab2653f7 in MPIDI_Cray_shared_mem_coll_reduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#2  0x00002aaaab265fdd in MPIR_CRAY_Allreduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#3  0x00002aaaab1756b4 in MPIR_Allreduce_impl () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#4  0x00002aaaab176055 in PMPI_Allreduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#5  0x00000000004097e3 in ?? ()

and for other ranks:

Program terminated with signal SIGABRT, Aborted.
#0  0x00002aaaac3f7520 in raise () from /lib64/libc.so.6
#0  0x00002aaaac3f7520 in raise () from /lib64/libc.so.6
#1  0x00002aaaac3f8b01 in abort () from /lib64/libc.so.6
#2  0x00002aaaab2e4638 in MPID_Abort () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#3  0x00002aaaab1f19a6 in PMPI_Abort () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#4  0x0000000000405664 in ?? ()
#5  0x0000000000857bb8 in ?? ()

The atpMergedBT.dot files can be viewed with stat-view, a component of the STAT package (module load stat). The merged stack backtrace tree provides a concise, yet comprehensive, view of what the application was doing at the time of the crash.

stat-view screenshot

ATP/STAT (launched with stat-view atpMergedBT_line.dot, 1920 mpi ranks)

GNU GDB

GDB is the fundamental building block upon which other debuggers are being assembled. GDB allows to see what is going on inside another program while it executes — or what another program was doing at the moment it crashed (core files).

Running the test

The test can be run from the command-line:

module load reframe
cd hpctools.git/reframechecks/debug/

~/reframe.git/reframe.py \
-C ~/reframe.git/config/cscs.py \
--system daint:gpu \
--prefix=$SCRATCH -r \
-p PrgEnv-gnu \
--keep-stage-files \
-c ./gdb.py

A successful ReFrame output will look like the following:

Reframe version: 3.0-dev6 (rev: e0f8d969)
Launched on host: daint101

[-----] started processing sphexa_gdb_sqpatch_001mpi_001omp_15n_0steps (Tool validation)
[ RUN ] sphexa_gdb_sqpatch_001mpi_001omp_15n_0steps on dom:gpu using PrgEnv-cray
[ RUN ] sphexa_gdb_sqpatch_001mpi_001omp_15n_0steps on dom:gpu using PrgEnv-gnu
[ RUN ] sphexa_gdb_sqpatch_001mpi_001omp_15n_0steps on dom:gpu using PrgEnv-intel
[ RUN ] sphexa_gdb_sqpatch_001mpi_001omp_15n_0steps on dom:gpu using PrgEnv-pgi
[-----] finished processing sphexa_gdb_sqpatch_001mpi_001omp_15n_0steps (Tool validation)

[  PASSED  ] Ran 4 test case(s) from 1 check(s) (0 failure(s))

Looking into the Class shows how to setup and run the code with the tool. In this example, the code is serial.

Bug reporting

Running gdb in non interactive mode (batch mode) is possible with a input file that specify the commands to execute at runtime:

break 75
run -s 0 -n 15
# pretty returns this:
# $1 = std::vector of length 3375, capacity 3375 = {0, etc...
# except for PGI:
#   Dwarf Error: wrong version in compilation unit header
#   (gdb) p domain.clist[0]
#   No symbol "operator[]" in current context.print domain.clist
print domain.clist
print domain.clist[1]
# pvector returns this:
# ---------------------------------------------------------------------
# elem[2]: $3 = 2
# elem[3]: $4 = 3
# elem[4]: $5 = 4
# Vector size = 3375
# Vector capacity = 3375
# Element type = std::_Vector_base<int, std::allocator<int> >::pointer
# ---------------------------------------------------------------------
pvector domain.clist 2 4
continue
quit

An overview of the debugging data will typically look like this:

Breakpoint 1, main (argc=5, argv=0x7fffffff5f28) at sqpatch.cpp:75
75	        taskList.update(domain.clist);
$1 = std::vector of length 3375, capacity 3375 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199...}
$2 = 1
elem[2]: $3 = 2
elem[3]: $4 = 3
elem[4]: $5 = 4
Vector size = 3375
Vector capacity = 3375
Element type = std::_Vector_base<int, std::allocator<int> >::pointer

# Total execution time of 0 iterations of SqPatch: 0.305378s
[Inferior 1 (process 19366) exited normally]

NVIDIA CUDA GDB

CUDA-GDB is the NVIDIA tool for debugging CUDA applications running on GPUs.

Running the test

The test can be run from the command-line:

module load reframe
cd hpctools.git/reframechecks/debug/

~/reframe.git/reframe.py \
-C ~/reframe.git/config/cscs.py \
--system daint:gpu \
--prefix=$SCRATCH -r \
-p PrgEnv-gnu \
--keep-stage-files \
-c ./cuda_gdb.py

A successful ReFrame output will look like the following:

Reframe version: 3.0-dev6 (rev: 3f0c45d4)
Launched on host: daint101

[----------] started processing sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps (Tool validation)
[ RUN      ] sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps on daint:gpu using PrgEnv-gnu
[----------] finished processing sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps (Tool validation)

[----------] waiting for spawned checks to finish
[       OK ] (1/1) sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps on daint:gpu using PrgEnv-gnu
[----------] all spawned checks have finished

Looking into the Class shows how to setup and run the code with the tool.

Bug reporting

Running cuda-gdb in batch mode is possible with a input file that specify the commands to execute at runtime:

break main
run -s 0 -n 15
break 75
info br
continue
p domain.clist
# $1 = std::vector of length 1000, capacity 1000 = {0, 1, 2, ...
ptype domain.clist
# type = std::vector<int>
print "info cuda devices"
set logging on info_devices.log
print "info cuda kernels"
set logging on info_kernels.log
show logging
print "info cuda threads"
set logging on info_threads.log

cuda-gdb supports user-defined functions (via the define command):

# set trace-commands off
define mygps_cmd
set trace-commands off
printf "gridDim=(%d,%d,%d) blockDim=(%d,%d,%d) blockIdx=(%d,%d,%d) threadIdx=(%d,%d,%d) warpSize=%d thid=%d\n", gridDim.x, gridDim.y, gridDim.z, blockDim.x, blockDim.y, blockDim.z, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z, warpSize, blockDim.x * blockIdx.x + threadIdx.x

You can also extend GDB using the Python programming language. An example of GDB’s Python API usage is:

import re
txt=gdb.execute('info cuda devices', to_string=True)
regex = r'\s+sm_\d+\s+(\d+)\s+'
res = re.findall(regex, txt)
gdb.execute('set $sm_max = %s' % res[0])

An overview of the debugging data will typically look like this:

PERFORMANCE REPORT
------------------------------------------------------------------------------
sphexa_cudagdb_sqpatch_001mpi_001omp_30n_0steps
- daint:gpu
   - PrgEnv-gnu
      * num_tasks: 1
      * info_kernel_nblocks: 106
      * info_kernel_nthperblock: 256
      * info_kernel_np: 27000
      * info_threads_np: 27008
      * SMs: 56
      * WarpsPerSM: 64
      * LanesPerWarp: 32
      * max_threads_per_sm: 2048
      * max_threads_per_device: 114688
      * best_cubesize_per_device: 49
      * cubesize: 30
      * vec_len: 27000
      * threadid_of_last_sm: 14335
      * last_threadid: 27007
------------------------------------------------------------------------------

It gives information about the limits of the gpu device:

cuda

thread

warp

sm

P100

threads

1

32

2’048

114’688

warps

x

1

64

3’584

sms

x

x

1

56

P100

x

x

x

1

It can be read as: one P100 gpu leverages up to 32 threads per warp, 2048 threads per sm, 114’688 threads per device, 64 warps per sm, 3’584 warps per device, 56 sms per device and so on.

NVIDIA CUDA and ARM Forge DDT

Arm Forge DDT can be used for debugging GPU parallel codes.

Running the test

The test can be run from the command-line:

module load reframe
cd hpctools.git/reframechecks/debug/

~/reframe.git/reframe.py \
-C ~/reframe.git/config/cscs.py \
--system daint:gpu \
--prefix=$SCRATCH -r \
-p PrgEnv-gnu \
--keep-stage-files \
-c ./arm_ddt_cuda.py

A successful ReFrame output will look like the following:

[----------] started processing sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps (Tool validation)
[ RUN      ] sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps on daint:gpu using PrgEnv-gnu
[----------] finished processing sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps (Tool validation)

[----------] waiting for spawned checks to finish
[       OK ] (1/1) sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps on daint:gpu using PrgEnv-gnu
[----------] all spawned checks have finished

[  PASSED  ] Ran 1 test case(s) from 1 check(s) (0 failure(s))
==============================================================================
PERFORMANCE REPORT
------------------------------------------------------------------------------
sphexa_cudaddt_sqpatch_001mpi_001omp_30n_0steps
- daint:gpu
   - PrgEnv-gnu
      * num_tasks: 1
      * elapsed: 113 s
------------------------------------------------------------------------------

Looking into the Class shows how to setup and run the code with the tool.

Bug reporting

DDT will automatically set a breakpoint at the entrance of cuda kernels.

kernel launch

Arm Forge DDT break on cuda kernel launch

In this example, the first cuda kernel to be launched is the density kernel:

b0_th0

Arm Forge DDT density kernel (block 0, thread 0)

The Thread Selector allows to select a gpu thread and/or threadblock.

bn_thn

Arm Forge DDT density kernel (last block, last thread)

Arm DDT also includes a GPU Devices display that gives information about the gpu device:

info_devices

Arm Forge DDT gpu devices info

gpu device info

cuda

thread

warp

sm

P100

threads

1

32

2’048

114’688

warps

x

1

64

3’584

sms

x

x

1

56

P100

x

x

x

1

It can be read as: one NVIDIA Pascal P100 gpu leverages up to 32 threads per warp, 2048 threads per sm, 114’688 threads per device, 64 warps per sm, 3’584 warps per device, 56 sms per device and so on.

As usual, it is possible to inspect variables on the cpu and on the gpu:

info_cpu

Arm Forge DDT variables (cpu)

info_gpu

Arm Forge DDT variables (gpu)

Note

GPU execution under the control of a debugger is not as fast as running without a debugger.

Running ddt with a tracepoint allows to specify the variables to record at runtime in batch mode. This is done in the set_launcher method. An overview of the debugging data will typically look like this in the html report:

html report

Arm Forge DDT html report (tracepoints)

and similarly in the txt report:

Tracepoints

#   Time               Tracepoint              Processes                                      Values
           sphexa::sph::cuda::kernels::density
           <double>(int, double, double, int,
           sphexa::BBox<double> const*, int
1 0:17.610 const*, int const*, int const*,     0         clist[27000-1]@3: {[0] = 26999, [1] = 0, [2] = 0} clist: Sparkline
           double const*, double const*,                 0x2aaafab3ca00
           double const*, double const*,
           double const*, double*)
           (cudaDensity.cu:26)
           sphexa::sph::cuda::kernels::density