Cray ATP

Cray ATP (Abnormal Termination Processing) is a tool that monitors user applications, and should an application take a system trap, performs analysis on the dying application. All of the stack backtraces of the application processes are gathered into a merged stack backtrace tree and written to disk as the file atpMergedBT.dot.

Running the test

The test can be run from the command-line:

module load reframe
cd hpctools.git/reframechecks/debug/

~/reframe.git/reframe.py \
-C ~/reframe.git/config/cscs.py \
--system daint:gpu \
--prefix=$SCRATCH -r \
-p PrgEnv-gnu \
--keep-stage-files \
-c ./cray_atp.py

A successful ReFrame output will look like the following:

Reframe version: 3.0-dev6 (rev: e0f8d969)
Launched on host: daint101

[----] waiting for spawned checks to finish
[ OK ] (1/1) sphexa_atp_sqpatch_024mpi_001omp_50n_1steps on daint:gpu using PrgEnv-gnu
[----] all spawned checks have finished

[  PASSED  ] Ran 1 test case(s) from 1 check(s) (0 failure(s))

Looking into the Class shows how to setup and run the code with the tool. In this case, the code is knowingly written in order that the mpi ranks other than 0, 1 and 2 will call MPI::COMM_WORLD.Abort thus making the execution to crash.

Bug reporting

An overview of the debugging data will typically look like this:

MPI VERSION    : CRAY MPICH version 7.7.10 (ANL base 3.2)
...
Rank 1633 [Tue May  5 19:30:24 2020] [c9-2c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 7) - process 1633
Rank 1721 [Tue May  5 19:30:24 2020] [c9-2c0s3n1] application called MPI_Abort(MPI_COMM_WORLD, 7) - process 1721
...
Rank 757 [Tue May  5 19:30:24 2020] [c7-1c0s4n1] application called MPI_Abort(MPI_COMM_WORLD, 7) - process 757
Application 22398835 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 1743 starting:
  _start@start.S:120
  __libc_start_main@0x2aaaac3ddf89
  main@sqpatch.cpp:85
  MPI::Comm::Abort(int) const@mpicxx.h:1236
  PMPI_Abort@0x2aaaab1f15e5
  MPID_Abort@0x2aaaab2e4267
  __GI_abort@0x2aaaac3f4740
  __GI_raise@0x2aaaac3f3160
ATP Stack walkback for Rank 1743 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 1743, 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

srun: error: nid04079: tasks 1344-1355: Killed
srun: Terminating job step 22398835.0
srun: error: nid03274: tasks 672-683: Killed
srun: error: nid04080: tasks 1356-1367: Killed
...
srun: error: nid03236: tasks 216-227: Killed
srun: error: nid05581: tasks 1716-1727: Killed
srun: error: nid05583: task 1743: Aborted (core dumped)
srun: Force Terminated job step 22398835.0

Several files are created:

atpMergedBT.dot
atpMergedBT_line.dot
core.atp.22398835.0.5324
core.atp.22398835.1743.23855
These files contains useful information about the crash:
  • atpMergedBT.dot: File containing the merged backtrace tree at a simple, function-level granularity. This file gives the simplest and most-collapsed view of the application state.

  • atpMergedBT_line.dot: File containing the merged backtrace tree at a more-complex, source-code line level of granularity. This file shows a denser, busier view of the application state and supports modest source browsing.

  • core.atp.apid.rank: These are the heuristically chosen core files named after the application ID and rank of the process from which they came.

The corefile contains an image of the process’s memory at the time of termination. This image can be opened in a debugger, in this case with gdb:

            f'`pkg-config --modversion libAtpSigHandler` >> {version_rpt}',
            f'echo ATP_HOME=$ATP_HOME >> {version_rpt}',
            f'pkg-config --variable=exec_prefix libAtpSigHandler &>{which_rpt}'
        ]
        self.postbuild_cmds += [

A typical report for rank 0 (or 1) will look like this:

Program terminated with signal SIGQUIT, Quit.
#0  0x00002aaaab2539bc in MPIDI_Cray_shared_mem_coll_tree_reduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#0  0x00002aaaab2539bc in MPIDI_Cray_shared_mem_coll_tree_reduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#1  0x00002aaaab2653f7 in MPIDI_Cray_shared_mem_coll_reduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#2  0x00002aaaab265fdd in MPIR_CRAY_Allreduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#3  0x00002aaaab1756b4 in MPIR_Allreduce_impl () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#4  0x00002aaaab176055 in PMPI_Allreduce () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#5  0x00000000004097e3 in ?? ()

and for other ranks:

Program terminated with signal SIGABRT, Aborted.
#0  0x00002aaaac3f7520 in raise () from /lib64/libc.so.6
#0  0x00002aaaac3f7520 in raise () from /lib64/libc.so.6
#1  0x00002aaaac3f8b01 in abort () from /lib64/libc.so.6
#2  0x00002aaaab2e4638 in MPID_Abort () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#3  0x00002aaaab1f19a6 in PMPI_Abort () from /opt/cray/pe/lib64/libmpich_gnu_82.so.3
#4  0x0000000000405664 in ?? ()
#5  0x0000000000857bb8 in ?? ()

The atpMergedBT.dot files can be viewed with stat-view, a component of the STAT package (module load stat). The merged stack backtrace tree provides a concise, yet comprehensive, view of what the application was doing at the time of the crash.

stat-view screenshot

ATP/STAT (launched with stat-view atpMergedBT_line.dot, 1920 mpi ranks)