Reference Guide

Regression tests

internal_timers_mpi.py

Sanity checks

reframechecks.common.sphexa.sanity.elapsed_time_from_date(self)

Reports elapsed time in seconds using the linux date command:

starttime=1579725956
stoptime=1579725961
reports: _Elapsed: 5 s
reframechecks.common.sphexa.sanity.pctg_FindNeighbors(obj)

reports: * %FindNeighbors: 9.8 %

reframechecks.common.sphexa.sanity.pctg_IAD(obj)

reports: * %IAD: 17.36 %

reframechecks.common.sphexa.sanity.pctg_MomentumEnergyIAD(obj)

reports: * %MomentumEnergyIAD: 30.15 %

reframechecks.common.sphexa.sanity.pctg_Timestep(obj)

reports: * %Timestep: 16.6 %

reframechecks.common.sphexa.sanity.pctg_mpi_synchronizeHalos(obj)

reports: * %mpi_synchronizeHalos: 12.62 %

reframechecks.common.sphexa.sanity.seconds_elaps(self)

Reports elapsed time in seconds using the internal timer from the code

=== Total time for iteration(0) 3.61153s
reports: * Elapsed: 3.6115 s
reframechecks.common.sphexa.sanity.seconds_timers(self, region)

Reports timings (in seconds) using the internal timer from the code

# domain::sync: 0.118225s
# updateTasks: 0.00561256s
# FindNeighbors: 0.266282s
# Density: 0.120372s
# EquationOfState: 0.00255166s
# mpi::synchronizeHalos: 0.116917s
# IAD: 0.185804s
# mpi::synchronizeHalos: 0.0850162s
# MomentumEnergyIAD: 0.423282s
# Timestep: 0.0405346s
# UpdateQuantities: 0.0140938s
# EnergyConservation: 0.0224118s
# UpdateSmoothingLength: 0.00413466s

internal_timers_mpi_containers.py

class reframechecks.notool.internal_timers_mpi_containers.SphExa_Container_Base_Check(*args: Any, **kwargs: Any)[source]

Bases: reframe.

2 parameters can be set for simulation:

Parameters
  • mpi_task – number of mpi tasks; the size of the cube in the 3D square patch test is set with a dictionary depending on mpi_task, but cubesize could also be on the list of parameters,

  • step – number of simulation steps.

Dependencies are:
  • compute: inputs (mpi_task, step) —srun—> *job.out

  • postprocess logs: inputs (*job.out) —x—> termgraph.in

  • plot data: inputs (termgraph.in) —termgraph.py—> termgraph.rpt

class reframechecks.notool.internal_timers_mpi_containers.MPI_Collect_Logs_Test(*args: Any, **kwargs: Any)

Bases: reframe.

collect_logs()
extract_data()
class reframechecks.notool.internal_timers_mpi_containers.MPI_Compute_Sarus_Test(*args: Any, **kwargs: Any)

Bases: reframe.

This class run the executable with Sarus

class reframechecks.notool.internal_timers_mpi_containers.MPI_Compute_Singularity_Test(*args: Any, **kwargs: Any)

Bases: reframe.

This class run the executable with Singularity (and natively too for comparison)

class reframechecks.notool.internal_timers_mpi_containers.MPI_Plot_Test(*args: Any, **kwargs: Any)

Bases: reframe.

class reframechecks.notool.internal_timers_mpi_containers.SphExa_Container_Base_Check(*args: Any, **kwargs: Any)[source]

Bases: reframe.

2 parameters can be set for simulation:

Parameters
  • mpi_task – number of mpi tasks; the size of the cube in the 3D square patch test is set with a dictionary depending on mpi_task, but cubesize could also be on the list of parameters,

  • step – number of simulation steps.

Dependencies are:
  • compute: inputs (mpi_task, step) —srun—> *job.out

  • postprocess logs: inputs (*job.out) —x—> termgraph.in

  • plot data: inputs (termgraph.in) —termgraph.py—> termgraph.rpt

Intel

intel_inspector.py

class reframechecks.intel.intel_inspector.SphExaIntelInspectorCheck(*args: Any, **kwargs: Any)

Bases: reframe.

This class runs the test code with Intel Inspector (mpi only): https://software.intel.com/en-us/inspector

Available analysis types are: inspxe-cl -h collect

mi1   Detect Leaks
mi2   Detect Memory Problems
mi3   Locate Memory Problems
ti1   Detect Deadlocks
ti2   Detect Deadlocks and Data Races
ti3   Locate Deadlocks and Data Races

2 parameters can be set for simulation:

Parameters
  • mpitask – number of mpi tasks; the size of the cube in the 3D square patch test is set with a dictionary depending on mpitask, but cubesize could also be on the list of parameters,

  • steps – number of simulation steps.

Typical performance reporting:

PERFORMANCE REPORT
--------------------------------------------------
sphexa_inspector_sqpatch_024mpi_001omp_100n_0steps
- dom:gpu
   - PrgEnv-gnu
      * num_tasks: 24
      * Elapsed: 8.899 s
      ...
      * Memory not deallocated: 1

intel_vtune.py

class reframechecks.intel.intel_vtune.SphExaVtuneCheck(*args: Any, **kwargs: Any)

Bases: sphexa.sanity_vtune.

This class runs the test code with Intel(R) VTune(TM) (mpi only): https://software.intel.com/en-us/vtune

2 parameters can be set for simulation:

Parameters
  • mpitask – number of mpi tasks; the size of the cube in the 3D square patch test is set with a dictionary depending on mpitask, but cubesize could also be on the list of parameters,

  • steps – number of simulation steps.

class reframechecks.intel.intel_vtune.SphExaVtuneCheck(*args: Any, **kwargs: Any)

Bases: sphexa.sanity_vtune.

This class runs the test code with Intel(R) VTune(TM) (mpi only): https://software.intel.com/en-us/vtune

2 parameters can be set for simulation:

Parameters
  • mpitask – number of mpi tasks; the size of the cube in the 3D square patch test is set with a dictionary depending on mpitask, but cubesize could also be on the list of parameters,

  • steps – number of simulation steps.

Sanity checks

class reframechecks.common.sphexa.sanity_vtune.VtuneBaseTest(*args: Any, **kwargs: Any)[source]

Bases: reframe.

set_basic_perf_patterns()

A set of basic perf_patterns shared between the tests

set_vtune_perf_patterns_rpt()

More perf_patterns for the tool

Typical performance reporting:

      * vtune_elapsed_min: 6.695 s
      * vtune_elapsed_max: 6.695 s
      * vtune_elapsed_cput: 5.0858 s
      * vtune_elapsed_cput_efft: 4.8549 s
      * vtune_elapsed_cput_spint: 0.2309 s
      * vtune_elapsed_cput_spint_mpit: 0.2187 s
      * %vtune_effective_physical_core_utilization: 85.3 %
      * %vtune_effective_logical_core_utilization: 84.6 %
      * vtune_cput_cn0: 122.06 s
      * %vtune_cput_cn0_efft: 95.5 %
      * %vtune_cput_cn0_spint: 4.5 %

intel_advisor.py

class reframechecks.intel.intel_advisor.SphExaIntelAdvisorCheck(*args: Any, **kwargs: Any)

Bases: reframe.

This class runs the test code with Intel Advisor (mpi only): https://software.intel.com/en-us/advisor

Available analysis types are: advixe-cl -h collect

survey       - Discover efficient vectorization and/or threading
dependencies - Identify and explore loop-carried dependencies for loops
map          - Identify and explore complex memory accesses
roofline     - Run the Survey analysis + Trip Counts & FLOP analysis
suitability  - Check predicted parallel performance
tripcounts   - Identify the number of loop iterations.

2 parameters can be set for simulation:

Parameters
  • mpitask – number of mpi tasks; the size of the cube in the 3D square patch test is set with a dictionary depending on mpitask, but cubesize could also be on the list of parameters,

  • steps – number of simulation steps.

Typical performance reporting:

PERFORMANCE REPORT
--------------------------------------------------
sphexa_inspector_sqpatch_024mpi_001omp_100n_0steps
- dom:gpu
   - PrgEnv-gnu
      * num_tasks: 24
      * Elapsed: 3.6147 s
      ...
      * advisor_elapsed: 2.13 s
      * advisor_loop1_line: 94 (momentumAndEnergyIAD.hpp)

Sanity checks

reframechecks.common.sphexa.sanity_intel.advisor_elapsed(obj)

Reports the elapsed time (sum of Self Time in seconds) measured by the tool

> summary.rpt
ID / Function Call Sites and Loops / Total Time / Self Time /  Type
71 [loop in sphexa::sph::computeMomentumAndEnergyIADImpl<double,
  ... sphexa::ParticlesData<double>> at momentumAndEnergyIAD.hpp:94]
  ... 1.092s      0.736s              Scalar  momentumAndEnergyIAD.hpp:94
34 [loop in MPIDI_Cray_shared_mem_coll_bcast]
  ... 0.596s      0.472s              Scalar  libmpich_gnu_82.so.3
etc.
returns: * advisor_elapsed: 2.13 s
reframechecks.common.sphexa.sanity_intel.advisor_loop1_filename(obj)

Reports the name of the source file (filename) of the most time consuming loop

> summary.rpt
ID / Function Call Sites and Loops / Total Time / Self Time /  Type
71 [loop in sphexa::sph::computeMomentumAndEnergyIADImpl<double,
  ... sphexa::ParticlesData<double>> at momentumAndEnergyIAD.hpp:94]
  ... 1.092s      0.736s              Scalar  momentumAndEnergyIAD.hpp:94
34 [loop in MPIDI_Cray_shared_mem_coll_bcast]
  ... 0.596s      0.472s              Scalar  libmpich_gnu_82.so.3
etc.
returns: * advisor_loop1_line: 94 (momentumAndEnergyIAD.hpp)
reframechecks.common.sphexa.sanity_intel.advisor_loop1_line(obj)

Reports the line (fline) of the most time consuming loop

> summary.rpt
ID / Function Call Sites and Loops / Total Time / Self Time /  Type
71 [loop in sphexa::sph::computeMomentumAndEnergyIADImpl<double,
  ... sphexa::ParticlesData<double>> at momentumAndEnergyIAD.hpp:94]
  ... 1.092s      0.736s              Scalar  momentumAndEnergyIAD.hpp:94
34 [loop in MPIDI_Cray_shared_mem_coll_bcast]
  ... 0.596s      0.472s              Scalar  libmpich_gnu_82.so.3
etc.
returns: * advisor_loop1_line: 94 (momentumAndEnergyIAD.hpp)
reframechecks.common.sphexa.sanity_intel.advisor_version(obj)

Checks tool’s version:

> advixe-cl --version
Intel(R) Advisor 2020 (build 604394) Command Line Tool
returns: True or False
reframechecks.common.sphexa.sanity_intel.inspector_not_deallocated(obj)

Reports number of Memory not deallocated problem(s)

> summary.rpt
2 new problem(s) found
1 Memory leak problem(s) detected
1 Memory not deallocated problem(s) detected
returns: * Memory not deallocated: 1
reframechecks.common.sphexa.sanity_intel.inspector_version(obj)

Checks tool’s version:

> inspxe-cl --version
Intel(R) Inspector 2020 (build 603904) Command Line tool
returns: True or False
reframechecks.common.sphexa.sanity_intel.vtune_logical_core_utilization(self)

Reports the minimum Physical Core Utilization (%) measured by the tool

Effective Logical Core Utilization: 96.0% (23.028 out of 24)
Effective Logical Core Utilization: 95.9% (23.007 out of 24)
Effective Logical Core Utilization: 95.5% (22.911 out of 24)
reframechecks.common.sphexa.sanity_intel.vtune_momentumAndEnergyIAD(self)

sphexa::sph::computeMomentumAndEnergyIADImpl<…> sqpatch.exe 40.919s sphexa::sph::computeMomentumAndEnergyIADImpl<…> sqpatch.exe 38.994s sphexa::sph::computeMomentumAndEnergyIADImpl<…> sqpatch.exe 40.245s sphexa::sph::computeMomentumAndEnergyIADImpl<…> sqpatch.exe 39.487s

reframechecks.common.sphexa.sanity_intel.vtune_perf_patterns(obj)

Dictionary of default perf_patterns for the tool

reframechecks.common.sphexa.sanity_intel.vtune_physical_core_utilization(self)

Reports the minimum Physical Core Utilization (%) measured by the tool

Effective Physical Core Utilization: 96.3% (11.554 out of 12)
Effective Physical Core Utilization: 96.1% (11.534 out of 12)
Effective Physical Core Utilization: 95.9% (11.512 out of 12)
reframechecks.common.sphexa.sanity_intel.vtune_time(self)

Vtune creates 1 report per compute node. For example, a 48 mpi tasks job (= 2 compute nodes when running with 24 c/cn) will create 2 directories: * rpt.nid00001/rpt.nid00001.vtune * rpt.nid00002/rpt.nid00002.vtune

Typical output (for each compute node) is:

Elapsed Time:     14.866s
    CPU Time:     319.177s            /24 = 13.3
        Effective Time:   308.218s    /24 = 12.8
            Idle: 0s
            Poor: 19.725s
            Ok:   119.570s
            Ideal:        168.922s
            Over: 0s
        Spin Time:        10.959s             /24 =  0.4
            MPI Busy Wait Time:   10.795s
            Other:        0.164s
        Overhead Time:    0s
Total Thread Count:       25
Paused Time:      0s
reframechecks.common.sphexa.sanity_intel.vtune_tool_reference(obj)

Dictionary of default reference for the tool

reframechecks.common.sphexa.sanity_intel.vtune_version(obj)

Checks tool’s version:

> vtune --version
Intel(R) VTune(TM) Profiler 2020 (build 605129) Command Line Tool
returns: True or False

Score-P

scorep_sampling_profiling.py

scorep_sampling_tracing.py

Sanity checks

reframechecks.common.sphexa.sanity_scorep.ipc_rk0(obj)

Reports the IPC (instructions per cycle) for rank 0

reframechecks.common.sphexa.sanity_scorep.program_begin_count(obj)

Reports the number of PROGRAM_BEGIN in the otf2 trace file

reframechecks.common.sphexa.sanity_scorep.program_end_count(obj)

Reports the number of PROGRAM_END in the otf2 trace file

reframechecks.common.sphexa.sanity_scorep.ru_maxrss_rk0(obj)

Reports the maximum resident set size

reframechecks.common.sphexa.sanity_scorep.scorep_assert_version(obj)

Checks tool’s version:

> scorep --version
Score-P 6.0
returns: True or False
reframechecks.common.sphexa.sanity_scorep.scorep_com_pct(obj)

Reports COM % measured by the tool

type max_buf[B]   visits    hits time[s] time[%] time/visit[us] region
COM      4,680 1,019,424     891  303.17    12.0         297.39  COM
                                            ****
reframechecks.common.sphexa.sanity_scorep.scorep_elapsed(obj)

Typical performance report from the tool (profile.cubex)

   type max_buf[B]   visits    hits time[s] time[%] time/visit[us] region
   ALL  1,019,921 2,249,107 934,957  325.00   100.0   144.50  ALL
                                     ******
   USR    724,140 1,125,393 667,740  226.14    69.6   200.94  USR
   MPI    428,794    59,185 215,094   74.72    23.0  1262.56  MPI
   COM     43,920 1,061,276  48,832   21.96     6.8    20.69  COM
MEMORY      9,143     3,229   3,267    2.16     0.7   669.59  MEMORY
SCOREP         94        24      24    0.01     0.0   492.90  SCOREP
   USR    317,100   283,366 283,366   94.43    29.1         333.24 ...
_ZN6sphexa3sph31computeMomentumAndEnergyIADImplIdNS_13ParticlesData ...
IdEEEEvRKNS_4TaskERT0_
reframechecks.common.sphexa.sanity_scorep.scorep_exclusivepct_energy(obj)

Reports % of elapsed time (exclusive) for MomentumAndEnergy function (small scale job)

> sqpatch_048mpi_001omp_125n_10steps_1000000cycles/rpt.exclusive
0.0193958 (0.0009252%) sqpatch.exe
1.39647 (0.06661%)       + main
...
714.135 (34.063%)   |   + ...
         *******
  _ZN6sphexa3sph31computeMomentumAndEnergyIADImplIdNS_13 ...
  ParticlesDataIdEEEEvRKNS_4TaskERT0_
0.205453 (0.0098%)  |   +
  _ZN6sphexa3sph15computeTimestepIdNS0_21TimestepPress2ndOrderIdNS_13 ...
  ParticlesDataIdEEEES4_EEvRKSt6vectorINS_4TaskESaIS7_EERT1_
201.685 (9.62%)     |   |   + MPI_Allreduce


 type max_buf[B]    visits    hits time[s] time[%] time/visit[us]  region
 OMP  1,925,120    81,920       0   63.84     2.5         779.29
  !$omp parallel @momentumAndEnergyIAD.hpp:87 ***
 OMP    920,500    81,920  48,000  125.41     5.0        1530.93
  !$omp for @momentumAndEnergyIAD.hpp:87      ***
 OMP    675,860    81,920       1   30.95     1.2         377.85
  !$omp implicit barrier @momentumAndEnergyIAD.hpp:93
                                              ***
reframechecks.common.sphexa.sanity_scorep.scorep_inclusivepct_energy(obj)

Reports % of elapsed time (inclusive) for MomentumAndEnergy function (small scale job)

> sqpatch_048mpi_001omp_125n_10steps_1000000cycles/rpt.exclusive
0.0193958 (0.0009252%) sqpatch.exe
1.39647 (0.06661%)       + main
...
714.135 (34.063%)   |   + ...
         *******
  _ZN6sphexa3sph31computeMomentumAndEnergyIADImplIdNS_13 ...
  ParticlesDataIdEEEEvRKNS_4TaskERT0_
0.205453 (0.0098%)  |   +
  _ZN6sphexa3sph15computeTimestepIdNS0_21TimestepPress2ndOrderIdNS_13 ...
  ParticlesDataIdEEEES4_EEvRKSt6vectorINS_4TaskESaIS7_EERT1_
201.685 (9.62%)     |   |   + MPI_Allreduce
reframechecks.common.sphexa.sanity_scorep.scorep_info_cuda_support(obj)

Checks tool’s configuration (Cuda support)

> scorep-info config-summary
CUDA support:  yes
reframechecks.common.sphexa.sanity_scorep.scorep_info_papi_support(obj)

Checks tool’s configuration (papi support)

> scorep-info config-summary
PAPI support: yes
reframechecks.common.sphexa.sanity_scorep.scorep_info_perf_support(obj)

Checks tool’s configuration (perf support)

> scorep-info config-summary
metric perf support: yes
reframechecks.common.sphexa.sanity_scorep.scorep_info_unwinding_support(obj)

Checks tool’s configuration (libunwind support)

> scorep-info config-summary
Unwinding support: yes
reframechecks.common.sphexa.sanity_scorep.scorep_mpi_pct(obj)

Reports MPI % measured by the tool

type max_buf[B]   visits    hits time[s] time[%] time/visit[us] region
MPI    428,794    59,185 215,094   74.72    23.0  1262.56  MPI
                                            ****
reframechecks.common.sphexa.sanity_scorep.scorep_omp_pct(obj)

Reports OMP % measured by the tool

type max_buf[B]   visits    hits time[s] time[%] time/visit[us] region
OMP 40,739,286 3,017,524 111,304 2203.92    85.4         730.37  OMP
                                            ****
reframechecks.common.sphexa.sanity_scorep.scorep_top1_name(obj)

Reports demangled name of top1 function name, for instance:

> c++filt ...
_ZN6sphexa3sph31computeMomentumAndEnergyIADImplIdNS_13 ...
ParticlesDataIdEEEEvRKNS_4TaskERT0_

void sphexa::sph::computeMomentumAndEnergyIADImpl  ...
      <double, sphexa::ParticlesData<double> > ...
      (sphexa::Task const&, sphexa::ParticlesData<double>&)
reframechecks.common.sphexa.sanity_scorep.scorep_top1_tracebuffersize(obj)

Reports max_buf[B] for top1 function

   type max_buf[B]   visits    hits time[s] time[%] time/visit[us] region
   ...
   USR    317,100   283,366 283,366   94.43    29.1         333.24 ...
_ZN6sphexa3sph31computeMomentumAndEnergyIADImplIdNS_13ParticlesData ...

   USR    430,500    81,902  81,902   38.00     1.5         463.99  ...
gomp_team_barrier_wait_end
reframechecks.common.sphexa.sanity_scorep.scorep_top1_tracebuffersize_name(obj)

Reports function name for top1 (max_buf[B]) function

reframechecks.common.sphexa.sanity_scorep.scorep_usr_pct(obj)

Reports USR % measured by the tool

type max_buf[B]   visits    hits time[s] time[%] time/visit[us] region
USR    724,140 1,125,393 667,740  226.14    69.6   200.94  USR
                                            ****
reframechecks.common.sphexa.sanity_scorep.scorep_version(obj)

Checks tool’s version:

> scorep --version
Score-P 7.0
returns: version string

Scalasca

scalasca_sampling_profiling.py

class reframechecks.scalasca.scalasca_sampling_profiling.SphExaScalascaProfilingCheck(*args: Any, **kwargs: Any)

Bases: reframe.

This class runs the test code with Scalasca (mpi only):

3 parameters can be set for simulation:

Parameters
  • mpi_task – number of mpi tasks; the size of the cube in the 3D square patch test is set with a dictionary depending on mpitask, but cubesize could also be on the list of parameters,

  • steps – number of simulation steps.

  • cycles – sampling sources generate interrupts that trigger a sample. SCOREP_SAMPLING_EVENTS sets the sampling source: see $EBROOTSCOREMINP/share/doc/scorep/html/sampling.html . Very large values will produce unreliable performance report, very small values will have a large runtime overhead.

Typical performance reporting:

PERFORMANCE REPORT
------------------------------------------------------------------------------
sphexa_scalascaS+P_sqpatch_024mpi_001omp_100n_4steps_5000000cycles
- dom:gpu
   - PrgEnv-gnu
      * num_tasks: 24
      * Elapsed: 20.4549 s
      * _Elapsed: 38 s
      * domain_distribute: 0.4089 s
      * mpi_synchronizeHalos: 2.4644 s
      * BuildTree: 0 s
      * FindNeighbors: 1.8787 s
      * Density: 1.8009 s
      * EquationOfState: 0.0174 s
      * IAD: 3.726 s
      * MomentumEnergyIAD: 6.1141 s
      * Timestep: 3.5887 s
      * UpdateQuantities: 0.0424 s
      * EnergyConservation: 0.0177 s
      * SmoothingLength: 0.017 s
      * %MomentumEnergyIAD: 29.89 %
      * %Timestep: 17.54 %
      * %mpi_synchronizeHalos: 12.05 %
      * %FindNeighbors: 9.18 %
      * %IAD: 18.22 %
      * scorep_elapsed: 21.4262 s
      * %scorep_USR: 71.0 %
      * %scorep_MPI: 23.3 %
      * scorep_top1: 30.1 % (void sphexa::sph::computeMomentumAndEnergyIADImpl)
      * %scorep_Energy_exclusive: 30.112 %
      * %scorep_Energy_inclusive: 30.112 %
set_runflags()

scalasca_sampling_tracing.py

class reframechecks.scalasca.scalasca_sampling_tracing.SphExaScalascaTracingCheck(*args: Any, **kwargs: Any)

Bases: reframe.

This class runs the test code with Scalasca (mpi only):

3 parameters can be set for simulation:

Parameters
  • mpitask – number of mpi tasks; the size of the cube in the 3D square patch test is set with a dictionary depending on mpitask, but cubesize could also be on the list of parameters,

  • steps – number of simulation steps.

  • cycles – sampling sources generate interrupts that trigger a sample. SCOREP_SAMPLING_EVENTS sets the sampling source: see $EBROOTSCOREMINP/share/doc/scorep/html/sampling.html . Very large values will produce unreliable performance report, very small values will have a large runtime overhead.

Typical performance reporting:

PERFORMANCE REPORT
------------------------------------------------------------------------------
sphexa_scalascaS+T_sqpatch_024mpi_001omp_100n_4steps_5000000cycles
- dom:gpu
   - PrgEnv-gnu
      * num_tasks: 24
      * Elapsed: 20.5242 s
      * _Elapsed: 28 s
      * domain_distribute: 0.4712 s
      * mpi_synchronizeHalos: 2.4623 s
      * BuildTree: 0 s
      * FindNeighbors: 1.8752 s
      * Density: 1.8066 s
      * EquationOfState: 0.0174 s
      * IAD: 3.7259 s
      * MomentumEnergyIAD: 6.1355 s
      * Timestep: 3.572 s
      * UpdateQuantities: 0.0273 s
      * EnergyConservation: 0.0079 s
      * SmoothingLength: 0.017 s
      * %MomentumEnergyIAD: 29.89 %
      * %Timestep: 17.4 %
      * %mpi_synchronizeHalos: 12.0 %
      * %FindNeighbors: 9.14 %
      * %IAD: 18.15 %
      * mpi_latesender: 2090 count
      * mpi_latesender_wo: 19 count
      * mpi_latereceiver: 336 count
      * mpi_wait_nxn: 1977 count
      * max_ipc_rk0: 1.294516 ins/cyc
      * max_rumaxrss_rk0: 127932 kilobytes
set_runflags()

Sanity checks

reframechecks.common.sphexa.sanity_scalasca.rpt_tracestats_mpi(obj)

Reports MPI statistics (mpi_latesender, mpi_latesender_wo, mpi_latereceiver, mpi_wait_nxn, mpi_nxn_completion) by reading the stat_rpt (trace.stat) file reported by the tool. Columns are (for each PatternName): Count Mean Median Minimum Maximum Sum Variance Quartil25 and Quartil75. Count (=second column) is used here.

Typical performance reporting:

> sphexa_scalascaS+T_sqpatch_024mpi_001omp_100n_4steps_5000000cycles/scorep_sqpatch_24_trace/trace.stat

PatternName               Count      Mean    Median      Minimum      Maximum      Sum     Variance    Quartil25    Quartil75
mpi_latesender                  2087 0.0231947 0.0024630 0.0000001623 0.2150408312 48.4074203231 0.0029201162 0.0007851545 0.0067652356
mpi_latesender_wo                 15 0.0073685 0.0057757 0.0011093750 0.0301200084 0.1105282126 0.0000531833 0.0025275522 0.0104651418
mpi_latereceiver                 327 0.0047614 0.0000339 0.0000362101 0.0139404002 1.5569782562 0.0000071413 0.0000338690 0.0000338690
mpi_wait_nxn                    1978 0.0324812 0.0002649 0.0000000015 0.7569314451 64.2478221177 0.0164967346 0.0001135482 0.0004163433
mpi_nxn_completion              1978 0.0000040 0.0001135 0.0000000008 0.0000607473 0.0078960137 0.0000000001 0.0000378494 0.0001892469
reframechecks.common.sphexa.sanity_scalasca.rpt_tracestats_omp(obj)

Reports OpenMP statistics by reading the trace.stat file: - omp_ibarrier_wait: OMP Wait at Implicit Barrier (sec) in Cube GUI - omp_lock_contention_critical: OMP Critical Contention (sec) in Cube GUI Each column (Count Mean Median Minimum Maximum Sum Variance Quartil25 and Quartil75) is read, only Sum is reported here.

reframechecks.common.sphexa.sanity_scalasca.scalasca_mpi_pct(obj)

MPI % reported by Scalasca (scorep.score, notice no hits column)

type max_buf[B]  visits time[s] time[%] time/visit[us]  region
ALL  6,529,686 193,188   28.13   100.0         145.63  ALL
OMP  6,525,184 141,056   27.33    97.1         193.74  OMP
MPI      4,502      73    0.02     0.1         268.42  MPI
                                 *****
reframechecks.common.sphexa.sanity_scalasca.scalasca_omp_pct(obj)

OpenMP % reported by Scalasca (scorep.score, notice no hits column)

type max_buf[B]  visits time[s] time[%] time/visit[us]  region
ALL  6,529,686 193,188   28.13   100.0         145.63  ALL
OMP  6,525,184 141,056   27.33    97.1         193.74  OMP
                                 *****
MPI      4,502      73    0.02     0.1         268.42  MPI

Extrae

extrae.py

Sanity checks

reframechecks.common.sphexa.sanity_extrae.create_sh(obj)

Create a wrapper script to insert Extrae libs (with LD_PRELOAD) into the executable at runtime

reframechecks.common.sphexa.sanity_extrae.extrae_version(obj)

Checks tool’s version. As there is no --version flag available, we read the version from extrae_version.h and compare it to our reference

> cat $EBROOTEXTRAE/include/extrae_version.h
#define EXTRAE_MAJOR 3
#define EXTRAE_MINOR 7
#define EXTRAE_MICRO 1
returns: True or False
reframechecks.common.sphexa.sanity_extrae.rpt_mpistats(obj)

Reports statistics (histogram of MPI communications) from the comms.dat file

#_of_comms %_of_bytes_sent # histogram bin
466        0.00            # 10 B
3543       0.25            # 100 B
11554     11.69            # 1 KB
29425     88.05            # 10 KB
0          0.00            # 100 KB
0          0.00            # 1 MB
0          0.00            # 10 MB
0          0.00            # >10 MB
reframechecks.common.sphexa.sanity_extrae.tool_reference_scoped_d(obj)

Sets a set of tool perf_reference to be shared between the tests.

mpiP

mpip.py

Sanity checks

class reframechecks.common.sphexa.sanity_mpip.MpipBaseTest(*args: Any, **kwargs: Any)[source]

Bases: reframe.

mpip_sanity_patterns()

Checks tool’s version:

> cat ./sqpatch.exe.6.31820.1.mpiP
@ mpiP
@ Command : sqpatch.exe -n 62 -s 1
@ Version : 3.4.2  <-- 57fc864
set_basic_perf_patterns()

A set of basic perf_patterns shared between the tests

set_mpip_perf_patterns()

More perf_patterns for the tool

-----------------------------------
@--- MPI Time (seconds) -----------
-----------------------------------
Task    AppTime    MPITime     MPI%
   0        8.6      0.121     1.40 <-- min
   1        8.6      0.157     1.82
   2        8.6       5.92    68.84 <-- max
   *       25.8        6.2    24.02 <---

=> NonMPI= AppTime - MPITime

Typical performance reporting:

* mpip_avg_app_time: 8.6 s  (= 25.8/3mpi)
* mpip_avg_mpi_time: 2.07 s (=  6.2/3mpi)
* %mpip_avg_mpi_time: 24.02 %
* %mpip_avg_non_mpi_time: 75.98 %
reframechecks.common.sphexa.sanity_mpip.mpip_perf_patterns(obj, reg)

More perf_patterns for the tool

-----------------------------------
@--- MPI Time (seconds) -----------
-----------------------------------
Task    AppTime    MPITime     MPI%
   0        8.6      0.121     1.40 <-- min
   1        8.6      0.157     1.82
   2        8.6       5.92    68.84 <-- max
   *       25.8        6.2    24.02 <---

=> NonMPI= AppTime - MPITime

Typical performance reporting:

* mpip_avg_app_time: 8.6 s  (= 25.8/3mpi)
* mpip_avg_mpi_time: 2.07 s (=  6.2/3mpi)
* %mpip_avg_mpi_time: 24.02 %
* %max/%min
* %mpip_avg_non_mpi_time: 75.98 %

Perftools

patrun.py

Sanity checks

class reframechecks.common.sphexa.sanity_perftools.PerftoolsBaseTest(*args: Any, **kwargs: Any)[source]

Bases: reframe.

patrun_energy_power()

This table shows HW performance counter data for the whole program, averaged across ranks or threads, as applicable.

Table 8:  Program energy and power usage (from Cray PM)

   Node |     Node |   Process | Node Id
 Energy |    Power |      Time |  PE=HIDE
    (J) |      (W) |           |

  7,891 |  692.806 | 11.389914 | Total    <---
|-- --------------------------------------
|  2,076 |  182.356 | 11.384319 | nid.7
|  1,977 |  173.548 | 11.391657 | nid.4
|  1,934 |  169.765 | 11.392220 | nid.6
|  1,904 |  167.143 | 11.391461 | nid.5
|========================================
Typical output:
  • patrun_avg_power: 692.806 W

patrun_hotspot1_mpi()
Table 1:  Profile by Function

  Samp% |    Samp |  Imb. |  Imb. | Group
        |         |  Samp | Samp% |  Function
        |         |       |       |   PE=HIDE

 100.0% | 1,126.4 |    -- |    -- | Total
...
||=================================================
|   9.9% |   111.4 |    -- |    -- | MPI
||-------------------------------------------------
||   5.2% |    58.2 | 993.8 | 95.5% | MPI_Allreduce <--
||   3.6% |    40.9 | 399.1 | 91.7% | MPI_Recv
patrun_hwpc()

This table shows HW performance counter data for the whole program, averaged across ranks or threads, as applicable.

Table 4:  Program HW Performance Counter Data
  ...
  Thread Time                                          11.352817 secs
  UNHALTED_REFERENCE_CYCLES                        28,659,167,096
  CPU_CLK_THREAD_UNHALTED:THREAD_P                 34,170,540,119
  DTLB_LOAD_MISSES:WALK_DURATION                       61,307,848
  INST_RETIRED:ANY_P                               22,152,242,298
  RESOURCE_STALLS:ANY                              19,793,119,676
  OFFCORE_RESPONSE_0:ANY_REQUEST:L3_MISS_LOCAL         20,949,344
  CPU CLK Boost                                              1.19 X
  Resource stall cycles / Cycles  -->                       57.9%
  Memory traffic GBytes           -->       0.118G/sec       1.34 GB
  Local Memory traffic GBytes               0.118G/sec       1.34 GB
  Memory Traffic / Nominal Peak                              0.2%
  DTLB Miss Ovhd                       61,307,848 cycles  0.2% cycles
  Retired Inst per Clock          -->                        0.65
==============================================================================
Typical output:
  • patrun_memory_traffic: 1.34 GB

  • patrun_ipc: 0.65

  • %patrun_stallcycles: 57.9 %

patrun_imbalance()

Load imbalance from csv report

Table 1:  load Balance with MPI Message Stats
patrun_memory_bw()

This table shows memory traffic to local and remote memory for numa nodes, taking for each numa node the maximum value across nodes.

Table 9:  Memory Bandwidth by Numanode

  Memory |   Local |    Thread |  Memory |  Memory | Numanode
 Traffic |  Memory |      Time | Traffic | Traffic |  Node Id
  GBytes | Traffic |           |  GBytes |       / |   PE=HIDE
         |  GBytes |           |   / Sec | Nominal |
         |         |           |         |    Peak |
|--------------------------------------------------------------
|   33.64 |   33.64 | 11.360701 |    2.96 |    4.3% | numanode.0
||-------------------------------------------------------------
||   33.64 |   33.64 | 11.359413 |    2.96 |    4.3% | nid.4
||   33.59 |   33.59 | 11.359451 |    2.96 |    4.3% | nid.6
||   33.24 |   33.24 | 11.360701 |    2.93 |    4.3% | nid.5
||   28.24 |   28.24 | 11.355006 |    2.49 |    3.6% | nid.7
|==============================================================

2 sockets:
Table 10:  Memory Bandwidth by Numanode

  Memory |   Local |  Remote | Thread |  Memory |  Memory | Numanode
 Traffic |  Memory |  Memory |   Time | Traffic | Traffic |  Node Id
  GBytes | Traffic | Traffic |        |  GBytes |       / |   PE=HIDE
         |  GBytes |  GBytes |        |   / Sec | Nominal |
         |         |         |        |         |    Peak |
|-------------------------------------------------------------------
|   11.21 |   10.99 |    0.22 | 3.886926 |  2.88 | 3.8% | numanode.0
||------------------------------------------------------------------
||   11.21 |   10.99 |    0.22 | 3.886926 | 2.88 |3.8% | nid.407
||   10.47 |   10.27 |    0.20 | 3.886450 | 2.69 |3.5% | nid.416
||==================================================================
|   11.29 |   11.06 |    0.23 | 3.889932 |  2.90 | 3.8% | numanode.1
||------------------------------------------------------------------
||   11.29 |   11.06 |    0.23 | 3.889932 | 2.90 |3.8% | nid.407
||   10.09 |    9.88 |    0.20 | 3.885858 | 2.60 |3.4% | nid.416
|===================================================================
Typical output:
  • patrun_memory_traffic_global: 33.64 GB

  • patrun_memory_traffic_local: 33.64 GB

  • %patrun_memory_traffic_peak: 4.3 %

patrun_num_of_compute_nodes()

Extract the number of compute nodes to compute averages

> ls 96mpi/sqpatch.exe+8709-4s/xf-files/:
  000004.xf
  000005.xf
  000006.xf
  000007.xf
Typical output:
  • patrun_cn: 4

patrun_samples()

Elapsed time (in samples) reported by the tool:

Table 1:  Profile by Function

  Samp% |  Samp |  Imb. |  Imb. | Group
        |       |  Samp | Samp% |  Function
        |       |       |       |   PE=HIDE

 100.0% | 382.8 |    -- |    -- | Total
 TODO:
  Experiment:  samp_cs_time
  Sampling interval:  10000 microsecs
patrun_version()

Checks tool’s version:

> pat_run -V
CrayPat/X:  Version 20.08.0 Revision 28ef35c9f
patrun_walltime_and_memory()

This table shows total wall clock time for the ranks with the maximum, mean, and minimum time, as well as the average across ranks.

Table 10:  Wall Clock Time, Memory High Water Mark

   Process |   Process | PE=[mmm]
      Time |     HiMem |
           | (MiBytes) |

 11.389914 |      76.3 | Total    <-- avgt
|--------------------------------
| 11.398188 |      57.7 | pe.24   <-- maxt
| 11.389955 |      98.9 | pe.34
| 11.365630 |      54.0 | pe.93   <-- mint
|================================
Typical output:
  • patrun_wallt_max: 11.3982 s

  • patrun_wallt_avg: 11.3899 s

  • patrun_wallt_min: 11.3656 s

  • patrun_mem_max: 57.7 MiBytes

  • patrun_mem_min: 54.0 MiBytes

perftools_lite_memory()

# 20.10.0 / AMD High Memory: 85,743.7 MiBytes 669.9 MiBytes per PE # More –> pat_report -O himem exe+141047-1002s/index.ap2 > rpt.mem

set_tool_perf_patterns()

More perf_patterns for the tool

Typical performance reporting:

      * patrun_wallt_max: 18.7552 s
      * patrun_wallt_avg: 18.7445 s
      * patrun_wallt_min: 18.7213 s
      * patrun_mem_max: 60.1 MiBytes
      * patrun_mem_min: 53.8 MiBytes
      * patrun_memory_traffic_global: 53.95 GB
      * patrun_memory_traffic_local: 53.95 GB
      * %patrun_memory_traffic_peak: 4.2 %
      * patrun_memory_traffic: 2.15 GB
      * patrun_ipc: 0.64
      * %patrun_stallcycles: 58.0 %
      * %patrun_user: 84.7 % (slow: 1677.0 smp [pe14] / mean:1570.2 median:1630.0 / fast:26.0 [pe95])
      * %patrun_mpi: 11.1 % (slow: 1793.0 smp [pe94] / mean:205.9 median:146.0 / fast:91.0 [pe56])
      * %patrun_etc: 4.2 % (slow: 97.0 smp [pe63] / mean:78.3 median:78.5 / fast:38.0 [pe93])
      * %patrun_total: 100.0 % (slow: 1862.0 smp [pe92] / mean:1854.4 median:1854.0 / fast:1835.0 [pe5])
      * %patrun_user_slowest: 90.5 % (pe.14)
      * %patrun_mpi_slowest: 5.6 % (pe.14)
      * %patrun_etc_slowest: 3.9 % (pe.14)
      * %patrun_user_fastest: 1.4 % (pe.95)
      * %patrun_mpi_fastest: 96.3 % (pe.95)
      * %patrun_etc_fastest: 2.3 % (pe.95)
      * %patrun_avg_usr_reported: 84.5 %
      * %patrun_avg_mpi_reported: 11.1 %
      * %patrun_avg_etc_reported: 4.4 %
      * %patrun_hotspot1: 34.7 % (sphexa::sph::computeMomentumAndEnergyIADImpl<>)
      * %patrun_mpi_h1: 6.6 % (MPI_Allreduce)
      * %patrun_mpi_h1_imb: 94.1 % (MPI_Allreduce)
      * patrun_avg_energy: 3274.0 J
      * patrun_avg_power: 174.665 W