5. Performance analysis

This section covers the use of tools on ARCHER to analyse the performance of your code, both in terms of the serial performance (percentage of peak, memory usage, etc.) and in terms of the parallel performance (load balance, time in MPI routines, etc.).

There are a number of tools installed on ARCHER available to help you understand the performance characteristics of your code and some will even offer suggestions on how to improve performance.

5.1 Performance Analysis Tools

These tools include

  • Cray Performance Analysis Tool (CrayPat)
  • Allinea MAP
  • TAU
  • Valgrind

5.2 Cray Performance Analysis Tool (CrayPat)

CrayPat consists of two command line tools which are used to profile your code

  • pat_build adds instrumentation to your code so that when you run the instrumented binary the profiling information is stored on disk;
  • pat_report reads the profiling information on disk and compiles it into human-readable reports.

The analysis produced by pat_report can be visualised using the Cray Apprentice2 tool app2.

CrayPat can perform two types of performance analysis: sampling experiments and tracing experiments. A sampling experiment probes the code at a predefined interval and produces a report based on these statistics. A tracing experiment explicitly monitors the code performance within named routines. Typically, the overhead associated with a tracing experiment is higher than that associated with a sampling experiment but provides much more detailed information. The key to getting useful data out of a sampling experiment is to run your profiling for a representative length of time.

Detailed documentation on CrayPat is available from Cray. The latest version of the documentation can be found at the top of this list:

5.2.1 Instrumenting a code with pat_build

Often the best way to to analyse the performance is to run a sampling experiment and then use the results from this to perform a focused tracing experiment. CrayPat contains the functionality for automating this process - known as Automatic Profiling Analysis (APA).

The example below illustrates how to do this.

5.2.1.1 Example: Profiling the CASTEP Code

Here is a step-by-step example of instrumenting and profiling the performance of a code (CASTEP) using CrayPat.

The first step is to compile your code with the profiling tools attached. First load the Cray perftools module with

module load perftools

Next, compile the CASTEP code in the standard way on the /work file-system (which is accessible on the compute nodes). Once you have the CASTEP executable you need to instrument it for a sampling experiment. The following command will produce the castep+samp instrumented executable

pat_build -o castep+samp castep

The default instrumentation is a sampling experiment with Automatic Profiling Analysis (equivalent to the option -O apa).

Run your instrumented program as you usually would in the job submission system at your site. For example

#!/bin/bash --login
#PBS -N castep_profile
#PBS -l select=32
#PBS -l walltime=1:00:00
# Replace [budget code] below with your project code
#PBS -A [budget code]

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)               
# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Run the sampling experiment
aprun -n 768 castep+samp input

When the job completes successfully the directory you submitted the job from will either contain a .xf file (if you used a small number of cores) with a name like castep+samp+25370-14s.xf or a directory (for large core counts) with a name like castep+samp+25370-14s. The actual name is dependent on the name of your instrumented executable and the process ID. In this case we used 32 nodes so we have a directory.

The next step is to produce a basic performance report and also the Automatic Profiling Analysis input for pat_build to create a tracing experiment. This is done (usually interactively) by using the pat_report command (you must have the perftools module loaded to use the pat_report command). In our example we would type

pat_report -o castep_samp_1024.pat castep+samp+25370-14s 

This will produce a text report (in this case, castep_samp_1024.pat) listing the various routines in the program and how much time was spent in each one during the run (see below for a discussion on the contents of this file). It also produces a castep+samp+25370-14s.ap2 file and a castep+samp+25370-14s.apa file.

  • The .ap2 file can be used with the Cray Apprentice2 visualisation tool to get an alternative view of the code performance.
  • The .apa file can be used as the input to pat_build to produce a tracing experiment focused on the routines that take most time (as determined in the previous sampling experiment). We illustrate this below.

An .apa file may not be produced if the program is very short, or mainly calls library functions, with little processing in the program itself. There will be a message at the end of the text report like

    No .apa file with pat_build option suggestions was generated
    because no samples appear to have been taken in USER functions.

For information on interpreting the results in the sampling experiment text report, see the section below.

To produce a focused tracing experiment based on the results from the sampling experiment above we would use the pat_build command with the .apa file produced above. (Note, there is no need to change directory to the directory containing the original binary to run this command.)

pat_build -O castep+samp+25370-14s.apa -o castep+apa

This will generate a new instrumented version of the CASTEP executable (castep+apa) in the current working directory. You can then submit a job (as outlined above) making sure that you reference this new executable. This will then run a tracing experiment based on the options in the .apa file.

Once again, you will find that your job generates a .xf file (or a directory) which you can analyse using the pat_report command. Please see the section below for information on analysing tracing experiments.

5.2.2 Analysing profile results using pat_report

The pat_report command is able to produce many different profile reports from the profile data. You can select a predefined report with the -O flag to pat_report. A selection of the most generally useful predefined report types are

  • ca+src - Show the callers (bottom-up view) leading to the routines that have a high use in the report and include source code line numbers for the calls and time-consuming statements.
  • load_balance - Show load-balance statistics for the high-use routines in the program. Parallel processes with minimum, maximum and median times for routines will be displayed. Only available with tracing experiments.
  • mpi_callers - Show MPI message statistics. Only available with tracing experiments.

Multiple options can be specified to the -O flag if they are separated by commas. For example

pat_report -O ca+src,load_balance -o castep_furtherinfo_1024.pat castep+samp+25370-14s

You can also define your own custom reports using the -b and -d flags to pat_report. Details on how to do this can be found in the pat_report documentation and man pages. The output from the predefined report types (described above) show the settings of the -b and -d flags that were used to generate the report. You can use these examples as the basis for your own custom reports.

Below, we show examples of the output from sampling and tracing experiments for the pure MPI version of CASTEP. A report from your own code will differ in the routines and times shown. Also, if your code uses OpenMP threads, SHMEM, or CAF/UPC then you will also see different output.

5.2.2.1 Example: results from a sampling experiment on CASTEP

In a sampling experiment, the main summary table produced by pat_report will look something like (of course, your code will contain different functions and subroutines)

Table 1:  Profile by Function

   Samp%  |  Samp  |  Imb.  |  Imb.  |Group 
          |        |  Samp  | Samp%  | Function 
          |        |        |        |  PE=HIDE 
       
   100.0% | 6954.6 |     -- |     -- |Total
|------------------------------------------------------------------------
|   46.0% | 3199.8 |     -- |     -- |MPI
||-----------------------------------------------------------------------
||  12.9% |  897.0 |  302.0 |  25.6% |mpi_bcast
||  12.9% |  896.4 |   78.6 |   8.2% |mpi_alltoallv
||   8.7% |  604.2 |  139.8 |  19.1% |MPI_ALLREDUCE
||   5.4% |  378.6 | 1221.4 |  77.5% |MPI_SCATTER
||   4.9% |  341.6 |  320.4 |  49.2% |MPI_BARRIER
||   1.1% |   75.8 |  136.2 |  65.2% |mpi_gather
||=======================================================================
|   38.8% | 2697.9 |     -- |     -- |USER
||-----------------------------------------------------------------------
||   9.4% |  652.1 |  171.9 |  21.2% |comms_transpose_exchange.3720
||   6.0% |  415.2 |   41.8 |   9.3% |zgemm_kernel_n
||   3.4% |  237.6 |   29.4 |  11.2% |zgemm_kernel_l
||   3.1% |  215.6 |   24.4 |  10.3% |zgemm_otcopy
||   1.7% |  119.8 |   20.2 |  14.6% |zgemm_oncopy
||   1.3% |   92.3 |   11.7 |  11.4% |zdotu_k
||=======================================================================
|   15.2% | 1057.0 |     -- |     -- |ETC
||-----------------------------------------------------------------------
||   2.0% |  140.5 |   32.5 |  19.1% |__ion_MOD_ion_q_recip_interpolation
||   2.0% |  139.8 |  744.2 |  85.5% |__remainder_piby2
||   1.1% |   78.3 |   32.7 |  29.9% |__ion_MOD_ion_apply_and_add_ylm
||   1.1% |   73.5 |   35.5 |  33.1% |__trace_MOD_trace_entry
|========================================================================

You can see that CrayPat splits the results from a sampling experiment into three sections

  • MPI - samples accumulated in message passing routines
  • USER - samples accumulated in user routines. These are usually the functions and subroutines that are part of your program.
  • ETC - samples accumulated in library routines.

The first two columns indicate the percentage and number of samples (mean of samples computed across all parallel tasks) and indicate the functions in the program where the sampling experiment spent the most time. Columns 3 and 4 give an indication of the differences between the minimum number of samples and maximum number of samples found across different parallel tasks. This, in turn, gives an indication of where the load-imbalance in the program is found. It may be that load imbalance is found in routines where insignificant amounts of time are spent. In this case, the load-imbalance may not actually be significant (for example, the large imbalance time seen in mpi_gather in the example above.

In the example above - the largest amounts of time seem to be spent in the MPI routines MPI_BCAST, MPI_ALLTOALLV and MPI_ALLREDUCE; in the program routine comms_transpose_exchange; and in the LAPACK routine zgemm.

To narrow down where these time consuming calls are in the program we can use the ca+src report option to pat_report to get a view of the split of samples between the different calls to the time consuming routines in the program.

pat_report -O ca+src -o castep_calls_1024.pat castep+samp+25370-14s 

An extract from the the report produced by the above command looks like

Table 1:  Profile by Function and Callers, with Line Numbers

    Samp%  |  Samp  |Group 
           |        | Function 
           |        |  Caller 
           |        |   PE=HIDE 
     
    100.0% | 6954.6 |Total
|--------------------------------------------------------------------------------------
--snip--
|    38.8% | 2697.9 |USER
||-------------------------------------------------------------------------------------
||    9.4% |  652.1 |comms_transpose_exchange.3720
|||------------------------------------------------------------------------------------
3||   3.7% |  254.4 |__comms_MOD_comms_transpose_n:comms.mpi.F90:line.17965
4||        |        | __comms_MOD_comms_transpose:comms.mpi.F90:line.17826
5||        |        |  __fft_MOD_fft_3d:fft.fftw3.F90:line.482
6||   3.1% |  215.0 |   __basis_MOD_basis_real_recip_reduced_one:basis.F90:line.4104
7||        |        |    _s_wave_MOD_wave_real_to_recip_wv_bks:wave.F90:line.24945
8||        |        |     __pot_MOD_pot_nongamma_apply_wvfn_slice:pot.F90:line.2640
9||        |        |      __pot_MOD_pot_apply_wvfn_slice:pot.F90:line.2550
10|        |        |       __electronic_MOD_electronic_apply_h_recip_wv:electronic.F90:line.4364
11|        |        |        __electronic_MOD_electronic_apply_h_wv:electronic.F90:line.4057
12|   3.0% |  211.4 |         __electronic_MOD_electronic_minimisation:electronic.F90:line.1957
13|   3.0% |  205.6 |          __geometry_MOD_geom_get_forces:geometry.F90:line.6299
14|   2.8% |  195.2 |           __geometry_MOD_geom_bfgs:geometry.F90:line.2199
15|        |        |            __geometry_MOD_geometry_optimise:geometry.F90:line.1051
16|        |        |             MAIN__:castep.f90:line.1013
17|        |        |              main:castep.f90:line.623
--snip--

This indicates that (in the USER routines) the part of the comms_transpose_exchange routine in which the program is spending most time is around line 3720 (in many cases the time-consuming line will be the start of a loop). Furthermore, we can see that approximately one third of the samples in the routine (3.7%) are originating from the call to comms_transpose_exchange in the stack shown. Here, it looks like they are part of the geometry optimisation section of CASTEP. The remaining two-thirds of the time will be indicated further down in the report.

In the original report we also get a table reporting on the wall clock time of the program. The table reports the minimum, maximum and median values of wall clock time from the full set of parallel tasks. The table also lists the maximum memory usage of each of the parallel tasks

Table 3:  Wall Clock Time, Memory High Water Mark

    Process  |  Process  |PE=[mmm] 
       Time  |    HiMem  |
             | (MBytes)  |
         
  124.409675 |   100.761 |Total
|---------------------------------
| 126.922418 |   101.637 |pe.39
| 124.329474 |   102.336 |pe.2
| 122.241495 |   101.355 |pe.4
|=================================

If you see a very large difference in the minimum and maximum wall clock time for different parallel tasks it indicates that there is a large amount of load-imbalance in your code and it should be investigated more thoroughly.

5.2.2.2 Example: results from a tracing experiment on CASTEP

To be added.

5.2.3 Using hardware counters

CrayPat can also provide performance counter data during tracing experiments. The PAT_RT_PERFCTR environment variable can be set in your job script to specify hardware, network, and power management events to be monitored. Please refer to the CrayPat manpage and the CrayPat documentation for information about using the performance counters.

5.2.3.1 Example: Performance counter results from CASTEP

To be added.

5.3 Cray Reveal

Reveal is a performance analysis tool supported by Cray that guides you in improving the performance of a Cray-compiled application code. Specifically, Reveal will analyse the loops within your code and identify which ones are less than fully optimised and why. In addition, Reveal can gather performance statistics that rank the loops according to execution time, thereby showing you the best places in the source code to apply optimisations. An overview of Reveal can be found in the Cray document entitled, Using Cray Performance Measurement and Analysis Tools. In addition, the ARCHER CSE team have produced a Reveal screencast (see below) that gives a straightforward example of how a researcher might run Reveal with their MPI application code, and, along the way, show how to make the code run faster, and also, how to add parallelism by introducing OpenMP directives. Further details on the OpenMP-aspects of the Reveal tool can be found in this Cray webinar.

5.4 Allinea MAP

Allinea MAP can be used to profile MPI code only - it does not support detailed profiling of OpenMP, SHMEM, CAF, or UPC. Profiling using Allinea MAP consists of four steps

  1. Create a MAP profiling version of your MPI library.
  2. Re-compile your code, linking to the MAP MPI profiling library.
  3. Run your code to collect profiling information.
  4. Visualise the profiling information using the MAP Remote Client.

5.4.1 Create a MAP profiling MPI library

You should first load the allinea module to give you access to all the MAP executables

module load allinea

The command you use to create the profiling version of your MPI library differs depending whether you are linking statically or dynamically. On ARCHER the default is to link statically so you should use the command for static libraries unless you are definitely linking dynamically.

For statically linked applications the command is make-profiler-libraries and you need to supply a directory for the library files. For example

mjf@eslogin006:~/test> module load allinea
mjf@eslogin006:~/test> mkdir map-libraries
mjf@eslogin006:~/test> cd ..
mjf@eslogin006:~/test> make-profiler-libraries --lib-type=static ./map-libraries 
Created the libraries in ./map-libraries:
   libmap-sampler.a
   libmap-sampler-pmpi.a

To instrument a program, add these compiler options:
   compilation for use with MAP - not required for Performance Reports:
      -g (or '-G2' for native Cray Fortran) (and -O3 etc.)
   linking (both MAP and Performance Reports):
      -Wl,@./maps-libraries/allinea-profiler.ld ... EXISTING_MPI_LIBRARIES
   If your link line specifies EXISTING_MPI_LIBRARIES (e.g. -lmpi), then
   these must appear *after* the Allinea sampler and MPI wrapper libraries in
   the link line.  There's a comprehensive description of the link ordering
   requirements in the 'Preparing a Program for Profiling' section of either
   userguide-forge.pdf or userguide-reports.pdf, located in
   /home/y07/y07/cse/allinea/forge/5.0.1-phase2/doc/.

mjf@eslogin006:~/test> ls -l ./map-libraries
total 6708
-rw-r--r-- 1 mjf z01     748 Feb 22 11:43 allinea-profiler.ld
-rw-r--r-- 1 mjf z01 1727928 Feb 22 11:43 libmap-sampler-pmpi.a
-rw-r--r-- 1 mjf z01 4785546 Feb 22 11:43 libmap-sampler.a

If you are linking dynamically then you still use the "make-profiler-libraries" command, but don't specify static building. For example

mjf@eslogin006:/work/z01/z01/mjf/test> module load allinea
mjf@eslogin006:/work/z01/z01/mjf/test> mkdir map-libraries
mjf@eslogin006:/work/z01/z01/mjf/test> make-profiler-libraries ./map-libraries
Created the libraries in ./map-libraries:
   libmap-sampler.so       (and .so.1, .so.1.0, .so.1.0.0)
   libmap-sampler-pmpi.so  (and .so.1, .so.1.0, .so.1.0.0)

To instrument a program, add these compiler options:
   compilation for use with MAP - not required for Performance Reports:
      -g (or '-G2' for native Cray Fortran) (and -O3 etc.)
   linking (both MAP and Performance Reports):
      -dynamic -L./map-libraries -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr

Note: These libraries must be on the same NFS/Lustre/GPFS filesystem as your
program.

Before running your program (interactively or from a queue), set
LD_LIBRARY_PATH:
   export LD_LIBRARY_PATH=./map-libraries:$LD_LIBRARY_PATH
   mpirun  ...
or add -Wl,-rpath=./map-libraries when linking your program.

5.4.2 Re-compile the code, linking to the MAP MPI libraries

Recompile your code with the -g flag (-G2 can be used with Cray Fortran) to ensure debugging information is available for the profiler. Then link using the set of options specified in the output from the make-profiler-libraries command that you used (note, the -lWl,@.... statement should be added at the end of command for linking the executable, not the beginning).

5.4.3 Run the code to collect profiling information

You can now run a job as normal to collect the profile information for your code. This involves writing a job submission script to run your code through MAP. MAP will produce a file containing all the profiling information which can then be examined using the Remote Client (as described below).

An example job submission script for running a profile across 48 MPI processes would be

#!/bin/bash --login
# 
#PBS -N map_test
#PBS -l select=2
#PBS -l walltime=1:0:0
#PBS -A t01

cd $PBS_O_WORKDIR

module load allinea

map -profile aprun -n 24 myapp.exe

This will produce a profile file with a name like

myapp_24p_2014-04-15-07.map

An example job submission script for running a profile across 24 MPI processes with 12 processes per node would be

#!/bin/bash --login
# 
#PBS -N map_test
#PBS -l select=2
#PBS -l walltime=1:0:0
#PBS -A t01

cd $PBS_O_WORKDIR

module load allinea

map -profile aprun -n 24 -N 12 -S 6 myapp.exe

You can specify a specific name for this output file using the -output option to map in the job script.

If you are profiling a dynamically linked program, then you will need to set LD_LIBRARY_PATH in the job script.

5.4.4 Visualise the profiling information

In order to visualise the data you should download and install the Allinea Remote Client on your desktop using these instructions.

Once remote access has been successfully setup you should be returned to the startup screen. This time select "Load Profile Data File". You will be given a dialogue box where you can select your profile file from the ARCHER file-system. Once you have selected your file MAP will assemble your results and you can visualise your profile data. The MAP interface should look something like

Allinea MAP profile screen

For more information on using the profiling interface see:

5.5 Tuning and Analysis Utilities (TAU)

TAU is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, UPC, Java, Python. The current version of TAU installed on ARCHER is v2.25.1.1, the following makefiles are supported.

  • Makefile.tau-cray-mpi-pdt
  • Makefile.tau-cray-mpi-pdt-scorep
  • Makefile.tau-cray-mpi-python-pdt
  • Makefile.tau-cray-ompt-mpi-pdt-openmp
  • Makefile.tau-cray-papi-mpi-pdt
  • Makefile.tau-cray-papi-ompt-mpi-pdt-openmp
  • Equivalent makefiles also exist for gnu and Intel programming environments.

Choose one of the makefiles and insert "export TAU_MAKEFILE=$TAU_LIB/" before compiling your code. Otherwise, use tau_exec for uninstrumented code.

Note, before using tau_exec, make sure you have built your application binary with the -dynamic link option. This is necessary because tau_exec preloads a dynamic shared object, libTAU.so, in the context of the executing application.

Full documentation can be found at http://www.cs.uoregon.edu/research/tau/docs.php. In addition, TAU workshop exercises can be found at /home/y07/shared/tau/workshop/ - please start with the readme file located within the workshop folder.

5.6 Valgrind

Unlike previously mentioned profiling tools, which are specialised to analyse massively parallel programs, Valgrind is a suite of more generic performance analysis tools that are not specialised for massively parallel profiling but nonetheless offer useful performance analysis functionality, for example for serial optimisation work or to help decide how best to parallelise your serial program. Valgrind is designed to be non-intrusive and to work directly with existing executables. You don't need to recompile, relink, or otherwise modify your program. For a list of Valgrind tools see http://valgrind.org/info/tools.html.

5.6.1 Memory profiling with Massif

Valgrind includes a tool called Massif that can be used to give insight into the memory usage of your program. It takes regular snapshots and outputs this data into a single file, which can be visualised to show the total amount of memory used as a function of time. This shows when peaks and bottlenecks occur and allows you to identify which data structures in your code are responsible for the largest memory usage of your program.

Documentation explaining how to use Massif is available at official Massif manual. In short, you should run your executable as follows:

  valgrind --tool=massif my_executable

The memory profiling data will be output into a file called massif.out.pid, where pid is the runtime process ID of your program. A custom filename can be chosen using the --massif-out-file option, as follows:

  valgrind --tool=massif --massif-out-file=optional_filename.out my_executable

The output file contains raw profiling statistics. To view a summary including a graphical plot of memory usage over time, use the ms_print command as follows:

  ms_print massif.out.12345

or, to save to a file:

  ms_print massif.out.12345 > massis.analysis.12345

This will show total memory usage over time as well as a breakdown of the top data structures contributing to memory usage at each snapshot where there has been a significant allocation or deallocation of memory.

5.7 General hints for interpreting profiling results

There are several general tips for using the results from the performance analysis

  • Look inside the routines where most of the time is spent to see if they can be optimised in any way. The ct+src report option can be useful to determine which regions of a function are using the most time.
  • Look for load-imbalance in the code. This is indicated by a large difference in computing time between different parallel tasks.
  • High values of time spent in MPI usually mean something wrong in the code. Load-imbalance, a bad communication pattern, or just not scaling to the used number of cores.
  • High values of cache misses (seen via hardware counters) could mean a bad memory access in an array or loop. Examine the arrays and loops in the problematic function to see if they can be reorganised to access the data in a more cache-friendly way.

The Tuning section in this guide has a lot of information on how to optimise both serial and parallel code. Combining profiling information with the optimisation techniques is the best way to improve the performance of your code.

5.7.1 Spotting load-imbalance

By far the largest obstacle to scaling applications to large numbers of parallel tasks is the presence of load-imbalance in the parallel code. Spotting the symptoms of load-imbalance in performance analysis is key to optimising the scaling performance of your code. Typical symptoms include

For MPI codes
large amounts of time spent in MPI_Barrier or MPI collectives (which include an implicit barrier).
For PGAS codes
large amounts of time spent in synchronisation functions.

When running a CrayPat tracing experiment with -g mpi the MPI time will be split into computation time (MPI group) and synchronisation time (MPI_SYNC group). A high synchronisation time can indicate problems with load imbalance in an MPI code.

The location of the accumulation of load-imbalance time often gives an indication of where in your program the load-imbalance is occurring.