5. Performance analysis
- 5.1 Performance Analysis Tools
- 5.2 Cray Performance Analysis Tool (CrayPat)
- 5.3 Cray Reveal
- 5.4 Allinea MAP
- 5.5 Tuning and Analysis Utilities (TAU)
- 5.6 Valgrind
- 5.7 General hints for interpreting profiling results
This section covers the use of tools on ARCHER to analyse the performance of your code, both in terms of the serial performance (percentage of peak, memory usage, etc.) and in terms of the parallel performance (load balance, time in MPI routines, etc.).
There are a number of tools installed on ARCHER available to help you understand the performance characteristics of your code and some will even offer suggestions on how to improve performance.
5.1 Performance Analysis Tools
These tools include
- Cray Performance Analysis Tool (CrayPat)
- Allinea MAP
5.2 Cray Performance Analysis Tool (CrayPat)
CrayPat consists of two command line tools which are used to profile your code
pat_buildadds instrumentation to your code so that when you run the instrumented binary the profiling information is stored on disk;
pat_reportreads the profiling information on disk and compiles it into human-readable reports.
The analysis produced by
pat_report can be visualised using
the Cray Apprentice2 tool
CrayPat can perform two types of performance analysis: sampling experiments and tracing experiments. A sampling experiment probes the code at a predefined interval and produces a report based on these statistics. A tracing experiment explicitly monitors the code performance within named routines. Typically, the overhead associated with a tracing experiment is higher than that associated with a sampling experiment but provides much more detailed information. The key to getting useful data out of a sampling experiment is to run your profiling for a representative length of time.
Detailed documentation on CrayPat is available from Cray. The latest version of the documentation can be found at the top of this list:
5.2.1 Instrumenting a code with
Often the best way to to analyse the performance is to run a sampling experiment and then use the results from this to perform a focused tracing experiment. CrayPat contains the functionality for automating this process - known as Automatic Profiling Analysis (APA).
The example below illustrates how to do this.
220.127.116.11 Example: Profiling the CASTEP Code
Here is a step-by-step example of instrumenting and profiling the performance of a code (CASTEP) using CrayPat.
The first step is to compile your code with the profiling tools
attached. First load the Cray
perftools module with
module load perftools
Next, compile the CASTEP code in the standard way on
/work file-system (which is accessible on the compute
nodes). Once you have the CASTEP executable you need to instrument
it for a sampling experiment. The following command will
castep+samp instrumented executable
pat_build -o castep+samp castep
The default instrumentation is a sampling experiment with Automatic
Profiling Analysis (equivalent to the option
Run your instrumented program as you usually would in the job submission system at your site. For example
#!/bin/bash --login #PBS -N castep_profile #PBS -l select=32 #PBS -l walltime=1:00:00 # Replace [budget code] below with your project code #PBS -A [budget code] # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the directory that the job was submitted from cd $PBS_O_WORKDIR # Run the sampling experiment aprun -n 768 castep+samp input
When the job completes successfully the directory you submitted the
job from will either contain a
.xf file (if you used a
small number of cores) with a name
castep+samp+25370-14s.xf or a directory (for large
core counts) with a name like
The actual name is dependent on the name of your instrumented
executable and the process ID. In this case we used 32 nodes so we
have a directory.
The next step is to produce a basic performance report and also the
Automatic Profiling Analysis input for
pat_build to create
a tracing experiment. This is done (usually interactively) by using
pat_report command (you must have
perftools module loaded to use the
command). In our example we would type
pat_report -o castep_samp_1024.pat castep+samp+25370-14s
This will produce a text report (in this
castep_samp_1024.pat) listing the various routines in
the program and how much time was spent in each one during the run
(see below for a discussion on the contents of this file). It also
castep+samp+25370-14s.ap2 file and
.ap2file can be used with the Cray Apprentice2 visualisation tool to get an alternative view of the code performance.
.apafile can be used as the input to
pat_buildto produce a tracing experiment focused on the routines that take most time (as determined in the previous sampling experiment). We illustrate this below.
.apa file may not be produced if the program is very
short, or mainly calls library functions, with little processing in
the program itself. There will be a message at the end of the text
No .apa file with pat_build option suggestions was generated because no samples appear to have been taken in USER functions.
For information on interpreting the results in the sampling experiment text report, see the section below.
To produce a focused tracing experiment based on the results from
the sampling experiment above we would use the
command with the
.apa file produced above. (Note, there
is no need to change directory to the directory containing the
original binary to run this command.)
pat_build -O castep+samp+25370-14s.apa -o castep+apa
This will generate a new instrumented version of the CASTEP
castep+apa) in the current working directory.
You can then submit a job (as outlined above) making sure that you
reference this new executable. This will then run a tracing
experiment based on the options in the
Once again, you will find that your job generates a
file (or a directory) which you can analyse using
pat_report command. Please see the section below for
information on analysing tracing experiments.
5.2.2 Analysing profile results using
pat_report command is able to produce many
different profile reports from the profile data. You can select a
predefined report with the
pat_report. A selection of the most generally
useful predefined report types are
ca+src- Show the callers (bottom-up view) leading to the routines that have a high use in the report and include source code line numbers for the calls and time-consuming statements.
load_balance- Show load-balance statistics for the high-use routines in the program. Parallel processes with minimum, maximum and median times for routines will be displayed. Only available with tracing experiments.
mpi_callers- Show MPI message statistics. Only available with tracing experiments.
Multiple options can be specified to the
-O flag if they
are separated by commas. For example
pat_report -O ca+src,load_balance -o castep_furtherinfo_1024.pat castep+samp+25370-14s
You can also define your own custom reports using the
-d flags to
pat_report. Details on how to do
this can be found in the
pat_report documentation and man
pages. The output from the predefined report types (described
above) show the settings of the
that were used to generate the report. You can use these examples
as the basis for your own custom reports.
Below, we show examples of the output from sampling and tracing experiments for the pure MPI version of CASTEP. A report from your own code will differ in the routines and times shown. Also, if your code uses OpenMP threads, SHMEM, or CAF/UPC then you will also see different output.
18.104.22.168 Example: results from a sampling experiment on CASTEP
In a sampling experiment, the main summary table produced
pat_report will look something like (of course, your
code will contain different functions and subroutines)
Table 1: Profile by Function Samp% | Samp | Imb. | Imb. |Group | | Samp | Samp% | Function | | | | PE=HIDE 100.0% | 6954.6 | -- | -- |Total |------------------------------------------------------------------------ | 46.0% | 3199.8 | -- | -- |MPI ||----------------------------------------------------------------------- || 12.9% | 897.0 | 302.0 | 25.6% |mpi_bcast || 12.9% | 896.4 | 78.6 | 8.2% |mpi_alltoallv || 8.7% | 604.2 | 139.8 | 19.1% |MPI_ALLREDUCE || 5.4% | 378.6 | 1221.4 | 77.5% |MPI_SCATTER || 4.9% | 341.6 | 320.4 | 49.2% |MPI_BARRIER || 1.1% | 75.8 | 136.2 | 65.2% |mpi_gather ||======================================================================= | 38.8% | 2697.9 | -- | -- |USER ||----------------------------------------------------------------------- || 9.4% | 652.1 | 171.9 | 21.2% |comms_transpose_exchange.3720 || 6.0% | 415.2 | 41.8 | 9.3% |zgemm_kernel_n || 3.4% | 237.6 | 29.4 | 11.2% |zgemm_kernel_l || 3.1% | 215.6 | 24.4 | 10.3% |zgemm_otcopy || 1.7% | 119.8 | 20.2 | 14.6% |zgemm_oncopy || 1.3% | 92.3 | 11.7 | 11.4% |zdotu_k ||======================================================================= | 15.2% | 1057.0 | -- | -- |ETC ||----------------------------------------------------------------------- || 2.0% | 140.5 | 32.5 | 19.1% |__ion_MOD_ion_q_recip_interpolation || 2.0% | 139.8 | 744.2 | 85.5% |__remainder_piby2 || 1.1% | 78.3 | 32.7 | 29.9% |__ion_MOD_ion_apply_and_add_ylm || 1.1% | 73.5 | 35.5 | 33.1% |__trace_MOD_trace_entry |========================================================================
You can see that CrayPat splits the results from a sampling experiment into three sections
- MPI - samples accumulated in message passing routines
- USER - samples accumulated in user routines. These are usually the functions and subroutines that are part of your program.
- ETC - samples accumulated in library routines.
The first two columns indicate the percentage and number of samples
(mean of samples computed across all parallel tasks) and indicate
the functions in the program where the sampling experiment spent the
most time. Columns 3 and 4 give an indication of the differences
between the minimum number of samples and maximum number of samples
found across different parallel tasks. This, in turn, gives an
indication of where the load-imbalance in the program is found. It
may be that load imbalance is found in routines where insignificant
amounts of time are spent. In this case, the load-imbalance may not
actually be significant (for example, the large imbalance time seen
mpi_gather in the example above.
In the example above - the largest amounts of time seem to be spent
in the MPI routines
MPI_ALLREDUCE; in the program
comms_transpose_exchange; and in the LAPACK
To narrow down where these time consuming calls are in the program
we can use the
ca+src report option to
to get a view of the split of samples between the different calls to
the time consuming routines in the program.
pat_report -O ca+src -o castep_calls_1024.pat castep+samp+25370-14s
An extract from the the report produced by the above command looks like
Table 1: Profile by Function and Callers, with Line Numbers Samp% | Samp |Group | | Function | | Caller | | PE=HIDE 100.0% | 6954.6 |Total |-------------------------------------------------------------------------------------- --snip-- | 38.8% | 2697.9 |USER ||------------------------------------------------------------------------------------- || 9.4% | 652.1 |comms_transpose_exchange.3720 |||------------------------------------------------------------------------------------ 3|| 3.7% | 254.4 |__comms_MOD_comms_transpose_n:comms.mpi.F90:line.17965 4|| | | __comms_MOD_comms_transpose:comms.mpi.F90:line.17826 5|| | | __fft_MOD_fft_3d:fft.fftw3.F90:line.482 6|| 3.1% | 215.0 | __basis_MOD_basis_real_recip_reduced_one:basis.F90:line.4104 7|| | | _s_wave_MOD_wave_real_to_recip_wv_bks:wave.F90:line.24945 8|| | | __pot_MOD_pot_nongamma_apply_wvfn_slice:pot.F90:line.2640 9|| | | __pot_MOD_pot_apply_wvfn_slice:pot.F90:line.2550 10| | | __electronic_MOD_electronic_apply_h_recip_wv:electronic.F90:line.4364 11| | | __electronic_MOD_electronic_apply_h_wv:electronic.F90:line.4057 12| 3.0% | 211.4 | __electronic_MOD_electronic_minimisation:electronic.F90:line.1957 13| 3.0% | 205.6 | __geometry_MOD_geom_get_forces:geometry.F90:line.6299 14| 2.8% | 195.2 | __geometry_MOD_geom_bfgs:geometry.F90:line.2199 15| | | __geometry_MOD_geometry_optimise:geometry.F90:line.1051 16| | | MAIN__:castep.f90:line.1013 17| | | main:castep.f90:line.623 --snip--
This indicates that (in the USER routines) the part of the
comms_transpose_exchange routine in which the program is
spending most time is around line 3720 (in many cases the
time-consuming line will be the start of a loop). Furthermore, we
can see that approximately one third of the samples in the routine
(3.7%) are originating from the call
comms_transpose_exchange in the stack shown. Here, it
looks like they are part of the geometry optimisation section of
CASTEP. The remaining two-thirds of the time will be indicated
further down in the report.
In the original report we also get a table reporting on the wall clock time of the program. The table reports the minimum, maximum and median values of wall clock time from the full set of parallel tasks. The table also lists the maximum memory usage of each of the parallel tasks
Table 3: Wall Clock Time, Memory High Water Mark Process | Process |PE=[mmm] Time | HiMem | | (MBytes) | 124.409675 | 100.761 |Total |--------------------------------- | 126.922418 | 101.637 |pe.39 | 124.329474 | 102.336 |pe.2 | 122.241495 | 101.355 |pe.4 |=================================
If you see a very large difference in the minimum and maximum wall clock time for different parallel tasks it indicates that there is a large amount of load-imbalance in your code and it should be investigated more thoroughly.
22.214.171.124 Example: results from a tracing experiment on CASTEP
To be added.
5.2.3 Using hardware counters
CrayPat can also provide performance counter data during tracing
PAT_RT_PERFCTR environment variable
can be set in your job script to specify hardware, network, and
power management events to be monitored. Please refer to the
CrayPat manpage and
the CrayPat documentation for information about using the performance counters.
126.96.36.199 Example: Performance counter results from CASTEP
To be added.
5.3 Cray Reveal
Reveal is a performance analysis tool supported by Cray that guides you in improving the performance of a Cray-compiled application code. Specifically, Reveal will analyse the loops within your code and identify which ones are less than fully optimised and why. In addition, Reveal can gather performance statistics that rank the loops according to execution time, thereby showing you the best places in the source code to apply optimisations. An overview of Reveal can be found in the Cray document entitled, Using Cray Performance Measurement and Analysis Tools. In addition, the ARCHER CSE team have produced a Reveal screencast (see below) that gives a straightforward example of how a researcher might run Reveal with their MPI application code, and, along the way, show how to make the code run faster, and also, how to add parallelism by introducing OpenMP directives. Further details on the OpenMP-aspects of the Reveal tool can be found in this Cray webinar.
5.4 Allinea MAP
Allinea MAP can be used to profile MPI code only - it does not support detailed profiling of OpenMP, SHMEM, CAF, or UPC. Profiling using Allinea MAP consists of four steps
- Create a MAP profiling version of your MPI library.
- Re-compile your code, linking to the MAP MPI profiling library.
- Run your code to collect profiling information.
- Visualise the profiling information using the MAP Remote Client.
5.4.1 Create a MAP profiling MPI library
You should first load the
module to give you access to all the MAP executables
module load allinea
The command you use to create the profiling version of your MPI library differs depending whether you are linking statically or dynamically. On ARCHER the default is to link statically so you should use the command for static libraries unless you are definitely linking dynamically.
For statically linked applications the command is
make-profiler-libraries and you need to supply a
directory for the library files. For example
mjf@eslogin006:~/test> module load allinea mjf@eslogin006:~/test> mkdir map-libraries mjf@eslogin006:~/test> cd .. mjf@eslogin006:~/test> make-profiler-libraries --lib-type=static ./map-libraries Created the libraries in ./map-libraries: libmap-sampler.a libmap-sampler-pmpi.a To instrument a program, add these compiler options: compilation for use with MAP - not required for Performance Reports: -g (or '-G2' for native Cray Fortran) (and -O3 etc.) linking (both MAP and Performance Reports): -Wl,@./maps-libraries/allinea-profiler.ld ... EXISTING_MPI_LIBRARIES If your link line specifies EXISTING_MPI_LIBRARIES (e.g. -lmpi), then these must appear *after* the Allinea sampler and MPI wrapper libraries in the link line. There's a comprehensive description of the link ordering requirements in the 'Preparing a Program for Profiling' section of either userguide-forge.pdf or userguide-reports.pdf, located in /home/y07/y07/cse/allinea/forge/5.0.1-phase2/doc/. mjf@eslogin006:~/test> ls -l ./map-libraries total 6708 -rw-r--r-- 1 mjf z01 748 Feb 22 11:43 allinea-profiler.ld -rw-r--r-- 1 mjf z01 1727928 Feb 22 11:43 libmap-sampler-pmpi.a -rw-r--r-- 1 mjf z01 4785546 Feb 22 11:43 libmap-sampler.a
If you are linking dynamically then you still use the "make-profiler-libraries" command, but don't specify static building. For example
mjf@eslogin006:/work/z01/z01/mjf/test> module load allinea mjf@eslogin006:/work/z01/z01/mjf/test> mkdir map-libraries mjf@eslogin006:/work/z01/z01/mjf/test> make-profiler-libraries ./map-libraries Created the libraries in ./map-libraries: libmap-sampler.so (and .so.1, .so.1.0, .so.1.0.0) libmap-sampler-pmpi.so (and .so.1, .so.1.0, .so.1.0.0) To instrument a program, add these compiler options: compilation for use with MAP - not required for Performance Reports: -g (or '-G2' for native Cray Fortran) (and -O3 etc.) linking (both MAP and Performance Reports): -dynamic -L./map-libraries -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr Note: These libraries must be on the same NFS/Lustre/GPFS filesystem as your program. Before running your program (interactively or from a queue), set LD_LIBRARY_PATH: export LD_LIBRARY_PATH=./map-libraries:$LD_LIBRARY_PATH mpirun ... or add -Wl,-rpath=./map-libraries when linking your program.
5.4.2 Re-compile the code, linking to the MAP MPI libraries
Recompile your code with the
-g flag (
can be used with Cray Fortran) to ensure debugging information is
available for the profiler. Then link using the set of options
specified in the output from
make-profiler-libraries command that you
used (note, the
-lWl,@.... statement should be added at the end of command for linking the executable, not the beginning).
5.4.3 Run the code to collect profiling information
You can now run a job as normal to collect the profile information for your code. This involves writing a job submission script to run your code through MAP. MAP will produce a file containing all the profiling information which can then be examined using the Remote Client (as described below).
An example job submission script for running a profile across 48 MPI processes would be
#!/bin/bash --login # #PBS -N map_test #PBS -l select=2 #PBS -l walltime=1:0:0 #PBS -A t01 cd $PBS_O_WORKDIR module load allinea map -profile aprun -n 24 myapp.exe
This will produce a profile file with a name like
An example job submission script for running a profile across 24 MPI processes with 12 processes per node would be
#!/bin/bash --login # #PBS -N map_test #PBS -l select=2 #PBS -l walltime=1:0:0 #PBS -A t01 cd $PBS_O_WORKDIR module load allinea map -profile aprun -n 24 -N 12 -S 6 myapp.exe
You can specify a specific name for this output file using
-output option to
map in the job
If you are profiling a dynamically linked program, then you will
need to set
LD_LIBRARY_PATH in the job script.
5.4.4 Visualise the profiling information
In order to visualise the data you should download and install the Allinea Remote Client on your desktop using these instructions.
Once remote access has been successfully setup you should be returned to the startup screen. This time select "Load Profile Data File". You will be given a dialogue box where you can select your profile file from the ARCHER file-system. Once you have selected your file MAP will assemble your results and you can visualise your profile data. The MAP interface should look something like
For more information on using the profiling interface see:
5.5 Tuning and Analysis Utilities (TAU)
TAU is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, UPC, Java, Python. The current version of TAU installed on ARCHER is v188.8.131.52, the following makefiles are supported.
- Makefile.tau-cray-papi-ompt-mpi-pdt-openmp Equivalent makefiles also exist for gnu and Intel programming environments.
Choose one of the makefiles and insert "export TAU_MAKEFILE=$TAU_LIB/" before compiling your code. Otherwise, use tau_exec for uninstrumented code.
Note, before using tau_exec, make sure you have built your application binary with the -dynamic link option. This is necessary because tau_exec preloads a dynamic shared object, libTAU.so, in the context of the executing application.
Full documentation can be found at http://www.cs.uoregon.edu/research/tau/docs.php. In addition, TAU workshop exercises can be found at /home/y07/shared/tau/workshop/ - please start with the readme file located within the workshop folder.
Unlike previously mentioned profiling tools, which are specialised to analyse massively parallel programs, Valgrind is a suite of more generic performance analysis tools that are not specialised for massively parallel profiling but nonetheless offer useful performance analysis functionality, for example for serial optimisation work or to help decide how best to parallelise your serial program. Valgrind is designed to be non-intrusive and to work directly with existing executables. You don't need to recompile, relink, or otherwise modify your program. For a list of Valgrind tools see http://valgrind.org/info/tools.html.
5.6.1 Memory profiling with Massif
Valgrind includes a tool called Massif that can be used to give insight into the memory usage of your program. It takes regular snapshots and outputs this data into a single file, which can be visualised to show the total amount of memory used as a function of time. This shows when peaks and bottlenecks occur and allows you to identify which data structures in your code are responsible for the largest memory usage of your program.
Documentation explaining how to use Massif is available at official Massif manual. In short, you should run your executable as follows:
valgrind --tool=massif my_executable
The memory profiling data will be output into a file called
pid is the runtime process ID of your program. A custom filename can be chosen using the
--massif-out-file option, as follows:
valgrind --tool=massif --massif-out-file=optional_filename.out my_executable
The output file contains raw profiling statistics. To view a summary including a graphical plot of memory usage over time, use the
ms_print command as follows:
or, to save to a file:
ms_print massif.out.12345 > massis.analysis.12345
This will show total memory usage over time as well as a breakdown of the top data structures contributing to memory usage at each snapshot where there has been a significant allocation or deallocation of memory.
5.7 General hints for interpreting profiling results
There are several general tips for using the results from the performance analysis
Look inside the routines where most of the time is spent to see if
they can be optimised in any way. The
ct+srcreport option can be useful to determine which regions of a function are using the most time.
- Look for load-imbalance in the code. This is indicated by a large difference in computing time between different parallel tasks.
- High values of time spent in MPI usually mean something wrong in the code. Load-imbalance, a bad communication pattern, or just not scaling to the used number of cores.
- High values of cache misses (seen via hardware counters) could mean a bad memory access in an array or loop. Examine the arrays and loops in the problematic function to see if they can be reorganised to access the data in a more cache-friendly way.
The Tuning section in this guide has a lot of information on how to optimise both serial and parallel code. Combining profiling information with the optimisation techniques is the best way to improve the performance of your code.
5.7.1 Spotting load-imbalance
By far the largest obstacle to scaling applications to large numbers of parallel tasks is the presence of load-imbalance in the parallel code. Spotting the symptoms of load-imbalance in performance analysis is key to optimising the scaling performance of your code. Typical symptoms include
- For MPI codes
- large amounts of time spent in MPI_Barrier or MPI collectives (which include an implicit barrier).
- For PGAS codes
- large amounts of time spent in synchronisation functions.
When running a CrayPat tracing experiment with
the MPI time will be split into computation time (MPI group) and
synchronisation time (MPI_SYNC group). A high synchronisation time
can indicate problems with load imbalance in an MPI code.
The location of the accumulation of load-imbalance time often gives an indication of where in your program the load-imbalance is occurring.