6. Tuning

This section discusses best practice for improving the performance of your code on ARCHER. We begin with a discussion of how to optimise the serial (single-core) compute performance and then discuss how to improve parallel performance.

Please note that these are general guidelines and some/all of the recommendations may not have an impact on your code. We always advise that you analyse the performance of your code using the profiling tools detailed in the Performance analysis section to identify bottlenecks and parallel performance issues (such as load imbalance).

6.1 Optimisation summary

A summary of getting the best performance from your code would be:

  1. Select the right (parallel) algorithm for your problem. If you do not do this then no amount of optimisation will give you the best performance.
  2. Use the compiler optimisation flags (and use pointers sparingly in your code).
  3. Use the optimised numerical libraries supplied by Cray rather than coding yourself.
  4. Eliminate any load-imbalance in your code (CrayPAT can help identify load-balance issues). If you have load-imbalance then your code wil never scale up to large core counts.

6.2 Serial (single-core) optimisation

6.2.1 Compiler optimisation flags

One of the easiest optmisations to perform is to use the correct compiler flags. This optimisation technique is extremely simple as it does not require you to modify your source code - although alterations to your source code may allow compiler flags to have more beneficial effects. It is often worth taking the time to try a number of optimisation flag combinations to see what effect they have on performance of your code. In addition, many of the compilers will provide information on what optimisations they are performing and, more usefully, what optimisations they are not performing and why. The flags needed to enable this information are indicated below.

Typical optimisations that can be performed by the compiler include:

Loop optimisation
such as vectorisation and unrolling.
Inlining
replacing a call to a function with the actual function source code.
Local logical block optimisations
such as scheduling, algebreic identity removal.
Global optimisations
such as constant propagations, dead store eliminations (still within a single source code file).
Inter-procedural analyses
try to optimise across subroutine/function boundary calls (can span multiple source code files).

The compiler-specific documentation and man pages contain more information about which optimisations particular flags will enable/disable.

When using the more aggressive optimisation options it is important to be aware that the resulting output might be affected, for example a loss of precision. Some of the optimisation options allow changing the order of execution and changing how arithmetic computations are performed. When using aggressive optimisations it is important to test your code to ensure that it still produces the correct result.

Many compiler suites allow pragmas or flags to be placed in the source code to give more information on whether or not (or even how) partcular sections of code should be optimised. These can be useful, particularly on restriciting optimisation for sections of code where the order of execution is critical.

Cray Compiler Suite

The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with -O1 being the least aggressive and -O3 being the most aggressive). The default is -O2 but most codes should benefit from increasing this to -O3.

To enable information on optimisations use the -hlist=a flag.

GNU Compiler Suite

The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with -O1 being the least aggressive and -O3 being the most aggressive).

The option -fdump-tree-all will generate information about attempted loop vectorisations.

Intel Compiler Suite

The Intel compiler also supports the -O1, -O2 and -O3 flags for optimisation.

The option -opt-report3 will generate information about attempted loop vectorisations.

6.2.2 Using Libraries

Another easy way to boost the serial performance for your code is to use the optimised numerical libraries provided on ARCHER. More information on the libraries available on the system can be found in the Available Numerical Libraries section.

6.2.3 Writing Optimal Serial Code

The speed of computation is determined by the efficiency of your algorithm (essentially the number of operations required to complete the calculation) and how well the compiled executable can exploit the Xeon architecture.

When actually writing your code the largest single effect you can have of performance is by selecting the appropriate algorithm for the problems you are studying. The algorithm you choose is dependent on many things but may include such considerations as:

Precision
Do you need to use double precision floating point numbers? If not, single or mixed-precision algorithms can run up to twice as fast as the double precision versions.
Problem size
What are the scaling properties of your algorithm? Would a different approach allow you to treat larger problems more efficiently?
Complexity
Although a particular algorithm may theoretically have the best scaling properties, is it so complex that this benefit is lost during coding?

Often algorithm selection is non-trivial and a good proportion of code benchmarking and profiling is needed to elucidate the best choice.

Once you have selected the best algorithms for your code you should endevour to write your code in such a way that allows the compiler to exploit the Xeon processor architecture in the most efficient way.

The first rule is that if your code segement can be replaced by an optimised library call then you should do this (see Available Numerical Libraries). If your code segment does not have a equivalent in one of the standard optimised numerical libraries then you should try to use code constructs that will expose instruction-level parallelism or vectorisation (also known as AVX instructions) to the compiler while avoiding simple optimisations that the compiler can perform easily. For floating-point intensive kernels the following general advice applies:

  • Avoid the use of pointers - these limit the optimisation that the compiler can perform.
  • Avoid using function calls, branching statements and goto statements wherever possible.
  • Only loops of stride 1 are amenable to vectorisation.
  • For nested loops, the innermost loop should be the longest and have a stride of 1.

6.2.4 Cache Optimisation

Main memory access on systems such as ARCHER is usually around two orders of magnitude slower than performing single floating-point operations. One solution used in the Xeon architecture to militate this is to use a hierarchy of smaller, faster memory spaces on the processor known as caches. This solution works as there is often a high chance of a particular address from memory being needed again within a short interval or a address from the same vicinity of memory being needed at the same time. This suggests that we could improve the performance of our code if we write it in such a way so that we access the data in memory that allows the cache hierarchy to be used as efficiently as possible.

Cache optimisation can be a very complex subject but we will try to provide a few general principles that can be applied to your codes that should help improve cache efficiency. The CrayPAT tool introduced in the Performance Analysis section can be used to monitor the cache efficiency of your code through the use of hardware counters.

Effectively, in programming for cache efficiency we are seeking to provide additional locality in our code. Here, locality, refers to both spatial locality - using data located in blocks of consecutive memory addresses; and temporal locality - using the same address multiple times in a short period of time.

  • Spatial locality can be improved by looping over data (in the innermost loop of nested loops) using a stride of 1 (or, in Fortran, by using array syntax).
  • Temporal locality can be improved by using short loops that do not contain function calls or branching statements.

There are two other ways in which the cache technology can have a detrimental effect on code performance.

Part of the way in which caches are able to achieve high performance is by mapping each memory address on to a set number of cache lines, this is known as n-way set associativity. This propery of caches can seriously affect the performance of codes where two array variables involved in an operation exist on the same cache line and the cache line must be refilled twice for each instance of the operation. One way to minimise this effect is to avoid using powers of 2 for your array sizes (as cache lines are always powers of 2) or, if you see this happening in your code, to pad the array with enough zeroes to stop this happening.

The other major effect on users codes comes in the form of so-called TLB misses. The TLB in question is the translation lookaside buffer and is the mechanism that the cache/memory hierachy uses to convert application addresses to physical memory addresses. If a mapping is not contained in the TLB then main memory must be accessed for further information resulting in a large performance penalty. TLB misses most often occur in codes as they loop through an array using a large stride.

6.3 Parallel optimisation

Some of the most important advice from the serial optimisation section also applies for parallel optimisation, namely:

  • Choose the correct algorithm for your problem.
  • Use vendor-provided libraries wherever possible.

When programming in parallel you will also need to select the parallel programming model to use. As the Cray XC system is an MPP machine with distributed memory you have the following options:

  • Pure MPI - using just the MPI communications library.
  • Pure SHMEM - using just the SHMEM, single-sided communications library.
  • Pure PGAS - using one of the Partitioned Global Address Space (PGAS) implementations, such as Coarray Fortran (CAF) or Unified Parallel C (UPC).
  • Hybrid approach - using a combination of parallel programming models (most often MPI+OpenMP but MPI+CAF and MPI+SHMEM are also used).

The Aries interconnect includes hardware support for single-sided communications. This means that SHMEM and PGAS approaches can run very efficiently and, if your algorithm is ammeanable to such an approach, are worth considering as an alternative to the more traditional pure MPI approach. A caveat here is that if your code makes heavy use of collective communications (for example, all-to-all or allreduce type operations) then you will find that the optimised MPI versions of these routines almost always outperform the equivalents coded using SHMEM or PGAS.

In addition, due to the fact that Cray XC machines are constructed from quite powerful SMP building blocks (i.e. individual nodes with up to 24 cores), then a hybrid programming approach using OpenMP for parallelism within a node and MPI for communitions outwith a node will generally produce code with better scaling properties than a pure MPI approach.

6.3.1 Load-imbalance

None of the parallel optimisation advice here will allow your code to scale to larger numbers of cores if your code has a large amount of load-imbalance.

Load-imbalance in parallel algorithms is where different parallel tasks (or threads) have a large amount of difference in computational work to perform. This, in turn, leads to some tasks (or threads) sitting idle at synchronisation points while waiting for other tasks to complete there block of work. Obviously, this can lead to a large amount of inefficiency in the program and can seriously inhibit good scaling behaviour.

Before optimising the parallel performance of your code it is always worth profiling (see the Profiling section) to try and identify the level of load-imbalance in your code, CrayPAT provides excellent tools for this. If you find a large amount of load-imbalance then you should eliminate this as much as possible before proceeding. Note that load-imbalance may only become apparent once you start using the code on higher and higher numbers of cores.

Eliminating load-imbalance can involve changing the algorithm you are using and/or changing the parallel decomposition of your problem. Generally, this issue is very code specific.

6.3.2 MPI Optimisation

The majority of parallel, scientific software still uses the MPI library as the main way to implement parallelism in the code, so much effort has been put in by Cray software engineers to optimise the MPI performance on Cray XC systems. You should make use of this by using high-level MPI routines for parallel operations wherever possible. For example, you should almost always use MPI collective calls rather than writing you own versions using lower-level MPI sends and receives.

When writing MPI (or hybrid MPI+X) code you should:

  • overlap commumication and computation by using non-blocking operations wherever possible;
  • pre-post receives before the matching send operation is called to save memory copies and MPI buffer management overheads;
  • send few large messages rather than many small messages to minimise latency costs;
  • use collective communication routines as little as possible (but don't build your own collectives out of point-to-point communication).
  • avoid the use of mpi_sendrecv as this is an extremely slow operation unless the two MPI tasks involved are perfectly synchronised.

Some useful MPI environment variables that can be used to tune the performance of your application are:

MPICH_ENV_DISPLAY
set to display the current environment settings when a MPI program is executed.
MPICH_FAST_MEMCPY
use an optimised memory copy function in all MPI routines.
MPICH_MAX_SHORT_MSG_SIZE
tune the use of the eager messaging protocol which tries to minimise the use of the MPI system buffer. Increasing/decreasing this value may improve performance.
MPICH_COLL_OPT_ON
can give better performance for MPI_Allreduce and MPI_Barrier for large numbers of cores.
MPICH_UNEX_BUFFER_SIZE
increases the buffer size for messages that are received before the receive has been posted. Increasing this may improve performance if you have a large number of such messages. Better to alter the code to pre-post receives if possible though.

Use "man intro_mpi" on the machine to show a full list of available options.

6.3.3 Mapping tasks/threads onto cores

The way in which your parallel tasks/threads are mapped onto the cores of the Cray XC compute nodes can have a large effect on performance. Some options you may want to consider are:

  • When underpopulating a compute node with parallel tasks it can often be beneficial to ensure that the parallel tasks are evenly spread across NUMA regions using the -S option to aprun (see below). This has the potential to optimise the memory badwidth available to each core and to free up the additional cores for use by the multithreaded version of Cray's LibSci library by setting the OMP_NUM_THREADS environment variable to however many spare cores are availble to each parallel task and using the "-d $OMP_NUM_THREADS" option to aprun (see below).

The aprun command which launches parallel jobs onto ARCHER compute nodes has a range of options for specifying how parallel tasks and threads are mapped onto the actual cores on a node. Some of the most important options are:

-n parallel_tasks
Total number of parallel tasks (not including threads). Default is 1.
-N parallel_tasks_per_node
Number of parallel tasks (not including threads) per node. Default is the number of cores in a node.
-d threads_per_parallel_task
Number of threads per parallel task. For OpenMP codes this will usually be equal to $OMP_NUM_THREADS. Default is 1. This option can also be used to specify a stride between parallel tasks when not using threads.
-S parallel_tasks_per_numa
Number of parallel tasks to assign to each NUMA region on the node. There are 2 NUMA regions per ARCHER compute node. Default is 12.

Some examples should help to illustrate the various options.

Example 1:

Pure MPI job using 768 MPI tasks (-n option, 32 nodes) with 24 tasks per node (-N option):

  aprun -n 768 -N 24 my_app.x

This is analogous to the behaviour of mpiexec on Linux clusters.

Example 2:

Hybrid MPI/OpenMP job using 384 MPI tasks (-n option) with 4 OpenMP threads per MPI task (-d option), 64 nodes in total. There will be 6 MPI tasks per node (-N option) and the 4 OpenMP threads are placed such that the threads associated with each MPI task are assigned to the same NUMA region (3 MPI tasks per NUMA region, -S option):

  aprun -n 384 -N 6 -d 4 -S 3 my_app.x

Example 3:

Pure MPI job using 768 MPI tasks (-n option) with 12 tasks per node (half-populated, -N option) with 6 tasks per NUMA region(-S option):

  aprun -n 768 -N 12 -S 6 my_app.x

Further information on job placement can be found in the Cray document:

or by typing:

  man aprun

when logged on to ARCHER.

6.4 Advanced OpenMP usage

On ARCHER systems, when using the GNU compiler suite, the location of the thread that initialises the data can determine the location of the data. This means that if you allocate your data in the serial portion of the code then the location of the data will be on the NUMA region associated with thread 0. This behaviour can have implications for performance in the parallel regions of the code if a thread from a different NUMA region then tries to access that data. If you are using the Cray or Intel compiler suites then there is no guarantee of where shared data will be located if your OpenMP code spans multiple NUMA regions. We always recommend that OpenMP code does not span multiple NUMA regions on ARCHER. See below for recommended task/thread configurations.

You can overcome this limitation, when using the GNU compier suite, by initialising your data in parallel (within a parallel region) or, for any compiler suite, by not using OpenMP parallel regions that span multiple NUMA regions on a node.

Due to these issues you should almost always use the "-ss" flag to enforce "Strict Segmentation" of memory and ensure best performance.

In general, it has been found that it can be difficult to gain any parallel performance when using OpenMP parallel regions that span multiple NUMA regions on an ARCHER compute node. For this reason, you will generally find that it is best to use one of the following task/thread layouts if you code contains OpenMP.

MPI Tasks per NUMA Region Threads per MPI task aprun syntax
1 12 aprun -n ... -ss -N 2 -S 1 -d 12 ...
2 6 aprun -n ... -ss -N 4 -S 2 -d 6 ...
3 4 aprun -n ... -ss -N 6 -S 3 -d 4 ...
6 2 aprun -n ... -ss -N 12 -S 6 -d 2 ...

6.4.1 Environment variables

The following are the most important OpenMP environment variables:

OMP_NUM_THREADS=number_of_threads
Sets the maximum number of OpenMP threads available to each parallel task.
OMP_NESTED=true
Enable nested OpenMP parallel regions.
OMP_SCHEDULE=policy
Determines how iterations of loops are scheduled.
OMP_STACKSIZE=size
Specifies the size of the stack for threads created.
OMP_WAIT_POLICY=policy
Controls the desired behavior of waiting threads.

A more complete list of OpenMP environment variables can be found at:

6.4.2 Intel OpenMP Affinity and Helper Thread

Note: This affects OpenMP codes, pure or hybrid, built with the Intel compiler. It does not apply to any codes built with the Cray or GNU compilers.

When running an Intel compiled OpenMP application, the Intel OpenMP runtime uses its own method of binding threads to processors, which conflicts with the default affinity on ARCHER and usually results in a suboptimal thread/core mapping. Additionally, the runtime creates an additional "helper" thread to carry out management tasks. Hence, your job will run with one more thread than specified by the OMP_NUM_THREADS environment variable. These factors can cause significant performance issues if not taken into account when running a job.

In all cases, it is recommended that you experiment with affinity settings and aprun arguments, making sure to test each configuration and confirm the core mappings are as expected. A Cray example program xthi is available freely here and provides a quick way of checking which processes and threads are mapped to each node and core.

Affinity

Intel OpenMP thread affinity can be controlled via the KMP_AFFINITY environment variable. With this unset, the default thread/core mapping is assumed, resulting in:

user@tdsmom:> export OMP_NUM_THREADS=24
user@tdsmom:> aprun -n 1 -d 24 xthi | sort -n -k 4 -k 6
Application 505911 resources: utime ~0s, stime ~0s, Rss ~3952, inblocks ~6056, outblocks ~15332
Hello from rank 0, thread 0, on nid00013. (core affinity = 0)
Hello from rank 0, thread 1, on nid00013. (core affinity = 0)
Hello from rank 0, thread 2, on nid00013. (core affinity = 0)
Hello from rank 0, thread 3, on nid00013. (core affinity = 0)
Hello from rank 0, thread 4, on nid00013. (core affinity = 0)
Hello from rank 0, thread 5, on nid00013. (core affinity = 0)
Hello from rank 0, thread 6, on nid00013. (core affinity = 0)
Hello from rank 0, thread 7, on nid00013. (core affinity = 0)
Hello from rank 0, thread 8, on nid00013. (core affinity = 0)
Hello from rank 0, thread 9, on nid00013. (core affinity = 0)
Hello from rank 0, thread 10, on nid00013. (core affinity = 0)
Hello from rank 0, thread 11, on nid00013. (core affinity = 0)
Hello from rank 0, thread 12, on nid00013. (core affinity = 0)
Hello from rank 0, thread 13, on nid00013. (core affinity = 0)
Hello from rank 0, thread 14, on nid00013. (core affinity = 0)
Hello from rank 0, thread 15, on nid00013. (core affinity = 0)
Hello from rank 0, thread 16, on nid00013. (core affinity = 0)
Hello from rank 0, thread 17, on nid00013. (core affinity = 0)
Hello from rank 0, thread 18, on nid00013. (core affinity = 0)
Hello from rank 0, thread 19, on nid00013. (core affinity = 0)
Hello from rank 0, thread 20, on nid00013. (core affinity = 0)
Hello from rank 0, thread 21, on nid00013. (core affinity = 0)
Hello from rank 0, thread 22, on nid00013. (core affinity = 0)
Hello from rank 0, thread 23, on nid00013. (core affinity = 0)

every thread assigned to core 0, as illustrated by this interactive single node OpenMP job. This is disastrous for parallel performance as each thread must timeshare a single core while the remaining 23 are idle.

It is therefore recommended to set KMP_AFFINITY=disabled in all Intel OpenMP jobs to bypass the Intel runtime:

user@tdsmom:> export KMP_AFFINITY=disabled
user@tdsmom:> export OMP_NUM_THREADS=24
user@tdsmom:> aprun -n 1 -d 24 xthi | sort -n -k 4 -k 6
Application 506079 resources: utime ~2s, stime ~0s, Rss ~3952, inblocks ~6056, outblocks ~15332
Hello from rank 0, thread 0, on nid00013. (core affinity = 0)
Hello from rank 0, thread 1, on nid00013. (core affinity = 2)
Hello from rank 0, thread 2, on nid00013. (core affinity = 3)
Hello from rank 0, thread 3, on nid00013. (core affinity = 4)
Hello from rank 0, thread 4, on nid00013. (core affinity = 5)
Hello from rank 0, thread 5, on nid00013. (core affinity = 6)
Hello from rank 0, thread 6, on nid00013. (core affinity = 7)
Hello from rank 0, thread 7, on nid00013. (core affinity = 8)
Hello from rank 0, thread 8, on nid00013. (core affinity = 9)
Hello from rank 0, thread 9, on nid00013. (core affinity = 10)
Hello from rank 0, thread 10, on nid00013. (core affinity = 11)
Hello from rank 0, thread 11, on nid00013. (core affinity = 12)
Hello from rank 0, thread 12, on nid00013. (core affinity = 13)
Hello from rank 0, thread 13, on nid00013. (core affinity = 14)
Hello from rank 0, thread 14, on nid00013. (core affinity = 15)
Hello from rank 0, thread 15, on nid00013. (core affinity = 16)
Hello from rank 0, thread 16, on nid00013. (core affinity = 17)
Hello from rank 0, thread 17, on nid00013. (core affinity = 18)
Hello from rank 0, thread 18, on nid00013. (core affinity = 19)
Hello from rank 0, thread 19, on nid00013. (core affinity = 20)
Hello from rank 0, thread 20, on nid00013. (core affinity = 21)
Hello from rank 0, thread 21, on nid00013. (core affinity = 22)
Hello from rank 0, thread 22, on nid00013. (core affinity = 23)
Hello from rank 0, thread 23, on nid00013. (core affinity = 0)

While this gives better parallel performance, note that here both thread 0 and thread 23 have been assigned to core 0. This is due to the aforementioned Intel helper thread taking residence on core 1, locking it out from the rest of the job. See the following section for advice on how to mitigate this.

Helper Thread

In conjunction with setting KMP_AFFINITY=disabled as described above, it is recommended that you experiment with the -cc option for aprun to set your own affinity and mitigate the effect of the Intel helper thread. Since the work done by the helper will be negligible in comparison to that of a worker thread in your application, it is usually sufficient to use either -cc none or -cc numa_node to unbind your threads within a node or numa region respectively. This allows the operating system to migrate threads between cores as required, ensuring the helper thread does not lock out an entire core for the duration of the job. See the aprun manual page for further details.

Alternatively, you may find you receive better performance by taking advantage of Intel Hyper-Threading (see documentation here) to lock the helper to a single logical core, while statically setting the affinity for each of your OpenMP threads to avoid migration. A job script template for setting this up is given below. The key is the calculation of a custom affinity line to be supplied to -cc, based on the number of nodes and OMP_NUM_THREADS specified in the script.

#!/bin/bash --login

# Hybrid MPI+OpenMP on full nodes for Intel

# compile on the frontend with
# module swap PrgEnv-cray PrgEnv-intel
# cc -openmp -o [executable] [C program]

#PBS -N [job name]
#PBS -l select=[number of nodes]
#PBS -l walltime=[wall time]
#PBS -A [account code]

module swap PrgEnv-cray PrgEnv-intel

# Ignore what Intel wants
export KMP_AFFINITY=disabled

# NODE_COUNT is set by the job (this is Archer-specific)

# Archer has 24 cores
export NUM_CORES=24

export OMP_NUM_THREADS=[one of 1 2 3 4 6 12 24]

# work out the placement of MPI ranks and OpenMP threads
export CORES_PER_NODE=$((NUM_CORES/OMP_NUM_THREADS))
export CC_LIST=''
for (( i=0 ; i<NUM_CORES ; i+=OMP_NUM_THREADS )); do
    CC_LIST="${CC_LIST}:$i,$((i+NUM_CORES))"
    for (( j=i+1 ; j<i+OMP_NUM_THREADS ; j+=1 )); do
        CC_LIST="${CC_LIST},$j"
    done
done
CC_LIST=${CC_LIST:1}

aprun -j 2 -n $((NODE_COUNT*CORES_PER_NODE)) -N $CORES_PER_NODE -d $((2*OMP_NUM_THREADS)) -cc $CC_LIST [executable]

Example output from specifying select=1 and OMP_NUM_THREADS=24 follows:

aprun -j 2 -n 1 -N 1 -d 48 -cc 0,24,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 ./xthi
Application 502325 resources: utime ~0s, stime ~0s, Rss ~11452, inblocks ~6058, outblocks ~15332
Hello from rank 0, thread 0, on nid00013. (core affinity = 0)
Hello from rank 0, thread 1, on nid00013. (core affinity = 1)
Hello from rank 0, thread 2, on nid00013. (core affinity = 2)
Hello from rank 0, thread 3, on nid00013. (core affinity = 3)
Hello from rank 0, thread 4, on nid00013. (core affinity = 4)
Hello from rank 0, thread 5, on nid00013. (core affinity = 5)
Hello from rank 0, thread 6, on nid00013. (core affinity = 6)
Hello from rank 0, thread 7, on nid00013. (core affinity = 7)
Hello from rank 0, thread 8, on nid00013. (core affinity = 8)
Hello from rank 0, thread 9, on nid00013. (core affinity = 9)
Hello from rank 0, thread 10, on nid00013. (core affinity = 10)
Hello from rank 0, thread 11, on nid00013. (core affinity = 11)
Hello from rank 0, thread 12, on nid00013. (core affinity = 12)
Hello from rank 0, thread 13, on nid00013. (core affinity = 13)
Hello from rank 0, thread 14, on nid00013. (core affinity = 14)
Hello from rank 0, thread 15, on nid00013. (core affinity = 15)
Hello from rank 0, thread 16, on nid00013. (core affinity = 16)
Hello from rank 0, thread 17, on nid00013. (core affinity = 17)
Hello from rank 0, thread 18, on nid00013. (core affinity = 18)
Hello from rank 0, thread 19, on nid00013. (core affinity = 19)
Hello from rank 0, thread 20, on nid00013. (core affinity = 20)
Hello from rank 0, thread 21, on nid00013. (core affinity = 21)
Hello from rank 0, thread 22, on nid00013. (core affinity = 22)
Hello from rank 0, thread 23, on nid00013. (core affinity = 23)

Note no threads share a (physical) core.

6.4.3 Compiler optimisations affecting numerical accuracy

Users are generally well aware of the effects that rounding errors can have on numerical calculations, e.g. when calculating a total, different orders of summation can give very slightly different answers. The classic way this manifests itself in parallel is that the result of computing local subtotals on each process, then reducing these across processes to obtain the final total, gives slightly different answers on different numbers of processes.

However, compiler optimisations can make the effects of numerical rounding apparent even when the code appears not to be susceptible.

The following code appeared in a simple OpenMP example to count how many of a set of complex numbers are inside or outside the Mandelbrot set:

    for (i=istart; i<istop; i++)
    {
      for (j=0; j<NPOINTS; j++)
      {
        creal = -2.0+2.5*(double)(i)/(double)(NPOINTS);
        cimag = 1.125*(double)(j)/(double)(NPOINTS);
        ...

Here we are looping across the complex plane and generating a regular grid of sample points c = (creal, cimag). The loop was parallelised by ensuring each thread had different values for "istart" and "istop".

As written, every loop is independent of all the others; the computation of each value of "c" depends only on the values of "i" and "j", so you would expect that exactly the same values of "c" would be generated regardless of how the "i" loop is split up across OpenMP threads.

Using the Cray compiler, however, the code curiously reported slightly different numbers of points inside the Mandelbrot set depending on the number of threads.

The root cause was that the grid of points being generated was very slightly different, and subsequent rounding effects can alter whether a point is inside or outside the set if it lies very close to the boundary. The reason was that the Cray compiler was optimising the computation of "c" to eliminate multiplications and divisions. It was noticing that the value of "c" on iteration "i" differs by a constant amount from the value on iteration "i-1". It was therefore rewriting this along the lines of

// Initialise
       c.real = -2.0+2.5*(double)(istart)/(double)(NPOINTS);
       deltacreal = 2.5/(double)(NPOINTS);
       ...
// Increment
       c.real = c.real + deltacreal;

Although differences in the values of "c" will be very small, the chaotic nature of the Mandelbrot set calculation means that this can change whether or not the point lies within the set.

If this kind of optimisation causes you problems, it can be turned off in the Cray compiler using "-h fp1". Of course, it may have an adverse impact on performance - from the manual page:

"You should never use the -h fp1 option except when your code pushes the limits of IEEE accuracy or requires strong IEEE standard conformance."

3 class="subsection" id="sec-6.5">6.5 Memory optimisation

Although the dynamic memory allocation procedures in modern programming languages offer a large amount of convenience the allocation and deallocation functions are time consuming operations. For this reason they should be avoided in subroutines/functions that are frequently called.

The aprun option -m size[h|hs] specifies the per-PE required Resident Set Size (RSS) memory size in megabytes. (K, M, and G suffixes, case insensitive, are supported). If you do not include the -m option, the default amount of memory available to each task equals the minimum value of (compute node memory size) / (number of cores) calculated for each compute node.

6.5.1 Memory affinity

Please see the discussion of memory affinity in the OpenMP section

6.5.2 Memory allocation (malloc) tuning

The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. Use the aprun option -ss to specify strict memory containment per NUMA node.

Linux also provides some environment variables to control how malloc behaves, e.g. MALLOC_TRIM_THRESHOLD_ that is the amount of free space at the top of the heap after a free() that needs to exist before malloc will return the memory to the OS. Returning memory to the OS is costly. The default setting of 128 KBytes is much too low for a node with 32GBytes of memory and one application. Setting it higher might improve performance for some applications.

6.5.3 Using huge pages

Huge pages are virtual memory pages that are larger than the default 4KB page size. They can improve the memory performance for codes that have common access patterns across large datastes. Huge pages can sometimes provide better performance by reducing the number of TLB misses and by enforcing larger sequential physical memory inside each page.

The ARCHER system is set up to have huge pages available by default. The modules craype-hugepages2m and craype-hugepages8m can be used to set the necessary link options and environment variables to enable the usage of 2MB or 8MB huge pages respectively. The default huge page size is 2 Mbytes. You will also need to load the appropriate craype-hugepages module at runtime (in you job submission script) for hugepages to work.

If you know the memory requirements of your application in advance you should set the -m option to aprun when you launch your job to preallocte the appropriate number of huge pages. This improves performance by reducing operating system overhead. The syntax is:

request size Mbytes per PE (advisory)
-m/size/h
request size Mbytes per PE (required)
-m/size/hs

6.6 Intel Hyper Threading

ARCHER compute nodes use two 12-core Intel E5-2697 v2 (Ivy Bridge) series processors, each equipped with Intel Hyper-Threading Technology (HTT). This is designed to improve the execution of applications through better parallelism from simultaneous multithreading (SMT) techniques.

HTT is presented to the user in the form of one additional "logical" core being addressed by the operating system for each physical core on the machine. On ARCHER, each node therefore reports a total of 48 available processors (can be confirmed by checking /proc/cpuinfo), with cores 0-23 representing traditional physical cores and 24-47 the HTT logical cores. Each physical core is paired with a logical one with the [physical,logical] sequence being [0,24], [1,25], [2,26], and so on.

While Hyper-Threading doubles the number of available parallel units per node at no additional resource cost, performance effects, positive or negative, are highly dependent on the application. Experimentation is key to determining if HTT would be suitable for your code.

6.6.1 Common Usage

HTT needs to be explicitly requested when running a job on ARCHER. This is performed through supplying the argument "-j 2" to aprun. In addition, the number of requested processors, i.e. the value of the "-n" parameter, should be set to include the number of logical cores available.

For example, on a single node, to run a program which makes full use of all processing units, the following command would be appropriate:

aprun -n 48 -j 2 ./myMPIProgram

This would place one MPI rank on each compute core - both physical and logical.

Similarly, this command would fully populate the 96 combined cores from two nodes:

aprun -n 96 -j 2 ./myMPIProgram

6.6.2 Unpacked Nodes

It is possible to use only some of the available HTT cores, if desired. However, the default allocation scheme on ARCHER is to assign processes in a round-robin fashion, equally splitting them between physical and logical cores. For example, the command:

aprun -n 32 -j 2 ./myMPIProgram

would result in the first myMPIProgram process being placed on core 0, the second rank being placed on core 24 (its logical pair), the third on core 1, the fourth on core 25, and so on. Here, this is sub-optimal as the 32 ranks would be utilising just 16 of the 24 available physical cores.

To configure this, affinity options can be used to bind processes to specific requested cores. Below, the option "-cc 0-31" is added to the previous example to ensure the first 32 successive ranks are allocated to the first 32 successive cores:

aprun -n 32 -j 2 -cc 0-31 ./myMPIProgram

leading to all 24 physical cores being used (0-23) along with 8 HTT ones (24-31).

6.6.3 Example Hyper Threading Job Script

#!/bin/bash --login
#
#PBS -N sharpen
#PBS -l select=1
#PBS -l walltime=00:01:00
#PBS -A [Project Code]

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Note that only 1 node (24 physical cores) is selected in
# the PBS directives but we are able to start 48 ranks by
# using the additional 24 logical cores enabled by the
# "-j 2" Hyper-threading flag.
aprun -n 48 -j 2 ./myMPIProgram