Knights Landing (KNL) Testing & Development Platform

The ARCHER Knights Landing testing & development platform is a small system consisting of 12 nodes each fitted with an Intel Knights Landing (KNL) processor.

The KNL system offers ARCHER users an opportunity to trial this manycore processor to run their applications, and enables software development aimed at optimising application performance for this processor type. The system is provided by Cray hence offers a similar user experience to the main ARCHER machine.

This guide describes the KNL system and provides information on how to use the facility. Further details and guidance will continue to be added.

If you have any questions about the system, please contact the ARCHER Helpdesk.

1.1 Requesting access

Existing ARCHER users can request access to the KNL system for their existing login accounts from within the SAFE web interface as described here.

New users can complete the ARCHER KNL Driving Test. On passing the test you will be send instructions for requesting an account. We strongly recommend reviewing the training materials before taking the test.

Click here to start the KNL Driving Test.

1.2 Training and resources

Information about KNL training and online resources including webinars and other training materials that discuss the processor hardware, the ARCHER KNL system setup, and how to use and program for the KNL are available by following this link. We encourage you to review these after you have a basic familiarity with the ARCHER KNL system in order to find out more about the KNL processor and how your applications can best make use of it.

In particular you may be interested in slides, practical exercises including example source code from a recent ARCHER training course aimed at exploring efficient use of the KNL, which are located here.

1.3 Hardware

The KNL processor constitutes the second generation of Intel's Xeon Phi Many Integrated Core (MIC) architecture. In contrast to its first generation Xeon Phi predecessor the Knights Corner co-processor, which functions as an accelerator device fitted in addition to a primary host processor, the KNL is entirely self-hosted. This means that like a traditional multicore processor such as those on the main ARCHER machine it directly runs both your application as well as the operating system on each node. It can run applications parallelised using MPI, OpenMP and a combination of the two, and applications can span multiple KNL nodes using MPI.

The ARCHER KNL system is composed of 12 compute nodes, each with a 64-core KNL processor (model 7210) running at 1.30GHz and with each physical core capable of running 4 concurrent hyperthreads. Each compute node has 96GB of system memory (DDR), and in addition to this each KNL processor comes with 16GB of on-chip memory (MCDRAM) which can be used in different modes as described in Section 1.8.1. Compute nodes are connected by a high-performance Cray Aries interconnect similar to that used in the main ARCHER machine.

In addition to the KNL compute nodes the system also has one login node that is used to connect to the system via ssh, to compile code, and submit jobs. This node has a single 16-core 2.60Ghz Intel Sandy Bridge E5-2670 processor.

1.4 Connecting

To connect to the KNL system you will first need to log in to ARCHER as usual. Whilst logged in to ARCHER you can connect through to the KNL system as follows:

username@eslogin002> ssh knl-login

1.5 Filesystems

The \home filesystem used on ARCHER is cross-mounted to the KNL system. This means that your usual ARCHER home directory is directly accessible from the KNL system.

Like the main machine the KNL system also has a high-performance Lustre filesystem mounted as /work from which you should run your jobs. However this is not cross-mounted and is therefore entirely independent of the main machine. You should run jobs from inside your personal directory in this filesystem, located at

/work/knl-users/$USER

If you need to transfer any data between /work on the main machine and /work on the KNL system we suggest you do this either by copying the data to your home directory first as an intermediate step, or by using scp as follows:

username@eslogin002> scp data.tar.gz knl-login:/work/knl-users/$USER/

to copy from ARCHER to the KNL system, or

username@eslogin002> scp knl-login:/work/knl-users/$USER/data.tar.gz .

to copy from the KNL system to ARCHER.

Note that in both cases the command has to be given whilst logged in to ARCHER, not the KNL system.

1.6 Environment

The KNL system runs a version of the Cray Linux Environment (CLE) operating system, which is familiar from the main machine. Your environment is controlled through the loading and unloading of environment modules as on the rest of ARCHER, with a default set loaded upon logging in to the system. It is important however to be aware that as the KNL system is a standalone installation the precise modules, software and libraries available as well as the default set are not necessarily the same as those on the main machine. At time of launch the default programming environment on the KNL system is more recent than that on ARCHER.

1.7 Compiling code

The usual Cray compiler wrappers ftn, cc and CC should be used for compilation. GNU, Cray and Intel compilers are all available on the KNL system and can be selected by loading the PrgEnv-gnu, PrgEnv-cray (loaded by default) or PrgEnv-intel module. This also ensures that the right versions of libraries and system software are made available to link through the compiler wrappers.

The craype-mic-knl module is loaded by default. This is a CPU targeting module that is responsible for causing the wrappers to pass relevant KNL-specific flags to the underlying compiler, including optimisation flags such as those telling the compiler to produce AVX512 vector instructions. Please note that binaries compiled with this module loaded may not run on the login node. This may cause builds that run binaries as part of the build process to fail. If this happens you should consult this document by Cray on cross-compiling using CMake and GNU Autotools.

Although the GNU compilers support compiling for KNL in principle, in practice this is not yet fully mature so you may have to compile your code with an Intel compiler or otherwise experiment with GNU compiler flags to find a combination that works. Version 17.0.0.098 of the Intel compiler is available on the system.

1.7.1 Libraries

Libraries available on the system include:

  • Intel MKL (made available when PrgEnv-intel is loaded)
    • Note: when using Intel's MKL link line advisor you should select "None" for the option "usage model of Intel Xeon Phi Coprocessor" (the other options all refer to the Xeon Phi KNC).
  • Cray LibSci
  • FFTW
  • HDF5
  • NetCDF
  • PETSc
  • Trilinos

These are available by loading the relevant module.

1.8 Running jobs

The PBS scheduler is installed on the KNL system and should be used to submit your job script for execution on the KNL nodes or to request an interactive job. The number of desired nodes should be requested from the scheduler by using the -l select option. For example to request two nodes for a job the line

#PBS -l select=2

should be included in your job script, or the option should be passed as a flag to the qsub command as follows:

qsub -l select=2

As on ARCHER you should specify the walltime for your job:

#PBS -l walltime=0:30:0

and the budget code:

#PBS -A k01-$USER

1.8.1 KNL modes

In principle the system can be configured to provide access to the KNL nodes booted in various combinations of possible clustering & memory modes. As there is a significant time cost associated with node reboot the initial configuration of the KNL system has all nodes set to so-called quadrant clustering mode and offers two choices for the memory mode:

  • 10 nodes are configured as quad_100
  • 2 nodes are configured as quad_0

In the quad_100 configuration all 16GB of on-chip MCDRAM memory is automatically used to cache accesses to system DDR memory (the 100 is a percentage and in general can be 0, 25, 50 or 100). This is also known as cache mode. In the quad_0 configuration the MCDRAM does not function as a cache but instead becomes available to use explicitly by your application in addition to system memory. This is also known as flat mode, and it requires you to manage how your application uses the available memory, either by using the numactl utility or through explicit memory allocations within your code. Information about this can be found in the training materials linked in section 1.2.

The way to select a specific configuration is to specify an Application Operating Environment (AOE) to PBS as in the following example:

qsub -l select=1:aoe=quad_100

or, in a job script:

#PBS -l select=1:aoe=quad_100

Jobs should request one of the two available AOEs and will be allocated the corresponding node(s) when the job runs. If you do not specify an aoe resource request in your job submission the scheduler will default to quad_100.

We may offer a different configuration in future and would welcome user feedback.

1.8.2 Job limits

Currently the following scheduling limits are in effect:

  • Maximum number of running jobs per user = 1
  • Maximum job length = 12 hours
  • Maximum number of queued jobs per user = 2
  • Maximum job size = 8 nodes

1.8.3 Application launch

The usual Cray parallel job launcher aprun should be used to run your application.

Pure MPI

The following syntax will launch an application parallelised only with MPI to run on a single node with 64 MPI processes, with one process per physical core:

aprun -n 64 ./my_application

Similarly to span two nodes with 128 MPI processes, still with one process per core:

aprun -n 128 ./my_application
Pure OpenMP

An application parallelised only with OpenMP could be run on a single node with one OpenMP thread per physical core for a total of 64 OpenMP threads as follows:

export OMP_NUM_THREADS=64
aprun -n 1 -d $OMP_NUM_THREADS -cc depth ./my_application

We recommend that the -cc depth option be used, this will allow the threads of each process to wander over the available set of cores/hyperthreads as determined by the -d and -j options. The default option (-cc cpu) will bind each thread in turn and that can cause problems with the Intel compiler OpenMP runtime if you are not careful (this option is fine if used with binaries created with the Cray or Gnu compilers). With Intel OpenMP either turn off the binding (set KMP_AFFINITY=disabled) or use OMP_PLACES to specify the binding.

We will be providing more advice on this topic in due course.

Hyperthreads

So far we have only used one hyperthread per physical core, as that is the default behaviour of aprun. We could launch our pure OpenMP program with 256 OpenMP threads on the 64 physical cores of a single node by enabling all four hyperthreads per physical core using the -j 4 option as follows:

export OMP_NUM_THREADS=256
aprun -n 1 -d $OMP_NUM_THREADS -j 4 -cc depth ./my_application

This corresponds to one OpenMP thread per hyperthread and hence, with 4 hyperthreads per physical core, to four OpenMP threads per core.

In principle we could also run a pure MPI application with hyperthreads, e.g. 128 MPI ranks on a single node, as follows:

aprun -n 128 -j 2 ./my_application

This corresponds to one MPI rank per hyperthread and hence, with 2 hyperthreads per physical core, to two MPI ranks per core.

MPI + OpenMP

To run a hybrid MPI+OpenMP application on all cores of a single node without enabling hyperthreads one could do:

export OMP_NUM_THREADS=16
aprun -n 4 -d $OMP_NUM_THREADS -cc depth ./my_application

This would launch the application with 4 MPI processes and with 16 OpenMP threads per MPI process, i.e. with one OpenMP thread per physical core.

We could enable all hyperthreads whilst keeping the same number of MPI processes with:

export OMP_NUM_THREADS=64
aprun -n 4 -d $OMP_NUM_THREADS -j 4 -cc depth ./my_application

or one could instead use the additional hyperthreads to increase the number of MPI processes per node whilst keeping the same number of OpenMP threads per process:

export OMP_NUM_THREADS=16
aprun -n 16 -d $OMP_NUM_THREADS -j 4 -cc depth ./my_application

It can sometimes be difficult to be sure that you have used the correct set of aprun options to achieve the placement you desire. A useful check is to run a test program to check the binding before you run your real application. One such test program is xthi, you can obtain the source for this from the CLE Application Placement Guide (Cray document S-2496). Make sure to use the same compiler to build xthi and the same aprun options to run it as you will use for your real application.

1.8.4 Example job script

Extending the second to last aprun example from the previous section to 4 nodes gives the following example job script:

#!/bin/bash
#PBS -N example
#PBS -l select=4:aoe=quad_100
#PBS -l walltime=0:30:00
#PBS -A k01-$USER

cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=64
aprun -n 16 -d $OMP_NUM_THREADS -j 4 -cc depth ./my_application


1.8.5 Tuning for KNL

This section will be expanded in future to contain notes on KNL tuning. We start with information on application binding.

Binding

When an application is launched by aprun onto the compute nodes the individual processes and threads are mapped (bound) to cores/hyperthreads as specified by the aprun arguments used. The default binding is to map each MPI process and its threads to each core in turn. Although we loosely talk about binding what actually happens is that the OS has a concept of an affinity mask, this defines the set of 'cpus' (in Linux terminology) that any thread can use. That set can be one or more.

It can be very useful to actually check this binding to be sure where processes and threads are running. One way to do this as mentioned in the ARCHER tuning guide is to use the Cray xthi program that can be obtained here.

The Cray Centre of Excellence for ARCHER has made a new affinity checker available called acheck which has a much more user-friendly output than xthi and provides extra information.

Here is an example showing use of acheck:

# This will not be required soon
export MODULEPATH=${MODULEPATH}:/home/y07/y07/knlmods/modulefiles-knl
#
export OMP_NUM_THREADS=8
cd /work
aprun -n 4 -N 2 -d $OMP_NUM_THREADS -cc depth acheck-cray -v
MPI and OpenMP Affinity Checker v1.21

Built with compiler: Cray C++ : Version 8.5.4 (u85060c85049)

There are 4 MPI Processes on 2 hosts

     Host Ranks...
 nid00044 0 1
 nid00052 2 3

Each MPI process has 8 OpenMP threads
OMP_NUM_THREADS was set to 8

                    ---binding--
     host rank thr  pinning mask
 nid00044    0   0   8 cpus 0-7
                 1   8 cpus 0-7
                 2   8 cpus 0-7
                 3   8 cpus 0-7
                 4   8 cpus 0-7
                 5   8 cpus 0-7
                 6   8 cpus 0-7
                 7   8 cpus 0-7
             1   0   8 cpus 8-15
                 1   8 cpus 8-15
                 2   8 cpus 8-15
                 3   8 cpus 8-15
                 4   8 cpus 8-15
                 5   8 cpus 8-15
                 6   8 cpus 8-15
                 7   8 cpus 8-15
 nid00052    2   0   8 cpus 0-7
                 1   8 cpus 0-7
                 2   8 cpus 0-7
                 3   8 cpus 0-7
                 4   8 cpus 0-7
                 5   8 cpus 0-7
                 6   8 cpus 0-7
                 7   8 cpus 0-7
             3   0   8 cpus 8-15
                 1   8 cpus 8-15
                 2   8 cpus 8-15
                 3   8 cpus 8-15
                 4   8 cpus 8-15
                 5   8 cpus 8-15
                 6   8 cpus 8-15
                 7   8 cpus 8-15
    

Note that in this case we launch 4 MPI processes, 2 per node and with 8 OpenMP threads each. Because we used the -cc depth option the threads of each MPI process are bound to 8 cpus in turn.

There are three precompiled versions of acheck:

  • acheck-cray
  • acheck-gnu
  • acheck-intel

Use the version that corresponds to the compiler that was used to build the real application binary that you will run with the same aprun arguments.

The following example shows a launch of 4 MPI processes using 4 hyperthreads per core:

# job setup as before...
      
aprun -n 4 -d $OMP_NUM_THREADS -cc depth -j4 acheck-cray -v
MPI and OpenMP Affinity Checker v1.21

Built with compiler: Cray C++ : Version 8.5.4 (u85060c85049)

There are 4 MPI Processes on 1 hosts

     Host Ranks...
 nid00051 0 1 2 3

Each MPI process has 4 OpenMP threads
OMP_NUM_THREADS was set to 4

                    ---binding-----------
     host rank thr  pinning mask
 nid00051    0   0   4 cpus 0 64 128 192
                 1   4 cpus 0 64 128 192
                 2   4 cpus 0 64 128 192
                 3   4 cpus 0 64 128 192
             1   0   4 cpus 1 65 129 193
                 1   4 cpus 1 65 129 193
                 2   4 cpus 1 65 129 193
                 3   4 cpus 1 65 129 193
             2   0   4 cpus 2 66 130 194
                 1   4 cpus 2 66 130 194
                 2   4 cpus 2 66 130 194
                 3   4 cpus 2 66 130 194
             3   0   4 cpus 3 67 131 195
                 1   4 cpus 3 67 131 195
                 2   4 cpus 3 67 131 195
                 3   4 cpus 3 67 131 195
    

It is particularly important to check the binding when using Intel OpenMP. Our advice for Intel 17 is to keep the defaults (do not disable KMP_AFFINITY) and to use -cc depth. Note that the advice would be different for Intel 16.