Getting the code ---------------- Obtain the code from the CP2K website. For the benchmarking exercise we will use the latest released version - CP2K 4.1: wget https://sourceforge.net/projects/cp2k/files/cp2k-4.1.tar.bz2 bunzip2 cp2k-4.1.tar.bz2 tar -xvf cp2k-4.1.tar The cp2k-4.1 directory will then contain source code, test files, makefile etc. Detailed installation instructions are in cp2k-4.1/INSTALL. Further guidance is below. Required libraries ------------------ * An FFT library. An FFTW3 library interface and an internal FFT library (FFTSG) are available. Just set the correct define e.g. __FFTW3, and add the library to LIBS in the arch file (see below) * libint. Please download the 1.1.4 version of libint from http://sourceforge.net/p/libint/home/. Please note the version 2.0 library is not compatible with CP2K. The library should be added to the LIBS variable in the arch file (see below), and the define __LIBINT used to enable use of the library in the code. * libxsmm. This is a library for fast small block GEMMs, which is significantly faster (up to 10x in some cases) than most BLAS libraries. Using libxsmm is recommended, but such operations make up a relatively small proportion of the runtime for this benchmark. Instructions and library source can be found at https://github.com/hfp/libxsmm/. To enable the library define __LIBXSMM and add the libxsmm libraries to LIBS. * libgrid. This is a library for auto-tuned gaussian/grid operations in CP2K. It can give a small speedup ~5-10% in certain parts of the code, which make up a relatively small proportion of the runtime for this benchmark. Build instructions are in cp2k-4.1/tools/autotune_grid/README * ELPA. Optimised eigensolvers are available via the ELPA library. Obtain the library from http://elpa.rzg.mpg.de/software and define __ELPA, __ELPA2, or __ELPA3, depending on the library version. See cp2k-4.1/INSTALL for details. To use ELPA, the input files should be modified by adding "PREFERRED_DIAG_LIBRARY ELPA" between the "&GLOBAL" and "&END GLOBAL" lines. * MPI, BLAS, LAPACK, BLACS, ScaLAPACK are all required * Other libraries described in cp2k-4.1/INSTALL (e.g. libxc, PLUMED...) have no effect on this benchmark. Building the code ----------------- * To build CP2K, create or modify an arch file in the directory cp2k/arch - there are many examples for various architectures which may be of use. By convention, *.popt is for an MPI-only build, *.psmp is for mixed-mode MPI/OpenMP. Below is an example for ARCHER using mixed-mode: CC = cc CPP = FC = ftn -fopenmp -ffree-form LD = ftn -fopenmp AR = ar -r DFLAGS = -D__FFTW3 -D__LIBINT \ -D__parallel -D__SCALAPACK -D__HAS_NO_SHARED_GLIBC \ -D__STATM_RESIDENT -D__LIBXSMM -D__HAS_LIBGRID \ -D__MPI_VERSION=3 -D__ELPA2 -D__MAX_CONTR=4 DATA_DIR = /work/y07/y07/cp2k/4.1.17463/data LIB_LOC = /usr/local/packages/cp2k/4.1.17463/libs CPPFLAGS = -traditional -C $(DFLAGS) -P CFLAGS = $(DFLAGS) FCFLAGS = $(DFLAGS) -I$(LIB_LOC)/libxsmm/include -I$(LIB_LOC)/elpa/include/elpa_openmp-2015.05.001/modules/ -O3 -ffast-math -funroll-loops -fno-tree-vectorize -fno-omit-frame-pointer -g -march=core-avx-i -Waliasing -Wampersand -Wc-binding-type -Wconversion -Wintrinsic-shadow -Wintrinsics-std -Wline-truncation -Wno-tabs -Wrealloc-lhs-all -Wtarget-lifetime -Wunderflow -Wunused-but-set-variable -Wunused-variable -std=f2003 LDFLAGS = $(FCFLAGS) LIBS = -L$(LIB_LOC)/libint/lib -lderiv -lint -lstdc++ \ -L$(LIB_LOC)/libgrid -lgrid \ -L$(LIB_LOC)/libxsmm/lib -lxsmmf -lxsmm -lxsmmext \ -L$(LIB_LOC)/elpa/lib -lelpa_openmp \ -lfftw3 -lfftw3_threads -lz -ldl Full descriptions of all the defines and other available flags is in the INSTALL file in the cp2k directory. Other example arch files can be found at http://dashboard.cp2k.org * Once an arch file has been created, build the code using the Makefile in cp2k/makefiles: make -j ARCH= VERSION= . correspond to the name of the arch file. Parallel build is available for multi-core machines. A single-core build may take around an hour depending on optimisation options. Running the benchmark --------------------- The benchmark input files are provided with the CP2K distribution, and can be found in cp2k-4.1/tests/QS/benchmark_HFX/LiH To run the code (either in MPI-only, or mixed-mode), copy the input files into a directory, and run the code there (using a batch job): cp2k.psmp input_bulk_B88_3.inp This input file generates an initial wavefunction file which will be used in the actual benchmark run. It takes approximately 5 minutes with 256 MPI processes to complete this setup calculation (it does not matter exactly how many processes are used to run this preparation step). The job will generate an output file "LiH_bulk_3-RESTART.wfn". Please rename this file to "B88.wfn", and it can then be used as input for all the benchmark runs. To run the main benchmark itself use: cp2k.psmp input_bulk_HFX_3.inp Please note that this job uses relatively large amounts of memory to obtain good performance, by caching Electron Repulsion Integrals (ERIs) for later re-use. It may be necessary to underpopulate nodes, or use OpenMP to harness larger numbers of cores while allowing more memory per process. The provided input file limits the amount of memory per process for the ERIs to 14000 MiB (via the MAX_MEMORY keyword in the FORCE_EVAL/DFT/XC/HF/MEMORY section of the input). CP2K will use only this much memory, and recompute the remaining values on-the-fly. Look in the output for a line like: HFX_MEM_INFO| Number of sph. ERI's calculated on the fly: 0 If the value is 0, then CP2K has enough memory to complete the calculation at maximum efficiency. MAX_MEMORY should be set as high as possible depending on the amount of memory available per process (leaving a few 100 MBs for other CP2K data structures, plus the OS, MPI etc.) As an indication, the following maximum memory usage was measured on ARCHER: MPI procs Memory per processes (GiB) 64 16.7 * 128 15.5 * 256 13.7 * 512 7.2 * 1024 5.0 2048 3.0 4096 2.1 8192 1.7 Verifying correct results ------------------------- To verify that the code is giving correct results please check the following value in the standard output from the job: * ENERGY| Total FORCE_EVAL ( QS ) energy (a.u.): -870.934788598823616 This value should be correct to within rounding errors at double precision (relative error of approx 1E-14). Measuring Performance --------------------- CP2K reports wallclock timings - inclusive (TOTAL TIME) and exclusive (SELF TIME) for all significant routines, at the foot of the output of a run. The first line - grep "CP2K " - can be used to measure the time taken by the code (excluding any job setup/teardown time due for example to the batch system). If this is of interest a walltime reported by the batch system or UNIX 'time' might be more appropriate. The code also reports how much time is spent in various communication routines "MESSAGE PASSING PERFORMANCE", which should give some indication of which MPI routines should be optimised. As an indication, on ARCHER, this benchmark was found to scale well on up to 49152 cores (taking around 50s to complete). Reference timings are available on the CP2K website: http://www.cp2k.org/performance#lih-hfx Tuning and optimising the code ------------------------------ To improve the performance of CP2K on a particular platform the following suggestions may be useful. Many require setting flags in the input file. Full documentation of the file format is at : http://manual.cp2k.org/trunk/ * General compiler optimisation - aggressive optimisation e.g. -O3 on GNU are usually OK, but the code is known to be problematic for the PGI compiler (at least at version 9, the last time the code was seriously tested with PGI), and older versions of Intel Fortran at higher optimisation levels. * FFT library. FFT is only a very small component of this test, but using an optimised FFT library instead of the internal FFTSG is recommended. For FFTW3 it is possible to choose the plan type (MEASURE, PATIENT, EXHAUSTIVE) over the default ESTIMATE, which may give better performance by setting the FFTW_PLAN_TYPE variable in the GLOBAL section of the input file. * Process placement. CP2K employs 2D and 3D process decompositions (set GLOBAL/PRINT_LEVEL MEDIUM) to see these for a particular run. On some machines, using a process-to-network mapping that improves 2D neighbour locality may improve performance.