8. I/O on ARCHER

This section provides information on getting the best performance out of the parallel /work file systems on ARCHER when writing data, particularly using parallel I/O patterns.

Bear in mind that /work is a shared filesystem so performance may vary and in particular there is no magic recipe for obtaining the best I/O performance. Some experimentation is typically required.

Common I/O Patterns

Single file, single writer (Serial I/O)

A common approach is to funnel all the I/O through a single master process. Although this has the advantage of producing a single file, the fact that only a single client is doing all the I/O means that it gains little benefit from the parallel file system.

File-per-process (FPP)

One of the first parallel strategies people use for I/O is for each parallel process to write to its own file. This is a simple scheme to implement and understand but has the disadvantage that, at the end of the calculation, the data is spread across many different files and may therefore be difficult to use for further analysis without a data reconstruction stage.

Single file, multiple writers without collective operations

There are a number of ways to achieve this. For example, many processes can open the same file but access different parts by skipping some initial offset; parallel I/O libraries such as MPI-IO, HDF5 and NetCDF also enable this.

Shared-file I/O has the advantage that all the data is organised correctly in a single file making analysis or restart more straightforward.

The problem is that, with many clients all accessing the same file, there can be a lot of contention for file system resources.

Single Shared File with collective writes (SSF)

The problem with having many clients performing I/O at the same time is that, to prevent them clashing with each other, the I/O library may have to take a conservative approach. For example, a file may be locked while each client is accessing it which means that I/O is effectively serialised and performance may be poor.

However, if I/O is done collectively where the library knows that all clients are doing I/O at the same time, then reads and writes can be explicitly coordinated to avoid clashes. It is only through collective I/O that the full bandwidth of the file system can be realised while accessing a single file.

Schemes benchmarked

As seen above, there are a number of different types of I/O that could be considered; we have limited ourselves to providing information for write performance in the following cases:

File Per Process (FPP)
One binary file written per parallel process. In this scenario we have investigated performance for standard unformatted Fortran writes.
Single Shared File (SSF) with collective writes
One binary file written to collectively by all parallel processes. In this scenario we have investigated performance using MPI-IO, parallel NetCDF, and parallel HDF5.

We have only considered write performance in the first instance as this is usually the critical factor for performance in parallel modelling or simulation applications.

Note: We have not yet included the NetCDF or HDF5 data below but will add this shortly.

Lustre parallel file systems

The ARCHER /work file systems use the Lustre parallel file system technology. The Lustre filesystem provides POSIX semantics (changes on one node are immediately visible on other nodes) and can support very high data rates for appropriate I/O patterns.

Lustre basic concepts

In order to understand how Lustre behaves, it is useful to understand a few fundamental features of how it is constructed.

  • Each /work filesystem comprises around 50 separate storage units called Object Storage Targets (OSTs); each OST can write data at around 500 MB/s. You can think of an OST as being a disk, although in practice it may comprise multiple disks, e.g. in a RAID array.
  • An individual file can be stored across multiple OSTs; this is called striping. The default is to stripe across 4 OSTs, although this can be changed by the user.
  • Every ARCHER node is a separate filesystem client; good performance is achieved when multiple clients simultaneously access the file system.
  • There is a single MetaData Server (MDS) which stores global information such as the directory structure and which OSTs a file is stored on. Operations such as opening and closing a file can require dedicated access to the MDS and it can become a serial bottleneck in some circumstances.
  • Parallel file systems are typically optimised for high bandwidth: they work best with a small number of large, contiguous IO requests rather than a large number of small ones.

Users have control of a number of settings on Lustre file systems including the striping settings. Although these parameters can be set on a per-file basis they are usually set on directory where your output files will be written so that all output files inherit the settings.

Stripe settings for a directory (or file) can be set using the lfs setstripe command.

You can query the stripe settings for a directory (or file) using the lfs getstripe command.

Setting the striping parameters correcty is a key part of getting the best write performance for your application on Lustre. There are two parameters that you need to be concerned with: stripe count and stripe size. Generally, the stripe count has a larger impact on performance than the stripe size.

To get best performance from Lustre it is best to match your I/O transfer size to the stripe size or multiples of the stripe size. It is also helpful if the I/O is aligned to 1 MiB boundaries.

Stripe Count

The stripe count sets the number of OSTs (Object Storage Targets) that Lustre stripes the file across. In theory, the larger the number of stripes, the more parallel write performance is available. However, large stripe counts for small files can be detrimental to performance as there is an overhead to using more stripes.

Stripe count is set using the -c option to lfs setstripe . For example, to set a stripe count of 1 for directory res_dir you would use:

auser@eslogin006:~> lfs setstripe -c 1 res_dir/

The special value '-1' to the stripe count tells Lustre to use maximum striping.

We have investigated stripe counts of 1, 4 (the default), and -1 (maximum striping).

FPP Note: Using multiple stripes with large numbers of files (for example in a FPP scheme with large core counts) can cause problems for the file system causing your jobs to fail and also may impact other users in a large way. We strongly recommend that you use a stripe count of 1 when using a FPP scheme.

SSF Note: In contrast to the FPP scheme, to get best performance using a SSF scheme you should use maximal striping ( -c -1 ). In this case, this does not cause issues for the file system as there are only a small number of files open.

Stripe Size

The size of each stripe generally has less of an impact on performance than the stripe count but can become important as the size of the file being written increases.

Stripe size is set using the -s option to lfs setstripe . For example, to set a stripe size of 4 MiB for directory res_dir along with maximum striping you would use:

auser@eslogin006:~> lfs setstripe -s 4m -c -1 res_dir/

Note that the -s option understands the following suffixes: 'k': KiB, 'm': MiB, 'g': GiB.

We have investigated stripe sizes of 1 MiB (the default), 4 MiB, and 8 MiB.

ARCHER /work default stripe settings and OST counts

Each of the three Lustre /work file systems on ARCHER has the same default stripe settings:

  • A default stripe count of 4
  • A default stripe size of 1 MiB (1048576 bytes)

These settings have been chosen to provide a good compromise for the wide variety of I/O patterns that are seen on the system but are unlikely to be optimal for any one particular scenario.

The total available number of OSTs available on each file system are as follows:

  • fs2: 48
  • fs3: 48
  • fs4: 56

How can I find out which file system I am on?

All projects have their /work directory on one of the three Lustre file systems. You can find out which one your project uses on ARCHER by moving the your /work directory and using the readlink command:

auser@eslogin006:~> cd /work/t01/t01/auser
auser@eslogin006:/work/t01/t01/auser> readlink -f .
/fs3/t01/t01/auser

So project 't01' uses the 'fs3' Lustre file system.

Do I have a problem with my I/O?

The first thing to do is to time your I/O and quantify it in terms of bytes per second (usually MiB/s or GiB/s). This does not have to be particularly accurate - something correct to within a factor of two should highlight whether or not there is room for improvement.

  • If your aggregate I/O bandwidth is much less than 500 MiB/s then there could be room for improvement; if it is more than several GiB/s then you are already doing rather well.

What can I measure and tune?

I/O performance can be quite difficult to understand, but there are some simple experiments you can do to try to uncover the sources of any problems.

First, look at how your I/O rate changes as you scale your code to increasing numbers of processes.

  • If the I/O rate stays the same then you may be dominated by serial IO.
  • If the I/O rate drops significantly then you may be seeing contention between clients. This could be because you have too many files, or too many clients independently accessing the same file. See the summary of performance advice below for how to avoid this.

If you are using a SSF scheme, you can also look at how I/O scales with the number of OSTs across which each file is striped (the stripe count makes little difference if you are using an FPP scheme). It would be best to run with several hundred processes so that you have sufficient Lustre clients to benefit from striping (remember that each ARCHER node is a client).

It is probably easiest to set striping per directory, which affects all files subsequently created in it, and delete all existing files before re-running any tests. For example, to set a directory to use 8 stripes:

  user@archer> lfs setstripe -c 8 <directory>

Run tests with, for example, 1 stripe, 4 stripes (the default), 8 stripes and full striping (stripe count of -1).

See the results below for examples of what the performance variation looks like with different stripe cound for the SSF scheme.

Summary of performance advice

Our tests show that the maximum bandwidth you can expect to obtain from the ARCHER Lustre /work file systems in practice is 15-17 GiB/s, irrespective of whether you use a FPP or SSF approach (the theoretical maximum performance is 30 GiB/s for fs2 and fs3 and 35 GiB/s for fs4).

We provide detailed performance numbers for the different schemes below and provide a high-level summary here.

  • For serial I/O with the default settings you should be able to saturate the bandwidth of a single OST, i.e. achieve around 500 MiB/s. You may be able to increase this by increasing the stripe size above its default of 1 MiB.
  • Do not stripe small files or large collections of small files (for example source code). Use a directory with a stripe count of 1.
  • For performance, we would recommend that users use a FPP scheme in most cases. It gives good, reliable performance for a wide range of process counts. (This may, however, add complexity in analysing or reusing the data.)
  • When using FPP you should explictly set a stripe count of 1 (-c 1) as it provides excellent performance, minimises chances of job crashes, and has minimum impact on other users.
  • When using SSF you should set maximum stripe count (-c -1) as this is required for performance.
  • If you are using MPI-IO in a SSF scheme you must ensure that you are using collective I/O to get good performance. I/O libraries such as MPI-IO, NetCDF and HDF5 can all implement collective IO but it is not necessarily the default - in general you need to check the documentation. For example, in MPI-IO the routines ending _all are always collective; you can specify the behaviour in NetCDF via the call nc_var_par_access.
  • The stripe size for SSF depends on the amount of data being written. The larger the data being written, the larger the stripe size should be used. Please see the more detailed information provided below.

Plot comparing maximum achieved write bandwiths on ARCHER fs3 for FPP and SSF parallel I/O schemes:

FPP vs SSF max. write bandwith plot

File Per Process (FPP)

The data below show the maximum measured write bandwidths (in MiB/s) for the FPP scheme using unformatted Fortran writes with different numbers of parallel processes.

These writes corresponded to:

  • 1024 MiB written per process
  • Fully-populated ARCHER nodes (24 processes per node)

A large amount of data needs to be written by each process to make sure that the aggressive caching and buffering of I/O by the Fortran I/O library is avoided.

In summary:

  • Using FPP we can quickly get high levels of performance for the ARCHER Lustre file systems (just 2 nodes, 48 processes, can give a write performance in excess of 12 GiB/s)

Plot of write bandwidth distributions for ARCHER fs3 using the FPP scheme:

FPP write bandwith plot
Number of Processes Stripe Count Stripe Size (MiB) Max. Write Bandwidth (MiB)
24 1 1 5929
48 1 1 12261
96 1 1 13262
192 1 1 12532
384 1 1 12048
768 1 1 12741
1536 1 1 11930
3072 1 1 13426
6144 1 1 14425

Single Shared File (SSF)

MPI-IO

The data below show the maximum measured write bandwidths (in MiB/s) for the SSF scheme using MPI-IO collective writes with different numbers of parallel processes. These writes corresponded to:

  • 16 MiB written per process
  • Fully-populated ARCHER nodes (24 processes per node)

Less data is required per process in the SSF scheme compared to the FPP scheme as MPI-IO does not employ the agressive buffering seen in the Fortran I/O library.

In summary:

  • Below 96 processes the defaut stripe count of 4 gives the best performance
  • Above 96 processes maximal striping (-c -1) should be used to get best performance
  • MPI-IO collective operations are required to see good performance
  • Single striping (-c 1) was not explored in detail as the performance was so low that it made running large tests problematic.

A more detailed study of the performance of SSF schemes can be found in the ARCHER White Paper: Performance of Parallel IO on ARCHER

Plot of write bandwidth distributions for ARCHER fs3 using the SSF scheme:

SSF write bandwith plot
Number of Processes Stripe Count Stripe Size (MiB) Max. Write Bandwidth (MiB)
24 -1 1 662
48 -1 1 1311
96 -1 1 2270
192 -1 1 3791
384 -1 1 5396
768 -1 1 5775
1536 -1 1 5945
3072 -1 1 11321
6144 -1 1 11530
12288 -1 1 6368
Number of Processes Stripe Count Stripe Size (MiB) Max. Write Bandwidth (MiB)
24 -1 8 691
48 -1 8 1487
96 -1 8 3085
192 -1 8 5587
384 -1 8 7663
768 -1 8 9108
1536 -1 8 8599
3072 -1 8 13785
6144 -1 8 15946
12288 -1 8 8205
Number of Processes Stripe Count Stripe Size (MiB) Max. Write Bandwidth (MiB)
24 4 1 862
48 4 1 1313
96 4 1 1292
192 4 1 1585
384 4 1 881
768 4 1 924
1536 4 1 822
3072 4 1 1288
6144 4 1 946
Number of Processes Stripe Count Stripe Size (MiB) Max. Write Bandwidth (MiB)
24 4 8 915
48 4 8 1565
96 4 8 1789
192 4 8 1753
384 4 8 1251
768 4 8 1553
1536 4 8 1355
3072 4 8 1650
6144 4 8 1939

I/O Profiling

You can profile your application to get an insight into the I/O behaviour. CrayPAT can profile I/O tracing for specific I/O APIs by specifying particular flags (for example pat build -g io ... ) A default perftools-lite run will also include some I/O profiling.

Simple I/O data can be obtained from the iobuf module, it can also be used to provide additional buffering.

If you are using MPI-IO (or NetCDF or HDF5) then you can set MPICH_MPIIO_STATS to obtained I/O statistics (see man mpi).