5. Running Jobs on ARCHER
- 5.1 Using PBS Pro
- 5.2 Output from PBS jobs
- 5.3 bolt: Job submission script creation tool
- 5.4 Running Parallel Jobs
- 5.4.1 PBS Submission Options
- 5.4.2 Parallel job launcher aprun
- 5.4.3 Task affinity for "unpacked" jobs
- 5.4.4 Example: job submission script for MPI parallel job
- 5.4.5 Example: job submission script for MPI parallel job on large memory nodes
- 5.4.6 Example: job submission script for OpenMP parallel job
- 5.4.7 Example: job submission script for MPI+OpenMP (mixed mode) parallel job
- 5.4.8 Interactive Jobs
- 5.5 Array Jobs
- 5.6 Sharing Nodes with OpenMP/Threaded Jobs
- 5.7 Python Task Farm
- 5.8 Job Submission System Layouts and Limits
- 5.9 checkScript: Script validation tool
- 5.10 Setting a time limit for aprun
- 5.11 Low Priority Access
- 5.12 Long Queue Access
- 5.13 Short (Debug) Queue Access
- 5.14 Reservations
- 5.15 Postprocessing/Serial Jobs
- 5.16 OOM (Out of Memory) Error Messages
The ARCHER facility uses PBS (Portable Batch System) to schedule jobs. Writing a submission script is typically the most convenient way to submit your job to the job submission system. Example submission scripts (with explanations) for the most common job types are provided below.
Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.
You can also use the bolt job submission script creation tool to generate correct job submission scripts quickly and easily. (see below)
Once you have written your job submission script you can validate it with the checkScript command (see below).
If you have any questions on how to run jobs on ARCHER do not hesitate to contact the ARCHER Helpdesk.
5.1 Using PBS Pro
You typically interact with PBS by (1) specifying PBS directives in job submission scripts (see examples below) and (2) issuing PBS commands from the esLogin nodes.
The job submission system on ARCHER generally works differently from one you may have used on other HPC facilities. ARCHER job submission scripts do not run directly on the compute nodes. Instead they run on Job Launcher Nodes. Job Launcher Nodes (also called MOM Nodes) are ARCHER Service Nodes that have permission to issue the aprun command. The aprun command launches jobs on the compute nodes. This contrasts with most HPC job submission systems, where the job submission script runs directly on the first compute node selected for the job. Therefore, running jobs on ARCHER requires care: avoid placing any memory or CPU intensive commands in job submission scripts as these could cause problems for other users who are sharing the Job Launcher Nodes. CPU and memory intensive commands should be run as serial jobs on the pre- and post-processing nodes (see below for details on how to do this).
There are three key commands used to interact with the PBS on the command line:
Check the PBS man page for more advanced commands:
The qsub command
The qsub command submits a job to PBS:
This will submit your job script "job_script.pbs" to the job-queues. See the sections below for details on how to write job scripts.
Note: To ensure the minimum wait time for your job, you should specify a walltime as short as possible for your job (i.e. if your job is going to run for 3 hours, do not specify 12 hours). On average, the longer the walltime you specify, the longer you will queue for.
The qstat command
Use the command qstat to view the job queue. For example:
will list all available queues on the ARCHER facility.
You can view just your jobs by using:
qstat -u $USER
The " -a " option to qstat provides the output in a more useful format.
To see more information about a queued job, use:
qstat -f $JOBID
This option may be useful when your job fails to enter a running state. The output contains a PBS comment field which may explain why the job failed to run.
If the batch system has calculated an estimated start time for a job, it is possible to view this by adding the -T flag as follows:
qstat -T $JOBID
The qdel command
Use this command to delete a job from ARCHER's job queue. For example:
will remove the job with ID $JOBID from the queue.
5.1.1 Using checkQueue
The checkQueue tool has been written to allow users to see more detailed information about their queued jobs. It shows the PBS comment line for all queued, running and recently completed jobs. It may be particularly helpful in diagnosing why a job has been queued for some time, suggesting possible actions to take.
Examples of the sort of output the tool can give include:
userz@eslogin008:/work/x01/x01/user> checkQueue ==========checkQueue=========== Listing job status comments for all jobs for user userz 123456.sdb jobx: Not Running: Host set host=archer_2901 too few free resources 123459.sdb joby: Not Running: PBS Error: ARCHER: User userz is not in x01-team ERROR: userz is not a member of budget x01-team : Please either ask the PI to add user userz to x01-team : in which case the job will then run automatically : or delete this job from the queue and resubmit : using a budget for which userz is a member ========== 2 jobs found for user userz ========END checkQueue========
userz@eslogin008:/work/x01/x01/user> checkQueue ==========checkQueue=========== Listing job status comments for all jobs for user userz 123461.sdb jobx: Not Running: PBS Error: ARCHER: budget x01-grp has no time left ERROR: Budget x01-grp has no time left. User: userz : Please allocate additional time resource to budget x01-grp : or delete job 123461.sdb from the queue. ========== 123465.sdb joby: No comment 2 jobs found for user userz ========END checkQueue========
5.1.2 Jobs held by PBS Checks
If a budget is used up before a queued job runs then the job will be automatically put into a Held state.
You will see the status of 'H' if you use qstat to check your jobs or see a 'Budget xxx has insufficient resource' message if you run checkQueue.
You should request additional resource for the budget from the PI or else delete these jobs from the queue.
If additional resource is granted the you will need to remove the Hold state from the job using the release command:
qrls -h u <job ID>
Jobs which are left in a held state for over two weeks may be automatically killed
5.2 Output from PBS jobs
PBS produces standard output and standard error for each batch job can be found in files <jobname>.o<Job ID> and <jobname>.e<Job ID> respectively. These files appear in the job's working directory once your job has completed or its maximum allocated time to run (i.e. wall time, see later sections) has ran out.
You can specify paths for the output and error files when you submit your job script with qsub options -o and -e options respectively. Using -oe will set the same file for both the output and error i.e. <jobname>.o<Job ID>
5.3 bolt: Job submission script creation tool
The bolt job submission script creation tool has been written by EPCC to simplify the process of writing job submission scripts for modern multicore architectures. Based on the options you supply, bolt will generate a job submission script that uses ARCHER as efficiently as possible.
bolt can generate job submission scripts for both parallel and serial jobs. Note that MPI, OpenMP and hybrid MPI/OpenMP jobs are supported. Low-priority jobs are not supported.
If there are problems or errors in your job parameter specifications then bolt will print warnings or errors. However, bolt cannot detect all problems so you may wish to run the checkScript tool on job submission scripts prior to running them.
5.3.1 Basic Usage
The basic syntax for using bolt is:
bolt -n [parallel tasks] -N [parallel tasks per node] -d [number of threads per task] \ -t [wallclock time (h:m:s)] -o [script name] -j [job name] -A [project code]
For example, to generate a job script to run an executable called 'my_prog.x' for 3 hours using 3072 parallel tasks and 12 tasks per compute node, you would use:
bolt -n 3072 -N 12 -t 3:0:0 -o my_job.bolt -j my_job -A z01-budget my_prog.x arg1 arg2
This generates the job script 'my_job.bolt' with the correct options to run 'my_prog.x' with command line arguments 'arg1' and 'arg2'. The project code against which the job will be charged is specified with the ' -A ' option. As usual, the job script is submitted as follows:
Note: if you do not specify the script name with the '-o' option then your script will be a file called 'a.bolt'
Note: if you do not specify the number of parallel tasks then bolt will generate a serial job submission script.
Note: if you do not specify a project code, bolt will use your default project code (set by your login account).
Note: if you do not specify a job name, bolt will use either bolt_ser_job (for serial jobs) or bolt_par_job (for parallel jobs).
Note: To ensure the minimum wait time for your job, you should specify a walltime as short as possible for your job (i.e. if your job is going to run for 3 hours, do not specify 12 hours). On average, the longer the walltime you specify, the longer you will queue for.
5.3.2 Further help
You can access further help on using bolt on the ARCHER machine with the ' -h ' option:
A selection of other options are:
- Write and submit the job script rather than just writing the job script.
- Force the job to be parallel even if it only uses a single parallel task.
5.4 Running Parallel Jobs
This section describes how to write job submission scripts specifically for different kinds of parallel jobs on ARCHER (serial jobs are described in later sections).
All parallel job submission scripts require (as a minimum) you to specify three things:
- The number of compute nodes (each compute node has 24 cores) you require via the "-l select=[nodes]" option
- The maximum length of time (i.e. walltime) you want the job to run for via the "-l walltime=[hh:mm:ss]" option. To ensure the minimum wait time for your job, you should specify a walltime as short as possible for your job (i.e. if your job is going to run for 3 hours, do not specify 12 hours). On average, the longer the walltime you specify, the longer you will queue for.
- The project code that you want to charge the job to via the "-A [project code]" option
In addition to these mandatory specifications, there are many other options you can provide to PBS.
5.4.1 PBS Submission Options
This section provides more information on various options used when submitting jobs to PBS on ARCHER. We also list a number of options that should not be used on the system.
When specified in a job submission script, all PBS options start with a "#PBS"-string. All options can also be specified directly on the command line.
|-N My_job||Name for your job. In the examples below the name will be "My_job", but uou can replace "My_job" with any name you want. The name will be used in various places. In particular it will be used in the queue listing and to generate the name of your output and/or error file(s). Note there is a limit on the size of the name.|
|-l select=[nodes]||Total number of compute nodes required for your job. In the simplest case (when using a single physical core per MPI process) you can get this number by dividing your total number of MPI processes by 24 (the number of physical cores per compute node).|
|-l select=[nodes]:bigmem=true||Specify your job to use the large memory nodes. There are 374
large memory nodes each with 128GB RAM (standard memory nodes have
64GB RAM). The large memory nodes are subject to the same queue
configurations as the standard memory nodes.
|-l walltime=[hh:mm:ss]||Specify the maximum wall clock time required for your job. The line #PBS -l walltime=00:20:00 requests twenty minutes. If your job exceeds the requested wall time, it will be terminated by PBS. It is advisable to ask for a slightly longer period than you expect your job to take. To achieve a better turn around time for short jobs and to get hanging jobs terminated before they consume excessive amounts of time, it is typically best to keep this extra time reasonably small. Job accounting is done after your job has finished. Your project is therefore charged based on the time that your job actually used. The requested wall time is not used for accounting purposes.|
|-A [project or budget code]||Specify the budget your job is going to be charged to. Please contact the principle investigator (PI) or a project manager (PM) of your project for details on the budget you should be using. You need to replace the string "[budget code]" with the string appropriate for your project.|
|-V||Specify that all environment variables, shell functions, aliases etc. that are active in the terminal session where you issue the qsub command are exported to your job. This means for example that the values of environment variables that you have set before submitting your job will be recognised inside your job script. However if you have a .bashrc file in your home directory that sets any of the same environment variables, aliases or functions that you are attempting to pass to your job using the -V option, the values in .bashrc will take precedence in your job's environment.|
The following PBS options are not supported on ARCHER. Specifying them may lead to your job being unable to run or even the loss of kAU resource without doing any useful work.
|-l place=[placement scheme]||Specifying a placement scheme will result in your job being unable to run. Do not use this option in your job submission scripts.|
5.4.2 Parallel job launcher aprun
The job launcher for parallel jobs on ARCHER is aprun . The /home filesystem is not accessible from the compute nodes, so you have to be in a directory on the /work parallel filesystem in order to use aprun i.e. in order to submit jobs.
A sample MPI job launch line using aprun looks like:
aprun -n 1536 my_mpi_executable.x arg1 arg2
This will start the parallel executable "my_mpi_executable.x" with arguments "arg1" and "arg2". The job will be started using 1536 MPI processes, by default 24 processes are placed on each compute node using all of the physical cores available.
The bolt job submission script creation tool will create job submission scripts with the correct settings for the aprun flags for parallel jobs on ARCHER.
The most important aprun flags are:
- -n [total parallel processes]
- Specifies the total number of distributed memory parallel processes (not including shared-memory threads). For jobs that use all physical cores this will usually be a multiple of 24. The default on ARCHER is 1.
- -N [parallel processes per node]
- Specifies the number of distributed memory parallel processes per node. There is a choice of 1-24 for physical cores on ARCHER compute nodes (1-48 if you are using HyperThreading: " -j 2 ", see below). As you are charged per node on ARCHER the most economic choice is always to run with "fully-packed" nodes on all physical cores if possible, i.e. -N 24 . Running "unpacked" or "underpopulated" (i.e. not using all the physical cores on a node) is useful if you need large amounts of memory per parallel process or you are using more than one shared-memory thread per parallel process. The default on ARCHER is 24 (i.e. use all physical cores).
- -d [threads per parallel process]
- Specifies the number of cores for each parallel process to use for shared-memory threading. (This is in addition to the OMP_NUM_THREADS environment variable if you are using OpenMP for your shared memory programming.) The default on ARCHER is 1.
- -S [parallel processes per NUMA region]
- Specifies the number of distributed memory parallel processes to place on each NUMA region (each ARCHER compute node is composed of 2 NUMA regions with 12 physical cores each). For mixed-mode jobs (for example, using MPI and OpenMP) it is typically be desirable to use this flag to place parallel processes on separate NUMA regions so that shared-memory threads in the same team access the same local memory. The default on ARCHER is the smaller of 12 (physical cores per NUMA region) and the total parallel processes specified by the -n option.
- -j [hyperthreads]
- Specifies the number of Intel HyperThreads to use for each physical core. Valid values for this are 0, 1 or 2. 0 indicates that all available HyperThreads should be used and hence is equivalent to 2 on ARCHER. The default is 1 and this should give the best performance for most codes on ARCHER.
Please use man aprun and aprun -h to query further options.
5.4.3 Task affinity for "unpacked" jobs
If you are running unpacked or hybrid distributed-/shared-memory jobs then the placement of parallel process and threads onto cores within a node can have a large effect on the performance of your code.
The placement of processes and threads is controlled aprun options. Examples of these options are provided in the sections below.
The ARCHER Best Practice Guide has a detailed discussion of the issues surrounding process/thread placement and how to control these using the aprun command.
Important: You have to change into a subdirectory of /work (your workspace), before calling aprun , i.e. before submitting your job script . If your submission directory (where you issued the qsub command) is part of /work , you can use the environment variable $PBS_O_WORKDIR , in the way shown above, to change into the required directory.
5.4.4 Example: job submission script for MPI parallel job
A simple MPI job submission script to submit a job using 64 compute nodes (maximum of 1536 physical cores) for 20 minutes would look like:
#!/bin/bash --login # PBS job options (name, compute nodes (each node has 24 cores), job time) # PBS -N is the job name (e.g. Example_MPI_Job) #PBS -N Example_MPI_Job # PBS -l select is the number of nodes requested (e.g. 64 nodes=1536 cores) #PBS -l select=64 # PBS -l walltime, maximum walltime allowed (e.g. 20 minutes) #PBS -l walltime=00:20:00 # Replace [budget code] below with your project code (e.g. t01) #PBS -A [budget code] # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the directory that the job was submitted from # (remember this should be on the /work filesystem) cd $PBS_O_WORKDIR # Set the number of threads to 1 # This prevents any system libraries from automatically # using threading. export OMP_NUM_THREADS=1 # Launch the parallel job # Using 1536 MPI processes and 24 MPI processes per node aprun -n 1536 ./my_mpi_executable.x arg1 arg2 > my_stdout.txt 2> my_stderr.txt
This will run your executable "my_mpi_executable.x" in parallel on 1536 MPI processes. PBS will allocate 64 nodes to your job and place 24 MPI processes on each node (one per physical core).
See above for a detailed discussion of the different PBS options
5.4.5 Example: job submission script for MPI parallel job on large memory nodes
A simple MPI job submission script to submit a job using 64 large memory compute nodes (with 128 GB of memory per node) for 20 minutes would look like:
#!/bin/bash --login # PBS job options (name, compute nodes (each node has 24 cores), job time) # PBS -N is the job name (e.g. Example_MPI_Job) #PBS -N Example_MPI_Job # PBS -l select is the number of nodes and type requested (e.g. 64 big memory nodes=1536 cores) #PBS -l select=64:bigmem=true # PBS -l walltime, maximum walltime allowed (e.g. 20 minutes) #PBS -l walltime=00:20:00 # Replace [budget code] below with your project code (e.g. t01) #PBS -A [budget code] # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the directory that the job was submitted from # (remember this should be on the /work filesystem) cd $PBS_O_WORKDIR # Set the number of threads to 1 # This prevents any system libraries from automatically # using threading. export OMP_NUM_THREADS=1 # Launch the parallel job # Using 1536 MPI processes and 24 MPI processes per node aprun -n 1536 ./my_mpi_executable.x arg1 arg2 > my_stdout.txt 2> my_stderr.txt
This will run your executable "my_mpi_executable.x" in parallel on 1536 MPI processes. PBS will allocate 64 large memory nodes to your job and place 24 MPI processes on each node (one per physical core).
See above for a detailed discussion of the different PBS options
5.4.6 Example: job submission script for OpenMP parallel job
The following example job submission script uses a single node to run an OpenMP code with 12 threads for 12 hours.
Although ARCHER compute nodes contain 24 physical cores it is not advisable to run OpenMP jobs that span mutiple NUMA regions - hence this example uses 12 threads (the maximum for a single NUMA region) rather than the full 24 (or even 48 with the aprun option " -j 2 " and HyperThreads).
If your code was compiled with the Intel compiler, you should add export KMP_AFFINITY=disabled to your batch script, and use the aprun option " -cc none " or " -cc numa_node " .
#!/bin/bash --login # PBS job options (name, compute nodes, job time) # PBS -N is the job name (e.g. Example_OMP_Job) #PBS -N Example_OMP_Job # PBS -l select is the number of nodes requested (e.g. 1 node=24 cores) #PBS -l select=1 # PBS -l walltime, maximum walltime allowed (e.g. 12 hours) #PBS -l walltime=12:0:0 # Replace [budget code] below with your project code (e.g. t01) #PBS -A [budget code] # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the direcotry that the job was submitted from # (remember this should be on the /work filesystem) cd $PBS_O_WORKDIR # Set the number of threads to 12 export OMP_NUM_THREADS=12 # Launch the parallel job # Using 1 processes and 12 OpenMP threads aprun -n 1 -d 12 ./my_openmp_executable.x arg1 arg2 > my_stdout.txt 2> my_stderr.txt
5.4.7 Example: job submission script for MPI+OpenMP (mixed mode) parallel job
Mixed mode codes that use both MPI (or another distributed memory parallel model) and OpenMP should take care to ensure that the shared memory portion of the process/thread placement does not span more than one NUMA region. This means that the number of shared memory threads should be a factor of 12.
As with OpenMP codes, if your MPI+OpenMP code was compiled with the Intel compiler, you should add
to your batch script, and use the aprun option "-cc none" or "-cc numa_node"
In the example below, we are using 128 nodes (3072 physical processors for 6 hours. There are 6 MPI processes per node and 4 OpenMP threads per MPI process.
#!/bin/bash --login # PBS job options (name, compute nodes, job time) # PBS -N is the job name (e.g. Example_MixedMode_Job) #PBS -N Example_MixedMode_Job # PBS -l select is the number of nodes requested (e.g. 128 node=3072 cores) #PBS -l select=128 # PBS -l walltime, maximum walltime allowed (e.g. 6 hours) #PBS -l walltime=6:0:0 # Replace [budget code] below with your project code (e.g. t01) #PBS -A [budget code] # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the direcotry that the job was submitted from # (remember this should be on the /work filesystem) cd $PBS_O_WORKDIR # Set the number of threads to 4 # There are 4 OpenMP threads per MPI process export OMP_NUM_THREADS=4 # Launch the parallel job # Using 128*6 = 768 MPI processes # 6 MPI processes per node # 3 MPI processes per NUMA region # 4 OpenMP threads per MPI process aprun -n 768 -N 6 -S 3 -d 4 ./my_mixed_executable.x arg1 arg2 > my_stdout.txt 2> my_stderr.txt
If you are performing mixed mode (hybrid) simulations with MPI communications from threads (i.e. you are using MPI_Init_thread in your MPI code rather than MPI_Init) then you will need to use the following environment variable to specify what level thread support you require from the MPI library:
This environment variable should be set to single , funneled , serialized or multiple :
- single (MPI_THREAD_SINGLE): Only one thread will execute
- funneled (MPI_THREAD_FUNNELED): The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are funneled to the main thread)
- serialized (MPI_THREAD_SERIALIZED): The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are serialized)
- multiple (MPI_THREAD_MULTIPLE): Multiple threads may call MPI with no restrictions
If MPICH_MAX_THREAD_SAFETY is unset then the value single is assumed.
Note: the MPI_THREAD_MULTIPLE thread safety implementation is not a high-performance implementation. Specifying MPI_THREAD_MULTIPLE can be expected to produce performance degradation as multiple thread safety uses a global lock.
5.4.8 Interactive jobs
The nature of the job submission system on ARCHER does not lend itself to developing or debugging code as the queues are primarily set up for production jobs.
When you are developing or debugging code you often want to run many short jobs with a small amount of editing the code between runs. One of the best ways to achieve this on ARCHER is to use interactive jobs. An interactive job allows you to issue 'aprun' commands directly from the command line without using a job submission script, and to see the output from your program directly in the terminal.
The following screencast demonstrates starting an interactive job and running a parallel program on the compute nodes from within the job.
To submit a request for an interactive job reserving 8 nodes (192 cores) for 1 hour you would issue the following qsub command from the command line:
qsub -IVl select=8,walltime=1:0:0 -A [project code]
When you submit this job your terminal will display something like:
qsub: waiting for job 492383.sdb to start
It may take some time for your interactive job to start. Once it runs you will enter a standard interactive terminal session. Whilst the interactive session lasts you will be able to run parallel jobs by issuing the 'aprun' command directly at your command prompt using the same syntax as you would inside a job script. The maximum number of nodes you can use is limited by the value of select you specify when you submit a request for the interactive job. You can only request an interactive job from a directory on the /work filesystem.
To reduce the amount of time spent waiting for your interactive job to start you may find it useful to use the short queue, though this has restrictions on job length and size. Alternatively if you know you will be doing a lot of intensive debugging you may find it useful to request an interactive session lasting the expected length of your working session, say a full day.
To take maximum advantage of an interactive session submitted to the short queue (longest job length 20 minutes) it can be useful to set up an email alert so that the batch system mails you as soon your interactive session starts. This can be achieved by using the -m and -M options with qsub when you request your interactive job as follows:
qsub -IVl select=1,walltime=0:20:0 -q short -A [project code] -m b -M [email address]
This should make it easier to do other tasks away from the terminal yet still be ready to use the interactive session as soon as it is available.
Please be aware that any command not prepended with aprun will be running directly on a job launcher node, rather than on a compute node. As the job launcher nodes are a shared resources for all users you are requested not to run any intensive computations without prepending the command with aprun in order to execute it on the compute node(s) you've reserved for the job. The same applies for commands within job scripts submitted to the batch system.
When using X-forward whilst working on the ARCHER login nodes, it is possible to enable further X-forwarding from the parallel nodes being used in an interactive job. To do this simply add the -X flag to the qsub command, as shown below:
qsub -IVl select=8,walltime=1:0:0 -A [project code] -X
5.5 Array Jobs
It is possible to run array style jobs on ARCHER using PBS. An array style job involves running multiple jobs at once using the same submission script. Each job is subject to the same resource restrictions (i.e. number of nodes and wall time limits), and must use the same execution script, but can vary the executable run or files processed based on a job id provided to each running array job. Below is a simple example of an array job for ARCHER that would run a different executable (named my_executable_1 up to my_executable_10) each on their own node for a maximum of five minutes:
The number of array elements has been capped at 32 per job. This still allows any user to submit 512 jobs at a time.
#!/bin/bash # Replace [budget code] below with your project code (e.g. t01) #PBS -A [budget code] #PBS -l select=1 #PBS -l walltime=00:05:00 #PBS -J 1-10 #PBS -r y #PBS -N ExampleArray # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the directory that the job was submitted from # (remember this should be on the /work filesystem) cd $PBS_O_WORKDIR aprun -n 24 ./my_executable_$PBS_ARRAY_INDEX \ > my_stdout_$PBS_ARRAY_INDEX.txt \ 2> my_stderr_$PBS_ARRAY_INDEX.txt
5.6 Sharing Nodes with OpenMP/Threaded Jobs
For most of the jobs people run on ARCHER the desired behaviour for node allocation is that only one job at a time has access to any given compute node. However, sometimes people may wish to run shared-memory jobs (such as OpenMP programs) on a node that do not utilise all of the available cores. In that circumstance it can be useful to be able to run another job on the same node to use the cores that have been left inactive by the first job. By default it is not possible to do this using the queuing system and the aprun command on ARCHER. However, we have developed a small application/set of scripts that allow users to run multiple programs on a single node. Note, this cannot be done with MPI (or any other distributed-memory programming model) programs, only shared-memory programs (such as those that use OpenMP or POSIX threads).
The utility needed to run multiple jobs on a single node can be downloaded here. Simply download this archive, copy it onto ARCHER and unpack it (it can be unpacked using this command: tar zxvf JobWrapperUtility.tar.gz ). It will create a directory called JobWrapper , which includes a README file with instructions on how to use the utility.
The utility is currently only setup to run two programs on a node, one on each of the processors. If you need to run a different number of applications on a node, or have other requirements to vary how programs run, please get in touch with the helpdesk and we can provide different versions of this utility for you.
5.7 Python Task Farm
There is a utility available on ARCHER for running serial python programs as a task farm (a task farm is a mechanism for running multiple copies of a program on a parallel system). The utility, called ptf , is available as a module on ARCHER (accessed using module load ptf ). A readme file and example submission script is available in the ptf module (for the location of the module files use the module show ptf command).
ptf takes a file with a python program in it and runs it as a task farm on as many cores as you want on ARCHER. It requires that the name of the file containing the python program is provided as the first argument to the ptf executable when it is run. ptf will pass any additional command line arguments used when running the program through to the python program as well as providing the python program with an extra command line argument that is the id (rebased to 1) of the MPI process that is running it (this is passed as the last argument to the python program), for python programs that require a unique id for each instance of the program that is running.
The utility is currently only setup to run a separate instance of the python program on each core requested when the job is run. There are also, currently, some restrictions on the size and complexity of python program it can execute. If you need to assign different numbers of cores to each python program, or have other requirements to vary how the python programs run, please get in touch with the helpdesk and we can work on providing different versions of this utility for you.
5.8 Scheduling System Layout and Limits
The scheduling system is laid out so that all you need to do is request the number of nodes you need and the time for your job. The scheduling system will then schedule the jobs to ensure fair access.
The current limits are:
- Regular Jobs (no queue specified uses standard queue): 1 minute to 24 hours; 1-4,920 nodes (24-118,080 processing cores with fully-populated nodes). Maximum number of jobs running per user varies to maximise system usage but the default is a maximum of 12 jobs running per user with a maximum of 16 in the queue at any one time.
- Long Jobs (-q long): 1 minute to 48 hours; 1-256 nodes (24-6,144 processing cores with fully-populated nodes). Maximum of 2 jobs running per user, maximum of 16 in the queue at any one time.
- Short Jobs (-q short): 1-20 minutes, maximum of 8 nodes, 1 job (waiting or running per user), only available 0800-2000 UK time Mon-Fri.
- Low Priority Jobs (-q low): 1 minute to 3 hours, 1-512 nodes (24-12288 processing cores with fully-populated nodes), maximum of 3 in the queue per user, only one of which can be running at any time. Only enabled when the backlog of work on the system drops below a certain level.
- Large memory node Jobs (bigmem=true): max 376 nodes, max walltime 48h,
only available 1730-0900 UK time Mon-Fri and all day Sat+Sun.
- Serial Jobs (-l select=serial=true:ncpus=1): 1 minute to 24 hours, maximum of 12 jobs in the queue per user, only 6 of which can be running at any time.
5.8.1 Scheduling System Priorities and Logic
Principally, the system attempts to place the largest jobs possible within the available space, and then tile the remaining nodes with the largest jobs that will fit, until the system is full. Obviously, small jobs are easier to place than large ones - but the system will attempt, where possible, to place the largest it can. There is also degree of additional scheduling in place to try and prevent jobs from ageing too much in the queue.
The system will deploy the backfill scheduler to try and minimise the time that a large job will have to wait for resources, and this can mean that nodes appear to be free when they are actually being reserved in advance (you can use the command qstat -wT to get snapshot and general idea of what jobs have been scheduled to run, when). Under those circumstances, shorter jobs may have an enhanced chance of being released since they might be able to run and terminate before the large job that is being backfilled for is scheduled to run.
The system is configured to try and maximise the chances of large jobs running - but it is also true that very small jobs will also have a high likelihood of getting in to fill a small gap.
5.9 checkScript job submission script validation tool
The checkScript tool has been written to allow users to validate their job submission scripts before submitting their jobs. The tool will read your job submission script and try to identify errors, problems or inconsistencies.
Note that tool currently only validates parallel job submission scripts. Serial and low priority jobs are not included.
An example of the sort of output the tool can give would be:
user@@eslogin008:/work/x01/x01/user> module add epcc-tools user@@eslogin008:/work/x01/x01/user> checkScript submit.pbs =========================================================================== checkScript --------------------------------------------------------------------------- Copyright 2011-2013 EPCC, The University of Edinburgh This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. =========================================================================== Using z01 budget. Remaining kAUs = 23221.006 Script details --------------- User: user Script file: submit.pbs Directory: /fs3/x01/x01/user (ok) Job name: my_job (ok) Requested resources ------------------- nodes = 64 (ok) walltime = 6:0:0 (ok) budget = z01 (ok) kAU Usage Estimate (if full job time used) ------------------------------------------ kAUs = 36.864 checkScript finished: 0 warning(s) and 0 error(s).
5.10 Setting a time limit for aprun
If you want to leave time at the end of the job to do tidying up, e.g., copying output files, you can limit the time for the aprun command using the leave_time utility. The usage is
leave_time <time needed to tidy up, in seconds> <program and its arguments>
module load leave_time leave_time 60 aprun -n 24 test.exe
will leave 60 seconds at the end of the job for tidying up.
Note that the leave_time module must be loaded in the batch script.
leave_time sends SIGTERM to the program to terminate it. This can be trapped in the program to do tidying up within the program, e.g., to write out a restart file.
If the time limit is reached, the exit status will be 124 (even if SIGTERM is trapped by the program), otherwise the exit status will be the exit status of the program.
leave_time uses the Gnu timeout command.
5.11 Low Priority Access
Low priority access is available to all project except Instant Access ones.
Low priority jobs are not charged against your allocation, although you do require a valid budget in your job script to allow the job to run.
Jobs can range from 1-512 nodes (24-12288 cores) and can have a maximum walltime of 3 hours. Only 1 low priority job per user can be run at any one time and only 3 jobs can be queued by any one user.
You submit low priority jobs to the queue "low" on the system. For example, if your job submission script is called "submit.pbs" you would use the command:
qsub -q low submit.pbs
to submit a low priority job.
The low priority access queue will be opened when the backlog in the queue system drops below 3 hours.
5.12 Long Queue Access
Long Jobs can run for a maximum of 48 hours. There are 2 ways run a long job on ARCHER, the first of which is to submit the job to the "long" queue on the system. For example, if your job submission script is called "submit.pbs" you would use the command:
qsub -q long submit.pbs
to submit a long job.
The second way to make use of the "long" queue on the system is to specify the queue in your submission script, as follows:
#pbs -q long
There is a maximum of 2 long jobs running per user at any one time.
Note: Jobs requiring less than 24 hours will not be accepted onto the long queue as these can run on the standard queue.
5.13 Short (Debug) Queue Access
Jobs can range from 1-8 nodes (24-192 cores) and can have a maximum walltime of 20 minutes. The queue is only enabled between the hours of 0800-2000 UK time, Mon-Fri.
You can submit debug jobs to the queue short on ths system. For example, if your job submission script is called "submit.pbs" you would use the command:
qsub -q short submit.pbs
to submit a short job.
There is a maximum of 1 job running per user in the short queue.
Reservations are available on ARCHER. These allow users to reserve a number of nodes for a specified length of time starting at a particular time on the system.
Reservations require justification. They will only be approved if the request could not be fulfilled with the standard queues. Possible uses for a reservation would be:
- An exceptional job requires longer than 48 hours runtime.
- You require a job/jobs to run at a particular time e.g. for a demonstration or course.
Note: Reservation requests must be submitted at least 60 Hours in advance of the reservation start time. If requesting a reservation for a Monday at 18:00, please ensure this is received by the Friday at 12:00 the latest. The same applies over Service Holidays.
Note: Reservations are only valid for standard compute nodes, high memory compute nodes and/or PP nodes cannot be included in reservations.
Reservations will be charged at 1.5 times the usual AU rate and you will be charged the full rate for the entire reservation at the time of booking, whether or not you use the nodes for the full time. In addition, you will not be refunded the AUs if you fail to use them due to a job crash unless this crash is due to a system failure.
To request a reservation please use the form on your main SAFE page. You need to provide the following:
- the start time and date of the reservation.
- the end time and date of the reservation.
- the project code for the reservation;
- the number of nodes required;
- your justification for the reservation - this must be provided or the request will be rejected
Your request will be checked by the Helpdesk and if approved you will be provided a reservation ID which can be used on the system. You submit jobs to a reservation using the qsub command in the following way:
qsub -q <reservation ID> <job submission script>
5.15 Postprocessing/Serial Jobs
The postprocessing (PP) nodes on the ARCHER facility are designed for large compilations, post-calculation analysis and data manipulation. They should be used for jobs which do not require parallel processing but which would have an adverse impact on the operation of the login nodes if they were run interactively.
Example uses include: compressing large data files, visualising large datasets, large compilations and transferring large amounts of data off the system.
The PP nodes can be accessed in two ways:
- Via the serial queues: this is described below.
- Via direct interactive access: as described in the Interactive access to Post Processing nodes section of this User Guide.
5.15.1 Example Postprocesing Job Submission Script
#!/bin/bash --login # #PBS -l select=serial=true:ncpus=1 #PBS -l walltime=00:20:00 #PBS -A [project code] # Make sure any symbolic links are resolved to absolute path export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR) # Change to the directory that the job was submitted from cd $PBS_O_WORKDIR gzip output.dat
The following PBS options are used in postprocessing job submission scripts:
- The option -l select=serial=true:ncpus=1 must be used to specify that this is a serial job and that you want to use a single CPU.
- The -l walltime=hh:mm:ss must be used to specify the calculation time limit. Maximum is 24 hours.
- The -A [project code] must be a valid budget but no charges will be applied for time used in the serial queue.
Please note you cannot run a serial job on the large memory (bigmem) compute nodes i.e. select=serial and select=[nodes]:bigmem cannot both be true.
5.15.2 Interactive Postprocessing Job
Direct interactive access to the PP nodes is also available. This means that you do not need to submit a job to access the PP nodes interactively. See Interactive access to Post Processing nodes section of this User Guide.
Postprocessing interactive jobs run in much the same way as the parallel interactive jobs described above except that you need to set 'select=serial=true' to run jobs on the PP nodes. For example, to submit a 1-hour interactive postprocessing job you would use:
qsub -IVl select=serial=true:ncpus=1,walltime=1:0:0 -A budget
(Remember to replace 'budget' with your budget code.) When you submit this job your terminal will display something like:
qsub: waiting for job 492383.sdb to start
and once the job begins running you will be returned to a standard Linux command line from which you can run your commands.
When using X while logged into the ARCHER login nodes, it is also possible to enable X-forwarding from the serial nodes. To do this simply add the -X flag to the qsub command as shown below:
qsub -IVl select=serial=true:ncpus=1,walltime=1:0:0 -A budget -X
When the job runs, you will be able to launch applications with a GUI and the interface will appear on your local machine.
5.16 OOM (Out of Memory) Error Messages
Applications that attempt to access more memory than is available on a node (64GB for normal nodes, 128GB for high-memory nodes) will abort producing an error similar to the following:
OOM killer terminated this process.
If this happens to your code, you will need to run it using more nodes. There are two ways to do this:
- If your application can be easily scaled over more processors, increase the total number of processors with the same dataset (e.g. use 48 processors over two nodes if your application initially used 24 processors on one node).
- If it is difficult or undesirable to run your application with more processors, then you can still increase the number of nodes without changing the number of processors i.e. use fewer processors per node. (Note: you will still be charged as if you had used the full node.)