Skip to end of metadata
Go to start of metadata

The High Performance Computing Center (HPCC) @ MSU manages shared computing resources consisting of clusters and development nodes. Necessarily, there is a queuing system in place to ensure fair access to resources. For an explanation of our queuing policies, please see our policy section. HPCC features one queue system where users specify resources that are needed. Torque 5.0.1 is used to manage resources.  Priority access to the cluster can be purchased.

Job Scripts

At HPCC, jobs run in a single queue based on their priority settings, as long as the resources are available to run the job. If the resources are not available, the next job in the queue that can be run and not negatively impact the expected run time of the jobs before them will attempt to run (i.e., backfill mode enabled). We encourage users to properly estimate the amount of resources required (specifically walltime and memory) to optimize the start time of their jobs.

To submit a job to the cluster, we suggest that users create a job script which contains commands to be run. A sample job script name myjob.qsub might include the following commands:

myjob.sub

Limitations in submission scripts

Icon

All lines starting with #PBS need to be above the first non-commented line in the script. If they are below the first non-commented line of code, the scheduler will not read them, leading to unexpected behavior.

A complete list of qsub options are listed in the next section.
Once the job script has been created, the job can be submitted using the command

qsub options

Options to qsub that affect the properties of your job:

(tick)

Recommended

(warning)

Advanced

(error)

Not recommended

 

Option

Description

Example

(warning)

-A

This option tells TORQUE to use the Account (not Username) credential specified. Unless you are an authorized user of the buyin account, your job will be deleted. Buyin account users must use this option to reserve nodes on their buyin machines.

#PBS -A mybuyin

 

-a

This option tells PBS to run a job at the given time; the time format is in the standard unix format of HHMM (24 hour time, minutes). Day, month, and year can be specified in the standard Unix format. (see the PBSPro User Manual for more information)

#PBS -a 0615

 

-e / -o

The location for Output_Path and Error_Path attributes as demonstrated below. Note that you only need to use one if you use the -j option.

#PBS -e /home/keenandr/myerrorfile

(warning)

-I

Declares that the job is to be run "interactively". The job will be queued and scheduled as any PBS batch job, but when executed, the standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running.
-X This opptions will allow you to forward X11 windows.  This is especially useful if you are using totalview to debug a multinode MPI job.

Note: You can not start an interactive job from gateway, you must start it from one of the developer nodes.

#PBS -I
#PBS -I -X

 

-j

Using an eo option will combine STDOUT and STDERR in the file specified in Error_Path; oe will combine them in Output_Path.

#PBS -j oe

(tick)

-l

Resources, separate with comma

  • nodes=#; Number and/or type of nodes to be reserved for exclusive use by the job. ppn=# specify the number of processors per node requested. Defaults to 1.  gpus=# on Intel16/Laconia, specify the number of gpus to use.
  • walltime= the total run time in the form: HH:MM:SS or DD:HH:MM:SS
  • mem= Maximum amount of memory needed by a job.
  • feature= the name of the type of compute node and other features related to our cluster configuration
  • (warning) file: Maximum amount of local disk space needed by a job. This is only needed if your job is using local disk space and the TMPDIR environment variable.

    The following Moab/Torque options are not recommended and not currently used here
  • (error) cput: CPU time limit.
  • (error) pmem: Maximum amount of memory needed per processor. 
  • (error) ncpus: Number of cpus (this is not used any more, use nodes=n instead).

#PBS -l nodes=4:ppn=1,walltime=01:00:00,mem=2gb

 

-M

Emails account(s) to notify once a job changes states as specified by -m

#PBS -M username@msu.edu

 

-m

  • a- sends mail when job is aborted by batch system
  • b- sends mail when begins execution
  • e- sends mail when job ends
  • n- does not send mail

#PBS -m abe

 

-N

Names the job

#PBS -N MySuperComputing

(error)

-p

Allows the user to set priority relative to the the jobs of a user; does not affect priority relative to other users.

#PBS -p 200

(warning)

-q

Tells the system which queue to use. The HPCC only has only one queue, "main" Where the job runs on the main cluster is determined by the resources requested.

#PBS -q main

(warning)

-r

Specifies whether a job is rerunnable (y/n) if it is interrupted by a system crash or other failure; by default, PBS will automaticially rerun the job script from the beginning if the job is queued.
Protip: This is very important: If your job does not support checkpointing and your job script does not check before overwriting output, data loss is a real possibility if you do not explicitly set your job as non-rerunnable.

#PBS -r n

(warning)

-t

Submits a Array Job with n identical tasks.  Each task has the same $PBS_JOBID but different $PBS_ARRAYID variables.

#PBS -t 5
or range of values
#PBS -t 3-10

(warning)

-V

Passes all current environment variables to the job.

#PBS -V

(warning)

-v

Defines additional environment variables for the job.

#PBS -v arg1=phase3,arg2=coalesceData,numpts=50

(tick)

-W

Special Generic Resources such as software licenses can be requested using the -W option. This is most commonly used with matlab (see Matlab Licenses for more information.)

#PBS -W gres:MATLAB

This is a subset of options; consult the PBS manual for more information. You can also type "man qsub" or "man pbs_resources" on the command line.

Scheduling Tips

  • HPCC maintains buy-in nodes, which are able to run non-priority (non buy-in) jobs that are less than four hours in walltime. If you submit a job which requests less than four hours of walltime, it is likely that the short job will start fairly quickly.
  • HPCC utilizes "backfill" to maximize the utilization of our cluster. For example, if the highest priority job requests 100 cpus but only 40 are available, and the other 60 will become available over the next 37 hours, then the 40 cpus are used to run the next highest priority job that requires ppn < 40 and walltime < 37:00:00.
  • Correctly estimating the amount of memory and walltime that is needed to run your code will enable your job to be scheduled as soon as possible.
  • There is some overhead for system processes. One should request slightly less memory than system capacity to ensure that a job is not cancelled.

When will my job start?

  • The length of time a job sits in a queue before running (also known as queue time) varies depending on the cluster load, and type of job. Such is the nature of shared computing resources. The start time of the job can be estimated by typing showstart <jobid>

    Icon

    It might take up to 30 seconds for a newly submitted job to be processed.

  • if your job has been in the queue for a while, and you would like an analysis of why the job has not started, you can use the command checkjob <jobid>

    In this example, none of the available processors satisfy the requirement of the jobs.

Why was my job killed?

If your job over-runs the walltime requested, or over-uses the memory requested, the scheduler will terminate your job and send you an email. Please review the information from the email, or the error file in your working directory for more information on how to fix these issues. If the error messages indicate a hardware error, please contact the staff by filling out this request form.

Queuing Policies

Queue Limits

  • Users are allowed to run jobs for up to a week (168 hours) in walltime.
  • Users can utilize up to a total of 520 cores at any time.
  • Users can request up to 6TB of memory per job. (note that such a large job might take a while to get scheduled)
  • It is against our fair use policy to artificially increase your priority in the queue. (e.g. by requesting resources that will not be used). Such accounts will be suspended.

Policy Summary

  • Jobs that run under 4 hours are able to run on the largest set of nodes.
  • Jobs that request more processors or RAM have priority over smaller jobs.
  • Jobs that are queued accrue priority based on how long they have been queued and how much wall-time they request.
  • The scheduler will attempt to balance usage among users.

Shorter jobs can run on more nodes

Jobs that request a total running (wall-clock) time of four hours or less can run on idle buy-in and specialized nodes. Because they can access the largest set of potential nodes, they are likely to run more quickly than jobs that have to wait for the fewer general-purpose community nodes.

MSU's buy-in model guarantees that buy-in users will have access to their nodes within four hours. Specialized hardware (large memory nodes, NVIDIA, Phi accelerators) and the nodes dedicated to larger jobs have similar four-hour windows where jobs that do not meet the normal requirements of those nodes can use them. In total, about two thirds of the Intel10, Intel14, and Intel14-XL clusters are able to run four hour jobs of any size when idle.

iCER staff can assist you in restructuring your application to use these windows using system-level checkpointing tools like BLCR or application checkpointing.

Bigger jobs are prioritized

The scheduler attempts to gather resources for large jobs and then fill in smaller jobs around them. The size of the job is determined by the number of CPUs and amount of memory requested.

The scheduler also schedules eight or more cores at a time to allow more resources to be gathered for multi-core jobs.

Resource requests are monitored. Abusive resource requests may violate MSU policy.

Queue Time

As jobs wait in the queue, they accrues priority, which allows them to gather resources more effectively. Each user can have up to 15 jobs gathering additional priority. This is in addition to other priority factors.

The scheduler also provides a priority boost to jobs the longer they have been in the queue relative to the wall-time requested by the job.

User Policy

The scheduler will attempt to ensure that resource utilization based on past consumption by adjusting the priority of users and groups that have used more than the average consumption over the past few days.