Page tree
Skip to end of metadata
Go to start of metadata

      The following limits apply generally to all MSU users of the HPCC.   Those at affiliate institutions may be working under slightly different policies.   The limits are in place to help our large user community share the HPC.   However If these policies are an impediment to completing your research, please contact us.

CPU and GPU Time Limits

GPU time limits for 2021 are calculated as an additional 10,000 hours on top of any usage prior to the introduction of the limits on February 17th

  • HPCC users who do not have a buy-in account are given a 'general' SLURM account. The general account is limited to 500,000 CPU hours (30,000,000 minutes) and 10,000 GPU hours (600,000 minutes) every year (from January 1st to December 31st) starting from 2021
  • There is no CPU or GPU hour limit if jobs use a buy-in account. If you have a buy-in account, your jobs will be run under that account by default, unless the manager of the buy-in account has chosen to opt-in (jobs submitted with the -A flag) instead of opt-out.
  • Users with general account can use the powertools command SLURMUsage to check their used CPU and GPU time (in minutes) and left CPU and GPU time (in hours):

    $ ml powertools               # run this command if powertools not loaded
    $ SLURMUsage
     Account             Resource  Usage(m)  Left CPU(h)  UserName
     general             CPU Time        0    500000.00   MSUNetID
                         GPU Time        0     10000.00   MSUNetID

    where the CPU usages of the user's general account is shown.

  • If users without a buy-in account needs more CPU or GPU time due to running out of the limits, they can request additional CPU/GPU hours by filling out the CPU/GPU Increase Request online form.

Queue Limits

  • Time: Users can schedule jobs and run for at most 7 days (168 hours)  ( --time=168:00:00)
  • CPU: Users can utilize up to a total of 1040 cores or 520 jobs running at any one time.  (Buyin groups who have purchased more than 1040 cores can exceed this limit)
  • Queue: The maximum number of jobs that can be queued per user is 1000 jobs.

Policy Summary

  • Jobs that run under 4 hours are able to run on the largest set of nodes ( the combination of community + specialized hardware + buy-in nodes.  see below for details) 
  • Jobs that request more resources (processors or RAM) have priorities over smaller jobs because these jobs are more difficult to schedule.
  • Jobs accrue priority based on how long they have been queued.
  • The scheduler will attempt to balance usage among all users. (See Fairshare Policy below.)
  • It is against our fair use policy to artificially increase the priority of a job in the queue (e.g. by requesting more resources which will not be used). Jobs found to be manipulating the scheduler will be canceled, and users continuing to attempt this will be suspended.

      Please see more information about Job Priority.

Buy-in program

      Faculty can purchase nodes via our buy-in program. The program guarantees jobs submitted with a buy-in group will start running on their buy-in nodes in 4 hours. However, due to competitions between buy-in group jobs, the guarantee might not be fulfilled if requested resources are occupied or reserved by other jobs of the buy-in group. 

Shorter jobs can run on more nodes

      Jobs that request a total running (wall-clock) time of four hours or less can run on any available buy-in and specialized nodes. Because they can access any nodes, they are likely to start running more quickly than the jobs which have to wait for the general-long partition nodes.

Bigger jobs are prioritized & Small jobs are backfilled

      The scheduler attempts to gather resources for large jobs and then backfill smaller jobs around them. The size of the job is determined by the number of CPUs and amount of memory requested.

      The scheduler packs small jobs together to allow more resources to be gathered for multi-core jobs.  Resource requests are monitored. Abusive resource requests may violate MSU policy.

Queue Time

      As jobs wait in the queue, they accrue priority to run.  This is in addition to other job priority factors.

Fairshare Policy

     The scheduler will attempt to ensure fair resource utilization of all HPCC users by adjusting the initial priorities of the users who have recently used HPCC resources. Due to the policy, if users had jobs running with many resources recently, their current pending jobs might wait longer than before. Users can find the FAIRSHARE contribution to a job priority by running command "sprio -u $USER":

[UserID@dev-intel18 UserID]$ sprio -u $USER
          JOBID PARTITION     USER   PRIORITY       SITE        AGE  FAIRSHARE        QOS                 TRES
       53381467 general-l   UserID      49432          0          0      49318          0       cpu=100,mem=15
       53381467 general-s   UserID      49432          0          0      49318          0       cpu=100,mem=15

where it is found under FAIRSHARE column and the values are between 60,000 and 0. More resources your jobs used recently, the less the values become and so are the priorities of your jobs. For other contributions of sprio results, please check Job Priority Factors.

  • No labels