Blog

Update 11:00 AM: The bug in the scheduler has been patched and the scheduler is back online.

The SLURM scheduler is experiencing intermittent crashes following yesterdays upgrade. We are currently working with our software vender to resolve the issue.

The GPFS home storage system is currently offline. We are working to identify and resolve the underlying cause of the disruption, and will provide additional information as available. 

Update 3:45 PM This outage started at about 1:55 PM. We've identified a set of nodes that may be causing this problem and are working to reset them.

Update 4 PM The system should be fully operational now. We've identified memory exhaustion on four compute nodes as the cause of the problem. Despite existing mechanisms to prevent the overutilization of memory, they were stuck in a state where there was not sufficient memory to respond to the storage cluster but still responsive enough to prevent an automatic recovery without them. We will continue to investigate the cause and work with the storage vendor to address this.

Update 9:07PM: The scheduler upgrade is complete and the scheduler is back online

On Thursday, March 4th, at 8:00PM, the scheduler will go offline before undergoing an upgrade to the latest release. This scheduler is expected to come back online before midnight

This outage will not affect running jobs, however some other functionality will be affected by this outage:

  1. SLURM client commands will be unavailable (squeue, srun, salloc, sbatch, etc.)
  2. New jobs cannot be submitted
  3. Jobs that are already queued will not start

If you have any questions about this outage, please contact us at https://contact.icer.msu.edu/.

Update Wednesday, February 24th, 10:45 AM: The accounting database is back online

Update Wednesday, February 24th, 8:02 AM: The accounting database outage is still in progress and now expected to complete in the early afternoon

Update Tuesday, February 23rd, 5:38 PM: The accounting database outage is still in progress and expected to last into the evening

On Tuesday, February 23rd, beginning at 6:00AM, the SLURM accounting database will go offline for maintenance. This maintenance is in preparation for updating SLURM to the latest version. Jobs can still be submitted and will run as usual, however, users may be affected in several other ways during this outage:

  1. Historical job data accessed through the sacct command will be unavailable.
  2. Some powertools that rely on the sacct command, such as SLURMUsage, will also be unavailable.
  3. New users added to the system during the outage will not be able to submit jobs until the database is back online.

This outage is expected to last approximately 12 hours. If you have any questions, please contact us at https://contact.icer.msu.edu/.

Annual Limit on GPU Usage

Starting today, ICER will be limiting the number of GPU hours that non-buyin users can consume on a yearly basis. The yearly limit will be 10000 GPU hours. Users who have already consumed GPU hours this year will be limited to 10000 GPU hours on top of what they have already consumed.

Users can check their usage and limits using the SLURMUsage powertool.

$ module load powertools
$ SLURMUsage 

 Account             Resource  Usage(m)  Left CPU(h)  UserName
==============================================================
 general             CPU Time        0    500000.00   hpccuser
                     GPU Time        0     10000.00   hpccuser

If you have any questions, please contact ICER support at https://contact.icer.msu.edu/

Update: The SLURM database maintenance is complete. Access to the sacct command has been restored.

Update: Database maintenance is still in progress and is expected to continue into Wednesday, February 10th.

Update: New users added to the cluster during the outage will not be able to submit jobs until the migration is complete.

On Tuesday, February 9th, beginning at 9:00AM, the SLURM accounting database will go offline for maintenance. During this outage, historical job data accessed through the sacct command will be unavailable. Jobs can still be submitted and will run as usual. This outage is expected to last approximately 8 hours. If you have any questions, please contact us at https://contact.icer.msu.edu/.

Database Server Upgrade

On March 1st at 8am EDT, we will be deploying an updated database server for user databases.  Our current server db-01 will be replaced with db-03.  Scripts will need to be updated accordingly.  Tickets have been opened with users that have databases on the server.  If you would like any databases migrated please let us know, we will not be migrating databases automatically.

The scheduler is currently offline. We are working to bring the service back up as quickly as possible, and will provide further updates here as they become available. 

2021-01-09 00:09 - Slurm scheduler is now back online. Jobs have resumed.

Globus currently Offline

Our globus server is currently offline.  We are waiting on a response from Globus as the issue is related to a security certificate issued by Globus.


2pm EDT Itssue with Globus has been resolved.

Due to an external core campus network maintenance outage, the HPCC may be unavailable for external access from 11 PM to 4 AM. Currently running jobs that do not rely on external network connectivity will continue to run.

The HPCC will be unavailable beginning at 7 AM on Tuesday, January 5th to perform routine firmware and software updates to improve stability and performance. All interactive access will be disabled (including SSH, OpenOnDemand, Globus Online endpoints, and SMB) and no jobs that would overlap this maintenance window will run. Please contact ICER if you have any questions.

Update: 9 AM. We're in the process of applying firmware and OS updates to the cluster hardware and are updating the Slurm database to support the newer version of Slurm.

Update: 10:45 AM. Most firmware updates are complete. The OS updates are about 50% complete across the cluster. We have run into an issue with the Slurm upgrade process and are working on a solution.

Update: 1:45 PM Infiniband network updates are complete. The compute node OS updates have been completed. We're rolling back the Slurm upgrade attempt due to time constraints.

Update 3 PM. HPCC systems are available for interactive use (ssh, OpenOnDemand). We're doing final checks on the system before resuming scheduling.

Update: 3:30 PM: GPFS recovery on home is running. Users may experience long pauses while the file system recovers snapshots.

Update: 4:25 PM: The GPFS recovery has completed. We have resumed scheduling and are monitoring job status.

Update: 4:45 PM: We have completed the maintenance window. Please let us know if you experience any issues.

Globus has released a new version of software and has indicated that our current version of software will no longer operate completely after January 1 2020.  We will be updating to the latest version of this software at 8am on 11/19/20.  Please plan your transfers accordingly.   


Update:  This is no long needed 


We have experienced a critical failure on our home directory storage system.  We have contacted our vendor and are working to correct the issue as soon as possible.  We currently have no eta for restoration.   Updates will be provided as available.


Update:

11/3 1:10am The vendor is still looking into the issue.

11/3: 7:30 AM Staff have been working with the vendor through the night to address this. We have not found any fix yet.

11/3: 8:30 AM We have been able to complete the log recovery and remove the corrupted data structure that prevented the file system from mounting shortly after 8 AM. Please let us know via the contact form if you see any issues.

11/3: 9 AM login issue: "/usr/bin/xauth: timeout in locking authority file ~/.Xauthority". Writes appear to be returning a "No such device" error on client nodes. We are investigating.

11/3: 9:30 AM Write issue has been resolved. Please let us know if you experience any problems.

We will be updating  our RDP gateway software to the latest revision on 11/2.  The server will be offline for a short time during this upgrade.  We anticipate the server being offline for less than two hours.


9/2 9;15am  Upgrade is complete and server is now online.

On 10/26 we will be taking our globus server offline for upgrade to the latest version of globus.  We estimate the downtime to last less than two hours.  Please be sure to time transfers accordingly, ongoing transfers will fail.


8:42am We have completed the switch to our new Globus server.  Currently guest collections are not working we have a ticket open with Globus support for this.


3pm:  All globus services should now be available.