Tomorrow morning, Nov. 23, 2021 all of the gateways will be rebooted for maintenance.
The reboots will start at 7:30am and should me completed by 8:00am.
Update: All gateways are up and available.
On Tuesday, November 23rd, at 10PM, the scheduler will go offline briefly for a configuration update. This outage will affect the availability of client commands, e.g. sbatch, srun, and salloc. This outage will not affect running or queued jobs. The scheduler is expected to be offline for approximately 30 minutes.
At 4:01 PM on November 10th, user logins began failing due to an security setting expiring. As of 4:50 PM most gateways, dev nodes, and compute nodes should be recovering. Please open a ticket via the contact forms for any issues you notice.
Update: An issue with ls15 was identified at 4:50 and resolved by 5 PM.
Update 11/11: The issue was resolved as of 7 PM on the 10th.
We will be upgrading the ICER directory services infrastructure on Monday, 25 October, 2021. Although no user impact is expected, please submit a problem report or reach out on the ICER Help Desk channel if you experience any login issues. If you have any questions about this change, please contact us at https://contact.icer.msu.edu/.
3:45PM Users are unable to log in to the HPCC due to a configuration problem. We are fixing it.
4:40PM Fixes were implemented to restore login services, all systems should again be operational.
Beginning on Monday, October 25th, how job IDs are represented on the HPCC's SLURM scheduler will change. This change will cause new job IDs to jump significantly in value. Currently queued or running job IDs will not change. This configuration change is one step towards implementing additional scheduler features in the future. If you have any questions about this change, please contact us at https://contact.hpcc.msu.edu/.
The HPCC scratch system gs18 (/mnt/scratch ; /mnt/gs18) is nearly full (98%). We're working to move heavy users to alternative scratch spaces, but users may experience "Out of space on device" if other users write a significant amount of data to the HPCC. Users are asked to remove any data from scratch that they no longer need and consider using the Lustre scratch system ls15 in the interim.
Update: The firmware upgrade was successfully completed as of 6:48am EDT Wednesday, October 6th, 2021. No restart of the Operating System was needed.
A required firmware upgrade will be applied to the dev-intel16-k80 node starting at 6:00 am on Wednesday, October 6th, 2021. This process will take approximately one (1) hour and the node may be unavailable during the upgrade period.
All users with any active ssh sessions at the beginning of the maintenance window may have those sessions reset and any running processes terminated.
- We rebooted the dev-intel16-k80 node to resolve a GPU issue at 9:00 pm on Wednesday, September 29th, 2021. All users that were logged in at that time have had their sessions reset.
- The reboot of the dev-intel16-k80 node scheduled for 9:00 pm Thursday, September 30th, 2021 has been cancelled.
Update: maintenance is successfully complete as of 11:30am EDT. All NFS and SMB shares should be mappable once again, but please open a ticket if you encounter any issues.
Update: maintenance for NFS/SMB on UFS18 will now take place at 9am EDT on Wednesday, September 29, 2021.
At 8am EDT on Tuesday, September 28, 2021, we will be performing maintenance on the NFS and SMB servers which allow remote mapping via these protocols. The maintenance is expected to take 2hrs, and remote exports of user home directories via NFS and SMB will be unavailable during this time.
Please note, this will not affect home directories on the cluster, only remote mounts setup by users outside of the system. If you have your home directory mapped on your local machine via NFS or SMB you will notice this become unavailable during the maintenance, but you may log into the cluster to gain access to your data.
At approximately 11:10am EDT on Wednesday, September 15, 2021, we experienced a failure with the UFS18 Home filesystem which caused user home directories to go offline. The outage continued for approximately 35 minutes, and the home filesystem has been restored to service. We will be working with the vendor to help isolate the cause and implement a permanent fix. Please open a support ticket with us if you continue to experience any issues accessing your home directories from this point forward.
The ufs18 home filesystem is currently offline, and we are implementing the previous fix from the vendor to return this to service.
Update: the home filesystem has been brought back online and user home directories should be functional once again. We continue to work with the vendor on a permanent resolution.
The ufs18 home filesystem is currently down, and we are working with the vendor to identify and resolve the issue as quickly as possible. Additional updates will be provided as soon as more information becomes available.
Update: after working with the vendor to perform a check on the filesystem, we have been able to get ufs18 home remounted across the cluster. We will be performing additional checks later in the morning, however, please open a ticket if you continue to experience issues accessing your home directory.
The HPCC scratch system gs18 (/mnt/scratch ; /mnt/gs18) is nearly full. We're working to move heavy users to alternative scratch spaces, but users may experience "Out of space on device" if other users write a significant amount of data to the HPCC. Users are asked to remove any data from scratch that they no longer need and consider using the Lustre scratch system ls15 in the interim.
The HPCC will be unavailable on Tuesday, August 17th to perform routine firmware and software updates to improve stability and performance of the systems. All interactive access will be disabled (including SSH, OpenOnDemand, Globus Online endpoints, and SMB) and no jobs that would overlap this maintenance window will run. Please contact ICER if you have any questions.
Update: the maintenance will begin at midnight.
Update 8-17 2:32 AM: All user sessions have been disconnected and running jobs have stopped. The network upgrades has been completed. Software updates to the HPCC environment will continue. Interactive logins and services will remain unavailable until the underlying services have been updated.
Update 8:50 AM: Updates are underway on HPCC systems.
Update 12:55 PM: Most upgrades are complete. One has taken longer than expected but we are moving forward. We hope to have the system available for login this afternoon.
Update 1:30 PM: All major maintenance is complete. We are reopening the gateways, dev nodes, and interactive services. The scheduler is still paused. Users may experience intermittent pauses while we finish up some remaining work.
Update 4:15 PM: Scheduling has been resumed. Nearly all compute nodes are available.
Update 4:40 PM: The maintenance window has completed and all HPCC services have returned to normal. Please let us know if you see any issues.
Update: The scheduler performance issues have been resolved and the scheduler is no longer paused.
We are currently experiencing performance issues with the job scheduler. We are working with our software vendor to resolve these issues. The scheduler is currently paused while we investigate these issues further. If you have any questions, please contact us at https://contact.icer.msu.edu.