On Tuesday, 5/11/21 at 8:00 AM, two of the ten servers that host the ls15 Lustre scratch filesystem will be rebooted. This reboot will return these servers to a fault-tolerant state where one server can take over for the other in the event of a hardware failure. This reboot may cause a momentary hang-up for jobs accessing files on the ls15 scratch filesystem. If you have any questions, please contact us at https://contact.icer.msu.edu/.
On Friday, 4/23/21 at 8:00 AM, two of the ten servers that host the ls15 Lustre scratch filesystem will be rebooted. This reboot will return these servers to a fault-tolerant state where one server can take over for the other in the event of a hardware failure. This reboot may cause a momentary hang-up for jobs accessing files on the ls15 scratch filesystem. If you have any questions, please contact us at https://contact.icer.msu.edu/.
We will be performing rolling reboots of gateways and development nodes during the week of April 12th. These reboots are required to update the client side of our high performance file system. Reboots will occur overnight and servers are expected to be back online before morning. Servers will be rebooted according to the following schedule:
April 12th at 4:00 AM: gateway-00, gateway-03
April 13th at 4:00 AM: globus-02, rdpgw-01, dev-intel14, dev-intel14-k20
April 14th at 4:00 AM: openondemand-00, dev-intel16, dev-intel16-k80
April 15th at 4:00 AM: dev-amd20, dev-amd20-v100
Dev-intel18, gateway-01, and gateway-02 are already updated and do not require a reboot. If you have any questions, please contact us at https://contact.icer.msu.edu.
Our home system is currently down due to an internal error in the storage system. Users may see 'Stale File Handle' errors on nodes or in jobs. We're working with the vendor to gather data and . No ETA on recovery yet.
14:00 - The home filesystem continues to be offline at this time, however, we are working with the vendor and anticipate a fix shortly. Another update will be provided at 14:30.
14:30 - A filesystem check is currently being run on home, after which we anticipate being able to bring the storage back online. Another update will be provided at 15:00.
14:45 - The filesystem check on home has completed and the storage is now back online. Please feel free to open a ticket if you experience any difficulties following the outage.
15:15 - Some nodes continued to experience stale file handles, which have now been corrected across the cluster. Please open a ticket with any ongoing filesystem issues.
12:30pm EDT - Nodes are currently losing connection to /mnt/ufs18. Home and Research spaces are affected. Our system administrators are working on resolving the issue.
10:20am EDT - We are currently experiencing networking issues with the HPCC firewall, causing intermittent connection disruptions and generally degraded performance. We are working to resolve this issue as quickly as possible and will provide further updates.
The behavior of interactive jobs has changed after last week's update to SLURM's latest release. Previously, when requesting a GPU in an interactive job, an additional srun command was required to use the GPU.
This additional srun is no longer required. The allocated GPU can be used immediately.
The original method will still work, so workflows that depends on running additional srun commands within an allocation–such as testing job steps to later be submitted in a batch job–will not need to be adjusted.
If you have any questions about this change, please contact us at https://contact.icer.msu.edu.
Update 11:00 AM: The bug in the scheduler has been patched and the scheduler is back online.
The SLURM scheduler is experiencing intermittent crashes following yesterdays upgrade. We are currently working with our software vender to resolve the issue.
The GPFS home storage system is currently offline. We are working to identify and resolve the underlying cause of the disruption, and will provide additional information as available.
Update 3:45 PM This outage started at about 1:55 PM. We've identified a set of nodes that may be causing this problem and are working to reset them.
Update 4 PM The system should be fully operational now. We've identified memory exhaustion on four compute nodes as the cause of the problem. Despite existing mechanisms to prevent the overutilization of memory, they were stuck in a state where there was not sufficient memory to respond to the storage cluster but still responsive enough to prevent an automatic recovery without them. We will continue to investigate the cause and work with the storage vendor to address this.
Update 9:07PM: The scheduler upgrade is complete and the scheduler is back online
On Thursday, March 4th, at 8:00PM, the scheduler will go offline before undergoing an upgrade to the latest release. This scheduler is expected to come back online before midnight.
This outage will not affect running jobs, however some other functionality will be affected by this outage:
- SLURM client commands will be unavailable (squeue, srun, salloc, sbatch, etc.)
- New jobs cannot be submitted
- Jobs that are already queued will not start
If you have any questions about this outage, please contact us at https://contact.icer.msu.edu/.
Update Wednesday, February 24th, 10:45 AM: The accounting database is back online
Update Wednesday, February 24th, 8:02 AM: The accounting database outage is still in progress and now expected to complete in the early afternoon
Update Tuesday, February 23rd, 5:38 PM: The accounting database outage is still in progress and expected to last into the evening
On Tuesday, February 23rd, beginning at 6:00AM, the SLURM accounting database will go offline for maintenance. This maintenance is in preparation for updating SLURM to the latest version. Jobs can still be submitted and will run as usual, however, users may be affected in several other ways during this outage:
- Historical job data accessed through the sacct command will be unavailable.
- Some powertools that rely on the sacct command, such as SLURMUsage, will also be unavailable.
- New users added to the system during the outage will not be able to submit jobs until the database is back online.
This outage is expected to last approximately 12 hours. If you have any questions, please contact us at https://contact.icer.msu.edu/.
Starting today, ICER will be limiting the number of GPU hours that non-buyin users can consume on a yearly basis. The yearly limit will be 10000 GPU hours. Users who have already consumed GPU hours this year will be limited to 10000 GPU hours on top of what they have already consumed.
Users can check their usage and limits using the SLURMUsage powertool.
If you have any questions, please contact ICER support at https://contact.icer.msu.edu/
Update: The SLURM database maintenance is complete. Access to the sacct command has been restored.
Update: Database maintenance is still in progress and is expected to continue into Wednesday, February 10th.
Update: New users added to the cluster during the outage will not be able to submit jobs until the migration is complete.
On Tuesday, February 9th, beginning at 9:00AM, the SLURM accounting database will go offline for maintenance. During this outage, historical job data accessed through the sacct command will be unavailable. Jobs can still be submitted and will run as usual. This outage is expected to last approximately 8 hours. If you have any questions, please contact us at https://contact.icer.msu.edu/.
On March 1st at 8am EDT, we will be deploying an updated database server for user databases. Our current server db-01 will be replaced with db-03. Scripts will need to be updated accordingly. Tickets have been opened with users that have databases on the server. If you would like any databases migrated please let us know, we will not be migrating databases automatically.
The scheduler is currently offline. We are working to bring the service back up as quickly as possible, and will provide further updates here as they become available.
2021-01-09 00:09 - Slurm scheduler is now back online. Jobs have resumed.
Our globus server is currently offline. We are waiting on a response from Globus as the issue is related to a security certificate issued by Globus.
2pm EDT Itssue with Globus has been resolved.