Our home system is currently down due to an internal error in the storage system. Users may see 'Stale File Handle' errors on nodes or in jobs. We're working with the vendor to gather data and . No ETA on recovery yet.
14:00 - The home filesystem continues to be offline at this time, however, we are working with the vendor and anticipate a fix shortly. Another update will be provided at 14:30.
14:30 - A filesystem check is currently being run on home, after which we anticipate being able to bring the storage back online. Another update will be provided at 15:00.
14:45 - The filesystem check on home has completed and the storage is now back online. Please feel free to open a ticket if you experience any difficulties following the outage.
15:15 - Some nodes continued to experience stale file handles, which have now been corrected across the cluster. Please open a ticket with any ongoing filesystem issues.
12:30pm EDT - Nodes are currently losing connection to /mnt/ufs18. Home and Research spaces are affected. Our system administrators are working on resolving the issue.
10:20am EDT - We are currently experiencing networking issues with the HPCC firewall, causing intermittent connection disruptions and generally degraded performance. We are working to resolve this issue as quickly as possible and will provide further updates.
The behavior of interactive jobs has changed after last week's update to SLURM's latest release. Previously, when requesting a GPU in an interactive job, an additional srun command was required to use the GPU.
This additional srun is no longer required. The allocated GPU can be used immediately.
The original method will still work, so workflows that depends on running additional srun commands within an allocation–such as testing job steps to later be submitted in a batch job–will not need to be adjusted.
If you have any questions about this change, please contact us at https://contact.icer.msu.edu.
Update 11:00 AM: The bug in the scheduler has been patched and the scheduler is back online.
The SLURM scheduler is experiencing intermittent crashes following yesterdays upgrade. We are currently working with our software vender to resolve the issue.