Blog

Update: The scheduler performance issues have been resolved and the scheduler is no longer paused.

We are currently experiencing performance issues with the job scheduler. We are working with our software vendor to resolve these issues. The scheduler is currently paused while we investigate these issues further. If you have any questions, please contact us at https://contact.icer.msu.edu.

At about 12:20 AM on June 30th, ufs18 automatically unmounted the home and research spaces on all HPCC systems due to a known software bug. By 1:10 AM service had been restored. We are currently in the process of updating the GPFS system software to address this issue. We apologize for the disruption.

1PM EST: We are currently experiencing issues with GPFS on the rsync gateway which are preventing the filesystems from being mounted. We are working with the vendor to resolve this as quickly as possible, and will be providing additional updates as information becomes available. 


Update: June 25 - 1:30PM EST: we continue to experience issues with GPFS on the rsync gateway, and will continue to work with the vendor. Additional updates will be provided as information becomes available. For users waiting to transfer files, please consider whether the globus server will work for your purposes in the meantime.


Update: June 29 - 12:45PM EST: work on the rsyncgw continues. While unable to provide an ETA at this time, we expect additional information later this afternoon.


Update: June 29 - 4:15PM EST: we have identified the source of the issue, and the vendor is currently testing the solution.


This issue has been resolved, and both scratch and home are once again availalbe on the rsync gateway. 

The Singularity installation on the HPCC has been updated to the latest 3.8.0 release. This release includes a number of bug fixes and new features that were not present in the previous version 3.6.4. A full list of changes can be found on the Singularity GitHub page.

Update: The issues with running interactive applications have been resolved.

Update: The OnDemand update is complete. We are currently troubleshooting issues with running interactive applications.

At 8AM on Tuesday, June 22nd, the HPCC OnDemand service will go offline temporarily to undergo an update. This update will bring several minor improvements to the OnDemand service. If you have any questions about this update, please contact us at https://contact.icer.msu.edu.

Slurm job submission will be offline during a configuration update, and is expected to return to functionality by 5pm EST. Jobs which are already running on the cluster will continue to run during this time.


Update: 4:55pm EST - The slurm configuration update has completed successfully, and the scheduler has now been resumed. 

Off-campus network connectivity will be interrupted for approximately 1-2 minutes at some point between 12AM and 2 AM on May 28th. when MSU's East Lansing campus border network routers are upgraded. Users with active sessions or transfers may need to reconnect or reestablish their session.

For more information, see: https://servicestatus.msu.edu/maintenances/51226

Starting Monday, May 24th, how authentication is handled for the HPCC's OnDemand web portal (https://ondemand.hpcc.msu.edu) will change.  OnDemand will begin to use CILogin for authentication instead of Google. CILogin can verify the same MSU credentials as Google with the added benefit of verifying CommunityIDs and department-sponsored NetIDs. If you are an active OnDemand user, you may notice this difference when you authenticate to OnDemand. If you are an HPCC user with a CommunityID or department-sponsored NetID, we encourage you to explore the OnDemand service. You can find more information about OnDemand on our wiki at https://wiki.hpcc.msu.edu/display/ITH/Open+OnDemand. If you have any questions about this change, please contact us at https://contact.icer.msu.edu/. 


ICER's SLURM scheduler is currently configured to automatically requeue jobs in the event of a node failure. This allows jobs to be restarted as soon as possible and limits the impact of systems issues on user workflow. However, this behavior is not preferable under all circumstances. For example, a job that is transforming data as it runs may not run properly if simply interrupted and restarted. For a job like this, one may want to manually requeue it after cleaning up the output of a partial run. For these cases, it is possible to override the default behavior and prevent jobs from being automatically requeued by specifying the --no-requeue option to salloc, sbatch, or srun. This option can also be added directly to the job script with the line ‘#SBATCH --no-requeue’. If you have any questions about this behavior or how to adapt this to your jobs, please contact us at https://contact.icer.msu.edu.

On Tuesday, 5/11/21 at 8:00 AM, two of the ten servers that host the ls15 Lustre scratch filesystem will be rebooted. This reboot will return these servers to a fault-tolerant state where one server can take over for the other in the event of a hardware failure. This reboot may cause a momentary hang-up for jobs accessing files on the ls15 scratch filesystem. If you have any questions, please contact us at https://contact.icer.msu.edu/. 

On Friday, 4/23/21 at 8:00 AM, two of the ten servers that host the ls15 Lustre scratch filesystem will be rebooted. This reboot will return these servers to a fault-tolerant state where one server can take over for the other in the event of a hardware failure. This reboot may cause a momentary hang-up for jobs accessing files on the ls15 scratch filesystem. If you have any questions, please contact us at https://contact.icer.msu.edu/. 

We will be performing rolling reboots of gateways and development nodes during the week of April 12th. These reboots are required to update the client side of our high performance file system. Reboots will occur overnight and servers are expected to be back online before morning. Servers will be rebooted according to the following schedule:


April 12th at 4:00 AM: gateway-00, gateway-03

April 13th at 4:00 AM: globus-02, rdpgw-01, dev-intel14, dev-intel14-k20

April 14th at 4:00 AM: openondemand-00, dev-intel16, dev-intel16-k80

April 15th at 4:00 AM: dev-amd20, dev-amd20-v100


Dev-intel18, gateway-01, and gateway-02 are already updated and do not require a reboot. If you have any questions, please contact us at https://contact.icer.msu.edu. 



Our home system is currently down due to an internal error in the storage system. Users may see 'Stale File Handle' errors on nodes or in jobs. We're working with the vendor to gather data and . No ETA on recovery yet.

14:00 - The home filesystem continues to be offline at this time, however, we are working with the vendor and anticipate a fix shortly. Another update will be provided at 14:30.

14:30 - A filesystem check is currently being run on home, after which we anticipate being able to bring the storage back online. Another update will be provided at 15:00.

14:45 - The filesystem check on home has completed and the storage is now back online. Please feel free to open a ticket if you experience any difficulties following the outage. 

15:15 - Some nodes continued to experience stale file handles, which have now been corrected across the cluster. Please open a ticket with any ongoing filesystem issues.

12:30pm EDT - Nodes are currently losing connection to /mnt/ufs18. Home and Research spaces are affected. Our system administrators are working on resolving the issue.

10:20am EDT - We are currently experiencing networking issues with the HPCC firewall, causing intermittent connection disruptions and generally degraded performance. We are working to resolve this issue as quickly as possible and will provide further updates.

The behavior of interactive jobs has changed after last week's update to SLURM's latest release. Previously, when requesting a GPU in an interactive job, an additional srun command was required to use the GPU.

$ salloc -n 1 --gres=gpu:1
salloc: Granted job allocation 16806593
salloc: Waiting for resource configuration
salloc: Nodes csn-037 are ready for job
$ srun --gres=gpu:1 nvidia-smi
Mon Mar  8 12:58:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 00000000:83:00.0 Off |                    0 |
| N/A   29C    P8    26W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This additional srun is no longer required. The allocated GPU can be used immediately.

$ salloc -n 1 --gres=gpu:1
salloc: Granted job allocation 16806705
salloc: Waiting for resource configuration
salloc: Nodes lac-342 are ready for job
$ nvidia-smi
Mon Mar  8 13:00:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The original method will still work, so workflows that depends on running additional srun commands within an allocation–such as testing job steps to later be submitted in a batch job–will not need to be adjusted.


If you have any questions about this change, please contact us at https://contact.icer.msu.edu.