Blog from March, 2021

Our home system is currently down due to an internal error in the storage system. Users may see 'Stale File Handle' errors on nodes or in jobs. We're working with the vendor to gather data and . No ETA on recovery yet.

14:00 - The home filesystem continues to be offline at this time, however, we are working with the vendor and anticipate a fix shortly. Another update will be provided at 14:30.

14:30 - A filesystem check is currently being run on home, after which we anticipate being able to bring the storage back online. Another update will be provided at 15:00.

14:45 - The filesystem check on home has completed and the storage is now back online. Please feel free to open a ticket if you experience any difficulties following the outage. 

15:15 - Some nodes continued to experience stale file handles, which have now been corrected across the cluster. Please open a ticket with any ongoing filesystem issues.

12:30pm EDT - Nodes are currently losing connection to /mnt/ufs18. Home and Research spaces are affected. Our system administrators are working on resolving the issue.

10:20am EDT - We are currently experiencing networking issues with the HPCC firewall, causing intermittent connection disruptions and generally degraded performance. We are working to resolve this issue as quickly as possible and will provide further updates.

The behavior of interactive jobs has changed after last week's update to SLURM's latest release. Previously, when requesting a GPU in an interactive job, an additional srun command was required to use the GPU.

$ salloc -n 1 --gres=gpu:1
salloc: Granted job allocation 16806593
salloc: Waiting for resource configuration
salloc: Nodes csn-037 are ready for job
$ srun --gres=gpu:1 nvidia-smi
Mon Mar  8 12:58:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 00000000:83:00.0 Off |                    0 |
| N/A   29C    P8    26W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This additional srun is no longer required. The allocated GPU can be used immediately.

$ salloc -n 1 --gres=gpu:1
salloc: Granted job allocation 16806705
salloc: Waiting for resource configuration
salloc: Nodes lac-342 are ready for job
$ nvidia-smi
Mon Mar  8 13:00:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The original method will still work, so workflows that depends on running additional srun commands within an allocation–such as testing job steps to later be submitted in a batch job–will not need to be adjusted.


If you have any questions about this change, please contact us at https://contact.icer.msu.edu.

Update 11:00 AM: The bug in the scheduler has been patched and the scheduler is back online.

The SLURM scheduler is experiencing intermittent crashes following yesterdays upgrade. We are currently working with our software vender to resolve the issue.