Skip to end of metadata
Go to start of metadata

We will be upgrading the ICER directory services infrastructure on Monday, 25 October, 2021.  Although no user impact is expected, please submit a problem report or reach out on the ICER Help Desk channel if you experience any login issues.  If you have any questions about this change, please contact us at https://contact.icer.msu.edu/


3:45PM Users are unable to log in to the HPCC due to a configuration problem. We are fixing it.

4:40PM Fixes were implemented to restore login services, all systems should again be operational.

Beginning on Monday, October 25th, how job IDs are represented on the HPCC's SLURM scheduler will change. This change will cause new job IDs to jump significantly in value. Currently queued or running job IDs will not change. This configuration change is one step towards implementing additional scheduler features in the future. If you have any questions about this change, please contact us at https://contact.hpcc.msu.edu/

The HPCC scratch system gs18 (/mnt/scratch ; /mnt/gs18) is nearly full (98%). We're working to move heavy users to alternative scratch spaces, but users may experience "Out of space on device" if other users write a significant amount of data to the HPCC. Users are asked to remove any data from scratch that they no longer need and consider using the Lustre scratch system ls15 in the interim.

Update: The firmware upgrade was successfully completed as of 6:48am EDT Wednesday, October 6th, 2021.  No restart of the Operating System was needed.

A required firmware upgrade will be applied to the dev-intel16-k80 node starting at 6:00 am on Wednesday, October 6th, 2021.  This process will take approximately one (1) hour and the node may be unavailable during the upgrade period. 

All users with any active ssh sessions at the beginning of the maintenance window may have those sessions reset and any running processes terminated.

  • We rebooted the dev-intel16-k80 node to resolve a GPU issue at 9:00 pm on Wednesday, September 29th, 2021.  All users that were logged in at that time have had their sessions reset.
  • The reboot of the dev-intel16-k80 node scheduled for 9:00 pm Thursday, September 30th, 2021 has been cancelled.


Update: maintenance is successfully complete as of 11:30am EDT. All NFS and SMB shares should be mappable once again, but please open a ticket if you encounter any issues.

Update: maintenance for NFS/SMB on UFS18 will now take place at 9am EDT on Wednesday, September 29, 2021. 

At 8am EDT on Tuesday, September 28, 2021, we will be performing maintenance on the NFS and SMB servers which allow remote mapping via these protocols. The maintenance is expected to take 2hrs, and remote exports of user home directories via NFS and SMB will be unavailable during this time.

Please note, this will not affect home directories on the cluster, only remote mounts setup by users outside of the system. If you have your home directory mapped on your local machine via NFS or SMB you will notice this become unavailable during the maintenance, but you may log into the cluster to gain access to your data.

At approximately 11:10am EDT on Wednesday, September 15, 2021, we experienced a failure with the UFS18 Home filesystem which caused user home directories to go offline. The outage continued for approximately 35 minutes, and the home filesystem has been restored to service. We will be working with the vendor to help isolate the cause and implement a permanent fix. Please open a support ticket with us if you continue to experience any issues accessing your home directories from this point forward.

The ufs18 home filesystem is currently offline, and we are implementing the previous fix from the vendor to return this to service. 

Update: the home filesystem has been brought back online and user home directories should be functional once again. We continue to work with the vendor on a permanent resolution.


The ufs18 home filesystem is currently down, and we are working with the vendor to identify and resolve the issue as quickly as possible. Additional updates will be provided as soon as more information becomes available.


Update: after working with the vendor to perform a check on the filesystem, we have been able to get ufs18 home remounted across the cluster. We will be performing additional checks later in the morning, however, please open a ticket if you continue to experience issues accessing your home directory. 

The HPCC will be unavailable on Tuesday, August 17th to perform routine firmware and software updates to improve stability and performance of the systems. All interactive access will be disabled (including SSH, OpenOnDemand, Globus Online endpoints, and SMB) and no jobs that would overlap this maintenance window will run. Please contact ICER if you have any questions.

Update: the maintenance will begin at midnight.

Update 8-17 2:32 AM: All user sessions have been disconnected and running jobs have stopped. The network upgrades has been completed. Software updates to the HPCC environment will continue. Interactive logins and services will remain unavailable until the underlying services have been updated.

Update 8:50 AM: Updates are underway on HPCC systems.

Update 12:55 PM: Most upgrades are complete. One has taken longer than expected but we are moving forward. We hope to have the system available for login this afternoon.

Update 1:30 PM: All major maintenance is complete. We are reopening the gateways, dev nodes, and interactive services. The scheduler is still paused. Users may experience intermittent pauses while we finish up some remaining work.

Update 4:15 PM: Scheduling has been resumed. Nearly all compute nodes are available.

Update 4:40 PM: The maintenance window has completed and all HPCC services have returned to normal. Please let us know if you see any issues.


The HPCC scratch system gs18 (/mnt/scratch ; /mnt/gs18) is nearly full. We're working to move heavy users to alternative scratch spaces, but users may experience "Out of space on device" if other users write a significant amount of data to the HPCC. Users are asked to remove any data from scratch that they no longer need and consider using the Lustre scratch system ls15 in the interim.


1PM EST: We are currently experiencing issues with GPFS on the rsync gateway which are preventing the filesystems from being mounted. We are working with the vendor to resolve this as quickly as possible, and will be providing additional updates as information becomes available. 


Update: June 25 - 1:30PM EST: we continue to experience issues with GPFS on the rsync gateway, and will continue to work with the vendor. Additional updates will be provided as information becomes available. For users waiting to transfer files, please consider whether the globus server will work for your purposes in the meantime.


Update: June 29 - 12:45PM EST: work on the rsyncgw continues. While unable to provide an ETA at this time, we expect additional information later this afternoon.


Update: June 29 - 4:15PM EST: we have identified the source of the issue, and the vendor is currently testing the solution.


This issue has been resolved, and both scratch and home are once again availalbe on the rsync gateway. 

Update: The scheduler performance issues have been resolved and the scheduler is no longer paused.

We are currently experiencing performance issues with the job scheduler. We are working with our software vendor to resolve these issues. The scheduler is currently paused while we investigate these issues further. If you have any questions, please contact us at https://contact.icer.msu.edu.

At about 12:20 AM on June 30th, ufs18 automatically unmounted the home and research spaces on all HPCC systems due to a known software bug. By 1:10 AM service had been restored. We are currently in the process of updating the GPFS system software to address this issue. We apologize for the disruption.

Update: The issues with running interactive applications have been resolved.

Update: The OnDemand update is complete. We are currently troubleshooting issues with running interactive applications.

At 8AM on Tuesday, June 22nd, the HPCC OnDemand service will go offline temporarily to undergo an update. This update will bring several minor improvements to the OnDemand service. If you have any questions about this update, please contact us at https://contact.icer.msu.edu.

The Singularity installation on the HPCC has been updated to the latest 3.8.0 release. This release includes a number of bug fixes and new features that were not present in the previous version 3.6.4. A full list of changes can be found on the Singularity GitHub page.

Slurm job submission will be offline during a configuration update, and is expected to return to functionality by 5pm EST. Jobs which are already running on the cluster will continue to run during this time.


Update: 4:55pm EST - The slurm configuration update has completed successfully, and the scheduler has now been resumed. 

Off-campus network connectivity will be interrupted for approximately 1-2 minutes at some point between 12AM and 2 AM on May 28th. when MSU's East Lansing campus border network routers are upgraded. Users with active sessions or transfers may need to reconnect or reestablish their session.

For more information, see: https://servicestatus.msu.edu/maintenances/51226

Starting Monday, May 24th, how authentication is handled for the HPCC's OnDemand web portal (https://ondemand.hpcc.msu.edu) will change.  OnDemand will begin to use CILogin for authentication instead of Google. CILogin can verify the same MSU credentials as Google with the added benefit of verifying CommunityIDs and department-sponsored NetIDs. If you are an active OnDemand user, you may notice this difference when you authenticate to OnDemand. If you are an HPCC user with a CommunityID or department-sponsored NetID, we encourage you to explore the OnDemand service. You can find more information about OnDemand on our wiki at https://wiki.hpcc.msu.edu/display/ITH/Open+OnDemand. If you have any questions about this change, please contact us at https://contact.icer.msu.edu/. 


ICER's SLURM scheduler is currently configured to automatically requeue jobs in the event of a node failure. This allows jobs to be restarted as soon as possible and limits the impact of systems issues on user workflow. However, this behavior is not preferable under all circumstances. For example, a job that is transforming data as it runs may not run properly if simply interrupted and restarted. For a job like this, one may want to manually requeue it after cleaning up the output of a partial run. For these cases, it is possible to override the default behavior and prevent jobs from being automatically requeued by specifying the --no-requeue option to salloc, sbatch, or srun. This option can also be added directly to the job script with the line ‘#SBATCH --no-requeue’. If you have any questions about this behavior or how to adapt this to your jobs, please contact us at https://contact.icer.msu.edu.

On Tuesday, 5/11/21 at 8:00 AM, two of the ten servers that host the ls15 Lustre scratch filesystem will be rebooted. This reboot will return these servers to a fault-tolerant state where one server can take over for the other in the event of a hardware failure. This reboot may cause a momentary hang-up for jobs accessing files on the ls15 scratch filesystem. If you have any questions, please contact us at https://contact.icer.msu.edu/. 

On Friday, 4/23/21 at 8:00 AM, two of the ten servers that host the ls15 Lustre scratch filesystem will be rebooted. This reboot will return these servers to a fault-tolerant state where one server can take over for the other in the event of a hardware failure. This reboot may cause a momentary hang-up for jobs accessing files on the ls15 scratch filesystem. If you have any questions, please contact us at https://contact.icer.msu.edu/. 

We will be performing rolling reboots of gateways and development nodes during the week of April 12th. These reboots are required to update the client side of our high performance file system. Reboots will occur overnight and servers are expected to be back online before morning. Servers will be rebooted according to the following schedule:


April 12th at 4:00 AM: gateway-00, gateway-03

April 13th at 4:00 AM: globus-02, rdpgw-01, dev-intel14, dev-intel14-k20

April 14th at 4:00 AM: openondemand-00, dev-intel16, dev-intel16-k80

April 15th at 4:00 AM: dev-amd20, dev-amd20-v100


Dev-intel18, gateway-01, and gateway-02 are already updated and do not require a reboot. If you have any questions, please contact us at https://contact.icer.msu.edu. 



Our home system is currently down due to an internal error in the storage system. Users may see 'Stale File Handle' errors on nodes or in jobs. We're working with the vendor to gather data and . No ETA on recovery yet.

14:00 - The home filesystem continues to be offline at this time, however, we are working with the vendor and anticipate a fix shortly. Another update will be provided at 14:30.

14:30 - A filesystem check is currently being run on home, after which we anticipate being able to bring the storage back online. Another update will be provided at 15:00.

14:45 - The filesystem check on home has completed and the storage is now back online. Please feel free to open a ticket if you experience any difficulties following the outage. 

15:15 - Some nodes continued to experience stale file handles, which have now been corrected across the cluster. Please open a ticket with any ongoing filesystem issues.

12:30pm EDT - Nodes are currently losing connection to /mnt/ufs18. Home and Research spaces are affected. Our system administrators are working on resolving the issue.

10:20am EDT - We are currently experiencing networking issues with the HPCC firewall, causing intermittent connection disruptions and generally degraded performance. We are working to resolve this issue as quickly as possible and will provide further updates.

The behavior of interactive jobs has changed after last week's update to SLURM's latest release. Previously, when requesting a GPU in an interactive job, an additional srun command was required to use the GPU.

$ salloc -n 1 --gres=gpu:1
salloc: Granted job allocation 16806593
salloc: Waiting for resource configuration
salloc: Nodes csn-037 are ready for job
$ srun --gres=gpu:1 nvidia-smi
Mon Mar  8 12:58:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K20m          On   | 00000000:83:00.0 Off |                    0 |
| N/A   29C    P8    26W / 225W |      0MiB /  4743MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This additional srun is no longer required. The allocated GPU can be used immediately.

$ salloc -n 1 --gres=gpu:1
salloc: Granted job allocation 16806705
salloc: Waiting for resource configuration
salloc: Nodes lac-342 are ready for job
$ nvidia-smi
Mon Mar  8 13:00:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The original method will still work, so workflows that depends on running additional srun commands within an allocation–such as testing job steps to later be submitted in a batch job–will not need to be adjusted.


If you have any questions about this change, please contact us at https://contact.icer.msu.edu.

Update 11:00 AM: The bug in the scheduler has been patched and the scheduler is back online.

The SLURM scheduler is experiencing intermittent crashes following yesterdays upgrade. We are currently working with our software vender to resolve the issue.

Update 9:07PM: The scheduler upgrade is complete and the scheduler is back online

On Thursday, March 4th, at 8:00PM, the scheduler will go offline before undergoing an upgrade to the latest release. This scheduler is expected to come back online before midnight

This outage will not affect running jobs, however some other functionality will be affected by this outage:

  1. SLURM client commands will be unavailable (squeue, srun, salloc, sbatch, etc.)
  2. New jobs cannot be submitted
  3. Jobs that are already queued will not start

If you have any questions about this outage, please contact us at https://contact.icer.msu.edu/.

The GPFS home storage system is currently offline. We are working to identify and resolve the underlying cause of the disruption, and will provide additional information as available. 

Update 3:45 PM This outage started at about 1:55 PM. We've identified a set of nodes that may be causing this problem and are working to reset them.

Update 4 PM The system should be fully operational now. We've identified memory exhaustion on four compute nodes as the cause of the problem. Despite existing mechanisms to prevent the overutilization of memory, they were stuck in a state where there was not sufficient memory to respond to the storage cluster but still responsive enough to prevent an automatic recovery without them. We will continue to investigate the cause and work with the storage vendor to address this.

Update Wednesday, February 24th, 10:45 AM: The accounting database is back online

Update Wednesday, February 24th, 8:02 AM: The accounting database outage is still in progress and now expected to complete in the early afternoon

Update Tuesday, February 23rd, 5:38 PM: The accounting database outage is still in progress and expected to last into the evening

On Tuesday, February 23rd, beginning at 6:00AM, the SLURM accounting database will go offline for maintenance. This maintenance is in preparation for updating SLURM to the latest version. Jobs can still be submitted and will run as usual, however, users may be affected in several other ways during this outage:

  1. Historical job data accessed through the sacct command will be unavailable.
  2. Some powertools that rely on the sacct command, such as SLURMUsage, will also be unavailable.
  3. New users added to the system during the outage will not be able to submit jobs until the database is back online.

This outage is expected to last approximately 12 hours. If you have any questions, please contact us at https://contact.icer.msu.edu/.

Annual Limit on GPU Usage

Starting today, ICER will be limiting the number of GPU hours that non-buyin users can consume on a yearly basis. The yearly limit will be 10000 GPU hours. Users who have already consumed GPU hours this year will be limited to 10000 GPU hours on top of what they have already consumed.

Users can check their usage and limits using the SLURMUsage powertool.

$ module load powertools
$ SLURMUsage 

 Account             Resource  Usage(m)  Left CPU(h)  UserName
==============================================================
 general             CPU Time        0    500000.00   hpccuser
                     GPU Time        0     10000.00   hpccuser

If you have any questions, please contact ICER support at https://contact.icer.msu.edu/

Update: The SLURM database maintenance is complete. Access to the sacct command has been restored.

Update: Database maintenance is still in progress and is expected to continue into Wednesday, February 10th.

Update: New users added to the cluster during the outage will not be able to submit jobs until the migration is complete.

On Tuesday, February 9th, beginning at 9:00AM, the SLURM accounting database will go offline for maintenance. During this outage, historical job data accessed through the sacct command will be unavailable. Jobs can still be submitted and will run as usual. This outage is expected to last approximately 8 hours. If you have any questions, please contact us at https://contact.icer.msu.edu/.

Database Server Upgrade

On March 1st at 8am EDT, we will be deploying an updated database server for user databases.  Our current server db-01 will be replaced with db-03.  Scripts will need to be updated accordingly.  Tickets have been opened with users that have databases on the server.  If you would like any databases migrated please let us know, we will not be migrating databases automatically.

The scheduler is currently offline. We are working to bring the service back up as quickly as possible, and will provide further updates here as they become available. 

2021-01-09 00:09 - Slurm scheduler is now back online. Jobs have resumed.

The HPCC will be unavailable beginning at 7 AM on Tuesday, January 5th to perform routine firmware and software updates to improve stability and performance. All interactive access will be disabled (including SSH, OpenOnDemand, Globus Online endpoints, and SMB) and no jobs that would overlap this maintenance window will run. Please contact ICER if you have any questions.

Update: 9 AM. We're in the process of applying firmware and OS updates to the cluster hardware and are updating the Slurm database to support the newer version of Slurm.

Update: 10:45 AM. Most firmware updates are complete. The OS updates are about 50% complete across the cluster. We have run into an issue with the Slurm upgrade process and are working on a solution.

Update: 1:45 PM Infiniband network updates are complete. The compute node OS updates have been completed. We're rolling back the Slurm upgrade attempt due to time constraints.

Update 3 PM. HPCC systems are available for interactive use (ssh, OpenOnDemand). We're doing final checks on the system before resuming scheduling.

Update: 3:30 PM: GPFS recovery on home is running. Users may experience long pauses while the file system recovers snapshots.

Update: 4:25 PM: The GPFS recovery has completed. We have resumed scheduling and are monitoring job status.

Update: 4:45 PM: We have completed the maintenance window. Please let us know if you experience any issues.

Globus currently Offline

Our globus server is currently offline.  We are waiting on a response from Globus as the issue is related to a security certificate issued by Globus.


2pm EDT Itssue with Globus has been resolved.

Due to an external core campus network maintenance outage, the HPCC may be unavailable for external access from 11 PM to 4 AM. Currently running jobs that do not rely on external network connectivity will continue to run.

Globus has released a new version of software and has indicated that our current version of software will no longer operate completely after January 1 2020.  We will be updating to the latest version of this software at 8am on 11/19/20.  Please plan your transfers accordingly.   


Update:  This is no long needed 


We have experienced a critical failure on our home directory storage system.  We have contacted our vendor and are working to correct the issue as soon as possible.  We currently have no eta for restoration.   Updates will be provided as available.


Update:

11/3 1:10am The vendor is still looking into the issue.

11/3: 7:30 AM Staff have been working with the vendor through the night to address this. We have not found any fix yet.

11/3: 8:30 AM We have been able to complete the log recovery and remove the corrupted data structure that prevented the file system from mounting shortly after 8 AM. Please let us know via the contact form if you see any issues.

11/3: 9 AM login issue: "/usr/bin/xauth: timeout in locking authority file ~/.Xauthority". Writes appear to be returning a "No such device" error on client nodes. We are investigating.

11/3: 9:30 AM Write issue has been resolved. Please let us know if you experience any problems.

We will be updating  our RDP gateway software to the latest revision on 11/2.  The server will be offline for a short time during this upgrade.  We anticipate the server being offline for less than two hours.


9/2 9;15am  Upgrade is complete and server is now online.

On 10/26 we will be taking our globus server offline for upgrade to the latest version of globus.  We estimate the downtime to last less than two hours.  Please be sure to time transfers accordingly, ongoing transfers will fail.


8:42am We have completed the switch to our new Globus server.  Currently guest collections are not working we have a ticket open with Globus support for this.


3pm:  All globus services should now be available.

Tuesday Oct 13 at 10:00am dev-amd20 will be shutdown for maintenance.  The outage should be brief and the system will be returned to service as quickly as possible.

At 8:00PM on Saturday, October 10th, the SLURM scheduler will be going offline for maintenance. Client commands will not be available during this time, e.g. sbatch, squeue, srun. Running jobs will not be affected. Maintenance is expected to last less than one hour. If you have any questions, please contact us https://contact.icer.msu.edu/.

Unexpected Lustre Outage

On Sunday, October 4th, from approximately 9:00 AM to 10:20 AM, the Luster file system was hung after it's metadata server ran out of disk space. Additional space was added and functionality restored. Jobs using the Lustre file system during this time may have experienced I/O errors.

General availability for the AMD20 cluster (ICER’s new supercomputer) began at 9 AM on Tuesday, September 29th. Please report any issues that you see to ICER through the help ticket system.

We also re-enabled the automatic 45 day purge on /mnt/ls15 on October 1st.

The first wave of AMD20 (210 CPU nodes and 9 GPU nodes) is now available for testing in the amd20-test partition. 

Use:

#SBATCH -q amd20-test
#SBATCH -p amd20-test

to request the test partition and QOS.

The dev-amd20 and dev-amd20-v100 nodes are available from other development nodes.

There is no limit on cores you can use, but a 24 hour limit on CPU time. Systems may need to be restarted at any time as we complete testing and address any issues that may arise.

If everything goes well we anticipate that this system will be available within the normal scheduler by the end of the month.

For more information, please see:

Cluster amd20 with AMD CPUs

Please contact us if you notice any issues or have additional questions.


9-2-20 at 12am we will be taking the globus google drive server offline to maintenance.  We will be attempting to correct an issue that is causing only the My Drive space to be available.  The maintenance is expected to last up to 4 hours.  When the server is back online users may need to remove their old collections and remap a new collection.



Update 08/13/20 1:00PM: A patch has been applied and scheduler functionality has returned to normal 

    We are currently encountering some performance issues with the job scheduler following updates during the maintenance. This is causing jobs not to schedule properly as well as delays in job execution. We are working to resolve this with our vendor.

HPCC provides 2 nodes of the new purchased cluster amd20 for users to do testing. Please check the wiki page "Running Job on amd20 Test Nodes" for how to run your jobs on the nodes. Users can also find more information (such as node performance, AMD Libraries) about the cluster through the page "Cluster amd20 with AMD CPU".

We have firewall issue after the HPCC maintenance (on August 4th). Sometimes the network situation is good but sometimes it is very slow. If you log into HPCC gateway and get the response like "Last login:  ... ...", please wait for further responses which might take a while. After the waiting, you are logged in. Our system administrator is now working with ITS to resolve this issue.

We are currently experiencing high CPU load on ICER's firewall. Users may experience lag when accessing files using the gateway nodes; users are advised to use development nodes until we resolve the issue. MSU IT Security is working with the firewall vendor to diagnose and resolve the issue.



  • No labels