Blog

We will be updating  our RDP gateway software to the latest revision on 11/2.  The server will be offline for a short time during this upgrade.  We anticipate the server being offline for less than two hours.

On 10/26 we will be taking our globus server offline for upgrade to the latest version of globus.  We estimate the downtime to last less than two hours.  Please be sure to time transfers accordingly, ongoing transfers will fail.


8:42am We have completed the switch to our new Globus server.  Currently guest collections are not working we have a ticket open with Globus support for this.


3pm:  All globus services should now be available.

Tuesday Oct 13 at 10:00am dev-amd20 will be shutdown for maintenance.  The outage should be brief and the system will be returned to service as quickly as possible.

At 8:00PM on Saturday, October 10th, the SLURM scheduler will be going offline for maintenance. Client commands will not be available during this time, e.g. sbatch, squeue, srun. Running jobs will not be affected. Maintenance is expected to last less than one hour. If you have any questions, please contact us https://contact.icer.msu.edu/.

Unexpected Lustre Outage

On Sunday, October 4th, from approximately 9:00 AM to 10:20 AM, the Luster file system was hung after it's metadata server ran out of disk space. Additional space was added and functionality restored. Jobs using the Lustre file system during this time may have experienced I/O errors.

General availability for the AMD20 cluster (ICER’s new supercomputer) began at 9 AM on Tuesday, September 29th. Please report any issues that you see to ICER through the help ticket system.

We also re-enabled the automatic 45 day purge on /mnt/ls15 on October 1st.

The first wave of AMD20 (210 CPU nodes and 9 GPU nodes) is now available for testing in the amd20-test partition. 

Use:

#SBATCH -q amd20-test
#SBATCH -p amd20-test

to request the test partition and QOS.

The dev-amd20 and dev-amd20-v100 nodes are available from other development nodes.

There is no limit on cores you can use, but a 24 hour limit on CPU time. Systems may need to be restarted at any time as we complete testing and address any issues that may arise.

If everything goes well we anticipate that this system will be available within the normal scheduler by the end of the month.

For more information, please see:

Cluster amd20 with AMD CPUs

Please contact us if you notice any issues or have additional questions.


9-2-20 at 12am we will be taking the globus google drive server offline to maintenance.  We will be attempting to correct an issue that is causing only the My Drive space to be available.  The maintenance is expected to last up to 4 hours.  When the server is back online users may need to remove their old collections and remap a new collection.



HPCC provides 2 nodes of the new purchased cluster amd20 for users to do testing. Please check the wiki page "Running Job on amd20 Test Nodes" for how to run your jobs on the nodes. Users can also find more information (such as node performance, AMD Libraries) about the cluster through the page "Cluster amd20 with AMD CPU".

Update 08/13/20 1:00PM: A patch has been applied and scheduler functionality has returned to normal 

    We are currently encountering some performance issues with the job scheduler following updates during the maintenance. This is causing jobs not to schedule properly as well as delays in job execution. We are working to resolve this with our vendor.

We have firewall issue after the HPCC maintenance (on August 4th). Sometimes the network situation is good but sometimes it is very slow. If you log into HPCC gateway and get the response like "Last login:  ... ...", please wait for further responses which might take a while. After the waiting, you are logged in. Our system administrator is now working with ITS to resolve this issue.

We are currently experiencing high CPU load on ICER's firewall. Users may experience lag when accessing files using the gateway nodes; users are advised to use development nodes until we resolve the issue. MSU IT Security is working with the firewall vendor to diagnose and resolve the issue.

During our last upgrade for legacy scratch upgrade in December of last year the normal purge of files not modified in 45 days became disabled.  We will be enabling this purge again during the next outage.  Please make sure to check your legacy scratch directories and backup any data that you may need.


The HPCC's main scratch system (gs18) is nearing capacity. We ask that users reduce their usage or move work to ls15. If gs18 remains near capacity, a more aggressive purge policy will be required to maintain system stability.

Users can use the 'quota' command to check their usage on gs18.

After the maintenance outage on August 4th, we are going to move a significant fraction of users from gs18 to a scratch space on ufs18. Affected users will be notified.

The HPCC will be unavailable on August 4th, 2020 to do regularly scheduled software, hardware, and network maintenance and to prepare for the new cluster installation. During the maintenance window, interactive access via SSH and OpenOnDemand will be disabled, remote home directory access (via Globus and Windows File Sharing) will be blocked, and no jobs that would overlap the maintenance window will be started until after it completes. Please contact us if you have any questions or concerns.

Update 1 AM 08-04: All services are currently unavailable; initial software updates have been staged and the network equipment is being updated.

Update 3 AM 08-04: The core network upgrades are complete.

Update 10 AM 08-04: Scheduler updates are complete. Compute node updates are underway. Windows file sharing access to the home directory servers are available.

Update 4 PM 08-04: Compute node updates are nearly complete, we anticipate a return to service by 5 PM today. There is an issue with one of our license servers; some licensed software may fail when started. We are working with the vendors to update the configuration.

Update 6:30 PM 08-04: Interactive access has been resumed. Late in the process we experienced a component failure on the 2016 cluster that has delayed our return to scheduling. We have restored some of the licenses on the failed server and are working with vendors to move the rest to a new license server.

Update 8:00 PM 08-04: The scheduler has been resumed and we have returned to full service.  We're finishing up a few outstanding issues; if you have any issues please contact us.

Update: 12:00 PM 08-12: The license sever issue is resolved.

Webrdp is currently offline.  We are looking into this and will provide updates when available.


Update 10::00 am   The webrdp server is now back online.