On Tuesday, 5/11/21 at 8:00 AM, two of the ten servers that host the ls15 Lustre scratch filesystem will be rebooted. This reboot will return these servers to a fault-tolerant state where one server can take over for the other in the event of a hardware failure. This reboot may cause a momentary hang-up for jobs accessing files on the ls15 scratch filesystem. If you have any questions, please contact us at https://contact.icer.msu.edu/.
On Friday, 4/23/21 at 8:00 AM, two of the ten servers that host the ls15 Lustre scratch filesystem will be rebooted. This reboot will return these servers to a fault-tolerant state where one server can take over for the other in the event of a hardware failure. This reboot may cause a momentary hang-up for jobs accessing files on the ls15 scratch filesystem. If you have any questions, please contact us at https://contact.icer.msu.edu/.
We will be performing rolling reboots of gateways and development nodes during the week of April 12th. These reboots are required to update the client side of our high performance file system. Reboots will occur overnight and servers are expected to be back online before morning. Servers will be rebooted according to the following schedule:
April 12th at 4:00 AM: gateway-00, gateway-03
April 13th at 4:00 AM: globus-02, rdpgw-01, dev-intel14, dev-intel14-k20
April 14th at 4:00 AM: openondemand-00, dev-intel16, dev-intel16-k80
April 15th at 4:00 AM: dev-amd20, dev-amd20-v100
Dev-intel18, gateway-01, and gateway-02 are already updated and do not require a reboot. If you have any questions, please contact us at https://contact.icer.msu.edu.
Our home system is currently down due to an internal error in the storage system. Users may see 'Stale File Handle' errors on nodes or in jobs. We're working with the vendor to gather data and . No ETA on recovery yet.
14:00 - The home filesystem continues to be offline at this time, however, we are working with the vendor and anticipate a fix shortly. Another update will be provided at 14:30.
14:30 - A filesystem check is currently being run on home, after which we anticipate being able to bring the storage back online. Another update will be provided at 15:00.
14:45 - The filesystem check on home has completed and the storage is now back online. Please feel free to open a ticket if you experience any difficulties following the outage.
15:15 - Some nodes continued to experience stale file handles, which have now been corrected across the cluster. Please open a ticket with any ongoing filesystem issues.
12:30pm EDT - Nodes are currently losing connection to /mnt/ufs18. Home and Research spaces are affected. Our system administrators are working on resolving the issue.
10:20am EDT - We are currently experiencing networking issues with the HPCC firewall, causing intermittent connection disruptions and generally degraded performance. We are working to resolve this issue as quickly as possible and will provide further updates.
The behavior of interactive jobs has changed after last week's update to SLURM's latest release. Previously, when requesting a GPU in an interactive job, an additional srun command was required to use the GPU.
This additional srun is no longer required. The allocated GPU can be used immediately.
The original method will still work, so workflows that depends on running additional srun commands within an allocation–such as testing job steps to later be submitted in a batch job–will not need to be adjusted.
If you have any questions about this change, please contact us at https://contact.icer.msu.edu.
Update 11:00 AM: The bug in the scheduler has been patched and the scheduler is back online.
The SLURM scheduler is experiencing intermittent crashes following yesterdays upgrade. We are currently working with our software vender to resolve the issue.
Update 9:07PM: The scheduler upgrade is complete and the scheduler is back online
On Thursday, March 4th, at 8:00PM, the scheduler will go offline before undergoing an upgrade to the latest release. This scheduler is expected to come back online before midnight.
This outage will not affect running jobs, however some other functionality will be affected by this outage:
- SLURM client commands will be unavailable (squeue, srun, salloc, sbatch, etc.)
- New jobs cannot be submitted
- Jobs that are already queued will not start
If you have any questions about this outage, please contact us at https://contact.icer.msu.edu/.
The GPFS home storage system is currently offline. We are working to identify and resolve the underlying cause of the disruption, and will provide additional information as available.
Update 3:45 PM This outage started at about 1:55 PM. We've identified a set of nodes that may be causing this problem and are working to reset them.
Update 4 PM The system should be fully operational now. We've identified memory exhaustion on four compute nodes as the cause of the problem. Despite existing mechanisms to prevent the overutilization of memory, they were stuck in a state where there was not sufficient memory to respond to the storage cluster but still responsive enough to prevent an automatic recovery without them. We will continue to investigate the cause and work with the storage vendor to address this.
Update Wednesday, February 24th, 10:45 AM: The accounting database is back online
Update Wednesday, February 24th, 8:02 AM: The accounting database outage is still in progress and now expected to complete in the early afternoon
Update Tuesday, February 23rd, 5:38 PM: The accounting database outage is still in progress and expected to last into the evening
On Tuesday, February 23rd, beginning at 6:00AM, the SLURM accounting database will go offline for maintenance. This maintenance is in preparation for updating SLURM to the latest version. Jobs can still be submitted and will run as usual, however, users may be affected in several other ways during this outage:
- Historical job data accessed through the sacct command will be unavailable.
- Some powertools that rely on the sacct command, such as SLURMUsage, will also be unavailable.
- New users added to the system during the outage will not be able to submit jobs until the database is back online.
This outage is expected to last approximately 12 hours. If you have any questions, please contact us at https://contact.icer.msu.edu/.
Starting today, ICER will be limiting the number of GPU hours that non-buyin users can consume on a yearly basis. The yearly limit will be 10000 GPU hours. Users who have already consumed GPU hours this year will be limited to 10000 GPU hours on top of what they have already consumed.
Users can check their usage and limits using the SLURMUsage powertool.
If you have any questions, please contact ICER support at https://contact.icer.msu.edu/
Update: The SLURM database maintenance is complete. Access to the sacct command has been restored.
Update: Database maintenance is still in progress and is expected to continue into Wednesday, February 10th.
Update: New users added to the cluster during the outage will not be able to submit jobs until the migration is complete.
On Tuesday, February 9th, beginning at 9:00AM, the SLURM accounting database will go offline for maintenance. During this outage, historical job data accessed through the sacct command will be unavailable. Jobs can still be submitted and will run as usual. This outage is expected to last approximately 8 hours. If you have any questions, please contact us at https://contact.icer.msu.edu/.
On March 1st at 8am EDT, we will be deploying an updated database server for user databases. Our current server db-01 will be replaced with db-03. Scripts will need to be updated accordingly. Tickets have been opened with users that have databases on the server. If you would like any databases migrated please let us know, we will not be migrating databases automatically.
The scheduler is currently offline. We are working to bring the service back up as quickly as possible, and will provide further updates here as they become available.
2021-01-09 00:09 - Slurm scheduler is now back online. Jobs have resumed.
The HPCC will be unavailable beginning at 7 AM on Tuesday, January 5th to perform routine firmware and software updates to improve stability and performance. All interactive access will be disabled (including SSH, OpenOnDemand, Globus Online endpoints, and SMB) and no jobs that would overlap this maintenance window will run. Please contact ICER if you have any questions.
Update: 9 AM. We're in the process of applying firmware and OS updates to the cluster hardware and are updating the Slurm database to support the newer version of Slurm.
Update: 10:45 AM. Most firmware updates are complete. The OS updates are about 50% complete across the cluster. We have run into an issue with the Slurm upgrade process and are working on a solution.
Update: 1:45 PM Infiniband network updates are complete. The compute node OS updates have been completed. We're rolling back the Slurm upgrade attempt due to time constraints.
Update 3 PM. HPCC systems are available for interactive use (ssh, OpenOnDemand). We're doing final checks on the system before resuming scheduling.
Update: 3:30 PM: GPFS recovery on home is running. Users may experience long pauses while the file system recovers snapshots.
Update: 4:25 PM: The GPFS recovery has completed. We have resumed scheduling and are monitoring job status.
Update: 4:45 PM: We have completed the maintenance window. Please let us know if you experience any issues.
Our globus server is currently offline. We are waiting on a response from Globus as the issue is related to a security certificate issued by Globus.
2pm EDT Itssue with Globus has been resolved.
Due to an external core campus network maintenance outage, the HPCC may be unavailable for external access from 11 PM to 4 AM. Currently running jobs that do not rely on external network connectivity will continue to run.
Globus has released a new version of software and has indicated that our current version of software will no longer operate completely after January 1 2020. We will be updating to the latest version of this software at 8am on 11/19/20. Please plan your transfers accordingly.
Update: This is no long needed
We have experienced a critical failure on our home directory storage system. We have contacted our vendor and are working to correct the issue as soon as possible. We currently have no eta for restoration. Updates will be provided as available.
11/3 1:10am The vendor is still looking into the issue.
11/3: 7:30 AM Staff have been working with the vendor through the night to address this. We have not found any fix yet.
11/3: 8:30 AM We have been able to complete the log recovery and remove the corrupted data structure that prevented the file system from mounting shortly after 8 AM. Please let us know via the contact form if you see any issues.
11/3: 9 AM login issue: "/usr/bin/xauth: timeout in locking authority file ~/.Xauthority". Writes appear to be returning a "No such device" error on client nodes. We are investigating.
11/3: 9:30 AM Write issue has been resolved. Please let us know if you experience any problems.
We will be updating our RDP gateway software to the latest revision on 11/2. The server will be offline for a short time during this upgrade. We anticipate the server being offline for less than two hours.
9/2 9;15am Upgrade is complete and server is now online.
On 10/26 we will be taking our globus server offline for upgrade to the latest version of globus. We estimate the downtime to last less than two hours. Please be sure to time transfers accordingly, ongoing transfers will fail.
8:42am We have completed the switch to our new Globus server. Currently guest collections are not working we have a ticket open with Globus support for this.
3pm: All globus services should now be available.
Tuesday Oct 13 at 10:00am dev-amd20 will be shutdown for maintenance. The outage should be brief and the system will be returned to service as quickly as possible.
At 8:00PM on Saturday, October 10th, the SLURM scheduler will be going offline for maintenance. Client commands will not be available during this time, e.g. sbatch, squeue, srun. Running jobs will not be affected. Maintenance is expected to last less than one hour. If you have any questions, please contact us https://contact.icer.msu.edu/.
On Sunday, October 4th, from approximately 9:00 AM to 10:20 AM, the Luster file system was hung after it's metadata server ran out of disk space. Additional space was added and functionality restored. Jobs using the Lustre file system during this time may have experienced I/O errors.
General availability for the AMD20 cluster (ICER’s new supercomputer) began at 9 AM on Tuesday, September 29th. Please report any issues that you see to ICER through the help ticket system.
We also re-enabled the automatic 45 day purge on /mnt/ls15 on October 1st.
The first wave of AMD20 (210 CPU nodes and 9 GPU nodes) is now available for testing in the amd20-test partition.
to request the test partition and QOS.
The dev-amd20 and dev-amd20-v100 nodes are available from other development nodes.
There is no limit on cores you can use, but a 24 hour limit on CPU time. Systems may need to be restarted at any time as we complete testing and address any issues that may arise.
If everything goes well we anticipate that this system will be available within the normal scheduler by the end of the month.
For more information, please see:
Please contact us if you notice any issues or have additional questions.
9-2-20 at 12am we will be taking the globus google drive server offline to maintenance. We will be attempting to correct an issue that is causing only the My Drive space to be available. The maintenance is expected to last up to 4 hours. When the server is back online users may need to remove their old collections and remap a new collection.
Update 08/13/20 1:00PM: A patch has been applied and scheduler functionality has returned to normal
We are currently encountering some performance issues with the job scheduler following updates during the maintenance. This is causing jobs not to schedule properly as well as delays in job execution. We are working to resolve this with our vendor.
HPCC provides 2 nodes of the new purchased cluster amd20 for users to do testing. Please check the wiki page "Running Job on amd20 Test Nodes" for how to run your jobs on the nodes. Users can also find more information (such as node performance, AMD Libraries) about the cluster through the page "Cluster amd20 with AMD CPU".
We have firewall issue after the HPCC maintenance (on August 4th). Sometimes the network situation is good but sometimes it is very slow. If you log into HPCC gateway and get the response like "Last login: ... ...", please wait for further responses which might take a while. After the waiting, you are logged in. Our system administrator is now working with ITS to resolve this issue.
We are currently experiencing high CPU load on ICER's firewall. Users may experience lag when accessing files using the gateway nodes; users are advised to use development nodes until we resolve the issue. MSU IT Security is working with the firewall vendor to diagnose and resolve the issue.
The HPCC will be unavailable on August 4th, 2020 to do regularly scheduled software, hardware, and network maintenance and to prepare for the new cluster installation. During the maintenance window, interactive access via SSH and OpenOnDemand will be disabled, remote home directory access (via Globus and Windows File Sharing) will be blocked, and no jobs that would overlap the maintenance window will be started until after it completes. Please contact us if you have any questions or concerns.
Update 1 AM 08-04: All services are currently unavailable; initial software updates have been staged and the network equipment is being updated.
Update 3 AM 08-04: The core network upgrades are complete.
Update 10 AM 08-04: Scheduler updates are complete. Compute node updates are underway. Windows file sharing access to the home directory servers are available.
Update 4 PM 08-04: Compute node updates are nearly complete, we anticipate a return to service by 5 PM today. There is an issue with one of our license servers; some licensed software may fail when started. We are working with the vendors to update the configuration.
Update 6:30 PM 08-04: Interactive access has been resumed. Late in the process we experienced a component failure on the 2016 cluster that has delayed our return to scheduling. We have restored some of the licenses on the failed server and are working with vendors to move the rest to a new license server.
Update 8:00 PM 08-04: The scheduler has been resumed and we have returned to full service. We're finishing up a few outstanding issues; if you have any issues please contact us.
Update: 12:00 PM 08-12: The license sever issue is resolved.
During our last upgrade for legacy scratch upgrade in December of last year the normal purge of files not modified in 45 days became disabled. We will be enabling this purge again during the next outage. Please make sure to check your legacy scratch directories and backup any data that you may need.
The HPCC's main scratch system (gs18) is nearing capacity. We ask that users reduce their usage or move work to ls15. If gs18 remains near capacity, a more aggressive purge policy will be required to maintain system stability.
Users can use the 'quota' command to check their usage on gs18.
After the maintenance outage on August 4th, we are going to move a significant fraction of users from gs18 to a scratch space on ufs18. Affected users will be notified.
Webrdp is currently offline. We are looking into this and will provide updates when available.
Update 10::00 am The webrdp server is now back online.
We are currently experiencing a network issue causing most of our nodes to be offline. We are investigating and will provide updates as soon as possible.
Update: 9:15 am ITS has resolved a network issue in MSU data center all nodes are now back online
Update 11:15 AM: A core data center switch failed at 1:04 AM this morning due to a bug that switch's controller software. As part of a redundant switch pair a single failure should not have taken the network offline, but the second switch did not did not successfully take over. We have identified why the second switch was unable to bring up the interface and are working to implement a fix that will prevent this from happening again.
On Saturday, June 20th, at 8:00PM, all SLURM clients on the HPCC will be updated. This includes components installed on the development and compute nodes. As a consequence of this update, any pending srun or salloc commands run interactively on the development nodes will be interrupted. Jobs queued with sbatch, and srun processes within those jobs will not be disrupted. Please contact us at https://contact.icer.msu.edu/ if you have any questions.
A recent update of the SLURM scheduler introduced a potential bug when specifying job constraints. Specifying a certain constraints may yield a "Requested node configuration not available error". If you encounter this error when submitting jobs with constraints that worked prior to Wednesday, June 10th, update your constraint to specify the 'NOAUTO' flag, e.g. 'NOAUTO:intel16' instead of 'intel16'. This will circumvent the issue while we work with our software vendor for a permanent fix. Please contact us with any questions at https://contact.icer.msu.edu/.
Update 8:47 PM: The scheduler is back online
We are currently working with our software vendor to address an issue with our job scheduler. The scheduler is currently paused. SLURM client commands will not be available and new jobs will not start until this issue is resolved.
The slowness issue is resolved.
One of the OSS servers on ls15 (/mnt/ls15) scratch file system is slow to respond to lock requests from the MDS server. We are working on replacing that drive at the moment. It will take a while to be complete. We will update this announcement once it is back to normal.
Today we will begin correcting file and directory ownership on all research spaces. Please note this process will take up to several weeks to complete. We will be contacting any users with large amounts of files which may cause research directories to become over quota before correcting ownership.
4/23/20 at 8am we will be performing maintenance to correct issues that are causing slow performance on our ls15 system. The system will be slow to unresponsive during this time. Maintenance is expected to be completed in less than two hours.
Update on 03:05pm, the issue is resolved.
ls15 scratch system (/mnt/ls15) is currently having a issue and we are working on it. We will update this information when it is back normal.
We will be performing emergency maintenance at 7am on Friday 4-3-20 to all gateways and development nodes. This will require home directories to be taken offline on those nodes and the nodes rebooted. We expect maintenance to be complete by 8 am
Update 4-2-20: After patching the system a quota check has successfully run. We believe currently that quotas reported are now correct.
Important Note: We are seeing about 50 research spaces over quota. This is likely due to previous under reported quotas. We have checked these with the DU functionality and they appear to be reporting properly. Please remember that if you are storming large amounts of small files that the reported quota will not match DU due to system block size limitations.
Note: We have ensured all default quotas are now enforced on research groups. If you are having trouble with your research group please open a ticket and we will assist you.
Currently our home file system check quota function will sometimes cause a users directory to have an incorrect quota. If you see this please open a ticket and we will work with you to temporarily increase your quota. We continue to work with our vendor to correct this issue.
Update 4-1-20: We have received a patch and are testing to see if all quota issues have been resolved.
Starting this morning we will be performing a patch upgrade to our home directory system. This has been provided by our vendor to correct issues with quota functionality. You may see some pauses in the system while components are restarted.
Update 4-1-20 All maintenance is complete.
As part of MSU’s response to COVID-19, ICER is transitioning to an online-only support model. All HPCC services will continue as normal.
- The contact form is the preferred way to contact us: https://contact.icer.msu.edu
- Monday and Thursday regularly scheduled open office hours will be held via Zoom and Teams. Please click here to view the online open office hours instruction.
- Also feel free to stop by "ICER-public" on Microsoft Teams with questions or to connect with peers.
- Training will be moving online; training events will be updated with webinar details.
- Further updates to our support, contact methods, and system status will be posted to our websites and to our social media accounts; email will be reserved for critical announcements and newsletters. Please click here to see how to get the latest HPCC updates.
We are currently experiencing issues with our network that is causing slow or broken connections to several of our login and transfer systems. We are looking into this issue and will provide updates as available.
Update 2/12/20 1:20pm The issue is currently resolved. We will be monitoring our network for further issues.
UPDATE (9:52 PM): The maintenance is complete and filesystems are remounted on the gateways
UPDATE: This outage is now scheduled for February 8th
On Saturday, February 8th, there will be a partial outage of HPCC storage starting at 8:00PM. This outage will begin with rolling reboots of all gateways and will interrupt storage access on the login gateways and rsync gateway only. This may cause 'cannot chrdir to directory' errors when connecting to the HPCC. Users can continue to connect from gateways to development nodes to conduct their work. Development nodes and running jobs will not be affected. This outage is expected to last several hours.
At 10am today our gateways are no long able to communicate with our home storage system. We are looking into the issue and will rectify it as soon as possible. Compute nodes can mount the home system properly and jobs will continue to run properly.
From the week of December 9th through early January, we will be performing software upgrades on our gs18 and ufs18 storage systems, which will improve performance and reliability. During this time, users may experience periodic pauses and degraded performance. We will update this blog post if there are any specific impacts users may see and as the work proceeds. Please contact us if you have any concerns.
Update: 3:25 PM 12/20: After the upgrade on 12/19, two new bugs were introduced. Users may experience "Stale File Handle" messages, slow home directory or research space access, or not be able to log into a gateway or dev node when this problem is occuring. The vendor is preparing fixes for us to deploy today or tomorrow and we have an understanding and workaround of what's triggering this problem to reduce the impact on our users. We're sorry for any impact that this has on your research.