We are currently experiencing issues with our network that is causing slow or broken connections to several of our login and transfer systems. We are looking into this issue and will provide updates as available.
Update 2/12/20 1:20pm The issue is currently resolved. We will be monitoring our network for further issues.
Currently our home file system check quota function will sometimes cause a users directory to have an incorrect quota. If you see this please open a ticket and we will work with you to temporarily increase your quota. We continue to work with our vendor to correct this issue.
UPDATE (9:52 PM): The maintenance is complete and filesystems are remounted on the gateways
UPDATE: This outage is now scheduled for February 8th
On Saturday, February 8th, there will be a partial outage of HPCC storage starting at 8:00PM. This outage will begin with rolling reboots of all gateways and will interrupt storage access on the login gateways and rsync gateway only. This may cause 'cannot chrdir to directory' errors when connecting to the HPCC. Users can continue to connect from gateways to development nodes to conduct their work. Development nodes and running jobs will not be affected. This outage is expected to last several hours.
At 10am today our gateways are no long able to communicate with our home storage system. We are looking into the issue and will rectify it as soon as possible. Compute nodes can mount the home system properly and jobs will continue to run properly.
Starting at 8am on 12-12-19 our vendor is performing updates to the software on our home system. We do not anticipate any downtime associated with the upgrades.
Update 12-13-19 1pm: Updates have been completed on 4 of 6 storage nodes. We anticipate the remaining storage nodes to be complete by end of day. Updates on protocal nodes will continue on Monday 12-16. When all existing equipment is upgraded we will be adding an additional storage block.
Update 12-13-19 4pm: On monday 12-16-19 users will see periodic samba home directory mount outages.
Update 12-17-19 8am: Upgrade work on our existing home storage system is complete. We will be adding additional storage to the system on 12-17 and 12-18. During out 12-19 outage all compute node clients will have software updates to match the storage cluster. During the outage we will also be replacing our AFM backup nodes with new hardware for better backup and over all system stability.
The HPCC is undergoing an upgrade of the GS18 scratch. No service interruptions are expected.
2019-21-12 all upgrades on scratch cluster are now complete
Today at 3:00 dev-intel16-K80 will go down for maintenance. Available GPUs are not correct
and a replacement card will resolve this issue. We will have the system returned to service as
soon as possible.
UPDATE: Dev-intel16-K80 is working and available now.
We have a new AMD-based Rome server with 128 cores and 512 GB of memory available to users. It is currently accessible as eval-epyc19 via ssh from any development node. We are considering this architecture for our 2020 cluster purchase and would like your feedback on any strengths or weaknesses you notice.
We've configured 4 NUMA clusters per socket (16 cores.) In early testing, a hybrid OpenMP-MPI model that uses OpenMP processes per-NUMA-domain or per-L3 cache and MPI between processes provides excellent performance. You can see the system layout with
This node is a shared resource so please be considerate of other users and aware of the impact other users may have on your benchmarking. If you would like exclusive access, please contact us so we can coordinate that.
From the week of December 9th through early January, we will be performing software upgrades on our gs18 and ufs18 storage systems, which will improve performance and reliability. During this time, users may experience periodic pauses and degraded performance. We will update this blog post if there are any specific impacts users may see and as the work proceeds. Please contact us if you have any concerns.
Update: 3:25 PM 12/20: After the upgrade on 12/19, two new bugs were introduced. Users may experience "Stale File Handle" messages, slow home directory or research space access, or not be able to log into a gateway or dev node when this problem is occuring. The vendor is preparing fixes for us to deploy today or tomorrow and we have an understanding and workaround of what's triggering this problem to reduce the impact on our users. We're sorry for any impact that this has on your research.
On Thursday, December 19th, the HPCC will be undergo scheduled maintenance. We will be applying GPFS software updates, adding redundancy for the Infiniband fabric, and additional minor fixes and updates. We will be rebooting every compute node, so any jobs that would overlap will be held until after the outage. The entire system may be unavailable during this work. We will update this blog post as more information becomes available.
Reminder to users: please be vigilant about data kept on scratch. Temporary data on scratch should be kept clean and updated to help prevent the file system from becoming full.
Please see this blog post about additional file system work happening this December.
UPDATE: 6:20 PM The system maintenance has been completed.
Update (11:15 PM): The home file system is back online.
The HPCC is currently experiencing an unexpected outage of the home file system. We are currently working to resolve the issue.
After investigation we have found that quota enforcement for disk space usage has not been enforced properly. We will be correcting this on 11-21-19. We encourage users to check the disk usage versus quota and ensure that your research space is not over quota. Looking at the current space usage about 30% of research spaces will be over quota. We will be contacting the PI of each of the over quota spaces directly as well.
09:55 AM: Right now, there is a problem to log into HPCC. Please wait for more update.
10:00 AM: During the weekend, a home mounting problem happened for many compute nodes. The issue is fixed.
10:25 AM: The login to HPCC is back to normal now. However, there is still a problem to log into dev-intel16-k80 node.
10:55 AM: dev-intel16-k80 can be log in now. The issue is resolved.
On Wednesday, October 23rd, the HPCC will be updating its installation of the Singularity container software to the latest version 3.4. This update adds new features including increased support for file system overlays and increased stability for containers using MPI. If you have any questions, please contact the HPCC at contact.icer.msu.edu.
10/17/2019 11:33 AM: Most of the compute nodes are working. HPCC system is back to normal.
10/17/2019 10:03 AM: There was a filesystem issue that has been resolved. The gateway and development nodes have resumed full functionality. However, compute nodes are still not recovered.
10/17/2019 09:40 AM: The HPCC is currently experiencing system issues. We are working on problem and will update this message when we have more information. We are sorry about the inconvenience.