Blog

12-12-19 Home system update

Starting at 8am on 12-12-19 our vendor is performing updates to the software on our home system.  We do not anticipate any downtime associated with the upgrades.


Update 12-13-19 1pm: Updates have been completed on 4 of 6 storage nodes.  We anticipate the remaining storage nodes to be complete by end of day.  Updates on protocal nodes will continue on Monday 12-16.  When all existing equipment is upgraded we will be adding an additional storage block.

Update 12-13-19 4pm: On monday 12-16-19 users will see periodic samba home directory mount outages.

The HPCC is undergoing an upgrade of the GS18 scratch. No service interruptions are expected.


2019-21-12 all upgrades on scratch cluster are now complete

Today at 3:00 dev-intel16-K80 will go down for maintenance.  Available GPUs are not correct

and a replacement card will resolve this issue.  We will have the system returned to service as

soon as possible.

UPDATE:  Dev-intel16-K80 is working and available now.

We have a new AMD-based Rome server with 128 cores and 512 GB of memory available to users. It is currently accessible as eval-epyc19 via ssh from any development node. We are considering this architecture for our 2020 cluster purchase and would like your feedback on any strengths or weaknesses you notice.

We've configured 4 NUMA clusters per socket (16 cores.) In early testing, a hybrid OpenMP-MPI model that uses OpenMP processes per-NUMA-domain or per-L3 cache and MPI between processes provides excellent performance. You can see the system layout with

lstopo-no-graphics

This node is a shared resource so please be considerate of other users and aware of the impact other users may have on your benchmarking. If you would like exclusive access, please contact us so we can coordinate that.

From the week of December 9th through early January, we will be performing software upgrades on our gs18 and ufs18 storage systems, which will improve performance and reliability. During this time, users may experience periodic pauses and degraded performance. We will update this blog post if there are any specific impacts users may see and as the work proceeds. Please contact us if you have any concerns.

On Thursday, December 19th, the HPCC will be undergo scheduled maintenance. We will be applying GPFS software updates, adding redundancy for the Infiniband fabric, and additional minor fixes and updates. We will be rebooting every compute node, so any jobs that would overlap will be held until after the outage. The entire system may be unavailable during this work. We will update this blog post as more information becomes available.

Reminder to users: please be vigilant about data kept on scratch. Temporary data on scratch should be kept clean and updated to help prevent the file system from becoming full. 

Please see this blog post about additional file system work happening this December.

Update (11:15 PM): The home file system is back online.


The HPCC is currently experiencing an unexpected outage of the home file system. We are currently working to resolve the issue.

After investigation we have found that quota enforcement for disk space usage has not been enforced properly.  We will be correcting this on 11-21-19.  We encourage users to check the disk usage versus quota and ensure that your research space is not over quota.  Looking at the current space usage about 30% of research spaces will be over quota.  We will be contacting the PI of each of the over quota spaces directly as well.  

Login Issue on HPCC Nodes

      09:55 AM:  Right now, there is a problem to log into HPCC. Please wait for more update.

      10:00 AM:  During the weekend, a home mounting problem happened for many compute nodes. The issue is fixed. 

      10:25 AM:  The login to HPCC is back to normal now. However, there is still a problem to log into dev-intel16-k80 node.

      10:55 AM:  dev-intel16-k80 can be log in now. The issue is resolved.

On Wednesday, October 23rd, the HPCC will be updating its installation of the Singularity container software to the latest version 3.4. This update adds new features including increased support for file system overlays and increased stability for containers using MPI. If you have any questions, please contact the HPCC at contact.icer.msu.edu.

10/17/2019 11:33 AM:   Most of the compute nodes are working. HPCC system is back to normal.

10/17/2019 10:03 AM:   There was a filesystem issue that has been resolved.  The gateway and development nodes have resumed full functionality. However, compute nodes are still not recovered.

10/17/2019 09:40 AM:   The HPCC is currently experiencing system issues.  We are working on problem and will update this message when we have more information. We are sorry about the inconvenience.


HPCC Staff



On Tuesday Oct 15 we will be adding new backup hardware to our home storage cluster that will be replacing legacy hardware.  As we add the new hardware the home directory system may be slow or pause at times as the file set backups recover.

How buy-in accounts are configured in the scheduler is changing. Buy-in accounts are currently configured with one partition per cluster, e.g. buy-in account “FOOBAR” with nodes in both the intel16 and intel18 clusters would have a “FOOBAR-16” and a “FOOBAR-18” partition. Buy-in accounts will soon have only one partition that contains all their buy-in nodes. This change will increase overall scheduler performance and will not affect how buy-in jobs are prioritized.

Rolling Reboots of Nodes

The HPCC is currently conducting rolling reboots of nodes to apply a file system update. This update will improve the overall stability of the GPFS file system. The HPCC will coordinate with buy-in node owners when rebooting buy-in hardware. These reboots will not affect running jobs, however, the overall amount of resources available to jobs will be reduced until reboots are complete.

Update (8:48PM): The SLURM scheduler is back online and accepting jobs

The SLURM scheduler server will be offline intermittently for planned maintenance on Thursday, September 19th, from 8:00 PM to 9:00 PM. During this time, SLURM client commands (squeue/sbatch/salloc) will be unavailable and queued jobs will not start. Running jobs will not be affected by this outage.