Blog

On Wednesday, October 23rd, the HPCC will be updating its installation of the Singularity container software to the latest version 3.4. This update adds new features including increased support for file system overlays and increased stability for containers using MPI. If you have any questions, please contact the HPCC at contact.icer.msu.edu.

10/17/2019 11:33 AM:   Most of the compute nodes are working. HPCC system is back to normal.

10/17/2019 10:03 AM:   There was a filesystem issue that has been resolved.  The gateway and development nodes have resumed full functionality. However, compute nodes are still not recovered.

10/17/2019 09:40 AM:   The HPCC is currently experiencing system issues.  We are working on problem and will update this message when we have more information. We are sorry about the inconvenience.


HPCC Staff



On Tuesday Oct 15 we will be adding new backup hardware to our home storage cluster that will be replacing legacy hardware.  As we add the new hardware the home directory system may be slow or pause at times as the file set backups recover.

How buy-in accounts are configured in the scheduler is changing. Buy-in accounts are currently configured with one partition per cluster, e.g. buy-in account “FOOBAR” with nodes in both the intel16 and intel18 clusters would have a “FOOBAR-16” and a “FOOBAR-18” partition. Buy-in accounts will soon have only one partition that contains all their buy-in nodes. This change will increase overall scheduler performance and will not affect how buy-in jobs are prioritized.

Rolling Reboots of Nodes

The HPCC is currently conducting rolling reboots of nodes to apply a file system update. This update will improve the overall stability of the GPFS file system. The HPCC will coordinate with buy-in node owners when rebooting buy-in hardware. These reboots will not affect running jobs, however, the overall amount of resources available to jobs will be reduced until reboots are complete.

Update (8:48PM): The SLURM scheduler is back online and accepting jobs

The SLURM scheduler server will be offline intermittently for planned maintenance on Thursday, September 19th, from 8:00 PM to 9:00 PM. During this time, SLURM client commands (squeue/sbatch/salloc) will be unavailable and queued jobs will not start. Running jobs will not be affected by this outage.


At 8am EDT 9-4-19 we will be performing a ram upgrade on two of the controllers for our new home directory storage system.  We will need to move the system controller between nodes which may cause several minutes of degraded performance.  We do not expect any significant down time associated with this upgrade.  


Unexpected SLURM Outage

Update (12:15PM): The SLURM server is back online.

The SLURM server is currently offline. Client commands are not available, e.g. srun/salloc/squeue. New jobs cannot be submitted. We are working with our software vendor to find a solution.

We are currently having an issue with our virtual machine stack causing logins to fail and other systems to not work properly.  Our scheduler is currently paused and will resume as soon as the issue is corrected.

Update: 9:20 AM The issue has been resolved. Please let us know if you see any other issues.

The HPCC and all systems (including storage) will be unavailable on Tuesday, August 13th for a full system maintenance. We will be performing system software updates, client storage software update, network configuration changes, a scheduler software update, and routine maintenance. We anticipate that this will be a full-day outage. We will be updating this blog post as the date approaches and during the outage with more information. No jobs will start that would overlap this maintenance window. Please contact us with any questions.

Update: 10:30 AM The maintenance is mostly complete. We will be restoring access to the development and gateway systems shortly. We expect a return to service by noon.

Update: 11:20 AM. The scheduler was resumed at 11:15 AM and all services should be returned to production. Please contact us with any questions.

On Friday, July 26th at 10:00 AM, dev-intel18 will be taken offline for upgrades and maintenance. This task will take approximately one hour, and all other development nodes will remain online.

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.

Home directory quota issues

We are aware that some user's quotas do not match what the du command displays. We have worked extensively with the vendor on this issue.  There are two root causes.

1) The quota check process would not complete properly.  On 8/20 we were able to perform a complete quota check which has corrected many user quotas.  We are still working with the vendor to insure this check can run successfully on a regular basis.

2) The new file system has a smallest file block size of 64k.  This means that files between 2K and 64K will occupy 64K of space.  This will cause disk usage for users with large amounts of small files to be inflated greatly.  We are working on a solution for this issue. 

     One suggested solution would be to add any files to a tar file, which would reduce the number of small files into a larger file. 

     A temporarily larger quota can be requested by a user if their quota is at 1T and they have many small files .

If you have questions or need assistance, please let us know.

Update 6:30pm  The home filesystem is active and should be available on all systems. Scheduling has been resumed.

At 3:30pm, ufs18 file system is gradually online. It is mounted on gateway, dev nodes (except dev-intel16) and some compute nodes. Users still need to wait until it is totally back in HPCC system.

As of 11 am EDT the home file system is currently unavailable.  We are working with the vendor to correct the issue.

Starting at 8am tomorrow we will be performing several small filesystem updates with our vendor to improve file system stability on our new home system.  We do not anticipate significant impact to users. Users may see short pauses during this time. We anticipate all updates to be completed by end of day.  



Starting at 8PM on Thursday, June 27th, we will be performing rolling reboots of all HPCC gateways, including the Remote Desktop Gateway and Globus.

On Tuesday, June 25th at 10:00 PM, the server which hosts our shared software libraries and modules will be taken offline to be moved to our new system. This task will take less than 30 minutes, however, users will notice that software and modules are inaccessible during this time. Jobs will be paused during the move, and will resume as soon as the server is back online. There should be no failures with jobs caused by this maintenance, however, please do contact us if you experience otherwise. 

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.