Skip to end of metadata
Go to start of metadata

      The slowness issue is resolved.

     One of the OSS servers on ls15 (/mnt/ls15) scratch file system is slow to respond to lock requests from the MDS server. We are working on replacing that drive at the moment. It will take a while to be complete. We will update this announcement once it is back to normal.

Today we will begin correcting file and directory ownership on all research spaces.  Please note this process will take up to several weeks to complete.  We will be contacting any users with large amounts of files which may cause research directories to become over quota before correcting ownership.

LS15 Maintenance

4/23/20 at 8am we will be performing maintenance to correct issues that are causing slow performance on our ls15 system.  The system will be slow to unresponsive during this time.  Maintenance is expected to be completed in less than two hours.

       Update on 03:05pm, the issue is resolved.

       ls15 scratch system (/mnt/ls15) is currently having a issue and we are working on it. We will update this information when it is back normal.

We will be performing emergency maintenance at 7am on Friday 4-3-20 to all gateways and development nodes.  This will require home directories to be taken offline on those nodes and the nodes rebooted.  We expect maintenance to be complete by 8 am

Update 4-2-20:  After patching the system a quota check has successfully run.  We believe currently that quotas reported are now correct.  

Important Note:  We are seeing about 50 research spaces over quota.  This is likely due to previous under reported quotas.  We have checked these with the DU functionality and they appear to be reporting properly.  Please remember that if you are storming large amounts of small files that the reported quota will not match DU due to system block size limitations.

Note:  We have ensured all default quotas are now enforced on research groups.  If you are having trouble with your research group please open a ticket and we will assist you.


      Currently our home file system check quota function will sometimes cause a users directory to have an incorrect quota. If you see this please open a ticket and we will work with you to temporarily increase your quota. We continue to work with our vendor to correct this issue.


Update 4-1-20:  We have received a patch and are testing to see if all quota issues have been resolved.   

Starting this morning we will be performing a patch upgrade to our home directory system.  This has been provided by our vendor to correct issues with quota functionality.  You may see some pauses in the system while components are restarted.  


Update 4-1-20 All maintenance is complete.

As part of MSU’s response to COVID-19, ICER is transitioning to an online-only support model. All HPCC services will continue as normal.


We are currently experiencing issues with our network that is causing slow or broken connections to several of our login and transfer systems.  We are looking into this issue and will provide updates as available.


Update 2/12/20 1:20pm  The issue is currently resolved.  We will be monitoring our network for further issues.


UPDATE (9:52 PM): The maintenance is complete and filesystems are remounted on the gateways

UPDATE: This outage is now scheduled for February 8th

On Saturday, February 8th, there will be a partial outage of HPCC storage starting at 8:00PM. This outage will begin with rolling reboots of all gateways and will interrupt storage access on the login gateways and rsync gateway only. This may cause 'cannot chrdir to directory' errors when connecting to the HPCC. Users can continue to connect from gateways to development nodes to conduct their work. Development nodes and running jobs will not be affected. This outage is expected to last several hours.

Hpcc login issues 12-24-19

At 10am today our gateways are no long able to communicate with our home storage system.  We are looking into the issue and will rectify it as soon as possible.  Compute nodes can mount the home system properly and jobs will continue to run properly.



From the week of December 9th through early January, we will be performing software upgrades on our gs18 and ufs18 storage systems, which will improve performance and reliability. During this time, users may experience periodic pauses and degraded performance. We will update this blog post if there are any specific impacts users may see and as the work proceeds. Please contact us if you have any concerns.

Update: 3:25 PM 12/20: After the upgrade on 12/19, two new bugs were introduced. Users may experience "Stale File Handle" messages, slow home directory or research space access, or not be able to log into a gateway or dev node when this problem is occuring. The vendor is preparing fixes for us to deploy today or tomorrow and we have an understanding and workaround of what's triggering this problem to reduce the impact on our users. We're sorry for any impact that this has on your research.

On Thursday, December 19th, the HPCC will be undergo scheduled maintenance. We will be applying GPFS software updates, adding redundancy for the Infiniband fabric, and additional minor fixes and updates. We will be rebooting every compute node, so any jobs that would overlap will be held until after the outage. The entire system may be unavailable during this work. We will update this blog post as more information becomes available.

Reminder to users: please be vigilant about data kept on scratch. Temporary data on scratch should be kept clean and updated to help prevent the file system from becoming full. 

Please see this blog post about additional file system work happening this December.

UPDATE: 6:20 PM The system maintenance has been completed.

Starting at 8am on 12-12-19 our vendor is performing updates to the software on our home system.  We do not anticipate any downtime associated with the upgrades.


Update 12-13-19 1pm: Updates have been completed on 4 of 6 storage nodes.  We anticipate the remaining storage nodes to be complete by end of day.  Updates on protocal nodes will continue on Monday 12-16.  When all existing equipment is upgraded we will be adding an additional storage block.

Update 12-13-19 4pm: On monday 12-16-19 users will see periodic samba home directory mount outages.

Update 12-17-19 8am:  Upgrade work on our existing home storage system is complete.  We will be adding additional storage to the system on 12-17 and 12-18.  During out 12-19 outage all compute node clients will have software updates to match the storage cluster.  During the outage we will also be replacing our AFM backup nodes with new hardware for better backup and over all system stability.

The HPCC is undergoing an upgrade of the GS18 scratch. No service interruptions are expected.


2019-21-12 all upgrades on scratch cluster are now complete

Today at 3:00 dev-intel16-K80 will go down for maintenance.  Available GPUs are not correct

and a replacement card will resolve this issue.  We will have the system returned to service as

soon as possible.

UPDATE:  Dev-intel16-K80 is working and available now.

We have a new AMD-based Rome server with 128 cores and 512 GB of memory available to users. It is currently accessible as eval-epyc19 via ssh from any development node. We are considering this architecture for our 2020 cluster purchase and would like your feedback on any strengths or weaknesses you notice.

We've configured 4 NUMA clusters per socket (16 cores.) In early testing, a hybrid OpenMP-MPI model that uses OpenMP processes per-NUMA-domain or per-L3 cache and MPI between processes provides excellent performance. You can see the system layout with

lstopo-no-graphics

This node is a shared resource so please be considerate of other users and aware of the impact other users may have on your benchmarking. If you would like exclusive access, please contact us so we can coordinate that.

Update (11:15 PM): The home file system is back online.


The HPCC is currently experiencing an unexpected outage of the home file system. We are currently working to resolve the issue.

After investigation we have found that quota enforcement for disk space usage has not been enforced properly.  We will be correcting this on 11-21-19.  We encourage users to check the disk usage versus quota and ensure that your research space is not over quota.  Looking at the current space usage about 30% of research spaces will be over quota.  We will be contacting the PI of each of the over quota spaces directly as well.  

Login Issue on HPCC Nodes

      09:55 AM:  Right now, there is a problem to log into HPCC. Please wait for more update.

      10:00 AM:  During the weekend, a home mounting problem happened for many compute nodes. The issue is fixed. 

      10:25 AM:  The login to HPCC is back to normal now. However, there is still a problem to log into dev-intel16-k80 node.

      10:55 AM:  dev-intel16-k80 can be log in now. The issue is resolved.

On Wednesday, October 23rd, the HPCC will be updating its installation of the Singularity container software to the latest version 3.4. This update adds new features including increased support for file system overlays and increased stability for containers using MPI. If you have any questions, please contact the HPCC at contact.icer.msu.edu.

10/17/2019 11:33 AM:   Most of the compute nodes are working. HPCC system is back to normal.

10/17/2019 10:03 AM:   There was a filesystem issue that has been resolved.  The gateway and development nodes have resumed full functionality. However, compute nodes are still not recovered.

10/17/2019 09:40 AM:   The HPCC is currently experiencing system issues.  We are working on problem and will update this message when we have more information. We are sorry about the inconvenience.


HPCC Staff



On Tuesday Oct 15 we will be adding new backup hardware to our home storage cluster that will be replacing legacy hardware.  As we add the new hardware the home directory system may be slow or pause at times as the file set backups recover.

How buy-in accounts are configured in the scheduler is changing. Buy-in accounts are currently configured with one partition per cluster, e.g. buy-in account “FOOBAR” with nodes in both the intel16 and intel18 clusters would have a “FOOBAR-16” and a “FOOBAR-18” partition. Buy-in accounts will soon have only one partition that contains all their buy-in nodes. This change will increase overall scheduler performance and will not affect how buy-in jobs are prioritized.

Rolling Reboots of Nodes

The HPCC is currently conducting rolling reboots of nodes to apply a file system update. This update will improve the overall stability of the GPFS file system. The HPCC will coordinate with buy-in node owners when rebooting buy-in hardware. These reboots will not affect running jobs, however, the overall amount of resources available to jobs will be reduced until reboots are complete.

Update (8:48PM): The SLURM scheduler is back online and accepting jobs

The SLURM scheduler server will be offline intermittently for planned maintenance on Thursday, September 19th, from 8:00 PM to 9:00 PM. During this time, SLURM client commands (squeue/sbatch/salloc) will be unavailable and queued jobs will not start. Running jobs will not be affected by this outage.

Home directory quota issues

We are aware that some user's quotas do not match what the du command displays. We have worked extensively with the vendor on this issue.  There are two root causes.

1) The quota check process would not complete properly.  On 8/20 we were able to perform a complete quota check which has corrected many user quotas.  We are still working with the vendor to insure this check can run successfully on a regular basis.

2) The new file system has a smallest file block size of 64k.  This means that files between 2K and 64K will occupy 64K of space.  This will cause disk usage for users with large amounts of small files to be inflated greatly.  We are working on a solution for this issue. 

     One suggested solution would be to add any files to a tar file, which would reduce the number of small files into a larger file. 

     A temporarily larger quota can be requested by a user if their quota is at 1T and they have many small files .

If you have questions or need assistance, please let us know.


At 8am EDT 9-4-19 we will be performing a ram upgrade on two of the controllers for our new home directory storage system.  We will need to move the system controller between nodes which may cause several minutes of degraded performance.  We do not expect any significant down time associated with this upgrade.  


Unexpected SLURM Outage

Update (12:15PM): The SLURM server is back online.

The SLURM server is currently offline. Client commands are not available, e.g. srun/salloc/squeue. New jobs cannot be submitted. We are working with our software vendor to find a solution.

We are currently having an issue with our virtual machine stack causing logins to fail and other systems to not work properly.  Our scheduler is currently paused and will resume as soon as the issue is corrected.

Update: 9:20 AM The issue has been resolved. Please let us know if you see any other issues.

The HPCC and all systems (including storage) will be unavailable on Tuesday, August 13th for a full system maintenance. We will be performing system software updates, client storage software update, network configuration changes, a scheduler software update, and routine maintenance. We anticipate that this will be a full-day outage. We will be updating this blog post as the date approaches and during the outage with more information. No jobs will start that would overlap this maintenance window. Please contact us with any questions.

Update: 10:30 AM The maintenance is mostly complete. We will be restoring access to the development and gateway systems shortly. We expect a return to service by noon.

Update: 11:20 AM. The scheduler was resumed at 11:15 AM and all services should be returned to production. Please contact us with any questions.

On Friday, July 26th at 10:00 AM, dev-intel18 will be taken offline for upgrades and maintenance. This task will take approximately one hour, and all other development nodes will remain online.

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.

Update: 7/11/19 8am: All Active and Idle users as well as research spaces have now been migrated.

Update: 7/10/19 3:20pm: 2 active users continue to migrate and 1 research space continue to migrate. 

Update: 7/9/19 4:20pm: 4 active users continue to migrate and 1 research space continue to migrate. 

Update: 7/9/19 8:20am: 8 active users continue to migrate and 3 research spaces continue to migrate.  When this is complete we will be migrating ~15 inactive users. 

Update: 7/8/19/ 10:15pm : Migration of users and research spaces continues ~20 users remain and ~20 research spaces.

Update: 7/8/18 4pm : Migration of users has progressed steadily all day.  Currently of 250 active users about 40 are still migrating.  Of  200 research spaces 30 are still migrating.  We will continue to migrate users and research spaces this evening and anticipate most users migrations complete by end of day.


UPDATE: Due to the issue with ufs18 on 6/8 and 6/9 we have moved the ufs-12 migration to July 8th. Please let us know if you would like to be moved sooner.

UPDATE: We have moved the ufs-12 migration from June 5th to June 10th.

On Wednesday, June 10th, we will be migrating all users and research spaces on ufs-12-a and ufs-12-b to the new home directory system. Users with home directories on ufs-12-a and ufs-12-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-12-a and ufs-12-b will be unavailable.

For users on ufs-12-a and ufs-12-b, jobs that would overlap with the maintenance window will not start until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.

Users and jobs using research spaces on ufs-12-a and ufs-12-b will be terminated on June 10th and may be temporarily prevented from accessing the system to ensure a clean migration.

Users should see significantly better performance and stability once the migration is complete. If you are on ufs-12-a or ufs-12-b, let us know if this interruption will have a significant impact on your work and we can migrate your spaces before the transition.

See https://wiki.hpcc.msu.edu/x/loZaAQ for more information. If you have any questions, please contact us.


Update 6:30pm  The home filesystem is active and should be available on all systems. Scheduling has been resumed.

At 3:30pm, ufs18 file system is gradually online. It is mounted on gateway, dev nodes (except dev-intel16) and some compute nodes. Users still need to wait until it is totally back in HPCC system.

As of 11 am EDT the home file system is currently unavailable.  We are working with the vendor to correct the issue.

Starting at 8am tomorrow we will be performing several small filesystem updates with our vendor to improve file system stability on our new home system.  We do not anticipate significant impact to users. Users may see short pauses during this time. We anticipate all updates to be completed by end of day.  



Starting at 8PM on Thursday, June 27th, we will be performing rolling reboots of all HPCC gateways, including the Remote Desktop Gateway and Globus.

On Tuesday, June 25th at 10:00 PM, the server which hosts our shared software libraries and modules will be taken offline to be moved to our new system. This task will take less than 30 minutes, however, users will notice that software and modules are inaccessible during this time. Jobs will be paused during the move, and will resume as soon as the server is back online. There should be no failures with jobs caused by this maintenance, however, please do contact us if you experience otherwise. 

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.

6-11-19 Ufs18 slowness

At 13:00 this afternoon the ufs18 system suffered an issue causing backup nodes to go offline.  We are looking into the issue with the vendor.  While the backup system recovers you may see intermittent slowness on ufs18.

6/8/2019 - ufs18 home directories are currently offline and inaccessible due to issues with the directory exports. We are working to resolve the issue as quickly as possible, and will provide an update to this announcement with further details.


6/8/2019 - 11:00 AM We have determined the scope of the home directory issues and are currently working with the vendor to resolve the underlying problem.


6/8/2019 - 1:15 PM IBM is reviewing the diagnostic information. Next update is expected at 2 PM.


6/8/2019 - 4:25 PM We continue to work with IBM to diagnose the underlying problem. Another update is expect by 6 PM.


6/8/2019 - 9:05 PM The recovery process has started. Next update should be within 1 hour.


6/8/2019 - 10:00 PM The recovery process continues at this time. Another update is expected by 11:00 PM


6/8/2019 - 11:19 PM The recovery process is proceeding successfully at this time. As of now, the recovery is likely to continue until at least the evening on Sunday, 6/9, and we will have additional updates in the early afternoon.

6/9/2019- 12 PM. The file system has recovered and we are working to restore access to all nodes. We expect that we will resume scheduling jobs this afternoon.

6/9/2019 - 4:15 PM We were able to restore access to all file systems, but the system crashed again once job scheduling was restarted. We are continuing to work with IBM.

6/9/2019 - 5:55 PM We have restored access to home on every system except the gateways. We are waiting for IBM to identify the cause of the 3:15 PM crash before resuming the job scheduler.

6/9/2019 - 7:00 PM We have restored access to home on the gateways. We are waiting for IBM to identify the cause of the 3:15 PM crash before resuming the job scheduler.

6/9/2019 - 9:00 PM We have fixed the corrupted entry that was causing the GPFS cluster to crash, and restarted job scheduling on the cluster. Please contact us if you notice any issues.


Event summary:

At 10;15 AM on 6/8, a file record on the ufs18 home system was corrupted. This caused the cluster-wide file system replay logs to become corrupted, which took the entire home file system offline. On Saturday afternoon, IBM and HPCC staff ran into issues when trying to run the required diagnostic commands to remove the corrupted logs; all GPFS clients had to be stopped; which unmounted gs18 also. By Saturday evening, the log cleanup command was run, and a full diagnostic scan was started. On early Sunday morning, the full diagnostic scan crashed because of a lack of available memory on the primary controller node for the diagnostic scan. On Sunday morning, a command was run to remove the failed file record, and work began to remount the file system and restore access and job scheduling. On Sunday afternoon at 3:15 PM, when scheduling was restored, an access to that file caused the filesystem to crash again. We removed the logs. At 7:30 IBM confirmed the cause as the same corrupted file record and provided another method to remove the bad file.

We are continuing to work with IBM to identify why the file record became corrupt, why the maintenance command had difficulty running on Saturday, and why the first command didn't remove the failed record on Sunday.

At 8am  6/11/19 we will be moving the storage pool for ufs-12-bato standard configuration after a today.  Home directories on both file systems will be unavailable for about an hour.




At 9am this morning we will be moving the storage pool for ufs-12-b to standard configuration after a fail over last night.  Home directories on both file systems will be unavailable for about an hour.


Update: 9:25am maintenance is now complete



Rolling reboots of HPCC gateways machines will begin on Thursday, June 6th, at 8:00PM. This includes Gateway-01, Gateway-02, Gateway-03, Globus, and the remote desktop gateway. These reboots are to reconfigure how the gateways access the new GPFS storage system and improve the stability of that connection.

The SLURM scheduler server will be going offline on Tuesday, June 11th, at 8:15PM to be migrated to new hardware. During this time, SLURM client commands (e.g. srun, squeue) will be unavailable, new jobs cannot be submitted, and currently queued jobs will not start. This outage will take approximately 30 minutes.

The HPCC gateway machine, Gateway-00, will be going offline briefly for a reboot on Tuesday, June 4th, at 8:00 AM. This reboot is part of re-configuring how the gateway connects to the new GPFS storage system and is expected to improve the stability and performance of that connection.

At 8:00 PM on Saturday, June 1st, the UFS-12-a file server will go offline for a reboot to address file-system performance issues.

Update 8:51 PM: Bringing UFS-12-a back online necessitated a reboot ufs UFS-12-b. Both file servers are now back online.

Update: following successful maintenance, rdpgw-01 is now back online and has been restored to full functionality.

At 9:00pm on Thursday, May 30, the rdpgw-01 server will be taken offline temporarily for scheduled maintenance. This process is expected to take no longer than 30 minutes, after which the server will be restored to full functionality. If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method. 

Reboot of these development nodes will happen at 2:00 this afternoon, Tuesday May 28.  This is to clear up stale mounts.



At 7am 5/26 several nodes went down hanging our home directory system.  This is a know bug in our GPFS system and we are waiting for a patch for our nodes.  At 5:30pm we were able to locate and restart the culprit node, and resolve the issue.

05/25 

Update 5:30 pm All research spaces have been migrated.  Two active user are still with very large file counts are still syncing.

05/24

Update 5:30 pm We continue to migrate users, this has been slowed due to issues with the home file system earlier tody. Currently 5 active users remain to be migrated and a single research space.

05/23

Update 10:40pm The remaining 14 active users are still syncing, we anticipate the majority of syncs to be finished  tomorrow 5/24

Update:  1:20pm The remaining 14 active users are currently syncing, along with 3 remaining research spaces.  

Update:  9:00am  An additional ~150 users have been enabled.  We now have ~40 active users still resyncing.  We continue to work on syncing research spaces.

05/22

Update 10:00pm  Resync is ongoing, we anticipate finishing ~150 further users early tomorrow. The remaining ufs-13-a research spaces will also be available tomorrow.

Update:  7:35pm ~300 users have been re enabled. Nearly all ufs-13-b research spaces are complete.  

Update:  5:10pm  ~250 users have been re enabled.

Update:  4:00pm The sync of data is progressing. Due to high amount of changes made to user accounts and research spaces the resync of data is taking longer than anticipated.



On Wednesday, May 22nd, we will be migrating all users and research spaces on ufs-13-a and ufs-13-b to the new home directory system. Users with home directories on ufs-13-a and ufs-13-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-13-a and ufs-13-b will be unavailable.

For users on ufs-13-a and ufs-13-b, jobs that would overlap with the maintenance window will not start until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.

Users and jobs using research spaces on ufs-13-a and ufs-13-b will be terminated on May 22nd and may be temporarily prevented from accessing the system to ensure a clean migration.

Users should see significantly better performance and stability once the migration is complete. If you are on ufs-13-a or ufs-13-b, let us know if this interruption will have a significant impact on your work and we can migrate your spaces before the transition.

See https://wiki.hpcc.msu.edu/x/loZaAQ for more information. If you have any questions, please contact us.







Nodes offline 5-24 (Resolved)

We are working on an issue that has caused all of our nodes to go into an offline state, as well as not allow users to login.  We will have the nodes back online as soon as possible.


Update 10:26  Nodes are back online and running jobs.  Logins at this time may still fail.


Update 12:00 We continue to experience issues with mounting home directories, which is causing gateways to be unresponsive.   We are working with the vendors to resolve this issue.  


Update: 13:30 The vendor has worked to restore functionality of our home storage system.  We continue to work with the vendor on this. At this time most home directories are available.


Update 14:45 The issue has been resolved and the vendor is working on a permanent solution for the issue.

Access to the HPCC is currently unavailable for users on ufs18. A bug on the new file system that is the result of contention between compression and file replication features. We are entering the recovery process and attempting to restore access. We expect a return to service this morning.

UPDATE: 12:30 PM- Service was restored as of 10:15 AM.



  • No labels