Blog

The HPCC and all systems (including storage) will be unavailable on Tuesday, August 13th for a full system maintenance. We will be performing system software updates, client storage software update, network configuration changes, a scheduler software update, and routine maintenance. We anticipate that this will be a full-day outage. We will be updating this blog post as the date approaches and during the outage with more information. No jobs will start that would overlap this maintenance window. Please contact us with any questions.

Update: 10:30 AM The maintenance is mostly complete. We will be restoring access to the development and gateway systems shortly. We expect a return to service by noon.

Update: 11:20 AM. The scheduler was resumed at 11:15 AM and all services should be returned to production. Please contact us with any questions.

On Friday, July 26th at 10:00 AM, dev-intel18 will be taken offline for upgrades and maintenance. This task will take approximately one hour, and all other development nodes will remain online.

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.

We are aware that quotas are currently not working properly and some files are being counted for more space in your quotas then the files actually occupy.

       Update 8/14:  We continue to work with our vendor to determine and correct issues with incorrect quotas

Update 6:30pm  The home filesystem is active and should be available on all systems. Scheduling has been resumed.

At 3:30pm, ufs18 file system is gradually online. It is mounted on gateway, dev nodes (except dev-intel16) and some compute nodes. Users still need to wait until it is totally back in HPCC system.

As of 11 am EDT the home file system is currently unavailable.  We are working with the vendor to correct the issue.

Starting at 8am tomorrow we will be performing several small filesystem updates with our vendor to improve file system stability on our new home system.  We do not anticipate significant impact to users. Users may see short pauses during this time. We anticipate all updates to be completed by end of day.  



Starting at 8PM on Thursday, June 27th, we will be performing rolling reboots of all HPCC gateways, including the Remote Desktop Gateway and Globus.

On Tuesday, June 25th at 10:00 PM, the server which hosts our shared software libraries and modules will be taken offline to be moved to our new system. This task will take less than 30 minutes, however, users will notice that software and modules are inaccessible during this time. Jobs will be paused during the move, and will resume as soon as the server is back online. There should be no failures with jobs caused by this maintenance, however, please do contact us if you experience otherwise. 

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.

6-11-19 Ufs18 slowness

At 13:00 this afternoon the ufs18 system suffered an issue causing backup nodes to go offline.  We are looking into the issue with the vendor.  While the backup system recovers you may see intermittent slowness on ufs18.

At 8am  6/11/19 we will be moving the storage pool for ufs-12-bato standard configuration after a today.  Home directories on both file systems will be unavailable for about an hour.



6/8/2019 - ufs18 home directories are currently offline and inaccessible due to issues with the directory exports. We are working to resolve the issue as quickly as possible, and will provide an update to this announcement with further details.


6/8/2019 - 11:00 AM We have determined the scope of the home directory issues and are currently working with the vendor to resolve the underlying problem.


6/8/2019 - 1:15 PM IBM is reviewing the diagnostic information. Next update is expected at 2 PM.


6/8/2019 - 4:25 PM We continue to work with IBM to diagnose the underlying problem. Another update is expect by 6 PM.


6/8/2019 - 9:05 PM The recovery process has started. Next update should be within 1 hour.


6/8/2019 - 10:00 PM The recovery process continues at this time. Another update is expected by 11:00 PM


6/8/2019 - 11:19 PM The recovery process is proceeding successfully at this time. As of now, the recovery is likely to continue until at least the evening on Sunday, 6/9, and we will have additional updates in the early afternoon.

6/9/2019- 12 PM. The file system has recovered and we are working to restore access to all nodes. We expect that we will resume scheduling jobs this afternoon.

6/9/2019 - 4:15 PM We were able to restore access to all file systems, but the system crashed again once job scheduling was restarted. We are continuing to work with IBM.

6/9/2019 - 5:55 PM We have restored access to home on every system except the gateways. We are waiting for IBM to identify the cause of the 3:15 PM crash before resuming the job scheduler.

6/9/2019 - 7:00 PM We have restored access to home on the gateways. We are waiting for IBM to identify the cause of the 3:15 PM crash before resuming the job scheduler.

6/9/2019 - 9:00 PM We have fixed the corrupted entry that was causing the GPFS cluster to crash, and restarted job scheduling on the cluster. Please contact us if you notice any issues.


Event summary:

At 10;15 AM on 6/8, a file record on the ufs18 home system was corrupted. This caused the cluster-wide file system replay logs to become corrupted, which took the entire home file system offline. On Saturday afternoon, IBM and HPCC staff ran into issues when trying to run the required diagnostic commands to remove the corrupted logs; all GPFS clients had to be stopped; which unmounted gs18 also. By Saturday evening, the log cleanup command was run, and a full diagnostic scan was started. On early Sunday morning, the full diagnostic scan crashed because of a lack of available memory on the primary controller node for the diagnostic scan. On Sunday morning, a command was run to remove the failed file record, and work began to remount the file system and restore access and job scheduling. On Sunday afternoon at 3:15 PM, when scheduling was restored, an access to that file caused the filesystem to crash again. We removed the logs. At 7:30 IBM confirmed the cause as the same corrupted file record and provided another method to remove the bad file.

We are continuing to work with IBM to identify why the file record became corrupt, why the maintenance command had difficulty running on Saturday, and why the first command didn't remove the failed record on Sunday.


At 9am this morning we will be moving the storage pool for ufs-12-b to standard configuration after a fail over last night.  Home directories on both file systems will be unavailable for about an hour.


Update: 9:25am maintenance is now complete



Rolling reboots of HPCC gateways machines will begin on Thursday, June 6th, at 8:00PM. This includes Gateway-01, Gateway-02, Gateway-03, Globus, and the remote desktop gateway. These reboots are to reconfigure how the gateways access the new GPFS storage system and improve the stability of that connection.

The SLURM scheduler server will be going offline on Tuesday, June 11th, at 8:15PM to be migrated to new hardware. During this time, SLURM client commands (e.g. srun, squeue) will be unavailable, new jobs cannot be submitted, and currently queued jobs will not start. This outage will take approximately 30 minutes.

The HPCC gateway machine, Gateway-00, will be going offline briefly for a reboot on Tuesday, June 4th, at 8:00 AM. This reboot is part of re-configuring how the gateway connects to the new GPFS storage system and is expected to improve the stability and performance of that connection.

At 8:00 PM on Saturday, June 1st, the UFS-12-a file server will go offline for a reboot to address file-system performance issues.

Update 8:51 PM: Bringing UFS-12-a back online necessitated a reboot ufs UFS-12-b. Both file servers are now back online.

Update: following successful maintenance, rdpgw-01 is now back online and has been restored to full functionality.

At 9:00pm on Thursday, May 30, the rdpgw-01 server will be taken offline temporarily for scheduled maintenance. This process is expected to take no longer than 30 minutes, after which the server will be restored to full functionality. If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.