Skip to end of metadata
Go to start of metadata

We are aware that quotas are currently not working properly and some files are being counted for more space in your quotas then the files actually occupy.

       Update 8/14:  We continue to work with our vendor to determine and correct issues with incorrect quotas

The HPCC and all systems (including storage) will be unavailable on Tuesday, August 13th for a full system maintenance. We will be performing system software updates, client storage software update, network configuration changes, a scheduler software update, and routine maintenance. We anticipate that this will be a full-day outage. We will be updating this blog post as the date approaches and during the outage with more information. No jobs will start that would overlap this maintenance window. Please contact us with any questions.

Update: 10:30 AM The maintenance is mostly complete. We will be restoring access to the development and gateway systems shortly. We expect a return to service by noon.

Update: 11:20 AM. The scheduler was resumed at 11:15 AM and all services should be returned to production. Please contact us with any questions.

On Friday, July 26th at 10:00 AM, dev-intel18 will be taken offline for upgrades and maintenance. This task will take approximately one hour, and all other development nodes will remain online.

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.

Update: 7/11/19 8am: All Active and Idle users as well as research spaces have now been migrated.

Update: 7/10/19 3:20pm: 2 active users continue to migrate and 1 research space continue to migrate. 

Update: 7/9/19 4:20pm: 4 active users continue to migrate and 1 research space continue to migrate. 

Update: 7/9/19 8:20am: 8 active users continue to migrate and 3 research spaces continue to migrate.  When this is complete we will be migrating ~15 inactive users. 

Update: 7/8/19/ 10:15pm : Migration of users and research spaces continues ~20 users remain and ~20 research spaces.

Update: 7/8/18 4pm : Migration of users has progressed steadily all day.  Currently of 250 active users about 40 are still migrating.  Of  200 research spaces 30 are still migrating.  We will continue to migrate users and research spaces this evening and anticipate most users migrations complete by end of day.


UPDATE: Due to the issue with ufs18 on 6/8 and 6/9 we have moved the ufs-12 migration to July 8th. Please let us know if you would like to be moved sooner.

UPDATE: We have moved the ufs-12 migration from June 5th to June 10th.

On Wednesday, June 10th, we will be migrating all users and research spaces on ufs-12-a and ufs-12-b to the new home directory system. Users with home directories on ufs-12-a and ufs-12-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-12-a and ufs-12-b will be unavailable.

For users on ufs-12-a and ufs-12-b, jobs that would overlap with the maintenance window will not start until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.

Users and jobs using research spaces on ufs-12-a and ufs-12-b will be terminated on June 10th and may be temporarily prevented from accessing the system to ensure a clean migration.

Users should see significantly better performance and stability once the migration is complete. If you are on ufs-12-a or ufs-12-b, let us know if this interruption will have a significant impact on your work and we can migrate your spaces before the transition.

See https://wiki.hpcc.msu.edu/x/loZaAQ for more information. If you have any questions, please contact us.


Update 6:30pm  The home filesystem is active and should be available on all systems. Scheduling has been resumed.

At 3:30pm, ufs18 file system is gradually online. It is mounted on gateway, dev nodes (except dev-intel16) and some compute nodes. Users still need to wait until it is totally back in HPCC system.

As of 11 am EDT the home file system is currently unavailable.  We are working with the vendor to correct the issue.

Starting at 8am tomorrow we will be performing several small filesystem updates with our vendor to improve file system stability on our new home system.  We do not anticipate significant impact to users. Users may see short pauses during this time. We anticipate all updates to be completed by end of day.  



Starting at 8PM on Thursday, June 27th, we will be performing rolling reboots of all HPCC gateways, including the Remote Desktop Gateway and Globus.

On Tuesday, June 25th at 10:00 PM, the server which hosts our shared software libraries and modules will be taken offline to be moved to our new system. This task will take less than 30 minutes, however, users will notice that software and modules are inaccessible during this time. Jobs will be paused during the move, and will resume as soon as the server is back online. There should be no failures with jobs caused by this maintenance, however, please do contact us if you experience otherwise. 

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.

6-11-19 Ufs18 slowness

At 13:00 this afternoon the ufs18 system suffered an issue causing backup nodes to go offline.  We are looking into the issue with the vendor.  While the backup system recovers you may see intermittent slowness on ufs18.

6/8/2019 - ufs18 home directories are currently offline and inaccessible due to issues with the directory exports. We are working to resolve the issue as quickly as possible, and will provide an update to this announcement with further details.


6/8/2019 - 11:00 AM We have determined the scope of the home directory issues and are currently working with the vendor to resolve the underlying problem.


6/8/2019 - 1:15 PM IBM is reviewing the diagnostic information. Next update is expected at 2 PM.


6/8/2019 - 4:25 PM We continue to work with IBM to diagnose the underlying problem. Another update is expect by 6 PM.


6/8/2019 - 9:05 PM The recovery process has started. Next update should be within 1 hour.


6/8/2019 - 10:00 PM The recovery process continues at this time. Another update is expected by 11:00 PM


6/8/2019 - 11:19 PM The recovery process is proceeding successfully at this time. As of now, the recovery is likely to continue until at least the evening on Sunday, 6/9, and we will have additional updates in the early afternoon.

6/9/2019- 12 PM. The file system has recovered and we are working to restore access to all nodes. We expect that we will resume scheduling jobs this afternoon.

6/9/2019 - 4:15 PM We were able to restore access to all file systems, but the system crashed again once job scheduling was restarted. We are continuing to work with IBM.

6/9/2019 - 5:55 PM We have restored access to home on every system except the gateways. We are waiting for IBM to identify the cause of the 3:15 PM crash before resuming the job scheduler.

6/9/2019 - 7:00 PM We have restored access to home on the gateways. We are waiting for IBM to identify the cause of the 3:15 PM crash before resuming the job scheduler.

6/9/2019 - 9:00 PM We have fixed the corrupted entry that was causing the GPFS cluster to crash, and restarted job scheduling on the cluster. Please contact us if you notice any issues.


Event summary:

At 10;15 AM on 6/8, a file record on the ufs18 home system was corrupted. This caused the cluster-wide file system replay logs to become corrupted, which took the entire home file system offline. On Saturday afternoon, IBM and HPCC staff ran into issues when trying to run the required diagnostic commands to remove the corrupted logs; all GPFS clients had to be stopped; which unmounted gs18 also. By Saturday evening, the log cleanup command was run, and a full diagnostic scan was started. On early Sunday morning, the full diagnostic scan crashed because of a lack of available memory on the primary controller node for the diagnostic scan. On Sunday morning, a command was run to remove the failed file record, and work began to remount the file system and restore access and job scheduling. On Sunday afternoon at 3:15 PM, when scheduling was restored, an access to that file caused the filesystem to crash again. We removed the logs. At 7:30 IBM confirmed the cause as the same corrupted file record and provided another method to remove the bad file.

We are continuing to work with IBM to identify why the file record became corrupt, why the maintenance command had difficulty running on Saturday, and why the first command didn't remove the failed record on Sunday.

At 8am  6/11/19 we will be moving the storage pool for ufs-12-bato standard configuration after a today.  Home directories on both file systems will be unavailable for about an hour.




At 9am this morning we will be moving the storage pool for ufs-12-b to standard configuration after a fail over last night.  Home directories on both file systems will be unavailable for about an hour.


Update: 9:25am maintenance is now complete



Rolling reboots of HPCC gateways machines will begin on Thursday, June 6th, at 8:00PM. This includes Gateway-01, Gateway-02, Gateway-03, Globus, and the remote desktop gateway. These reboots are to reconfigure how the gateways access the new GPFS storage system and improve the stability of that connection.

The SLURM scheduler server will be going offline on Tuesday, June 11th, at 8:15PM to be migrated to new hardware. During this time, SLURM client commands (e.g. srun, squeue) will be unavailable, new jobs cannot be submitted, and currently queued jobs will not start. This outage will take approximately 30 minutes.

The HPCC gateway machine, Gateway-00, will be going offline briefly for a reboot on Tuesday, June 4th, at 8:00 AM. This reboot is part of re-configuring how the gateway connects to the new GPFS storage system and is expected to improve the stability and performance of that connection.

At 8:00 PM on Saturday, June 1st, the UFS-12-a file server will go offline for a reboot to address file-system performance issues.

Update 8:51 PM: Bringing UFS-12-a back online necessitated a reboot ufs UFS-12-b. Both file servers are now back online.

Update: following successful maintenance, rdpgw-01 is now back online and has been restored to full functionality.

At 9:00pm on Thursday, May 30, the rdpgw-01 server will be taken offline temporarily for scheduled maintenance. This process is expected to take no longer than 30 minutes, after which the server will be restored to full functionality. If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method. 

Reboot of these development nodes will happen at 2:00 this afternoon, Tuesday May 28.  This is to clear up stale mounts.



At 7am 5/26 several nodes went down hanging our home directory system.  This is a know bug in our GPFS system and we are waiting for a patch for our nodes.  At 5:30pm we were able to locate and restart the culprit node, and resolve the issue.

05/25 

Update 5:30 pm All research spaces have been migrated.  Two active user are still with very large file counts are still syncing.

05/24

Update 5:30 pm We continue to migrate users, this has been slowed due to issues with the home file system earlier tody. Currently 5 active users remain to be migrated and a single research space.

05/23

Update 10:40pm The remaining 14 active users are still syncing, we anticipate the majority of syncs to be finished  tomorrow 5/24

Update:  1:20pm The remaining 14 active users are currently syncing, along with 3 remaining research spaces.  

Update:  9:00am  An additional ~150 users have been enabled.  We now have ~40 active users still resyncing.  We continue to work on syncing research spaces.

05/22

Update 10:00pm  Resync is ongoing, we anticipate finishing ~150 further users early tomorrow. The remaining ufs-13-a research spaces will also be available tomorrow.

Update:  7:35pm ~300 users have been re enabled. Nearly all ufs-13-b research spaces are complete.  

Update:  5:10pm  ~250 users have been re enabled.

Update:  4:00pm The sync of data is progressing. Due to high amount of changes made to user accounts and research spaces the resync of data is taking longer than anticipated.



On Wednesday, May 22nd, we will be migrating all users and research spaces on ufs-13-a and ufs-13-b to the new home directory system. Users with home directories on ufs-13-a and ufs-13-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-13-a and ufs-13-b will be unavailable.

For users on ufs-13-a and ufs-13-b, jobs that would overlap with the maintenance window will not start until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.

Users and jobs using research spaces on ufs-13-a and ufs-13-b will be terminated on May 22nd and may be temporarily prevented from accessing the system to ensure a clean migration.

Users should see significantly better performance and stability once the migration is complete. If you are on ufs-13-a or ufs-13-b, let us know if this interruption will have a significant impact on your work and we can migrate your spaces before the transition.

See https://wiki.hpcc.msu.edu/x/loZaAQ for more information. If you have any questions, please contact us.







Nodes offline 5-24 (Resolved)

We are working on an issue that has caused all of our nodes to go into an offline state, as well as not allow users to login.  We will have the nodes back online as soon as possible.


Update 10:26  Nodes are back online and running jobs.  Logins at this time may still fail.


Update 12:00 We continue to experience issues with mounting home directories, which is causing gateways to be unresponsive.   We are working with the vendors to resolve this issue.  


Update: 13:30 The vendor has worked to restore functionality of our home storage system.  We continue to work with the vendor on this. At this time most home directories are available.


Update 14:45 The issue has been resolved and the vendor is working on a permanent solution for the issue.

Access to the HPCC is currently unavailable for users on ufs18. A bug on the new file system that is the result of contention between compression and file replication features. We are entering the recovery process and attempting to restore access. We expect a return to service this morning.

UPDATE: 12:30 PM- Service was restored as of 10:15 AM.

SSH Login Cipher Removal

We will be removing older, less secure SSH ciphers (3des-cbc, blowfish-cbc, and cast128-cdc) this afternoon.  This should not effect anyone with a modern SSH client.  If you have problems logging in, please try to update your SSH client; if the problem persists, please contact us: https://contact.icer.msu.edu/contact


ls15 scratch issue

May 8, 2019 (10:27am)

      Scratch system ls15 is working normally and available for all nodes.

May 7, 2019

      ls15 (/mnt/ls15/scratch/users/<account> and /mnt/ls15/scratch/groups/<group>) are currently unavailable on most of the nodes. If you would like to access the scratch space, please use dev-intel14-phi or rsync gateway. We are working with the vendor on this issue. Update will be posted.

On Wednesday, May 8, 2019 at 9pm, the server hosting the HPCC globus endpoint will be temporarily shutdown to allow a migration of the server to our new systems. This process will be disruptive to transfers, so access to the system will be halted at 9am on Wednesday, May 8 to prevent new transfers from being started prior to the outage. The anticipated duration for the outage is 20 minutes.

If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method. 

Our job scheduling software, SLURM, is currently offline. We are working with the software vendor to bring it back online. While SLURM is offline, new jobs cannot be submitted and client commands, e.g. squeue, srun, sinfo, will not function. Running jobs are not affected.

Update 2:20 P.M.: The scheduler is back online after applying a patch from the vendor


On Tuesday, April 30 at 10:00pm gateway-01 and gateway-03 will become temporarily unavailable as we migrate these servers to a new host. A notice will be given 30 minutes prior to the shutdown of each machine, and logins will be temporarily blocked during this time. The total downtime for either gateway will be less than 10 minutes.

On Wednesday, April 17, 2019 unexpected maintenance was required following a system failure on the server which hosts wiki.hpcc.msu.edu. At approximately 3pm on Thursday, April 18, wiki.hpcc.msu.edu was brought back online, however, users may find that they are unable to log into the site. We are aware of this issue, and are working to restore the site to full functionality as quickly as possible. 

Apr 10. 2019 (3:40pm)

      The login issue from Globus web site to HPCC server is resolved.

Apr 09, 2019

      Globus web site has a connection problem with the HPCC server. When users try to connect to globus-01.hpcc.msu.edu from Globus web site, they get an error message as "Activation failed: Error contacting myproxy server, Failed to receive credentials. ERROR from myproxy-server: invalid password". We noticed this issue and our system administrators are working on it. We will update this message when it is resolved.

Over night the data pool for ufs-13-a experienced a disk failure triggering a kernel bug causing the system to go offline.  This also caused ufs-13-b to become unavailable as it imported the data pool.  We are working on the issue and will have both systems back online as soon as possible.


4-2-19 10:30am As of 10:30 the failed disk has been replaced and both systems are back online.


At 8am on 4-2-19 we will rebooting our web rdp gateway for maintenance.

On Thursday, March 28th, we will be migrating all users and research spaces on ufs-11-a and ufs-11-b to the new home directory system. Users with home directories on ufs-11-a and ufs-11-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-11-a and ufs-11-b will be unavailable.

Users who are on ufs-11-a and ufs-11-b will have their queued jobs held until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.

Users who are held will see "AssocMaxJobsLimit" in their squeue output.

If you acknowledge that any running jobs will be canceled on the 28th, contact us to request that the hold on your account be lifted.

Users should see significantly better performance and stability once the migration is complete. If you are on ufs-11-a or ufs-11-b, let us know if this interruption will have a significant impact on your work.

See https://wiki.hpcc.msu.edu/x/loZaAQ for more information. If you have any questions, please contact us.

UPDATE 2:45 PM Most users on ufs-11-b and 2/3rds of the research spaces have been moved to the new file system. Most users on ufs-11-b can log in and submit jobs now. 70% of the active users on ufs-11-a have been migrated.

UPDATE 5:15 PM. Over 80% of the research spaces on ufs-11-b have been migrated and are available. We expect that most users and research spaces from ufs-11-a and ufs-11-b will be available by the end of the day. There are a small number of users and research spaces that have a large number of files that may not complete until Friday. Affected users will receive a message by the end of the day.

UPDATE 11:30 PM All but one of the research spaces on ufs-11-b have been migrated, and more than 95% of the users on ufs-11-a and ufs-11-b have been migrated. A dozen research spaces from ufs-11-a remain unavailable. Most blocks have been lifted. We expect the remaining transfers will complete overnight.

UPDATE 3/29/19 4PM We continue to finish the migration of users and research spaces from ufs-11-a to new directories, and will enable accounts as the migrations are complete. 

UPDATE 3/30/19 11PM Migration of remaining users continues.  We have 6 users and 1 research space remaining to be migrated at this time.

At 8am 3-22-19 we will be taking ufs-13-a and ufs-13-b offline for approximately 1 hour to return the file system to high availability state.

UPDATE: ufs-13-a and ufs-13-b returned to service this morning at 9 AM.

      /mnt/gs18 (/mnt/scratch) is available on rsync gateway (rsync.hpcc.msu.edu). HPCC users can now use rsync gateway to do file transfer to the scratch space.

There will be an all-day scheduled maintenance outage on March 10th, 2019 starting at 6 AM, to accommodate power work at the new data center to accommodate future growth. In addition, updates to improve the performance and stability of the HPCC systems will be applied.

UPDATE: 7:55 PM - most work has been completed. gs18 is not currently available on the intel16 cluster, and scheduling new jobs is currently paused. We anticipate a return to service by 9 PM.

UPDATE: 8:59 PM. All systems have been returned to service. We appreciate your patience.

9:00 AM. The UFS-13-a and UFS-13-b filers are currently offline for emergency maintenance.

UPDATE: 10:20 AM. The file servers have been returned to service.

UPDATE: 4:40 PM. Issues seen earlier have recurred on ufs-13-a, and the filer is currently offline for additional maintenance.

At approximately 1PM on February 28th, we experienced an unexpected reboot of the UFS-12-a filer. Currently, both filers are unavailable while they are restored to production status.

UPDATE: 4:55 PM. Ufs-12-a and ufs-12-b were returned to service by 1:40 PM.

/mnt/scratch Quota

To help prevent /mnt/scratch from becoming completely full default quota of 50T has been added to /mnt/scratch for each user.  Please open a ticket if additional space is needed.


At 7:00 AM on February 21, 2019, the filers ufs-13-a and ufs-13-b will undergo maintenance to return the home directories to their respective hosts. There will be a brief disruption to directories hosted on the affected filers, however, performance for these home directories will be fully restored following the maintenance. Please contact the HPCC using the ticket portal (https://contact.icer.msu.edu/contact) should you have any questions.


At 9:30 am we experienced a fail over on our ufs-13 storage system due to a possible kernel bug.  We will be taking both systems offline in an attempt to correct this issue.  We anticipate the downtime to be approximately 2 to 3 hours.  


Update 15:30 maintenance is taking longer than expected we will provide another update when complete

Update 16:44 maintenance is complete and all directories should now be available.  

The SLURM server will be going down for maintenance at 8:00PM on Sunday, February 17th. Maintenance will last about 15 minutes. During this time, no new jobs may be submitted to SLURM and client commands (squeue, sbatch, etc.) will be unavailable. Running jobs will not be affected by this downtime.

We would like to remind users that files on /mnt/scratch with a modification time over 45 days are purged nightly.  If you have files that you need to preserve please consider moving them to your home directory.  

On 02-03-19 our home file system experienced an issue causing the data store for ufs-13-b to be moved to ufs-13-a.  On 2-7-19 at 8am we will be rebooting both units to restore high availability functionality.  Typically this process takes about an hour.


Update: We have moved the reboot forward as the performance of our home system has suffered.  


Update 2-7-19 9:40 am, maintenance is now complete and the files systems are back online.


Security certificates for wiki.hpcc.msu.edu and rt.hpcc.msu.edu have expired.  We are working on replacing the certificates.  Please accept any warnings the sites present, until we are able to get the new certificates in place.

UPDATE:  The new security certificates are in place now.  You should not receive any warnings.

The SLURM server will be going down this Thursday at 8:00PM until approximately 9:00PM. We are taking this downtime to increase the resources allocated for the SLURM server. This will improve the overall stability of SLURM. Running jobs will not be affected by this downtime, however, no new jobs will start and SLURM client commands (srun, salloc, sbatch) will not work until the server is back online.


As of approximately 4:30pm, both filers ufs-13-a and ufs-13-b have been restored to full functionality. Users who continue to experience login troubles are encouraged to open a ticket at https://contact.icer.msu.edu/contact

At approximately 1:30pm on Thursday, January 31, the home filers ufs-13-a and ufs-13-b experienced a failure which caused the directories to become unavailable. We continue to work on restoring both of these to full functionality, and will follow this announcement with further updates as available.

On Friday, February 1, 2019, the HPCC will be performing memory upgrades to the following gateways:


gateway-00

gateway-02

gateway-03


The memory upgrades will be performed one at a time, and each host should take approximately five minutes. Upgrades will begin at 9:00am, and will be complete on all hosts by 9:30am.


Access to the HPCC will continue to be available during the upgrades, however, it is possible that some users may notice a slight increase in login times while the hosts are rebooted. 

The Rsync gateway (Globus)  will be going down momentarily for a reboot tonight at 8:00PM.

We discovered the policy removing files with an mtime over 45 days has not been working properly.  We will be enabling this on Monday Febuary 4th.  Please be aware that any files not modified in 45 days will be removed, regularly start on this date.

UPDATE: 1/18 08:50 AM Scratch space ls15 is available on rsync gateway, dev nodes and compute nodes. If you have any question to access your scratch space. Please let us know.

UPDATE: 1/17 08:50 AM Sctach space 1s15 is available on rsync gateway (rsync.hpcc.msu.edu) and all compute nodes. It is still not available on all dev nodes yet. Our system administrators are working on the issue. Please wait for next update.

UPDATE: 1/10 11:15 AM The physical move and OS update has been completed on ls15. There is an issue with the Lustre software upgrade which has delayed the return to service. gs18 remains available until the issue has been resolved.


The HPCC downtime reservation on Monday, Jan 3rd is completed.

Scratch space /mnt/ls15 will be taken offline from Monday, Jan 7th to Thursday, Jan 10th for upgrade. If any work requires scratch space, please use the new scratch system /mnt/gs18 instead. Please check the scratch space transition site for how to move your work to /mnt/gs18.

If any submitted job possibly uses /mnt/ls15 scratch space from Jan 7th to Jan 10th, please modify the script and resubmit it. The following jobs have been on hold by HPCC administrators due to the ls15 offline:

Jobs on Hold:

5777369,5777370,5777373,5777374,5861807,5862078,5863186,5871676,5871678,5871680,5871683,
5871684,5871685,5871688,5871693,5871695,5871696,5871697,5871698,5871699,5871700,5888175,
5888238,5888387,5889726,5889967,5890135,5897035,5911266,5926822,5929189,5929196,5929244,
5929648,5929652,5929660,5929662





  • No labels