At 7am 5/26 several nodes went down hanging our home directory system. This is a know bug in our GPFS system and we are waiting for a patch for our nodes. At 5:30pm we were able to locate and restart the culprit node, and resolve the issue.
Update 5:30 pm All research spaces have been migrated. Two active user are still with very large file counts are still syncing.
Update 5:30 pm We continue to migrate users, this has been slowed due to issues with the home file system earlier tody. Currently 5 active users remain to be migrated and a single research space.
Update 10:40pm The remaining 14 active users are still syncing, we anticipate the majority of syncs to be finished tomorrow 5/24
Update: 1:20pm The remaining 14 active users are currently syncing, along with 3 remaining research spaces.
Update: 9:00am An additional ~150 users have been enabled. We now have ~40 active users still resyncing. We continue to work on syncing research spaces.
Update 10:00pm Resync is ongoing, we anticipate finishing ~150 further users early tomorrow. The remaining ufs-13-a research spaces will also be available tomorrow.
Update: 7:35pm ~300 users have been re enabled. Nearly all ufs-13-b research spaces are complete.
Update: 5:10pm ~250 users have been re enabled.
Update: 4:00pm The sync of data is progressing. Due to high amount of changes made to user accounts and research spaces the resync of data is taking longer than anticipated.
On Wednesday, May 22nd, we will be migrating all users and research spaces on ufs-13-a and ufs-13-b to the new home directory system. Users with home directories on ufs-13-a and ufs-13-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-13-a and ufs-13-b will be unavailable.
For users on ufs-13-a and ufs-13-b, jobs that would overlap with the maintenance window will not start until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.
Users and jobs using research spaces on ufs-13-a and ufs-13-b will be terminated on May 22nd and may be temporarily prevented from accessing the system to ensure a clean migration.
Users should see significantly better performance and stability once the migration is complete. If you are on ufs-13-a or ufs-13-b, let us know if this interruption will have a significant impact on your work and we can migrate your spaces before the transition.
We are working on an issue that has caused all of our nodes to go into an offline state, as well as not allow users to login. We will have the nodes back online as soon as possible.
Update 10:26 Nodes are back online and running jobs. Logins at this time may still fail.
Update 12:00 We continue to experience issues with mounting home directories, which is causing gateways to be unresponsive. We are working with the vendors to resolve this issue.
Update: 13:30 The vendor has worked to restore functionality of our home storage system. We continue to work with the vendor on this. At this time most home directories are available.
Update 14:45 The issue has been resolved and the vendor is working on a permanent solution for the issue.
On Wednesday, June 5th, we will be migrating all users and research spaces on ufs-12-a and ufs-12-b to the new home directory system. Users with home directories on ufs-12-a and ufs-12-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-12-a and ufs-12-b will be unavailable.
For users on ufs-12-a and ufs-12-b, jobs that would overlap with the maintenance window will not start until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.
Users and jobs using research spaces on ufs-12-a and ufs-12-b will be terminated on June 5nd and may be temporarily prevented from accessing the system to ensure a clean migration.
Users should see significantly better performance and stability once the migration is complete. If you are on ufs-12-a or ufs-12-b, let us know if this interruption will have a significant impact on your work and we can migrate your spaces before the transition.
Access to the HPCC is currently unavailable for users on ufs18. A bug on the new file system that is the result of contention between compression and file replication features. We are entering the recovery process and attempting to restore access. We expect a return to service this morning.
UPDATE: 12:30 PM- Service was restored as of 10:15 AM.
We will be removing older, less secure SSH ciphers (3des-cbc, blowfish-cbc, and cast128-cdc) this afternoon. This should not effect anyone with a modern SSH client. If you have problems logging in, please try to update your SSH client; if the problem persists, please contact us: https://contact.icer.msu.edu/contact
May 8, 2019 (10:27am)
Scratch system ls15 is working normally and available for all nodes.
May 7, 2019
/mnt/ls15/scratch/groups/<group>) are currently unavailable on most of the nodes. If you would like to access the scratch space, please use dev-intel14-phi or rsync gateway. We are working with the vendor on this issue. Update will be posted.
On Wednesday, May 8, 2019 at 9pm, the server hosting the HPCC globus endpoint will be temporarily shutdown to allow a migration of the server to our new systems. This process will be disruptive to transfers, so access to the system will be halted at 9am on Wednesday, May 8 to prevent new transfers from being started prior to the outage. The anticipated duration for the outage is 20 minutes.
If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.
Our job scheduling software, SLURM, is currently offline. We are working with the software vendor to bring it back online. While SLURM is offline, new jobs cannot be submitted and client commands, e.g. squeue, srun, sinfo, will not function. Running jobs are not affected.
Update 2:20 P.M.: The scheduler is back online after applying a patch from the vendor
On Tuesday, April 30 at 10:00pm gateway-01 and gateway-03 will become temporarily unavailable as we migrate these servers to a new host. A notice will be given 30 minutes prior to the shutdown of each machine, and logins will be temporarily blocked during this time. The total downtime for either gateway will be less than 10 minutes.
On Wednesday, April 17, 2019 unexpected maintenance was required following a system failure on the server which hosts wiki.hpcc.msu.edu. At approximately 3pm on Thursday, April 18, wiki.hpcc.msu.edu was brought back online, however, users may find that they are unable to log into the site. We are aware of this issue, and are working to restore the site to full functionality as quickly as possible.
Apr 10. 2019 (3:40pm)
The login issue from Globus web site to HPCC server is resolved.
Apr 09, 2019
Globus web site has a connection problem with the HPCC server. When users try to connect to globus-01.hpcc.msu.edu from Globus web site, they get an error message as "Activation failed: Error contacting myproxy server, Failed to receive credentials. ERROR from myproxy-server: invalid password". We noticed this issue and our system administrators are working on it. We will update this message when it is resolved.
Over night the data pool for ufs-13-a experienced a disk failure triggering a kernel bug causing the system to go offline. This also caused ufs-13-b to become unavailable as it imported the data pool. We are working on the issue and will have both systems back online as soon as possible.
4-2-19 10:30am As of 10:30 the failed disk has been replaced and both systems are back online.
At 8am on 4-2-19 we will rebooting our web rdp gateway for maintenance.
On Thursday, March 28th, we will be migrating all users and research spaces on ufs-11-a and ufs-11-b to the new home directory system. Users with home directories on ufs-11-a and ufs-11-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-11-a and ufs-11-b will be unavailable.
Users who are on ufs-11-a and ufs-11-b will have their queued jobs held until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.
Users who are held will see "AssocMaxJobsLimit" in their squeue output.
If you acknowledge that any running jobs will be canceled on the 28th, contact us to request that the hold on your account be lifted.
Users should see significantly better performance and stability once the migration is complete. If you are on ufs-11-a or ufs-11-b, let us know if this interruption will have a significant impact on your work.
UPDATE 2:45 PM Most users on ufs-11-b and 2/3rds of the research spaces have been moved to the new file system. Most users on ufs-11-b can log in and submit jobs now. 70% of the active users on ufs-11-a have been migrated.
UPDATE 5:15 PM. Over 80% of the research spaces on ufs-11-b have been migrated and are available. We expect that most users and research spaces from ufs-11-a and ufs-11-b will be available by the end of the day. There are a small number of users and research spaces that have a large number of files that may not complete until Friday. Affected users will receive a message by the end of the day.
UPDATE 11:30 PM All but one of the research spaces on ufs-11-b have been migrated, and more than 95% of the users on ufs-11-a and ufs-11-b have been migrated. A dozen research spaces from ufs-11-a remain unavailable. Most blocks have been lifted. We expect the remaining transfers will complete overnight.
UPDATE 3/29/19 4PM We continue to finish the migration of users and research spaces from ufs-11-a to new directories, and will enable accounts as the migrations are complete.
UPDATE 3/30/19 11PM Migration of remaining users continues. We have 6 users and 1 research space remaining to be migrated at this time.
At 8am 3-22-19 we will be taking ufs-13-a and ufs-13-b offline for approximately 1 hour to return the file system to high availability state.
UPDATE: ufs-13-a and ufs-13-b returned to service this morning at 9 AM.
/mnt/gs18 (/mnt/scratch) is available on rsync gateway (rsync.hpcc.msu.edu). HPCC users can now use rsync gateway to do file transfer to the scratch space.
There will be an all-day scheduled maintenance outage on March 10th, 2019 starting at 6 AM, to accommodate power work at the new data center to accommodate future growth. In addition, updates to improve the performance and stability of the HPCC systems will be applied.
UPDATE: 7:55 PM - most work has been completed. gs18 is not currently available on the intel16 cluster, and scheduling new jobs is currently paused. We anticipate a return to service by 9 PM.
UPDATE: 8:59 PM. All systems have been returned to service. We appreciate your patience.
9:00 AM. The UFS-13-a and UFS-13-b filers are currently offline for emergency maintenance.
UPDATE: 10:20 AM. The file servers have been returned to service.
UPDATE: 4:40 PM. Issues seen earlier have recurred on ufs-13-a, and the filer is currently offline for additional maintenance.
At approximately 1PM on February 28th, we experienced an unexpected reboot of the UFS-12-a filer. Currently, both filers are unavailable while they are restored to production status.
UPDATE: 4:55 PM. Ufs-12-a and ufs-12-b were returned to service by 1:40 PM.
To help prevent /mnt/scratch from becoming completely full default quota of 50T has been added to /mnt/scratch for each user. Please open a ticket if additional space is needed.
At 7:00 AM on February 21, 2019, the filers ufs-13-a and ufs-13-b will undergo maintenance to return the home directories to their respective hosts. There will be a brief disruption to directories hosted on the affected filers, however, performance for these home directories will be fully restored following the maintenance. Please contact the HPCC using the ticket portal (https://contact.icer.msu.edu/contact) should you have any questions.
At 9:30 am we experienced a fail over on our ufs-13 storage system due to a possible kernel bug. We will be taking both systems offline in an attempt to correct this issue. We anticipate the downtime to be approximately 2 to 3 hours.
Update 15:30 maintenance is taking longer than expected we will provide another update when complete
Update 16:44 maintenance is complete and all directories should now be available.
The SLURM server will be going down for maintenance at 8:00PM on Sunday, February 17th. Maintenance will last about 15 minutes. During this time, no new jobs may be submitted to SLURM and client commands (squeue, sbatch, etc.) will be unavailable. Running jobs will not be affected by this downtime.
We would like to remind users that files on /mnt/scratch with a modification time over 45 days are purged nightly. If you have files that you need to preserve please consider moving them to your home directory.
On 02-03-19 our home file system experienced an issue causing the data store for ufs-13-b to be moved to ufs-13-a. On 2-7-19 at 8am we will be rebooting both units to restore high availability functionality. Typically this process takes about an hour.
Update: We have moved the reboot forward as the performance of our home system has suffered.
Update 2-7-19 9:40 am, maintenance is now complete and the files systems are back online.
Security certificates for wiki.hpcc.msu.edu and rt.hpcc.msu.edu have expired. We are working on replacing the certificates. Please accept any warnings the sites present, until we are able to get the new certificates in place.
UPDATE: The new security certificates are in place now. You should not receive any warnings.
The SLURM server will be going down this Thursday at 8:00PM until approximately 9:00PM. We are taking this downtime to increase the resources allocated for the SLURM server. This will improve the overall stability of SLURM. Running jobs will not be affected by this downtime, however, no new jobs will start and SLURM client commands (srun, salloc, sbatch) will not work until the server is back online.
As of approximately 4:30pm, both filers ufs-13-a and ufs-13-b have been restored to full functionality. Users who continue to experience login troubles are encouraged to open a ticket at https://contact.icer.msu.edu/contact
At approximately 1:30pm on Thursday, January 31, the home filers ufs-13-a and ufs-13-b experienced a failure which caused the directories to become unavailable. We continue to work on restoring both of these to full functionality, and will follow this announcement with further updates as available.
On Friday, February 1, 2019, the HPCC will be performing memory upgrades to the following gateways:
The memory upgrades will be performed one at a time, and each host should take approximately five minutes. Upgrades will begin at 9:00am, and will be complete on all hosts by 9:30am.
Access to the HPCC will continue to be available during the upgrades, however, it is possible that some users may notice a slight increase in login times while the hosts are rebooted.
The Rsync gateway (Globus) will be going down momentarily for a reboot tonight at 8:00PM.
We discovered the policy removing files with an mtime over 45 days has not been working properly. We will be enabling this on Monday Febuary 4th. Please be aware that any files not modified in 45 days will be removed, regularly start on this date.
UPDATE: 1/18 08:50 AM Scratch space ls15 is available on rsync gateway, dev nodes and compute nodes. If you have any question to access your scratch space. Please let us know.
UPDATE: 1/17 08:50 AM Sctach space 1s15 is available on rsync gateway (rsync.hpcc.msu.edu) and all compute nodes. It is still not available on all dev nodes yet. Our system administrators are working on the issue. Please wait for next update.
UPDATE: 1/10 11:15 AM The physical move and OS update has been completed on ls15. There is an issue with the Lustre software upgrade which has delayed the return to service. gs18 remains available until the issue has been resolved.
The HPCC downtime reservation on Monday, Jan 3rd is completed.
Scratch space /mnt/ls15 will be taken offline from Monday, Jan 7th to Thursday, Jan 10th for upgrade. If any work requires scratch space, please use the new scratch system /mnt/gs18 instead. Please check the scratch space transition site for how to move your work to /mnt/gs18.
If any submitted job possibly uses /mnt/ls15 scratch space from Jan 7th to Jan 10th, please modify the script and resubmit it. The following jobs have been on hold by HPCC administrators due to the ls15 offline:
Jobs on Hold:
HPCC staff will be performing a minor update to the SLURM software starting Tuesday, January 15th, and completing on Thursday, January 17th. SLURM will be available during this time, however, brief (less than a minute) interruptions in the availability of SLURM commands may occur. This update will fix minor bugs encountered by HPCC users. The update process will not impact running jobs.
The gateways have been updated to CentOS 7 to match the rest of the cluster. When accessing the cluster you may see a warning regarding the host key being incorrect, we are working on resolving the host key issue.
For more information about how to fix this, see our HPCC FAQ: known_hosts error
HPCC has a downtime reserved for maintenance on Thursday, 01/03/2019. No job can run during the reservation time. Any submitted job supposed to be in running during the time is put off until the maintenance is over. A detail of the job reservation can be found by running the command "scontrol show reservation" on a dev node:
Due to power work at the MSU data center /mnt/ffs18 will be unavailable 12-10-18 from 7am EST until 5pm EST.
Update 12-29-18 1pm: Power work is complete and ffs17 is now available.
Due to power work in the MSU data center we have seen some of our nodes go offline. Power work is estimated to complete in the next few hours. We will then bring nodes back online as we can. This is currently only effecting ~83 nodes.
Update 12-29-18 1pm: Power work at the MSU data center has been completed nodes that were down are now back online.
12-27-18 13:40 The network issue has been resolved dev-intel18 is online.
Due to network issues dev-intel18 in unavailable. We are working to resolve this issue and will update when it is back online.
At 7:00AM on Friday, December 14th, file systems served by the UFS-11-a and UFS-11-b filers will be unavailable while fault tolerance is restored. Restoration is estimated to take half an hour.
At 7:00AM on Thursday, December 13th, file-systems on ufs-12-a will be unavailable while server redundancy is restored. Restoration is estimated to take approximately 30 minutes.
On Friday, December 7th, we encountered an issue with the ufs-12-a and ufs-12-b filers that affected their availability between 2:30PM and 4:00PM. This issue has been resolved.
The SLURM server will be going down for maintenance from 9:00PM to 11:00PM on Sunday, November 11th. New jobs cannot be submitted during this time and all SLURM commands (sbatch/salloc/srun/squeue etc.) will be unavailable. Running jobs will not be affected by this downtime.
Update: The SLURM server is back online and accepting jobs.
On the week of October 29th, we will be moving the Intel16 cluster over to the new data center. No jobs will run on this cluster during this outage. Intel14 and Intel18 will remain available during this transition. The cluster will return to service by the end of the week. Please let us know if you have any concerns.
All Intel16 nodes have been returned to service.
Samba pushed an update the broke mapping connections to our file storage systems. This has been corrected and drive should mount properly now.
Update Nov 1, 8:00 am: most users will not be able to log-in Thursday Nov 1 during file system restoration. We will update as soon as these systems are available.
On Monday night the Ufs-12-b filer experienced a kernal bug. This caused a high availability event allowing ufs-12-a to provide the ufs-12-b data store. On 11-1-18 we will need to take both file systems offline to restore normal functionality. We anticipate approximate one hour of downtime for this file storage system.
Update: We have added ufs-13-a and ufs-13-b to this maintenance as well due to a HA fail over last evening.
File systems Ufs-13-a and ufs-13-b are currently unavailable due to unexpected maintenance related to the intel16 cluster move. Users with home directories on these file systems will not be able to log-in, or you may not be able to see your files, and researcher folders may be unavailable. However your files are safe and your accounts are ok. An update will be provided when they return to service.
Update: 9:40am Ufs-13-a and Ufs-13-b should now be available.
On the week of October 22nd, we will be moving the Intel14 cluster over to the new data center. No jobs will run on this cluster during this outage. Intel16 and Intel18 will remain available during this transition. The cluster will return to service by the end of the week. Please let us know if you have any concerns.
UPDATE: Please note that the ffs17 file system will not be available on intel14 cluster from the 22nd until the Intel16 migration is complete on the week of the 29th.
UPDATE: 10/25 4:58 PM The 2014 cluster has been returned to service. The /mnt/gs18 scratch file system is available now.
At 12:15 PM an Emergency Power Off was triggered by the fire system in the Engineering data center. After restoring power, one of the core network switches failed. HPCC staff are in the process of moving connections off of the failed switch and restoring the HPCC environment. The next update will be posted by 6:30 PM.
UPDATE 6:30 PM: Most of the network connections have been restored. Gateways are currently available as are the ufs-13 file servers. We are working on an issue with the home directory servers ufs-11 and ufs-12. We will continue to return the environment to service; the next update will be 7:30 PM.
UPDATE 7:40 PM: All home directory servers have been returned to service. We are testing the Infiniband fabric and compute nodes and expect that we'll be able to resume scheduling soon. The next update will be at or before 8:30 PM.
UPDATE 8:35 PM Scheduling has been resumed; most systems should have returned to service. Please let us know if you have any problems.