At 7am 5/26 several nodes went down hanging our home directory system. This is a know bug in our GPFS system and we are waiting for a patch for our nodes. At 5:30pm we were able to locate and restart the culprit node, and resolve the issue.
We are working on an issue that has caused all of our nodes to go into an offline state, as well as not allow users to login. We will have the nodes back online as soon as possible.
Update 10:26 Nodes are back online and running jobs. Logins at this time may still fail.
Update 12:00 We continue to experience issues with mounting home directories, which is causing gateways to be unresponsive. We are working with the vendors to resolve this issue.
Update: 13:30 The vendor has worked to restore functionality of our home storage system. We continue to work with the vendor on this. At this time most home directories are available.
Update 14:45 The issue has been resolved and the vendor is working on a permanent solution for the issue.
Access to the HPCC is currently unavailable for users on ufs18. A bug on the new file system that is the result of contention between compression and file replication features. We are entering the recovery process and attempting to restore access. We expect a return to service this morning.
UPDATE: 12:30 PM- Service was restored as of 10:15 AM.
On Wednesday, June 5th, we will be migrating all users and research spaces on ufs-12-a and ufs-12-b to the new home directory system. Users with home directories on ufs-12-a and ufs-12-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-12-a and ufs-12-b will be unavailable.
For users on ufs-12-a and ufs-12-b, jobs that would overlap with the maintenance window will not start until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.
Users and jobs using research spaces on ufs-12-a and ufs-12-b will be terminated on June 5nd and may be temporarily prevented from accessing the system to ensure a clean migration.
Users should see significantly better performance and stability once the migration is complete. If you are on ufs-12-a or ufs-12-b, let us know if this interruption will have a significant impact on your work and we can migrate your spaces before the transition.
Update 5:30 pm All research spaces have been migrated. Two active user are still with very large file counts are still syncing.
Update 5:30 pm We continue to migrate users, this has been slowed due to issues with the home file system earlier tody. Currently 5 active users remain to be migrated and a single research space.
Update 10:40pm The remaining 14 active users are still syncing, we anticipate the majority of syncs to be finished tomorrow 5/24
Update: 1:20pm The remaining 14 active users are currently syncing, along with 3 remaining research spaces.
Update: 9:00am An additional ~150 users have been enabled. We now have ~40 active users still resyncing. We continue to work on syncing research spaces.
Update 10:00pm Resync is ongoing, we anticipate finishing ~150 further users early tomorrow. The remaining ufs-13-a research spaces will also be available tomorrow.
Update: 7:35pm ~300 users have been re enabled. Nearly all ufs-13-b research spaces are complete.
Update: 5:10pm ~250 users have been re enabled.
Update: 4:00pm The sync of data is progressing. Due to high amount of changes made to user accounts and research spaces the resync of data is taking longer than anticipated.
On Wednesday, May 22nd, we will be migrating all users and research spaces on ufs-13-a and ufs-13-b to the new home directory system. Users with home directories on ufs-13-a and ufs-13-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-13-a and ufs-13-b will be unavailable.
For users on ufs-13-a and ufs-13-b, jobs that would overlap with the maintenance window will not start until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.
Users and jobs using research spaces on ufs-13-a and ufs-13-b will be terminated on May 22nd and may be temporarily prevented from accessing the system to ensure a clean migration.
Users should see significantly better performance and stability once the migration is complete. If you are on ufs-13-a or ufs-13-b, let us know if this interruption will have a significant impact on your work and we can migrate your spaces before the transition.
We will be removing older, less secure SSH ciphers (3des-cbc, blowfish-cbc, and cast128-cdc) this afternoon. This should not effect anyone with a modern SSH client. If you have problems logging in, please try to update your SSH client; if the problem persists, please contact us: https://contact.icer.msu.edu/contact
May 8, 2019 (10:27am)
Scratch system ls15 is working normally and available for all nodes.
May 7, 2019
/mnt/ls15/scratch/groups/<group>) are currently unavailable on most of the nodes. If you would like to access the scratch space, please use dev-intel14-phi or rsync gateway. We are working with the vendor on this issue. Update will be posted.
On Wednesday, May 8, 2019 at 9pm, the server hosting the HPCC globus endpoint will be temporarily shutdown to allow a migration of the server to our new systems. This process will be disruptive to transfers, so access to the system will be halted at 9am on Wednesday, May 8 to prevent new transfers from being started prior to the outage. The anticipated duration for the outage is 20 minutes.
If you have any questions regarding the scheduled maintenance, please do not hesitate to open a ticket, or otherwise, contact us using your preferred method.
Our job scheduling software, SLURM, is currently offline. We are working with the software vendor to bring it back online. While SLURM is offline, new jobs cannot be submitted and client commands, e.g. squeue, srun, sinfo, will not function. Running jobs are not affected.
Update 2:20 P.M.: The scheduler is back online after applying a patch from the vendor
On Tuesday, April 30 at 10:00pm gateway-01 and gateway-03 will become temporarily unavailable as we migrate these servers to a new host. A notice will be given 30 minutes prior to the shutdown of each machine, and logins will be temporarily blocked during this time. The total downtime for either gateway will be less than 10 minutes.
On Wednesday, April 17, 2019 unexpected maintenance was required following a system failure on the server which hosts wiki.hpcc.msu.edu. At approximately 3pm on Thursday, April 18, wiki.hpcc.msu.edu was brought back online, however, users may find that they are unable to log into the site. We are aware of this issue, and are working to restore the site to full functionality as quickly as possible.
Apr 10. 2019 (3:40pm)
The login issue from Globus web site to HPCC server is resolved.
Apr 09, 2019
Globus web site has a connection problem with the HPCC server. When users try to connect to globus-01.hpcc.msu.edu from Globus web site, they get an error message as "Activation failed: Error contacting myproxy server, Failed to receive credentials. ERROR from myproxy-server: invalid password". We noticed this issue and our system administrators are working on it. We will update this message when it is resolved.
Over night the data pool for ufs-13-a experienced a disk failure triggering a kernel bug causing the system to go offline. This also caused ufs-13-b to become unavailable as it imported the data pool. We are working on the issue and will have both systems back online as soon as possible.
4-2-19 10:30am As of 10:30 the failed disk has been replaced and both systems are back online.
At 8am on 4-2-19 we will rebooting our web rdp gateway for maintenance.
On Thursday, March 28th, we will be migrating all users and research spaces on ufs-11-a and ufs-11-b to the new home directory system. Users with home directories on ufs-11-a and ufs-11-b will not be able to run jobs or log into the HPCC on this day, and research spaces on ufs-11-a and ufs-11-b will be unavailable.
Users who are on ufs-11-a and ufs-11-b will have their queued jobs held until the maintenance is complete. Any running jobs for these users or research spaces will be terminated at the start of the window.
Users who are held will see "AssocMaxJobsLimit" in their squeue output.
If you acknowledge that any running jobs will be canceled on the 28th, contact us to request that the hold on your account be lifted.
Users should see significantly better performance and stability once the migration is complete. If you are on ufs-11-a or ufs-11-b, let us know if this interruption will have a significant impact on your work.
UPDATE 2:45 PM Most users on ufs-11-b and 2/3rds of the research spaces have been moved to the new file system. Most users on ufs-11-b can log in and submit jobs now. 70% of the active users on ufs-11-a have been migrated.
UPDATE 5:15 PM. Over 80% of the research spaces on ufs-11-b have been migrated and are available. We expect that most users and research spaces from ufs-11-a and ufs-11-b will be available by the end of the day. There are a small number of users and research spaces that have a large number of files that may not complete until Friday. Affected users will receive a message by the end of the day.
UPDATE 11:30 PM All but one of the research spaces on ufs-11-b have been migrated, and more than 95% of the users on ufs-11-a and ufs-11-b have been migrated. A dozen research spaces from ufs-11-a remain unavailable. Most blocks have been lifted. We expect the remaining transfers will complete overnight.
UPDATE 3/29/19 4PM We continue to finish the migration of users and research spaces from ufs-11-a to new directories, and will enable accounts as the migrations are complete.
UPDATE 3/30/19 11PM Migration of remaining users continues. We have 6 users and 1 research space remaining to be migrated at this time.