The HPCC will be unavailable on March 6th for security and stability software updates.
On 1-20-18 a single drive failed on Ufs-13-a which caused other drives on the same controller to become unresponsive. The file system became unresponsive for approximately an hour and then our high availability system moved the data pool over to Ufs-13-b. As a result a very small amount of files, approximately 20, that were being written at the time of the failure were lost. At this time home directories on ufs-13-a will not be able to mount to your local system unless you are using sshfs. Included below is a link on mounting a drive with sshfs.
We will be working with our vendor to investigate what caused the failure and how this can be avoided in the future.
On Tuesday January 30th at 7am we will be taking the data pool offline for both Ufs-13-a and Ufs-13-b to resume normal operations.
This maintenance will last about one hour, during which all users and research spaces on these storage systems will be unavailable.
Update: Both ufs-13-a and ufs-13-b were taken offline to restore complete functionality at 7:15 am and were restored to online status at 7:45 am
We have now resolved issues with samba on ufs-12-a and ufs-12-b. This service should now be available to all users. If you are having trouble connecting please see our samba wiki below
Due to a system problem, one of the components of the Lustre (ls15) scratch file system is currently offline. We are currently working with the vendor to restore access to the full system. Access to files on the affected server will hang until access has been restored. The scheduler has been paused until the issue is resolved.
Update: 6 PM: The vendor is continuing to diagnose the problem. The scheduler has been resumed. Users can see which of their files are on the affected OST by using the following command:
Update 12:45pm 12-19: We continue to work with our Lustre vendor to diagnose the current issue with /mnt/ls15.
Update 9am 12-21: We are currently working on cleaning up lost file data from the lustre failure. We anticipate lustre operations to likely return to normal late this afternoon.
Update 4pm 12-21: All user and group scratch directories have been cleaned of lost files and should be responding properly. Please open a ticket if you are still having trouble with your scratch directory.
The HPCC will be unavailable on January 4th, 2018 to perform scheduled maintenance on core systems.
Update: 4:30 PM. The maintenance has been completed and job scheduling has resumed. There is one outstanding issue: users with home directories or research spaces on ufs-12 will not be able to access systems via SMB on their desktops. Users can use SSHfs as described here
to access their home directory or research space on the ufs-12 server until service is restored.
We have disabled the SMB version 1 (SMBv1) protocol on our filers running Red Hat Enterprise Linux 7 (ufs-11-a, ufs-11-b, ufs-13-a and ufs-13-b) due to a critical vulnerability. SMBv2 and SMBv3 remain available. This is an older version of the protocol that allows you to access your home directory or research space on campus from your desktop or laptop. In our testing, all modern operating systems allow SMBv2 or v3, unless it has been explicitly disabled; check our documentation for instructions on how to enable SMBv3 and disable SMBv1. If you are having trouble accessing your home directory or research space on one of these systems after this change, please open a ticket.
HPCC systems will be unavailable from Tuesday, September 19th from 7 AM to 7 PM, to improve stability of our home directory storage system. We apologize for the interruption in your work.
Update: The work was completed at 8:25 PM and the system was returned to service. Users with research spaces on ufs-11-b may experience intermittent permissions issues. This will be resolved Wednesday morning.
Due to a bug currently in zfs, we have experienced several instances of our home directory filers going offline. This is caused when a user restores files from a snapshot. Zfs is not unmounting the snapshot properly and when our auto snapshot process runs it can cause a kernel panic bringing the system offline. To mitigate this we will be disabling access to the .zfs directory for all user home directories and research spaces. If you are in need of file restoration please open a ticket. We apologize for the inconvenience however this will help improve system stability. When a viable fix for the issue is presented by ZFS we will allow access to the snapshots once again.
Update: 11:27 am : Issues with the scheduling system have been resolved and service has been restored. Some compute nodes have remaining issues that are being addressed.
As of approximately 10 am, the HPC scheduling software is off-line for emergency maintenance. We are actively working to restore service as soon as possible. Users can log-in, but are not able to schedule new jobs or monitor existing jobs until service is restored. As always, thank you for your patience.
Unrelated to yesterdays trouble with ufs-13-b, Ufs-13-a experienced as kernel panic causing it to be unavailable from about 10pm until this morning. We are looking into way of mitigating this issue. Again we apologize for the inconvenience.
At about 4pm today we attempted to remove an old snapshot which caused the ufs-13-b filer to become unresponsive for approximately 40 minutes. The filer has been rebooted and is currently back online. We apologize for any inconvenience.
Users and research spaces affected by the failure of the 11-a filer were migrated to the 13-a and 13-b filers. Accessing data when logged into HPCC systems is unchanged, however, this migration will affect how data is accessed using Samba.
If you are having difficulty accessing your data after the August 8th downtime, please refer to our wiki: Mapping HPC drives to a campus computer with SMB.
Due to a catastrophic error on the file server ufs-11-a, research spaces and home directories on that file server are currently unavailable. All data is safe and we are restoring from a current backup, but access to those spaces has been blocked until they have been restored. We will send affected users an email once their data has been restored and they have been unblocked.
We sincerely apologize for the interruption. Please let us know if this will negatively impact your work and we will see what we can do to ensure that you can complete your work in a timely manner.
Update: 8/11 5 PM: Most home directories (864 of 877) and research spaces (33 of 39) that were on ufs-11-a have been restored.
Update: 8/14 11 AM: All but one user (876 of 877) and research space (38 of 39) that were on ufs-11-a have been restored. About 300 TB of data has been restored from the offsite systems.
Update: 8/15 11 AM: All users and research spaces that were on ufs-11-a have been restored. Please contact us if you have any issues.
Starting at approximately 10am this morning our scheduler crashed and is currently offline. We are looking into the issue and will bring the scheduler back online as soon as possible.
UPDATE: 11:30 AM The job manager (TORQUE) failed to restart due to a large number of job array files that were left behind from jobs that had completed successfully. Staff identified the problem and were able to remove the ~130,000 leftover job files and restart the server at approximately 11:10 AM.
All HPCC systems will be unavailable from Tuesday, August 8th from 6 AM to 6 PM. The primary focus will be on improving the performance and stability of our storage systems, as well as general system updates. Users should note the following changes:
- New SMB configuration for Windows File Sharing / Home Directory mounting (Samba). We'll be changing authentication domains from HPCC to CAMPUSAD and raising the minimum version of SMB required to connect to version 2. Please see the following link for updated mounting information.
- New gateway servers. We will be deploying additional gateway nodes to provide a better user experience and minimize disruption when one server has an issue. Users may notice that access to gateway.hpcc.msu.edu may report a different host name once they connect. Users can connect to a specific gateway if desired, but it is not recommended.