Please let us know if need immediate assistance by filing a ticket with us at https://contact.icer.msu.edu. You can also reach us at (517) 353-9309, or stop by the iCER office.
Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Bldg
567 Wilson Road, Room 1440
East Lansing, MI 48824
Phone: (517) 353-9309
We recently applied a fix to address an urgent Linux kernel security issue. Users have reported that some MPI processes may have been terminated when the patch was applied. New MPI jobs appear to launch successfully. gdb and other debuggers will be disabled until the permanent security fix is applied next week.
From 1:40 to 2 PM today (10-21) access to home directories and research spaces on ufs-11-a was interrupted due to an administrator error. Users may have seen the following error message:
"NFS: Stale File Handle"
Jobs attempting to access ufs-11-a may have failed. We're sorry for any problems that this caused.
File system maintenance is on-going, and the system is currently unavailable today for many users who had either home directories or shared research directories on the 10-b file system (approximately 700 users affected).
Those affected can't log-in. See https://wiki.hpcc.msu.edu/x/VQKpAQ for details. Affected users were sent two message indicating they would not be able to log-in today.
Update 1pm: The first 200 users have been migrated and their users have been unblocked and released. There are another ~500 users still to migrate, which we expect to be completed this evening.
Update 7:35 PM All accounts that we were able to migrate have been migrated. All users should have access to their new home directories within two hours. Emails will be sent tomorrow with instructions for updating samba mounts if you are using them.
One of our home directory servers (ufs-10-b) has had a number of problems in the last few weeks. These problems have caused the HPCC to become unavailable and jobs to fail. To resolve this, we are moving all active users on ufs-10-b to new home directory servers.
- No submitted jobs for ufs-10-b users that would run during the maintenance window will start; they will start after the maintenance is complete. ufs-10-b users can continue to run jobs that will end before the maintenance.
- ufs-10-b users will not be able to use SSH, Globus, RDP, or rsync to access the HPCC on the 19th.
- Windows file sharing (Samba/smb) access to ufs-10-b will be unavailable on the 19th. You will need to update your client's mount to point at the new file server after the maintenance is complete.
- Any access to research spaces that are on ufs-10-b will be disabled until the migration is complete.
To check a research space (replace myresearchspace with the name of the research space)
If either returns 'ufs-10-b.i' you will be affected.
The HPCC will be intermittently available from 8 AM to 5 PM Tuesday, October 4th. No scheduled jobs will be run during the window.
- Cluster expansion: An additional intel16/Laconia rack will be installed. This will add an additional 2000 cores to Laconia.
- Lustre (/mnt/scratch) performance and stability improvements: A bugfix release will be applied to the Lustre servers and the network configuration will be adjusted.
- Home directory migration: Active users on ufs-10-a will be migrated to our new file servers. This will significantly improve performance and user experience.
We anticipate that Gateway and development nodes will be available throughout the maintenance with one to two interruptions for reboots. Users on ufs-10-a may have their access blocked until the migration is complete. Lustre will be unavailable until the Lustre-specific maintenance is complete.
If you have any questions please contact us: https://contact.icer.msu.edu/contact
Update: Tuesday, October 4th, 3:12pm EDT:
All of the planned work for the Storage and Cluster Maintenance Outage has been completed and we are expected to resume running jobs around 4:00pm EDT today.
Update: 5:15 PM The scheduler was resumed and the maintenance was completed at 4:30 PM.
We are currently experiencing issues with scratch on the globus and rsync gateway nodes. Until further notice, please transfer any data to a research space, then transfer this data to scratch from a dev node. We understand this is a major inconvenience, and are working to restore scratch functionality on the globus and rsync gateway nodes as quickly as possible. During this time Globus and file transfer services through the rsync.hpcc.msu.edu will have interruptions.
Update Sept 19 4:15pm Connection to scratch from rsync/ file transfer gateway and Globus have been restored. However issues remain and while we work on a more stable solution, there may be intermittant issues. Thanks for your patience.
There is a bug that is preventing high availability from working properly on Lustre (/mnt/scratch). Users may notice pauses when accessing /mnt/scratch when one of the back-end Lustre servers becomes unresponsive. We are working with the vendor and Intel to address the problem. Pending I/O should wait until the node is repaired, but we have received reports of failures when the recovery was unable to process the pending client requests and clients receive a write error. We are working with the vendor and Intel to address this issue.
File servers ufs-11-a and ufs-11-b went offline at 10:00 AM due to a high availability software configuration issue. We are working to restore service by 10:30 AM.
UPDATE: Service intermittently available between 10:30 and 11 AM and was restored by 11 AM. We apologize for any problems this may have caused.
The Lustre storage system (/mnt/scratch and /mnt/ls15) will be unavailable from 8 AM to 12 PM on Monday, the 22nd of August for a critical configuration update. It will be unavailable on any compute, development or file transfer node during this window. Interactive development nodes will remain available but may be rebooted during the maintenance window. The main queue will be paused. If your job does not use Lustre but you would like to submit a job to run during this window or if you have any other questions please contact us via the contact form.
UPDATE: 8:05 AM All clients have been unmounted and the vendor is running the update on the Lustre servers.
UPDATE: 10 AM 60% of the Lustre servers have had the fix applied. The ufs-11 home directory servers have been updated.
UPDATE: 10:30 AM All of the Lustre servers have had the fix applied. The vendor is testing the Lustre servers.
UPDATE: 12:00 PM The vendor has identified an issue with the way the metadata server is passing information to clients on one of the networks. Intel and the vendor are investigating. The new estimated return-to-service time for Lustre and running jobs on the main cluster is now 5 PM. We apologize for the disruption.
UPDATE: 2 PM We are working on a fix that will allow a return to service. ETA is still 5 PM.
UPDATE: 5 PM The maintenance has been completed and Lustre is available to all compute and development nodes. Jobs are running normally. The Globus Online endpoint is currently offline and is expected to return to service by EOB tomorrow. Let us know if you experience any issues.
A user generated a large volume of traffic to the file server ufs-10-b this morning, causing problems for users whose home or research spaces are on ufs-10-b. The issue developed around 8 AM and was resolved by 11 AM. We have contacted the users responsible and held their jobs until their workflow can be fixed.
The new home directory file servers remained available during this outage. We are working to migrate users to the new file servers and retire the older file servers. If you would like to move to the new file servers, please contact us at https://contact.icer.msu.edu/contact .
At this time due to high usage Ufs-10-a has become unresponsive. This is causing sluggishness throughout the cluster. Currently the filer is in the process of rebooting with an estimated reboot time of 30 - 40 minutes.
Update: ufs-10-a is currently back online. We are working to mitigate load on the cluster at this time. To allow this logins through Gateway have been disabled.
Update: 15:00 ufs-10-a went offline again due to excessive traffic. We have identified the cause and are currently mitigating it.
Update: 15:35 The cluster has stabilized, and ufs-10-a is back online under normal operation.
Our regularly scheduled maintenance window began at 6:00 AM today. Access to HPCC services may remain limited until the end of the business day.
UPDATE: 4:45 PM. Our Lustre scratch maintenance is taking longer than anticipated. We will make gateway and development nodes available to users at 4:45, but scratch will remain unavailable and queued jobs will remain queued until the Lustre work is complete; the current estimate is 8 PM. Users whose jobs do not make use of scratch may use the contact form to request that their jobs are started manually.
UPDATE: 7:45 PM All of the Lustre software updates have been applied. Intel and the Lustre vendor have identified and implemented the final configuration change to allow full speed access to Lustre on intel16. However, it needs to be applied to 48 storage configurations. The current estimate for completion is now 10 PM.
UPDATE: 12:20 AM Lustre maintenance has been completed and the scheduler has been resumed. All HPCC services are operational.
06:00 Interactive access was suspended and active users were disconnected.
06:30 Firewall upgrade begun.
07:00 Firewall upgrade complete.
07:00 High speed (Infiniband) network, home directory servers, gateway and Scratch (Lustre) updates underway.
08:00 Intel14 Infiniband update complete
10:00 Home directory network complete.
2:30 PM gateway update complete.
3 PM New home directory testing complete.
3 PM Intel16 Infiniband update complete.
4 PM Home directory maintenance complete.
4:45 PM Gateway will be available to users.
10 PM Lustre configuration complete
12:20 AM Final Lustre testing complete. Scheduler resumed.
Over the weekend Ufs-11-b appears to have suffered a hardware failure. As expected the high availability partner Ufs-11-a has imported the file system for Ufs-11-b allowing for access to the file system. There will be increased load and latency on Ufs-11-a until the hardware failure has been corrected. We currently have a service call open to address the issue.
Update 7-20-16 -
The hardware for Ufs-11-b has now been repaired and the server is back online. Currently ufs-11-a is still hosting both file systems as several users are using them. Currently we are planning on switching the file systems back during the outage on 7-26.
Update 7-20-16 -
Job activity to the new filers finished this morning. The filesystem for ufs-11-b has been switched back to its main server. During the switch ufs-11-a and ufs-11-b suffered a short period of downtime. They are both operating properly at this time. Performance on both file systems is back to normal.
The interactive node dev-intel14 is currently offline due to a software problem. Staff are working to resolve the issue. The other development nodes remain available and running and queued jobs on the cluster are unaffected.
Update 3:30pm - dev-intel14 was returned to service after identifying a failure with the environment variable which controlled several volume mounts. The logic for the variable was corrected to resolve the underlying failure, and the node was returned to service. No repeat issue is expected in this regard, as the logic was corrected globally.