Please let us know if need immediate assistance by filing a ticket with us at https://contact.icer.msu.edu. You can also reach us at (517) 353-9309, or stop by the iCER office.
Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Bldg
567 Wilson Road, Room 1440
East Lansing, MI 48824
Phone: (517) 353-9309
There is a bug that is preventing high availability from working properly on Lustre (/mnt/scratch). Users may notice pauses when accessing /mnt/scratch when one of the back-end Lustre servers becomes unresponsive. We are working with the vendor and Intel to address the problem. Pending I/O should wait until the node is repaired, but we have received reports of failures when the recovery was unable to process the pending client requests and clients receive a write error. We are working with the vendor and Intel to address this issue.
File servers ufs-11-a and ufs-11-b went offline at 10:00 AM due to a high availability software configuration issue. We are working to restore service by 10:30 AM.
UPDATE: Service intermittently available between 10:30 and 11 AM and was restored by 11 AM. We apologize for any problems this may have caused.
The Lustre storage system (/mnt/scratch and /mnt/ls15) will be unavailable from 8 AM to 12 PM on Monday, the 22nd of August for a critical configuration update. It will be unavailable on any compute, development or file transfer node during this window. Interactive development nodes will remain available but may be rebooted during the maintenance window. The main queue will be paused. If your job does not use Lustre but you would like to submit a job to run during this window or if you have any other questions please contact us via the contact form.
UPDATE: 8:05 AM All clients have been unmounted and the vendor is running the update on the Lustre servers.
UPDATE: 10 AM 60% of the Lustre servers have had the fix applied. The ufs-11 home directory servers have been updated.
UPDATE: 10:30 AM All of the Lustre servers have had the fix applied. The vendor is testing the Lustre servers.
UPDATE: 12:00 PM The vendor has identified an issue with the way the metadata server is passing information to clients on one of the networks. Intel and the vendor are investigating. The new estimated return-to-service time for Lustre and running jobs on the main cluster is now 5 PM. We apologize for the disruption.
UPDATE: 2 PM We are working on a fix that will allow a return to service. ETA is still 5 PM.
UPDATE: 5 PM The maintenance has been completed and Lustre is available to all compute and development nodes. Jobs are running normally. The Globus Online endpoint is currently offline and is expected to return to service by EOB tomorrow. Let us know if you experience any issues.
A user generated a large volume of traffic to the file server ufs-10-b this morning, causing problems for users whose home or research spaces are on ufs-10-b. The issue developed around 8 AM and was resolved by 11 AM. We have contacted the users responsible and held their jobs until their workflow can be fixed.
The new home directory file servers remained available during this outage. We are working to migrate users to the new file servers and retire the older file servers. If you would like to move to the new file servers, please contact us at https://contact.icer.msu.edu/contact .
At this time due to high usage Ufs-10-a has become unresponsive. This is causing sluggishness throughout the cluster. Currently the filer is in the process of rebooting with an estimated reboot time of 30 - 40 minutes.
Update: ufs-10-a is currently back online. We are working to mitigate load on the cluster at this time. To allow this logins through Gateway have been disabled.
Update: 15:00 ufs-10-a went offline again due to excessive traffic. We have identified the cause and are currently mitigating it.
Update: 15:35 The cluster has stabilized, and ufs-10-a is back online under normal operation.
Our regularly scheduled maintenance window began at 6:00 AM today. Access to HPCC services may remain limited until the end of the business day.
UPDATE: 4:45 PM. Our Lustre scratch maintenance is taking longer than anticipated. We will make gateway and development nodes available to users at 4:45, but scratch will remain unavailable and queued jobs will remain queued until the Lustre work is complete; the current estimate is 8 PM. Users whose jobs do not make use of scratch may use the contact form to request that their jobs are started manually.
UPDATE: 7:45 PM All of the Lustre software updates have been applied. Intel and the Lustre vendor have identified and implemented the final configuration change to allow full speed access to Lustre on intel16. However, it needs to be applied to 48 storage configurations. The current estimate for completion is now 10 PM.
UPDATE: 12:20 AM Lustre maintenance has been completed and the scheduler has been resumed. All HPCC services are operational.
06:00 Interactive access was suspended and active users were disconnected.
06:30 Firewall upgrade begun.
07:00 Firewall upgrade complete.
07:00 High speed (Infiniband) network, home directory servers, gateway and Scratch (Lustre) updates underway.
08:00 Intel14 Infiniband update complete
10:00 Home directory network complete.
2:30 PM gateway update complete.
3 PM New home directory testing complete.
3 PM Intel16 Infiniband update complete.
4 PM Home directory maintenance complete.
4:45 PM Gateway will be available to users.
10 PM Lustre configuration complete
12:20 AM Final Lustre testing complete. Scheduler resumed.
Over the weekend Ufs-11-b appears to have suffered a hardware failure. As expected the high availability partner Ufs-11-a has imported the file system for Ufs-11-b allowing for access to the file system. There will be increased load and latency on Ufs-11-a until the hardware failure has been corrected. We currently have a service call open to address the issue.
Update 7-20-16 -
The hardware for Ufs-11-b has now been repaired and the server is back online. Currently ufs-11-a is still hosting both file systems as several users are using them. Currently we are planning on switching the file systems back during the outage on 7-26.
Update 7-20-16 -
Job activity to the new filers finished this morning. The filesystem for ufs-11-b has been switched back to its main server. During the switch ufs-11-a and ufs-11-b suffered a short period of downtime. They are both operating properly at this time. Performance on both file systems is back to normal.
The interactive node dev-intel14 is currently offline due to a software problem. Staff are working to resolve the issue. The other development nodes remain available and running and queued jobs on the cluster are unaffected.
Update 3:30pm - dev-intel14 was returned to service after identifying a failure with the environment variable which controlled several volume mounts. The logic for the variable was corrected to resolve the underlying failure, and the node was returned to service. No repeat issue is expected in this regard, as the logic was corrected globally.
iCER is deploying a new hybrid home and research storage system which will be significantly faster than the previous home directory servers. The first phase of of this system is a two-node, high availability cluster of 24-core servers with a 20 Gbps network connection and a hybrid disk/SSD storage system with 800 TB of storage.
We’ve already begun the process of migrating users to this new system. Any new large quota increases will be provided on this system, which will eventually replace all the home directory servers.
Due to unusual traffic, users whose home or research spaces were on file server ufs-10-a experienced slow performance this afternoon around 3 PM and became unresponsive around 4:15 PM. We are working to restore service. We anticipate a return to service by 5:30 PM.
Update: 5:45 PM ufs-10-a was returned to service at approximately 5:30 PM. Due to a bug in the NFSv4 client implementation, a few cluster nodes were generating a high number of requests faster than the server could respond, which created a significant backlog in requests on the server. The operating system our home directory file servers currently run on does not have any rate limiting, which allowed several hundred thousand requests to be queued. Exacerbating the issue, there is a memory leak in the kernel on the file server. We are working aggressively to move to a new file system and new hardware for home directory and research servers and will continue to monitor the servers.
At approximately 3 PM today, one of the Lustre file servers experienced a software crash. Its fail-over partner was unavailable, due to an unrelated issue. While the server was down any access to to /mnt/ls15 or /mnt/scratch would hang. The server was returned to service at approximately 4:30 PM today. We apologize for any disruption that this caused.
The system remains stable and available.
On 4/21 at 3:05 PM ufs-10-a crashed. We are currently working to return it to service. Users may experience delays logging into gateways until it is returned to service.
4:38 PM: ufs-10-a has been returned to service. The fault was due to a kernel bug on the file server. We have switched to a version of the kernel that does not have that issue. Gateway access has been restored.
UPDATE: Maintenance was successfully completed and the scheduler has restarted.
Tuesday, April 19 — The scheduler will be paused between 8-10 a.m. today to facilitate electrical work being done inside the Machine Room. Jobs that are running will continue to run, but jobs with a wall time that would overlap with this planned maintenance will not start until after the scheduler has resumed. Check back here for updates.
You can see a tentative timeline for all new cluster work here.