Please let us know if need immediate assistance by filing a ticket with us at https://contact.icer.msu.edu. You can also reach us at (517) 353-9309, or stop by the iCER office.
Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Bldg
567 Wilson Road, Room 1440
East Lansing, MI 48824
Phone: (517) 353-9309
At this time due to high usage Ufs-10-a has become unresponsive. This is causing sluggishness throughout the cluster. Currently the filer is in the process of rebooting with an estimated reboot time of 30 - 40 minutes.
Update: ufs-10-a is currently back online. We are working to mitigate load on the cluster at this time. To allow this logins through Gateway have been disabled.
Update: 15:00 ufs-10-a went offline again due to excessive traffic. We have identified the cause and are currently mitigating it.
Update: 15:35 The cluster has stabilized, and ufs-10-a is back online under normal operation.
Our regularly scheduled maintenance window began at 6:00 AM today. Access to HPCC services may remain limited until the end of the business day.
UPDATE: 4:45 PM. Our Lustre scratch maintenance is taking longer than anticipated. We will make gateway and development nodes available to users at 4:45, but scratch will remain unavailable and queued jobs will remain queued until the Lustre work is complete; the current estimate is 8 PM. Users whose jobs do not make use of scratch may use the contact form to request that their jobs are started manually.
UPDATE: 7:45 PM All of the Lustre software updates have been applied. Intel and the Lustre vendor have identified and implemented the final configuration change to allow full speed access to Lustre on intel16. However, it needs to be applied to 48 storage configurations. The current estimate for completion is now 10 PM.
UPDATE: 12:20 AM Lustre maintenance has been completed and the scheduler has been resumed. All HPCC services are operational.
06:00 Interactive access was suspended and active users were disconnected.
06:30 Firewall upgrade begun.
07:00 Firewall upgrade complete.
07:00 High speed (Infiniband) network, home directory servers, gateway and Scratch (Lustre) updates underway.
08:00 Intel14 Infiniband update complete
10:00 Home directory network complete.
2:30 PM gateway update complete.
3 PM New home directory testing complete.
3 PM Intel16 Infiniband update complete.
4 PM Home directory maintenance complete.
4:45 PM Gateway will be available to users.
10 PM Lustre configuration complete
12:20 AM Final Lustre testing complete. Scheduler resumed.
Over the weekend Ufs-11-b appears to have suffered a hardware failure. As expected the high availability partner Ufs-11-a has imported the file system for Ufs-11-b allowing for access to the file system. There will be increased load and latency on Ufs-11-a until the hardware failure has been corrected. We currently have a service call open to address the issue.
Update 7-20-16 -
The hardware for Ufs-11-b has now been repaired and the server is back online. Currently ufs-11-a is still hosting both file systems as several users are using them. Currently we are planning on switching the file systems back during the outage on 7-26.
Update 7-20-16 -
Job activity to the new filers finished this morning. The filesystem for ufs-11-b has been switched back to its main server. During the switch ufs-11-a and ufs-11-b suffered a short period of downtime. They are both operating properly at this time. Performance on both file systems is back to normal.
The interactive node dev-intel14 is currently offline due to a software problem. Staff are working to resolve the issue. The other development nodes remain available and running and queued jobs on the cluster are unaffected.
Update 3:30pm - dev-intel14 was returned to service after identifying a failure with the environment variable which controlled several volume mounts. The logic for the variable was corrected to resolve the underlying failure, and the node was returned to service. No repeat issue is expected in this regard, as the logic was corrected globally.
iCER is deploying a new hybrid home and research storage system which will be significantly faster than the previous home directory servers. The first phase of of this system is a two-node, high availability cluster of 24-core servers with a 20 Gbps network connection and a hybrid disk/SSD storage system with 800 TB of storage.
We’ve already begun the process of migrating users to this new system. Any new large quota increases will be provided on this system, which will eventually replace all the home directory servers.
Due to unusual traffic, users whose home or research spaces were on file server ufs-10-a experienced slow performance this afternoon around 3 PM and became unresponsive around 4:15 PM. We are working to restore service. We anticipate a return to service by 5:30 PM.
Update: 5:45 PM ufs-10-a was returned to service at approximately 5:30 PM. Due to a bug in the NFSv4 client implementation, a few cluster nodes were generating a high number of requests faster than the server could respond, which created a significant backlog in requests on the server. The operating system our home directory file servers currently run on does not have any rate limiting, which allowed several hundred thousand requests to be queued. Exacerbating the issue, there is a memory leak in the kernel on the file server. We are working aggressively to move to a new file system and new hardware for home directory and research servers and will continue to monitor the servers.
At approximately 3 PM today, one of the Lustre file servers experienced a software crash. Its fail-over partner was unavailable, due to an unrelated issue. While the server was down any access to to /mnt/ls15 or /mnt/scratch would hang. The server was returned to service at approximately 4:30 PM today. We apologize for any disruption that this caused.
The system remains stable and available.
On 4/21 at 3:05 PM ufs-10-a crashed. We are currently working to return it to service. Users may experience delays logging into gateways until it is returned to service.
4:38 PM: ufs-10-a has been returned to service. The fault was due to a kernel bug on the file server. We have switched to a version of the kernel that does not have that issue. Gateway access has been restored.
UPDATE: Maintenance was successfully completed and the scheduler has restarted.
Tuesday, April 19 — The scheduler will be paused between 8-10 a.m. today to facilitate electrical work being done inside the Machine Room. Jobs that are running will continue to run, but jobs with a wall time that would overlap with this planned maintenance will not start until after the scheduler has resumed. Check back here for updates.
You can see a tentative timeline for all new cluster work here.
2:15 p.m. — dev-intel14 has been returned to operational status. All services are up and operational.
10:30 a.m. — A bug was reported in the NFS client on dev-intel14. To address this issue, dev-intel14 will be rebooted at 1 p.m. today. Users will not be able to access dev-intel14 for about 30 minutes. In the mean time, please use one of the other development nodes, such as dev-intel14-phi, to access your files.
If you have any questions or would like further assistance, please contact us.
The Machine Room is up and operational.
Please note: Next Tuesday, April 19, the scheduler will be paused between 8-10 a.m. to facilitate electrical work being done inside the Machine Room. Jobs that are running will continue to run but no new jobs starting during this time. Check back here for updates.
You can see a tentative timeline for all new cluster work here.
April 7th 2016 10:30am: There was a failure at the MSU power plant. This dropped power for ~15 seconds to the Engineering Building and the HPCC Machine Room. All compute nodes lost power and any running jobs were terminated. Additionally, power was lost to one file server (ufs-10-b); users on ufs-10-b would be unable to log into the HPCC until the file server rebooted (estimated return to service at 11 AM.)
UPDATE: 11:00 AM: ufs-10-b has been returned to service and all interactive services are available. We are currently checking the health of the cluster before resuming scheduling.
UPDATE: 12:02:48 PM: Scheduler remains paused as the MSU campus alert has stated power shedding can occur several times. Interactive access will continue however no new batch jobs will start until there is an all clear issued by MSU.
UPDATE: 12:38 PM: We have received the all-clear and scheduling jobs has resumed.
The externally-managed ICER firewall stopped routing packets at 6:15 AM this morning (04-06-2015). It appears there was a fault on one of the routing engines in the system. It was returned to service at 9:10 AM. Running jobs were not disrupted. We apologize for any disruption that this may have caused.