Skip to end of metadata
Go to start of metadata

Announcement History

Emergency Hardware Maintenance Planned for 10-a  Filer
The file system which hosts 10a has a hardware issue with the networkinterface card. Logins for all users may be affected and some users (with home directories on 10-a) will not be able to login at all. To address this issue a system down time is planned for March 16, from 7:30am - 9:30am to change the hardware card.

We regret any inconvenience caused as a result of this.  Service updates will be posted at http://icer.msu.edu/service-status

Please let us know if need immediate assistance by filing a ticket with us at https://contact.icer.msu.edu. You can also reach us at (517) 353-9309, or stop by the iCER office.  

Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Bldg
567 Wilson Road, Room 1440
East Lansing, MI 48824
Phone: (517) 353-9309

  

Last changed Jul 28, 2016 15:39 by Brian Roth

At this time due to high usage Ufs-10-a has become unresponsive.  This is causing sluggishness throughout the cluster.  Currently the filer is in the process of rebooting with an estimated reboot time of 30 - 40 minutes.   

 

Update: ufs-10-a is currently back online.  We are working to mitigate load on the cluster at this time.  To allow this logins through Gateway have been disabled.

 

Update: 15:00  ufs-10-a went offline again due to excessive traffic.  We have identified the cause and are currently mitigating it.

 

Update: 15:35  The cluster has stabilized, and ufs-10-a is back online under normal operation. 

Posted at Jul 28, 2016 by Brian Roth | 0 comments
Last changed Jul 27, 2016 00:30 by Andrew R Keen
Labels: maintenance

Our regularly scheduled maintenance window began at 6:00 AM today. Access to HPCC services may remain limited until the end of the business day.

UPDATE: 4:45 PM. Our Lustre scratch maintenance is taking longer than anticipated. We will make gateway and development nodes available to users at 4:45, but scratch will remain unavailable and queued jobs will remain queued until the Lustre work is complete; the current estimate is 8 PM. Users whose jobs do not make use of scratch may use the contact form to request that their jobs are started manually.

UPDATE: 7:45 PM All of the Lustre software updates have been applied. Intel and the Lustre vendor have identified and implemented the final configuration change to allow full speed access to Lustre on intel16. However, it needs to be applied to 48 storage configurations. The current estimate for completion is now 10 PM.

UPDATE: 12:20 AM Lustre maintenance has been completed and the scheduler has been resumed. All HPCC services are operational.

Timeline:

06:00 Interactive access was suspended and active users were disconnected.

06:30 Firewall upgrade begun.

07:00 Firewall upgrade complete.

07:00 High speed (Infiniband) network, home directory servers, gateway and Scratch (Lustre) updates underway.

08:00 Intel14 Infiniband update complete

10:00 Home directory network complete.

2:30 PM gateway update complete.

3 PM New home directory testing complete.

3 PM Intel16 Infiniband update complete.

4 PM Home directory maintenance complete.

4:45 PM Gateway will be available to users.

10 PM Lustre configuration complete

12:20 AM Final Lustre testing complete. Scheduler resumed.

Posted at Jul 26, 2016 by Andrew R Keen | 0 comments
Last changed Jul 21, 2016 09:21 by Brian Roth

Over the weekend Ufs-11-b appears to have suffered a hardware failure.  As expected the high availability partner Ufs-11-a has imported the file system for Ufs-11-b allowing for access to the file system.  There will be increased load and latency on Ufs-11-a until the hardware failure has been corrected.  We currently have a service call open to address the issue.   

 

Update 7-20-16 - 

The hardware for Ufs-11-b has now been repaired and the server is back online.  Currently ufs-11-a is still hosting both file systems as several users are using them.  Currently we are planning on switching the file systems back during the outage on 7-26.  

 

 

Update 7-20-16 - 

Job activity to the new filers finished this morning.  The filesystem for ufs-11-b has been switched back to its main server.  During the switch ufs-11-a and ufs-11-b suffered a short period of downtime.  They are both operating properly at this time.  Performance on both file systems is back to normal.

Posted at Jul 18, 2016 by Brian Roth | 0 comments
Last changed Jul 14, 2016 15:55 by Nicholas Rahme

The interactive node dev-intel14 is currently offline due to a software problem. Staff are working to resolve the issue. The other development nodes remain available and running and queued jobs on the cluster are unaffected.

 

Update 3:30pm - dev-intel14 was returned to service after identifying a failure with the environment variable which controlled several volume mounts. The logic for the variable was corrected to resolve the underlying failure, and the node was returned to service. No repeat issue is expected in this regard, as the logic was corrected globally.

Posted at Jul 14, 2016 by Andrew R Keen | 0 comments
Last changed Jul 12, 2016 21:51 by Allan Ross
Labels: maintenance

iCER is deploying a new hybrid home and research storage system which will be significantly faster than the previous home directory servers. The first phase of of this system is a two-node, high availability cluster of 24-core servers with a 20 Gbps network connection and a hybrid disk/SSD storage system with 800 TB of storage.

We’ve already begun the process of migrating users to this new system. Any new large quota increases will be provided on this system, which will eventually replace all the home directory servers.

We are currently identifying and contacting users who would benefit from this new system. Users will need to log out and stop all running jobs during the migration process, which typically takes less than one hour. If you would like to migrate to the new system, please contact us.
Posted at Jul 08, 2016 by Brian Roth | 0 comments
Last changed Jun 28, 2016 17:54 by Andrew R Keen
Labels: maintenance

Due to unusual traffic, users whose home or research spaces were on file server ufs-10-a experienced slow performance this afternoon around 3 PM and became unresponsive around 4:15 PM. We are working to restore service. We anticipate a return to service by 5:30 PM.

Update: 5:45 PM ufs-10-a was returned to service at approximately 5:30 PM. Due to a bug in the NFSv4 client implementation, a few cluster nodes were generating a high number of requests faster than the server could respond, which created a significant backlog in requests on the server. The operating system our home directory file servers currently run on does not have any rate limiting, which allowed several hundred thousand requests to be queued. Exacerbating the issue, there is a memory leak in the kernel on the file server. We are working aggressively to move to a new file system and new hardware for home directory and research servers and will continue to monitor the servers.

Posted at Jun 28, 2016 by Andrew R Keen | 0 comments
Labels: maintenance

At approximately 3 PM today, one of the Lustre file servers experienced a software crash. Its fail-over partner was unavailable, due to an unrelated issue. While the server was down any access to to /mnt/ls15 or /mnt/scratch would hang. The server was returned to service at approximately 4:30 PM today. We apologize for any disruption that this caused.

Posted at Jun 16, 2016 by Andrew R Keen | 0 comments
Last changed May 09, 2016 15:39 by Allan Ross

The system remains stable and available. 

Posted at May 09, 2016 by Allan Ross | 0 comments
Last changed Apr 21, 2016 16:42 by Andrew R Keen
Labels: maintenance

On 4/21 at 3:05 PM ufs-10-a crashed. We are currently working to return it to service. Users may experience delays logging into gateways until it is returned to service.

4:38 PM: ufs-10-a has been returned to service. The fault was due to a kernel bug on the file server. We have switched to a version of the kernel that does not have that issue. Gateway access has been restored.

Posted at Apr 21, 2016 by Andrew R Keen | 0 comments
Last changed Apr 19, 2016 11:33 by Allan Ross

UPDATE: Maintenance was successfully completed and the scheduler has restarted. 

Tuesday, April 19 — The scheduler will be paused between 8-10 a.m. today to facilitate electrical work being done inside the Machine Room. Jobs that are running will continue to run, but jobs with a wall time that would overlap with this planned maintenance will not start until after the scheduler has resumedCheck back here for updates. 

You can see a tentative timeline for all new cluster work here

Posted at Apr 19, 2016 by Allan Ross | 0 comments
Last changed Apr 18, 2016 15:36 by Allan Ross
Labels: maintenance

2:15 p.m. — dev-intel14 has been returned to operational status. All services are up and operational. 

10:30 a.m. — A bug was reported in the NFS client on dev-intel14. To address this issue, dev-intel14 will be rebooted at 1 p.m. today. Users will not be able to access dev-intel14 for about 30 minutes. In the mean time, please use one of the other development nodes, such as dev-intel14-phi, to access your files.

If you have any questions or would like further assistance, please contact us

Posted at Apr 18, 2016 by Camille Alva Archer | 0 comments
Last changed Apr 13, 2016 14:58 by Allan Ross

The Machine Room is up and operational. 

Please note: Next Tuesday, April 19, the scheduler will be paused between 8-10 a.m. to facilitate electrical work being done inside the Machine Room. Jobs that are running will continue to run but no new jobs starting during this time. Check back here for updates. 

 
You can see a tentative timeline for all new cluster work here

Posted at Apr 13, 2016 by Allan Ross | 0 comments
Last changed Apr 07, 2016 12:39 by Andrew R Keen

April 7th 2016 10:30am:  There was a failure at the MSU power plant. This dropped power for ~15 seconds to the Engineering Building and the HPCC Machine Room. All compute nodes lost power and any running jobs were terminated. Additionally, power was lost to one file server (ufs-10-b); users on ufs-10-b would be unable to log into the HPCC until the file server rebooted (estimated return to service at 11 AM.)

UPDATE: 11:00 AM: ufs-10-b has been returned to service and all interactive services are available. We are currently checking the health of the cluster before resuming scheduling.

UPDATE: 12:02:48 PM: Scheduler remains paused as the MSU campus alert has stated power shedding can occur several times. Interactive access will continue however no new batch jobs will start until there is an all clear issued by MSU.

UPDATE: 12:38 PM: We have received the all-clear and scheduling jobs has resumed.

 



Posted at Apr 07, 2016 by Allan Ross | 0 comments

The externally-managed ICER firewall stopped routing packets at 6:15 AM this morning (04-06-2015). It appears there was a fault on one of the routing engines in the system. It was returned to service at 9:10 AM. Running jobs were not disrupted. We apologize for any disruption that this may have caused.

Posted at Apr 06, 2016 by Andrew R Keen | 0 comments

 

  • No labels