Skip to end of metadata
Go to start of metadata

Announcement History

Emergency Hardware Maintenance Planned for 10-a  Filer
The file system which hosts 10a has a hardware issue with the networkinterface card. Logins for all users may be affected and some users (with home directories on 10-a) will not be able to login at all. To address this issue a system down time is planned for March 16, from 7:30am - 9:30am to change the hardware card.

We regret any inconvenience caused as a result of this.  Service updates will be posted at http://icer.msu.edu/service-status

Please let us know if need immediate assistance by filing a ticket with us at https://contact.icer.msu.edu. You can also reach us at (517) 353-9309, or stop by the iCER office.  

Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Bldg
567 Wilson Road, Room 1440
East Lansing, MI 48824
Phone: (517) 353-9309

  

Last changed May 09, 2016 15:39 by Allan Ross

The system remains stable and available. 

Posted at May 09, 2016 by Allan Ross | 0 comments
Last changed Apr 21, 2016 16:42 by Andrew R Keen
Labels: maintenance

On 4/21 at 3:05 PM ufs-10-a crashed. We are currently working to return it to service. Users may experience delays logging into gateways until it is returned to service.

4:38 PM: ufs-10-a has been returned to service. The fault was due to a kernel bug on the file server. We have switched to a version of the kernel that does not have that issue. Gateway access has been restored.

Posted at Apr 21, 2016 by Andrew R Keen | 0 comments
Last changed Apr 19, 2016 11:33 by Allan Ross

UPDATE: Maintenance was successfully completed and the scheduler has restarted. 

Tuesday, April 19 — The scheduler will be paused between 8-10 a.m. today to facilitate electrical work being done inside the Machine Room. Jobs that are running will continue to run, but jobs with a wall time that would overlap with this planned maintenance will not start until after the scheduler has resumedCheck back here for updates. 

You can see a tentative timeline for all new cluster work here

Posted at Apr 19, 2016 by Allan Ross | 0 comments
Last changed Apr 18, 2016 15:36 by Allan Ross
Labels: maintenance

2:15 p.m. — dev-intel14 has been returned to operational status. All services are up and operational. 

10:30 a.m. — A bug was reported in the NFS client on dev-intel14. To address this issue, dev-intel14 will be rebooted at 1 p.m. today. Users will not be able to access dev-intel14 for about 30 minutes. In the mean time, please use one of the other development nodes, such as dev-intel14-phi, to access your files.

If you have any questions or would like further assistance, please contact us

Posted at Apr 18, 2016 by Camille Alva Archer | 0 comments
Last changed Apr 13, 2016 14:58 by Allan Ross

The Machine Room is up and operational. 

Please note: Next Tuesday, April 19, the scheduler will be paused between 8-10 a.m. to facilitate electrical work being done inside the Machine Room. Jobs that are running will continue to run but no new jobs starting during this time. Check back here for updates. 

 
You can see a tentative timeline for all new cluster work here

Posted at Apr 13, 2016 by Allan Ross | 0 comments
Last changed Apr 07, 2016 12:39 by Andrew R Keen

April 7th 2016 10:30am:  There was a failure at the MSU power plant. This dropped power for ~15 seconds to the Engineering Building and the HPCC Machine Room. All compute nodes lost power and any running jobs were terminated. Additionally, power was lost to one file server (ufs-10-b); users on ufs-10-b would be unable to log into the HPCC until the file server rebooted (estimated return to service at 11 AM.)

UPDATE: 11:00 AM: ufs-10-b has been returned to service and all interactive services are available. We are currently checking the health of the cluster before resuming scheduling.

UPDATE: 12:02:48 PM: Scheduler remains paused as the MSU campus alert has stated power shedding can occur several times. Interactive access will continue however no new batch jobs will start until there is an all clear issued by MSU.

UPDATE: 12:38 PM: We have received the all-clear and scheduling jobs has resumed.

 



Posted at Apr 07, 2016 by Allan Ross | 0 comments

The externally-managed ICER firewall stopped routing packets at 6:15 AM this morning (04-06-2015). It appears there was a fault on one of the routing engines in the system. It was returned to service at 9:10 AM. Running jobs were not disrupted. We apologize for any disruption that this may have caused.

Posted at Apr 06, 2016 by Andrew R Keen | 0 comments

The machine is up and running. Samba servers on ufs-10-a were affected by LDAP. They have been restarted and are back online.

Posted at Apr 04, 2016 by Camille Alva Archer | 0 comments
Last changed Apr 05, 2016 13:18 by Sharan Kalwani

The machine is up and running. ufs-10-a has been returned to service. Currently the network path to the us-10a filer is running at 1-Gigabit/sec instead of the normal 10-Gigabit/sec. We are working to address this. At times it may appear that access is slow.


Posted at Apr 04, 2016 by Allan Ross | 0 comments
Last changed Mar 31, 2016 22:44 by Andrew R Keen

ufs-10-b is currently unavailable due to a system failure. We are currently restarting the system. A return to service is anticipated by 7:45 PM.


UPDATE: 7:40 PM. ufs-10-b has been returned to service.The cause of the failure is a memory allocation bug in ufs-10-b's operating system's kernel.

UPDATE: 9:00 PM. ufs-10-b has had two more issues. Staff continue to work on the system.

UPDATE: 10:30 PM. ufs-10-b has been returned to service. A user was generating a high volume of home directory traffic. Due to a failure of rate limiting in the file server, it accumulated several hundred thousand pending requests and exhausted all available memory, crashing the system.

Posted at Mar 31, 2016 by Andrew R Keen | 0 comments
Last changed Mar 30, 2016 10:23 by Allan Ross

As iCER prepares the Machine Room for this year’s new cluster installation, some of the older clusters (Intel07, Intel10 and gfx10) will be retired and removed. Please see the proposed schedule of work below.

With the loss of Intel07, Intel10 and gfx10, users will experience longer wait times in the queue. Researchers who have been selecting these older machines are asked to join the queue for Intel14 to run their jobs. Overall, the usage on these retiring nodes has been modest over the past months, so we expect the impact to be minimal. Once the new machine is up and running, queue wait times should decrease dramatically.  

Important dates:
° Wednesday, March 23 — removal of Intel07 (successfully completed)
° Wednesday, March 30 — removal of gfx10
° Wednesday, April 6 — removal of Intel10 

Posted at Mar 25, 2016 by Allan Ross | 0 comments

As iCER prepares the Machine Room for this year’s new cluster install, some of the older clusters will be retired and removed. Additionally, an outage has been tentatively scheduled to safely conduct infrastructure work to the room.  

The first step in this process begins today, March 23, with the removal of Intel07, one of the Machine Room’s oldest and lowest usage nodes. Then, next Wednesday, March 30, another cluster — Intel10 — will also be retired.

There is also an outage tentatively scheduled for Saturday, April 9 to accommodate some electrical, plumbing, and welding work in the Machine Room. All the gear will be idled as a precaution.  

Important dates:

° Wednesday, March 23 — removal of Intel07

° Wednesday, March 30 — removal of Intel10

° (tentative) Saturday, April 9 — outage from 8 a.m.-noon

Please note: we will scan and sweep the retiring drives and park any data we find to scratch. It will then be erased after 45 days, per usual.

Posted at Mar 23, 2016 by Allan Ross | 0 comments

9:15 a.m., March 18, 2016 — All systems are operating at ideal capacity. 

 

Posted at Mar 18, 2016 by Allan Ross | 0 comments
Last changed Mar 17, 2016 15:23 by Allan Ross

10:15 a.m., March 17, 2016 — This morning’s attempt to return filer 10a to full speed did not work as anticipated. The filer is available, but remains at slower speed. No further outages will be taken until we have a better solution.

We regret any inconvenience this may cause.  Service updates will be posted on our Service Status page. Please let us know if need immediate assistance by filing a ticket with us. You can also reach us at (517) 353-9309 or stop by the iCER office.  

Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Building
567 Wilson Road, Room 1440
East Lansing, MI 48824
(517) 353-9309, icer.msu.edu 

Posted at Mar 17, 2016 by Allan Ross | 0 comments
Last changed Mar 16, 2016 14:08 by Allan Ross

1:57 p.m., March 16, 2016 — This morning’s hardware maintenance has been completed, but the network interface is not working at the correct speed. Access to the Filer 10a remains slow because of this. We will continue to work on this issue from 7:30-9:30 a.m. tomorrow, Thursday, March 17. 

We regret any inconvenience this may cause.  Service updates will be posted on our Service Status page. Please let us know if need immediate assistance by filing a ticket with us. You can also reach us at (517) 353-9309 or stop by the iCER office.  


Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Building
567 Wilson Road, Room 1440
East Lansing, MI 48824
(517) 353-9309, icer.msu.edu 

 

Posted at Mar 16, 2016 by Allan Ross | 0 comments

 

  • No labels